A memory control plane for model fleets.
Per-user memory stores, controller-curated recall, exact provenance checks, and pressure-aware admission gates reduce the cost and reliability penalty of long-memory inference. This board does not claim a new live memory score; it records the current evidence boundary and the reusable personal-memory eval suite.
Research Question
True north: can Infinite Memory act as a single-user coherent entity recall/retrieval layer under pressure, across research development, story canon, relationship boundaries, personal psychology preferences, agent workflows, multi-turn updates, and tenant isolation?
Current answer: the harness is now complete enough to rerun against future versions. Live evidence is still partial: v0.62 proves research-update recall at 1024/2048 pressure, while Q1/Q2/Q3/Q4 endpoint-capable runners are staged and blocked from live certification by the v0.66 admission gate.
Objective Readiness Audit
| Question | Audited Answer | Evidence Path |
|---|---|---|
| Is the independent eval suite ready? | Yes. The manifest, seven reusable cases, Q1/Q2/Q3/Q4 plans, dry-run scores, and suite orchestrator all validate. | research/tracks/hypernym-infinite-mim/results/eval-suite-manifest-validation/20260610T_eval_objective_audit_finalizer_codex_v1/report.json |
| Is the primary objective complete? | No. The current audited state is not_complete_live_gate_blocked; goal_complete is false. |
research/tracks/hypernym-infinite-mim/results/objective-readiness-audit/20260610T_objective_completion_matrix_finalizer_codex_v1/audit.json |
| What does the completion matrix say? | 11 satisfied harness requirements, 2 partial-live evidence requirements, and 6 blocked current-suite live-certification requirements. |
completion_matrix.satisfied_count, completion_matrix.partial_live_evidence_count, completion_matrix.blocked_count |
| Which requirements are still blocked? | Story-writing current canon, relationship boundary recall, personal-psychology preference/abstention, long-running agent workflow recall, sequential multi-turn personal memory, and full-domain live threshold/pressure coverage. | completion_matrix.blocked_requirement_names |
| What live capability is actually proven most recently? | v0.62 scored four research-update rows at 1024/2048 pressure with strict and semantic true-north 1.0. | research/tracks/hypernym-infinite-mim/results/v0.62-tail-contract-cross-domain-pressure/20260610T_tail_contract_cross_domain_pressure_live_codex_v1/scores.json |
| What does the audit refuse to overclaim? | Dry-runs and gate-refusals are now listed under evidence_summary.harness_evidence_not_capability and evidence_summary.gate_refusal_evidence, not live capability. |
evidence_summary |
| What is still missing? | Gate-allowed live threshold/pressure rows for story, relationship, psychology, agent workflow, and the sequential multi-turn session. | research/tracks/hypernym-infinite-mim/results/v0.66-admission-control-gate/20260610T_admission_control_gate_codex_v1/decision.json |
Threshold Boundary Analysis
| Domain | Live-Certified Boundary | Dry-Run Ready Boundary | Current Interpretation |
|---|---|---|---|
| Research development | 2048 pressure lower bound, 4 HTTP 200 rows in v0.62. |
1024, 2048 |
Only current domain with a live current-suite pressure lower bound. Do not generalize this to other domains. |
| Story world canon | None in current Q1/Q2 suite. | 2048 |
Executable row exists; no admitted HTTP 200 live certification under v0.66. |
| Relationship boundary editing | None in current Q1/Q2 suite. | 1024, 2048 |
Executable rows exist; still needs live current-vs-superseded boundary proof. |
| Personal psychology preference | None in current Q1/Q2 suite. | 1024, 2048 |
Executable rows exist; still needs live preference recall plus abstention proof. |
| Long-running agent workflows | None in current Q1/Q2 suite. | 256, 1024, 2048 |
Q1 and Q2 are dry-run ready; no current-suite live certification yet. |
V3 Rerun Packet
| Field | Current Value | Why It Matters |
|---|---|---|
| Packet | research/tracks/hypernym-infinite-mim/results/versioned-eval-packet/20260610T_versioned_eval_packet_v3_candidate_codex_v8/packet.json |
Stable machine-readable contract for rerunning the same personal-memory eval against V3 or any future model version. |
| Target version | v3-candidate |
Names the future comparison target without changing the objective or case catalog. |
| Command order | status -> audit -> validate -> dry_run -> live_when_gate_allows |
Prevents accidental live traffic, cross-track contamination, or capability claims from dry-run artifacts. |
| Suite fingerprints | 14+ SHA256 fingerprints across manifest, catalog, runner, suite executors, validator, audit, comparator, launch checklist, and Q1/Q2/Q3/Q4 plans. |
Lets CTO compare future scores against the same eval definition instead of a silently changed suite. |
| Comparison contract | q1 strict/semantic, q2 current_fact_recall, q2 forbidden_absence, q3 page safety, q4 abstention/forbidden absence, tokens, latency, non-200 rows, stop reason. |
Defines the institutional scoreboard for V3: quality, safety, cost, latency, and serving reliability. |
Version Comparison Scaffold
| Component | Current State | Decision Rule |
|---|---|---|
| Comparator | research/tracks/hypernym-infinite-mim/compare_version_eval_results.py |
Consumes future Q1/Q2/Q3/Q4 artifacts and emits a structured comparison report without touching the live endpoint. |
| Current scaffold report | research/tracks/hypernym-infinite-mim/results/version-comparison/20260610T_version_comparison_scaffold_codex_v8/comparison.json |
Status is comparison_scaffold_ready_no_candidate_artifacts; no V3 capability claim is possible until candidate artifacts exist. |
| Evidence filter | q1, q2, q3, and q4 are all marked missing for future candidate live evidence today. |
Dry-runs, zero-token artifacts, missing artifacts, no-admitted-row artifacts, and artifacts missing required comparison fields are explicitly refused as capability evidence. |
| Validator coverage | version_comparison_verified: 1 |
The eval suite now fails validation if the comparison scaffold disappears, points at a stale packet through the launch checklist, or starts overclaiming candidate capability. |
Live Launch Checklist
| Control | Current State | Operator Meaning |
|---|---|---|
| Checklist artifact | research/tracks/hypernym-infinite-mim/results/live-launch-checklist/20260610T_live_launch_checklist_codex_v8/checklist.json |
Gate-aware launch order for the first valid live Q1/Q2/Q3/Q4 run, with current packet/comparison pointers and the post-first-live threshold finalizer. |
| Status | ready_but_gate_blocked |
The suite is staged, but live memory-quality rows must wait for a fresh isolated lane, server-side lease, quiet window, or same-size admitted HTTP 200 calibration. |
| Launch order | status -> objective_audit -> validate_suite -> suite_dry_run -> gate_check -> first_live_subset -> first_live_threshold_finalize -> suite_live -> version_compare |
Prevents out-of-order experiments, accidental live traffic, and dry-run artifacts being treated as capability evidence. |
| Hard stops | No non-direct endpoint, no parallel live calls, stop after first unrecovered non-200, no dry-run/gate-refusal capability claims, no goal completion until full live coverage plus Q2 sequential state and Q4 abstention. | This is the operator contract for safe continuation on a shared endpoint. |
| Validator coverage | live_launch_checklist_verified: 1 |
The eval suite now fails validation if the checklist is missing, overclaims gate status, or changes the launch order. |
First Live Certification Subset
| Phase | Rows / Turns | What It Certifies |
|---|---|---|
| Gate recheck | 0 |
Re-run v0.66 admission control and stop unless allow_memory_quality_run=true. |
| Q1 minimal cross-domain certification | 4 labels at p2048 |
One max-pressure current-recall row each for story canon, agent workflow, relationship boundary, and psychology preference. |
| Q2 sequential state certification | 15 turns |
Actual multi-turn state evolution across all objective domains without reducing the claim to a packed single prompt. |
| Q4 max-pressure abstention certification | 8 labels at p2048 |
For relationship and psychology: current recall plus stale, rejected, and foreign abstention. |
| Total first certification | 27 |
The smallest current plan that can close the main missing live-evidence gaps without running the whole suite first. |
| Executable dry-run | 4 / 15 / 8 |
run-first-live-certification-subset --dry-run observed 4 Q1 rows, 15 Q2 turns, and 8 Q4 endpoint-runner rows with no live endpoint traffic. |
| Current limit | blocked_by_gate |
Q4 now has an endpoint-capable runner; live execution is still blocked by the same v0.66 admission gate as the rest of the certification subset. |
dry_run.Coverage Map
| Domain | Current Evidence | Gap Before Stronger Claim |
|---|---|---|
| Research development | v0.62 scored research-update rows passed strict and semantic true-north at 1024 and 2048 pressure. | Cross-domain 2048 matrix is incomplete after shared endpoint 503. |
| Story world canon | v0.57 passed all tested pressure bands; v0.65 tested same-size admission. | Newer same-size story rows were not admitted, so do not extend the quality claim yet. |
| Relationship boundary editing | Covered by v0.51 and staged in v0.62 cross-domain work. | Needs focused 2048 current-vs-superseded relationship boundary run under isolated lane. |
| Personal psychology preference | Covered by v0.51 and staged in v0.62 cross-domain work. | Needs contradiction pressure with sensitive-preference abstention and provenance checks. |
| Long-running agent workflows | v0.60 completed 6/6 agent-loop rows at 1024/2048; v0.61 tail-contract variants passed scored rows. | Needs repeated multi-turn session testing once admission is isolated. |
Q1 Cross-Domain Resume Plan
| Domain | Rows | Pressure | Status |
|---|---|---|---|
| Story world canon | 2 labels: tail contract + tail schema example. | 2048 | Pending gate allow. |
| Long-running agent workflows | 2 labels: tail contract + tail schema example. | 2048 | Pending gate allow. |
| Relationship boundary editing | 4 labels: two variants across two pressure bands. | 1024, 2048 | Pending gate allow. |
| Personal psychology preference | 4 labels: two variants across two pressure bands. | 1024, 2048 | Pending gate allow. |
| Research development | 0 rerun labels by default. | 1024, 2048 already scored in v0.62. | Control only unless needed. |
Q2 Sequential Multi-Turn Plan
| Phase | Turns | Purpose | Scored At |
|---|---|---|---|
| Active updates | 1, 3, 5, 7, 9, 12, 13 | Set current research, story, relationship, psychology, and agent facts, then supersede research and story. | Final probe. |
| Stale/rejected controls | 2, 4, 8 | Seed stale research plus rejected story and psychology records that must stay absent. | Final forbidden-id check. |
| Foreign control | 6 | Seed a different relationship entity with overlapping language. | Final foreign-id check. |
| Pressure inserts | 10, 14 | Add 256-band and 1024-band distractor pressure with near-matches. | Intermediate and final probes. |
| Probes | 11, 15 | Ask for current state as JSON without repasting the full synthetic bundle. | Semantic true-north, stale absence, foreign absence, admission. |
Q3 Tenant / Foreign Boundary Regression
| Suite | Rows | Safety Signal | Status |
|---|---|---|---|
| Tenant boundary | 24 logical / 60 page rows | Tenant B IDs, wicks, digests, and text absent at page level. | Dry-run verified. |
| Revoked memory | 24 logical / 60 page rows | Revoked IDs, wicks, digests, and text absent at page level. | Dry-run verified. |
| Forged namespace | 24 logical / 60 page rows | Forged digest and namespace collision controls absent at page level. | Dry-run verified. |
| Epoch rollback | 24 logical / 60 page rows | Stale epoch records, digests, and markers absent at page level. | Dry-run verified. |
Q4 Sensitive Preference / Boundary Abstention
| Dimension | Current Artifact State | Why CTO Should Care |
|---|---|---|
| Scope | 32 endpoint-runner dry-run rows across IM-PER-003 relationship boundary and IM-PER-004 personal psychology preference. |
This turns the vague "sensitive memory" problem into exact current-vs-stale-vs-rejected-vs-foreign checks. |
| Pressure | 0, 256, 1024, and 2048 pressure bands. |
Future live runs can show where abstention and current recall break as memory pressure increases. |
| Query modes | current_recall, stale_abstain, rejected_abstain, foreign_abstain. |
Tests both usefulness and restraint: recall the latest valid user state, refuse superseded or wrong-person state. |
| Dry-run metrics | semantic_true_north_score=1.0, strict_true_north_score=1.0, abstention_correct_mean=1.0, forbidden_absence_mean=1.0, prompt_tokens_total=0, completion_tokens_total=0. |
Proves the harness and scorer are wired; it does not claim the endpoint achieved these live. |
| Gate status | blocked_by_gate; live_endpoint_touched=false. |
Protects the shared endpoint and keeps institutional claims honest. |
Suite Orchestrator
| Mode | Command | Current Result |
|---|---|---|
| Dry-run | forge_runner.sh run-personal-memory-eval-suite --dry-run |
Passes Q1, Q2, Q3, and Q4 with no live endpoint traffic; the bootstrap dry-run skips preflight validation only to break first-materialization circularity. |
| Live | forge_runner.sh run-personal-memory-eval-suite --live |
Currently exits `blocked_by_gate` before endpoint traffic because v0.66 blocks admission. |
Reusable Case Catalog
| Case | Domain | What It Tests | Next Queue |
|---|---|---|---|
IM-PER-001 |
Research development | Latest accepted research claim over stale hypotheses, rejected interpretations, and foreign research entities. | cross_domain_tail_contract_resume |
IM-PER-002 |
Story world canon | Current character, setting, and plot invariants over discarded drafts and decoy characters. | cross_domain_tail_contract_resume |
IM-PER-003 |
Relationship boundary editing | Current boundary and allowed communication mode over stale, rejected, and foreign-person records. | sensitive_preference_boundary_abstention |
IM-PER-004 |
Personal psychology preference | Current self-model/preference with abstention for rejected or diagnosis-like framings. | sensitive_preference_boundary_abstention |
IM-PER-005 |
Long-running agent workflows | Current directive, state-machine node, and next action over older directives and foreign agent tasks. | cross_domain_tail_contract_resume |
IM-PER-006 |
Sequential multi-turn session | Conversation updates across research, story, relationship, psychology, and agent state without repasting the full synthetic bundle. | multi_turn_personal_memory_session |
IM-PER-007 |
Tenant / foreign boundary | Empty or abstain on foreign, revoked, forged namespace, or rollback-epoch memory. | tenant_foreign_boundary_regression |
What Is Actually Proven Right Now
Partial v0.62 evidence shows exact current research-update recall survived 1024 and 2048 pressure.
v0.54-v0.61 show controller-selected current payloads and tail output contracts are stronger than broad freeform recall.
Prior isolation rows cover tenant, revoked, stale, forged namespace, nonce replay, and rollback-style failure modes.
v0.65 proves health can be OK while same-size large requests are not admitted on the shared lane.
Resume only after an isolated lane, server-side lease, quiet window, or passing same-size calibration.
Data Trace
| Artifact | Path / Handle | Use |
|---|---|---|
| Manifest | research/tracks/hypernym-infinite-mim/infinite-memory-eval-suite-manifest.json |
Machine-readable suite coverage and gates. |
| Case catalog | research/tracks/hypernym-infinite-mim/personal-memory-eval-case-catalog.json |
Concrete reusable cases, next-run queues, pressure bands, and success floors. |
| Validation report | research/tracks/hypernym-infinite-mim/results/eval-suite-manifest-validation/20260610T_eval_objective_audit_finalizer_codex_v1/report.json |
Pass/fail proof: 7 cases, 5 domains, 8 axes, 4 materialized plans, 4 executable dry-runs, first-live subset, suite orchestrator, objective audit, V3 packet, comparator, launch checklist, threshold-boundary analysis, and Q4 abstention verified with no live endpoint traffic. |
| Objective readiness audit | research/tracks/hypernym-infinite-mim/results/objective-readiness-audit/20260610T_objective_completion_matrix_finalizer_codex_v1/audit.json |
Machine-readable closeout: 11 satisfied harness requirements, 2 partial-live evidence requirements, 6 blocked current-suite live-certification requirements, explicit evidence summary, and `goal_complete: false`. |
| Versioned eval packet | research/tracks/hypernym-infinite-mim/results/versioned-eval-packet/20260610T_versioned_eval_packet_v3_candidate_codex_v8/packet.json |
V3/new-version rerun contract: command order, suite fingerprints, live policy, comparison fields, and data trace. |
| Version comparison scaffold | research/tracks/hypernym-infinite-mim/results/version-comparison/20260610T_version_comparison_scaffold_codex_v8/comparison.json |
Future V3 comparison contract: refuses dry-run/gate-refusal/health-only artifacts as capability evidence and emits structured deltas when live candidate artifacts exist. |
| Live launch checklist | research/tracks/hypernym-infinite-mim/results/live-launch-checklist/20260610T_live_launch_checklist_codex_v8/checklist.json |
Operator handoff contract: launch order, first-live subset plan, threshold finalizer, gate decision, hard stops, after-live result steps, and current packet/comparison pointers for the first valid live Q1/Q2/Q3/Q4 suite run. |
| First live certification subset | research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/plan.json |
Minimal post-gate live certification plan: 4 Q1 rows, 15 Q2 turns, 8 Q4 rows, 27 total live rows/turns after gate allow. |
| First live subset dry-run | research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_dryrun_indexed_codex_v1/subset-run.json |
Executable dry-run proof: 4 Q1 rows, 15 Q2 turns, 8 Q4 endpoint-runner rows, no live endpoint traffic, plus an artifact index for Q1/Q2/Q4 scores and the follow-on threshold-analysis command. |
| First live subset artifact index | research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_dryrun_indexed_codex_v1/artifact-index.json |
Machine-readable handoff: Q1 scores path, Q2 scores path, Q4 scores path, label files, and threshold-analysis command template. |
| Post-first-live threshold finalizer | research/tracks/hypernym-infinite-mim/results/post-first-live-threshold-finalizer/20260610T_first_live_threshold_finalizer_dry_index_codex_v1/finalizer.json |
Consumes the artifact index and produces threshold analysis from indexed Q1/Q2/Q4 score paths; current dry-index proof classifies all indexed scores as non-live. |
| Finalizer non-live refusal | research/tracks/hypernym-infinite-mim/results/post-first-live-threshold-finalizer/20260610T_first_live_threshold_finalizer_nonlive_refusal_codex_v1/finalizer.json |
Guard proof: without explicit non-live allowance, dry indexed score files are refused and no threshold analysis is produced. |
| Finalizer dry-index threshold analysis | research/tracks/hypernym-infinite-mim/results/threshold-boundary-analysis/20260610T_first_live_threshold_finalizer_dry_index_codex_v1_threshold/analysis.json |
No-promotion proof: dry indexed Q1/Q2/Q4 artifacts do not create live-success rows or Q2 live certification. |
| First live subset gate refusal | research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_live_gate_refusal_indexed_codex_v1/subset-run.json |
Live-mode safety proof: exits blocked_by_gate with live_endpoint_touched=false while v0.66 blocks admission, while still writing the expected artifact index for a future admitted run. |
| First live Q1 labels | research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/q1-first-labels.txt |
Four p2048 labels for story, agent, relationship, and psychology current-recall certification. |
| First live Q4 labels | research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/q4-first-labels.txt |
Eight p2048 labels covering current recall plus stale/rejected/foreign abstention for relationship and psychology. |
| Threshold boundary analysis | research/tracks/hypernym-infinite-mim/results/threshold-boundary-analysis/20260610T_threshold_boundary_analysis_live_inputs_codex_v1/analysis.json |
Pressure-threshold matrix with explicit live score source tracing: research has a live 2048 lower bound; story, relationship, psychology, agent workflow, and Q2 remain dry-run-ready but not live-certified. |
| Suite orchestrator | research/tracks/hypernym-infinite-mim/run_personal_memory_eval_suite.py |
One-command entrypoint for validation and Q1/Q2/Q3/Q4 execution. |
| Suite dry-run | research/tracks/hypernym-infinite-mim/results/personal-memory-eval-suite/20260610T_personal_memory_eval_suite_dryrun_codex_v3/suite-run.json |
Full orchestrator proof: dry_run_pass, validation bootstrap + Q1 + Q2 + Q3 + Q4. |
| Q1 plan | research/tracks/hypernym-infinite-mim/results/q1-cross-domain-tail-contract-resume-plan/20260610T_q1_cross_domain_tail_contract_resume_plan_codex_v1/plan.json |
Exact 12-row resume plan plus already-scored research control. |
| Q1 selected labels | research/tracks/hypernym-infinite-mim/results/q1-cross-domain-tail-contract-resume-plan/20260610T_q1_cross_domain_tail_contract_resume_plan_codex_v1/selected-labels.txt |
Execution label list for `run-unscored-domain-drain-resume` once v0.66 allows live traffic. |
| Q1 dry-run scores | research/tracks/hypernym-infinite-mim/results/v0.63-unscored-domain-drain-resume/20260610T_q1_cross_domain_tail_contract_resume_dryrun_codex_v1/scores.json |
Executable selected-label proof: 12 rows, strict/semantic true-north 1.0, no live endpoint traffic. |
| Q2 plan | research/tracks/hypernym-infinite-mim/results/q2-multi-turn-personal-memory-session-plan/20260610T_q2_multi_turn_personal_memory_session_plan_codex_v1/plan.json |
15-turn sequential personal-memory plan with active updates, controls, pressure inserts, probes, and scoring contract. |
| Q2 turns | research/tracks/hypernym-infinite-mim/results/q2-multi-turn-personal-memory-session-plan/20260610T_q2_multi_turn_personal_memory_session_plan_codex_v1/turns.jsonl |
Turn-by-turn session source for the future live runner. |
| Q2 dry-run scores | research/tracks/hypernym-infinite-mim/results/q2-multi-turn-personal-memory-session/20260610T_q2_multi_turn_personal_memory_session_dryrun_codex_v1/scores.json |
Executable runner proof: 15 turns, 2 probes, strict/semantic true-north 1.0, no live endpoint traffic. |
| Q3 plan | research/tracks/hypernym-infinite-mim/results/q3-tenant-foreign-boundary-regression-plan/20260610T_q3_tenant_foreign_boundary_regression_plan_codex_v1/plan.json |
Aggregates tenant, revoked, forged namespace, and epoch rollback dry-runs into one boundary regression artifact. |
| Q4 plan | research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention-plan/20260610T_q4_sensitive_preference_boundary_abstention_plan_codex_v1/plan.json |
32-row sensitive preference/boundary abstention plan across current, stale, rejected, and foreign query modes. |
| Q4 endpoint runner | research/tracks/hypernym-infinite-mim/run_q4_sensitive_preference_boundary_abstention.py |
Direct-endpoint-capable executor with gate refusal, local dry-run scoring, and frontier-endpoint guardrails. |
| Q4 selected labels | research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention-plan/20260610T_q4_sensitive_preference_boundary_abstention_plan_codex_v1/selected-labels.txt |
Execution labels for Q4 once the admission gate allows live traffic. |
| Q4 dry-run scores | research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention/20260610T_q4_sensitive_preference_boundary_abstention_dryrun_codex_v1/scores.json |
Dry-run scoring proof: 32 rows, abstention correctness 1.0, forbidden absence 1.0, no live endpoint traffic. |
| Q4 gate refusal | research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention/20260610T_q4_sensitive_preference_boundary_abstention_live_gate_refusal_codex_v1/scores.json |
Live-mode safety proof: blocked_by_gate, completed_rows=0, live_endpoint_touched=false. |
| Latest gate | research/tracks/hypernym-infinite-mim/results/v0.66-admission-control-gate/20260610T_admission_control_gate_codex_v1/decision.json |
Blocks live memory-quality rows under current shared-lane conditions. |
| Latest positive memory-quality evidence | research/tracks/hypernym-infinite-mim/results/v0.62-tail-contract-cross-domain-pressure/20260610T_tail_contract_cross_domain_pressure_live_codex_v1/scores.json |
Partial cross-domain pressure score file. |
| Latest same-size admission evidence | research/tracks/hypernym-infinite-mim/results/v0.65-request-size-admission-calibration/20260610T_request_size_admission_calibration_live_codex_v1/scores.json |
Capacity/admission result: zero admitted rows under shared endpoint pressure. |
| Working memory | research/tracks/hypernym-infinite-mim/WORKING_MEMORY.md |
Human handoff and current operational state. |
API Pull Targets
| Need | Local Pull Target | Stable Interpretation |
|---|---|---|
| Latest validated suite state | jq '.coverage_counts, .status, .failures, .warnings' research/tracks/hypernym-infinite-mim/results/eval-suite-manifest-validation/20260610T_eval_objective_audit_finalizer_codex_v1/report.json |
Shows validator pass/fail plus counts for Q1/Q2/Q3/Q4 materialization. |
| First-live threshold handoff | jq '.threshold_analysis_command' research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_dryrun_indexed_codex_v1/artifact-index.json |
Shows the command template that will recompute live threshold boundaries from Q1/Q2/Q4 score artifacts after the gate admits traffic. |
| Finalizer guard status | jq '.status, .live_like, .threshold_analysis' research/tracks/hypernym-infinite-mim/results/post-first-live-threshold-finalizer/20260610T_first_live_threshold_finalizer_nonlive_refusal_codex_v1/finalizer.json |
Shows that current dry indexed scores are refused as capability evidence unless a harness-only override is explicit. |
| Q4 abstention contract | jq '.planned_row_count, .case_ids, .query_modes, .pressure_bands, .status' research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention-plan/20260610T_q4_sensitive_preference_boundary_abstention_plan_codex_v1/plan.json |
Shows exactly what future live Q4 rows will test. |
| Q4 dry-run scores | jq '.summary' research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention/20260610T_q4_sensitive_preference_boundary_abstention_dryrun_codex_v1/scores.json |
Harness/scorer proof only; do not treat as live model capability. |
| Future version comparison | bash research/tracks/hypernym-infinite-mim/forge_runner.sh compare-version-eval-results --run-id <id> --candidate-version <version> --q1-scores <path> --q2-scores <path> --q3-scores <path> --q4-scores <path> |
Creates a structured delta once live candidate artifacts exist. |
| First live launch plan | jq '.summary, .phases[].phase' research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/plan.json |
Shows the minimal 27-row/turn certification sequence and gate-blocked status. |
| Resume snapshot | .forge/artifacts/cxdb-hypernym-infinite-mim-post-q4-sensitive-boundary-snapshot-20260610T095129Z.json |
Portable handoff record for a CTO, future agent, or local retrieval tool. |
Compound Research Chain
Prior deployed boards are kept as immutable research waypoints so later CTO review can reconstruct the wall-climb rather than reading one orphan page.