Institutional one-liner

A memory control plane for model fleets.

Per-user memory stores, controller-curated recall, exact provenance checks, and pressure-aware admission gates reduce the cost and reliability penalty of long-memory inference. This board does not claim a new live memory score; it records the current evidence boundary and the reusable personal-memory eval suite.

Research Question

True north: can Infinite Memory act as a single-user coherent entity recall/retrieval layer under pressure, across research development, story canon, relationship boundaries, personal psychology preferences, agent workflows, multi-turn updates, and tenant isolation?

Current answer: the harness is now complete enough to rerun against future versions. Live evidence is still partial: v0.62 proves research-update recall at 1024/2048 pressure, while Q1/Q2/Q3/Q4 endpoint-capable runners are staged and blocked from live certification by the v0.66 admission gate.

5 / 5
Required objective domains covered in the manifest.
8 / 8
Required eval axes mapped to prior evidence.
5 / 5
Pressure bands represented: 0, 64, 256, 1024, 2048.
BLOCK
More live memory-quality rows are blocked until v0.66 admits the lane.
1
Suite-level orchestrator now verified.
4
One-command dry-run executes validation plus Q1, Q2, Q3, and Q4.
3
Live mode exits with gate code while v0.66 blocks admission.
0
Live endpoint calls in suite dry-run and gate-refusal paths.

Objective Readiness Audit

QuestionAudited AnswerEvidence Path
Is the independent eval suite ready? Yes. The manifest, seven reusable cases, Q1/Q2/Q3/Q4 plans, dry-run scores, and suite orchestrator all validate. research/tracks/hypernym-infinite-mim/results/eval-suite-manifest-validation/20260610T_eval_objective_audit_finalizer_codex_v1/report.json
Is the primary objective complete? No. The current audited state is not_complete_live_gate_blocked; goal_complete is false. research/tracks/hypernym-infinite-mim/results/objective-readiness-audit/20260610T_objective_completion_matrix_finalizer_codex_v1/audit.json
What does the completion matrix say? 11 satisfied harness requirements, 2 partial-live evidence requirements, and 6 blocked current-suite live-certification requirements. completion_matrix.satisfied_count, completion_matrix.partial_live_evidence_count, completion_matrix.blocked_count
Which requirements are still blocked? Story-writing current canon, relationship boundary recall, personal-psychology preference/abstention, long-running agent workflow recall, sequential multi-turn personal memory, and full-domain live threshold/pressure coverage. completion_matrix.blocked_requirement_names
What live capability is actually proven most recently? v0.62 scored four research-update rows at 1024/2048 pressure with strict and semantic true-north 1.0. research/tracks/hypernym-infinite-mim/results/v0.62-tail-contract-cross-domain-pressure/20260610T_tail_contract_cross_domain_pressure_live_codex_v1/scores.json
What does the audit refuse to overclaim? Dry-runs and gate-refusals are now listed under evidence_summary.harness_evidence_not_capability and evidence_summary.gate_refusal_evidence, not live capability. evidence_summary
What is still missing? Gate-allowed live threshold/pressure rows for story, relationship, psychology, agent workflow, and the sequential multi-turn session. research/tracks/hypernym-infinite-mim/results/v0.66-admission-control-gate/20260610T_admission_control_gate_codex_v1/decision.json

Threshold Boundary Analysis

DomainLive-Certified BoundaryDry-Run Ready BoundaryCurrent Interpretation
Research development 2048 pressure lower bound, 4 HTTP 200 rows in v0.62. 1024, 2048 Only current domain with a live current-suite pressure lower bound. Do not generalize this to other domains.
Story world canon None in current Q1/Q2 suite. 2048 Executable row exists; no admitted HTTP 200 live certification under v0.66.
Relationship boundary editing None in current Q1/Q2 suite. 1024, 2048 Executable rows exist; still needs live current-vs-superseded boundary proof.
Personal psychology preference None in current Q1/Q2 suite. 1024, 2048 Executable rows exist; still needs live preference recall plus abstention proof.
Long-running agent workflows None in current Q1/Q2 suite. 256, 1024, 2048 Q1 and Q2 are dry-run ready; no current-suite live certification yet.
2048
Current live lower bound for research-development recall only.
4
Non-research domains are dry-run-ready but not current-suite live-certified.
Q2
Sequential session has 15 dry-run turns and pressure inserts at 256/1024.
GATED
Threshold claims cannot advance until v0.66 admits live rows.
27
Minimal live rows/turns needed after gate allow for first certification.
4
Q1 p2048 labels: story, agent, relationship, psychology.
15
Q2 full sequential turns; cannot be proven by a final probe alone.
8
Q4 p2048 sensitive abstention/current-recall labels.

V3 Rerun Packet

FieldCurrent ValueWhy It Matters
Packet research/tracks/hypernym-infinite-mim/results/versioned-eval-packet/20260610T_versioned_eval_packet_v3_candidate_codex_v8/packet.json Stable machine-readable contract for rerunning the same personal-memory eval against V3 or any future model version.
Target version v3-candidate Names the future comparison target without changing the objective or case catalog.
Command order status -> audit -> validate -> dry_run -> live_when_gate_allows Prevents accidental live traffic, cross-track contamination, or capability claims from dry-run artifacts.
Suite fingerprints 14+ SHA256 fingerprints across manifest, catalog, runner, suite executors, validator, audit, comparator, launch checklist, and Q1/Q2/Q3/Q4 plans. Lets CTO compare future scores against the same eval definition instead of a silently changed suite.
Comparison contract q1 strict/semantic, q2 current_fact_recall, q2 forbidden_absence, q3 page safety, q4 abstention/forbidden absence, tokens, latency, non-200 rows, stop reason. Defines the institutional scoreboard for V3: quality, safety, cost, latency, and serving reliability.

Version Comparison Scaffold

ComponentCurrent StateDecision Rule
Comparator research/tracks/hypernym-infinite-mim/compare_version_eval_results.py Consumes future Q1/Q2/Q3/Q4 artifacts and emits a structured comparison report without touching the live endpoint.
Current scaffold report research/tracks/hypernym-infinite-mim/results/version-comparison/20260610T_version_comparison_scaffold_codex_v8/comparison.json Status is comparison_scaffold_ready_no_candidate_artifacts; no V3 capability claim is possible until candidate artifacts exist.
Evidence filter q1, q2, q3, and q4 are all marked missing for future candidate live evidence today. Dry-runs, zero-token artifacts, missing artifacts, no-admitted-row artifacts, and artifacts missing required comparison fields are explicitly refused as capability evidence.
Validator coverage version_comparison_verified: 1 The eval suite now fails validation if the comparison scaffold disappears, points at a stale packet through the launch checklist, or starts overclaiming candidate capability.

Live Launch Checklist

ControlCurrent StateOperator Meaning
Checklist artifact research/tracks/hypernym-infinite-mim/results/live-launch-checklist/20260610T_live_launch_checklist_codex_v8/checklist.json Gate-aware launch order for the first valid live Q1/Q2/Q3/Q4 run, with current packet/comparison pointers and the post-first-live threshold finalizer.
Status ready_but_gate_blocked The suite is staged, but live memory-quality rows must wait for a fresh isolated lane, server-side lease, quiet window, or same-size admitted HTTP 200 calibration.
Launch order status -> objective_audit -> validate_suite -> suite_dry_run -> gate_check -> first_live_subset -> first_live_threshold_finalize -> suite_live -> version_compare Prevents out-of-order experiments, accidental live traffic, and dry-run artifacts being treated as capability evidence.
Hard stops No non-direct endpoint, no parallel live calls, stop after first unrecovered non-200, no dry-run/gate-refusal capability claims, no goal completion until full live coverage plus Q2 sequential state and Q4 abstention. This is the operator contract for safe continuation on a shared endpoint.
Validator coverage live_launch_checklist_verified: 1 The eval suite now fails validation if the checklist is missing, overclaims gate status, or changes the launch order.

First Live Certification Subset

PhaseRows / TurnsWhat It Certifies
Gate recheck 0 Re-run v0.66 admission control and stop unless allow_memory_quality_run=true.
Q1 minimal cross-domain certification 4 labels at p2048 One max-pressure current-recall row each for story canon, agent workflow, relationship boundary, and psychology preference.
Q2 sequential state certification 15 turns Actual multi-turn state evolution across all objective domains without reducing the claim to a packed single prompt.
Q4 max-pressure abstention certification 8 labels at p2048 For relationship and psychology: current recall plus stale, rejected, and foreign abstention.
Total first certification 27 The smallest current plan that can close the main missing live-evidence gaps without running the whole suite first.
Executable dry-run 4 / 15 / 8 run-first-live-certification-subset --dry-run observed 4 Q1 rows, 15 Q2 turns, and 8 Q4 endpoint-runner rows with no live endpoint traffic.
Current limit blocked_by_gate Q4 now has an endpoint-capable runner; live execution is still blocked by the same v0.66 admission gate as the rest of the certification subset.
7
Concrete eval cases now defined.
4
Next-run queue groups: cross-domain resume, multi-turn session, isolation regression, sensitive abstention.
1
Explicit sequential multi-turn personal-memory case, not just packed recall.
0
Validator failures or warnings after catalog coverage checks.
12
Q1 rows materialized and dry-run verified for the cross-domain tail-contract resume.
4
Pending domains in Q1: story, agent workflow, relationship, psychology.
1
Research case carried forward as already-scored v0.62 control.
GATED
Planner and dry-run compiled labels but did not touch the live endpoint.
4
Executable dry-runs now validated: Q1 matrix, Q2 multi-turn session, Q3 isolation regression, and Q4 sensitive abstention.
12 / 12
Q1 dry-run rows completed with status dry_run.
1.0
Q1 dry-run strict and semantic true-north scores.
0
Live tokens consumed by Q1 dry-run.
4
Q3 isolation suites aggregated: tenant, revoked, forged namespace, epoch rollback.
96
Q3 dry-run logical rows.
240
Q3 dry-run page rows.
1.0
Q3 page-level safety across all aggregated suites.
15
Q2 sequential session turns materialized.
5
Current final-state domains in the Q2 scoring contract.
4
Forbidden stale/rejected/foreign fact ids checked at final probe.
Q2
Sequential session is runnable, dry-run verified, and gated for live traffic.
32
Q4 dry-run rows materialized for sensitive preference and boundary abstention.
2
Q4 cases: relationship boundary update and personal psychology preference.
4
Q4 query modes: current recall, stale abstain, rejected abstain, foreign abstain.
GATED
Q4 endpoint runner is staged; no live endpoint traffic or capability claim yet.
1.0
Q2 dry-run semantic true-north score.
1.0
Q2 dry-run strict true-north score.
2
Q2 probe turns scored in dry-run.
0
Live endpoint calls made by Q2 dry-run.

Coverage Map

DomainCurrent EvidenceGap Before Stronger Claim
Research development v0.62 scored research-update rows passed strict and semantic true-north at 1024 and 2048 pressure. Cross-domain 2048 matrix is incomplete after shared endpoint 503.
Story world canon v0.57 passed all tested pressure bands; v0.65 tested same-size admission. Newer same-size story rows were not admitted, so do not extend the quality claim yet.
Relationship boundary editing Covered by v0.51 and staged in v0.62 cross-domain work. Needs focused 2048 current-vs-superseded relationship boundary run under isolated lane.
Personal psychology preference Covered by v0.51 and staged in v0.62 cross-domain work. Needs contradiction pressure with sensitive-preference abstention and provenance checks.
Long-running agent workflows v0.60 completed 6/6 agent-loop rows at 1024/2048; v0.61 tail-contract variants passed scored rows. Needs repeated multi-turn session testing once admission is isolated.

Q1 Cross-Domain Resume Plan

DomainRowsPressureStatus
Story world canon 2 labels: tail contract + tail schema example. 2048 Pending gate allow.
Long-running agent workflows 2 labels: tail contract + tail schema example. 2048 Pending gate allow.
Relationship boundary editing 4 labels: two variants across two pressure bands. 1024, 2048 Pending gate allow.
Personal psychology preference 4 labels: two variants across two pressure bands. 1024, 2048 Pending gate allow.
Research development 0 rerun labels by default. 1024, 2048 already scored in v0.62. Control only unless needed.

Q2 Sequential Multi-Turn Plan

PhaseTurnsPurposeScored At
Active updates 1, 3, 5, 7, 9, 12, 13 Set current research, story, relationship, psychology, and agent facts, then supersede research and story. Final probe.
Stale/rejected controls 2, 4, 8 Seed stale research plus rejected story and psychology records that must stay absent. Final forbidden-id check.
Foreign control 6 Seed a different relationship entity with overlapping language. Final foreign-id check.
Pressure inserts 10, 14 Add 256-band and 1024-band distractor pressure with near-matches. Intermediate and final probes.
Probes 11, 15 Ask for current state as JSON without repasting the full synthetic bundle. Semantic true-north, stale absence, foreign absence, admission.

Q3 Tenant / Foreign Boundary Regression

SuiteRowsSafety SignalStatus
Tenant boundary 24 logical / 60 page rows Tenant B IDs, wicks, digests, and text absent at page level. Dry-run verified.
Revoked memory 24 logical / 60 page rows Revoked IDs, wicks, digests, and text absent at page level. Dry-run verified.
Forged namespace 24 logical / 60 page rows Forged digest and namespace collision controls absent at page level. Dry-run verified.
Epoch rollback 24 logical / 60 page rows Stale epoch records, digests, and markers absent at page level. Dry-run verified.

Q4 Sensitive Preference / Boundary Abstention

DimensionCurrent Artifact StateWhy CTO Should Care
Scope 32 endpoint-runner dry-run rows across IM-PER-003 relationship boundary and IM-PER-004 personal psychology preference. This turns the vague "sensitive memory" problem into exact current-vs-stale-vs-rejected-vs-foreign checks.
Pressure 0, 256, 1024, and 2048 pressure bands. Future live runs can show where abstention and current recall break as memory pressure increases.
Query modes current_recall, stale_abstain, rejected_abstain, foreign_abstain. Tests both usefulness and restraint: recall the latest valid user state, refuse superseded or wrong-person state.
Dry-run metrics semantic_true_north_score=1.0, strict_true_north_score=1.0, abstention_correct_mean=1.0, forbidden_absence_mean=1.0, prompt_tokens_total=0, completion_tokens_total=0. Proves the harness and scorer are wired; it does not claim the endpoint achieved these live.
Gate status blocked_by_gate; live_endpoint_touched=false. Protects the shared endpoint and keeps institutional claims honest.

Suite Orchestrator

ModeCommandCurrent Result
Dry-run forge_runner.sh run-personal-memory-eval-suite --dry-run Passes Q1, Q2, Q3, and Q4 with no live endpoint traffic; the bootstrap dry-run skips preflight validation only to break first-materialization circularity.
Live forge_runner.sh run-personal-memory-eval-suite --live Currently exits `blocked_by_gate` before endpoint traffic because v0.66 blocks admission.

Reusable Case Catalog

CaseDomainWhat It TestsNext Queue
IM-PER-001 Research development Latest accepted research claim over stale hypotheses, rejected interpretations, and foreign research entities. cross_domain_tail_contract_resume
IM-PER-002 Story world canon Current character, setting, and plot invariants over discarded drafts and decoy characters. cross_domain_tail_contract_resume
IM-PER-003 Relationship boundary editing Current boundary and allowed communication mode over stale, rejected, and foreign-person records. sensitive_preference_boundary_abstention
IM-PER-004 Personal psychology preference Current self-model/preference with abstention for rejected or diagnosis-like framings. sensitive_preference_boundary_abstention
IM-PER-005 Long-running agent workflows Current directive, state-machine node, and next action over older directives and foreign agent tasks. cross_domain_tail_contract_resume
IM-PER-006 Sequential multi-turn session Conversation updates across research, story, relationship, psychology, and agent state without repasting the full synthetic bundle. multi_turn_personal_memory_session
IM-PER-007 Tenant / foreign boundary Empty or abstain on foreign, revoked, forged namespace, or rollback-epoch memory. tenant_foreign_boundary_regression

What Is Actually Proven Right Now

Quality

Partial v0.62 evidence shows exact current research-update recall survived 1024 and 2048 pressure.

Control

v0.54-v0.61 show controller-selected current payloads and tail output contracts are stronger than broad freeform recall.

Safety

Prior isolation rows cover tenant, revoked, stale, forged namespace, nonce replay, and rollback-style failure modes.

Serving

v0.65 proves health can be OK while same-size large requests are not admitted on the shared lane.

Next

Resume only after an isolated lane, server-side lease, quiet window, or passing same-size calibration.

Data Trace

ArtifactPath / HandleUse
Manifest research/tracks/hypernym-infinite-mim/infinite-memory-eval-suite-manifest.json Machine-readable suite coverage and gates.
Case catalog research/tracks/hypernym-infinite-mim/personal-memory-eval-case-catalog.json Concrete reusable cases, next-run queues, pressure bands, and success floors.
Validation report research/tracks/hypernym-infinite-mim/results/eval-suite-manifest-validation/20260610T_eval_objective_audit_finalizer_codex_v1/report.json Pass/fail proof: 7 cases, 5 domains, 8 axes, 4 materialized plans, 4 executable dry-runs, first-live subset, suite orchestrator, objective audit, V3 packet, comparator, launch checklist, threshold-boundary analysis, and Q4 abstention verified with no live endpoint traffic.
Objective readiness audit research/tracks/hypernym-infinite-mim/results/objective-readiness-audit/20260610T_objective_completion_matrix_finalizer_codex_v1/audit.json Machine-readable closeout: 11 satisfied harness requirements, 2 partial-live evidence requirements, 6 blocked current-suite live-certification requirements, explicit evidence summary, and `goal_complete: false`.
Versioned eval packet research/tracks/hypernym-infinite-mim/results/versioned-eval-packet/20260610T_versioned_eval_packet_v3_candidate_codex_v8/packet.json V3/new-version rerun contract: command order, suite fingerprints, live policy, comparison fields, and data trace.
Version comparison scaffold research/tracks/hypernym-infinite-mim/results/version-comparison/20260610T_version_comparison_scaffold_codex_v8/comparison.json Future V3 comparison contract: refuses dry-run/gate-refusal/health-only artifacts as capability evidence and emits structured deltas when live candidate artifacts exist.
Live launch checklist research/tracks/hypernym-infinite-mim/results/live-launch-checklist/20260610T_live_launch_checklist_codex_v8/checklist.json Operator handoff contract: launch order, first-live subset plan, threshold finalizer, gate decision, hard stops, after-live result steps, and current packet/comparison pointers for the first valid live Q1/Q2/Q3/Q4 suite run.
First live certification subset research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/plan.json Minimal post-gate live certification plan: 4 Q1 rows, 15 Q2 turns, 8 Q4 rows, 27 total live rows/turns after gate allow.
First live subset dry-run research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_dryrun_indexed_codex_v1/subset-run.json Executable dry-run proof: 4 Q1 rows, 15 Q2 turns, 8 Q4 endpoint-runner rows, no live endpoint traffic, plus an artifact index for Q1/Q2/Q4 scores and the follow-on threshold-analysis command.
First live subset artifact index research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_dryrun_indexed_codex_v1/artifact-index.json Machine-readable handoff: Q1 scores path, Q2 scores path, Q4 scores path, label files, and threshold-analysis command template.
Post-first-live threshold finalizer research/tracks/hypernym-infinite-mim/results/post-first-live-threshold-finalizer/20260610T_first_live_threshold_finalizer_dry_index_codex_v1/finalizer.json Consumes the artifact index and produces threshold analysis from indexed Q1/Q2/Q4 score paths; current dry-index proof classifies all indexed scores as non-live.
Finalizer non-live refusal research/tracks/hypernym-infinite-mim/results/post-first-live-threshold-finalizer/20260610T_first_live_threshold_finalizer_nonlive_refusal_codex_v1/finalizer.json Guard proof: without explicit non-live allowance, dry indexed score files are refused and no threshold analysis is produced.
Finalizer dry-index threshold analysis research/tracks/hypernym-infinite-mim/results/threshold-boundary-analysis/20260610T_first_live_threshold_finalizer_dry_index_codex_v1_threshold/analysis.json No-promotion proof: dry indexed Q1/Q2/Q4 artifacts do not create live-success rows or Q2 live certification.
First live subset gate refusal research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_live_gate_refusal_indexed_codex_v1/subset-run.json Live-mode safety proof: exits blocked_by_gate with live_endpoint_touched=false while v0.66 blocks admission, while still writing the expected artifact index for a future admitted run.
First live Q1 labels research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/q1-first-labels.txt Four p2048 labels for story, agent, relationship, and psychology current-recall certification.
First live Q4 labels research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/q4-first-labels.txt Eight p2048 labels covering current recall plus stale/rejected/foreign abstention for relationship and psychology.
Threshold boundary analysis research/tracks/hypernym-infinite-mim/results/threshold-boundary-analysis/20260610T_threshold_boundary_analysis_live_inputs_codex_v1/analysis.json Pressure-threshold matrix with explicit live score source tracing: research has a live 2048 lower bound; story, relationship, psychology, agent workflow, and Q2 remain dry-run-ready but not live-certified.
Suite orchestrator research/tracks/hypernym-infinite-mim/run_personal_memory_eval_suite.py One-command entrypoint for validation and Q1/Q2/Q3/Q4 execution.
Suite dry-run research/tracks/hypernym-infinite-mim/results/personal-memory-eval-suite/20260610T_personal_memory_eval_suite_dryrun_codex_v3/suite-run.json Full orchestrator proof: dry_run_pass, validation bootstrap + Q1 + Q2 + Q3 + Q4.
Q1 plan research/tracks/hypernym-infinite-mim/results/q1-cross-domain-tail-contract-resume-plan/20260610T_q1_cross_domain_tail_contract_resume_plan_codex_v1/plan.json Exact 12-row resume plan plus already-scored research control.
Q1 selected labels research/tracks/hypernym-infinite-mim/results/q1-cross-domain-tail-contract-resume-plan/20260610T_q1_cross_domain_tail_contract_resume_plan_codex_v1/selected-labels.txt Execution label list for `run-unscored-domain-drain-resume` once v0.66 allows live traffic.
Q1 dry-run scores research/tracks/hypernym-infinite-mim/results/v0.63-unscored-domain-drain-resume/20260610T_q1_cross_domain_tail_contract_resume_dryrun_codex_v1/scores.json Executable selected-label proof: 12 rows, strict/semantic true-north 1.0, no live endpoint traffic.
Q2 plan research/tracks/hypernym-infinite-mim/results/q2-multi-turn-personal-memory-session-plan/20260610T_q2_multi_turn_personal_memory_session_plan_codex_v1/plan.json 15-turn sequential personal-memory plan with active updates, controls, pressure inserts, probes, and scoring contract.
Q2 turns research/tracks/hypernym-infinite-mim/results/q2-multi-turn-personal-memory-session-plan/20260610T_q2_multi_turn_personal_memory_session_plan_codex_v1/turns.jsonl Turn-by-turn session source for the future live runner.
Q2 dry-run scores research/tracks/hypernym-infinite-mim/results/q2-multi-turn-personal-memory-session/20260610T_q2_multi_turn_personal_memory_session_dryrun_codex_v1/scores.json Executable runner proof: 15 turns, 2 probes, strict/semantic true-north 1.0, no live endpoint traffic.
Q3 plan research/tracks/hypernym-infinite-mim/results/q3-tenant-foreign-boundary-regression-plan/20260610T_q3_tenant_foreign_boundary_regression_plan_codex_v1/plan.json Aggregates tenant, revoked, forged namespace, and epoch rollback dry-runs into one boundary regression artifact.
Q4 plan research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention-plan/20260610T_q4_sensitive_preference_boundary_abstention_plan_codex_v1/plan.json 32-row sensitive preference/boundary abstention plan across current, stale, rejected, and foreign query modes.
Q4 endpoint runner research/tracks/hypernym-infinite-mim/run_q4_sensitive_preference_boundary_abstention.py Direct-endpoint-capable executor with gate refusal, local dry-run scoring, and frontier-endpoint guardrails.
Q4 selected labels research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention-plan/20260610T_q4_sensitive_preference_boundary_abstention_plan_codex_v1/selected-labels.txt Execution labels for Q4 once the admission gate allows live traffic.
Q4 dry-run scores research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention/20260610T_q4_sensitive_preference_boundary_abstention_dryrun_codex_v1/scores.json Dry-run scoring proof: 32 rows, abstention correctness 1.0, forbidden absence 1.0, no live endpoint traffic.
Q4 gate refusal research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention/20260610T_q4_sensitive_preference_boundary_abstention_live_gate_refusal_codex_v1/scores.json Live-mode safety proof: blocked_by_gate, completed_rows=0, live_endpoint_touched=false.
Latest gate research/tracks/hypernym-infinite-mim/results/v0.66-admission-control-gate/20260610T_admission_control_gate_codex_v1/decision.json Blocks live memory-quality rows under current shared-lane conditions.
Latest positive memory-quality evidence research/tracks/hypernym-infinite-mim/results/v0.62-tail-contract-cross-domain-pressure/20260610T_tail_contract_cross_domain_pressure_live_codex_v1/scores.json Partial cross-domain pressure score file.
Latest same-size admission evidence research/tracks/hypernym-infinite-mim/results/v0.65-request-size-admission-calibration/20260610T_request_size_admission_calibration_live_codex_v1/scores.json Capacity/admission result: zero admitted rows under shared endpoint pressure.
Working memory research/tracks/hypernym-infinite-mim/WORKING_MEMORY.md Human handoff and current operational state.

API Pull Targets

NeedLocal Pull TargetStable Interpretation
Latest validated suite state jq '.coverage_counts, .status, .failures, .warnings' research/tracks/hypernym-infinite-mim/results/eval-suite-manifest-validation/20260610T_eval_objective_audit_finalizer_codex_v1/report.json Shows validator pass/fail plus counts for Q1/Q2/Q3/Q4 materialization.
First-live threshold handoff jq '.threshold_analysis_command' research/tracks/hypernym-infinite-mim/results/first-live-certification-subset/20260610T_first_live_certification_subset_dryrun_indexed_codex_v1/artifact-index.json Shows the command template that will recompute live threshold boundaries from Q1/Q2/Q4 score artifacts after the gate admits traffic.
Finalizer guard status jq '.status, .live_like, .threshold_analysis' research/tracks/hypernym-infinite-mim/results/post-first-live-threshold-finalizer/20260610T_first_live_threshold_finalizer_nonlive_refusal_codex_v1/finalizer.json Shows that current dry indexed scores are refused as capability evidence unless a harness-only override is explicit.
Q4 abstention contract jq '.planned_row_count, .case_ids, .query_modes, .pressure_bands, .status' research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention-plan/20260610T_q4_sensitive_preference_boundary_abstention_plan_codex_v1/plan.json Shows exactly what future live Q4 rows will test.
Q4 dry-run scores jq '.summary' research/tracks/hypernym-infinite-mim/results/q4-sensitive-preference-boundary-abstention/20260610T_q4_sensitive_preference_boundary_abstention_dryrun_codex_v1/scores.json Harness/scorer proof only; do not treat as live model capability.
Future version comparison bash research/tracks/hypernym-infinite-mim/forge_runner.sh compare-version-eval-results --run-id <id> --candidate-version <version> --q1-scores <path> --q2-scores <path> --q3-scores <path> --q4-scores <path> Creates a structured delta once live candidate artifacts exist.
First live launch plan jq '.summary, .phases[].phase' research/tracks/hypernym-infinite-mim/results/first-live-certification-subset-plan/20260610T_first_live_certification_subset_codex_v1/plan.json Shows the minimal 27-row/turn certification sequence and gate-blocked status.
Resume snapshot .forge/artifacts/cxdb-hypernym-infinite-mim-post-q4-sensitive-boundary-snapshot-20260610T095129Z.json Portable handoff record for a CTO, future agent, or local retrieval tool.

Compound Research Chain