D2.1 token-agreement harness scaffold (I11 cert gate infra, 117/117 tests)#236
Merged
Merged
Conversation
First Phase 2 deliverable — scaffold of the I11 cert gate harness. The PR #219 → #220 lesson landed as a typed-rejection wall: the stub result carries stub:true + backend:"stub" so no client can confuse Phase 0 stub output for a real measurement. crates/cognitive-shader-driver/src/token_agreement.rs (~320 LOC): ReferenceModel { path, path_hash, stub_token_count } ::load(&Path) -> Result<Self, TokenAgreementError> D2.1 stub: validates path exists, hashes display; does NOT parse safetensors yet. D2.2 replaces with real loader driven by auto_detect::detect() → ModelFingerprint. ::stub(tag, n_tokens) — builds stub model without touching fs TokenAgreementError: ModelPathMissing { path } EmptyPromptSet TokenCountMismatch { reference, candidate } NotImplementedYet { what } ← measure_full() until D2.2 TopKAgreement { top1_matches, top5_matches, total_positions, divergence_positions: Vec<u32> } ::compare(ref: &[Vec<u32>], cand: &[Vec<u32>]) -> Result<Self> Position-by-position: top1 = r[0] == c[0]; top5 = r[0] in c[..5]. Records divergence positions for failure-mode analysis (late-sequence drift vs random errors). ::top1_rate() / top5_rate() -> f32 ::meets_cert_gate() -> bool (top1 ≥ 0.99 AND top5 ≥ 0.999) ::aggregate(per_prompt) — sums counters; concatenates divergence with per-prompt offset so failures stay localised TokenAgreementHarness: ::new(reference, baseline, candidate, n_tokens) ::measure_stub() -> WireTokenAgreementResult { stub:true, .. } ::measure_full() -> NotImplementedYet (D2.2 scope) Tests (13 new): - reference_model_stub_builds_without_filesystem - reference_model_load_missing_path_yields_typed_error - topk_compare_identical_streams_is_perfect (full cert gate pass) - topk_compare_all_different_fails_cert_gate - topk_top5_matches_when_top1_misses_but_in_top5 (ref top-1 = 7; cand has 7 at position 3 in top-5 → top5 counts) - topk_mismatched_stream_lengths_yield_typed_error - topk_aggregate_sums_counters_and_offsets_divergence (prompt 2's divergence at pos 4 → aggregate pos 14 after prompt 1's 10) - cert_gate_passes_at_exact_thresholds (990/1000 = 0.99, 999/1000 = 0.999 — both boundaries hit) - cert_gate_fails_when_top1_below_threshold_even_if_top5_passes - cert_gate_fails_when_top5_below_threshold_even_if_top1_passes - harness_measure_stub_returns_machine_checkable_stub_flag (stub:true enforced; backend="stub"; all rates 0.0; zero latencies) - harness_measure_full_returns_not_implemented_pointing_at_d22 - harness_measure_stub_rejects_zero_n_tokens Board hygiene (CLAUDE.md Mandatory rule): STATUS_BOARD.md D2.1 Queued → In PR Phase state: Phase 0 ✅ complete (D0.1-D0.7 all shipped) Phase 1 scaffold ✅ (D1.1, D1.2, D1.3 shipped; D1.1b queued) Phase 2 ⏳ D2.1 (this PR), D2.2 + D2.3 queued Rules honored: Rule D — Measurement set comes from Wire DTOs (D0.2 WireTokenAgreement) Rule E — TopKAgreement exposes object-methods (top1_rate, meets_cert_gate) Rule F — No serialization between stages; per-prompt Vec<Vec<u32>> token streams are plain Rust owned; the serde happens at D2.3 handler entry / exit only https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First Phase 2 deliverable — token-agreement harness scaffold. The I11 cert gate infrastructure lands with a machine-checkable
stub: truewall to prevent the #219 → #220 failure mode (synthetic numbers read as real measurements).117/117
cognitive-shader-driver --features servetests pass (+13 new).What lands
crates/cognitive-shader-driver/src/token_agreement.rs— ~320 LOC:ReferenceModelD2.1
load()validates path existence + hashes the display; D2.2 replaces with real safetensors parsing + tokenizer + runtime decoder, driven byauto_detect::detect()(D0.5).TopKAgreementcomparatorPosition-by-position comparison. Records divergence positions for failure-mode analysis ("late-sequence drift" vs "random errors everywhere"). Aggregation concatenates per-prompt divergences with offsets so failures stay localisable.
TokenAgreementHarnessmeasure_stub()returnsstub: true, backend: "stub", top1_rate: 0.0, top5_rate: 0.0. Thestubflag is machine-checkable per D0.2's anti-#219 pattern — clients assert!result.stubto fail loudly if they mistake stub output for real measurements.Typed errors
Tests (13 new)
Critical coverage:
topk_compare_identical_streams_is_perfect— full cert gate passtopk_top5_matches_when_top1_misses_but_in_top5— top-5 logic verified onref[0] = 7appearing at position 3 in candidate top-5topk_aggregate_sums_counters_and_offsets_divergence— prompt 2's divergence at position 4 becomes aggregate position 14 after prompt 1's 10 positionscert_gate_passes_at_exact_thresholds— 990/1000 = 0.99 AND 999/1000 = 0.999 (exact boundary pass)cert_gate_fails_when_top1_below_threshold_even_if_top5_passes— AND-gate semanticscert_gate_fails_when_top5_below_threshold_even_if_top1_passes— AND-gate semanticsharness_measure_stub_returns_machine_checkable_stub_flag— enforcesstub == true,backend == "stub", zero rates + latenciesharness_measure_full_returns_not_implemented_pointing_at_d22— D2.2 scope pointer preservedharness_measure_stub_rejects_zero_n_tokens—EmptyPromptSettyped errorPhase state after merge
Rules honored
WireTokenAgreementDTO (D0.2 surface)TopKAgreementexposes methods (top1_rate,meets_cert_gate,aggregate) not raw field accessVec<Vec<u32>>token streams are owned Rust valuesBoard hygiene
STATUS_BOARD.md: D2.1 Queued → In PRhttps://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh