Skip to content

D2.1 token-agreement harness scaffold (I11 cert gate infra, 117/117 tests)#236

Merged
AdaWorldAPI merged 1 commit into
mainfrom
claude/teleport-session-setup-wMZfb
Apr 21, 2026
Merged

D2.1 token-agreement harness scaffold (I11 cert gate infra, 117/117 tests)#236
AdaWorldAPI merged 1 commit into
mainfrom
claude/teleport-session-setup-wMZfb

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

First Phase 2 deliverable — token-agreement harness scaffold. The I11 cert gate infrastructure lands with a machine-checkable stub: true wall to prevent the #219#220 failure mode (synthetic numbers read as real measurements).

117/117 cognitive-shader-driver --features serve tests pass (+13 new).

What lands

crates/cognitive-shader-driver/src/token_agreement.rs — ~320 LOC:

ReferenceModel

pub struct ReferenceModel { path, path_hash, stub_token_count }

impl ReferenceModel {
    pub fn load(path: &Path) -> Result<Self, TokenAgreementError>;   // D2.1 stub
    pub fn stub(tag: u64, n_tokens: u32) -> Self;                     // testing fixture
}

D2.1 load() validates path existence + hashes the display; D2.2 replaces with real safetensors parsing + tokenizer + runtime decoder, driven by auto_detect::detect() (D0.5).

TopKAgreement comparator

pub struct TopKAgreement { top1_matches, top5_matches, total_positions, divergence_positions }

impl TopKAgreement {
    pub fn compare(reference_topk: &[Vec<u32>], candidate_topk: &[Vec<u32>]) -> Result<Self>;
    pub fn top1_rate() / top5_rate() -> f32;
    pub fn meets_cert_gate() -> bool;    // top1 ≥ 0.99 AND top5 ≥ 0.999
    pub fn aggregate(per_prompt: &[Self]) -> Self;
}

Position-by-position comparison. Records divergence positions for failure-mode analysis ("late-sequence drift" vs "random errors everywhere"). Aggregation concatenates per-prompt divergences with offsets so failures stay localisable.

TokenAgreementHarness

pub struct TokenAgreementHarness { reference, baseline, candidate, n_tokens }

impl TokenAgreementHarness {
    pub fn measure_stub() -> Result<WireTokenAgreementResult>;        // D2.1: stub:true
    pub fn measure_full() -> Result<WireTokenAgreementResult>;        // D2.2: NotImplementedYet
}

measure_stub() returns stub: true, backend: "stub", top1_rate: 0.0, top5_rate: 0.0. The stub flag is machine-checkable per D0.2's anti-#219 pattern — clients assert !result.stub to fail loudly if they mistake stub output for real measurements.

Typed errors

pub enum TokenAgreementError {
    ModelPathMissing { path },
    EmptyPromptSet,
    TokenCountMismatch { reference, candidate },
    NotImplementedYet { what },    // points at D2.2 scope
}

Tests (13 new)

Critical coverage:

  • topk_compare_identical_streams_is_perfect — full cert gate pass
  • topk_top5_matches_when_top1_misses_but_in_top5 — top-5 logic verified on ref[0] = 7 appearing at position 3 in candidate top-5
  • topk_aggregate_sums_counters_and_offsets_divergence — prompt 2's divergence at position 4 becomes aggregate position 14 after prompt 1's 10 positions
  • cert_gate_passes_at_exact_thresholds — 990/1000 = 0.99 AND 999/1000 = 0.999 (exact boundary pass)
  • cert_gate_fails_when_top1_below_threshold_even_if_top5_passes — AND-gate semantics
  • cert_gate_fails_when_top5_below_threshold_even_if_top1_passes — AND-gate semantics
  • harness_measure_stub_returns_machine_checkable_stub_flag — enforces stub == true, backend == "stub", zero rates + latencies
  • harness_measure_full_returns_not_implemented_pointing_at_d22 — D2.2 scope pointer preserved
  • harness_measure_stub_rejects_zero_n_tokensEmptyPromptSet typed error

Phase state after merge

Phase Status
Phase 0 (Wire surface) ✅ Complete (D0.1–D0.7 all shipped)
Phase 1 scaffold ✅ D1.1 / D1.2 / D1.3 shipped
Phase 1 D1.1b (Cranelift wiring) ⏳ Queued
Phase 2 D2.1 this PR · D2.2 + D2.3 queued
Phase 3-5 ⏳ Queued

Rules honored

  • Rule D — measurement set configured via WireTokenAgreement DTO (D0.2 surface)
  • Rule ETopKAgreement exposes methods (top1_rate, meets_cert_gate, aggregate) not raw field access
  • Rule F — no serialisation between stages; per-prompt Vec<Vec<u32>> token streams are owned Rust values

Board hygiene

  • STATUS_BOARD.md: D2.1 Queued → In PR

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

First Phase 2 deliverable — scaffold of the I11 cert gate harness.
The PR #219#220 lesson landed as a typed-rejection wall: the
stub result carries stub:true + backend:"stub" so no client can
confuse Phase 0 stub output for a real measurement.

crates/cognitive-shader-driver/src/token_agreement.rs (~320 LOC):

  ReferenceModel { path, path_hash, stub_token_count }
    ::load(&Path) -> Result<Self, TokenAgreementError>
      D2.1 stub: validates path exists, hashes display; does NOT
      parse safetensors yet. D2.2 replaces with real loader driven
      by auto_detect::detect() → ModelFingerprint.
    ::stub(tag, n_tokens) — builds stub model without touching fs

  TokenAgreementError:
    ModelPathMissing { path }
    EmptyPromptSet
    TokenCountMismatch { reference, candidate }
    NotImplementedYet { what }  ← measure_full() until D2.2

  TopKAgreement { top1_matches, top5_matches, total_positions,
                  divergence_positions: Vec<u32> }
    ::compare(ref: &[Vec<u32>], cand: &[Vec<u32>]) -> Result<Self>
      Position-by-position: top1 = r[0] == c[0]; top5 = r[0] in c[..5].
      Records divergence positions for failure-mode analysis
      (late-sequence drift vs random errors).
    ::top1_rate() / top5_rate() -> f32
    ::meets_cert_gate() -> bool  (top1 ≥ 0.99 AND top5 ≥ 0.999)
    ::aggregate(per_prompt) — sums counters; concatenates
      divergence with per-prompt offset so failures stay localised

  TokenAgreementHarness:
    ::new(reference, baseline, candidate, n_tokens)
    ::measure_stub() -> WireTokenAgreementResult { stub:true, .. }
    ::measure_full() -> NotImplementedYet (D2.2 scope)

Tests (13 new):
  - reference_model_stub_builds_without_filesystem
  - reference_model_load_missing_path_yields_typed_error
  - topk_compare_identical_streams_is_perfect (full cert gate pass)
  - topk_compare_all_different_fails_cert_gate
  - topk_top5_matches_when_top1_misses_but_in_top5
    (ref top-1 = 7; cand has 7 at position 3 in top-5 → top5 counts)
  - topk_mismatched_stream_lengths_yield_typed_error
  - topk_aggregate_sums_counters_and_offsets_divergence
    (prompt 2's divergence at pos 4 → aggregate pos 14 after prompt 1's 10)
  - cert_gate_passes_at_exact_thresholds
    (990/1000 = 0.99, 999/1000 = 0.999 — both boundaries hit)
  - cert_gate_fails_when_top1_below_threshold_even_if_top5_passes
  - cert_gate_fails_when_top5_below_threshold_even_if_top1_passes
  - harness_measure_stub_returns_machine_checkable_stub_flag
    (stub:true enforced; backend="stub"; all rates 0.0; zero latencies)
  - harness_measure_full_returns_not_implemented_pointing_at_d22
  - harness_measure_stub_rejects_zero_n_tokens

Board hygiene (CLAUDE.md Mandatory rule):
  STATUS_BOARD.md D2.1 Queued → In PR

Phase state:
  Phase 0 ✅ complete (D0.1-D0.7 all shipped)
  Phase 1 scaffold ✅ (D1.1, D1.2, D1.3 shipped; D1.1b queued)
  Phase 2 ⏳ D2.1 (this PR), D2.2 + D2.3 queued

Rules honored:
  Rule D — Measurement set comes from Wire DTOs (D0.2 WireTokenAgreement)
  Rule E — TopKAgreement exposes object-methods (top1_rate, meets_cert_gate)
  Rule F — No serialization between stages; per-prompt Vec<Vec<u32>>
           token streams are plain Rust owned; the serde happens at
           D2.3 handler entry / exit only

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
@AdaWorldAPI AdaWorldAPI merged commit 3ee739a into main Apr 21, 2026
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants