AdaWorldAPI · AdaWorldAPI · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md
@@ -1,3 +1,56 @@
+## [Fleet sprint-13-w-i1-salvage] [IN PR] D-CSV-13b i4 batch SIMD dispatch (branch claude/sprint-13-w-i1-salvage)
+
+**D-id:** D-CSV-13b — SIMD vectorization of i4 MUL evaluation. AVX-512F+BW path (8 elements/iter), NEON path (2 elements/iter), scalar fallback. Runtime dispatch via cached `simd_caps()` (`AtomicU8`); zero ndarray dep preserves contract-crate zero-dep posture.
+
+**Worker:** W-I1 retry worker (Opus, salvage continuation). Previous W-I1 burned 134 tool uses without committing; ~979 LOC of impl recovered to the salvage branch (commit `cdc84ec`) for this run to finish.
+
+**Files modified:**
+- `crates/lance-graph-contract/src/mul.rs` (+210 LOC net, ~3 surgical fixes):
+  (a) `#[repr(u8)]` with explicit discriminants on `DkPosition`/`TrustTexture`/`FlowState` per spec §5 (the salvaged SIMD impl already byte-wrote into these slices via `extract_8_lane0_bytes` — without `#[repr(u8)]` the byte writes were UB-prone);
+  (b) FIX `extract_dim_i8` to sign-extend across the full i64 lane via `_mm512_slli_epi64::<60>` + `_mm512_srai_epi64::<60>` — salvage only sign-extended within i16 sub-lanes, so every `_mm512_cmp*_epi64_mask` against a negative threshold (e.g. coherence ≤ -3) silently returned all-false, collapsing the priority chains; this is what made the pre-existing batch tests fail on the salvage branch;
+  (c) switch flow_state's `flow_proxy` arithmetic from `_mm512_adds/subs_epi16` (wrong granularity given the i64 inputs) to `_mm512_add/sub_epi64` (exact for the i4 input range -23..=+22);
+  (d) promote `mod scalar_impl` from `pub(crate)` to `#[doc(hidden)] pub` so `benches/i4_batch.rs` can baseline SIMD against scalar without going through the dispatch wrapper;
+  (e) `#[allow(dead_code)]` on `SimdCapsShim` (each field is read only on its matching `#[cfg(target_arch)]` branch — fixes the lingering warning per the retry brief);
+  (f) add 5 new randomised SIMD-vs-scalar parity tests (xorshift64 fixed seed, zero-dep) over 10 sizes [0, 1, 3, 7, 8, 9, 15, 16, 64, 1024] covering: empty / size-1 / sub-MIN_BATCH-AVX / exact MIN_BATCH-1 / exact MIN_BATCH=8 / MIN_BATCH+1 / 2×MIN-1 / 2×MIN / large / very-large.
+- `crates/lance-graph-contract/Cargo.toml`: criterion 0.5 dev-dep (matches `lance-graph-benches`) + `[[bench]] name="i4_batch" harness=false`.
+
+**Tests:** 449 lance-graph-contract tests green — 429 lib + 8 + 7 + 4 + 1 doctest. Includes:
+- 5 new `test_*_batch_parity_simd_vs_scalar` (10 sizes each × 5 fns).
+- 5 pre-existing `test_*_batch_matches_scalar` (silently FAILING on the salvage branch before fix (b)).
+- Pre-existing `test_batch_empty_input_returns_empty_output` covers size 0 on all 5 fns.
+
+**Benchmarks (Intel Xeon @ 2.10GHz, AVX-512F+BW+VBMI2 host, `cargo bench --quick --measurement-time 1`, batch=1024):**
+- `dk_position_batch`: 2.68 µs scalar / 0.31 µs dispatch = **8.7×** (SHIP gate ≥4× ✓)
+- `trust_texture_batch`: 2.28 µs / 0.31 µs = **7.4×** (SHIP ✓)
+- `flow_state_batch`: 2.44 µs / 0.47 µs = **5.2×** (SHIP ✓)
+- `gate_decision_disc_batch`: 15.25 µs / 1.49 µs = **10.2×** (SHIP ✓)
+- `mul_assess_batch`: 17.78 µs / 5.76 µs = **3.1×** (spec target ≥2.5× because the scalar f64 finalize stage bounds the speedup ✓)
+
+All SHIP gates met on this host. NEON path is correctness-only per spec §7 (cannot validate on x86_64); shape mirrors AVX-512 with `vqtbl1q_u8` table lookup + `vbslq_s8` blend.
+
+**Iron-rule citations:**
+- **I-LEGACY-API-FEATURE-GATED** (CLAUDE.md, spec §5) — explicit `#[repr(u8)] = N` discriminants + safety doc-comments lock the SIMD-byte-write contract. Reviewers must check the LUTs in `avx512_impl` and `neon_impl` whenever these enum layouts change.
+- **I-NOISE-FLOOR-JIRAK** (CLAUDE.md, spec §7) — speedups reported as point estimates with criterion CIs; no claims of statistical significance beyond that.
+
+**AP1-AP8 self-scan:**
+- AP1 (silent layout drift across feature gates) — addressed via explicit `#[repr(u8)] = N` + parity tests at 10 sizes × 5 fns; SIMD output is byte-identical to scalar.
+- AP2 (panic-prone unchecked indexing) — all SIMD inner fns iterate `while i + N <= n` with scalar tail.
+- AP3 (UB through transmute) — enum byte-writes are now safe with `#[repr(u8)]`; `transmute(disc_byte)` in `mul_assess_batch` is bounded by SIMD-produced ranges 0..=3.
+- AP4 (atomic ordering bugs) — `CAPS_CACHE: AtomicU8` uses `Ordering::Relaxed`, correct for cache-singleton init (re-probe is idempotent).
+- AP5 (missing `#[target_feature]`) — all SIMD inner fns carry `#[target_feature(enable = "avx512f,avx512bw")]` or `enable = "neon"`.
+- AP6 (incorrect SIMD dispatch fallback) — dispatch falls through to scalar when caps absent OR when `len() < MIN_BATCH`; scalar_impl is the correctness anchor.
+- AP7 (under-tested edge cases) — covered: 0, 1, sub-MIN, MIN, MIN+1, 2×MIN-1, 2×MIN, large.
+- AP8 (silent NEON divergence) — NEON path is structurally parallel to AVX-512 (`vqtbl1q_u8` + `vbslq_s8`); cross-arch parity test deferred (no aarch64 host this session).
+
+**Validation gaps disclosed:**
+- NEON path compiled but not executed (no aarch64 host); spec §6 cross-arch parity test W-SIMD-VERIFY-1 deferred. Tracked as TD-D-CSV-13b-NEON-VERIFY-1.
+- `cargo bench` ran end-to-end and SHIP gates met on the Skylake-class AVX-512 host; spec §8 R-2 multi-microarch validation (Sapphire Rapids + Zen 4 + Tiger Lake) also deferred. Tracked as TD-D-CSV-13b-MULTI-MICROARCH-1.
+- No linker bus error encountered this run.
+
+**Outcome:** D-CSV-13b ready for merge as sprint-13 W-I1.
+
+---
+
 ## [Fleet sprint-11-wave-c-qualia-i4-column] [IN PR] D-CSV-5a sibling QualiaI4Column add (branch claude/sprint-11-wave-c-qualia-i4-column)
 
 **D-id:** D-CSV-5a — QualiaColumn migration phase 5a (split from D-CSV-5 per OQ-CSV-4 sibling-cutover ratification). Adds `QualiaI4Column` ALONGSIDE the existing `QualiaColumn` with double-write on push paths; no read-side change. Phase 5b (separate PR after merge) flips readers + drops the f32 column.

diff --git a/.claude/board/LATEST_STATE.md b/.claude/board/LATEST_STATE.md
@@ -152,7 +152,7 @@ Types live in `crates/cognitive-shader-driver/src/wire.rs` behind `--features se
 
 **Queued Work — sprint-13 (specs being drafted in the sprint-13-preflight fleet on this branch):**
 
-- **D-CSV-13b** — SIMD vectorization of D-CSV-8 i4 MUL evaluation (AVX-512 + NEON intrinsics; ~150-300 LOC per ISA; 4-8× throughput gain over PR #387 scalar path). Spec being drafted by PP-6.
+- **D-CSV-13b** — SIMD vectorization of D-CSV-8 i4 MUL evaluation. **IN PR (sprint-13/W-I1 salvage)** on branch `claude/sprint-13-w-i1-salvage`. AVX-512F+BW path runtime-dispatched via cached `simd_caps()` (zero ndarray dep); NEON path correctness-only per spec §7; scalar fallback. Bench on Skylake-AVX512 host: 8.7× dk / 7.4× trust / 5.2× flow / 10.2× gate_disc / 3.1× mul_assess at batch 1024 — all SHIP gates met. `#[repr(u8)]` discriminants locked on `DkPosition`/`TrustTexture`/`FlowState` per spec §5 (I-LEGACY-API-FEATURE-GATED). 449 lance-graph-contract tests green including 5 new SIMD-vs-scalar parity tests over 10 sizes.
 - **D-CSV-14** — on-Think method migration for D-CSV-12 splat ops (struct-method surface per L-20 lock; depends on D-CSV-11 ndarray streaming PR #147). Spec being drafted by PP-4.
 - **D-CSV-16** — NEW sprint-13 entry. Spec being drafted by PP-5.
 - **D-CSV-17** — NEW sprint-13 entry. Spec being drafted by PP-3.

diff --git a/.claude/board/PR_ARC_INVENTORY.md b/.claude/board/PR_ARC_INVENTORY.md
@@ -35,6 +35,46 @@
 
 ---
 
+## sprint-13/W-I1 — impl(sprint-13): D-CSV-13b i4 batch SIMD dispatch + tests (in PR)
+
+**Status:** In PR (branch `claude/sprint-13-w-i1-salvage`, HEAD `c9c1c79`, awaiting user merge). 4 commits on the branch: `cdc84ec` salvage W-I1 i4_eval::batch impl + criterion scaffold (recovered from cleaned worktree) → `a356e64` SIMD-vs-scalar parity tests + repr(u8) enum invariant (5 new randomised tests over 10 sizes, criterion 0.5 dev-dep, dead-code warning fix) → `d8d1437` AVX-512 dim-extract sign-extend fix (the bug that made the salvage path silently produce wrong bytes on negative thresholds) → `c9c1c79` `scalar_impl` made `#[doc(hidden)] pub` for bench access.
+
+**Confidence (2026-05-16):** salvage-and-finish run. Previous W-I1 (Sonnet) burned 134 tool uses without staging a commit; harness auto-cleaned the worktree, ~979 LOC of partial impl was recovered to the salvage branch by orchestration. This retry (Opus) commit 1 of 4 landed within 7 tool uses per the brief's "commit early, commit often" hard rule. AVX-512F+BW path is now correct (verified against scalar over 10 batch sizes × 5 fns); NEON path compiles but is correctness-only per spec §7 (no aarch64 host this session). Bench at batch 1024: 8.7×/7.4×/5.2×/10.2×/3.1× — all SHIP gates met on the Skylake-AVX512 host.
+
+### Added
+
+- `crates/lance-graph-contract/src/mul.rs::i4_eval::batch` — the SIMD dispatch module: `dk_position_batch`, `trust_texture_batch`, `flow_state_batch`, `gate_decision_disc_batch`, `gate_decision_batch` (full GateDecision; scalar-only carve-out due to `String` payloads), `mul_assess_batch`, `mul_assess_vec`. Runtime dispatch via cached `simd_caps()` (`AtomicU8` packed bits, `Ordering::Relaxed`). AVX-512F+BW intrinsics path (8 elements/iter) under `#[cfg(target_arch = "x86_64")]`. NEON intrinsics path (2 elements/iter) under `#[cfg(target_arch = "aarch64")]`. `pub(crate) #[doc(hidden)] pub mod scalar_impl` as the correctness anchor + bench baseline.
+- `crates/lance-graph-contract/src/mul.rs::GateDecision::to_disc()` — `u8` discriminant (0=Flow, 1=Hold, 2=Block) for SIMD-packable gate output.
+- 5 new randomised SIMD-vs-scalar parity tests in `mul::i4_eval::tests` covering all 5 batch fns at 10 sizes [0, 1, 3, 7, 8, 9, 15, 16, 64, 1024] (xorshift64 fixed seed, zero-dep).
+- `crates/lance-graph-contract/benches/i4_batch.rs` — criterion bench scaffold sweeping batch sizes [8, 64, 1024, 16384] for all 5 batch fns (dispatch vs scalar baseline).
+- `crates/lance-graph-contract/Cargo.toml` — `criterion = "0.5"` dev-dep matching `lance-graph-benches`; `[[bench]] name="i4_batch" harness=false`.
+
+### Locked
+
+- **Enum layout invariant (D-CSV-13b, spec §5; I-LEGACY-API-FEATURE-GATED):** `DkPosition`, `TrustTexture`, `FlowState` are `#[repr(u8)]` with explicit discriminants. The SIMD impl byte-writes into `&mut [Enum]` slices via `extract_8_lane0_bytes` — reordering or removing these discriminants WILL silently corrupt SIMD output. Discriminants locked: `DkPosition { MountStupid=0, ValleyOfDespair=1, SlopeOfEnlightenment=2, Plateau=3 }`; `TrustTexture { Calibrated=0, Overconfident=1, Uncertain=2, Underconfident=3 }`; `FlowState { Flow=0, Boredom=1, Transition=2, Anxiety=3 }`. Doc-comments on each enum cite the SIMD-byte-write contract and the LUT locations in `avx512_impl`/`neon_impl` that reviewers must check on any future layout change.
+- **GateDecision discriminant mapping (spec §5):** `GateDecision::to_disc()` returns `0=Flow, 1=Hold, 2=Block`; this is the byte mapping written by `gate_decision_disc_batch`. `GateDecision` itself cannot be `#[repr(u8)]` due to its `String` payloads — `gate_decision_batch` materializes the full enum via the scalar path.
+- **Runtime SIMD dispatch (OQ-CSV-13, spec §4):** dispatch happens via cached `simd_caps()` inside `lance-graph-contract`, NOT via an ndarray dev-dep. Preserves the contract crate's zero-dep posture. The shim is ~50 LOC and uses `is_x86_feature_detected!` / `is_aarch64_feature_detected!`.
+- **MIN_BATCH guards:** AVX-512 needs `len >= 8`; NEON needs `len >= 2`. Below those thresholds the dispatch falls through to scalar.
+
+### Deferred
+
+- **NEON cross-arch parity verification (spec §6, W-SIMD-VERIFY-1):** no aarch64 host this session; the NEON path compiled but byte-equivalence to scalar was not executed. Tracked as TD-D-CSV-13b-NEON-VERIFY-1.
+- **Multi-microarch AVX-512 perf validation (spec §8 R-2):** bench results came from a single Skylake-class Xeon. Sapphire Rapids + Zen 4 + Tiger Lake validation deferred. Tracked as TD-D-CSV-13b-MULTI-MICROARCH-1.
+- **AVX-2-only fast path (spec §1):** out of scope per spec; AVX-2 hardware falls through to scalar. Tracked as TD-SIMD-I4-AVX2-1.
+- **WASM SIMD128 + VBMI2 compressstore (spec §1, §8 R-6):** sprint-14+. The current AVX-512 path uses a scalar byte-extract from a 64-byte stack buffer instead of `_mm512_mask_compressstoreu_epi8` to preserve Skylake-X / Cascade Lake portability (no VBMI2 requirement). Tracked as TD-D-CSV-13b-VBMI2-1.
+
+### Docs
+
+- Spec `.claude/specs/pr-sprint-13-simd-i4.md` — the 982-LOC planning document covering AP1-AP8 anti-pattern catalogue, §3 per-function SIMD pseudocode, §5 semantic-equivalence iron rule, §6 test plan, §7 benchmark plan with Jirak rate citation, §8 R-1..R-10 risk matrix.
+- Doc-comments on `DkPosition`/`TrustTexture`/`FlowState` cite the D-CSV-13b layout invariant and point reviewers at the SIMD LUTs.
+- `GateDecision::to_disc()` rustdoc documents the locked byte mapping.
+
+### Confidence (2026-05-16)
+
+Salvage retry succeeded. The critical bug in the salvaged AVX-512 impl (i64-grained comparisons against negative thresholds always returning false because `extract_dim_i8` only sign-extended within i16 sub-lanes) was diagnosed and fixed surgically. All 449 lance-graph-contract tests green; SHIP gates met on this host. The pre-existing batch tests that were silently passing because they didn't reach the bug + the new randomised parity tests that DO reach it together close the I-LEGACY-API-FEATURE-GATED audit per spec §5.
+
+---
+
 ## #390 — impl(sprint-12/wave-G): D-CSV-5b cutover + D-CSV-6b WitnessCorpus index + D-CSV-13 batch + D-CSV-15 Jirak math (in PR)
 
 **Status:** In PR (branch `claude/sprint-12-wave-g-fleet`, HEAD `bad0875`, awaiting user merge). 6 commits on the branch: `7d7b537` WIP snapshot → `03ce219` W-G3 + W-G5 + W-G6 + W-G1 partial → `291878f` W-G1 driver.rs + W-G2 refinement + W-G4 Σ10 → `67c2ca8` W-G1 cutover finalization + W-G4 Jirak math correction → `4d429e3` W-Meta-Opus honest review (grade A−) + CSI-15 rename → `bad0875` cargo fmt rustfmt 1.95 CI gate.

diff --git a/.claude/board/STATUS_BOARD.md b/.claude/board/STATUS_BOARD.md
@@ -463,7 +463,7 @@ Consolidates sprint-10 architectural decisions before context dilution.
 | D-id | Title | Status | PR / Evidence |
 |---|---|---|---|
 | D-CSV-13 | Batch i4 scalar MUL (paired with D-CSV-8 SIMD-readiness) | **Shipped** | PR #388 merge `77f2d26` (W-G3 batch i4 scalar) |
-| D-CSV-13b | SIMD vectorization of D-CSV-8 i4 MUL evaluation (AVX-512 + NEON intrinsics) | **Queued (PP-6 spec drafting)** | sprint-13 preflight; ~150-300 LOC per ISA |
+| D-CSV-13b | SIMD vectorization of D-CSV-8 i4 MUL evaluation (AVX-512 + NEON intrinsics) | **In PR (sprint-13/W-I1 salvage)** | branch `claude/sprint-13-w-i1-salvage`; AVX-512F+BW dispatch via `simd_caps()`; bench on Skylake-AVX512 host = 8.7× dk / 7.4× trust / 5.2× flow / 10.2× gate_disc / 3.1× mul_assess at batch 1024 — all SHIP gates met; 5 SIMD-vs-scalar parity tests over 10 sizes green |
 | D-CSV-14 | On-Think method migration for D-CSV-12 splat ops (struct-method surface per L-20) | **Queued (PP-4 spec drafting)** | sprint-13; depends on D-CSV-11 streaming substrate (shipped via ndarray #147) |
 | D-CSV-15 | Σ10 Jirak-derived threshold (TD-SIGMA-TIER-THRESHOLDS-1 resolution) | **In PR (#390 W-G4 Jirak threshold)** | sprint-12 Wave G partial; full VAMPE coupled-revival deferred sprint-13+ |
 | D-CSV-16 | NEW sprint-13 entry | **Queued (PP-5 spec drafting)** | sprint-13 preflight |

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/crates/lance-graph-contract/Cargo.toml b/crates/lance-graph-contract/Cargo.toml
@@ -22,6 +22,13 @@ glob = "0.3"
 # Used in manifest_codegen tests to replicate validation logic in-process.
 serde_yaml = "0.9"
 serde = { version = "1", features = ["derive"] }
+# D-CSV-13b: criterion bench scaffold for i4_eval::batch SIMD dispatch.
+# Matches version used in crates/lance-graph-benches/Cargo.toml.
+criterion = { version = "0.5", default-features = false, features = ["html_reports"] }
+
+[[bench]]
+name = "i4_batch"
+harness = false
 
 [features]
 # A-unlock-stepdomain — `step_trajectory_hash` forward stub for the E4