impl(sprint-13/W-I1): D-CSV-13b — i4 batch SIMD dispatch + tests by AdaWorldAPI · Pull Request #398 · AdaWorldAPI/lance-graph

AdaWorldAPI · 2026-05-16T19:28:14Z

Summary

D-CSV-13b — SIMD vectorization of i4 MUL evaluation in lance_graph_contract::mul::i4_eval::batch. AVX-512F+BW path (8 elements/iter, x86_64), NEON path (2 elements/iter, aarch64; correctness-only this session), scalar fallback. Runtime dispatch via cached simd_caps() (AtomicU8, zero ndarray dep).
5 new randomised SIMD-vs-scalar parity tests (xorshift64 fixed seed) over 10 batch sizes [0, 1, 3, 7, 8, 9, 15, 16, 64, 1024] for all 5 batch fns — closes the spec §5 I-LEGACY-API-FEATURE-GATED byte-identity audit on the AVX-512 path.
#[repr(u8)] enum layout invariant locked on DkPosition/TrustTexture/FlowState with explicit discriminants per spec §5; the SIMD impl writes raw bytes into these slices via extract_8_lane0_bytes and was UB-prone on the salvage branch.

Salvage context

Previous W-I1 worker (Sonnet) burned 134 tool uses without staging a commit; the harness auto-cleaned the worktree but ~979 LOC of partial impl was recovered to this branch (commit cdc84ec) for the retry. The retry's first commit landed within 7 tool uses per the brief's "commit early, commit often" hard rule.

The salvaged AVX-512 impl compiled but had a critical sign-extend bug: extract_dim_i8 only sign-extended within i16 sub-lanes, so every _mm512_cmp*_epi64_mask against a negative threshold (e.g. coherence ≤ -3) silently returned all-false — collapsing the priority chains. The pre-existing batch tests on the salvage branch were FAILING because of this. Fixed surgically: _mm512_slli_epi64::<60> + _mm512_srai_epi64::<60> now sign-extend across the full i64 lane.

Benchmarks (Intel Xeon @ 2.10GHz, AVX-512F+BW+VBMI2, `cargo bench --quick`, batch=1024)

function	scalar	dispatch	speedup	SHIP gate (spec §10)
`dk_position_batch`	2.68 µs	0.31 µs	8.7×	≥4× ✓
`trust_texture_batch`	2.28 µs	0.31 µs	7.4×	≥4× ✓
`flow_state_batch`	2.44 µs	0.47 µs	5.2×	≥4× ✓
`gate_decision_disc_batch`	15.25 µs	1.49 µs	10.2×	≥4× ✓
`mul_assess_batch`	17.78 µs	5.76 µs	3.1×	≥2.5× ✓ (scalar f64 finalize bounds speedup per spec §7)

All SHIP gates met on this host. Speedups reported as point estimates with criterion CIs (no statistical-significance claims per I-NOISE-FLOOR-JIRAK).

Iron-rule citations

I-LEGACY-API-FEATURE-GATED (CLAUDE.md, spec §5) — DkPosition/TrustTexture/FlowState are now #[repr(u8)] = N with explicit discriminants. Doc-comments on each enum cite the SIMD-byte-write contract and direct reviewers to the LUTs in avx512_impl/neon_impl on any future layout change.
I-NOISE-FLOOR-JIRAK (CLAUDE.md, spec §7) — speedups as point estimates with criterion CIs; no significance claims beyond that.

AP1-AP8 anti-pattern self-scan (per spec)

AP1 (silent layout drift) — closed via explicit #[repr(u8)] = N + parity tests at 10 sizes × 5 fns; SIMD output is byte-identical to scalar.
AP2 (panic-prone indexing) — all SIMD inner fns iterate while i + N <= n with scalar tail.
AP3 (UB transmute) — enum byte-writes are now safe with #[repr(u8)]; transmute(disc_byte) in mul_assess_batch is bounded by SIMD-produced ranges 0..=3.
AP4 (atomic ordering) — CAPS_CACHE: AtomicU8 uses Ordering::Relaxed (cache-singleton init is idempotent).
AP5 (missing #[target_feature]) — all SIMD inner fns carry #[target_feature(enable = "avx512f,avx512bw")] or enable = "neon".
AP6 (incorrect dispatch fallback) — dispatch falls through to scalar when caps absent OR len() < MIN_BATCH; scalar_impl is the correctness anchor.
AP7 (under-tested edge cases) — covered: 0, 1, sub-MIN, MIN, MIN+1, 2×MIN-1, 2×MIN, large.
AP8 (silent NEON divergence) — NEON path mirrors AVX-512 logic at 2 elements/iter; cross-arch parity test deferred (no aarch64 host this session) → TD-D-CSV-13b-NEON-VERIFY-1.

Files touched

crates/lance-graph-contract/src/mul.rs (+210 LOC net) — surgical fixes to the salvaged impl + 5 new parity tests + #[repr(u8)] invariant.
crates/lance-graph-contract/Cargo.toml — criterion = "0.5" dev-dep + [[bench]] name="i4_batch" harness=false.
crates/lance-graph-contract/benches/i4_batch.rs — salvaged from cdc84ec; compiles and runs end-to-end after the impl fixes landed.
.claude/board/AGENT_LOG.md — prepended sprint-13-w-i1-salvage entry.
.claude/board/STATUS_BOARD.md — flipped D-CSV-13b row to "In PR".
.claude/board/LATEST_STATE.md — updated D-CSV-13b queued-work line.
.claude/board/PR_ARC_INVENTORY.md — prepended Added / Locked / Deferred / Docs / Confidence entry.

Validation status

cargo check -p lance-graph-contract — clean (one benign workspace warning about cognitive-shader-driver's duplicate bin target, unrelated to this PR).
cargo test -p lance-graph-contract — 449 tests passing (429 lib + 8 + 7 + 4 + 1 doctest), zero failures.
cargo bench -p lance-graph-contract --no-run — compiles cleanly.
cargo bench -p lance-graph-contract --bench i4_batch -- --quick --measurement-time 1 — runs end-to-end; SHIP gates met (table above).

Validation gaps disclosed

NEON cross-arch parity test (spec §6 W-SIMD-VERIFY-1): no aarch64 host this session. NEON path compiled, structure mirrors AVX-512 with vqtbl1q_u8 + vbslq_s8. Deferred → TD-D-CSV-13b-NEON-VERIFY-1.
Multi-microarch AVX-512 perf validation (spec §8 R-2): bench from a single Skylake-class Xeon; Sapphire Rapids + Zen 4 + Tiger Lake deferred → TD-D-CSV-13b-MULTI-MICROARCH-1.
No linker bus error encountered this session.

Test plan

cargo check -p lance-graph-contract
cargo test -p lance-graph-contract — 449 pass
cargo bench -p lance-graph-contract --no-run — compiles
cargo bench -p lance-graph-contract --bench i4_batch -- --quick — SHIP gates met
NEON parity test on aarch64 (deferred per gap disclosure)
Multi-microarch AVX-512 perf (deferred per gap disclosure)

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

Generated by Claude Code

…affold Recovered W-I1 working tree state that never reached git: the previous worker (134 tool uses, ~37 min) burned its quota mid-implementation and exited without staging or committing. The work was held only in: - working-tree uncommitted edits to mul.rs (+799/-36 LOC) - new untracked file benches/i4_batch.rs (180 LOC) - stray ELF scratch binaries (check_avx, test_avx512*, test_srli2) - a `doc/` rustdoc dump The binaries and rustdoc were removed; the source is preserved here so the retry worker (post 15:30-UTC quota reset) can resume from this point instead of from zero. What landed: 1. `GateDecision::to_disc(&self) -> u8` — SIMD-packable byte mapping (0=Flow, 1=Hold, 2=Block). The variant payloads (String reasons) prevent `#[repr(u8)]`; the manual discriminant lets the batch path stay branch-free. 2. `mul::i4_eval::batch` module — five batch entry points with runtime SIMD dispatch via `simd_caps()` (OQ-CSV-13). One binary runs on any host; AVX-512BW / NEON / scalar all coexist: - `dk_position_batch` - `trust_texture_batch` - `flow_state_batch` - `gate_decision_disc_batch` (u8 fast path) - `gate_decision_batch` (full GateDecision with reason strings, scalar-only — carve-out documented) - `mul_assess_batch` Each has an AVX-512 `#[cfg(target_arch = "x86_64")]` arm, an aarch64 NEON arm, and a `scalar_impl` fallback submodule with the same function names. 3. `benches/i4_batch.rs` — Criterion benchmark scaffold targeting the SHIP/LAND gates from the spec: - SHIP: ≥4× AVX-512 vs scalar for dk/trust/flow/gate_disc at 1024 - LAND: ≥2× (records TD-D-CSV-13b-PERF-FLOOR-1 if 2≤x<4) - mul_assess target: ≥2.5× (limited by scalar f64 finalize) Sweeps batch sizes [8, 64, 1024, 16384] per fn. Validation gap (the work the worker never got to): - `cargo check -p lance-graph-contract` → CLEAN (one dead-code warning for `SimdCapsShim::neon` field, benign — retry worker can either use the field or drop it). - `cargo test -p lance-graph-contract i4_eval::batch` → 0 tests; the worker did not write unit tests for the new batch fns. Tests must be added on retry against the scalar reference (i.e. assert dispatch output equals `scalar_impl` output element-wise for randomised input). - `cargo bench` on benches/i4_batch.rs will NOT compile until `criterion` is added to `[dev-dependencies]` in lance-graph-contract/Cargo.toml. Intentionally left absent here — adding the dep belongs to the retry commit that also adds the unit tests. Branch is not for merge as-is; it's a seed state for the retry worker. https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

…u8) enum invariant Adds randomised SIMD-vs-scalar parity tests with fixed seed (xorshift64, deterministic, zero-dep) covering all 5 batch fns at 10 sizes including edge cases (0, 1, 3, 7, 8, 9, 15, 16, 64, 1024). Each test exercises every decision branch by setting all 5 read dims (valence, tension, warmth, coherence, groundedness). Locks DkPosition/TrustTexture/FlowState to #[repr(u8)] with explicit discriminants per spec §5 (I-LEGACY-API-FEATURE-GATED). The SIMD impl already byte-wrote into &mut [DkPosition] / [TrustTexture] / [FlowState] slices via extract_8_lane0_bytes; before this commit the underlying enum layout was default-repr so the byte writes were potentially undefined. Discriminants match the SIMD LUT assumptions: - DkPosition: MountStupid=0, ValleyOfDespair=1, Slope=2, Plateau=3 - TrustTexture: Calibrated=0, Overconfident=1, Uncertain=2, Underconfident=3 (note: prior declaration order placed Uncertain=3 — corrected per spec) - FlowState: Flow=0, Boredom=1, Transition=2, Anxiety=3 (note: prior declaration order placed Anxiety=0 — corrected per spec) Also fixes the SimdCapsShim dead-code warning (each field is only read on its matching #[cfg(target_arch)] dispatch branch; tagged #[allow(dead_code)] on the struct). Adds criterion 0.5 as a dev-dep (matches lance-graph-benches version) plus the [[bench]] harness=false declaration needed for benches/i4_batch.rs to build via `cargo bench --no-run`. https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

…ull i64 The salvaged AVX-512 batch impls used _mm512_cmp*_epi64_mask comparisons against i64 thresholds, but extract_dim_i8 only sign-extended the i4 nibble within an i16 sub-lane. After the i16 srai, the upper 48 bits of each i64 lane stayed zero — so a negative i4 (e.g. -3 → 0xFD as i8) read back as i64 = 0x000000000000FFFD = +65533 to the i64 comparator. Negative-threshold checks like (coh <= -3) silently became (positive >> -3), always false, which collapsed the priority chain (Valley/Anxiety/etc. branches never fired). Fix extract_dim_i8 to sign-extend across the full i64 lane via _mm512_slli_epi64<60> + _mm512_srai_epi64<60>. The dim values now live as proper i64 signed values in -8..=+7, so the existing i64-grained comparisons work correctly. Also switch flow_state_batch's flow_proxy arithmetic from _mm512_adds_epi16/_subs_epi16 (i16 saturating, wrong granularity given the i64 inputs) to _mm512_add_epi64/_sub_epi64 (i64, exact for the i4 input range -23..=+22 which can never overflow i64). The scalar's i8 clamp is never triggered for i4 inputs so the behaviours match. After the fix all 449 lance-graph-contract tests pass, including the 5 new SIMD-vs-scalar parity tests over batch sizes [0, 1, 3, 7, 8, 9, 15, 16, 64, 1024] and the pre-existing 5 *_batch_matches_scalar tests that were silently failing on the salvage branch. https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

…#[doc(hidden)] benches/i4_batch.rs needs to baseline SIMD dispatch against scalar_impl directly. Promote the module from pub(crate) to pub with #[doc(hidden)] so the crate's external API is unchanged at the rustdoc level but the bench scaffold can compile. Bench results (cargo bench --quick, AVX-512 host, batch size 1024): - dk_position_batch 8.7x (SHIP gate >=4x met) - trust_texture_batch 7.4x - flow_state_batch 5.2x - gate_decision_disc_batch 10.2x - mul_assess_batch 3.1x (>=2.5x target met; scalar f64 finalize bounds the speedup per spec section 7) https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

…BOARD / LATEST_STATE / PR_ARC) Per CLAUDE.md "Mandatory Board-Hygiene Rule": - AGENT_LOG.md: PREPEND a sprint-13-w-i1-salvage entry covering files touched (210 LOC net in mul.rs + Cargo.toml dev-dep), tests (449 green incl. 5 new SIMD-vs-scalar parity tests over 10 sizes), benchmarks (8.7x/7.4x/5.2x/10.2x/3.1x at batch 1024 on Skylake-AVX512 host), iron-rule citations (I-LEGACY-API-FEATURE-GATED, I-NOISE-FLOOR-JIRAK), AP1-AP8 self-scan, validation gaps disclosed (NEON cross-arch deferred, multi-microarch deferred). - STATUS_BOARD.md: flip D-CSV-13b row from "Queued (PP-6 spec drafting)" to "In PR (sprint-13/W-I1 salvage)" with bench summary. - LATEST_STATE.md: replace the "queued, spec being drafted by PP-6" line with the in-PR status including the SHIP-gate-met bench numbers. - PR_ARC_INVENTORY.md: PREPEND a new sprint-13/W-I1 entry covering Added (the i4_eval::batch module surface), Locked (the #[repr(u8)] enum layout invariant per spec section 5), Deferred (NEON cross-arch verification, multi-microarch perf, AVX-2 fast path, WASM SIMD128, VBMI2 compressstore), Docs (the spec + doc-comments), Confidence. These updates land in a follow-up commit rather than the impl commit because the impl needed surgical fixes (the salvage AVX-512 path had a critical sign-extend bug); separating the commits keeps the bug-fix attribution clean. Future sessions should still aim for impl + board in the same commit. https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

Orchestrator-only fix per autoattended-multiagent-pattern §5 Rule 3 (atomic consolidation pass). The retry worker added `criterion` as a dev-dep in `crates/lance-graph-contract/Cargo.toml` (commit a356e64) and successfully ran `cargo bench` locally — which silently regenerated Cargo.lock — but never staged or committed Cargo.lock itself. PR #398 as pushed has Cargo.toml expecting `criterion` while Cargo.lock does not list it; any `cargo build --locked` (CI default) fails with "the lock file Cargo.lock needs to be updated". Adds the single missing entry under `[[package]]` for `lance-graph-contract`: ``` dependencies = [ + "criterion", "glob", "serde", "serde_yaml", ``` No semantic changes to lance-graph-contract or any other crate. The criterion package itself was already pulled in as a transitive of other workspace members, so this commit adds only the dep reference, not a new crate version. https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4df835fc30

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T19:29:57Z

+                    // SAFETY: QualiaI4_16D is repr(C, align(8)); &.0 is a valid *const u64.
+                    let q0 = vld1q_u64(&qualia[i].0 as *const u64);
+                    let q1 = vld1q_u64(&qualia[i + 1].0 as *const u64);


Avoid loading past the qualia slice in NEON batches

On aarch64, vld1q_u64(&qualia[i + 1].0 as *const u64) loads two u64 lanes starting at the last element of each 2-item chunk, so for inputs like qualia.len() == 2 it reads one QualiaI4_16D past the slice even though only lane 0 is later used. Because the public dispatcher enables this path for any len >= 2, dk_position_batch can hit out-of-bounds reads/UB on ARM; the same load pattern appears in the other NEON batch routines.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-16T19:29:57Z

+                    let neg_man = _mm512_sub_epi64(zero, man_vec);
+                    let man_neg_mask = _mm512_cmplt_epi64_mask(man_vec, zero);
+                    let abs_man = _mm512_mask_blend_epi64(man_neg_mask, man_vec, neg_man);


Preserve scalar handling of i8::MIN mantissas

When signed_mantissa is i8::MIN, the scalar anchor uses signed_mantissa.unsigned_abs() as i8, which wraps to -128 and therefore falls into the abs_mantissa <= 1 ValleyOfDespair branch. This AVX-512 path negates after widening to i64, producing 128, so on AVX-512 hosts high-coherence rows with mantissa -128 are classified as Slope/Plateau instead of matching scalar output; that also propagates through mul_assess_batch.

Useful? React with 👍 / 👎.

Adds the simd-savant agent card alongside the project-agnostic 4-savant taxonomy (PP-13 brutally-honest-tester / PP-14 convergence-architect / PP-15 baton-handoff-auditor / PP-16 preflight-drift-auditor). Its scope is the one workspace-specific SIMD invariant codified earlier in this session: All SIMD must come from `ndarray::simd` via the polyfill — `simd.rs` + `simd_ops.rs` > `simd_{type}.rs` per-arch. Raw intrinsics outside `ndarray/src/simd_*.rs` are a violation. The savant runs at three checkpoints (PRE-SPAWN / DURING-IMPL / PRE-MERGE) and owns 8 anti-patterns (AP-SIMD-1..8) covering raw intrinsics in consumer crates, hand-rolled feature detection, arch-specific cfg outside the polyfill, unchecked pointer loads, missing scalar fallback, and duplicated SIMD wrappers. Hand-offs are explicit per autoattended-pattern §3 discipline: - SIMD-induced UB / OOB → PP-13 (post-impl gate) - Missing primitive → file `TD-NDARRAY-SIMD-<NAME>` and route to ndarray maintainer (do NOT approve inlining the raw intrinsic) - Spec-vs-code drift → PP-16 - Cross-crate SIMD type aliasing → PP-15 - Compile error → PP-13 Files touched: - `.claude/agents/simd-savant.md` (new) — the agent card. - `.claude/agents/BOOT.md` — adds the Quality-lifecycle row for the simd-savant in the Knowledge Activation table (alongside the four PP-N rows). - `.claude/knowledge/autoattended-multiagent-pattern.md` § 14 (lance-graph adapter section) — adds the workspace-specific note explaining why the 5th savant is an adapter rather than a §3 transferable slot (depends on having a polyfill repo to be the source-of-truth — not all projects do). Belegte trigger: Sprint-13 W-I1 PR #398. The salvaged D-CSV-13b impl inlined raw `_mm512_*` (x86_64) and `vld1q_u64` (aarch64) intrinsics directly in `crates/lance-graph-contract/src/mul.rs`, bypassing `ndarray::simd` entirely. Codex P1 (NEON OOB at len==2) is a direct consequence of AP-SIMD-5 (hand-rolled ptr-load with no bounds proof). This savant would have caught the violation PRE-SPAWN (in the worker brief) and PRE-MERGE (in the audit grep). https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

claude added 6 commits May 16, 2026 17:46

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

AdaWorldAPI merged commit 291c5cd into main May 16, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

impl(sprint-13/W-I1): D-CSV-13b — i4 batch SIMD dispatch + tests#398

impl(sprint-13/W-I1): D-CSV-13b — i4 batch SIMD dispatch + tests#398
AdaWorldAPI merged 6 commits into
mainfrom
claude/sprint-13-w-i1-salvage

AdaWorldAPI commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 16, 2026

Summary

Salvage context

Benchmarks (Intel Xeon @ 2.10GHz, AVX-512F+BW+VBMI2, cargo bench --quick, batch=1024)

Iron-rule citations

AP1-AP8 anti-pattern self-scan (per spec)

Files touched

Validation status

Validation gaps disclosed

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmarks (Intel Xeon @ 2.10GHz, AVX-512F+BW+VBMI2, `cargo bench --quick`, batch=1024)