Skip to content

impl(sprint-13/W-I1): D-CSV-13b — i4 batch SIMD dispatch + tests#398

Merged
AdaWorldAPI merged 6 commits into
mainfrom
claude/sprint-13-w-i1-salvage
May 16, 2026
Merged

impl(sprint-13/W-I1): D-CSV-13b — i4 batch SIMD dispatch + tests#398
AdaWorldAPI merged 6 commits into
mainfrom
claude/sprint-13-w-i1-salvage

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

  • D-CSV-13b — SIMD vectorization of i4 MUL evaluation in lance_graph_contract::mul::i4_eval::batch. AVX-512F+BW path (8 elements/iter, x86_64), NEON path (2 elements/iter, aarch64; correctness-only this session), scalar fallback. Runtime dispatch via cached simd_caps() (AtomicU8, zero ndarray dep).
  • 5 new randomised SIMD-vs-scalar parity tests (xorshift64 fixed seed) over 10 batch sizes [0, 1, 3, 7, 8, 9, 15, 16, 64, 1024] for all 5 batch fns — closes the spec §5 I-LEGACY-API-FEATURE-GATED byte-identity audit on the AVX-512 path.
  • #[repr(u8)] enum layout invariant locked on DkPosition/TrustTexture/FlowState with explicit discriminants per spec §5; the SIMD impl writes raw bytes into these slices via extract_8_lane0_bytes and was UB-prone on the salvage branch.

Salvage context

Previous W-I1 worker (Sonnet) burned 134 tool uses without staging a commit; the harness auto-cleaned the worktree but ~979 LOC of partial impl was recovered to this branch (commit cdc84ec) for the retry. The retry's first commit landed within 7 tool uses per the brief's "commit early, commit often" hard rule.

The salvaged AVX-512 impl compiled but had a critical sign-extend bug: extract_dim_i8 only sign-extended within i16 sub-lanes, so every _mm512_cmp*_epi64_mask against a negative threshold (e.g. coherence ≤ -3) silently returned all-false — collapsing the priority chains. The pre-existing batch tests on the salvage branch were FAILING because of this. Fixed surgically: _mm512_slli_epi64::<60> + _mm512_srai_epi64::<60> now sign-extend across the full i64 lane.

Benchmarks (Intel Xeon @ 2.10GHz, AVX-512F+BW+VBMI2, cargo bench --quick, batch=1024)

function scalar dispatch speedup SHIP gate (spec §10)
dk_position_batch 2.68 µs 0.31 µs 8.7× ≥4× ✓
trust_texture_batch 2.28 µs 0.31 µs 7.4× ≥4× ✓
flow_state_batch 2.44 µs 0.47 µs 5.2× ≥4× ✓
gate_decision_disc_batch 15.25 µs 1.49 µs 10.2× ≥4× ✓
mul_assess_batch 17.78 µs 5.76 µs 3.1× ≥2.5× ✓ (scalar f64 finalize bounds speedup per spec §7)

All SHIP gates met on this host. Speedups reported as point estimates with criterion CIs (no statistical-significance claims per I-NOISE-FLOOR-JIRAK).

Iron-rule citations

  • I-LEGACY-API-FEATURE-GATED (CLAUDE.md, spec §5) — DkPosition/TrustTexture/FlowState are now #[repr(u8)] = N with explicit discriminants. Doc-comments on each enum cite the SIMD-byte-write contract and direct reviewers to the LUTs in avx512_impl/neon_impl on any future layout change.
  • I-NOISE-FLOOR-JIRAK (CLAUDE.md, spec §7) — speedups as point estimates with criterion CIs; no significance claims beyond that.

AP1-AP8 anti-pattern self-scan (per spec)

  • AP1 (silent layout drift) — closed via explicit #[repr(u8)] = N + parity tests at 10 sizes × 5 fns; SIMD output is byte-identical to scalar.
  • AP2 (panic-prone indexing) — all SIMD inner fns iterate while i + N <= n with scalar tail.
  • AP3 (UB transmute) — enum byte-writes are now safe with #[repr(u8)]; transmute(disc_byte) in mul_assess_batch is bounded by SIMD-produced ranges 0..=3.
  • AP4 (atomic ordering) — CAPS_CACHE: AtomicU8 uses Ordering::Relaxed (cache-singleton init is idempotent).
  • AP5 (missing #[target_feature]) — all SIMD inner fns carry #[target_feature(enable = "avx512f,avx512bw")] or enable = "neon".
  • AP6 (incorrect dispatch fallback) — dispatch falls through to scalar when caps absent OR len() < MIN_BATCH; scalar_impl is the correctness anchor.
  • AP7 (under-tested edge cases) — covered: 0, 1, sub-MIN, MIN, MIN+1, 2×MIN-1, 2×MIN, large.
  • AP8 (silent NEON divergence) — NEON path mirrors AVX-512 logic at 2 elements/iter; cross-arch parity test deferred (no aarch64 host this session) → TD-D-CSV-13b-NEON-VERIFY-1.

Files touched

  • crates/lance-graph-contract/src/mul.rs (+210 LOC net) — surgical fixes to the salvaged impl + 5 new parity tests + #[repr(u8)] invariant.
  • crates/lance-graph-contract/Cargo.tomlcriterion = "0.5" dev-dep + [[bench]] name="i4_batch" harness=false.
  • crates/lance-graph-contract/benches/i4_batch.rs — salvaged from cdc84ec; compiles and runs end-to-end after the impl fixes landed.
  • .claude/board/AGENT_LOG.md — prepended sprint-13-w-i1-salvage entry.
  • .claude/board/STATUS_BOARD.md — flipped D-CSV-13b row to "In PR".
  • .claude/board/LATEST_STATE.md — updated D-CSV-13b queued-work line.
  • .claude/board/PR_ARC_INVENTORY.md — prepended Added / Locked / Deferred / Docs / Confidence entry.

Validation status

  • cargo check -p lance-graph-contract — clean (one benign workspace warning about cognitive-shader-driver's duplicate bin target, unrelated to this PR).
  • cargo test -p lance-graph-contract449 tests passing (429 lib + 8 + 7 + 4 + 1 doctest), zero failures.
  • cargo bench -p lance-graph-contract --no-run — compiles cleanly.
  • cargo bench -p lance-graph-contract --bench i4_batch -- --quick --measurement-time 1 — runs end-to-end; SHIP gates met (table above).

Validation gaps disclosed

  • NEON cross-arch parity test (spec §6 W-SIMD-VERIFY-1): no aarch64 host this session. NEON path compiled, structure mirrors AVX-512 with vqtbl1q_u8 + vbslq_s8. Deferred → TD-D-CSV-13b-NEON-VERIFY-1.
  • Multi-microarch AVX-512 perf validation (spec §8 R-2): bench from a single Skylake-class Xeon; Sapphire Rapids + Zen 4 + Tiger Lake deferred → TD-D-CSV-13b-MULTI-MICROARCH-1.
  • No linker bus error encountered this session.

Test plan

  • cargo check -p lance-graph-contract
  • cargo test -p lance-graph-contract — 449 pass
  • cargo bench -p lance-graph-contract --no-run — compiles
  • cargo bench -p lance-graph-contract --bench i4_batch -- --quick — SHIP gates met
  • NEON parity test on aarch64 (deferred per gap disclosure)
  • Multi-microarch AVX-512 perf (deferred per gap disclosure)

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS


Generated by Claude Code

claude added 6 commits May 16, 2026 17:46
…affold

Recovered W-I1 working tree state that never reached git: the previous
worker (134 tool uses, ~37 min) burned its quota mid-implementation and
exited without staging or committing. The work was held only in:
  - working-tree uncommitted edits to mul.rs (+799/-36 LOC)
  - new untracked file benches/i4_batch.rs (180 LOC)
  - stray ELF scratch binaries (check_avx, test_avx512*, test_srli2)
  - a `doc/` rustdoc dump

The binaries and rustdoc were removed; the source is preserved here so
the retry worker (post 15:30-UTC quota reset) can resume from this point
instead of from zero.

What landed:

1. `GateDecision::to_disc(&self) -> u8` — SIMD-packable byte mapping
   (0=Flow, 1=Hold, 2=Block). The variant payloads (String reasons)
   prevent `#[repr(u8)]`; the manual discriminant lets the batch path
   stay branch-free.

2. `mul::i4_eval::batch` module — five batch entry points with runtime
   SIMD dispatch via `simd_caps()` (OQ-CSV-13). One binary runs on any
   host; AVX-512BW / NEON / scalar all coexist:
     - `dk_position_batch`
     - `trust_texture_batch`
     - `flow_state_batch`
     - `gate_decision_disc_batch` (u8 fast path)
     - `gate_decision_batch` (full GateDecision with reason strings,
       scalar-only — carve-out documented)
     - `mul_assess_batch`
   Each has an AVX-512 `#[cfg(target_arch = "x86_64")]` arm, an aarch64
   NEON arm, and a `scalar_impl` fallback submodule with the same
   function names.

3. `benches/i4_batch.rs` — Criterion benchmark scaffold targeting the
   SHIP/LAND gates from the spec:
     - SHIP: ≥4× AVX-512 vs scalar for dk/trust/flow/gate_disc at 1024
     - LAND: ≥2× (records TD-D-CSV-13b-PERF-FLOOR-1 if 2≤x<4)
     - mul_assess target: ≥2.5× (limited by scalar f64 finalize)
   Sweeps batch sizes [8, 64, 1024, 16384] per fn.

Validation gap (the work the worker never got to):

- `cargo check -p lance-graph-contract` → CLEAN (one dead-code warning
  for `SimdCapsShim::neon` field, benign — retry worker can either
  use the field or drop it).
- `cargo test -p lance-graph-contract i4_eval::batch` → 0 tests; the
  worker did not write unit tests for the new batch fns. Tests must
  be added on retry against the scalar reference (i.e. assert dispatch
  output equals `scalar_impl` output element-wise for randomised input).
- `cargo bench` on benches/i4_batch.rs will NOT compile until `criterion`
  is added to `[dev-dependencies]` in lance-graph-contract/Cargo.toml.
  Intentionally left absent here — adding the dep belongs to the retry
  commit that also adds the unit tests.

Branch is not for merge as-is; it's a seed state for the retry worker.

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
…u8) enum invariant

Adds randomised SIMD-vs-scalar parity tests with fixed seed (xorshift64,
deterministic, zero-dep) covering all 5 batch fns at 10 sizes including
edge cases (0, 1, 3, 7, 8, 9, 15, 16, 64, 1024). Each test exercises every
decision branch by setting all 5 read dims (valence, tension, warmth,
coherence, groundedness).

Locks DkPosition/TrustTexture/FlowState to #[repr(u8)] with explicit
discriminants per spec §5 (I-LEGACY-API-FEATURE-GATED). The SIMD impl
already byte-wrote into &mut [DkPosition] / [TrustTexture] / [FlowState]
slices via extract_8_lane0_bytes; before this commit the underlying enum
layout was default-repr so the byte writes were potentially undefined.

Discriminants match the SIMD LUT assumptions:
- DkPosition: MountStupid=0, ValleyOfDespair=1, Slope=2, Plateau=3
- TrustTexture: Calibrated=0, Overconfident=1, Uncertain=2, Underconfident=3
  (note: prior declaration order placed Uncertain=3 — corrected per spec)
- FlowState: Flow=0, Boredom=1, Transition=2, Anxiety=3
  (note: prior declaration order placed Anxiety=0 — corrected per spec)

Also fixes the SimdCapsShim dead-code warning (each field is only read on
its matching #[cfg(target_arch)] dispatch branch; tagged #[allow(dead_code)]
on the struct).

Adds criterion 0.5 as a dev-dep (matches lance-graph-benches version) plus
the [[bench]] harness=false declaration needed for benches/i4_batch.rs to
build via `cargo bench --no-run`.

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
…ull i64

The salvaged AVX-512 batch impls used _mm512_cmp*_epi64_mask comparisons
against i64 thresholds, but extract_dim_i8 only sign-extended the i4 nibble
within an i16 sub-lane. After the i16 srai, the upper 48 bits of each i64
lane stayed zero — so a negative i4 (e.g. -3 → 0xFD as i8) read back as
i64 = 0x000000000000FFFD = +65533 to the i64 comparator. Negative-threshold
checks like (coh <= -3) silently became (positive >> -3), always false,
which collapsed the priority chain (Valley/Anxiety/etc. branches never
fired).

Fix extract_dim_i8 to sign-extend across the full i64 lane via
_mm512_slli_epi64<60> + _mm512_srai_epi64<60>. The dim values now live as
proper i64 signed values in -8..=+7, so the existing i64-grained
comparisons work correctly.

Also switch flow_state_batch's flow_proxy arithmetic from
_mm512_adds_epi16/_subs_epi16 (i16 saturating, wrong granularity given
the i64 inputs) to _mm512_add_epi64/_sub_epi64 (i64, exact for the i4
input range -23..=+22 which can never overflow i64). The scalar's i8
clamp is never triggered for i4 inputs so the behaviours match.

After the fix all 449 lance-graph-contract tests pass, including the 5
new SIMD-vs-scalar parity tests over batch sizes [0, 1, 3, 7, 8, 9, 15,
16, 64, 1024] and the pre-existing 5 *_batch_matches_scalar tests that
were silently failing on the salvage branch.

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
…#[doc(hidden)]

benches/i4_batch.rs needs to baseline SIMD dispatch against scalar_impl
directly. Promote the module from pub(crate) to pub with #[doc(hidden)]
so the crate's external API is unchanged at the rustdoc level but the
bench scaffold can compile.

Bench results (cargo bench --quick, AVX-512 host, batch size 1024):
- dk_position_batch          8.7x (SHIP gate >=4x met)
- trust_texture_batch        7.4x
- flow_state_batch           5.2x
- gate_decision_disc_batch  10.2x
- mul_assess_batch           3.1x (>=2.5x target met; scalar f64 finalize
                                    bounds the speedup per spec section 7)

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
…BOARD / LATEST_STATE / PR_ARC)

Per CLAUDE.md "Mandatory Board-Hygiene Rule":
- AGENT_LOG.md: PREPEND a sprint-13-w-i1-salvage entry covering files
  touched (210 LOC net in mul.rs + Cargo.toml dev-dep), tests (449 green
  incl. 5 new SIMD-vs-scalar parity tests over 10 sizes), benchmarks
  (8.7x/7.4x/5.2x/10.2x/3.1x at batch 1024 on Skylake-AVX512 host),
  iron-rule citations (I-LEGACY-API-FEATURE-GATED, I-NOISE-FLOOR-JIRAK),
  AP1-AP8 self-scan, validation gaps disclosed (NEON cross-arch deferred,
  multi-microarch deferred).
- STATUS_BOARD.md: flip D-CSV-13b row from "Queued (PP-6 spec drafting)"
  to "In PR (sprint-13/W-I1 salvage)" with bench summary.
- LATEST_STATE.md: replace the "queued, spec being drafted by PP-6" line
  with the in-PR status including the SHIP-gate-met bench numbers.
- PR_ARC_INVENTORY.md: PREPEND a new sprint-13/W-I1 entry covering
  Added (the i4_eval::batch module surface), Locked (the #[repr(u8)]
  enum layout invariant per spec section 5), Deferred (NEON cross-arch
  verification, multi-microarch perf, AVX-2 fast path, WASM SIMD128,
  VBMI2 compressstore), Docs (the spec + doc-comments), Confidence.

These updates land in a follow-up commit rather than the impl commit
because the impl needed surgical fixes (the salvage AVX-512 path had a
critical sign-extend bug); separating the commits keeps the bug-fix
attribution clean. Future sessions should still aim for impl + board
in the same commit.

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
Orchestrator-only fix per autoattended-multiagent-pattern §5 Rule 3
(atomic consolidation pass). The retry worker added `criterion` as
a dev-dep in `crates/lance-graph-contract/Cargo.toml` (commit a356e64)
and successfully ran `cargo bench` locally — which silently
regenerated Cargo.lock — but never staged or committed Cargo.lock
itself. PR #398 as pushed has Cargo.toml expecting `criterion` while
Cargo.lock does not list it; any `cargo build --locked` (CI default)
fails with "the lock file Cargo.lock needs to be updated".

Adds the single missing entry under `[[package]]` for `lance-graph-contract`:
```
 dependencies = [
+ "criterion",
   "glob",
   "serde",
   "serde_yaml",
```

No semantic changes to lance-graph-contract or any other crate. The
criterion package itself was already pulled in as a transitive of
other workspace members, so this commit adds only the dep reference,
not a new crate version.

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4df835fc30

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1199 to +1201
// SAFETY: QualiaI4_16D is repr(C, align(8)); &.0 is a valid *const u64.
let q0 = vld1q_u64(&qualia[i].0 as *const u64);
let q1 = vld1q_u64(&qualia[i + 1].0 as *const u64);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid loading past the qualia slice in NEON batches

On aarch64, vld1q_u64(&qualia[i + 1].0 as *const u64) loads two u64 lanes starting at the last element of each 2-item chunk, so for inputs like qualia.len() == 2 it reads one QualiaI4_16D past the slice even though only lane 0 is later used. Because the public dispatcher enables this path for any len >= 2, dk_position_batch can hit out-of-bounds reads/UB on ARM; the same load pattern appears in the other NEON batch routines.

Useful? React with 👍 / 👎.

Comment on lines +888 to +890
let neg_man = _mm512_sub_epi64(zero, man_vec);
let man_neg_mask = _mm512_cmplt_epi64_mask(man_vec, zero);
let abs_man = _mm512_mask_blend_epi64(man_neg_mask, man_vec, neg_man);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve scalar handling of i8::MIN mantissas

When signed_mantissa is i8::MIN, the scalar anchor uses signed_mantissa.unsigned_abs() as i8, which wraps to -128 and therefore falls into the abs_mantissa <= 1 ValleyOfDespair branch. This AVX-512 path negates after widening to i64, producing 128, so on AVX-512 hosts high-coherence rows with mantissa -128 are classified as Slope/Plateau instead of matching scalar output; that also propagates through mul_assess_batch.

Useful? React with 👍 / 👎.

AdaWorldAPI pushed a commit that referenced this pull request May 16, 2026
Adds the simd-savant agent card alongside the project-agnostic 4-savant
taxonomy (PP-13 brutally-honest-tester / PP-14 convergence-architect /
PP-15 baton-handoff-auditor / PP-16 preflight-drift-auditor). Its scope
is the one workspace-specific SIMD invariant codified earlier in this
session:

  All SIMD must come from `ndarray::simd` via the polyfill —
  `simd.rs` + `simd_ops.rs` > `simd_{type}.rs` per-arch.
  Raw intrinsics outside `ndarray/src/simd_*.rs` are a violation.

The savant runs at three checkpoints (PRE-SPAWN / DURING-IMPL /
PRE-MERGE) and owns 8 anti-patterns (AP-SIMD-1..8) covering raw
intrinsics in consumer crates, hand-rolled feature detection,
arch-specific cfg outside the polyfill, unchecked pointer loads,
missing scalar fallback, and duplicated SIMD wrappers.

Hand-offs are explicit per autoattended-pattern §3 discipline:
- SIMD-induced UB / OOB → PP-13 (post-impl gate)
- Missing primitive → file `TD-NDARRAY-SIMD-<NAME>` and route to
  ndarray maintainer (do NOT approve inlining the raw intrinsic)
- Spec-vs-code drift → PP-16
- Cross-crate SIMD type aliasing → PP-15
- Compile error → PP-13

Files touched:
- `.claude/agents/simd-savant.md` (new) — the agent card.
- `.claude/agents/BOOT.md` — adds the Quality-lifecycle row for the
  simd-savant in the Knowledge Activation table (alongside the four
  PP-N rows).
- `.claude/knowledge/autoattended-multiagent-pattern.md` § 14
  (lance-graph adapter section) — adds the workspace-specific note
  explaining why the 5th savant is an adapter rather than a §3
  transferable slot (depends on having a polyfill repo to be the
  source-of-truth — not all projects do).

Belegte trigger: Sprint-13 W-I1 PR #398. The salvaged D-CSV-13b
impl inlined raw `_mm512_*` (x86_64) and `vld1q_u64` (aarch64)
intrinsics directly in `crates/lance-graph-contract/src/mul.rs`,
bypassing `ndarray::simd` entirely. Codex P1 (NEON OOB at len==2)
is a direct consequence of AP-SIMD-5 (hand-rolled ptr-load with no
bounds proof). This savant would have caught the violation
PRE-SPAWN (in the worker brief) and PRE-MERGE (in the audit grep).

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
@AdaWorldAPI AdaWorldAPI merged commit 291c5cd into main May 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants