Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/board/EPIPHANIES.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
**Missing primitives** (must be added to ndarray before consumer remediation can complete):

- `TD-NDARRAY-SIMD-UNPACK-I4-16D` — `I8x16::from_i4_packed_u64` + `batch_packed_i4_16<E, F>` closure-batch
- `TD-NDARRAY-SIMD-SATURATING-ABS-I8` — `I8x16::saturating_abs` (closes codex P2 i8::MIN divergence)
- `TD-NDARRAY-SIMD-SATURATING-ABS-I8` — `I8x16::saturating_abs` via `_mm512_min_epu8(_mm512_abs_epi8(x), 0x7f)` on AVX-512 (VPABSB alone does NOT saturate `i8::MIN`; needs VPMINUB clamp), `vqabsq_s8` on NEON, `i8::saturating_abs` scalar — closes codex P2 i8::MIN divergence
- `TD-NDARRAY-SIMD-GATHER` — `U16x8::gather_u16` (palette lookup, currently raw `_mm256_i32gather_epi32` in `bgz17`)
- `TD-NDARRAY-SIMD-PREFETCH` — cross-arch `prefetch_read_t0` (no-op on unsupported)
- `TD-NDARRAY-SIMD-POPCOUNT-U64` — `U64x8::popcnt` (lane-wise 64-bit popcount; currently raw `_mm512_popcnt_epi64` in `holograph` + `blasgraph`)
Expand Down
7 changes: 4 additions & 3 deletions .claude/board/TECH_DEBT.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,11 @@
- **Severity:** P1 (closes codex P2 i8::MIN divergence on PR #398 by giving consumers a single source-of-truth for hardware-semantics abs)
- **Surfaced in:** PR #398 codex P2 review; PP-16 preflight-drift-auditor verdict "Direction B" 2026-05-16
- **Status:** Open
- **Description:** Scalar path in `mul.rs` uses `signed_mantissa.unsigned_abs() as i8`, which wraps `i8::MIN = -128` back to `-128i8` (the cast `u8 → i8` doesn't saturate), then `-128 ≤ 1` is true → wrongly classifies as `ValleyOfDespair`. AVX-512 `_mm512_abs_epi8` saturates `i8::MIN → 127` by ISA semantics (VPABSB), correctly NOT triggering `ValleyOfDespair`. Spec line 233 of `pr-sprint-13-simd-i4.md`: `|signed_mantissa| ≤ 1 → ValleyOfDespair` represents weak rule signal, NOT sign-extreme. Direction B (scalar is buggy, AVX-512 is correct) is canonical.
- **Description:** Scalar path in `mul.rs` uses `signed_mantissa.unsigned_abs() as i8`, which wraps `i8::MIN = -128` back to `-128i8` (the cast `u8 → i8` doesn't saturate), then `-128 ≤ 1` is true → wrongly classifies as `ValleyOfDespair`. PR #398's AVX-512 path correctly classifies `i8::MIN` not because of VPABSB (VPABSB does NOT saturate — `abs(0x80) = 0x80`, the bit pattern is unchanged), but because the path widens i8 → i64 first and then negate-blends, where the negate of -128 (i64) is +128 (i64), comparing > 1. Spec line 233 of `pr-sprint-13-simd-i4.md`: `|signed_mantissa| ≤ 1 → ValleyOfDespair` represents weak rule signal, NOT sign-extreme. Direction B (scalar is buggy, AVX-512 outcome is correct) is canonical — but the new ndarray primitive must produce truly-saturating semantics across all three backends.
- **Required API surface:**
- `impl I8x16 { pub fn saturating_abs(self) -> Self; }` — AVX-512 `_mm512_abs_epi8` (saturates by ISA); NEON `vqabsq_s8`; scalar `i8::saturating_abs` fused loop.
- `impl I8x32 { pub fn saturating_abs(self) -> Self; }` (parity)
- `impl I8x16 { pub fn saturating_abs(self) -> Self; }` — AVX-512 `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))` (VPABSB leaves `0x80 → 0x80`; VPMINUB clamps `0x80` unsigned-greater-than `0x7f` down to `0x7f`); NEON `vqabsq_s8` (the `q` suffix is hardware-saturating); scalar `i8::saturating_abs` fused loop.
- `impl I8x32 { pub fn saturating_abs(self) -> Self; }` (parity, same AVX-512 + clamp pattern)
- **Mandatory test:** assert `I8x16::saturating_abs(splat(i8::MIN))` returns `splat(i8::MAX)` on all three backends. PR #398-style widen-then-negate is NOT a correct substitute; the primitive must be saturating in the same byte-wide register without widening.
- **Cross-ref:** `.claude/knowledge/ndarray-vertical-simd-alien-magic.md` §W1a #2; EPIPHANIES.md E-SIMD-SWEEP-1; PR #398 codex P2.

---
Expand Down
2 changes: 1 addition & 1 deletion .claude/knowledge/ndarray-vertical-simd-alien-magic.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Each row maps one stack workload to (a) the ndarray struct methods it needs and
Five small primitives, each on its own branch, each auditable by `simd-savant` before merge:

1. **`TD-NDARRAY-SIMD-UNPACK-I4-16D`** — `I8x16::from_i4_packed_u64` + `I8x16::lane_i8::<N>` + `batch_packed_i4_16<E, F>` closure-batch entry. AVX-512 path via `_mm512_cvtepi8_epi16` + nibble shuffle; NEON via `vshl_n_s8` / `vqshl_n_s8`; scalar via fused-loop fallback. Bounds-aware tail.
2. **`TD-NDARRAY-SIMD-SATURATING-ABS-I8`** — `I8x16::saturating_abs(self) -> Self`. AVX-512 `_mm512_abs_epi8` (saturates `i8::MIN → 127` by ISA); NEON `vqabsq_s8`; scalar `i8::saturating_abs`. Closes codex P2 i8::MIN divergence on PR #398 by giving consumers a single source-of-truth for "the abs that matches hardware semantics."
2. **`TD-NDARRAY-SIMD-SATURATING-ABS-I8`** — `I8x16::saturating_abs(self) -> Self`. **AVX-512 needs `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`** — VPABSB alone does NOT saturate (`abs(0x80) = 0x80`, still `i8::MIN`); the VPMINUB clamp is required to remap `0x80 → 0x7f`. NEON `vqabsq_s8` (the `q` suffix means saturating); scalar `i8::saturating_abs`. Closes codex P2 i8::MIN divergence on PR #398 by giving consumers a single source-of-truth for "true saturating abs" across all three backends.
3. **`TD-NDARRAY-SIMD-GATHER`** — `U16x8::gather_u16(indices, table)`. AVX2 `_mm256_i32gather_epi32` + downcast; NEON loop (no native gather); scalar `indices.iter().map(|&i| table[i])`. Closes `bgz17/src/simd.rs:88` raw `_mm256_i32gather_epi32`.
4. **`TD-NDARRAY-SIMD-PREFETCH`** — `prefetch_read_t0(ptr: *const u8)`, `prefetch_read_t1`, `prefetch_read_t2`. AVX `_mm_prefetch`; NEON `__builtin_prefetch`-equivalent; no-op on unsupported. Closes `bgz17/src/prefetch.rs:96` / `:100`.
5. **`TD-NDARRAY-SIMD-POPCOUNT-U64`** — `U64x8::popcnt(self) -> Self` (lane-wise 64-bit popcount). AVX-512 VPOPCNTDQ `_mm512_popcnt_epi64`; NEON `vcntq_u8` + horizontal-sum; scalar `u64::count_ones`. Closes holograph + blasgraph hamming raw-intrinsic blocks.
Expand Down