fix(codegen): #92 intrinsify Buffer numeric reads (5x speedup on readInt32BE hot loop)#166
Conversation
|
Pushed a follow-up commit ( What the extension doesAt function entry, for each parameter typed Guards:
Deliberately not covered: Uint8Array-typed paramsSurfacing a pre-existing bug on main while testing this: when a program defines both a Measured
Intrinsic firing confirmed via The remaining 6× gap vs Node on the param path is a separate issue — the Tests
|
Closes the primitive half of #92 via PR #166. For the 14 Node-style Buffer numeric read accessors, bypass the `js_native_call_method` → method-name string match → `js_buffer_read_int32_be` → `buffer_slice_at` → `from_be_bytes` chain when the receiver is a tracked `const buf = Buffer.alloc(N)` local or a `Buffer`-typed function parameter. Emit inline LLVM instead: one load of the `data_ptr` from the pre-computed `buffer_data_slot`, one `getelementptr inbounds`, one byte-width load, one `@llvm.bswap.iN` for BE widths, one `sitofp` / `uitofp` to double. Covered methods: readInt8, readUInt8, readInt16{BE,LE}, readUInt16{BE,LE}, readInt32{BE,LE}, readUInt32{BE,LE}, readFloat{BE,LE}, readDouble{BE,LE}. BigInt64 reads skip the intrinsic (would need inline BigInt alloc). Two fast-path sources for the data_ptr slot: 1. `const buf = Buffer.alloc(N)` locals — existing path, already tracked in `buffer_data_slots` by `stmt.rs`. 2. `Buffer`-typed function parameters (the Postgres-driver shape) — new pre-registration in `codegen.rs::compile_function` that, at function entry, loads the NaN-boxed arg from the param slot, unboxes to i64 handle, computes `gep+8` to the data pointer, stashes it in a fresh ptr alloca, and registers in `buffer_data_slots`. Guarded by `has_any_mutation` (skips params written to via LocalSet/Update/ARRAY_MUTATORS — includes `buf.fill(...)`, `buf.copyWithin(...)`) and `boxed_vars` (cross-closure mutation). Uint8Array-typed params deliberately excluded — a pre-existing crash surfaces on main when a program defines both a Buffer-param and a Uint8Array-param function and invokes them in sequence; tracked separately. Untracked receivers (object fields, anything not tagged at the let-site) fall through to the existing runtime dispatch unchanged, so this PR is strictly additive. ## Measured (macOS arm64, #92 repro — readInt32BE × 12.5M) | Runtime | Time | Ratio | |---------|-----:|------:| | Perry | 29ms | 1.0× | | Node | 37ms | 1.3× | | Bun | 104ms | 3.6× | Before this change, Perry was ~145ms on the same workload (5× slower than Bun). After: 3.6× faster than Bun, 1.3× faster than Node. `otool -tV` shows the emitted intrinsic getting auto-vectorized (`rev32.4s` + 128-bit NEON loads in the hot loop). Function-param path measured at ~40 ns/op (down from ~55 ns/op via runtime dispatch). The remaining Node gap on that path is a separate issue: `n: number` loop bounds prevent i32-loop-counter specialization (same family as #140 but in a different guise), blocking LLVM's LoopVectorizer on the read chain. Not in scope for the intrinsic work — the primitive cost is now competitive. ## Correctness New `test-files/test_buffer_numeric_read_intrinsic.ts` covers all 14 variants against their corresponding writes, including sign-extension edge cases (`i32` MIN/MAX = ±2147483648, `u16` 0xFFFF = 65535, `u32` 0xFFFFFFFF = 4294967295, negative doubles in LE) and a `sumRow(row: Buffer, n: number)` case exercising the param intrinsic. Matches `node --experimental-strip-types` byte-for-byte on all lines. Gap suite still 23/28, `perry-runtime` unit tests 111/111. ## CI env gap The test compiles cleanly on macOS 15.x but fails on the GitHub macOS-14 runner (SDK/linker gap, same pattern as `test_gap_buffer_ops`, `test_stress_buffer`, `test_gap_typed_arrays`). Added to the `SKIP_TESTS` list in `.github/workflows/test.yml` and to `test-parity/known_failures.json` with status `ci-env` so neither the compile-smoke nor parity gate flags it as a new failure. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry) folded in at merge.
a69aa88 to
1816abd
Compare
…s (v0.5.188) Closes #168 via PR #171. Perry already specializes loop counters to i32 when the bound is `arr.length` (the `classify_for_length_hoist` peephole). This PR extends that optimization to the equally-common `i < n` shape where `n` is a `number`-typed function parameter or local — the pattern that blocks `LoopVectorizer` on Buffer-read and other intrinsic-heavy hot paths (e.g. the v0.5.183 `readInt32BE` intrinsic from PR #166 now actually auto-vectorizes when the loop bound is a `n: number` param). ## What changed Single file: `crates/perry-codegen/src/stmt.rs`. **New helper** — `classify_for_local_bound(cond, ctx)`: - Accepts `Compare { op: Lt|Le, left: LocalGet(i), right: LocalGet(n) }` - Requires `i` in `integer_locals` (initialized from integer literal, only mutated via `Update`) - Requires `n` to be either also in `integer_locals`, or a number-typed (`Type::Number | Type::Int32`) non-boxed, non-global slot — covers `n: number` parameters and simple `let` locals typed as number. **In `lower_for`** — new parallel block alongside the `arr.length` path: 1. Allocates a parallel i32 alloca for the counter (if not already present from the Let site) and initializes it via `fptosi` of the current double value. 2. Emits `fptosi(n)` once before the cond block into a fresh i32 alloca (the key: loop-invariant integer bound in a register LLVM/SCEV can see). 3. Uses `icmp slt i32 %i, %n.i32` (or `icmp sle` for `<=`) instead of `fcmp olt double`. 4. The existing `Expr::Update` lowering already keeps the i32 slot in sync with `add i32 1` per iteration. 5. Removes the i32 counter slot after loop exit (only if this path was the one that allocated it). ## Correctness - Counter (`i`): `integer_locals` guarantees it is initialized from an integer literal and only ever modified by `Update ++/--`. The i32 slot round-trips exactly. - Bound (`n`): when `n` is in `integer_locals` the same guarantee holds. When `n` is a number-typed parameter slot Perry trusts the TypeScript type annotation (the same trust-types contract already applied throughout codegen). A non-integer float bound (e.g. `foo(3.7)`) would observe at most one fewer iteration — a trade-off within Perry's existing contract. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry) folded in at merge.
…s (v0.5.188) Closes #168 via PR #171. Perry already specializes loop counters to i32 when the bound is `arr.length` (the `classify_for_length_hoist` peephole). This PR extends that optimization to the equally-common `i < n` shape where `n` is a `number`-typed function parameter or local — the pattern that blocks `LoopVectorizer` on Buffer-read and other intrinsic-heavy hot paths (e.g. the v0.5.183 `readInt32BE` intrinsic from PR #166 now actually auto-vectorizes when the loop bound is a `n: number` param). ## What changed Single file: `crates/perry-codegen/src/stmt.rs`. **New helper** — `classify_for_local_bound(cond, ctx)`: - Accepts `Compare { op: Lt|Le, left: LocalGet(i), right: LocalGet(n) }` - Requires `i` in `integer_locals` (initialized from integer literal, only mutated via `Update`) - Requires `n` to be either also in `integer_locals`, or a number-typed (`Type::Number | Type::Int32`) non-boxed, non-global slot — covers `n: number` parameters and simple `let` locals typed as number. **In `lower_for`** — new parallel block alongside the `arr.length` path: 1. Allocates a parallel i32 alloca for the counter (if not already present from the Let site) and initializes it via `fptosi` of the current double value. 2. Emits `fptosi(n)` once before the cond block into a fresh i32 alloca (the key: loop-invariant integer bound in a register LLVM/SCEV can see). 3. Uses `icmp slt i32 %i, %n.i32` (or `icmp sle` for `<=`) instead of `fcmp olt double`. 4. The existing `Expr::Update` lowering already keeps the i32 slot in sync with `add i32 1` per iteration. 5. Removes the i32 counter slot after loop exit (only if this path was the one that allocated it). ## Correctness - Counter (`i`): `integer_locals` guarantees it is initialized from an integer literal and only ever modified by `Update ++/--`. The i32 slot round-trips exactly. - Bound (`n`): when `n` is in `integer_locals` the same guarantee holds. When `n` is a number-typed parameter slot Perry trusts the TypeScript type annotation (the same trust-types contract already applied throughout codegen). A non-integer float bound (e.g. `foo(3.7)`) would observe at most one fewer iteration — a trade-off within Perry's existing contract. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry) folded in at merge.
…lab (v0.5.190) Closes #92 (Buffer.alloc small half) via PR #173. Per-thread bump-pointer slab for `Buffer.alloc(N)` where `N < 256`, eliminating one `malloc` call and one `HashSet::insert` per allocation. Large buffers (≥ 256 bytes) fall through to the unchanged malloc + HashSet path. `is_registered_buffer` checks slab block address ranges before the `BUFFER_REGISTRY` HashSet, so `instanceof` / `Buffer.isBuffer` still work without per-alloc registration. ## Threshold rationale (256 bytes) Covers the primary Postgres cell-decode pattern from the issue: 4 bytes (Int32), 8 bytes (Int64 / UUID half), 16 bytes (UUID), up to 255 bytes (mid-size strings). The malloc path is retained for larger allocations where its overhead is amortised over the allocation cost. ## Micro-benchmark: 100k × `Buffer.alloc(16)` | | Time | |---|---| | Perry before | ~8ms | | **Perry after** | **~1–2ms** | | Bun target | ~2ms | 4–8× speedup, matching Bun. ## GC interaction Buffers have never carried a `GcHeader` and are not tracked in `MALLOC_STATE` — the existing malloc path also never calls `dealloc` on individual buffers (they live for the thread's lifetime). Slab blocks follow the same lifetime: one `alloc` per 256 KB block, retained until the thread exits. No GC behaviour changes. `is_registered_buffer` correctness: slab blocks exclusively contain `BufferHeader` allocations. All callers pass the header pointer (NaN-boxed `POINTER_TAG` always encodes the `BufferHeader*`, never interior data bytes), so the O(n_slabs) range scan has no false positives. `n_slabs` is typically 1-5 for programs doing 100k small-buffer allocs. ## Maintainer fixup folded at merge `test_buffer_small_alloc` exposes the same macOS-14 SDK/linker gap that affects every other Buffer test in the suite (compiles cleanly on macOS 15.x; fails on the GitHub macOS-14 runner). Added to `SKIP_TESTS` in `.github/workflows/test.yml` and to `test-parity/known_failures.json` with `status: "ci-env"` — same pattern as `test_gap_buffer_ops`, `test_stress_buffer`, `test_buffer_numeric_read_intrinsic`. Not a perry-side bug. ## Out of scope Buffer reads/writes are separate work — the readInt32BE et al. intrinsic landed in v0.5.183 (PR #166), and per-receiver-type dispatch for `tArr[i] = v` is a known follow-up after the v0.5.184 revert. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry + ci-env skip-list entries) folded in at merge.
…lab (v0.5.190) Closes #92 (Buffer.alloc small half) via PR #173. Per-thread bump-pointer slab for `Buffer.alloc(N)` where `N < 256`, eliminating one `malloc` call and one `HashSet::insert` per allocation. Large buffers (≥ 256 bytes) fall through to the unchanged malloc + HashSet path. `is_registered_buffer` checks slab block address ranges before the `BUFFER_REGISTRY` HashSet, so `instanceof` / `Buffer.isBuffer` still work without per-alloc registration. ## Threshold rationale (256 bytes) Covers the primary Postgres cell-decode pattern from the issue: 4 bytes (Int32), 8 bytes (Int64 / UUID half), 16 bytes (UUID), up to 255 bytes (mid-size strings). The malloc path is retained for larger allocations where its overhead is amortised over the allocation cost. ## Micro-benchmark: 100k × `Buffer.alloc(16)` | | Time | |---|---| | Perry before | ~8ms | | **Perry after** | **~1–2ms** | | Bun target | ~2ms | 4–8× speedup, matching Bun. ## GC interaction Buffers have never carried a `GcHeader` and are not tracked in `MALLOC_STATE` — the existing malloc path also never calls `dealloc` on individual buffers (they live for the thread's lifetime). Slab blocks follow the same lifetime: one `alloc` per 256 KB block, retained until the thread exits. No GC behaviour changes. `is_registered_buffer` correctness: slab blocks exclusively contain `BufferHeader` allocations. All callers pass the header pointer (NaN-boxed `POINTER_TAG` always encodes the `BufferHeader*`, never interior data bytes), so the O(n_slabs) range scan has no false positives. `n_slabs` is typically 1-5 for programs doing 100k small-buffer allocs. ## Maintainer fixup folded at merge `test_buffer_small_alloc` exposes the same macOS-14 SDK/linker gap that affects every other Buffer test in the suite (compiles cleanly on macOS 15.x; fails on the GitHub macOS-14 runner). Added to `SKIP_TESTS` in `.github/workflows/test.yml` and to `test-parity/known_failures.json` with `status: "ci-env"` — same pattern as `test_gap_buffer_ops`, `test_stress_buffer`, `test_buffer_numeric_read_intrinsic`. Not a perry-side bug. ## Out of scope Buffer reads/writes are separate work — the readInt32BE et al. intrinsic landed in v0.5.183 (PR #166), and per-receiver-type dispatch for `tArr[i] = v` is a known follow-up after the v0.5.184 revert. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry + ci-env skip-list entries) folded in at merge.
Closes the primitive side of #92.
Summary
For the 14 Node-style Buffer numeric read accessors, bypass the
js_native_call_method→ method-name string match →js_buffer_read_int32_be→buffer_slice_at→from_be_byteschain when the receiver is a trackedconst buf = Buffer.alloc(N)local. Emit inline LLVM instead: one load of thedata_ptrfrom the pre-computedbuffer_data_slot, onegetelementptr inbounds, one byte-width load, one@llvm.bswap.iNfor BE widths, onesitofp/uitofpto double.Covered methods:
readInt8,readUInt8,readInt16{BE,LE},readUInt16{BE,LE},readInt32{BE,LE},readUInt32{BE,LE},readFloat{BE,LE},readDouble{BE,LE}.Measured on the issue's own repro
buf.readInt32BE(i*4)in a tight loop, 12.5M total reads (50 × 250k), macOS arm64:Before this change, Perry was ~145ms on the same workload — 5× slower than Bun. After: 3.6× faster than Bun.
otool -tVon the compiled binary shows the emitted intrinsic getting auto-vectorized (rev32.4s+ 128-bit NEON loads in the hot loop).Correctness
New
test-files/test_buffer_numeric_read_intrinsic.tsexercises all 14 variants against their corresponding writes, including the sign-extension edge cases that would miscompile if sitofp/uitofp were swapped (i32 MIN/MAX = ±2147483648, u16 0xFFFF = 65535, u32 0xFFFFFFFF = 4294967295, negative doubles in LE). Matchesnode --experimental-strip-typesbyte-for-byte on all 19 lines.Gap suite still 23/28 (no regressions); perry-runtime unit tests 111/111 green.
Why this shape
The fast path only fires when:
Expr::LocalGet(id)idhas abuffer_data_slotsentry (populated bystmt.rsforconst buf = Buffer.alloc(N)— immutable local or module-global)Untracked receivers (function args, object fields, anything not tagged at the let-site) fall through to the existing runtime dispatch unchanged, so this PR is strictly additive. The Postgres driver's real hot path passes Buffer as a function parameter and therefore still misses the fast path — see the follow-up note below.
What's still open on #92
buffer_data_slotsto cover typed function params / object fields is the follow-up that closes the gap on real Postgres-driver workloads. Not in this PR (scope creep + needs type-flow correctness). I can open a separate PR once this is reviewed.Buffer.alloc(small)path still goes through the Rust allocator;BigInt(str)/parseFloatstill heap-allocate. Those are separate from this change.BigInt64reads skip the intrinsic (would need BigInt allocation inline — not a primitive bottleneck at that point).Files changed
crates/perry-codegen/src/lower_call.rs— addsclassify_buffer_numeric_read+try_emit_buffer_read_intrinsichelpers (~160 lines), hooks into the existing PropertyGet dispatch at theis_buffer_classbranchtest-files/test_buffer_numeric_read_intrinsic.ts— new correctness testCargo.lock— version sync (0.5.177 → 0.5.179 to match main)Per CLAUDE.md external-contributor rules I did not bump the workspace version or edit the
**Current Version:**/ Recent Changes entry — maintainer to handle at merge time.Test plan
cargo build --release -p perry-runtime -p perry-stdlib -p perry— cleancargo test --release -p perry-runtime --lib— 111/111 passbash /tmp/run_gap_tests.sh— 23/28, unchanged from baselinetest_buffer_numeric_read_intrinsic.tsmatches Node byte-for-byterev32/ NEON instructions — intrinsic fires