fix(codegen): #92 intrinsify Buffer numeric reads (5x speedup on readInt32BE hot loop) by proggeramlug · Pull Request #166 · PerryTS/perry

proggeramlug · 2026-04-23T17:25:07Z

Closes the primitive side of #92.

Summary

For the 14 Node-style Buffer numeric read accessors, bypass the js_native_call_method → method-name string match → js_buffer_read_int32_be → buffer_slice_at → from_be_bytes chain when the receiver is a tracked const buf = Buffer.alloc(N) local. Emit inline LLVM instead: one load of the data_ptr from the pre-computed buffer_data_slot, one getelementptr inbounds, one byte-width load, one @llvm.bswap.iN for BE widths, one sitofp / uitofp to double.

Covered methods: readInt8, readUInt8, readInt16{BE,LE}, readUInt16{BE,LE}, readInt32{BE,LE}, readUInt32{BE,LE}, readFloat{BE,LE}, readDouble{BE,LE}.

Measured on the issue's own repro

buf.readInt32BE(i*4) in a tight loop, 12.5M total reads (50 × 250k), macOS arm64:

Runtime	Time	Ratio to Perry
Perry	29ms	1.0×
Node	37ms	1.3×
Bun	104ms	3.6×

Before this change, Perry was ~145ms on the same workload — 5× slower than Bun. After: 3.6× faster than Bun. otool -tV on the compiled binary shows the emitted intrinsic getting auto-vectorized (rev32.4s + 128-bit NEON loads in the hot loop).

Correctness

New test-files/test_buffer_numeric_read_intrinsic.ts exercises all 14 variants against their corresponding writes, including the sign-extension edge cases that would miscompile if sitofp/uitofp were swapped (i32 MIN/MAX = ±2147483648, u16 0xFFFF = 65535, u32 0xFFFFFFFF = 4294967295, negative doubles in LE). Matches node --experimental-strip-types byte-for-byte on all 19 lines.

Gap suite still 23/28 (no regressions); perry-runtime unit tests 111/111 green.

Why this shape

The fast path only fires when:

Receiver is an Expr::LocalGet(id)
id has a buffer_data_slots entry (populated by stmt.rs for const buf = Buffer.alloc(N) — immutable local or module-global)

Untracked receivers (function args, object fields, anything not tagged at the let-site) fall through to the existing runtime dispatch unchanged, so this PR is strictly additive. The Postgres driver's real hot path passes Buffer as a function parameter and therefore still misses the fast path — see the follow-up note below.

What's still open on #92

Extending buffer_data_slots to cover typed function params / object fields is the follow-up that closes the gap on real Postgres-driver workloads. Not in this PR (scope creep + needs type-flow correctness). I can open a separate PR once this is reviewed.
Buffer.alloc(small) path still goes through the Rust allocator; BigInt(str) / parseFloat still heap-allocate. Those are separate from this change.
BigInt64 reads skip the intrinsic (would need BigInt allocation inline — not a primitive bottleneck at that point).

Files changed

crates/perry-codegen/src/lower_call.rs — adds classify_buffer_numeric_read + try_emit_buffer_read_intrinsic helpers (~160 lines), hooks into the existing PropertyGet dispatch at the is_buffer_class branch
test-files/test_buffer_numeric_read_intrinsic.ts — new correctness test
Cargo.lock — version sync (0.5.177 → 0.5.179 to match main)

Per CLAUDE.md external-contributor rules I did not bump the workspace version or edit the **Current Version:** / Recent Changes entry — maintainer to handle at merge time.

Test plan

cargo build --release -p perry-runtime -p perry-stdlib -p perry — clean
cargo test --release -p perry-runtime --lib — 111/111 pass
bash /tmp/run_gap_tests.sh — 23/28, unchanged from baseline
New test_buffer_numeric_read_intrinsic.ts matches Node byte-for-byte
Perry binary of the bench shows rev32 / NEON instructions — intrinsic fires

proggeramlug · 2026-04-23T18:38:52Z

Pushed a follow-up commit (1e95fabd) extending the intrinsic to Buffer-typed function parameters — the real Postgres-driver shape, not just const buf = Buffer.alloc(N) locals.

What the extension does

At function entry, for each parameter typed Buffer that isn't reassigned, load the NaN-boxed arg from its param slot, unbox to i64 handle, compute handle + 8 as the data pointer, stash in a fresh ptr alloca, and register in buffer_data_slots keyed by the param's LocalId. The existing Uint8ArrayGet fast path and the try_emit_buffer_read_intrinsic from the first commit both look up by LocalId and pick up the slot automatically.

Guards:

Skip params flagged by has_any_mutation (LocalSet / Update / ARRAY_MUTATORS — covers buf = ..., buf.fill(...), buf.copyWithin(...)). A cached data_ptr would go stale if the param is reassigned.
Skip params in boxed_vars (cross-closure mutation, same reason).

Deliberately not covered: Uint8Array-typed params

Surfacing a pre-existing bug on main while testing this: when a program defines both a Buffer-param function and a Uint8Array-param function and invokes them in sequence, it crashes (exit 133 / SIGTRAP in the second call's arr[i] chain). Verified reproducible on main at 6e578615 with zero changes from this branch applied — it's not caused by the extension. Scoping this PR to Buffer params only, since that covers the Postgres driver target workload. The Uint8Array-param crash is tracked separately.

Measured

function decodeRow(row: Buffer, n: number) { for (let i = 0; i < n; i++) s += row.readInt32BE(i*4); } over 2M reads (10 × 200k), macOS arm64:

Runtime	Time	ns/op
Perry (before extension: runtime dispatch)	~110ms	~55
Perry (this PR: intrinsic)	80ms	~40
Node	14ms	7
Bun	5ms	2

Intrinsic firing confirmed via otool -tV: 33 rev* instructions, zero calls to js_native_call_method or js_buffer_read_int32_be. Correctness verified (sums match Node exactly).

The remaining 6× gap vs Node on the param path is a separate issue — the n: number loop bound prevents i32-loop-counter specialization (same family as #140 in a different guise), which blocks LLVM's LoopVectorizer. Not in scope for the Buffer intrinsic work; the primitive cost is now competitive (40ns/op down from 55ns/op). Fixing the loop-bound specialization should close most of the remaining gap.

Tests

test_buffer_numeric_read_intrinsic.ts now includes a sumRow(row: Buffer, n: number) case. Matches Node byte-for-byte. Gap suite steady at 23/28, perry-runtime 111/111.

Closes the primitive half of #92 via PR #166. For the 14 Node-style Buffer numeric read accessors, bypass the `js_native_call_method` → method-name string match → `js_buffer_read_int32_be` → `buffer_slice_at` → `from_be_bytes` chain when the receiver is a tracked `const buf = Buffer.alloc(N)` local or a `Buffer`-typed function parameter. Emit inline LLVM instead: one load of the `data_ptr` from the pre-computed `buffer_data_slot`, one `getelementptr inbounds`, one byte-width load, one `@llvm.bswap.iN` for BE widths, one `sitofp` / `uitofp` to double. Covered methods: readInt8, readUInt8, readInt16{BE,LE}, readUInt16{BE,LE}, readInt32{BE,LE}, readUInt32{BE,LE}, readFloat{BE,LE}, readDouble{BE,LE}. BigInt64 reads skip the intrinsic (would need inline BigInt alloc). Two fast-path sources for the data_ptr slot: 1. `const buf = Buffer.alloc(N)` locals — existing path, already tracked in `buffer_data_slots` by `stmt.rs`. 2. `Buffer`-typed function parameters (the Postgres-driver shape) — new pre-registration in `codegen.rs::compile_function` that, at function entry, loads the NaN-boxed arg from the param slot, unboxes to i64 handle, computes `gep+8` to the data pointer, stashes it in a fresh ptr alloca, and registers in `buffer_data_slots`. Guarded by `has_any_mutation` (skips params written to via LocalSet/Update/ARRAY_MUTATORS — includes `buf.fill(...)`, `buf.copyWithin(...)`) and `boxed_vars` (cross-closure mutation). Uint8Array-typed params deliberately excluded — a pre-existing crash surfaces on main when a program defines both a Buffer-param and a Uint8Array-param function and invokes them in sequence; tracked separately. Untracked receivers (object fields, anything not tagged at the let-site) fall through to the existing runtime dispatch unchanged, so this PR is strictly additive. ## Measured (macOS arm64, #92 repro — readInt32BE × 12.5M) | Runtime | Time | Ratio | |---------|-----:|------:| | Perry | 29ms | 1.0× | | Node | 37ms | 1.3× | | Bun | 104ms | 3.6× | Before this change, Perry was ~145ms on the same workload (5× slower than Bun). After: 3.6× faster than Bun, 1.3× faster than Node. `otool -tV` shows the emitted intrinsic getting auto-vectorized (`rev32.4s` + 128-bit NEON loads in the hot loop). Function-param path measured at ~40 ns/op (down from ~55 ns/op via runtime dispatch). The remaining Node gap on that path is a separate issue: `n: number` loop bounds prevent i32-loop-counter specialization (same family as #140 but in a different guise), blocking LLVM's LoopVectorizer on the read chain. Not in scope for the intrinsic work — the primitive cost is now competitive. ## Correctness New `test-files/test_buffer_numeric_read_intrinsic.ts` covers all 14 variants against their corresponding writes, including sign-extension edge cases (`i32` MIN/MAX = ±2147483648, `u16` 0xFFFF = 65535, `u32` 0xFFFFFFFF = 4294967295, negative doubles in LE) and a `sumRow(row: Buffer, n: number)` case exercising the param intrinsic. Matches `node --experimental-strip-types` byte-for-byte on all lines. Gap suite still 23/28, `perry-runtime` unit tests 111/111. ## CI env gap The test compiles cleanly on macOS 15.x but fails on the GitHub macOS-14 runner (SDK/linker gap, same pattern as `test_gap_buffer_ops`, `test_stress_buffer`, `test_gap_typed_arrays`). Added to the `SKIP_TESTS` list in `.github/workflows/test.yml` and to `test-parity/known_failures.json` with status `ci-env` so neither the compile-smoke nor parity gate flags it as a new failure. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry) folded in at merge.

…s (v0.5.188) Closes #168 via PR #171. Perry already specializes loop counters to i32 when the bound is `arr.length` (the `classify_for_length_hoist` peephole). This PR extends that optimization to the equally-common `i < n` shape where `n` is a `number`-typed function parameter or local — the pattern that blocks `LoopVectorizer` on Buffer-read and other intrinsic-heavy hot paths (e.g. the v0.5.183 `readInt32BE` intrinsic from PR #166 now actually auto-vectorizes when the loop bound is a `n: number` param). ## What changed Single file: `crates/perry-codegen/src/stmt.rs`. **New helper** — `classify_for_local_bound(cond, ctx)`: - Accepts `Compare { op: Lt|Le, left: LocalGet(i), right: LocalGet(n) }` - Requires `i` in `integer_locals` (initialized from integer literal, only mutated via `Update`) - Requires `n` to be either also in `integer_locals`, or a number-typed (`Type::Number | Type::Int32`) non-boxed, non-global slot — covers `n: number` parameters and simple `let` locals typed as number. **In `lower_for`** — new parallel block alongside the `arr.length` path: 1. Allocates a parallel i32 alloca for the counter (if not already present from the Let site) and initializes it via `fptosi` of the current double value. 2. Emits `fptosi(n)` once before the cond block into a fresh i32 alloca (the key: loop-invariant integer bound in a register LLVM/SCEV can see). 3. Uses `icmp slt i32 %i, %n.i32` (or `icmp sle` for `<=`) instead of `fcmp olt double`. 4. The existing `Expr::Update` lowering already keeps the i32 slot in sync with `add i32 1` per iteration. 5. Removes the i32 counter slot after loop exit (only if this path was the one that allocated it). ## Correctness - Counter (`i`): `integer_locals` guarantees it is initialized from an integer literal and only ever modified by `Update ++/--`. The i32 slot round-trips exactly. - Bound (`n`): when `n` is in `integer_locals` the same guarantee holds. When `n` is a number-typed parameter slot Perry trusts the TypeScript type annotation (the same trust-types contract already applied throughout codegen). A non-integer float bound (e.g. `foo(3.7)`) would observe at most one fewer iteration — a trade-off within Perry's existing contract. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry) folded in at merge.

…lab (v0.5.190) Closes #92 (Buffer.alloc small half) via PR #173. Per-thread bump-pointer slab for `Buffer.alloc(N)` where `N < 256`, eliminating one `malloc` call and one `HashSet::insert` per allocation. Large buffers (≥ 256 bytes) fall through to the unchanged malloc + HashSet path. `is_registered_buffer` checks slab block address ranges before the `BUFFER_REGISTRY` HashSet, so `instanceof` / `Buffer.isBuffer` still work without per-alloc registration. ## Threshold rationale (256 bytes) Covers the primary Postgres cell-decode pattern from the issue: 4 bytes (Int32), 8 bytes (Int64 / UUID half), 16 bytes (UUID), up to 255 bytes (mid-size strings). The malloc path is retained for larger allocations where its overhead is amortised over the allocation cost. ## Micro-benchmark: 100k × `Buffer.alloc(16)` | | Time | |---|---| | Perry before | ~8ms | | **Perry after** | **~1–2ms** | | Bun target | ~2ms | 4–8× speedup, matching Bun. ## GC interaction Buffers have never carried a `GcHeader` and are not tracked in `MALLOC_STATE` — the existing malloc path also never calls `dealloc` on individual buffers (they live for the thread's lifetime). Slab blocks follow the same lifetime: one `alloc` per 256 KB block, retained until the thread exits. No GC behaviour changes. `is_registered_buffer` correctness: slab blocks exclusively contain `BufferHeader` allocations. All callers pass the header pointer (NaN-boxed `POINTER_TAG` always encodes the `BufferHeader*`, never interior data bytes), so the O(n_slabs) range scan has no false positives. `n_slabs` is typically 1-5 for programs doing 100k small-buffer allocs. ## Maintainer fixup folded at merge `test_buffer_small_alloc` exposes the same macOS-14 SDK/linker gap that affects every other Buffer test in the suite (compiles cleanly on macOS 15.x; fails on the GitHub macOS-14 runner). Added to `SKIP_TESTS` in `.github/workflows/test.yml` and to `test-parity/known_failures.json` with `status: "ci-env"` — same pattern as `test_gap_buffer_ops`, `test_stress_buffer`, `test_buffer_numeric_read_intrinsic`. Not a perry-side bug. ## Out of scope Buffer reads/writes are separate work — the readInt32BE et al. intrinsic landed in v0.5.183 (PR #166), and per-receiver-type dispatch for `tArr[i] = v` is a known follow-up after the v0.5.184 revert. Cloud-authored PR, manually audited and metadata (version bump + CLAUDE.md entry + ci-env skip-list entries) folded in at merge.

proggeramlug force-pushed the fix-92-buffer-read-intrinsic branch from a69aa88 to 1816abd Compare April 23, 2026 19:06

proggeramlug merged commit 0bd62b1 into main Apr 23, 2026
1 check passed

proggeramlug mentioned this pull request Apr 23, 2026

perf(runtime): fast-path small Buffer.alloc via per-thread bump slab #173

Merged

4 tasks

proggeramlug deleted the fix-92-buffer-read-intrinsic branch April 24, 2026 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(codegen): #92 intrinsify Buffer numeric reads (5x speedup on readInt32BE hot loop)#166

fix(codegen): #92 intrinsify Buffer numeric reads (5x speedup on readInt32BE hot loop)#166
proggeramlug merged 1 commit intomainfrom
fix-92-buffer-read-intrinsic

proggeramlug commented Apr 23, 2026

Uh oh!

proggeramlug commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

proggeramlug commented Apr 23, 2026

Summary

Measured on the issue's own repro

Correctness

Why this shape

What's still open on #92

Files changed

Test plan

Uh oh!

proggeramlug commented Apr 23, 2026

What the extension does

Deliberately not covered: Uint8Array-typed params

Measured

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant