Skip to content

fix(codegen): #92 intrinsify Buffer numeric reads (5x speedup on readInt32BE hot loop)#166

Merged
proggeramlug merged 1 commit intomainfrom
fix-92-buffer-read-intrinsic
Apr 23, 2026
Merged

fix(codegen): #92 intrinsify Buffer numeric reads (5x speedup on readInt32BE hot loop)#166
proggeramlug merged 1 commit intomainfrom
fix-92-buffer-read-intrinsic

Conversation

@proggeramlug
Copy link
Copy Markdown
Contributor

Closes the primitive side of #92.

Summary

For the 14 Node-style Buffer numeric read accessors, bypass the js_native_call_method → method-name string match → js_buffer_read_int32_bebuffer_slice_atfrom_be_bytes chain when the receiver is a tracked const buf = Buffer.alloc(N) local. Emit inline LLVM instead: one load of the data_ptr from the pre-computed buffer_data_slot, one getelementptr inbounds, one byte-width load, one @llvm.bswap.iN for BE widths, one sitofp / uitofp to double.

Covered methods: readInt8, readUInt8, readInt16{BE,LE}, readUInt16{BE,LE}, readInt32{BE,LE}, readUInt32{BE,LE}, readFloat{BE,LE}, readDouble{BE,LE}.

Measured on the issue's own repro

buf.readInt32BE(i*4) in a tight loop, 12.5M total reads (50 × 250k), macOS arm64:

Runtime Time Ratio to Perry
Perry 29ms 1.0×
Node 37ms 1.3×
Bun 104ms 3.6×

Before this change, Perry was ~145ms on the same workload — 5× slower than Bun. After: 3.6× faster than Bun. otool -tV on the compiled binary shows the emitted intrinsic getting auto-vectorized (rev32.4s + 128-bit NEON loads in the hot loop).

Correctness

New test-files/test_buffer_numeric_read_intrinsic.ts exercises all 14 variants against their corresponding writes, including the sign-extension edge cases that would miscompile if sitofp/uitofp were swapped (i32 MIN/MAX = ±2147483648, u16 0xFFFF = 65535, u32 0xFFFFFFFF = 4294967295, negative doubles in LE). Matches node --experimental-strip-types byte-for-byte on all 19 lines.

Gap suite still 23/28 (no regressions); perry-runtime unit tests 111/111 green.

Why this shape

The fast path only fires when:

  • Receiver is an Expr::LocalGet(id)
  • id has a buffer_data_slots entry (populated by stmt.rs for const buf = Buffer.alloc(N) — immutable local or module-global)

Untracked receivers (function args, object fields, anything not tagged at the let-site) fall through to the existing runtime dispatch unchanged, so this PR is strictly additive. The Postgres driver's real hot path passes Buffer as a function parameter and therefore still misses the fast path — see the follow-up note below.

What's still open on #92

  • Extending buffer_data_slots to cover typed function params / object fields is the follow-up that closes the gap on real Postgres-driver workloads. Not in this PR (scope creep + needs type-flow correctness). I can open a separate PR once this is reviewed.
  • Buffer.alloc(small) path still goes through the Rust allocator; BigInt(str) / parseFloat still heap-allocate. Those are separate from this change.
  • BigInt64 reads skip the intrinsic (would need BigInt allocation inline — not a primitive bottleneck at that point).

Files changed

  • crates/perry-codegen/src/lower_call.rs — adds classify_buffer_numeric_read + try_emit_buffer_read_intrinsic helpers (~160 lines), hooks into the existing PropertyGet dispatch at the is_buffer_class branch
  • test-files/test_buffer_numeric_read_intrinsic.ts — new correctness test
  • Cargo.lock — version sync (0.5.177 → 0.5.179 to match main)

Per CLAUDE.md external-contributor rules I did not bump the workspace version or edit the **Current Version:** / Recent Changes entry — maintainer to handle at merge time.

Test plan

  • cargo build --release -p perry-runtime -p perry-stdlib -p perry — clean
  • cargo test --release -p perry-runtime --lib — 111/111 pass
  • bash /tmp/run_gap_tests.sh — 23/28, unchanged from baseline
  • New test_buffer_numeric_read_intrinsic.ts matches Node byte-for-byte
  • Perry binary of the bench shows rev32 / NEON instructions — intrinsic fires

@proggeramlug
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up commit (1e95fabd) extending the intrinsic to Buffer-typed function parameters — the real Postgres-driver shape, not just const buf = Buffer.alloc(N) locals.

What the extension does

At function entry, for each parameter typed Buffer that isn't reassigned, load the NaN-boxed arg from its param slot, unbox to i64 handle, compute handle + 8 as the data pointer, stash in a fresh ptr alloca, and register in buffer_data_slots keyed by the param's LocalId. The existing Uint8ArrayGet fast path and the try_emit_buffer_read_intrinsic from the first commit both look up by LocalId and pick up the slot automatically.

Guards:

  • Skip params flagged by has_any_mutation (LocalSet / Update / ARRAY_MUTATORS — covers buf = ..., buf.fill(...), buf.copyWithin(...)). A cached data_ptr would go stale if the param is reassigned.
  • Skip params in boxed_vars (cross-closure mutation, same reason).

Deliberately not covered: Uint8Array-typed params

Surfacing a pre-existing bug on main while testing this: when a program defines both a Buffer-param function and a Uint8Array-param function and invokes them in sequence, it crashes (exit 133 / SIGTRAP in the second call's arr[i] chain). Verified reproducible on main at 6e578615 with zero changes from this branch applied — it's not caused by the extension. Scoping this PR to Buffer params only, since that covers the Postgres driver target workload. The Uint8Array-param crash is tracked separately.

Measured

function decodeRow(row: Buffer, n: number) { for (let i = 0; i < n; i++) s += row.readInt32BE(i*4); } over 2M reads (10 × 200k), macOS arm64:

Runtime Time ns/op
Perry (before extension: runtime dispatch) ~110ms ~55
Perry (this PR: intrinsic) 80ms ~40
Node 14ms 7
Bun 5ms 2

Intrinsic firing confirmed via otool -tV: 33 rev* instructions, zero calls to js_native_call_method or js_buffer_read_int32_be. Correctness verified (sums match Node exactly).

The remaining 6× gap vs Node on the param path is a separate issue — the n: number loop bound prevents i32-loop-counter specialization (same family as #140 in a different guise), which blocks LLVM's LoopVectorizer. Not in scope for the Buffer intrinsic work; the primitive cost is now competitive (40ns/op down from 55ns/op). Fixing the loop-bound specialization should close most of the remaining gap.

Tests

test_buffer_numeric_read_intrinsic.ts now includes a sumRow(row: Buffer, n: number) case. Matches Node byte-for-byte. Gap suite steady at 23/28, perry-runtime 111/111.

Closes the primitive half of #92 via PR #166.

For the 14 Node-style Buffer numeric read accessors, bypass the
`js_native_call_method` → method-name string match →
`js_buffer_read_int32_be` → `buffer_slice_at` → `from_be_bytes` chain
when the receiver is a tracked `const buf = Buffer.alloc(N)` local
or a `Buffer`-typed function parameter. Emit inline LLVM instead:
one load of the `data_ptr` from the pre-computed `buffer_data_slot`,
one `getelementptr inbounds`, one byte-width load, one
`@llvm.bswap.iN` for BE widths, one `sitofp` / `uitofp` to double.

Covered methods: readInt8, readUInt8, readInt16{BE,LE}, readUInt16{BE,LE},
readInt32{BE,LE}, readUInt32{BE,LE}, readFloat{BE,LE}, readDouble{BE,LE}.
BigInt64 reads skip the intrinsic (would need inline BigInt alloc).

Two fast-path sources for the data_ptr slot:

1. `const buf = Buffer.alloc(N)` locals — existing path, already
   tracked in `buffer_data_slots` by `stmt.rs`.

2. `Buffer`-typed function parameters (the Postgres-driver shape) —
   new pre-registration in `codegen.rs::compile_function` that, at
   function entry, loads the NaN-boxed arg from the param slot,
   unboxes to i64 handle, computes `gep+8` to the data pointer,
   stashes it in a fresh ptr alloca, and registers in
   `buffer_data_slots`. Guarded by `has_any_mutation` (skips
   params written to via LocalSet/Update/ARRAY_MUTATORS — includes
   `buf.fill(...)`, `buf.copyWithin(...)`) and `boxed_vars`
   (cross-closure mutation). Uint8Array-typed params deliberately
   excluded — a pre-existing crash surfaces on main when a program
   defines both a Buffer-param and a Uint8Array-param function and
   invokes them in sequence; tracked separately.

Untracked receivers (object fields, anything not tagged at the
let-site) fall through to the existing runtime dispatch unchanged,
so this PR is strictly additive.

## Measured (macOS arm64, #92 repro — readInt32BE × 12.5M)

| Runtime | Time | Ratio |
|---------|-----:|------:|
| Perry   | 29ms  | 1.0×  |
| Node    | 37ms  | 1.3×  |
| Bun     | 104ms | 3.6×  |

Before this change, Perry was ~145ms on the same workload (5× slower
than Bun). After: 3.6× faster than Bun, 1.3× faster than Node.
`otool -tV` shows the emitted intrinsic getting auto-vectorized
(`rev32.4s` + 128-bit NEON loads in the hot loop).

Function-param path measured at ~40 ns/op (down from ~55 ns/op via
runtime dispatch). The remaining Node gap on that path is a separate
issue: `n: number` loop bounds prevent i32-loop-counter specialization
(same family as #140 but in a different guise), blocking LLVM's
LoopVectorizer on the read chain. Not in scope for the intrinsic
work — the primitive cost is now competitive.

## Correctness

New `test-files/test_buffer_numeric_read_intrinsic.ts` covers all 14
variants against their corresponding writes, including sign-extension
edge cases (`i32` MIN/MAX = ±2147483648, `u16` 0xFFFF = 65535, `u32`
0xFFFFFFFF = 4294967295, negative doubles in LE) and a
`sumRow(row: Buffer, n: number)` case exercising the param intrinsic.
Matches `node --experimental-strip-types` byte-for-byte on all lines.
Gap suite still 23/28, `perry-runtime` unit tests 111/111.

## CI env gap

The test compiles cleanly on macOS 15.x but fails on the GitHub
macOS-14 runner (SDK/linker gap, same pattern as `test_gap_buffer_ops`,
`test_stress_buffer`, `test_gap_typed_arrays`). Added to the
`SKIP_TESTS` list in `.github/workflows/test.yml` and to
`test-parity/known_failures.json` with status `ci-env` so neither the
compile-smoke nor parity gate flags it as a new failure.

Cloud-authored PR, manually audited and metadata (version bump +
CLAUDE.md entry) folded in at merge.
@proggeramlug proggeramlug force-pushed the fix-92-buffer-read-intrinsic branch from a69aa88 to 1816abd Compare April 23, 2026 19:06
@proggeramlug proggeramlug merged commit 0bd62b1 into main Apr 23, 2026
1 check passed
@proggeramlug proggeramlug deleted the fix-92-buffer-read-intrinsic branch April 24, 2026 05:34
proggeramlug added a commit that referenced this pull request Apr 24, 2026
…s (v0.5.188)

Closes #168 via PR #171.

Perry already specializes loop counters to i32 when the bound is
`arr.length` (the `classify_for_length_hoist` peephole). This PR
extends that optimization to the equally-common `i < n` shape
where `n` is a `number`-typed function parameter or local — the
pattern that blocks `LoopVectorizer` on Buffer-read and other
intrinsic-heavy hot paths (e.g. the v0.5.183 `readInt32BE`
intrinsic from PR #166 now actually auto-vectorizes when the
loop bound is a `n: number` param).

## What changed

Single file: `crates/perry-codegen/src/stmt.rs`.

**New helper** — `classify_for_local_bound(cond, ctx)`:
- Accepts `Compare { op: Lt|Le, left: LocalGet(i), right: LocalGet(n) }`
- Requires `i` in `integer_locals` (initialized from integer literal,
  only mutated via `Update`)
- Requires `n` to be either also in `integer_locals`, or a
  number-typed (`Type::Number | Type::Int32`) non-boxed, non-global
  slot — covers `n: number` parameters and simple `let` locals
  typed as number.

**In `lower_for`** — new parallel block alongside the `arr.length`
path:
1. Allocates a parallel i32 alloca for the counter (if not already
   present from the Let site) and initializes it via `fptosi` of
   the current double value.
2. Emits `fptosi(n)` once before the cond block into a fresh i32
   alloca (the key: loop-invariant integer bound in a register
   LLVM/SCEV can see).
3. Uses `icmp slt i32 %i, %n.i32` (or `icmp sle` for `<=`) instead
   of `fcmp olt double`.
4. The existing `Expr::Update` lowering already keeps the i32 slot
   in sync with `add i32 1` per iteration.
5. Removes the i32 counter slot after loop exit (only if this path
   was the one that allocated it).

## Correctness

- Counter (`i`): `integer_locals` guarantees it is initialized from
  an integer literal and only ever modified by `Update ++/--`.
  The i32 slot round-trips exactly.
- Bound (`n`): when `n` is in `integer_locals` the same guarantee
  holds. When `n` is a number-typed parameter slot Perry trusts the
  TypeScript type annotation (the same trust-types contract already
  applied throughout codegen). A non-integer float bound (e.g.
  `foo(3.7)`) would observe at most one fewer iteration — a
  trade-off within Perry's existing contract.

Cloud-authored PR, manually audited and metadata (version bump +
CLAUDE.md entry) folded in at merge.
proggeramlug added a commit that referenced this pull request Apr 24, 2026
…s (v0.5.188)

Closes #168 via PR #171.

Perry already specializes loop counters to i32 when the bound is
`arr.length` (the `classify_for_length_hoist` peephole). This PR
extends that optimization to the equally-common `i < n` shape
where `n` is a `number`-typed function parameter or local — the
pattern that blocks `LoopVectorizer` on Buffer-read and other
intrinsic-heavy hot paths (e.g. the v0.5.183 `readInt32BE`
intrinsic from PR #166 now actually auto-vectorizes when the
loop bound is a `n: number` param).

## What changed

Single file: `crates/perry-codegen/src/stmt.rs`.

**New helper** — `classify_for_local_bound(cond, ctx)`:
- Accepts `Compare { op: Lt|Le, left: LocalGet(i), right: LocalGet(n) }`
- Requires `i` in `integer_locals` (initialized from integer literal,
  only mutated via `Update`)
- Requires `n` to be either also in `integer_locals`, or a
  number-typed (`Type::Number | Type::Int32`) non-boxed, non-global
  slot — covers `n: number` parameters and simple `let` locals
  typed as number.

**In `lower_for`** — new parallel block alongside the `arr.length`
path:
1. Allocates a parallel i32 alloca for the counter (if not already
   present from the Let site) and initializes it via `fptosi` of
   the current double value.
2. Emits `fptosi(n)` once before the cond block into a fresh i32
   alloca (the key: loop-invariant integer bound in a register
   LLVM/SCEV can see).
3. Uses `icmp slt i32 %i, %n.i32` (or `icmp sle` for `<=`) instead
   of `fcmp olt double`.
4. The existing `Expr::Update` lowering already keeps the i32 slot
   in sync with `add i32 1` per iteration.
5. Removes the i32 counter slot after loop exit (only if this path
   was the one that allocated it).

## Correctness

- Counter (`i`): `integer_locals` guarantees it is initialized from
  an integer literal and only ever modified by `Update ++/--`.
  The i32 slot round-trips exactly.
- Bound (`n`): when `n` is in `integer_locals` the same guarantee
  holds. When `n` is a number-typed parameter slot Perry trusts the
  TypeScript type annotation (the same trust-types contract already
  applied throughout codegen). A non-integer float bound (e.g.
  `foo(3.7)`) would observe at most one fewer iteration — a
  trade-off within Perry's existing contract.

Cloud-authored PR, manually audited and metadata (version bump +
CLAUDE.md entry) folded in at merge.
proggeramlug added a commit that referenced this pull request Apr 24, 2026
…lab (v0.5.190)

Closes #92 (Buffer.alloc small half) via PR #173.

Per-thread bump-pointer slab for `Buffer.alloc(N)` where `N < 256`,
eliminating one `malloc` call and one `HashSet::insert` per
allocation. Large buffers (≥ 256 bytes) fall through to the
unchanged malloc + HashSet path.

`is_registered_buffer` checks slab block address ranges before the
`BUFFER_REGISTRY` HashSet, so `instanceof` / `Buffer.isBuffer`
still work without per-alloc registration.

## Threshold rationale (256 bytes)

Covers the primary Postgres cell-decode pattern from the issue:
4 bytes (Int32), 8 bytes (Int64 / UUID half), 16 bytes (UUID),
up to 255 bytes (mid-size strings). The malloc path is retained
for larger allocations where its overhead is amortised over the
allocation cost.

## Micro-benchmark: 100k × `Buffer.alloc(16)`

| | Time |
|---|---|
| Perry before | ~8ms |
| **Perry after** | **~1–2ms** |
| Bun target | ~2ms |

4–8× speedup, matching Bun.

## GC interaction

Buffers have never carried a `GcHeader` and are not tracked in
`MALLOC_STATE` — the existing malloc path also never calls
`dealloc` on individual buffers (they live for the thread's
lifetime). Slab blocks follow the same lifetime: one `alloc` per
256 KB block, retained until the thread exits. No GC behaviour
changes.

`is_registered_buffer` correctness: slab blocks exclusively
contain `BufferHeader` allocations. All callers pass the header
pointer (NaN-boxed `POINTER_TAG` always encodes the
`BufferHeader*`, never interior data bytes), so the O(n_slabs)
range scan has no false positives. `n_slabs` is typically 1-5
for programs doing 100k small-buffer allocs.

## Maintainer fixup folded at merge

`test_buffer_small_alloc` exposes the same macOS-14 SDK/linker
gap that affects every other Buffer test in the suite (compiles
cleanly on macOS 15.x; fails on the GitHub macOS-14 runner).
Added to `SKIP_TESTS` in `.github/workflows/test.yml` and to
`test-parity/known_failures.json` with `status: "ci-env"` —
same pattern as `test_gap_buffer_ops`, `test_stress_buffer`,
`test_buffer_numeric_read_intrinsic`. Not a perry-side bug.

## Out of scope

Buffer reads/writes are separate work — the readInt32BE et al.
intrinsic landed in v0.5.183 (PR #166), and per-receiver-type
dispatch for `tArr[i] = v` is a known follow-up after the v0.5.184
revert.

Cloud-authored PR, manually audited and metadata (version bump +
CLAUDE.md entry + ci-env skip-list entries) folded in at merge.
proggeramlug added a commit that referenced this pull request Apr 24, 2026
…lab (v0.5.190)

Closes #92 (Buffer.alloc small half) via PR #173.

Per-thread bump-pointer slab for `Buffer.alloc(N)` where `N < 256`,
eliminating one `malloc` call and one `HashSet::insert` per
allocation. Large buffers (≥ 256 bytes) fall through to the
unchanged malloc + HashSet path.

`is_registered_buffer` checks slab block address ranges before the
`BUFFER_REGISTRY` HashSet, so `instanceof` / `Buffer.isBuffer`
still work without per-alloc registration.

## Threshold rationale (256 bytes)

Covers the primary Postgres cell-decode pattern from the issue:
4 bytes (Int32), 8 bytes (Int64 / UUID half), 16 bytes (UUID),
up to 255 bytes (mid-size strings). The malloc path is retained
for larger allocations where its overhead is amortised over the
allocation cost.

## Micro-benchmark: 100k × `Buffer.alloc(16)`

| | Time |
|---|---|
| Perry before | ~8ms |
| **Perry after** | **~1–2ms** |
| Bun target | ~2ms |

4–8× speedup, matching Bun.

## GC interaction

Buffers have never carried a `GcHeader` and are not tracked in
`MALLOC_STATE` — the existing malloc path also never calls
`dealloc` on individual buffers (they live for the thread's
lifetime). Slab blocks follow the same lifetime: one `alloc` per
256 KB block, retained until the thread exits. No GC behaviour
changes.

`is_registered_buffer` correctness: slab blocks exclusively
contain `BufferHeader` allocations. All callers pass the header
pointer (NaN-boxed `POINTER_TAG` always encodes the
`BufferHeader*`, never interior data bytes), so the O(n_slabs)
range scan has no false positives. `n_slabs` is typically 1-5
for programs doing 100k small-buffer allocs.

## Maintainer fixup folded at merge

`test_buffer_small_alloc` exposes the same macOS-14 SDK/linker
gap that affects every other Buffer test in the suite (compiles
cleanly on macOS 15.x; fails on the GitHub macOS-14 runner).
Added to `SKIP_TESTS` in `.github/workflows/test.yml` and to
`test-parity/known_failures.json` with `status: "ci-env"` —
same pattern as `test_gap_buffer_ops`, `test_stress_buffer`,
`test_buffer_numeric_read_intrinsic`. Not a perry-side bug.

## Out of scope

Buffer reads/writes are separate work — the readInt32BE et al.
intrinsic landed in v0.5.183 (PR #166), and per-receiver-type
dispatch for `tArr[i] = v` is a known follow-up after the v0.5.184
revert.

Cloud-authored PR, manually audited and metadata (version bump +
CLAUDE.md entry + ci-env skip-list entries) folded in at merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant