[REFACTOR][OrcJIT] Per-session slab-pool memory manager by cyx-6 · Pull Request #574 · apache/tvm-ffi

cyx-6 · 2026-04-27T09:13:06Z

Summary

Replaces the single ~1 GB per-session arena with a growable pool of
fixed-size Slabs plus explicit reclamation. The work is staged as
six independently revertible commits:

#	Commit	Topic
1	`b42d1c4`	Isolate LLVM patches under `llvm_patches/` (code motion)
2	`52b7f9d`	Lint fixes for the isolation commit
3	`f4e5e10`	Release JITDylib + arena memory on `DynamicLibrary` drop
4	`e9992ca`	Stage A — extract `Slab` from `ArenaJITLinkMemoryManager`
5	`4cc8462`	Stage B — `SlabPoolMemoryManager` (growable per-session pool)
6	`a8690ba`	Stage C — `ExecutionSession.clear_free_slabs()` manual reclamation

Net: orcjit_session.cc shrunk 864 → 498 lines (patches moved
out), orcjit_memory_manager.cc shrunk from ~700 → ~190 lines
with a clean Slab/Pool split, and per-session JIT memory now lifecycles
through four well-defined moments (session start, pool growth, dylib
drop, manual reclaim) instead of one "arena lives as long as the
session" blob.

What each stage does

1. LLVM patches isolation (`b42d1c4`, `52b7f9d`)

Move GOTPCRELXFixPlugin and InitFiniPlugin — two
ObjectLinkingLayer plugins whose entire reason to exist is an
upstream LLVM defect — out of orcjit_session.cc into
addons/tvm_ffi_orcjit/src/ffi/llvm_patches/. Each file opens with
a fixed-shape header (issue link, affected versions, trigger,
symptom, ## Removal checklist) so deletion is mechanical once
upstream lands a fix. A README.md indexes the patches.
Keeps the arena memory manager (still useful regardless of LLVM
state) at the top level.
Reduces default arena capacity from 4 GB x86_64 / 8 GB AArch64 to
1 GB on both.

2. Dylib removal + arena reuse fix (`f4e5e10`)

Previously, dropping a DynamicLibrary ran static destructors but
never called ExecutionSession::removeJITDylib — JIT code pages
accumulated until the whole session was destroyed. This commit:

Plumbs RemoveDylib through ~ORCJITDynamicLibraryObj (Linux,
Windows, and macOS — I checked that MachOPlatform::notifyRemoving's
llvm_unreachable is dead code; no caller actually reaches it).
Fixes a latent bug the new deallocate path exposed: commitPages
stickily mprotects a 2 MB slab once, but finalize re-mprotects
pages to r-x / r--. Returning a region to the free list left those
protections in place → the next allocate's memset(0) would
fault. Resets protection to RW in the allocator, not the deallocator,
so the teardown-time invariant (ORC runtime may execute deallocated
pages while DeallocActions unwind) is preserved.

3. Stage A: Slab abstraction (`e9992ca`)

Extracts all per-arena state (mmap reservation, dual-pool bump,
commit bitmap, free list, protection primitives) into a new
Slab class in orcjit_slab.{h,cc}.
ArenaJITLinkMemoryManager becomes a thin ~50-line wrapper that
holds unique_ptr<Slab> and delegates. The capacity-halving retry
loop (kernel negotiation) stays on the wrapper.
FinalizedAllocInfo gains a Slab* owner so future multi-slab
routing is O(1).
Renames kSlabSize (2 MB commit granularity) → kCommitGranularity
to free the name for the Slab class.
API break: Python kwarg arena_size → slab_size. One breaking
change now instead of two once the multi-slab pool arrives.

Zero behavior change — pure restructuring.

4. Stage B: growable pool (`4cc8462`)

Rewrites ArenaJITLinkMemoryManager → SlabPoolMemoryManager
holding vector<unique_ptr<Slab>>.
Default slab_size: 1 GB → 64 MB. Small workloads reserve 64
MB up front instead of 1 GB. Oversize graphs get a dedicated slab
sized to fit.
Adds retriable SlabPoolExhaustedError and
Slab::computeGraphFootprint pre-flight helper.
Concurrency fix: pool_mu_ is dropped before calling into
Slab::allocate or user callbacks. LLJIT materialization re-enters
allocate() via nested lookups — a coarse lock deadlocks.
(Reproduced, diagnosed, fixed during implementation; no such
deadlock exists today.)

5. Stage C: manual reclaim (`a8690ba`)

Slab tracks live_count_ (atomic) + ever_used_ (one-way).
noteAllocated / noteDeallocated are called at finalize /
deallocateOne. isReclaimable = used at least once AND zero live
allocations.
SlabPoolMemoryManager::clearFreeSlabs() partitions under the
lock, moves discards out, drops the lock, lets ~Slab munmap
outside the lock.
Exposed as session.clear_free_slabs() -> int in Python.
Chose manual over automatic warm-slab eviction because the automatic
path has teardown-race hazards (ORC runtime can hold pointers into
drained pages while its DeallocActions unwind). Manual lets the
user pick a moment when those hazards don't apply (after
del lib has returned, the C++ destructor has run, counts
reflect reality). Can be upgraded to automatic later without
changing the API.

API changes

Break: ExecutionSession(arena_size=...) → ExecutionSession(slab_size=...).
Semantics: size per slab, not total capacity. Default 64 MB. -1
still disables the slab allocator (LLJIT uses its default
scattered-mmap allocator).
New: ExecutionSession.clear_free_slabs() -> int. Returns count
of slabs reclaimed. No-op when the pool is disabled or on
macOS/Windows (pool compiled out there).

Motivation

Before this stack:

orcjit_session.cc mixed core session setup with 300+ lines of
inline LLVM workarounds.
Every session pre-reserved 1 GB of VA regardless of workload size.
Dropping a library didn't free its JIT code — memory accumulated
until session destruction.
A session couldn't JIT more than ~1 GB of cumulative code.
No way to recover RSS on a long-running host that loads and unloads
libraries.

After this stack all five are fixed, and each stage is independently
revertible.

Test plan

Build on aarch64 Linux: LLVM_PREFIX=/opt/llvm pip install -e addons/tvm_ffi_orcjit
pytest addons/tvm_ffi_orcjit/tests — 75 passed, 3 skipped (was 48 passed, 3 skipped at branch base)
Behavior smoke: repeated create/load/drop cycles, captured
Function keeps dylib alive, oversize path, reclaim after drop —
all covered by new parametrized tests
ruff check on touched Python
CI: lint / clang-tidy / macOS / Windows / x86_64 Linux

Moves the two ObjectLinkingLayer plugins whose entire reason to exist is an upstream LLVM defect out of orcjit_session.cc into their own files under addons/tvm_ffi_orcjit/src/ffi/llvm_patches/: - gotpcrelx_fix.{h,cc} (x86_64 Linux GOTPCRELX relaxation bug) - init_fini_plugin.{h,cc} (ELFNixPlatform gap, COFFPlatform stalled) Each file opens with a fixed-shape header (LLVM issue link, affected versions, trigger, symptom, Removal checklist) so that when upstream lands a fix we can delete the file and the matching include / plugin registration mechanically. README.md indexes the patches and their removal criteria. The arena memory manager stays at the top level -- it is a design-level feature (contiguous r-x layout, THP, faster teardown) that we keep even after LLVM #173269 is fixed, not a workaround. Also reduces the default arena capacity from 4 GB (x86_64) / 8 GB (AArch64) to 1 GB on both architectures. 1 GB covers typical ML JIT workloads while staying well under the PC-relative relocation limit (x86_64 +/-2 GB, AArch64 +/-4 GB) and is friendlier to memory-constrained hosts (containers, CI runners). No logic changes: orcjit_session.cc shrinks from 864 -> 498 lines of pure code motion. CI-covered build and tests pass on aarch64 Linux.

gemini-code-assist

Code Review

This pull request refactors LLVM JITLink workarounds into a dedicated llvm_patches directory and reduces the default JIT arena size from 4GB/8GB to 1GB to better support memory-constrained environments. It introduces specific fixes for x86_64 GOTPCRELX relaxation bugs and implements cross-platform init/fini section handling for ELF and COFF. Feedback was provided regarding a potential lifetime issue when capturing JITDylib by reference in a lambda and a logic inconsistency in section priority parsing when getAsInteger fails.

gemini-code-assist · 2026-04-27T09:18:09Z

+  // Handles ELF (.init_array, .ctors, .fini_array, .dtors),
+  // Mach-O (__DATA,__mod_init_func, __DATA,__mod_term_func),
+  // and COFF (.CRT$XC*, .CRT$XT*) section conventions.
+  Config.PostFixupPasses.emplace_back([this, &jit_dylib](llvm::jitlink::LinkGraph& G) {


The lambda captures jit_dylib by reference. While JITDylib objects are generally long-lived within an ExecutionSession, jit_dylib is a reference obtained from MR.getTargetJITDylib(). Since PostFixupPasses are executed during the linking process which is part of the materialization, this is likely safe, but capturing by pointer or ensuring the JITDylib's lifetime explicitly would be more robust if the materialization context were to change.

gemini-code-assist · 2026-04-27T09:18:09Z

+      // negate so that higher-numbered entries run first (reverse order).
+      if (is_init_array) {
+        if (section_name.consume_front(".init_array.")) {
+          section_name.getAsInteger(10, priority);


getAsInteger returns true on error. The return value is ignored here, which means if a section name like .init_array.abc is encountered, priority will remain at its default value of 0. This is inconsistent with the default priority of 65535 used for sections without a suffix (line 160). Consider checking the return value and falling back to a sensible default or reporting an error.

- README.md: switch to per-line  ASF license headers (adds an H1 right after, satisfies MD041) and drop the pipe-aligned table in favour of a bulleted index (no more MD060 aligned-pipe errors). - CMakeLists.txt: pack add_library() source list the way cmake-format wants (two lines, not one per source).

Previously, dropping a `DynamicLibrary` ran any pending static destructors but never removed the underlying `JITDylib` from the `ExecutionSession`. JIT code pages stayed resident in the arena until the session itself was destroyed, blocking any reuse across dylib lifetimes. Changes: - `ORCJITExecutionSessionObj::RemoveDylib`: erases any pending init/fini map entries keyed by the JITDylib* (so a recycled address starts clean) and calls `ExecutionSession::removeJITDylib`. Errors are swallowed so it is safe from a destructor. - `~ORCJITDynamicLibraryObj` now calls `RemoveDylib` after the existing deinit step, on Linux, Windows, and macOS. `MachOPlatform::notifyRemoving` is `llvm_unreachable` but is dead code in LLVM — no caller invokes it; `removeJITDylibs` only calls `teardownJITDylib`, which is a plain map cleanup on every platform. - Arena memory manager: fix a latent bug that surfaced as soon as deallocate was exercised. `commitPages` mprotects a 2 MB slab once (sticky `slab_committed_`), but finalize later re-mprotects pages to r-x / r--. Returning a region to the free list left those protections in place, and the next `allocate()` faulted in its `memset(0)`. Reset the region to RW at allocate time (not deallocate) so the teardown-time invariant — ORC runtime may still execute deallocated pages while its DeallocActions unwind — is preserved. Tests (8 new, parametrized over C / C++ variants): - Drop empty library; drop + recreate; 32-iteration load/drop cycle exercising the recycled-region path. - Drop one library while another is live; captured `Function` keeps its library alive across `del lib`. - Drop runs static destructors immediately (not at session teardown). - Drop a `set_link_order` caller leaves its base usable. - Dropping every library before the session still produces clean teardown.

Stage A of the slab-pool refactor in refactor_plan.md. Pure internal restructuring; zero behavior change. - Move every per-arena concern — mmap reservation, dual-pool bump allocator, commit bitmap, free list with coalescing, page-protection primitives, InFlightAlloc logic — into a new `Slab` class in `orcjit_slab.{h,cc}`. Slab is the unit-of-VA-reservation; future stages will introduce a pool of them per session. - `ArenaJITLinkMemoryManager` becomes a thin ~30-line wrapper that owns one `unique_ptr<Slab>` and delegates `allocate` / `deallocate`. The capacity-halving retry loop (negotiation between caller request and kernel RLIMIT_AS) stays here — it is not a Slab concern. - `FinalizedAllocInfo` gains a `Slab* owner` field, stamped at finalize time. Redundant today (one Slab per session) but makes Stage B's pool-manager routing O(1) without address comparisons. - Rename the 2 MB commit-granularity constant `kSlabSize` → `Slab::kCommitGranularity`. This frees the "Slab" name for the new class and clarifies that the constant describes page-commit / THP granularity, not the Slab-as-pool-unit. - Rename the user-facing parameter `arena_size` / `arena_size_bytes` → `slab_size` / `slab_size_bytes` across the C++ constructors, Python kwarg, registered-global-func lambda, and tests. Stage A has one Slab per session so the semantics are identical; Stage B preserves the name while adding multi-slab pool behavior. This is an API break: `ExecutionSession(arena_size=...)` must become `ExecutionSession(slab_size=...)`. Doing the rename now avoids a second break once Stage B lands. Verification: `pytest addons/tvm_ffi_orcjit/tests` — 64 passed, 3 skipped (identical to pre-refactor). `nm` confirms `Slab::allocate`, `Slab::deallocateOne`, etc. in the built .so.

Stage B of the slab-pool refactor. `ArenaJITLinkMemoryManager` is replaced by `SlabPoolMemoryManager`, which holds a vector of `Slab`s per session and grows on demand instead of pre-reserving one giant arena. Behavior changes: - **Default capacity**: per-slab `slab_size` drops from 1 GB to 64 MB. Typical ML JIT graphs are well under 10 MB; the first slab now reserves 64 MB instead of 1 GB, dramatically reducing VA footprint for small workloads on memory-constrained hosts. - **Growth on demand**: when no existing slab can fit a graph, a fresh slab is mmap'd at `slab_size` bytes and appended to the pool. Sessions that previously failed with "pool exhausted" for cumulative allocations beyond 1 GB now succeed transparently. - **Oversize path**: a single graph whose footprint exceeds `slab_size - 2 * kCommitGranularity` gets its own dedicated slab sized to fit it, rounded to the commit granularity. One graph per oversize slab; the slab becomes available to other allocations after the graph is freed but usually isn't reused (sized tightly). New infrastructure: - `SlabPoolExhaustedError` — retriable error class. Emitted by `Slab::bumpAllocate` when the requested region exceeds the pool limit; caught by `SlabPoolMemoryManager::allocate` to fall through to the next slab or mmap a new one. Other errors (mmap, mprotect, JITLink) keep their existing types and are propagated to the caller without retry. - `Slab::computeGraphFootprint(G, page_size)` — static helper that pre-computes per-pool byte totals for a graph. The pool manager uses it to make the normal-vs-oversize decision without first attempting a normal allocation. - `classifyOverflowSections(G)` — file-scope helper extracted from `Slab::allocate`; also used by `computeGraphFootprint`. Keeps the two entry points consistent on which sections go to the overflow (separate-mmap) path. Concurrency: `pool_mu_` guards only the `slabs_` vector. It is dropped before calling into `Slab::allocate` or the caller's `OnAllocated` callback, because LLJIT materialization frequently invokes nested lookups from inside those callbacks — a coarse lock here deadlocks. Existing slab pointers are stable across concurrent grows (Stage B never removes slabs), so a snapshot taken under the lock is safe to iterate afterwards. Initial-slab retry: the session constructor still halves capacity on mmap failure, now down to `kMinSlabSize = 8 MB`. Subsequent grows use exactly `slab_size_` with no retry — mmap failures propagate. Tests: - `test_arena.py`: `_ARENA_SIZE` bumped from 16 MB → 256 MB so the co-location tests continue to exercise single-slab invariants. The existing overflow-section test's contiguous-region assertion still passes because 256 MB is enough for one slab. - `test_basic.py`: 3 new parametrized tests (×2 C/C++ variants = 6 cases) under the "Slab-pool growth" section — `test_pool_grows_under_small_slab` (16 libs, 8 MB slab, pool must grow), `test_small_slab_recycles_after_drop` (32-iter load/drop exercises free-list within a slab), `test_pool_survives_mixed_load_drop_create` (interleaved paths). Verification: `pytest addons/tvm_ffi_orcjit/tests` — 70 passed, 3 skipped (was 64 + 3). Scope: Stage C (warm-slab eviction + real munmap of drained slabs) remains a planned follow-up. Drained slabs in Stage B stay mapped until the session is destroyed, same as today's single arena.

…b reclamation Stage C of the slab-pool refactor. Gives users explicit control over when drained slabs are `munmap`'d back to the OS, rather than relying on automatic eviction heuristics with their timing hazards. Rationale: in Stage B a Slab stays mapped for the session's lifetime even after every allocation on it is freed, so long-running workloads (load model → unload → reload) never recover their RSS. Instead of an automatic warm-slab eviction with deferred-reclamation bookkeeping (hard to get right around teardown-time ORC-runtime references), this change exposes a manual reclamation call. The user picks a safe moment — typically right after `del lib` on a batch of libraries — and calls `session.clear_free_slabs()` to release drained slabs. Implementation: - `Slab` gains an atomic `live_count_` and a one-way `ever_used_` flag. `noteAllocated()` fires in `InFlightAlloc::finalize` just before the FinalizedAlloc handle is published; `noteDeallocated()` fires in `deallocateOne()` after the region is returned to the free list. `isReclaimable()` returns true only when the slab was used at least once and currently has zero live allocations — a fresh initial slab is preserved. - `SlabPoolMemoryManager::clearFreeSlabs()` partitions `slabs_` into keep and discard under `pool_mu_`, moves the discard half into a local vector, drops the lock, and lets `unique_ptr<Slab>` destructors `munmap` outside the lock. Returns the count reclaimed. - `ORCJITExecutionSessionObj::ClearFreeSlabs()` forwards to the pool manager, or returns 0 if the pool is disabled (`slab_size=-1`) or on non-Linux (pool compiled out). - Registered as `orcjit.ExecutionSessionClearFreeSlabs`; exposed on Python as `ExecutionSession.clear_free_slabs() -> int`. Tests (5 new): - `test_clear_free_slabs_no_drained`: fresh session and one-live-lib session both return 0. - `test_clear_free_slabs_reclaims_oversize`: a 3 MB ZeroFill blob built on the fly via `tvm_ffi.cpp.build` forces the oversize path under a 4 MB `slab_size`; dropping the lib makes the dedicated slab reclaimable. - `test_clear_free_slabs_idempotent`: a second call after everything is reclaimed returns 0. - `test_clear_free_slabs_preserves_live_pool`: drop-one + keep-one — only the drained slab is reclaimed, the kept lib still executes. - `test_clear_free_slabs_disabled_pool`: no-op under `slab_size=-1`. Safety note (in the Python docstring): call when no JIT work is in flight on another thread. Under Python's GIL, once `del lib` returns, the C++ destructor has finished and the slab's live count reflects the drop; subsequent `clear_free_slabs()` is safe. Verification: `pytest addons/tvm_ffi_orcjit/tests` — 75 passed, 3 skipped (was 70 + 3).

Collapse the allocator's five-step logic (footprint pre-check, usable estimate, oversize branch, first-fit, grow-fresh) into two steps: 1. First-fit over existing slabs; retry on SlabPoolExhaustedError. 2. On miss, grow the pool with a slab sized by a new Slab::capacityForFootprint helper — power-of-2 doubling from slab_size until both per-pool budgets cover the graph's footprint. The separate "oversize path" disappears: a skewed or oversize graph is just a miss that happens to grow the pool to 2·slab_size, 4·slab_size, etc. instead of slab_size. Net −90 lines of allocator logic, +40 lines of helper + updated comments. Observable change: a dropped oversize slab can now be reused by a subsequent graph that fits — strictly better for RSS. The `test_clear_free_slabs_preserves_live_pool` test was adjusted to use two distinctly-sized blobs (3 MB and 5 MB on a 4 MB pool) so drop and keep land on separate slabs, preserving the invariant the test was really checking. Extend `_build_big_object` to accept a `name` so two objects can live in the same tmp_path. Also rename tests/test_arena.py to tests/test_memory_manager.py and sweep the "arena" vocabulary to "slab" throughout — the ArenaJITLinkMemoryManager class was replaced by SlabPoolMemoryManager in commit 4cc8462.

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

tqchen approved these changes Apr 27, 2026

View reviewed changes

cyx-6 changed the title ~~[REFACTOR][OrcJIT] Isolate LLVM patches under llvm_patches/~~ [REFACTOR][OrcJIT] Isolate LLVM patches + release dylib memory on drop Apr 29, 2026

cyx-6 added 3 commits April 29, 2026 14:43

cyx-6 changed the title ~~[REFACTOR][OrcJIT] Isolate LLVM patches + release dylib memory on drop~~ [REFACTOR][OrcJIT] Per-session slab-pool memory manager Apr 29, 2026

cyx-6 force-pushed the orcjit-refactor branch from 4542333 to b078a98 Compare April 30, 2026 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REFACTOR][OrcJIT] Per-session slab-pool memory manager#574

[REFACTOR][OrcJIT] Per-session slab-pool memory manager#574
cyx-6 wants to merge 7 commits intoapache:mainfrom
cyx-6:orcjit-refactor

cyx-6 commented Apr 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cyx-6 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What each stage does

1. LLVM patches isolation (b42d1c4, 52b7f9d)

2. Dylib removal + arena reuse fix (f4e5e10)

3. Stage A: Slab abstraction (e9992ca)

4. Stage B: growable pool (4cc8462)

5. Stage C: manual reclaim (a8690ba)

API changes

Motivation

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyx-6 commented Apr 27, 2026 •

edited

Loading

1. LLVM patches isolation (`b42d1c4`, `52b7f9d`)

2. Dylib removal + arena reuse fix (`f4e5e10`)

3. Stage A: Slab abstraction (`e9992ca`)

4. Stage B: growable pool (`4cc8462`)

5. Stage C: manual reclaim (`a8690ba`)