[REFACTOR][OrcJIT] Per-session slab-pool memory manager#574
[REFACTOR][OrcJIT] Per-session slab-pool memory manager#574cyx-6 wants to merge 7 commits intoapache:mainfrom
Conversation
Moves the two ObjectLinkingLayer plugins whose entire reason to exist is
an upstream LLVM defect out of orcjit_session.cc into their own files
under addons/tvm_ffi_orcjit/src/ffi/llvm_patches/:
- gotpcrelx_fix.{h,cc} (x86_64 Linux GOTPCRELX relaxation bug)
- init_fini_plugin.{h,cc} (ELFNixPlatform gap, COFFPlatform stalled)
Each file opens with a fixed-shape header (LLVM issue link, affected
versions, trigger, symptom, Removal checklist) so that when upstream
lands a fix we can delete the file and the matching include / plugin
registration mechanically. README.md indexes the patches and their
removal criteria.
The arena memory manager stays at the top level -- it is a design-level
feature (contiguous r-x layout, THP, faster teardown) that we keep even
after LLVM #173269 is fixed, not a workaround.
Also reduces the default arena capacity from 4 GB (x86_64) / 8 GB
(AArch64) to 1 GB on both architectures. 1 GB covers typical ML JIT
workloads while staying well under the PC-relative relocation limit
(x86_64 +/-2 GB, AArch64 +/-4 GB) and is friendlier to memory-constrained
hosts (containers, CI runners).
No logic changes: orcjit_session.cc shrinks from 864 -> 498 lines of
pure code motion. CI-covered build and tests pass on aarch64 Linux.
There was a problem hiding this comment.
Code Review
This pull request refactors LLVM JITLink workarounds into a dedicated llvm_patches directory and reduces the default JIT arena size from 4GB/8GB to 1GB to better support memory-constrained environments. It introduces specific fixes for x86_64 GOTPCRELX relaxation bugs and implements cross-platform init/fini section handling for ELF and COFF. Feedback was provided regarding a potential lifetime issue when capturing JITDylib by reference in a lambda and a logic inconsistency in section priority parsing when getAsInteger fails.
| // Handles ELF (.init_array, .ctors, .fini_array, .dtors), | ||
| // Mach-O (__DATA,__mod_init_func, __DATA,__mod_term_func), | ||
| // and COFF (.CRT$XC*, .CRT$XT*) section conventions. | ||
| Config.PostFixupPasses.emplace_back([this, &jit_dylib](llvm::jitlink::LinkGraph& G) { |
There was a problem hiding this comment.
The lambda captures jit_dylib by reference. While JITDylib objects are generally long-lived within an ExecutionSession, jit_dylib is a reference obtained from MR.getTargetJITDylib(). Since PostFixupPasses are executed during the linking process which is part of the materialization, this is likely safe, but capturing by pointer or ensuring the JITDylib's lifetime explicitly would be more robust if the materialization context were to change.
| // negate so that higher-numbered entries run first (reverse order). | ||
| if (is_init_array) { | ||
| if (section_name.consume_front(".init_array.")) { | ||
| section_name.getAsInteger(10, priority); |
There was a problem hiding this comment.
getAsInteger returns true on error. The return value is ignored here, which means if a section name like .init_array.abc is encountered, priority will remain at its default value of 0. This is inconsistent with the default priority of 65535 used for sections without a suffix (line 160). Consider checking the return value and falling back to a sensible default or reporting an error.
- README.md: switch to per-line <!--- --> ASF license headers (adds an H1 right after, satisfies MD041) and drop the pipe-aligned table in favour of a bulleted index (no more MD060 aligned-pipe errors). - CMakeLists.txt: pack add_library() source list the way cmake-format wants (two lines, not one per source).
Previously, dropping a `DynamicLibrary` ran any pending static destructors but never removed the underlying `JITDylib` from the `ExecutionSession`. JIT code pages stayed resident in the arena until the session itself was destroyed, blocking any reuse across dylib lifetimes. Changes: - `ORCJITExecutionSessionObj::RemoveDylib`: erases any pending init/fini map entries keyed by the JITDylib* (so a recycled address starts clean) and calls `ExecutionSession::removeJITDylib`. Errors are swallowed so it is safe from a destructor. - `~ORCJITDynamicLibraryObj` now calls `RemoveDylib` after the existing deinit step, on Linux, Windows, and macOS. `MachOPlatform::notifyRemoving` is `llvm_unreachable` but is dead code in LLVM — no caller invokes it; `removeJITDylibs` only calls `teardownJITDylib`, which is a plain map cleanup on every platform. - Arena memory manager: fix a latent bug that surfaced as soon as deallocate was exercised. `commitPages` mprotects a 2 MB slab once (sticky `slab_committed_`), but finalize later re-mprotects pages to r-x / r--. Returning a region to the free list left those protections in place, and the next `allocate()` faulted in its `memset(0)`. Reset the region to RW at allocate time (not deallocate) so the teardown-time invariant — ORC runtime may still execute deallocated pages while its DeallocActions unwind — is preserved. Tests (8 new, parametrized over C / C++ variants): - Drop empty library; drop + recreate; 32-iteration load/drop cycle exercising the recycled-region path. - Drop one library while another is live; captured `Function` keeps its library alive across `del lib`. - Drop runs static destructors immediately (not at session teardown). - Drop a `set_link_order` caller leaves its base usable. - Dropping every library before the session still produces clean teardown.
Stage A of the slab-pool refactor in refactor_plan.md. Pure internal
restructuring; zero behavior change.
- Move every per-arena concern — mmap reservation, dual-pool bump
allocator, commit bitmap, free list with coalescing, page-protection
primitives, InFlightAlloc logic — into a new `Slab` class in
`orcjit_slab.{h,cc}`. Slab is the unit-of-VA-reservation; future
stages will introduce a pool of them per session.
- `ArenaJITLinkMemoryManager` becomes a thin ~30-line wrapper that
owns one `unique_ptr<Slab>` and delegates `allocate` / `deallocate`.
The capacity-halving retry loop (negotiation between caller request
and kernel RLIMIT_AS) stays here — it is not a Slab concern.
- `FinalizedAllocInfo` gains a `Slab* owner` field, stamped at finalize
time. Redundant today (one Slab per session) but makes Stage B's
pool-manager routing O(1) without address comparisons.
- Rename the 2 MB commit-granularity constant `kSlabSize` →
`Slab::kCommitGranularity`. This frees the "Slab" name for the new
class and clarifies that the constant describes page-commit / THP
granularity, not the Slab-as-pool-unit.
- Rename the user-facing parameter `arena_size` / `arena_size_bytes`
→ `slab_size` / `slab_size_bytes` across the C++ constructors,
Python kwarg, registered-global-func lambda, and tests. Stage A has
one Slab per session so the semantics are identical; Stage B
preserves the name while adding multi-slab pool behavior. This is
an API break: `ExecutionSession(arena_size=...)` must become
`ExecutionSession(slab_size=...)`. Doing the rename now avoids a
second break once Stage B lands.
Verification: `pytest addons/tvm_ffi_orcjit/tests` — 64 passed, 3
skipped (identical to pre-refactor). `nm` confirms `Slab::allocate`,
`Slab::deallocateOne`, etc. in the built .so.
Stage B of the slab-pool refactor. `ArenaJITLinkMemoryManager` is replaced by `SlabPoolMemoryManager`, which holds a vector of `Slab`s per session and grows on demand instead of pre-reserving one giant arena. Behavior changes: - **Default capacity**: per-slab `slab_size` drops from 1 GB to 64 MB. Typical ML JIT graphs are well under 10 MB; the first slab now reserves 64 MB instead of 1 GB, dramatically reducing VA footprint for small workloads on memory-constrained hosts. - **Growth on demand**: when no existing slab can fit a graph, a fresh slab is mmap'd at `slab_size` bytes and appended to the pool. Sessions that previously failed with "pool exhausted" for cumulative allocations beyond 1 GB now succeed transparently. - **Oversize path**: a single graph whose footprint exceeds `slab_size - 2 * kCommitGranularity` gets its own dedicated slab sized to fit it, rounded to the commit granularity. One graph per oversize slab; the slab becomes available to other allocations after the graph is freed but usually isn't reused (sized tightly). New infrastructure: - `SlabPoolExhaustedError` — retriable error class. Emitted by `Slab::bumpAllocate` when the requested region exceeds the pool limit; caught by `SlabPoolMemoryManager::allocate` to fall through to the next slab or mmap a new one. Other errors (mmap, mprotect, JITLink) keep their existing types and are propagated to the caller without retry. - `Slab::computeGraphFootprint(G, page_size)` — static helper that pre-computes per-pool byte totals for a graph. The pool manager uses it to make the normal-vs-oversize decision without first attempting a normal allocation. - `classifyOverflowSections(G)` — file-scope helper extracted from `Slab::allocate`; also used by `computeGraphFootprint`. Keeps the two entry points consistent on which sections go to the overflow (separate-mmap) path. Concurrency: `pool_mu_` guards only the `slabs_` vector. It is dropped before calling into `Slab::allocate` or the caller's `OnAllocated` callback, because LLJIT materialization frequently invokes nested lookups from inside those callbacks — a coarse lock here deadlocks. Existing slab pointers are stable across concurrent grows (Stage B never removes slabs), so a snapshot taken under the lock is safe to iterate afterwards. Initial-slab retry: the session constructor still halves capacity on mmap failure, now down to `kMinSlabSize = 8 MB`. Subsequent grows use exactly `slab_size_` with no retry — mmap failures propagate. Tests: - `test_arena.py`: `_ARENA_SIZE` bumped from 16 MB → 256 MB so the co-location tests continue to exercise single-slab invariants. The existing overflow-section test's contiguous-region assertion still passes because 256 MB is enough for one slab. - `test_basic.py`: 3 new parametrized tests (×2 C/C++ variants = 6 cases) under the "Slab-pool growth" section — `test_pool_grows_under_small_slab` (16 libs, 8 MB slab, pool must grow), `test_small_slab_recycles_after_drop` (32-iter load/drop exercises free-list within a slab), `test_pool_survives_mixed_load_drop_create` (interleaved paths). Verification: `pytest addons/tvm_ffi_orcjit/tests` — 70 passed, 3 skipped (was 64 + 3). Scope: Stage C (warm-slab eviction + real munmap of drained slabs) remains a planned follow-up. Drained slabs in Stage B stay mapped until the session is destroyed, same as today's single arena.
…b reclamation Stage C of the slab-pool refactor. Gives users explicit control over when drained slabs are `munmap`'d back to the OS, rather than relying on automatic eviction heuristics with their timing hazards. Rationale: in Stage B a Slab stays mapped for the session's lifetime even after every allocation on it is freed, so long-running workloads (load model → unload → reload) never recover their RSS. Instead of an automatic warm-slab eviction with deferred-reclamation bookkeeping (hard to get right around teardown-time ORC-runtime references), this change exposes a manual reclamation call. The user picks a safe moment — typically right after `del lib` on a batch of libraries — and calls `session.clear_free_slabs()` to release drained slabs. Implementation: - `Slab` gains an atomic `live_count_` and a one-way `ever_used_` flag. `noteAllocated()` fires in `InFlightAlloc::finalize` just before the FinalizedAlloc handle is published; `noteDeallocated()` fires in `deallocateOne()` after the region is returned to the free list. `isReclaimable()` returns true only when the slab was used at least once and currently has zero live allocations — a fresh initial slab is preserved. - `SlabPoolMemoryManager::clearFreeSlabs()` partitions `slabs_` into keep and discard under `pool_mu_`, moves the discard half into a local vector, drops the lock, and lets `unique_ptr<Slab>` destructors `munmap` outside the lock. Returns the count reclaimed. - `ORCJITExecutionSessionObj::ClearFreeSlabs()` forwards to the pool manager, or returns 0 if the pool is disabled (`slab_size=-1`) or on non-Linux (pool compiled out). - Registered as `orcjit.ExecutionSessionClearFreeSlabs`; exposed on Python as `ExecutionSession.clear_free_slabs() -> int`. Tests (5 new): - `test_clear_free_slabs_no_drained`: fresh session and one-live-lib session both return 0. - `test_clear_free_slabs_reclaims_oversize`: a 3 MB ZeroFill blob built on the fly via `tvm_ffi.cpp.build` forces the oversize path under a 4 MB `slab_size`; dropping the lib makes the dedicated slab reclaimable. - `test_clear_free_slabs_idempotent`: a second call after everything is reclaimed returns 0. - `test_clear_free_slabs_preserves_live_pool`: drop-one + keep-one — only the drained slab is reclaimed, the kept lib still executes. - `test_clear_free_slabs_disabled_pool`: no-op under `slab_size=-1`. Safety note (in the Python docstring): call when no JIT work is in flight on another thread. Under Python's GIL, once `del lib` returns, the C++ destructor has finished and the slab's live count reflects the drop; subsequent `clear_free_slabs()` is safe. Verification: `pytest addons/tvm_ffi_orcjit/tests` — 75 passed, 3 skipped (was 70 + 3).
Collapse the allocator's five-step logic (footprint pre-check, usable
estimate, oversize branch, first-fit, grow-fresh) into two steps:
1. First-fit over existing slabs; retry on SlabPoolExhaustedError.
2. On miss, grow the pool with a slab sized by a new
Slab::capacityForFootprint helper — power-of-2 doubling from
slab_size until both per-pool budgets cover the graph's footprint.
The separate "oversize path" disappears: a skewed or oversize graph is
just a miss that happens to grow the pool to 2·slab_size, 4·slab_size,
etc. instead of slab_size. Net −90 lines of allocator logic, +40 lines
of helper + updated comments.
Observable change: a dropped oversize slab can now be reused by a
subsequent graph that fits — strictly better for RSS. The
`test_clear_free_slabs_preserves_live_pool` test was adjusted to use
two distinctly-sized blobs (3 MB and 5 MB on a 4 MB pool) so drop and
keep land on separate slabs, preserving the invariant the test was
really checking. Extend `_build_big_object` to accept a `name` so two
objects can live in the same tmp_path.
Also rename tests/test_arena.py to tests/test_memory_manager.py and
sweep the "arena" vocabulary to "slab" throughout — the
ArenaJITLinkMemoryManager class was replaced by SlabPoolMemoryManager
in commit 4cc8462.
Summary
Replaces the single ~1 GB per-session arena with a growable pool of
fixed-size
Slabs plus explicit reclamation. The work is staged assix independently revertible commits:
b42d1c4llvm_patches/(code motion)52b7f9df4e5e10DynamicLibrarydrope9992caSlabfromArenaJITLinkMemoryManager4cc8462SlabPoolMemoryManager(growable per-session pool)a8690baExecutionSession.clear_free_slabs()manual reclamationNet:
orcjit_session.ccshrunk 864 → 498 lines (patches movedout),
orcjit_memory_manager.ccshrunk from ~700 → ~190 lineswith a clean Slab/Pool split, and per-session JIT memory now lifecycles
through four well-defined moments (session start, pool growth, dylib
drop, manual reclaim) instead of one "arena lives as long as the
session" blob.
What each stage does
1. LLVM patches isolation (
b42d1c4,52b7f9d)GOTPCRELXFixPluginandInitFiniPlugin— twoObjectLinkingLayerplugins whose entire reason to exist is anupstream LLVM defect — out of
orcjit_session.ccintoaddons/tvm_ffi_orcjit/src/ffi/llvm_patches/. Each file opens witha fixed-shape header (issue link, affected versions, trigger,
symptom,
## Removalchecklist) so deletion is mechanical onceupstream lands a fix. A
README.mdindexes the patches.state) at the top level.
1 GB on both.
2. Dylib removal + arena reuse fix (
f4e5e10)Previously, dropping a
DynamicLibraryran static destructors butnever called
ExecutionSession::removeJITDylib— JIT code pagesaccumulated until the whole session was destroyed. This commit:
RemoveDylibthrough~ORCJITDynamicLibraryObj(Linux,Windows, and macOS — I checked that
MachOPlatform::notifyRemoving'sllvm_unreachableis dead code; no caller actually reaches it).commitPagesstickily mprotects a 2 MB slab once, but
finalizere-mprotectspages to r-x / r--. Returning a region to the free list left those
protections in place → the next
allocate'smemset(0)wouldfault. Resets protection to RW in the allocator, not the deallocator,
so the teardown-time invariant (ORC runtime may execute deallocated
pages while DeallocActions unwind) is preserved.
3. Stage A: Slab abstraction (
e9992ca)commit bitmap, free list, protection primitives) into a new
Slabclass inorcjit_slab.{h,cc}.ArenaJITLinkMemoryManagerbecomes a thin ~50-line wrapper thatholds
unique_ptr<Slab>and delegates. The capacity-halving retryloop (kernel negotiation) stays on the wrapper.
FinalizedAllocInfogains aSlab* ownerso future multi-slabrouting is O(1).
kSlabSize(2 MB commit granularity) →kCommitGranularityto free the name for the
Slabclass.arena_size→slab_size. One breakingchange now instead of two once the multi-slab pool arrives.
Zero behavior change — pure restructuring.
4. Stage B: growable pool (
4cc8462)ArenaJITLinkMemoryManager→SlabPoolMemoryManagerholding
vector<unique_ptr<Slab>>.slab_size: 1 GB → 64 MB. Small workloads reserve 64MB up front instead of 1 GB. Oversize graphs get a dedicated slab
sized to fit.
SlabPoolExhaustedErrorandSlab::computeGraphFootprintpre-flight helper.pool_mu_is dropped before calling intoSlab::allocateor user callbacks. LLJIT materialization re-entersallocate()via nested lookups — a coarse lock deadlocks.(Reproduced, diagnosed, fixed during implementation; no such
deadlock exists today.)
5. Stage C: manual reclaim (
a8690ba)Slabtrackslive_count_(atomic) +ever_used_(one-way).noteAllocated/noteDeallocatedare called at finalize /deallocateOne.
isReclaimable= used at least once AND zero liveallocations.
SlabPoolMemoryManager::clearFreeSlabs()partitions under thelock, moves discards out, drops the lock, lets
~Slabmunmapoutside the lock.
session.clear_free_slabs() -> intin Python.path has teardown-race hazards (ORC runtime can hold pointers into
drained pages while its DeallocActions unwind). Manual lets the
user pick a moment when those hazards don't apply (after
del libhas returned, the C++ destructor has run, countsreflect reality). Can be upgraded to automatic later without
changing the API.
API changes
ExecutionSession(arena_size=...)→ExecutionSession(slab_size=...).Semantics: size per slab, not total capacity. Default 64 MB.
-1still disables the slab allocator (LLJIT uses its default
scattered-mmap allocator).
ExecutionSession.clear_free_slabs() -> int. Returns countof slabs reclaimed. No-op when the pool is disabled or on
macOS/Windows (pool compiled out there).
Motivation
Before this stack:
orcjit_session.ccmixed core session setup with 300+ lines ofinline LLVM workarounds.
until session destruction.
libraries.
After this stack all five are fixed, and each stage is independently
revertible.
Test plan
LLVM_PREFIX=/opt/llvm pip install -e addons/tvm_ffi_orcjitpytest addons/tvm_ffi_orcjit/tests— 75 passed, 3 skipped (was 48 passed, 3 skipped at branch base)Functionkeeps dylib alive, oversize path, reclaim after drop —all covered by new parametrized tests
ruff checkon touched Python