Skip to content

[REFACTOR][OrcJIT] Per-session slab-pool memory manager#574

Open
cyx-6 wants to merge 7 commits intoapache:mainfrom
cyx-6:orcjit-refactor
Open

[REFACTOR][OrcJIT] Per-session slab-pool memory manager#574
cyx-6 wants to merge 7 commits intoapache:mainfrom
cyx-6:orcjit-refactor

Conversation

@cyx-6
Copy link
Copy Markdown
Contributor

@cyx-6 cyx-6 commented Apr 27, 2026

Summary

Replaces the single ~1 GB per-session arena with a growable pool of
fixed-size Slabs
plus explicit reclamation. The work is staged as
six independently revertible commits:

# Commit Topic
1 b42d1c4 Isolate LLVM patches under llvm_patches/ (code motion)
2 52b7f9d Lint fixes for the isolation commit
3 f4e5e10 Release JITDylib + arena memory on DynamicLibrary drop
4 e9992ca Stage A — extract Slab from ArenaJITLinkMemoryManager
5 4cc8462 Stage B — SlabPoolMemoryManager (growable per-session pool)
6 a8690ba Stage C — ExecutionSession.clear_free_slabs() manual reclamation

Net: orcjit_session.cc shrunk 864 → 498 lines (patches moved
out), orcjit_memory_manager.cc shrunk from ~700 → ~190 lines
with a clean Slab/Pool split, and per-session JIT memory now lifecycles
through four well-defined moments (session start, pool growth, dylib
drop, manual reclaim) instead of one "arena lives as long as the
session" blob.

What each stage does

1. LLVM patches isolation (b42d1c4, 52b7f9d)

  • Move GOTPCRELXFixPlugin and InitFiniPlugin — two
    ObjectLinkingLayer plugins whose entire reason to exist is an
    upstream LLVM defect — out of orcjit_session.cc into
    addons/tvm_ffi_orcjit/src/ffi/llvm_patches/. Each file opens with
    a fixed-shape header (issue link, affected versions, trigger,
    symptom, ## Removal checklist) so deletion is mechanical once
    upstream lands a fix. A README.md indexes the patches.
  • Keeps the arena memory manager (still useful regardless of LLVM
    state) at the top level.
  • Reduces default arena capacity from 4 GB x86_64 / 8 GB AArch64 to
    1 GB on both.

2. Dylib removal + arena reuse fix (f4e5e10)

Previously, dropping a DynamicLibrary ran static destructors but
never called ExecutionSession::removeJITDylib — JIT code pages
accumulated until the whole session was destroyed. This commit:

  • Plumbs RemoveDylib through ~ORCJITDynamicLibraryObj (Linux,
    Windows, and macOS — I checked that MachOPlatform::notifyRemoving's
    llvm_unreachable is dead code; no caller actually reaches it).
  • Fixes a latent bug the new deallocate path exposed: commitPages
    stickily mprotects a 2 MB slab once, but finalize re-mprotects
    pages to r-x / r--. Returning a region to the free list left those
    protections in place → the next allocate's memset(0) would
    fault. Resets protection to RW in the allocator, not the deallocator,
    so the teardown-time invariant (ORC runtime may execute deallocated
    pages while DeallocActions unwind) is preserved.

3. Stage A: Slab abstraction (e9992ca)

  • Extracts all per-arena state (mmap reservation, dual-pool bump,
    commit bitmap, free list, protection primitives) into a new
    Slab class in orcjit_slab.{h,cc}.
  • ArenaJITLinkMemoryManager becomes a thin ~50-line wrapper that
    holds unique_ptr<Slab> and delegates. The capacity-halving retry
    loop (kernel negotiation) stays on the wrapper.
  • FinalizedAllocInfo gains a Slab* owner so future multi-slab
    routing is O(1).
  • Renames kSlabSize (2 MB commit granularity) → kCommitGranularity
    to free the name for the Slab class.
  • API break: Python kwarg arena_sizeslab_size. One breaking
    change now instead of two once the multi-slab pool arrives.

Zero behavior change — pure restructuring.

4. Stage B: growable pool (4cc8462)

  • Rewrites ArenaJITLinkMemoryManagerSlabPoolMemoryManager
    holding vector<unique_ptr<Slab>>.
  • Default slab_size: 1 GB → 64 MB. Small workloads reserve 64
    MB up front instead of 1 GB. Oversize graphs get a dedicated slab
    sized to fit.
  • Adds retriable SlabPoolExhaustedError and
    Slab::computeGraphFootprint pre-flight helper.
  • Concurrency fix: pool_mu_ is dropped before calling into
    Slab::allocate or user callbacks. LLJIT materialization re-enters
    allocate() via nested lookups — a coarse lock deadlocks.
    (Reproduced, diagnosed, fixed during implementation; no such
    deadlock exists today.)

5. Stage C: manual reclaim (a8690ba)

  • Slab tracks live_count_ (atomic) + ever_used_ (one-way).
    noteAllocated / noteDeallocated are called at finalize /
    deallocateOne. isReclaimable = used at least once AND zero live
    allocations.
  • SlabPoolMemoryManager::clearFreeSlabs() partitions under the
    lock, moves discards out, drops the lock, lets ~Slab munmap
    outside the lock.
  • Exposed as session.clear_free_slabs() -> int in Python.
  • Chose manual over automatic warm-slab eviction because the automatic
    path has teardown-race hazards (ORC runtime can hold pointers into
    drained pages while its DeallocActions unwind). Manual lets the
    user pick a moment when those hazards don't apply (after
    del lib has returned, the C++ destructor has run, counts
    reflect reality). Can be upgraded to automatic later without
    changing the API.

API changes

  • Break: ExecutionSession(arena_size=...)ExecutionSession(slab_size=...).
    Semantics: size per slab, not total capacity. Default 64 MB. -1
    still disables the slab allocator (LLJIT uses its default
    scattered-mmap allocator).
  • New: ExecutionSession.clear_free_slabs() -> int. Returns count
    of slabs reclaimed. No-op when the pool is disabled or on
    macOS/Windows (pool compiled out there).

Motivation

Before this stack:

  • orcjit_session.cc mixed core session setup with 300+ lines of
    inline LLVM workarounds.
  • Every session pre-reserved 1 GB of VA regardless of workload size.
  • Dropping a library didn't free its JIT code — memory accumulated
    until session destruction.
  • A session couldn't JIT more than ~1 GB of cumulative code.
  • No way to recover RSS on a long-running host that loads and unloads
    libraries.

After this stack all five are fixed, and each stage is independently
revertible.

Test plan

  • Build on aarch64 Linux: LLVM_PREFIX=/opt/llvm pip install -e addons/tvm_ffi_orcjit
  • pytest addons/tvm_ffi_orcjit/tests75 passed, 3 skipped (was 48 passed, 3 skipped at branch base)
  • Behavior smoke: repeated create/load/drop cycles, captured
    Function keeps dylib alive, oversize path, reclaim after drop —
    all covered by new parametrized tests
  • ruff check on touched Python
  • CI: lint / clang-tidy / macOS / Windows / x86_64 Linux

Moves the two ObjectLinkingLayer plugins whose entire reason to exist is
an upstream LLVM defect out of orcjit_session.cc into their own files
under addons/tvm_ffi_orcjit/src/ffi/llvm_patches/:

  - gotpcrelx_fix.{h,cc}     (x86_64 Linux GOTPCRELX relaxation bug)
  - init_fini_plugin.{h,cc}  (ELFNixPlatform gap, COFFPlatform stalled)

Each file opens with a fixed-shape header (LLVM issue link, affected
versions, trigger, symptom, Removal checklist) so that when upstream
lands a fix we can delete the file and the matching include / plugin
registration mechanically. README.md indexes the patches and their
removal criteria.

The arena memory manager stays at the top level -- it is a design-level
feature (contiguous r-x layout, THP, faster teardown) that we keep even
after LLVM #173269 is fixed, not a workaround.

Also reduces the default arena capacity from 4 GB (x86_64) / 8 GB
(AArch64) to 1 GB on both architectures. 1 GB covers typical ML JIT
workloads while staying well under the PC-relative relocation limit
(x86_64 +/-2 GB, AArch64 +/-4 GB) and is friendlier to memory-constrained
hosts (containers, CI runners).

No logic changes: orcjit_session.cc shrinks from 864 -> 498 lines of
pure code motion. CI-covered build and tests pass on aarch64 Linux.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors LLVM JITLink workarounds into a dedicated llvm_patches directory and reduces the default JIT arena size from 4GB/8GB to 1GB to better support memory-constrained environments. It introduces specific fixes for x86_64 GOTPCRELX relaxation bugs and implements cross-platform init/fini section handling for ELF and COFF. Feedback was provided regarding a potential lifetime issue when capturing JITDylib by reference in a lambda and a logic inconsistency in section priority parsing when getAsInteger fails.

// Handles ELF (.init_array, .ctors, .fini_array, .dtors),
// Mach-O (__DATA,__mod_init_func, __DATA,__mod_term_func),
// and COFF (.CRT$XC*, .CRT$XT*) section conventions.
Config.PostFixupPasses.emplace_back([this, &jit_dylib](llvm::jitlink::LinkGraph& G) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The lambda captures jit_dylib by reference. While JITDylib objects are generally long-lived within an ExecutionSession, jit_dylib is a reference obtained from MR.getTargetJITDylib(). Since PostFixupPasses are executed during the linking process which is part of the materialization, this is likely safe, but capturing by pointer or ensuring the JITDylib's lifetime explicitly would be more robust if the materialization context were to change.

// negate so that higher-numbered entries run first (reverse order).
if (is_init_array) {
if (section_name.consume_front(".init_array.")) {
section_name.getAsInteger(10, priority);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

getAsInteger returns true on error. The return value is ignored here, which means if a section name like .init_array.abc is encountered, priority will remain at its default value of 0. This is inconsistent with the default priority of 65535 used for sections without a suffix (line 160). Consider checking the return value and falling back to a sensible default or reporting an error.

- README.md: switch to per-line <!--- --> ASF license headers (adds an
  H1 right after, satisfies MD041) and drop the pipe-aligned table in
  favour of a bulleted index (no more MD060 aligned-pipe errors).
- CMakeLists.txt: pack add_library() source list the way cmake-format
  wants (two lines, not one per source).
Previously, dropping a `DynamicLibrary` ran any pending static destructors
but never removed the underlying `JITDylib` from the `ExecutionSession`.
JIT code pages stayed resident in the arena until the session itself was
destroyed, blocking any reuse across dylib lifetimes.

Changes:

- `ORCJITExecutionSessionObj::RemoveDylib`: erases any pending init/fini
  map entries keyed by the JITDylib* (so a recycled address starts clean)
  and calls `ExecutionSession::removeJITDylib`. Errors are swallowed so
  it is safe from a destructor.
- `~ORCJITDynamicLibraryObj` now calls `RemoveDylib` after the existing
  deinit step, on Linux, Windows, and macOS. `MachOPlatform::notifyRemoving`
  is `llvm_unreachable` but is dead code in LLVM — no caller invokes it;
  `removeJITDylibs` only calls `teardownJITDylib`, which is a plain map
  cleanup on every platform.
- Arena memory manager: fix a latent bug that surfaced as soon as
  deallocate was exercised. `commitPages` mprotects a 2 MB slab once
  (sticky `slab_committed_`), but finalize later re-mprotects pages to
  r-x / r--. Returning a region to the free list left those protections
  in place, and the next `allocate()` faulted in its `memset(0)`. Reset
  the region to RW at allocate time (not deallocate) so the teardown-time
  invariant — ORC runtime may still execute deallocated pages while its
  DeallocActions unwind — is preserved.

Tests (8 new, parametrized over C / C++ variants):

- Drop empty library; drop + recreate; 32-iteration load/drop cycle
  exercising the recycled-region path.
- Drop one library while another is live; captured `Function` keeps its
  library alive across `del lib`.
- Drop runs static destructors immediately (not at session teardown).
- Drop a `set_link_order` caller leaves its base usable.
- Dropping every library before the session still produces clean teardown.
@cyx-6 cyx-6 changed the title [REFACTOR][OrcJIT] Isolate LLVM patches under llvm_patches/ [REFACTOR][OrcJIT] Isolate LLVM patches + release dylib memory on drop Apr 29, 2026
cyx-6 added 3 commits April 29, 2026 14:43
Stage A of the slab-pool refactor in refactor_plan.md. Pure internal
restructuring; zero behavior change.

- Move every per-arena concern — mmap reservation, dual-pool bump
  allocator, commit bitmap, free list with coalescing, page-protection
  primitives, InFlightAlloc logic — into a new `Slab` class in
  `orcjit_slab.{h,cc}`. Slab is the unit-of-VA-reservation; future
  stages will introduce a pool of them per session.
- `ArenaJITLinkMemoryManager` becomes a thin ~30-line wrapper that
  owns one `unique_ptr<Slab>` and delegates `allocate` / `deallocate`.
  The capacity-halving retry loop (negotiation between caller request
  and kernel RLIMIT_AS) stays here — it is not a Slab concern.
- `FinalizedAllocInfo` gains a `Slab* owner` field, stamped at finalize
  time. Redundant today (one Slab per session) but makes Stage B's
  pool-manager routing O(1) without address comparisons.
- Rename the 2 MB commit-granularity constant `kSlabSize` →
  `Slab::kCommitGranularity`. This frees the "Slab" name for the new
  class and clarifies that the constant describes page-commit / THP
  granularity, not the Slab-as-pool-unit.
- Rename the user-facing parameter `arena_size` / `arena_size_bytes`
  → `slab_size` / `slab_size_bytes` across the C++ constructors,
  Python kwarg, registered-global-func lambda, and tests. Stage A has
  one Slab per session so the semantics are identical; Stage B
  preserves the name while adding multi-slab pool behavior. This is
  an API break: `ExecutionSession(arena_size=...)` must become
  `ExecutionSession(slab_size=...)`. Doing the rename now avoids a
  second break once Stage B lands.

Verification: `pytest addons/tvm_ffi_orcjit/tests` — 64 passed, 3
skipped (identical to pre-refactor). `nm` confirms `Slab::allocate`,
`Slab::deallocateOne`, etc. in the built .so.
Stage B of the slab-pool refactor. `ArenaJITLinkMemoryManager` is
replaced by `SlabPoolMemoryManager`, which holds a vector of `Slab`s
per session and grows on demand instead of pre-reserving one giant
arena.

Behavior changes:

- **Default capacity**: per-slab `slab_size` drops from 1 GB to 64 MB.
  Typical ML JIT graphs are well under 10 MB; the first slab now
  reserves 64 MB instead of 1 GB, dramatically reducing VA footprint
  for small workloads on memory-constrained hosts.
- **Growth on demand**: when no existing slab can fit a graph, a fresh
  slab is mmap'd at `slab_size` bytes and appended to the pool.
  Sessions that previously failed with "pool exhausted" for cumulative
  allocations beyond 1 GB now succeed transparently.
- **Oversize path**: a single graph whose footprint exceeds
  `slab_size - 2 * kCommitGranularity` gets its own dedicated slab
  sized to fit it, rounded to the commit granularity. One graph per
  oversize slab; the slab becomes available to other allocations
  after the graph is freed but usually isn't reused (sized tightly).

New infrastructure:

- `SlabPoolExhaustedError` — retriable error class. Emitted by
  `Slab::bumpAllocate` when the requested region exceeds the pool
  limit; caught by `SlabPoolMemoryManager::allocate` to fall through
  to the next slab or mmap a new one. Other errors (mmap, mprotect,
  JITLink) keep their existing types and are propagated to the caller
  without retry.
- `Slab::computeGraphFootprint(G, page_size)` — static helper that
  pre-computes per-pool byte totals for a graph. The pool manager
  uses it to make the normal-vs-oversize decision without first
  attempting a normal allocation.
- `classifyOverflowSections(G)` — file-scope helper extracted from
  `Slab::allocate`; also used by `computeGraphFootprint`. Keeps the
  two entry points consistent on which sections go to the overflow
  (separate-mmap) path.

Concurrency:

`pool_mu_` guards only the `slabs_` vector. It is dropped before
calling into `Slab::allocate` or the caller's `OnAllocated` callback,
because LLJIT materialization frequently invokes nested lookups from
inside those callbacks — a coarse lock here deadlocks. Existing slab
pointers are stable across concurrent grows (Stage B never removes
slabs), so a snapshot taken under the lock is safe to iterate
afterwards.

Initial-slab retry: the session constructor still halves capacity on
mmap failure, now down to `kMinSlabSize = 8 MB`. Subsequent grows use
exactly `slab_size_` with no retry — mmap failures propagate.

Tests:

- `test_arena.py`: `_ARENA_SIZE` bumped from 16 MB → 256 MB so the
  co-location tests continue to exercise single-slab invariants. The
  existing overflow-section test's contiguous-region assertion still
  passes because 256 MB is enough for one slab.
- `test_basic.py`: 3 new parametrized tests (×2 C/C++ variants = 6
  cases) under the "Slab-pool growth" section —
  `test_pool_grows_under_small_slab` (16 libs, 8 MB slab, pool must
  grow), `test_small_slab_recycles_after_drop` (32-iter load/drop
  exercises free-list within a slab), `test_pool_survives_mixed_load_drop_create`
  (interleaved paths).

Verification: `pytest addons/tvm_ffi_orcjit/tests` — 70 passed,
3 skipped (was 64 + 3).

Scope: Stage C (warm-slab eviction + real munmap of drained slabs)
remains a planned follow-up. Drained slabs in Stage B stay mapped
until the session is destroyed, same as today's single arena.
…b reclamation

Stage C of the slab-pool refactor. Gives users explicit control over
when drained slabs are `munmap`'d back to the OS, rather than relying
on automatic eviction heuristics with their timing hazards.

Rationale: in Stage B a Slab stays mapped for the session's lifetime
even after every allocation on it is freed, so long-running workloads
(load model → unload → reload) never recover their RSS. Instead of
an automatic warm-slab eviction with deferred-reclamation bookkeeping
(hard to get right around teardown-time ORC-runtime references), this
change exposes a manual reclamation call. The user picks a safe
moment — typically right after `del lib` on a batch of libraries —
and calls `session.clear_free_slabs()` to release drained slabs.

Implementation:

- `Slab` gains an atomic `live_count_` and a one-way `ever_used_`
  flag. `noteAllocated()` fires in `InFlightAlloc::finalize` just
  before the FinalizedAlloc handle is published; `noteDeallocated()`
  fires in `deallocateOne()` after the region is returned to the
  free list. `isReclaimable()` returns true only when the slab was
  used at least once and currently has zero live allocations — a
  fresh initial slab is preserved.
- `SlabPoolMemoryManager::clearFreeSlabs()` partitions `slabs_` into
  keep and discard under `pool_mu_`, moves the discard half into a
  local vector, drops the lock, and lets `unique_ptr<Slab>` destructors
  `munmap` outside the lock. Returns the count reclaimed.
- `ORCJITExecutionSessionObj::ClearFreeSlabs()` forwards to the pool
  manager, or returns 0 if the pool is disabled (`slab_size=-1`) or
  on non-Linux (pool compiled out).
- Registered as `orcjit.ExecutionSessionClearFreeSlabs`; exposed on
  Python as `ExecutionSession.clear_free_slabs() -> int`.

Tests (5 new):

- `test_clear_free_slabs_no_drained`: fresh session and one-live-lib
  session both return 0.
- `test_clear_free_slabs_reclaims_oversize`: a 3 MB ZeroFill blob
  built on the fly via `tvm_ffi.cpp.build` forces the oversize path
  under a 4 MB `slab_size`; dropping the lib makes the dedicated
  slab reclaimable.
- `test_clear_free_slabs_idempotent`: a second call after everything
  is reclaimed returns 0.
- `test_clear_free_slabs_preserves_live_pool`: drop-one + keep-one
  — only the drained slab is reclaimed, the kept lib still executes.
- `test_clear_free_slabs_disabled_pool`: no-op under `slab_size=-1`.

Safety note (in the Python docstring): call when no JIT work is in
flight on another thread. Under Python's GIL, once `del lib` returns,
the C++ destructor has finished and the slab's live count reflects
the drop; subsequent `clear_free_slabs()` is safe.

Verification: `pytest addons/tvm_ffi_orcjit/tests` — 75 passed,
3 skipped (was 70 + 3).
@cyx-6 cyx-6 changed the title [REFACTOR][OrcJIT] Isolate LLVM patches + release dylib memory on drop [REFACTOR][OrcJIT] Per-session slab-pool memory manager Apr 29, 2026
Collapse the allocator's five-step logic (footprint pre-check, usable
estimate, oversize branch, first-fit, grow-fresh) into two steps:

  1. First-fit over existing slabs; retry on SlabPoolExhaustedError.
  2. On miss, grow the pool with a slab sized by a new
     Slab::capacityForFootprint helper — power-of-2 doubling from
     slab_size until both per-pool budgets cover the graph's footprint.

The separate "oversize path" disappears: a skewed or oversize graph is
just a miss that happens to grow the pool to 2·slab_size, 4·slab_size,
etc. instead of slab_size.  Net −90 lines of allocator logic, +40 lines
of helper + updated comments.

Observable change: a dropped oversize slab can now be reused by a
subsequent graph that fits — strictly better for RSS.  The
`test_clear_free_slabs_preserves_live_pool` test was adjusted to use
two distinctly-sized blobs (3 MB and 5 MB on a 4 MB pool) so drop and
keep land on separate slabs, preserving the invariant the test was
really checking.  Extend `_build_big_object` to accept a `name` so two
objects can live in the same tmp_path.

Also rename tests/test_arena.py to tests/test_memory_manager.py and
sweep the "arena" vocabulary to "slab" throughout — the
ArenaJITLinkMemoryManager class was replaced by SlabPoolMemoryManager
in commit 4cc8462.
@cyx-6 cyx-6 force-pushed the orcjit-refactor branch from 4542333 to b078a98 Compare April 30, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants