Skip to content

fix: chunk Neo4j flush into bounded sub-transactions to prevent transaction-memory OOM#14

Closed
bkrabach wants to merge 1 commit into
mainfrom
fix/chunked-flush-bounded-transaction
Closed

fix: chunk Neo4j flush into bounded sub-transactions to prevent transaction-memory OOM#14
bkrabach wants to merge 1 commit into
mainfrom
fix/chunked-flush-bounded-transaction

Conversation

@bkrabach

Copy link
Copy Markdown
Collaborator

Problem

Under sustained ingest backpressure (e.g. draining a large per-session backlog after a restart), the Neo4j write flush builds a single unbounded transaction and OOMs:

neo4j.exceptions.TransientError: MemoryPoolOutOfMemoryError
"...would use more than the limit 20.6 GiB. Currently using 20.6 GiB.
 dbms.memory.transaction.total.max threshold reached"

It is an unbounded transaction, not a sizing problem. Observed in the field: the "currently using" figure tracks whatever ceiling is configured (it climbed in lock-step when the ceiling was raised 20.6 GiB → 40 GiB) while the Neo4j process RSS stayed ~4.7 GiB. Raising memory only moves the wall.

Root cause

Neo4jGraphStore._flush_body snapshots the entire _node_buffer + _edge_buffer + _label_patches and writes them in one execute_write(_write_batch, ...) transaction. On failure the finally block restores the whole snapshot back into the buffers, so under continued load each failed flush makes the next one larger → a self-amplifying grow-spiral that never commits. The per-session queue offset therefore never advances, and the data is stuck (it stays safe on disk in the JSONL queue, but never reaches the graph). Both ingest (registry._flush_barrier) and session finalization (registry._finalize_session) funnel through the same flush(), so both paths exhibit it.

The existing _DRAIN_MAX_BATCH = 100 read cap does not help — the buffer/flush, not the read batch, is the unbounded unit.

Fix

Write the buffered snapshot in bounded sub-transactions of at most neo4j_flush_chunk_size items each:

  • _flush_body now writes in three ordered phases — nodes → edges → label-patches (ordering preserved for referential integrity: edges/patches MATCH nodes that must already exist). Each chunk is its own execute_write, so per-transaction memory is bounded regardless of backlog size.
  • _write_batch is unchanged — still a pure function of its args; each phase passes only its category populated.
  • Failure restores only the un-committed remainder (from the failing chunk onward), merged with any concurrent new writes. Already-committed chunks are durable and idempotent (MERGE on uniqueness constraints) and are not restored — so the buffer can only shrink across retries. This removes the grow-spiral.
  • The all-or-raise caller contract is preserved: a failure re-raises, the offset isn't committed, and the drainer re-runs the batch idempotently.

New tunable

neo4j_flush_chunk_size (default 200) — added to Settings, reachable via AMPLIFIER_CONTEXT_INTELLIGENCE_SERVER_NEO4J_FLUSH_CHUNK_SIZE or neo4j_flush_chunk_size: in server-config.yaml, matching the existing write_concurrency pattern.

Tests / proof

New, in tests/neo4j/test_concurrent_flush.py (@pytest.mark.neo4j, real Neo4j container):

  • test_chunked_flush_writes_all_in_bounded_chunks — buffers 500 nodes + 200 edges at chunk=50; asserts all rows land and no transaction exceeded 50 items (instruments _write_batch per-call sizes).
  • test_chunked_flush_partial_failure_restores_only_remainder — injects a mid-flush failure; asserts already-committed chunks are durable, the buffer holds only the remainder (not the whole snapshot), and a retry completes.

Results on this branch:

  • tests/neo4j/test_concurrent_flush.py6 passed (4 pre-existing flush/drain/poison tests = no regression, plus the 2 above), real neo4j:5.26.22-community.
  • Unit suite pytest -m "not neo4j and not integration"1273 passed.
  • ruff + pyright — clean.

Notes

  • Default neo4j_flush_chunk_size=200 is conservative; operators can tune.
  • Tracking/discussion: microsoft-amplifier/amplifier-support#278 (filed there because Issues are disabled on this repo — re-enabling Issues would help external triage).
  • Based on main @ 750de9d.

…action-memory OOM

Previously, _flush_body wrote the entire node+edge+patch buffer in a single
execute_write transaction. On failure, the finally block would restore and
re-merge the whole snapshot, causing a self-amplifying grow-spiral under
continued load—each failed flush became larger than the last, leading to
unbounded transaction memory growth and MemoryPoolOutOfMemoryError.

This fix rewrites the flush to:
- Write in bounded sub-transactions (nodes→edges→patches)
- Respect neo4j_flush_chunk_size (default 200)
- Restore only the un-committed remainder on failure (killing the grow-spiral)
- Maintain strict ordering for referential integrity

Adds neo4j_flush_chunk_size tunable (AMPLIFIER_CONTEXT_INTELLIGENCE_SERVER_NEO4J_FLUSH_CHUNK_SIZE).

Tests: 6/6 real-neo4j flush tests pass (4 existing = no regression, 2 new
bounding + failure-restore tests), 1273 unit tests pass, ruff+pyright clean.

Fixes: microsoft-amplifier/amplifier-support#278

Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
@colombod

Copy link
Copy Markdown
Collaborator

Thanks @bkrabach — we independently arrived at the same core fix (chunked bounded sub-transactions in _flush_body, _write_batch untouched, both drain + finalization paths), which is good validation of the approach.

#15 is a functional superset of this PR, so I'd propose consolidating onto it and closing this one:

  • Same OOM fix, plus a serialized-byte co-bound alongside the row cap (a 200-row chunk can still OOM on a few fat tool_input/messages rows; the byte bound closes that).
  • Phase 2 failure-visibility (which #278 also asked for): last_successful_flush + orphaned_sessions() surfaced on /status, so a wedged finalization orphan is no longer silent.
  • Stronger proof: a live test that reproduces a real MemoryPoolOutOfMemoryError and shows the chunked flush drains the same buffer, on both paths, with a kill/restart "frozen-across-restarts" arc; plus a dangling-node reader audit; a db.memory.transaction.max deployment cap; and the 4.0.1 bump.
  • One deliberate divergence: fix: chunked Neo4j flush eliminates OOM-induced ingest stall + failure-visibility signal (v4.0.1) #15 restores the whole snapshot and re-raises rather than tracking the uncommitted remainder — correctness-equivalent under idempotent MERGE, and your remainder optimization is easy to port if preferred.

Proposing we close this in favor of #15. Thanks for the independent confirmation of the root cause and fix.

@colombod colombod closed this Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants