Skip to content

fix(neo4j): restore global write-lock ordering + finite lock timeout (eliminate flush deadlocks & silent drain stalls)#18

Closed
colombod wants to merge 2 commits into
mainfrom
fix/flush-lock-order-and-fail-loud
Closed

fix(neo4j): restore global write-lock ordering + finite lock timeout (eliminate flush deadlocks & silent drain stalls)#18
colombod wants to merge 2 commits into
mainfrom
fix/flush-lock-order-and-fail-loud

Conversation

@colombod

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #15. PR #15 (v4.0.1) replaced the single global flush transaction with a per-chunk coordinator, which lost the global Neo4j lock ordering. Under concurrent ingestion this caused TransientError.DeadlockDetected cycles, and — because db.lock.acquisition.timeout=0 (unlimited) — silent, unrecoverable drain stalls (a blocked transaction waits forever and raises nothing).

This PR fixes both, in two commits.

1. Restore global write-lock ordering (bba226b)

  • Sort the full node snapshot before chunking; iterate edge_groups in sorted type order; sort the edge snapshot before chunking (row-level global order).
  • Acquire write-locks on all edge endpoints in node_id-sorted order before the edge MERGEs (canonical per-endpoint pre-lock). This eliminates the topological root↔concept lock-order inversion that row-sorting alone cannot reach (the two endpoints appear in different src/dst roles across different edge types within one transaction). APOC is not available, so pure Cypher; implemented Eager-free (per-endpoint MATCH … SET) to avoid added memory pressure.

2. Finite lock-acquisition timeout — fail loud, never silent (657b800)

  • NEO4J_db_lock_acquisition_timeout="30s" in the shipped compose. Converts a blocked flush from a silent infinite wait into a bounded TransientError that flows through the existing retry → dead-letter path (already ERROR-logged; /status reports degraded). No application source change.
  • Deliberately no write_semaphore acquire timeout (it would dead-letter valid events under backpressure); the lock timeout alone bounds starvation.

Evidence (isolated reproduction stack, real Neo4j)

  • Deadlocks: 194 → 0 even at concept-node in_degree ~1920; 0 driver retries, 0 dead-letters.
  • Never-silent proven: a RED probe reproduced the silent infinite hang on timeout=0; with the finite timeout a blocked flush raises in ~5.5s, dead-letters in ~15s, and /status shows degraded=true.
  • Starvation bounded: under induced contention a free drainer commits in ~5.6s (lock timeout alone).
  • Scales gracefully (K=4→64): 0 deadlock / 0 dead-letter / 0 loss / full recovery at every level; throughput plateaus at ~12 events/s (supernode serialization), queue grows linearly under backpressure — bounded, no cliff.
  • Regression: 1372 non-Neo4j + 64 Neo4j tests pass. New TDD tests: tests/test_flush_global_lock_order.py, tests/neo4j/test_lock_timeout_fail_loud.py.

Known ceiling / follow-ups (not in this PR)

  • Throughput ceiling ~12 events/s is supernode write serialization. The lever to raise it (~order of magnitude) is reducing cross-session writes to the shared root/concept nodes — but it needs a safe replacement for the HAS_SUBSESSION MATCH (which silently drops the edge if the parent Session isn't committed). Tracked as a follow-up.
  • Latent silent-edge-drop: HAS_SUBSESSION MATCH … MERGE silently drops the edge if the parent Session is absent at child-flush time — pre-existing, independent of fix: chunked Neo4j flush eliminates OOM-induced ingest stall + failure-visibility signal (v4.0.1) #15. Tracked.
  • FD/ulimit ceiling past ~1024 concurrent never-ending sessions is a config matter (raise ulimit); still 0 loss when crossed.

colombod and others added 2 commits June 18, 2026 23:46
…dlocks

Restore deterministic global write-lock ordering in the Neo4j flush path to
eliminate concurrent-ingestion deadlocks.

PR #15 (v4.0.1) replaced the single global write transaction with a per-chunk
coordinator, which lost the global node lock order — concurrent drainers then
acquired locks out of order and Neo4j raised TransientError.DeadlockDetected
(and, with db.lock.acquisition.timeout=0, caused silent blocking stalls).

This fix applies two changes:

1. Row-level global order (Layer 1): Sort the full node snapshot before
   chunking; iterate edge_groups in sorted type order; sort the edge snapshot
   before chunking. This eliminated 194 → 80 deadlocks in testing, but left
   topological inversions that row-sorting cannot reach.

2. Canonical node-lock pre-acquisition (Layer 2): Acquire write-locks on all
   edge endpoints in node_id-sorted order before the edge MERGEs, eliminating
   the root ↔ concept lock-order inversion. Implemented Eager-free via
   per-endpoint MATCH...SET (APOC not installed); confirmed by EXPLAIN on the
   reproduction stack.

Verification (isolated reproduction stack, prod untouched):
- DeadlockDetected: 194 → 80 (row-sort) → 0 (pre-lock), at concept in_degree 1280
- 0 driver retries, 0 dead-letters; drain kept up (1408 → 692 queued)
- Full Neo4j integration suite: 61 passed (incl. orphan OOM-surfacing test)
- Non-Neo4j tests: 1372 passed; lint clean
- New TDD test asserts global node/edge lock-order invariant and sorted pre-lock

Fixes: #15 regression (deadlock regression in v4.0.1)
Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
…il loud

With db.lock.acquisition.timeout=0 (unlimited) a blocked (non-deadlock) transaction
waits forever and raises nothing — the silent drain stall Brian observed.
A finite timeout converts that into a bounded TransientError that flows through the
existing retry → dead-letter path (already logged at ERROR; /status reports degraded).

Set NEO4J_db_lock_acquisition_timeout="30s" in the shipped docker-compose neo4j
service (tunable default). No application source change needed — the exhausted-flush
path already surfaces loudly (flush_chunk_failed/drain_batch_exhausted at ERROR,
dead_letter_total + degraded on /status). Deliberately NOT adding a write_semaphore
acquire timeout (would dead-letter valid events under backpressure); the lock timeout
alone bounds starvation.

Verification: NEW fault-injection tests tests/neo4j/test_lock_timeout_fail_loud.py —
RED probe on 0s reproduced the silent infinite block (hang); with the finite timeout
the flush raises in ~5.5s, dead-letters in ~15s with degraded=true on /status, and
other drainers keep progressing under contention. Full Neo4j suite 64 passed, 1372
non-Neo4j pass, harness DeadlockDetected still 0.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
@colombod

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #19 (merged), which is the more complete and live-proven fix for this incident.

#19 independently converges on the same root cause — PR #15's per-chunk flush lost the global lock order, and db.lock.acquisition.timeout=0 turned the resulting contention into silent, unrecoverable drain stalls — and additionally fixes a dominant real-world factor this PR missed: the label-free node MERGE/MATCH causing an AllNodesScan on the ~1.3M-node production graph. #19 addresses it with a universal :Node label + composite (node_id, workspace) index + idempotent backfill, with live before/after evidence (backlog 58→0 in ~2 min; slowest txn 25–30s → sub-ms).

We re-based our investigation onto merged #19 and measured our one non-redundant contribution — the canonical per-endpoint pre-lock. On an identical harness it drives residual DeadlockDetected 110 → 0, but #19 already self-heals those deadlocks losslessly (0 dead-letters, 0 data loss), and the pre-lock yields no throughput gain (~10.2 vs ~11.3 eps) at an unconditional hot-path cost. So it is not worth shipping. The branch and its evidence are preserved as a ready-to-revive mitigation if production telemetry ever shows the residual deadlock-retry churn causing real harm at scale.

Net: #19 is the fix. Our independent reproduction corroborates it (194→0 lock-order progression; topological root↔concept lock-order analysis; a four-gate evidence framework on an isolated real-Neo4j stack).

Two follow-ups worth tracking for the team (repo Issues are disabled, noting here):

  1. Latent silent edge-dropHAS_SUBSESSION (and similar cross-session edges) use MATCH (src) MATCH (dst) MERGE (src)-[r]->(dst); if an endpoint isn't committed yet the MATCH returns nothing and the edge is silently dropped (no error, no log). Independent of fix: chunked Neo4j flush eliminates OOM-induced ingest stall + failure-visibility signal (v4.0.1) #15, and the blocker for the throughput work below.
  2. Throughput lever — the ~10–12 eps drain ceiling is supernode write serialization (every drainer contends on the shared root-Session + concept nodes). Reducing cross-session writes (W1/W2) is the real ~order-of-magnitude lever, gated on a safe replacement for the silent-MATCH edge write (item 1).

🤖 Generated with Amplifier

Co-Authored-By: Amplifier 240397093+microsoft-amplifier@users.noreply.github.com

@colombod colombod closed this Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant