fix(neo4j): restore global write-lock ordering + finite lock timeout (eliminate flush deadlocks & silent drain stalls)#18
Conversation
…dlocks Restore deterministic global write-lock ordering in the Neo4j flush path to eliminate concurrent-ingestion deadlocks. PR #15 (v4.0.1) replaced the single global write transaction with a per-chunk coordinator, which lost the global node lock order — concurrent drainers then acquired locks out of order and Neo4j raised TransientError.DeadlockDetected (and, with db.lock.acquisition.timeout=0, caused silent blocking stalls). This fix applies two changes: 1. Row-level global order (Layer 1): Sort the full node snapshot before chunking; iterate edge_groups in sorted type order; sort the edge snapshot before chunking. This eliminated 194 → 80 deadlocks in testing, but left topological inversions that row-sorting cannot reach. 2. Canonical node-lock pre-acquisition (Layer 2): Acquire write-locks on all edge endpoints in node_id-sorted order before the edge MERGEs, eliminating the root ↔ concept lock-order inversion. Implemented Eager-free via per-endpoint MATCH...SET (APOC not installed); confirmed by EXPLAIN on the reproduction stack. Verification (isolated reproduction stack, prod untouched): - DeadlockDetected: 194 → 80 (row-sort) → 0 (pre-lock), at concept in_degree 1280 - 0 driver retries, 0 dead-letters; drain kept up (1408 → 692 queued) - Full Neo4j integration suite: 61 passed (incl. orphan OOM-surfacing test) - Non-Neo4j tests: 1372 passed; lint clean - New TDD test asserts global node/edge lock-order invariant and sorted pre-lock Fixes: #15 regression (deadlock regression in v4.0.1) Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
…il loud With db.lock.acquisition.timeout=0 (unlimited) a blocked (non-deadlock) transaction waits forever and raises nothing — the silent drain stall Brian observed. A finite timeout converts that into a bounded TransientError that flows through the existing retry → dead-letter path (already logged at ERROR; /status reports degraded). Set NEO4J_db_lock_acquisition_timeout="30s" in the shipped docker-compose neo4j service (tunable default). No application source change needed — the exhausted-flush path already surfaces loudly (flush_chunk_failed/drain_batch_exhausted at ERROR, dead_letter_total + degraded on /status). Deliberately NOT adding a write_semaphore acquire timeout (would dead-letter valid events under backpressure); the lock timeout alone bounds starvation. Verification: NEW fault-injection tests tests/neo4j/test_lock_timeout_fail_loud.py — RED probe on 0s reproduced the silent infinite block (hang); with the finite timeout the flush raises in ~5.5s, dead-letters in ~15s with degraded=true on /status, and other drainers keep progressing under contention. Full Neo4j suite 64 passed, 1372 non-Neo4j pass, harness DeadlockDetected still 0. 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
|
Closing in favor of #19 (merged), which is the more complete and live-proven fix for this incident. #19 independently converges on the same root cause — PR #15's per-chunk flush lost the global lock order, and We re-based our investigation onto merged #19 and measured our one non-redundant contribution — the canonical per-endpoint pre-lock. On an identical harness it drives residual Net: #19 is the fix. Our independent reproduction corroborates it (194→0 lock-order progression; topological root↔concept lock-order analysis; a four-gate evidence framework on an isolated real-Neo4j stack). Two follow-ups worth tracking for the team (repo Issues are disabled, noting here):
🤖 Generated with Amplifier Co-Authored-By: Amplifier 240397093+microsoft-amplifier@users.noreply.github.com |
Summary
Follow-up to #15. PR #15 (v4.0.1) replaced the single global flush transaction with a per-chunk coordinator, which lost the global Neo4j lock ordering. Under concurrent ingestion this caused
TransientError.DeadlockDetectedcycles, and — becausedb.lock.acquisition.timeout=0(unlimited) — silent, unrecoverable drain stalls (a blocked transaction waits forever and raises nothing).This PR fixes both, in two commits.
1. Restore global write-lock ordering (
bba226b)edge_groupsin sorted type order; sort the edge snapshot before chunking (row-level global order).node_id-sorted order before the edge MERGEs (canonical per-endpoint pre-lock). This eliminates the topological root↔concept lock-order inversion that row-sorting alone cannot reach (the two endpoints appear in different src/dst roles across different edge types within one transaction). APOC is not available, so pure Cypher; implemented Eager-free (per-endpointMATCH … SET) to avoid added memory pressure.2. Finite lock-acquisition timeout — fail loud, never silent (
657b800)NEO4J_db_lock_acquisition_timeout="30s"in the shipped compose. Converts a blocked flush from a silent infinite wait into a boundedTransientErrorthat flows through the existing retry → dead-letter path (already ERROR-logged;/statusreportsdegraded). No application source change.write_semaphoreacquire timeout (it would dead-letter valid events under backpressure); the lock timeout alone bounds starvation.Evidence (isolated reproduction stack, real Neo4j)
timeout=0; with the finite timeout a blocked flush raises in ~5.5s, dead-letters in ~15s, and/statusshowsdegraded=true.tests/test_flush_global_lock_order.py,tests/neo4j/test_lock_timeout_fail_loud.py.Known ceiling / follow-ups (not in this PR)
HAS_SUBSESSIONMATCH(which silently drops the edge if the parent Session isn't committed). Tracked as a follow-up.HAS_SUBSESSIONMATCH … MERGEsilently drops the edge if the parent Session is absent at child-flush time — pre-existing, independent of fix: chunked Neo4j flush eliminates OOM-induced ingest stall + failure-visibility signal (v4.0.1) #15. Tracked.