fix(neo4j): restore global write-lock ordering + finite lock timeout (eliminate flush deadlocks & silent drain stalls) by colombod · Pull Request #18 · microsoft/amplifier-context-intelligence

colombod · 2026-06-19T01:55:57Z

Summary

Follow-up to #15. PR #15 (v4.0.1) replaced the single global flush transaction with a per-chunk coordinator, which lost the global Neo4j lock ordering. Under concurrent ingestion this caused TransientError.DeadlockDetected cycles, and — because db.lock.acquisition.timeout=0 (unlimited) — silent, unrecoverable drain stalls (a blocked transaction waits forever and raises nothing).

This PR fixes both, in two commits.

1. Restore global write-lock ordering (`bba226b`)

Sort the full node snapshot before chunking; iterate edge_groups in sorted type order; sort the edge snapshot before chunking (row-level global order).
Acquire write-locks on all edge endpoints in node_id-sorted order before the edge MERGEs (canonical per-endpoint pre-lock). This eliminates the topological root↔concept lock-order inversion that row-sorting alone cannot reach (the two endpoints appear in different src/dst roles across different edge types within one transaction). APOC is not available, so pure Cypher; implemented Eager-free (per-endpoint MATCH … SET) to avoid added memory pressure.

2. Finite lock-acquisition timeout — fail loud, never silent (`657b800`)

NEO4J_db_lock_acquisition_timeout="30s" in the shipped compose. Converts a blocked flush from a silent infinite wait into a bounded TransientError that flows through the existing retry → dead-letter path (already ERROR-logged; /status reports degraded). No application source change.
Deliberately no write_semaphore acquire timeout (it would dead-letter valid events under backpressure); the lock timeout alone bounds starvation.

Evidence (isolated reproduction stack, real Neo4j)

Deadlocks: 194 → 0 even at concept-node in_degree ~1920; 0 driver retries, 0 dead-letters.
Never-silent proven: a RED probe reproduced the silent infinite hang on timeout=0; with the finite timeout a blocked flush raises in ~5.5s, dead-letters in ~15s, and /status shows degraded=true.
Starvation bounded: under induced contention a free drainer commits in ~5.6s (lock timeout alone).
Scales gracefully (K=4→64): 0 deadlock / 0 dead-letter / 0 loss / full recovery at every level; throughput plateaus at ~12 events/s (supernode serialization), queue grows linearly under backpressure — bounded, no cliff.
Regression: 1372 non-Neo4j + 64 Neo4j tests pass. New TDD tests: tests/test_flush_global_lock_order.py, tests/neo4j/test_lock_timeout_fail_loud.py.

Known ceiling / follow-ups (not in this PR)

Throughput ceiling ~12 events/s is supernode write serialization. The lever to raise it (~order of magnitude) is reducing cross-session writes to the shared root/concept nodes — but it needs a safe replacement for the HAS_SUBSESSION MATCH (which silently drops the edge if the parent Session isn't committed). Tracked as a follow-up.
Latent silent-edge-drop: HAS_SUBSESSION MATCH … MERGE silently drops the edge if the parent Session is absent at child-flush time — pre-existing, independent of fix: chunked Neo4j flush eliminates OOM-induced ingest stall + failure-visibility signal (v4.0.1) #15. Tracked.
FD/ulimit ceiling past ~1024 concurrent never-ending sessions is a config matter (raise ulimit); still 0 loss when crossed.

…dlocks Restore deterministic global write-lock ordering in the Neo4j flush path to eliminate concurrent-ingestion deadlocks. PR #15 (v4.0.1) replaced the single global write transaction with a per-chunk coordinator, which lost the global node lock order — concurrent drainers then acquired locks out of order and Neo4j raised TransientError.DeadlockDetected (and, with db.lock.acquisition.timeout=0, caused silent blocking stalls). This fix applies two changes: 1. Row-level global order (Layer 1): Sort the full node snapshot before chunking; iterate edge_groups in sorted type order; sort the edge snapshot before chunking. This eliminated 194 → 80 deadlocks in testing, but left topological inversions that row-sorting cannot reach. 2. Canonical node-lock pre-acquisition (Layer 2): Acquire write-locks on all edge endpoints in node_id-sorted order before the edge MERGEs, eliminating the root ↔ concept lock-order inversion. Implemented Eager-free via per-endpoint MATCH...SET (APOC not installed); confirmed by EXPLAIN on the reproduction stack. Verification (isolated reproduction stack, prod untouched): - DeadlockDetected: 194 → 80 (row-sort) → 0 (pre-lock), at concept in_degree 1280 - 0 driver retries, 0 dead-letters; drain kept up (1408 → 692 queued) - Full Neo4j integration suite: 61 passed (incl. orphan OOM-surfacing test) - Non-Neo4j tests: 1372 passed; lint clean - New TDD test asserts global node/edge lock-order invariant and sorted pre-lock Fixes: #15 regression (deadlock regression in v4.0.1) Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>

…il loud With db.lock.acquisition.timeout=0 (unlimited) a blocked (non-deadlock) transaction waits forever and raises nothing — the silent drain stall Brian observed. A finite timeout converts that into a bounded TransientError that flows through the existing retry → dead-letter path (already logged at ERROR; /status reports degraded). Set NEO4J_db_lock_acquisition_timeout="30s" in the shipped docker-compose neo4j service (tunable default). No application source change needed — the exhausted-flush path already surfaces loudly (flush_chunk_failed/drain_batch_exhausted at ERROR, dead_letter_total + degraded on /status). Deliberately NOT adding a write_semaphore acquire timeout (would dead-letter valid events under backpressure); the lock timeout alone bounds starvation. Verification: NEW fault-injection tests tests/neo4j/test_lock_timeout_fail_loud.py — RED probe on 0s reproduced the silent infinite block (hang); with the finite timeout the flush raises in ~5.5s, dead-letters in ~15s with degraded=true on /status, and other drainers keep progressing under contention. Full Neo4j suite 64 passed, 1372 non-Neo4j pass, harness DeadlockDetected still 0. 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>

colombod · 2026-06-19T10:53:10Z

Closing in favor of #19 (merged), which is the more complete and live-proven fix for this incident.

#19 independently converges on the same root cause — PR #15's per-chunk flush lost the global lock order, and db.lock.acquisition.timeout=0 turned the resulting contention into silent, unrecoverable drain stalls — and additionally fixes a dominant real-world factor this PR missed: the label-free node MERGE/MATCH causing an AllNodesScan on the ~1.3M-node production graph. #19 addresses it with a universal :Node label + composite (node_id, workspace) index + idempotent backfill, with live before/after evidence (backlog 58→0 in ~2 min; slowest txn 25–30s → sub-ms).

We re-based our investigation onto merged #19 and measured our one non-redundant contribution — the canonical per-endpoint pre-lock. On an identical harness it drives residual DeadlockDetected 110 → 0, but #19 already self-heals those deadlocks losslessly (0 dead-letters, 0 data loss), and the pre-lock yields no throughput gain (~10.2 vs ~11.3 eps) at an unconditional hot-path cost. So it is not worth shipping. The branch and its evidence are preserved as a ready-to-revive mitigation if production telemetry ever shows the residual deadlock-retry churn causing real harm at scale.

Net: #19 is the fix. Our independent reproduction corroborates it (194→0 lock-order progression; topological root↔concept lock-order analysis; a four-gate evidence framework on an isolated real-Neo4j stack).

Two follow-ups worth tracking for the team (repo Issues are disabled, noting here):

Latent silent edge-drop — HAS_SUBSESSION (and similar cross-session edges) use MATCH (src) MATCH (dst) MERGE (src)-[r]->(dst); if an endpoint isn't committed yet the MATCH returns nothing and the edge is silently dropped (no error, no log). Independent of fix: chunked Neo4j flush eliminates OOM-induced ingest stall + failure-visibility signal (v4.0.1) #15, and the blocker for the throughput work below.
Throughput lever — the ~10–12 eps drain ceiling is supernode write serialization (every drainer contends on the shared root-Session + concept nodes). Reducing cross-session writes (W1/W2) is the real ~order-of-magnitude lever, gated on a safe replacement for the silent-MATCH edge write (item 1).

🤖 Generated with Amplifier

Co-Authored-By: Amplifier 240397093+microsoft-amplifier@users.noreply.github.com

colombod and others added 2 commits June 18, 2026 23:46

colombod closed this Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(neo4j): restore global write-lock ordering + finite lock timeout (eliminate flush deadlocks & silent drain stalls)#18

fix(neo4j): restore global write-lock ordering + finite lock timeout (eliminate flush deadlocks & silent drain stalls)#18
colombod wants to merge 2 commits into
mainfrom
fix/flush-lock-order-and-fail-loud

colombod commented Jun 19, 2026

Uh oh!

colombod commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

colombod commented Jun 19, 2026

Summary

1. Restore global write-lock ordering (bba226b)

2. Finite lock-acquisition timeout — fail loud, never silent (657b800)

Evidence (isolated reproduction stack, real Neo4j)

Known ceiling / follow-ups (not in this PR)

Uh oh!

colombod commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Restore global write-lock ordering (`bba226b`)

2. Finite lock-acquisition timeout — fail loud, never silent (`657b800`)