Skip to content

fix: ensure sequence metadata list accesses are threadsafe in task runner#19467

Open
jtuglu1 wants to merge 1 commit into
apache:masterfrom
jtuglu1:fix-streaming-task-runner-race
Open

fix: ensure sequence metadata list accesses are threadsafe in task runner#19467
jtuglu1 wants to merge 1 commit into
apache:masterfrom
jtuglu1:fix-streaming-task-runner-race

Conversation

@jtuglu1
Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 commented May 15, 2026

Description

Fixes #19458 and cleans up SeekableStreamIndexTaskRunnerTest.java

Old code had racy logic which was performing size checks on COWsequences in multiple places then attempting to index into the sequences list. This size check could be invalidated by a publishing thread calling .remove() on the between the size() check and the index operation.

Race:

  1. [main runner thread]: calls getLastSequenceMetadata() here.
  2. [main runner thread]: in the definition, sequences.size() returns 2 here
  3. [publisher thread]: publisher thread removes element from the sequences list.
  4. [main runner thread]: sequences.get() throws OOB error due to stale read race.

Fix

While technically this issue can be solved by taking snapshots before doing multiple reads on the sequence list, that still doesn't prevent read/write inter-leavings that might cause temporary lapses in in-memory/on-disk state (especially if we crash). So, I guarded the sequence list with a reentrant lock. While we re-acquire this lock per record, running this under load/taking some flamegraphs did not add any noticeable overhead. I believe this is mainly because the lock is not frequently contended in the common case and bottlenecks in the ingestion code lie elsewhere. Switching to a rw lock is another option here but performance was the same between the two options. Will add the performance benchmarks soon.

Release note

Fix fatal race in streaming ingest task during segment publish.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu1 jtuglu1 force-pushed the fix-streaming-task-runner-race branch from 6f824ce to c2553ce Compare May 16, 2026 00:19
@jtuglu1 jtuglu1 force-pushed the fix-streaming-task-runner-race branch from c2553ce to 4831e0e Compare May 16, 2026 00:31
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 3 of 3 changed files.


This is an automated review by Codex GPT-5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

java.lang.ArrayIndexOutOfBoundsException in streaming supervisor index task

3 participants