Skip to content

investigation: FLAKE_PROBE on obj_read NotFound (DO NOT MERGE)#6664

Draft
gtarpenning wants to merge 15 commits intomasterfrom
gtarpenning/flake-probe-investigation
Draft

investigation: FLAKE_PROBE on obj_read NotFound (DO NOT MERGE)#6664
gtarpenning wants to merge 15 commits intomasterfrom
gtarpenning/flake-probe-investigation

Conversation

@gtarpenning
Copy link
Copy Markdown
Member

Throwaway investigation branch. Adds a diagnostic SELECT against object_versions at the moment obj_read is about to raise NotFoundError, so we can tell whether the row is genuinely missing or just being looked up wrong (project_id, digest, cross-project).

Target flakes:

  • test_content_in_dataset
  • test_published_dataset_laziness

Re-triggering CI until we capture a FLAKE_PROBE log line on a failing run. Revert once we've learned something.

When `obj_read` is about to raise `NotFoundError`, query
`object_versions` for the same project_id/object_id/digest (and
cross-project matches on digest) and log the counts + a small sample
of actual rows. Lets us see, at the moment of flake, whether:
- the row is genuinely absent (points at write-side issue),
- the row exists with a different digest (ref-conversion / caching),
- the row exists in a different project_id (database isolation bug),
- or the digest matches something elsewhere.

CI-only throwaway. Revert once we've captured a failing run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wandbot-3000
Copy link
Copy Markdown

wandbot-3000 Bot commented Apr 21, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 21.73913% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ve/trace_server/clickhouse_trace_server_batched.py 21.73% 17 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

gtarpenning and others added 6 commits April 21, 2026 09:59
Round 2 of CI saw flakes in test_publish_round_trip_query_object which
raises NotFoundError from `refs_read_batch` (L5470), not from `obj_read`
(L1899 where the first probe lives). Add the same diagnostic probe there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…te/read

- tests/conftest.py: autouse fixture that sets NOT_FOUND_RETRY_WAIT_SECONDS=0
  and WEAVE_RETRY_MAX_ATTEMPTS=1 so write-then-read races surface as hard
  failures on first attempt (previously masked by 3x retry window).
- obj_create: log FLAKE_TRACE on successful INSERT (project_id, object_id,
  digest, database).
- obj_read: log FLAKE_TRACE on entry with the same fields.

Correlating these with the existing FLAKE_PROBE on the NotFoundError path
should tell us whether an earlier obj_create succeeded for the exact
(project, object, digest) that the failing obj_read is looking for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@w-b-hivemind
Copy link
Copy Markdown

w-b-hivemind Bot commented Apr 21, 2026

HiveMind Sessions

1 session · 1h 51m · $117

Session Agent Duration Tokens Cost Lines
Fix Pr Lint And Test Flakes Via Monkeypatching
7c92ac2d-b797-4be9-9440-18bd25e9a3d4
claude 1h 51m 273.1K $117 +753 -235
Total 1h 51m 273.1K $117 +753 -235

View all sessions in HiveMind →

Run claude --resume 7c92ac2d-b797-4be9-9440-18bd25e9a3d4 to pickup where you left off.

gtarpenning and others added 8 commits April 21, 2026 11:13
- Remove hot-path FLAKE_TRACE logs on obj_create/obj_read. These were
  injecting ~0.1-1ms of logger.warning overhead per call and appear to
  close the race window (Heisenbug: probe stops reproducing the flake).
- Keep server-side failure-branch FLAKE_PROBE at L1899 and L5519 as a
  zero-cost backstop.
- Add test-level _probe_ch_state() helper in tests/trace/test_dataset.py
  and tests/trace/type_handlers/Content/test_content.py that, on catching
  NotFoundError/ValueError from ref.get(), queries object_versions
  directly and prints counts + sample so pytest captured stdout shows
  both the test failure AND the CH state at the moment of the failure.
- Wrap ref.get() in test_published_dataset_laziness and
  test_content_in_dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_image_as_property hit NotFoundError from file_content_read at L5838
(not obj_read). Different table (files, not object_versions), same race
pattern: file_create then file_content_read fails to find the just-written
chunks. Probe counts file_chunks rows by digest to see if the chunks are
present in CH at the moment the read fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…not arr[1:N]

Last run captured the flake (test_table_cant_set_bad_data → NotFoundError
Obj Table:...) and my probe fired, but the probe query itself failed with
'Syntax error at position 415 (AS)' because ClickHouse doesn't parse
`groupArray(...)[1:10] AS sample`. Fix all 5 probe sites to use
arraySlice().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant