investigation: FLAKE_PROBE on obj_read NotFound (DO NOT MERGE)#6664
Draft
gtarpenning wants to merge 15 commits intomasterfrom
Draft
investigation: FLAKE_PROBE on obj_read NotFound (DO NOT MERGE)#6664gtarpenning wants to merge 15 commits intomasterfrom
gtarpenning wants to merge 15 commits intomasterfrom
Conversation
When `obj_read` is about to raise `NotFoundError`, query `object_versions` for the same project_id/object_id/digest (and cross-project matches on digest) and log the counts + a small sample of actual rows. Lets us see, at the moment of flake, whether: - the row is genuinely absent (points at write-side issue), - the row exists with a different digest (ref-conversion / caching), - the row exists in a different project_id (database isolation bug), - or the digest matches something elsewhere. CI-only throwaway. Revert once we've captured a failing run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=ca9eb0ad5003e1b9e5332ae880490b03d08ddfc7 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Round 2 of CI saw flakes in test_publish_round_trip_query_object which raises NotFoundError from `refs_read_batch` (L5470), not from `obj_read` (L1899 where the first probe lives). Add the same diagnostic probe there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…te/read - tests/conftest.py: autouse fixture that sets NOT_FOUND_RETRY_WAIT_SECONDS=0 and WEAVE_RETRY_MAX_ATTEMPTS=1 so write-then-read races surface as hard failures on first attempt (previously masked by 3x retry window). - obj_create: log FLAKE_TRACE on successful INSERT (project_id, object_id, digest, database). - obj_read: log FLAKE_TRACE on entry with the same fields. Correlating these with the existing FLAKE_PROBE on the NotFoundError path should tell us whether an earlier obj_create succeeded for the exact (project, object, digest) that the failing obj_read is looking for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HiveMind Sessions1 session · 1h 51m · $117
View all sessions in HiveMind → Run |
- Remove hot-path FLAKE_TRACE logs on obj_create/obj_read. These were injecting ~0.1-1ms of logger.warning overhead per call and appear to close the race window (Heisenbug: probe stops reproducing the flake). - Keep server-side failure-branch FLAKE_PROBE at L1899 and L5519 as a zero-cost backstop. - Add test-level _probe_ch_state() helper in tests/trace/test_dataset.py and tests/trace/type_handlers/Content/test_content.py that, on catching NotFoundError/ValueError from ref.get(), queries object_versions directly and prints counts + sample so pytest captured stdout shows both the test failure AND the CH state at the moment of the failure. - Wrap ref.get() in test_published_dataset_laziness and test_content_in_dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_image_as_property hit NotFoundError from file_content_read at L5838 (not obj_read). Different table (files, not object_versions), same race pattern: file_create then file_content_read fails to find the just-written chunks. Probe counts file_chunks rows by digest to see if the chunks are present in CH at the moment the read fails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…not arr[1:N] Last run captured the flake (test_table_cant_set_bad_data → NotFoundError Obj Table:...) and my probe fired, but the probe query itself failed with 'Syntax error at position 415 (AS)' because ClickHouse doesn't parse `groupArray(...)[1:10] AS sample`. Fix all 5 probe sites to use arraySlice().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Throwaway investigation branch. Adds a diagnostic SELECT against
object_versionsat the momentobj_readis about to raiseNotFoundError, so we can tell whether the row is genuinely missing or just being looked up wrong (project_id, digest, cross-project).Target flakes:
test_content_in_datasettest_published_dataset_lazinessRe-triggering CI until we capture a
FLAKE_PROBElog line on a failing run. Revert once we've learned something.