Core, Spark: Fix stale manifest_length in rewrite_table_path manifest lists#16910
Core, Spark: Fix stale manifest_length in rewrite_table_path manifest lists#16910wombatu-kun wants to merge 4 commits into
Conversation
|
|
||
| // rebuild manifest-list files last, stamping manifest_length with the rewritten manifest sizes | ||
| Set<RewriteResult<ManifestFile>> manifestListResults = Sets.newConcurrentHashSet(); | ||
| Tasks.foreach(validSnapshots) |
There was a problem hiding this comment.
Note on performance: this loop currently defaults to single-threaded execution because PySpark/SQL cannot configure an Executor Service, directly causing slower runtimes for rewrite_table_path. I've raised a feature request to address this, and I believe it's an important bottleneck we should advocate to fix.
There was a problem hiding this comment.
The threading here is unchanged by this PR - both the new discovery loop and the manifest-list loop reuse the action's existing executorService, the same as the original single loop did. Making that executor parallel by default is orthogonal to the manifest_length fix, so #16752 is the right place to address it.
There was a problem hiding this comment.
Of course.
I assumed it was Out-Of-Scope, just wanted to point out a potential bottleneck.
leeyam24
left a comment
There was a problem hiding this comment.
I'm done reviewing, looks pretty good to me.
|
Thanks for the PR! I'll take a look. I reference this in #15470 (comment) I think it would be good to include this in the upcoming patch release. Could you rebase to resolve conflict? Might also be good to follow the pattern from #15470 |
… lists Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a81e706 to
babe6a8
Compare
…se line Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Rebased onto |
| ManifestFile newFile = file.copy(); | ||
| ((StructLike) newFile).set(0, newPath(newFile.path(), sourcePrefix, targetPrefix)); | ||
| ((StructLike) newFile) | ||
| .set(1, rewrittenManifestLengths.getOrDefault(file.path(), file.length())); |
There was a problem hiding this comment.
This seems like an intentional decision, based on
Manifests not rewritten in an incremental run keep their original length.
but it does still mean that this procedure will silently produce incorrect manifest lists in incremental mode, if rewritten manifest lists referenced manifests that were already rewritten in a previous run.
If the rewritten manifest size changed after the prefix rewrite, which it typically does, the new manifest list is still inconsistent with the file it references.
This is a known incorrect behaviour described in #13720 (comment) and acknowledged there.
There was a problem hiding this comment.
You're right - this is the same bug as your #13719 (#16905 is a later duplicate), and your #13720 went further than this PR: by rewriting every referenced manifest and requiring all lengths to be known, it kept the incremental case correct too. This PR keeps the incremental optimization (only new manifests are rewritten), so carried-over manifests fall back to file.length() and their manifest_length stays stale - the exact A/B/C case you raised with @dramaticlly. So the non-incremental path is fixed here but the incremental path is not, and closing it means re-rewriting all referenced manifests, the efficiency tradeoff that stalled #13720 at the community sync. I'd rather settle that completeness-vs-efficiency question on #13719 / dev@ than decide it unilaterally here, and I'm glad to help revive your #13720. Even as-is this PR is a strict improvement over main, where every entry kept the source length.
Closes #16905
Problem
RewriteTablePathUtil.rewriteManifestListrewrote each manifest-list entry updating only the manifest path and leftmanifest_length(spec field 501) at the source value. When the target prefix differs in length from the source, the rewritten manifest embeds different-length data-file paths and therefore differs in byte size, somanifest_lengthno longer matches the rewritten manifest on disk. Readers that validate the field (Trino, Impala, iceberg-rust) fail withIncorrect file size ... (end of stream not reached); Spark ignores the field, which is why the bug only surfaces on some engines.Fix
The rewritten manifest's true length is only known after the manifests are rewritten, but the Spark action wrote the manifest list first.
rebuildMetadatanow discovers the manifests to rewrite (read-only), rewrites them while capturing each one's byte length, and writes the manifest lists last, stampingmanifest_lengthfrom the captured lengths. The length is taken from the manifest writer (ManifestWriter.length()) rather than agetLength()/HEAD call, so it stays accurate even where the length of an in-progress write underreports. Manifests not rewritten in an incremental run keep their original length. Because the staged metadata is copied to the target byte-for-byte (Distcp), the recorded length equals the target file's actual size.Tests
TestRewriteTablePathsAction#testManifestLengthAfterRewrite: rewrites a multi-snapshot table to a longer target prefix and asserts thatmanifest_lengthmatches the on-disk manifest size for every manifest-list entry across all snapshots, current and historical (format versions 2-4, covering both Avro and Parquet manifests). Fails before this change, passes after.TestRewriteTablePathUtil: unit coverage for stampingmanifest_lengthfrom the length map and for falling back to the source length when a manifest is absent from the map.Notes
Same class of bug as #15470 (delete-file
file_size_in_bytesinside manifests), but formanifest_lengthin the manifest list; the two fixes are independent. They touch overlapping code in the rewrite path, so whichever lands second will need a small rebase.