API, Core: Add manifestLocation API to ContentFile#11044
Conversation
72323a7 to
2d88443
Compare
| * Returns the path of the manifest which this file is referenced in or null if it was not read | ||
| * from a manifest. | ||
| */ | ||
| default CharSequence manifestPath() { |
There was a problem hiding this comment.
Shall we put this method before pos() so both are co-located?
There was a problem hiding this comment.
We need to discuss using CharSequence for paths in our APIs. I am not sure this approach is working out very well. I believe the idea was to expose Avro Utf8 directly. The problem is that Utf8 is not serializable. Therefore, we always convert Utf8 to String. Working with CharSequence adds a lot of complexity for no performance benefit. Puffin and a few other places use String for paths.
@rdblue, is my assumption about CharSequence correct? What are your thoughts now?
There was a problem hiding this comment.
BTW, path() in ManifestFile returns String.
There was a problem hiding this comment.
I think it would be fine to use String here and always expect to convert.
There was a problem hiding this comment.
Updated to use String. I also published #11092 to deprecate the path API which returns CharSequence, in favor of a location API which returns String. This removal would happen in 2.0
|
I took another look. I am on board. My high-level comments are below.
|
Great points on these, will run distributed planning benchmarks to make sure there's not a regression from adding this. |
b923d60 to
494339a
Compare
|
@aokolnychyi I ran the plan benchmarks, and added to the PR description. there's no significant difference in planning before and after.
For tests, we actually already have SparkDistributedDataScanTestBase which tests different combinations of local/distributed for data file and delete file planning. We have TestSparkDistributedDataScanJavaSerialization and TestSparkDistributedDataScanKryoSerialization for testing java/kryo serialization respectively. So I think we're effectively getting coverage in different planning + serialization combinations for the new code path. Please let me know if you think there's a coverage gap though! |
494339a to
32145d7
Compare
32145d7 to
499011d
Compare
aokolnychyi
left a comment
There was a problem hiding this comment.
LGTM, one nit in the tests. Thanks, @amogh-jahagirdar!
|
Verified TestSparkReaderDeletes#testMultiplePosDeleteFiles() failure is unrelated (ran locally a few times with my changes and it passed). The test may be flaky. I'll push another change to address the nit and we can verify if it's flaky. |
…path to a manifest from which the content file resides in
499011d to
bbf159b
Compare
|
This time |
|
I'm going to go ahead and merge. Thanks for reviewing @aokolnychyi @rdblue! |
This change adds a manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in. If the data or delete file was not read from a manifest, the API returns null. The planned use case for this API is to enable a more optimized determination of which manifests need to be rewritten during commits.
Note: To be very clear this is not a spec change, this is an additional read API on ContentFile which surfaces the manifest path on reading of the manifest entry
Benchmark results prior to the changes:
Benchmark results after the changes: