fix: apply per-column compression metadata on direct-path writes#477
Open
LuciferYang wants to merge 1 commit into
Open
fix: apply per-column compression metadata on direct-path writes#477LuciferYang wants to merge 1 commit into
LuciferYang wants to merge 1 commit into
Conversation
When users write via df.write().format("lance").option("<col>.lance.compression",
...).save(path), Spark's DataFrameWriter routes the call through
SupportsCatalogOptions.createTable with properties=Map.empty, so the compression
setting is silently dropped at table creation time.
Apply SchemaConverter.processSchemaWithProperties against the merged write
options in newWriteBuilder so the lance-encoding:compression field metadata
reaches the Arrow schema before the writer runs. The TBLPROPERTIES and
writeTo().tableProperty() paths were already correct.
Add an E2E roundtrip suite that verifies both data integrity and the on-disk
codec (via Lance v2 footer + column metadata protobuf parsing) across lz4,
zstd, none, mixed per-column codecs, and all three write paths.
Count via collectAsList().size() rather than SELECT count(*) to sidestep a
Spark 3.5 V2 pushed-aggregation binary-compat bug that instantiates the old
2-arg Sum(Expression, Enumeration$Value) constructor removed in some 3.5.x
patch releases.
b8ff883 to
4a12ac6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When users write via
df.write().format("lance").option("<col>.lance.compression", "lz4").save(path), the compression setting is silently dropped. Spark'sDataFrameWriterroutes this path throughSupportsCatalogOptions.createTablewithproperties=Map.empty, so the per-column compression metadata never reaches the Arrow schema at table-creation time.Fix this in
LanceDataset.newWriteBuilderby runningSchemaConverter.processSchemaWithPropertiesagainst the merged write options, so thelance-encoding:compressionfield metadata is applied before handing off to the write builder. The TBLPROPERTIES andwriteTo().tableProperty()paths already worked correctly; this only repairs the direct-path write.Test plan
Added
BaseCompressionRoundtripTest(with subclasses for Spark 3.4 and 3.5; reused by 4.0 / 4.1 viaadd-test-source) covering:TBLPROPERTIESpath — lz4, zstd, nonewriteTo().tableProperty()pathdf.write().option(...).save(path)— this is the scenario the fix addressesEach test verifies both data integrity and the on-disk codec by parsing the Lance v2 footer and column-metadata protobuf to extract
BufferCompression.scheme. Tests usefile_format_version = '2.2'(block-level compression, no 4 KB miniblock threshold) and 10 K rows of compressible strings.