Skip to content

fix: apply per-column compression metadata on direct-path writes#477

Open
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:feat/compression-roundtrip-test
Open

fix: apply per-column compression metadata on direct-path writes#477
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:feat/compression-roundtrip-test

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

Summary

When users write via df.write().format("lance").option("<col>.lance.compression", "lz4").save(path), the compression setting is silently dropped. Spark's DataFrameWriter routes this path through SupportsCatalogOptions.createTable with properties=Map.empty, so the per-column compression metadata never reaches the Arrow schema at table-creation time.

Fix this in LanceDataset.newWriteBuilder by running SchemaConverter.processSchemaWithProperties against the merged write options, so the lance-encoding:compression field metadata is applied before handing off to the write builder. The TBLPROPERTIES and writeTo().tableProperty() paths already worked correctly; this only repairs the direct-path write.

Test plan

Added BaseCompressionRoundtripTest (with subclasses for Spark 3.4 and 3.5; reused by 4.0 / 4.1 via add-test-source) covering:

  • SQL TBLPROPERTIES path — lz4, zstd, none
  • Mixed per-column codecs (lz4 on one column, zstd on another)
  • DataFrame writeTo().tableProperty() path
  • Multiple appends reuse the table's compression settings
  • Direct-path df.write().option(...).save(path) — this is the scenario the fix addresses

Each test verifies both data integrity and the on-disk codec by parsing the Lance v2 footer and column-metadata protobuf to extract BufferCompression.scheme. Tests use file_format_version = '2.2' (block-level compression, no 4 KB miniblock threshold) and 10 K rows of compressible strings.

@github-actions github-actions Bot added the enhancement New feature or request label Apr 24, 2026
When users write via df.write().format("lance").option("<col>.lance.compression",
...).save(path), Spark's DataFrameWriter routes the call through
SupportsCatalogOptions.createTable with properties=Map.empty, so the compression
setting is silently dropped at table creation time.

Apply SchemaConverter.processSchemaWithProperties against the merged write
options in newWriteBuilder so the lance-encoding:compression field metadata
reaches the Arrow schema before the writer runs. The TBLPROPERTIES and
writeTo().tableProperty() paths were already correct.

Add an E2E roundtrip suite that verifies both data integrity and the on-disk
codec (via Lance v2 footer + column metadata protobuf parsing) across lz4,
zstd, none, mixed per-column codecs, and all three write paths.

Count via collectAsList().size() rather than SELECT count(*) to sidestep a
Spark 3.5 V2 pushed-aggregation binary-compat bug that instantiates the old
2-arg Sum(Expression, Enumeration$Value) constructor removed in some 3.5.x
patch releases.
@LuciferYang LuciferYang force-pushed the feat/compression-roundtrip-test branch from b8ff883 to 4a12ac6 Compare April 24, 2026 11:03
@LuciferYang LuciferYang changed the title feat: support per-column compression on direct-path writes fix: apply per-column compression metadata on direct-path writes Apr 24, 2026
@github-actions github-actions Bot added the bug Something isn't working label Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant