fix: apply per-column compression metadata on direct-path writes by LuciferYang · Pull Request #477 · lance-format/lance-spark

LuciferYang · 2026-04-24T11:01:24Z

Summary

When users write via df.write().format("lance").option("<col>.lance.compression", "lz4").save(path), the compression setting is silently dropped. Spark's DataFrameWriter routes this path through SupportsCatalogOptions.createTable with properties=Map.empty, so the per-column compression metadata never reaches the Arrow schema at table-creation time.

Fix this in LanceDataset.newWriteBuilder by running SchemaConverter.processSchemaWithProperties against the merged write options, so the lance-encoding:compression field metadata is applied before handing off to the write builder. The TBLPROPERTIES and writeTo().tableProperty() paths already worked correctly; this only repairs the direct-path write.

Test plan

Added BaseCompressionRoundtripTest (with subclasses for Spark 3.4 and 3.5; reused by 4.0 / 4.1 via add-test-source) covering:

SQL TBLPROPERTIES path — lz4, zstd, none
Mixed per-column codecs (lz4 on one column, zstd on another)
DataFrame writeTo().tableProperty() path
Multiple appends reuse the table's compression settings
Direct-path df.write().option(...).save(path) — this is the scenario the fix addresses

Each test verifies both data integrity and the on-disk codec by parsing the Lance v2 footer and column-metadata protobuf to extract BufferCompression.scheme. Tests use file_format_version = '2.2' (block-level compression, no 4 KB miniblock threshold) and 10 K rows of compressible strings.

When users write via df.write().format("lance").option("<col>.lance.compression", ...).save(path), Spark's DataFrameWriter routes the call through SupportsCatalogOptions.createTable with properties=Map.empty, so the compression setting is silently dropped at table creation time. Apply SchemaConverter.processSchemaWithProperties against the merged write options in newWriteBuilder so the lance-encoding:compression field metadata reaches the Arrow schema before the writer runs. The TBLPROPERTIES and writeTo().tableProperty() paths were already correct. Add an E2E roundtrip suite that verifies both data integrity and the on-disk codec (via Lance v2 footer + column metadata protobuf parsing) across lz4, zstd, none, mixed per-column codecs, and all three write paths. Count via collectAsList().size() rather than SELECT count(*) to sidestep a Spark 3.5 V2 pushed-aggregation binary-compat bug that instantiates the old 2-arg Sum(Expression, Enumeration$Value) constructor removed in some 3.5.x patch releases.

github-actions Bot added the enhancement New feature or request label Apr 24, 2026

LuciferYang force-pushed the feat/compression-roundtrip-test branch from b8ff883 to 4a12ac6 Compare April 24, 2026 11:03

LuciferYang changed the title ~~feat: support per-column compression on direct-path writes~~ fix: apply per-column compression metadata on direct-path writes Apr 24, 2026

github-actions Bot added the bug Something isn't working label Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: apply per-column compression metadata on direct-path writes#477

fix: apply per-column compression metadata on direct-path writes#477
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:feat/compression-roundtrip-test

LuciferYang commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Apr 24, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant