Skip to content

fix(spark): strip recognized typed options from Rust storage_options map#520

Open
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:fix/upstream-typed-key-strip
Open

fix(spark): strip recognized typed options from Rust storage_options map#520
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:fix/upstream-typed-key-strip

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

LanceSparkReadOptions.Builder.fromOptions saves the entire input map as storageOptions before parseTypedFlags promotes recognized keys to their dedicated builder fields. Typed connector options (path, batch_size, block_size, version, executor_credential_refresh, ...) therefore appear in the map forwarded to the Rust object_store layer via ReadOptions.setStorageOptions. The native layer silently drops unknown keys, so there is no user-visible breakage — but the native log carries entries that do not belong there and confuse storage-layer troubleshooting.

LanceSparkWriteOptions has the same pattern for write_mode, max_row_per_file, batch_size, file_format_version, and nine others.

Change

Introduce a RECOGNIZED_TYPED_KEYS set on each options class and remove those keys from storageOptions at the end of Builder.build(). Stripping in build() rather than inside parseTypedFlags preserves the fromOptions → withCatalogDefaults merge semantics: catalog defaults merged after a per-read .option(...) still need to re-parse typed keys from a populated map, so only the final post-merge state is cleaned.

Surfaced while tracing how spark.sql.catalog.<name>.<key> configs flow through withCatalogDefaults (#476).

Test plan

  • 12 new tests in LanceSparkReadOptionsTypedKeyStrippingTest covering each typed key, catalog defaults, per-read overrides, and passthrough for genuine storage keys.
  • 8 new tests in LanceSparkWriteOptionsTypedKeyStrippingTest mirroring read-side coverage.

@github-actions github-actions Bot added the bug Something isn't working label May 12, 2026
LanceSparkReadOptions.Builder.fromOptions saves the entire input map as
storageOptions before parseTypedFlags promotes recognized keys to their
dedicated builder fields, so typed connector-level knobs (path,
pushDownFilters, block_size, version, index_cache_size,
metadata_cache_size, batch_size, topN_push_down, nearest,
executor_credential_refresh) were leaking into the Rust-side
storage_options map, which is reserved for object-store credentials and
endpoint config (aws_*, gcs_*, allow_http, ...). LanceSparkWriteOptions
had the same pattern for write_mode, max_row_per_file,
max_rows_per_group, max_bytes_per_file, file_format_version,
use_queued_write_buffer, queue_depth, batch_size, enable_stable_row_ids,
use_large_var_types, max_batch_bytes, and blob_pack_file_size_threshold.

The Rust layer silently drops unknown keys, so no functional breakage —
this is debug-hygiene only. Cleanup surfaced while investigating how
typed read options flow through spark.sql.catalog.<name>.<key> catalog-
level configuration introduced by lance-format#476: the catalog-level path works,
but recognized typed keys also end up in the native storage_options
map, adding noise to storage-layer logs and debug output.

Introduce RECOGNIZED_TYPED_KEYS sets on both options classes and strip
them in Builder.build(). Stripping in build() (not inside
parseTypedFlags) preserves the fromOptions -> withCatalogDefaults merge
semantics: the chain can re-parse typed keys from a still-populated
storageOptions when per-read options merge over catalog defaults, and
only the final post-merge state is cleaned.
@LuciferYang LuciferYang force-pushed the fix/upstream-typed-key-strip branch from bc2c3ee to 8c63dd5 Compare May 12, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant