Skip to content

feat: add MemWAL sharding evaluator#6854

Merged
jackye1995 merged 5 commits into
lance-format:mainfrom
jackye1995:jack/arrow-native-memwal-sharding
May 20, 2026
Merged

feat: add MemWAL sharding evaluator#6854
jackye1995 merged 5 commits into
lance-format:mainfrom
jackye1995:jack/arrow-native-memwal-sharding

Conversation

@jackye1995

Copy link
Copy Markdown
Contributor

Adds an Arrow-native MemWAL sharding evaluator and exposes it through the Java API/JNI.

  • Evaluates MemWAL sharding specs against Arrow RecordBatch values for bucket, identity, and unsharded fields.
  • Resolves sharding source IDs through a Java-provided source-id-to-column map.
  • Adds Java-facing ShardingEvaluator returning an Arrow reader for the evaluated sharding key batch.

This is needed by lance-spark to route writes using Lance's sharding semantics instead of duplicating Spark-side bucket logic.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request A-java Java bindings + JNI labels May 20, 2026
Lift bucket sharding initialization to persist the configured shard field independently from primary-key metadata.
@github-actions github-actions Bot added the A-python Python bindings label May 20, 2026
Remove deprecated Region compatibility aliases from the Python MemWAL API and align raw bindings with Shard naming.
@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.84131% with 84 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/mem_wal/sharding.rs 79.16% 68 Missing and 12 partials ⚠️
rust/lance/src/dataset/mem_wal/api.rs 50.00% 3 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo
Xuanwo previously requested changes May 20, 2026

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Python sharding spec round-trip needs one fix before this lands.

Comment thread python/src/mem_wal.rs
field_id: get_py_value(field, "field_id")?.extract::<String>()?,
source_ids: get_py_value(field, "source_ids")?.extract::<Vec<i32>>()?,
transform: optional_string(get_py_value(field, "transform")?)?,
expression: optional_string(get_py_value(field, "expression")?)?,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes dict specs returned by Dataset.mem_wal_index_details() unusable with the new evaluator.

mem_wal_index_details() currently serializes each sharding field with field_id, source_ids, transform, result_type, and parameters, but it does not include expression. Since this parser now requires expression to be present, the natural flow below fails with Missing sharding spec field 'expression':

spec = ds.mem_wal_index_details()["sharding_specs"][0]
evaluate_sharding_spec(batch, spec, LanceSchema.from_pyarrow(batch.schema))

Could we either include expression in the dict returned by mem_wal_index_details() or treat a missing expression key as None here? I think adding it to mem_wal_index_details() is cleaner because that keeps the exported spec shape complete and round-trippable.

@jackye1995 jackye1995 May 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4f2ddc5 by including expression in the Python Dataset.mem_wal_index_details() sharding field dict. I also extended test_initialize_mem_wal_bucket_sharding to pass the returned dict spec directly into evaluate_sharding_spec(...), covering the round-trip path from the comment.

@Xuanwo Xuanwo dismissed their stale review May 20, 2026 07:19

Submitted as changes requested by mistake; intended as a non-blocking review comment.

@hamersaw hamersaw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! It feels a little off to be burying the sharding (partitioning / clustering) metadata into the memwal index implementation IMO. I think there are a lot of use-cases where we use this metadata (ex. writing dataset, Spark SPJ, etc) that is not tied to mem_wal in any way.

@jackye1995

Copy link
Copy Markdown
Contributor Author

Looks great to me! It feels a little off to be burying the sharding (partitioning / clustering) metadata into the memwal index implementation IMO. I think there are a lot of use-cases where we use this metadata (ex. writing dataset, Spark SPJ, etc) that is not tied to mem_wal in any way.

I agree, I think one related proposal is to lift this into some independent module like lance-sharding.

But on the other side, the good thing about this is user who sets this can directly write with WAL. It's just we don't handle the read path.

@jackye1995 jackye1995 requested a review from Xuanwo May 20, 2026 17:11
@jackye1995 jackye1995 merged commit bf68171 into lance-format:main May 20, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants