feat: add MemWAL sharding evaluator by jackye1995 · Pull Request #6854 · lance-format/lance

jackye1995 · 2026-05-20T02:33:00Z

Adds an Arrow-native MemWAL sharding evaluator and exposes it through the Java API/JNI.

Evaluates MemWAL sharding specs against Arrow RecordBatch values for bucket, identity, and unsharded fields.
Resolves sharding source IDs through a Java-provided source-id-to-column map.
Adds Java-facing ShardingEvaluator returning an Arrow reader for the evaluated sharding key batch.

This is needed by lance-spark to route writes using Lance's sharding semantics instead of duplicating Spark-side bucket logic.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Lift bucket sharding initialization to persist the configured shard field independently from primary-key metadata.

Remove deprecated Region compatibility aliases from the Python MemWAL API and align raw bindings with Shard naming.

codecov · 2026-05-20T03:57:19Z

Codecov Report

❌ Patch coverage is 78.84131% with 84 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/mem_wal/sharding.rs	79.16%	68 Missing and 12 partials ⚠️
rust/lance/src/dataset/mem_wal/api.rs	50.00%	3 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo

I think the Python sharding spec round-trip needs one fix before this lands.

Xuanwo · 2026-05-20T07:18:31Z

+        field_id: get_py_value(field, "field_id")?.extract::<String>()?,
+        source_ids: get_py_value(field, "source_ids")?.extract::<Vec<i32>>()?,
+        transform: optional_string(get_py_value(field, "transform")?)?,
+        expression: optional_string(get_py_value(field, "expression")?)?,


This makes dict specs returned by Dataset.mem_wal_index_details() unusable with the new evaluator.

mem_wal_index_details() currently serializes each sharding field with field_id, source_ids, transform, result_type, and parameters, but it does not include expression. Since this parser now requires expression to be present, the natural flow below fails with Missing sharding spec field 'expression':

spec = ds.mem_wal_index_details()["sharding_specs"][0] evaluate_sharding_spec(batch, spec, LanceSchema.from_pyarrow(batch.schema))

Could we either include expression in the dict returned by mem_wal_index_details() or treat a missing expression key as None here? I think adding it to mem_wal_index_details() is cleaner because that keeps the exported spec shape complete and round-trippable.

Addressed in 4f2ddc5 by including expression in the Python Dataset.mem_wal_index_details() sharding field dict. I also extended test_initialize_mem_wal_bucket_sharding to pass the returned dict spec directly into evaluate_sharding_spec(...), covering the round-trip path from the comment.

Submitted as changes requested by mistake; intended as a non-blocking review comment.

hamersaw

Looks great to me! It feels a little off to be burying the sharding (partitioning / clustering) metadata into the memwal index implementation IMO. I think there are a lot of use-cases where we use this metadata (ex. writing dataset, Spark SPJ, etc) that is not tied to mem_wal in any way.

jackye1995 · 2026-05-20T16:21:01Z

Looks great to me! It feels a little off to be burying the sharding (partitioning / clustering) metadata into the memwal index implementation IMO. I think there are a lot of use-cases where we use this metadata (ex. writing dataset, Spark SPJ, etc) that is not tied to mem_wal in any way.

I agree, I think one related proposal is to lift this into some independent module like lance-sharding.

But on the other side, the good thing about this is user who sets this can directly write with WAL. It's just we don't handle the read path.

feat: add memwal sharding evaluator

147b9ca

claude Bot reviewed May 20, 2026

View reviewed changes

github-actions Bot added enhancement New feature or request A-java Java bindings + JNI labels May 20, 2026

feat: expose memwal sharding evaluator to python

043c0a8

Lift bucket sharding initialization to persist the configured shard field independently from primary-key metadata.

github-actions Bot added the A-python Python bindings label May 20, 2026

feat: expose shard-only python memwal api

7eb4a29

Remove deprecated Region compatibility aliases from the Python MemWAL API and align raw bindings with Shard naming.

feat: derive memwal sharding sources from schema

e35a92a

jackye1995 mentioned this pull request May 20, 2026

feat: use Lance MemWAL sharding for bucketed writes and SPJ lance-format/lance-spark#519

Merged

Xuanwo previously requested changes May 20, 2026

View reviewed changes

hamersaw approved these changes May 20, 2026

View reviewed changes

fix: include memwal sharding expressions in python details

4f2ddc5

jackye1995 requested a review from Xuanwo May 20, 2026 17:11

Xuanwo approved these changes May 20, 2026

View reviewed changes

jackye1995 merged commit bf68171 into lance-format:main May 20, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add MemWAL sharding evaluator#6854

feat: add MemWAL sharding evaluator#6854
jackye1995 merged 5 commits into
lance-format:mainfrom
jackye1995:jack/arrow-native-memwal-sharding

jackye1995 commented May 20, 2026

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Xuanwo left a comment

Uh oh!

Xuanwo May 20, 2026

Uh oh!

jackye1995 May 20, 2026 •

edited

Loading

Uh oh!

hamersaw left a comment

Uh oh!

jackye1995 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jackye1995 commented May 20, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Xuanwo May 20, 2026

Choose a reason for hiding this comment

Uh oh!

jackye1995 May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamersaw left a comment

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 20, 2026 •

edited

Loading

jackye1995 May 20, 2026 •

edited

Loading