Skip to content

feat(java): allow schema override for fragment writes#6919

Open
beinan wants to merge 1 commit into
lance-format:mainfrom
beinan:user/beinan/write-fragment-schema-override
Open

feat(java): allow schema override for fragment writes#6919
beinan wants to merge 1 commit into
lance-format:mainfrom
beinan:user/beinan/write-fragment-schema-override

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented May 22, 2026

Issue

In distributed writes, workers can create uncommitted fragments and defer the final commit until all fragments are ready. The Java WriteFragmentBuilder only exposed APPEND mode without a way to pass the target dataset schema, so lance-core had to open the existing dataset to infer schema and field IDs before writing each fragment.

That dataset open is unnecessary when the caller already has the target schema, and it becomes expensive for datasets with very large fragment counts because opening the dataset has to load/read manifest metadata. This shows up as fragment writing getting slower as the dataset grows, even before the final commit step.

Summary

  • Add WriteFragmentBuilder.schema(Schema) so Java distributed writers can pass the target dataset schema when creating uncommitted fragments.
  • Pass the optional schema through Arrow FFI/JNI into FragmentCreateBuilder.schema(...), avoiding the append-mode dataset open used only for schema inference.
  • Preserve current base path / object store write parameters when the schema override path is used.
  • Add Java coverage for append fragment writes with a schema override.

Benefits

  • Lets Java callers avoid an expensive dataset open per fragment write in distributed append workflows.
  • Keeps Lance field IDs from the target dataset schema instead of inferring from the incoming Arrow batches.
  • Makes the Java API match the underlying Rust FragmentCreateBuilder capability.
  • Reduces write-time overhead for datasets with high fragment counts, especially when many workers are writing fragments concurrently.

Testing

  • cargo check --manifest-path /tmp/lance-write-fragment-schema-pr/java/lance-jni/Cargo.toml
  • ./mvnw -Dtest=org.lance.FragmentTest#testWriteFragmentWithSchemaOverride test

@github-actions github-actions Bot added enhancement New feature or request java labels May 22, 2026
@beinan beinan marked this pull request as ready for review May 22, 2026 20:18
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@beinan beinan force-pushed the user/beinan/write-fragment-schema-override branch from 7da7ecc to 74fa3b3 Compare May 22, 2026 20:29
Copy link
Copy Markdown
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just one small change.

Concern: schema override silently reassigns Lance field IDs

The override is only safe for a freshly-created dataset whose field IDs happen to be 0..N in declaration order — which is exactly (and only) what the new test covers. For any dataset that has evolved (dropped/added/reordered columns, or nested fields with gappy IDs), this path produces fragments whose data-file field IDs point at the wrong columns.

Trace:

  1. The override transports a plain Arrow schema, which can't carry Lance field IDs. On the Rust side it's converted with Schema::try_from(&arrow_schema) (java/lance-jni/src/fragment.rs), which calls set_field_id(None) and blindly assigns sequential IDs 0,1,2,… (rust/lance-core/src/datatypes/schema.rs:667-674).
  2. This bypasses the very mechanism that exists to keep IDs correct — the normal append path loads the dataset "because it has the correct field ids" (rust/lance/src/dataset/fragment/write.rs:319-321), and the override short-circuits it (write.rs:298-299).
  3. The doc comment advises "The schema should come from the target dataset so Lance field IDs are preserved", but that doesn't actually work: Dataset.getSchema()inner_import_ffi_schemaSchema::from(...) goes through From<&Field> for ArrowField (rust/lance-core/src/datatypes/field.rs:1196), which never emits lance:field_id. So a dataset-derived Arrow schema arrives field-ID-less and gets renumbered anyway.
  4. validate_schema (write.rs:351-359) only compares the data stream against the override — never the override against the target dataset — so a clean write gives zero guarantee the IDs line up.

Why the test passes anyway: SimpleTestDataset declares id, name in that exact order, so its IDs are 0, 1 and the blind reassignment coincidentally matches. The test also asserts only fragments.size() and physicalRows — it never commits + reads back, and never inspects field IDs, so it can't catch a mismatch.

Prior art for doing it right: convert_schema_from_operation / from_arrow_schema (java/lance-jni/src/transaction.rs:826-944) derives field IDs by name against the read schema (metadata → base schema → max id). The fragment-write path reimplements schema handling and skips all of it.

Suggested fix (keeps the no-reopen goal of distributed writes):

  • Accept a LanceSchema (from Dataset.getLanceSchema(), which already carries id/parent_id) instead of an Arrow Schema, and transport the IDs across FFI (e.g. embed lance:field_id in field metadata for this path only — Field::try_from already reads it at field.rs:1084).
  • In the override path, fail closed: if any field id is -1 after conversion, error out instead of renumbering.
  • Strengthen the test: use a dataset with non-contiguous field IDs (add then drop a column), write via override, commit (Append), read back and assert values, and assert the fragment's data-file field IDs match the manifest.

@beinan beinan force-pushed the user/beinan/write-fragment-schema-override branch from 74fa3b3 to 09f9813 Compare May 22, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants