feat(java): allow schema override for fragment writes#6919
Conversation
7da7ecc to
74fa3b3
Compare
There was a problem hiding this comment.
Looks great, just one small change.
Concern: schema override silently reassigns Lance field IDs
The override is only safe for a freshly-created dataset whose field IDs happen to be 0..N in declaration order — which is exactly (and only) what the new test covers. For any dataset that has evolved (dropped/added/reordered columns, or nested fields with gappy IDs), this path produces fragments whose data-file field IDs point at the wrong columns.
Trace:
- The override transports a plain Arrow schema, which can't carry Lance field IDs. On the Rust side it's converted with
Schema::try_from(&arrow_schema)(java/lance-jni/src/fragment.rs), which callsset_field_id(None)and blindly assigns sequential IDs0,1,2,…(rust/lance-core/src/datatypes/schema.rs:667-674). - This bypasses the very mechanism that exists to keep IDs correct — the normal append path loads the dataset "because it has the correct field ids" (
rust/lance/src/dataset/fragment/write.rs:319-321), and the override short-circuits it (write.rs:298-299). - The doc comment advises "The schema should come from the target dataset so Lance field IDs are preserved", but that doesn't actually work:
Dataset.getSchema()→inner_import_ffi_schema→Schema::from(...)goes throughFrom<&Field> for ArrowField(rust/lance-core/src/datatypes/field.rs:1196), which never emitslance:field_id. So a dataset-derived Arrow schema arrives field-ID-less and gets renumbered anyway. validate_schema(write.rs:351-359) only compares the data stream against the override — never the override against the target dataset — so a clean write gives zero guarantee the IDs line up.
Why the test passes anyway: SimpleTestDataset declares id, name in that exact order, so its IDs are 0, 1 and the blind reassignment coincidentally matches. The test also asserts only fragments.size() and physicalRows — it never commits + reads back, and never inspects field IDs, so it can't catch a mismatch.
Prior art for doing it right: convert_schema_from_operation / from_arrow_schema (java/lance-jni/src/transaction.rs:826-944) derives field IDs by name against the read schema (metadata → base schema → max id). The fragment-write path reimplements schema handling and skips all of it.
Suggested fix (keeps the no-reopen goal of distributed writes):
- Accept a
LanceSchema(fromDataset.getLanceSchema(), which already carriesid/parent_id) instead of an ArrowSchema, and transport the IDs across FFI (e.g. embedlance:field_idin field metadata for this path only —Field::try_fromalready reads it atfield.rs:1084). - In the override path, fail closed: if any field id is
-1after conversion, error out instead of renumbering. - Strengthen the test: use a dataset with non-contiguous field IDs (add then drop a column), write via override, commit (Append), read back and assert values, and assert the fragment's data-file field IDs match the manifest.
74fa3b3 to
09f9813
Compare
Issue
In distributed writes, workers can create uncommitted fragments and defer the final commit until all fragments are ready. The Java
WriteFragmentBuilderonly exposedAPPENDmode without a way to pass the target dataset schema, so lance-core had to open the existing dataset to infer schema and field IDs before writing each fragment.That dataset open is unnecessary when the caller already has the target schema, and it becomes expensive for datasets with very large fragment counts because opening the dataset has to load/read manifest metadata. This shows up as fragment writing getting slower as the dataset grows, even before the final commit step.
Summary
WriteFragmentBuilder.schema(Schema)so Java distributed writers can pass the target dataset schema when creating uncommitted fragments.FragmentCreateBuilder.schema(...), avoiding the append-mode dataset open used only for schema inference.Benefits
FragmentCreateBuildercapability.Testing
cargo check --manifest-path /tmp/lance-write-fragment-schema-pr/java/lance-jni/Cargo.toml./mvnw -Dtest=org.lance.FragmentTest#testWriteFragmentWithSchemaOverride test