feat(transcode): Phase-2-B triples_to_batch (ExpandedTriple stream → RecordBatch)#313
Conversation
Adds the missing reverse-direction helper: takes a stream of ExpandedTriple (what Ontology::expand_entity returns) and materialises a RecordBatch grouped by subject_label. This is the Phase 5 (in #309's ROADMAP) / Phase 2-B (in the SQL-SPO bridge plan) bridge that consumer code needs to roundtrip an entity row through the SPO substrate. ## Why this shape (and not 'walk SpoStore::scan(lookup)') The original Phase-2 plan-doc described 'walk SpoStore::scan(lookup)' as the read path. SpoStore (in lance-graph proper) is fingerprint-Hamming-indexed and doesn't expose a flat scan(lookup) method — its API is per-verb (query_forward / query_reverse / query_relation) and the FNV-1a fingerprint is one-way (subject can't be reversed back to entity_id). ExpandedTriple is the right input shape: it carries the canonical subject_label (`entity:{type}:{id}`) so entity_id is recoverable, and the contract crate already mints these via SchemaExpander. Consumers wire SpoStore -> ExpandedTriple -> RecordBatch through their own subject_id-aware reader; this helper does the second step canonically. ## What ships triples_to_batch(soa, &[ExpandedTriple]) -> Result<RecordBatch> - Groups by subject_label (BTreeMap, lex sort, stable) - Parses entity_id from `entity:{type}:{id}` - Emits one row per subject; missing-required surfaces as MissingColumn error - Drops triples whose predicate isn't declared in the schema (BBB outer-view rule) - Rejects mixed entity_type via EntityTypeMismatch round1_lenient_schema(soa) -> SchemaRef - Round-1 helper: every body column emitted as nullable Utf8. The typed schema (Float32, Date32, etc.) applies on the from_columns / from_columns_partial path which has typed input. Round 3 adds typed-value reconstruction inside triples_to_batch. parse_entity_id_from_label(label, expected_type) -> Option<u64> - Private helper. Matches the canonical mint format. Two new TranscodeError variants: EntityTypeMismatch { expected, got } BadSubjectLabel(String) ## Tests 5 prior + 7 new (zerocopy:: total 12, all pass under query-lite + auth-rls-lite): triples_to_batch_produces_one_row_per_subject triples_to_batch_rejects_mixed_entity_types triples_to_batch_returns_empty_batch_for_empty_input triples_to_batch_drops_undeclared_predicates_silently triples_to_batch_rejects_missing_required_column triples_to_batch_subject_label_round_trip triples_to_batch_preserves_lex_subject_order ## What's deferred - Typed value reconstruction (round 3): every body column emits Utf8 today; round 3 parses object_label according to semantic_type (Currency -> Float32, Date -> Date32, etc.). - SpoStore reader proper: still needs a side-table mapping subject fingerprint -> entity_id. Consumer-side; tracked in `.claude/plans/sql-spo-ontology-bridge-v1.md`. Verified: cargo check + cargo test transcode::zerocopy:: -> 12/12 pass under {query-lite, query-lite+auth-rls-lite}. Clippy clean, fmt clean.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66feaea26e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fields.push(Field::new("id", DataType::UInt64, false)); | ||
| fields.push(Field::new("entity_type", DataType::Utf8, false)); | ||
| for col in &soa.columns { | ||
| fields.push(Field::new(col.name, DataType::Utf8, true)); |
There was a problem hiding this comment.
Preserve required-field nullability in lenient schema
round1_lenient_schema marks every body field as nullable, which drops the required/optional contract encoded in OuterSchema and used by arrow_schema. That means batches produced by triples_to_batch no longer carry non-null constraints for required properties, and can become schema-incompatible with batches built through from_columns for the same entity type. The field nullability should still follow PropertyKind even if the data type is temporarily widened to Utf8.
Useful? React with 👍 / 👎.
Closes the deferred item from PR #313: triples_to_batch could only emit hashes (`value:1a2b3c4d5e6f7a8b`) because ExpandedTriple's object_label is FNV-1a-encoded. Round-3 adds triples_to_batch_with_resolver, which takes a closure mapping the object_label back to original value bytes (consumer-side state) and parses per the column's SemanticType into a typed Arrow scalar. ## API pub fn triples_to_batch_with_resolver<R>( soa: &OuterSchema, triples: &[ExpandedTriple], resolver: R, ) -> Result<RecordBatch, TranscodeError> where R: Fn(&str) -> Option<Vec<u8>>, Returns the canonical typed arrow_schema (NOT the lenient Utf8 fallback that triples_to_batch uses). The original triples_to_batch stays unchanged for callers without a resolver. ## Type mapping Currency(_) -> Float32 (str::parse::<f32>) Date(_) -> Date32 (YYYY-MM-DD -> days_since_epoch) CustomerId -> UInt64 (str::parse::<u64>) InvoiceNumber -> UInt64 (str::parse::<u64>) PlainText / etc. -> Utf8 (UTF-8 lossy) FixedSizeListF32 -> nulls (resolver returns single-bytes FixedSizeBinary only; fixed-shape needs round-5 wide-payload resolver) ## Required vs optional behaviour - Resolver returns None -> null cell. Required columns with all rows resolved as None surface as MissingColumn (unchanged from triples_to_batch). - Resolver returns Some(bytes) but parser fails: * REQUIRED column -> typed ParseFailure { column, reason }. Surfaces the typed error rather than letting Arrow reject the null with an opaque InvalidArgumentError later. * OPTIONAL/FREE column -> null cell. Consumers that need to know about parse failures can compare against the original input separately. ## Helper: parse_iso_date_to_days(s) Howard Hinnant civil_to_days, public-domain. Tested against: - 1970-01-01 -> 0 - 1970-01-02 -> 1 - 2000-01-01 -> 10_957 - 2020-02-29 -> 18_321 (leap day) - garbage / out-of-range / wrong-separator -> None ## New TranscodeError variant ParseFailure { column: String, reason: &'static str } ## Tests 12 prior + 8 new (zerocopy:: total 21, all pass under query-lite + auth-rls-lite): typed_resolver_currency_parses_to_float32 typed_resolver_date_parses_to_days_since_epoch typed_resolver_customer_id_round_trips_uint64 typed_resolver_required_unparseable_returns_parse_failure typed_resolver_optional_unparseable_emits_null typed_resolver_returns_null_when_resolver_misses typed_resolver_required_all_unresolved_errors typed_resolver_iso_date_parses_known_dates typed_resolver_iso_date_rejects_garbage ## What's still deferred - Date(Month) / Date(Year) precisions. Today only YYYY-MM-DD parses; round-4 plumbs the precision into the parser. - Geo / File / Image collapse to Utf8 still. Round-4 candidates. - async resolver. Round-5. - FixedSizeListF32 / FixedSizeBinary single-bytes resolver. Round-5 wide-payload resolver. Verified: cargo check + cargo test transcode::zerocopy:: -> 21/21 pass under {query-lite, query-lite+auth-rls-lite}. Clippy clean, fmt clean. Cross-link: PR #313 (the round-1 triples_to_batch this extends).
Closes the deferred item from PR #316: parse_iso_date_to_days previously only accepted YYYY-MM-DD, even for columns declared Date(Month) or Date(Year). Round-4 widens the parser to accept all three precisions, treating lower-precision inputs as the earliest point in the period: YYYY-MM-DD exact day YYYY-MM first day of the month (1970-02 -> 31) YYYY first day of the year (2000 -> 10_957) Algorithm unchanged (Howard Hinnant civil_to_days). The match on parts.as_slice() lets the parser decide what to default for the missing components without changing the math. ## Tests (5 new + 1 expanded; 13 typed_resolver total) typed_resolver_iso_year_only_parses_to_jan_1 typed_resolver_iso_year_month_parses_to_day_1 typed_resolver_iso_lower_precision_rejects_bad_components -- bad month / bad day / non-numeric / empty year typed_resolver_iso_lower_precision_handles_leap_year_feb -- 2020-02 = day 31 of 2020 = 18_293 since epoch typed_resolver_iso_date_rejects_garbage (expanded) -- adds empty-string + 4-part-input rejection cases ## What's still deferred (round-5 territory) - Strict-mode parser that REFUSES cross-precision inputs (e.g. a column declared Date(Day) rejects a YYYY-only string). Today's permissive default is the right round-1 choice for graceful up-cast; strict-mode lands when a consumer asks. - Date(DateTime) precision -- ISO timestamp parser would need a Timestamp column type, not Date32. Different Arrow plumbing. Verified: cargo test transcode::zerocopy::tests::typed_resolver:: 13/13 pass under query-lite,auth-rls-lite. Clippy clean, fmt clean. Cross-link: PR #316 (round-3 typed-value resolver), PR #313 (round-1 triples_to_batch, where Date(_) was first plumbed).
…#318) Closes the deferred item from PR #316: parse_iso_date_to_days previously only accepted YYYY-MM-DD, even for columns declared Date(Month) or Date(Year). Round-4 widens the parser to accept all three precisions, treating lower-precision inputs as the earliest point in the period: YYYY-MM-DD exact day YYYY-MM first day of the month (1970-02 -> 31) YYYY first day of the year (2000 -> 10_957) Algorithm unchanged (Howard Hinnant civil_to_days). The match on parts.as_slice() lets the parser decide what to default for the missing components without changing the math. ## Tests (5 new + 1 expanded; 13 typed_resolver total) typed_resolver_iso_year_only_parses_to_jan_1 typed_resolver_iso_year_month_parses_to_day_1 typed_resolver_iso_lower_precision_rejects_bad_components -- bad month / bad day / non-numeric / empty year typed_resolver_iso_lower_precision_handles_leap_year_feb -- 2020-02 = day 31 of 2020 = 18_293 since epoch typed_resolver_iso_date_rejects_garbage (expanded) -- adds empty-string + 4-part-input rejection cases ## What's still deferred (round-5 territory) - Strict-mode parser that REFUSES cross-precision inputs (e.g. a column declared Date(Day) rejects a YYYY-only string). Today's permissive default is the right round-1 choice for graceful up-cast; strict-mode lands when a consumer asks. - Date(DateTime) precision -- ISO timestamp parser would need a Timestamp column type, not Date32. Different Arrow plumbing. Verified: cargo test transcode::zerocopy::tests::typed_resolver:: 13/13 pass under query-lite,auth-rls-lite. Clippy clean, fmt clean. Cross-link: PR #316 (round-3 typed-value resolver), PR #313 (round-1 triples_to_batch, where Date(_) was first plumbed).
Summary
Phase-2-B: the missing reverse-direction helper —
triples_to_batch(soa, &[ExpandedTriple]) → RecordBatch. Takes a stream of triples (whatOntology::expand_entity()returns) and materialises one Arrow row per subject.This is the Phase 5 / Phase 2-B bridge consumer code needs to roundtrip an entity through the SPO substrate.
Why this shape — not "walk
SpoStore::scan(lookup)"The original Phase-2 plan-doc described "walk
SpoStore::scan(lookup)" as the read path.SpoStoredoesn't expose a flatscan(lookup)— its API is per-verb (query_forward/query_reverse/query_relation) and the FNV-1a fingerprint is one-way (subject can't be reversed back toentity_id).ExpandedTripleis the right input shape:subject_label(entity:{type}:{id}) →entity_idis recoverable.SchemaExpander::expand_entity().Consumers wire
SpoStore → ExpandedTriple → RecordBatchthrough their own subject-id-aware reader (the side-table mapping subject fingerprint → entity_id is consumer-side state, not transcode's). This helper does the second step canonically, so every consumer doesn't reinvent it.What ships
Behaviour
subject_label(BTreeMap)entity_type_id≠ schema'sErr(EntityTypeMismatch)subject_labelnot inentity:{type}:{id}formErr(BadSubjectLabel)predicatenot in schemaErr(MissingColumn)Round-1 honesty:
Utf8everywhere on the bodytriples_to_batchemits every body column as nullableUtf8regardless of the schema's declaredArrowTypeCode. The typed surface (Currency→Float32,Date(_)→Date32, etc.) applies on thefrom_columns/from_columns_partialpath which has typed input. Triple input is string-shaped (object_label), so round-1 keeps itUtf8. Round 3 adds typed-value reconstruction inside this function.The schema returned by the function is
round1_lenient_schema(soa)(every body column nullable Utf8). The typed schema (arrow_schema(soa)) is still the wire-shape contract; round-3 makes triples_to_batch emit it.New
TranscodeErrorvariantsBoth surface as concrete typed errors so consumers can branch on them rather than parsing display strings.
Tests
12/12 pass under
--features query-lite,auth-rls-lite:from_columns/from_columns_partial)triples_to_batch:triples_to_batch_produces_one_row_per_subject— two patients viaexpand_entitytriples_to_batch_rejects_mixed_entity_types— Diagnosis triples fed to a Patient soatriples_to_batch_returns_empty_batch_for_empty_input— clean empty casetriples_to_batch_drops_undeclared_predicates_silently— BBB outer-viewtriples_to_batch_rejects_missing_required_column— required column gatetriples_to_batch_subject_label_round_trip—expand_entity(999_999)→id_col[0] == 999_999triples_to_batch_preserves_lex_subject_order— BTreeMap-stable orderWhat's deferred to Round 3
object_labelaccording tosemantic_type("123.45"→Float32 123.45,"1980-04-15"→Date32 days_since_epoch, etc.). Schema agreement betweentriples_to_batchandarrow_schemaonce typed..claude/plans/sql-spo-ontology-bridge-v1.md.Verified
cargo check -p lance-graph-callcenter --no-default-features --features query-lite— cleancargo test transcode::zerocopy::— 12/12 passcargo clippy— zerozerocopywarningsrustfmt --check— cleanFiles changed
crates/lance-graph-callcenter/src/transcode/zerocopy.rs—+~250 / −~10Cross-link
ExpandedTripleconsumption)OuterColumn.codec_route,from_columns_partial)Generated by Claude Code
Generated by Claude Code