feat(transcode): Phase-2-B triples_to_batch (ExpandedTriple stream → RecordBatch) by AdaWorldAPI · Pull Request #313 · AdaWorldAPI/lance-graph

AdaWorldAPI · 2026-04-30T09:01:09Z

Summary

Phase-2-B: the missing reverse-direction helper — triples_to_batch(soa, &[ExpandedTriple]) → RecordBatch. Takes a stream of triples (what Ontology::expand_entity() returns) and materialises one Arrow row per subject.

This is the Phase 5 / Phase 2-B bridge consumer code needs to roundtrip an entity through the SPO substrate.

Why this shape — not "walk `SpoStore::scan(lookup)`"

The original Phase-2 plan-doc described "walk SpoStore::scan(lookup)" as the read path. SpoStore doesn't expose a flat scan(lookup) — its API is per-verb (query_forward / query_reverse / query_relation) and the FNV-1a fingerprint is one-way (subject can't be reversed back to entity_id).

ExpandedTriple is the right input shape:

Carries the canonical subject_label (entity:{type}:{id}) → entity_id is recoverable.
The contract crate already mints these via SchemaExpander::expand_entity().

Consumers wire SpoStore → ExpandedTriple → RecordBatch through their own subject-id-aware reader (the side-table mapping subject fingerprint → entity_id is consumer-side state, not transcode's). This helper does the second step canonically, so every consumer doesn't reinvent it.

What ships

pub fn triples_to_batch(
    soa: &OuterSchema,
    triples: &[ExpandedTriple],
) -> Result<RecordBatch, TranscodeError>;

pub fn round1_lenient_schema(soa: &OuterSchema) -> SchemaRef;
fn parse_entity_id_from_label(label: &str, expected_type: &str) -> Option<u64>;

Behaviour

Input shape	Outcome
Triples for one subject	1 row, columns populated from matching predicates
Triples for N subjects	N rows, lex-sorted by `subject_label` (BTreeMap)
Triple's `entity_type_id` ≠ schema's	`Err(EntityTypeMismatch)`
`subject_label` not in `entity:{type}:{id}` form	`Err(BadSubjectLabel)`
Triple's `predicate` not in schema	silently dropped (BBB outer-view rule)
Required column with no triple in the group	`Err(MissingColumn)`
Empty input	empty batch with the right schema

Round-1 honesty: `Utf8` everywhere on the body

triples_to_batch emits every body column as nullable Utf8 regardless of the schema's declared ArrowTypeCode. The typed surface (Currency → Float32, Date(_) → Date32, etc.) applies on the from_columns / from_columns_partial path which has typed input. Triple input is string-shaped (object_label), so round-1 keeps it Utf8. Round 3 adds typed-value reconstruction inside this function.

The schema returned by the function is round1_lenient_schema(soa) (every body column nullable Utf8). The typed schema (arrow_schema(soa)) is still the wire-shape contract; round-3 makes triples_to_batch emit it.

New `TranscodeError` variants

EntityTypeMismatch {
    expected: EntityTypeId,
    got: EntityTypeId,
},
BadSubjectLabel(String),

Both surface as concrete typed errors so consumers can branch on them rather than parsing display strings.

Tests

12/12 pass under --features query-lite,auth-rls-lite:

5 existing (regression coverage on from_columns / from_columns_partial)
7 new for triples_to_batch:
- triples_to_batch_produces_one_row_per_subject — two patients via expand_entity
- triples_to_batch_rejects_mixed_entity_types — Diagnosis triples fed to a Patient soa
- triples_to_batch_returns_empty_batch_for_empty_input — clean empty case
- triples_to_batch_drops_undeclared_predicates_silently — BBB outer-view
- triples_to_batch_rejects_missing_required_column — required column gate
- triples_to_batch_subject_label_round_trip — expand_entity(999_999) → id_col[0] == 999_999
- triples_to_batch_preserves_lex_subject_order — BTreeMap-stable order

What's deferred to Round 3

Typed value reconstruction: parse object_label according to semantic_type ("123.45" → Float32 123.45, "1980-04-15" → Date32 days_since_epoch, etc.). Schema agreement between triples_to_batch and arrow_schema once typed.
SpoStore reader proper: needs the consumer-side fingerprint → entity_id side-table. Out of scope here; tracked in .claude/plans/sql-spo-ontology-bridge-v1.md.

Verified

cargo check -p lance-graph-callcenter --no-default-features --features query-lite — clean
cargo test transcode::zerocopy:: — 12/12 pass
cargo clippy — zero zerocopy warnings
rustfmt --check — clean

Files changed

crates/lance-graph-callcenter/src/transcode/zerocopy.rs — +~250 / −~10

Cross-link

lance-graph PR feat(callcenter::transcode): outer ↔ inner ontology mapper + parallelbetrieb #309 — original transcode submodule (introduced ExpandedTriple consumption)
lance-graph PR feat(transcode): r2 fixes — typed Arrow + codec_route + partial writes + CachedOntology + route validation #310 — round-2 fixes (OuterColumn.codec_route, from_columns_partial)
lance-graph PR feat(transcode): Phase-2-A pushdown classification (Inexact for recognised filters) #312 — Phase-2-A pushdown classification (the forward direction)

Generated by Claude Code

Generated by Claude Code

Adds the missing reverse-direction helper: takes a stream of ExpandedTriple (what Ontology::expand_entity returns) and materialises a RecordBatch grouped by subject_label. This is the Phase 5 (in #309's ROADMAP) / Phase 2-B (in the SQL-SPO bridge plan) bridge that consumer code needs to roundtrip an entity row through the SPO substrate. ## Why this shape (and not 'walk SpoStore::scan(lookup)') The original Phase-2 plan-doc described 'walk SpoStore::scan(lookup)' as the read path. SpoStore (in lance-graph proper) is fingerprint-Hamming-indexed and doesn't expose a flat scan(lookup) method — its API is per-verb (query_forward / query_reverse / query_relation) and the FNV-1a fingerprint is one-way (subject can't be reversed back to entity_id). ExpandedTriple is the right input shape: it carries the canonical subject_label (`entity:{type}:{id}`) so entity_id is recoverable, and the contract crate already mints these via SchemaExpander. Consumers wire SpoStore -> ExpandedTriple -> RecordBatch through their own subject_id-aware reader; this helper does the second step canonically. ## What ships triples_to_batch(soa, &[ExpandedTriple]) -> Result<RecordBatch> - Groups by subject_label (BTreeMap, lex sort, stable) - Parses entity_id from `entity:{type}:{id}` - Emits one row per subject; missing-required surfaces as MissingColumn error - Drops triples whose predicate isn't declared in the schema (BBB outer-view rule) - Rejects mixed entity_type via EntityTypeMismatch round1_lenient_schema(soa) -> SchemaRef - Round-1 helper: every body column emitted as nullable Utf8. The typed schema (Float32, Date32, etc.) applies on the from_columns / from_columns_partial path which has typed input. Round 3 adds typed-value reconstruction inside triples_to_batch. parse_entity_id_from_label(label, expected_type) -> Option<u64> - Private helper. Matches the canonical mint format. Two new TranscodeError variants: EntityTypeMismatch { expected, got } BadSubjectLabel(String) ## Tests 5 prior + 7 new (zerocopy:: total 12, all pass under query-lite + auth-rls-lite): triples_to_batch_produces_one_row_per_subject triples_to_batch_rejects_mixed_entity_types triples_to_batch_returns_empty_batch_for_empty_input triples_to_batch_drops_undeclared_predicates_silently triples_to_batch_rejects_missing_required_column triples_to_batch_subject_label_round_trip triples_to_batch_preserves_lex_subject_order ## What's deferred - Typed value reconstruction (round 3): every body column emits Utf8 today; round 3 parses object_label according to semantic_type (Currency -> Float32, Date -> Date32, etc.). - SpoStore reader proper: still needs a side-table mapping subject fingerprint -> entity_id. Consumer-side; tracked in `.claude/plans/sql-spo-ontology-bridge-v1.md`. Verified: cargo check + cargo test transcode::zerocopy:: -> 12/12 pass under {query-lite, query-lite+auth-rls-lite}. Clippy clean, fmt clean.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66feaea26e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T09:03:53Z

+    fields.push(Field::new("id", DataType::UInt64, false));
+    fields.push(Field::new("entity_type", DataType::Utf8, false));
+    for col in &soa.columns {
+        fields.push(Field::new(col.name, DataType::Utf8, true));


Preserve required-field nullability in lenient schema

round1_lenient_schema marks every body field as nullable, which drops the required/optional contract encoded in OuterSchema and used by arrow_schema. That means batches produced by triples_to_batch no longer carry non-null constraints for required properties, and can become schema-incompatible with batches built through from_columns for the same entity type. The field nullability should still follow PropertyKind even if the data type is temporarily widened to Utf8.

Useful? React with 👍 / 👎.

Closes the deferred item from PR #313: triples_to_batch could only emit hashes (`value:1a2b3c4d5e6f7a8b`) because ExpandedTriple's object_label is FNV-1a-encoded. Round-3 adds triples_to_batch_with_resolver, which takes a closure mapping the object_label back to original value bytes (consumer-side state) and parses per the column's SemanticType into a typed Arrow scalar. ## API pub fn triples_to_batch_with_resolver<R>( soa: &OuterSchema, triples: &[ExpandedTriple], resolver: R, ) -> Result<RecordBatch, TranscodeError> where R: Fn(&str) -> Option<Vec<u8>>, Returns the canonical typed arrow_schema (NOT the lenient Utf8 fallback that triples_to_batch uses). The original triples_to_batch stays unchanged for callers without a resolver. ## Type mapping Currency(_) -> Float32 (str::parse::<f32>) Date(_) -> Date32 (YYYY-MM-DD -> days_since_epoch) CustomerId -> UInt64 (str::parse::<u64>) InvoiceNumber -> UInt64 (str::parse::<u64>) PlainText / etc. -> Utf8 (UTF-8 lossy) FixedSizeListF32 -> nulls (resolver returns single-bytes FixedSizeBinary only; fixed-shape needs round-5 wide-payload resolver) ## Required vs optional behaviour - Resolver returns None -> null cell. Required columns with all rows resolved as None surface as MissingColumn (unchanged from triples_to_batch). - Resolver returns Some(bytes) but parser fails: * REQUIRED column -> typed ParseFailure { column, reason }. Surfaces the typed error rather than letting Arrow reject the null with an opaque InvalidArgumentError later. * OPTIONAL/FREE column -> null cell. Consumers that need to know about parse failures can compare against the original input separately. ## Helper: parse_iso_date_to_days(s) Howard Hinnant civil_to_days, public-domain. Tested against: - 1970-01-01 -> 0 - 1970-01-02 -> 1 - 2000-01-01 -> 10_957 - 2020-02-29 -> 18_321 (leap day) - garbage / out-of-range / wrong-separator -> None ## New TranscodeError variant ParseFailure { column: String, reason: &'static str } ## Tests 12 prior + 8 new (zerocopy:: total 21, all pass under query-lite + auth-rls-lite): typed_resolver_currency_parses_to_float32 typed_resolver_date_parses_to_days_since_epoch typed_resolver_customer_id_round_trips_uint64 typed_resolver_required_unparseable_returns_parse_failure typed_resolver_optional_unparseable_emits_null typed_resolver_returns_null_when_resolver_misses typed_resolver_required_all_unresolved_errors typed_resolver_iso_date_parses_known_dates typed_resolver_iso_date_rejects_garbage ## What's still deferred - Date(Month) / Date(Year) precisions. Today only YYYY-MM-DD parses; round-4 plumbs the precision into the parser. - Geo / File / Image collapse to Utf8 still. Round-4 candidates. - async resolver. Round-5. - FixedSizeListF32 / FixedSizeBinary single-bytes resolver. Round-5 wide-payload resolver. Verified: cargo check + cargo test transcode::zerocopy:: -> 21/21 pass under {query-lite, query-lite+auth-rls-lite}. Clippy clean, fmt clean. Cross-link: PR #313 (the round-1 triples_to_batch this extends).

Closes the deferred item from PR #316: parse_iso_date_to_days previously only accepted YYYY-MM-DD, even for columns declared Date(Month) or Date(Year). Round-4 widens the parser to accept all three precisions, treating lower-precision inputs as the earliest point in the period: YYYY-MM-DD exact day YYYY-MM first day of the month (1970-02 -> 31) YYYY first day of the year (2000 -> 10_957) Algorithm unchanged (Howard Hinnant civil_to_days). The match on parts.as_slice() lets the parser decide what to default for the missing components without changing the math. ## Tests (5 new + 1 expanded; 13 typed_resolver total) typed_resolver_iso_year_only_parses_to_jan_1 typed_resolver_iso_year_month_parses_to_day_1 typed_resolver_iso_lower_precision_rejects_bad_components -- bad month / bad day / non-numeric / empty year typed_resolver_iso_lower_precision_handles_leap_year_feb -- 2020-02 = day 31 of 2020 = 18_293 since epoch typed_resolver_iso_date_rejects_garbage (expanded) -- adds empty-string + 4-part-input rejection cases ## What's still deferred (round-5 territory) - Strict-mode parser that REFUSES cross-precision inputs (e.g. a column declared Date(Day) rejects a YYYY-only string). Today's permissive default is the right round-1 choice for graceful up-cast; strict-mode lands when a consumer asks. - Date(DateTime) precision -- ISO timestamp parser would need a Timestamp column type, not Date32. Different Arrow plumbing. Verified: cargo test transcode::zerocopy::tests::typed_resolver:: 13/13 pass under query-lite,auth-rls-lite. Clippy clean, fmt clean. Cross-link: PR #316 (round-3 typed-value resolver), PR #313 (round-1 triples_to_batch, where Date(_) was first plumbed).

…#318) Closes the deferred item from PR #316: parse_iso_date_to_days previously only accepted YYYY-MM-DD, even for columns declared Date(Month) or Date(Year). Round-4 widens the parser to accept all three precisions, treating lower-precision inputs as the earliest point in the period: YYYY-MM-DD exact day YYYY-MM first day of the month (1970-02 -> 31) YYYY first day of the year (2000 -> 10_957) Algorithm unchanged (Howard Hinnant civil_to_days). The match on parts.as_slice() lets the parser decide what to default for the missing components without changing the math. ## Tests (5 new + 1 expanded; 13 typed_resolver total) typed_resolver_iso_year_only_parses_to_jan_1 typed_resolver_iso_year_month_parses_to_day_1 typed_resolver_iso_lower_precision_rejects_bad_components -- bad month / bad day / non-numeric / empty year typed_resolver_iso_lower_precision_handles_leap_year_feb -- 2020-02 = day 31 of 2020 = 18_293 since epoch typed_resolver_iso_date_rejects_garbage (expanded) -- adds empty-string + 4-part-input rejection cases ## What's still deferred (round-5 territory) - Strict-mode parser that REFUSES cross-precision inputs (e.g. a column declared Date(Day) rejects a YYYY-only string). Today's permissive default is the right round-1 choice for graceful up-cast; strict-mode lands when a consumer asks. - Date(DateTime) precision -- ISO timestamp parser would need a Timestamp column type, not Date32. Different Arrow plumbing. Verified: cargo test transcode::zerocopy::tests::typed_resolver:: 13/13 pass under query-lite,auth-rls-lite. Clippy clean, fmt clean. Cross-link: PR #316 (round-3 typed-value resolver), PR #313 (round-1 triples_to_batch, where Date(_) was first plumbed).

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

AdaWorldAPI merged commit 41145c0 into main Apr 30, 2026
1 of 5 checks passed

AdaWorldAPI mentioned this pull request Apr 30, 2026

feat(transcode): round-3 typed-value resolver for triples_to_batch #316

Merged

AdaWorldAPI mentioned this pull request Apr 30, 2026

feat(transcode): round-4 Date(Month) and Date(Year) precision support #318

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(transcode): Phase-2-B triples_to_batch (ExpandedTriple stream → RecordBatch)#313

feat(transcode): Phase-2-B triples_to_batch (ExpandedTriple stream → RecordBatch)#313
AdaWorldAPI merged 1 commit into
mainfrom
claude/transcode-phase2b-spo-scan-L3DF0

AdaWorldAPI commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AdaWorldAPI commented Apr 30, 2026

Summary

Why this shape — not "walk SpoStore::scan(lookup)"

What ships

Behaviour

Round-1 honesty: Utf8 everywhere on the body

New TranscodeError variants

Tests

What's deferred to Round 3

Verified

Files changed

Cross-link

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why this shape — not "walk `SpoStore::scan(lookup)`"

Round-1 honesty: `Utf8` everywhere on the body

New `TranscodeError` variants