Skip to content

feat(transcode): Phase-2-B triples_to_batch (ExpandedTriple stream → RecordBatch)#313

Merged
AdaWorldAPI merged 1 commit into
mainfrom
claude/transcode-phase2b-spo-scan-L3DF0
Apr 30, 2026
Merged

feat(transcode): Phase-2-B triples_to_batch (ExpandedTriple stream → RecordBatch)#313
AdaWorldAPI merged 1 commit into
mainfrom
claude/transcode-phase2b-spo-scan-L3DF0

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Phase-2-B: the missing reverse-direction helper — triples_to_batch(soa, &[ExpandedTriple]) → RecordBatch. Takes a stream of triples (what Ontology::expand_entity() returns) and materialises one Arrow row per subject.

This is the Phase 5 / Phase 2-B bridge consumer code needs to roundtrip an entity through the SPO substrate.

Why this shape — not "walk SpoStore::scan(lookup)"

The original Phase-2 plan-doc described "walk SpoStore::scan(lookup)" as the read path. SpoStore doesn't expose a flat scan(lookup) — its API is per-verb (query_forward / query_reverse / query_relation) and the FNV-1a fingerprint is one-way (subject can't be reversed back to entity_id).

ExpandedTriple is the right input shape:

  • Carries the canonical subject_label (entity:{type}:{id}) → entity_id is recoverable.
  • The contract crate already mints these via SchemaExpander::expand_entity().

Consumers wire SpoStore → ExpandedTriple → RecordBatch through their own subject-id-aware reader (the side-table mapping subject fingerprint → entity_id is consumer-side state, not transcode's). This helper does the second step canonically, so every consumer doesn't reinvent it.

What ships

pub fn triples_to_batch(
    soa: &OuterSchema,
    triples: &[ExpandedTriple],
) -> Result<RecordBatch, TranscodeError>;

pub fn round1_lenient_schema(soa: &OuterSchema) -> SchemaRef;
fn parse_entity_id_from_label(label: &str, expected_type: &str) -> Option<u64>;

Behaviour

Input shape Outcome
Triples for one subject 1 row, columns populated from matching predicates
Triples for N subjects N rows, lex-sorted by subject_label (BTreeMap)
Triple's entity_type_id ≠ schema's Err(EntityTypeMismatch)
subject_label not in entity:{type}:{id} form Err(BadSubjectLabel)
Triple's predicate not in schema silently dropped (BBB outer-view rule)
Required column with no triple in the group Err(MissingColumn)
Empty input empty batch with the right schema

Round-1 honesty: Utf8 everywhere on the body

triples_to_batch emits every body column as nullable Utf8 regardless of the schema's declared ArrowTypeCode. The typed surface (CurrencyFloat32, Date(_)Date32, etc.) applies on the from_columns / from_columns_partial path which has typed input. Triple input is string-shaped (object_label), so round-1 keeps it Utf8. Round 3 adds typed-value reconstruction inside this function.

The schema returned by the function is round1_lenient_schema(soa) (every body column nullable Utf8). The typed schema (arrow_schema(soa)) is still the wire-shape contract; round-3 makes triples_to_batch emit it.

New TranscodeError variants

EntityTypeMismatch {
    expected: EntityTypeId,
    got: EntityTypeId,
},
BadSubjectLabel(String),

Both surface as concrete typed errors so consumers can branch on them rather than parsing display strings.

Tests

12/12 pass under --features query-lite,auth-rls-lite:

  • 5 existing (regression coverage on from_columns / from_columns_partial)
  • 7 new for triples_to_batch:
    • triples_to_batch_produces_one_row_per_subject — two patients via expand_entity
    • triples_to_batch_rejects_mixed_entity_types — Diagnosis triples fed to a Patient soa
    • triples_to_batch_returns_empty_batch_for_empty_input — clean empty case
    • triples_to_batch_drops_undeclared_predicates_silently — BBB outer-view
    • triples_to_batch_rejects_missing_required_column — required column gate
    • triples_to_batch_subject_label_round_tripexpand_entity(999_999)id_col[0] == 999_999
    • triples_to_batch_preserves_lex_subject_order — BTreeMap-stable order

What's deferred to Round 3

  • Typed value reconstruction: parse object_label according to semantic_type ("123.45"Float32 123.45, "1980-04-15"Date32 days_since_epoch, etc.). Schema agreement between triples_to_batch and arrow_schema once typed.
  • SpoStore reader proper: needs the consumer-side fingerprint → entity_id side-table. Out of scope here; tracked in .claude/plans/sql-spo-ontology-bridge-v1.md.

Verified

  • cargo check -p lance-graph-callcenter --no-default-features --features query-lite — clean
  • cargo test transcode::zerocopy:: — 12/12 pass
  • cargo clippy — zero zerocopy warnings
  • rustfmt --check — clean

Files changed

  • crates/lance-graph-callcenter/src/transcode/zerocopy.rs+~250 / −~10

Cross-link

Generated by Claude Code


Generated by Claude Code

Adds the missing reverse-direction helper: takes a stream of
ExpandedTriple (what Ontology::expand_entity returns) and
materialises a RecordBatch grouped by subject_label. This is the
Phase 5 (in #309's ROADMAP) / Phase 2-B (in the SQL-SPO bridge
plan) bridge that consumer code needs to roundtrip an entity row
through the SPO substrate.

## Why this shape (and not 'walk SpoStore::scan(lookup)')

The original Phase-2 plan-doc described 'walk SpoStore::scan(lookup)'
as the read path. SpoStore (in lance-graph proper) is
fingerprint-Hamming-indexed and doesn't expose a flat scan(lookup)
method — its API is per-verb (query_forward / query_reverse /
query_relation) and the FNV-1a fingerprint is one-way (subject can't
be reversed back to entity_id).

ExpandedTriple is the right input shape: it carries the canonical
subject_label (`entity:{type}:{id}`) so entity_id is recoverable,
and the contract crate already mints these via SchemaExpander.
Consumers wire SpoStore -> ExpandedTriple -> RecordBatch through
their own subject_id-aware reader; this helper does the second
step canonically.

## What ships

  triples_to_batch(soa, &[ExpandedTriple]) -> Result<RecordBatch>
    - Groups by subject_label (BTreeMap, lex sort, stable)
    - Parses entity_id from `entity:{type}:{id}`
    - Emits one row per subject; missing-required surfaces as
      MissingColumn error
    - Drops triples whose predicate isn't declared in the schema
      (BBB outer-view rule)
    - Rejects mixed entity_type via EntityTypeMismatch

  round1_lenient_schema(soa) -> SchemaRef
    - Round-1 helper: every body column emitted as nullable Utf8.
      The typed schema (Float32, Date32, etc.) applies on the
      from_columns / from_columns_partial path which has typed
      input. Round 3 adds typed-value reconstruction inside
      triples_to_batch.

  parse_entity_id_from_label(label, expected_type) -> Option<u64>
    - Private helper. Matches the canonical mint format.

Two new TranscodeError variants:
  EntityTypeMismatch { expected, got }
  BadSubjectLabel(String)

## Tests

5 prior + 7 new (zerocopy:: total 12, all pass under query-lite +
auth-rls-lite):
  triples_to_batch_produces_one_row_per_subject
  triples_to_batch_rejects_mixed_entity_types
  triples_to_batch_returns_empty_batch_for_empty_input
  triples_to_batch_drops_undeclared_predicates_silently
  triples_to_batch_rejects_missing_required_column
  triples_to_batch_subject_label_round_trip
  triples_to_batch_preserves_lex_subject_order

## What's deferred

- Typed value reconstruction (round 3): every body column emits
  Utf8 today; round 3 parses object_label according to
  semantic_type (Currency -> Float32, Date -> Date32, etc.).
- SpoStore reader proper: still needs a side-table mapping
  subject fingerprint -> entity_id. Consumer-side; tracked in
  `.claude/plans/sql-spo-ontology-bridge-v1.md`.

Verified: cargo check + cargo test transcode::zerocopy:: -> 12/12
pass under {query-lite, query-lite+auth-rls-lite}. Clippy clean,
fmt clean.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66feaea26e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fields.push(Field::new("id", DataType::UInt64, false));
fields.push(Field::new("entity_type", DataType::Utf8, false));
for col in &soa.columns {
fields.push(Field::new(col.name, DataType::Utf8, true));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve required-field nullability in lenient schema

round1_lenient_schema marks every body field as nullable, which drops the required/optional contract encoded in OuterSchema and used by arrow_schema. That means batches produced by triples_to_batch no longer carry non-null constraints for required properties, and can become schema-incompatible with batches built through from_columns for the same entity type. The field nullability should still follow PropertyKind even if the data type is temporarily widened to Utf8.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit 41145c0 into main Apr 30, 2026
1 of 5 checks passed
AdaWorldAPI added a commit that referenced this pull request Apr 30, 2026
Closes the deferred item from PR #313: triples_to_batch could only
emit hashes (`value:1a2b3c4d5e6f7a8b`) because ExpandedTriple's
object_label is FNV-1a-encoded. Round-3 adds
triples_to_batch_with_resolver, which takes a closure mapping the
object_label back to original value bytes (consumer-side state)
and parses per the column's SemanticType into a typed Arrow scalar.

## API

  pub fn triples_to_batch_with_resolver<R>(
      soa: &OuterSchema,
      triples: &[ExpandedTriple],
      resolver: R,
  ) -> Result<RecordBatch, TranscodeError>
  where R: Fn(&str) -> Option<Vec<u8>>,

Returns the canonical typed arrow_schema (NOT the lenient Utf8
fallback that triples_to_batch uses). The original triples_to_batch
stays unchanged for callers without a resolver.

## Type mapping

  Currency(_)        -> Float32  (str::parse::<f32>)
  Date(_)            -> Date32   (YYYY-MM-DD -> days_since_epoch)
  CustomerId         -> UInt64   (str::parse::<u64>)
  InvoiceNumber      -> UInt64   (str::parse::<u64>)
  PlainText / etc.   -> Utf8     (UTF-8 lossy)
  FixedSizeListF32   -> nulls    (resolver returns single-bytes
  FixedSizeBinary       only; fixed-shape needs round-5 wide-payload
                        resolver)

## Required vs optional behaviour

  - Resolver returns None -> null cell. Required columns with all
    rows resolved as None surface as MissingColumn (unchanged from
    triples_to_batch).
  - Resolver returns Some(bytes) but parser fails:
      * REQUIRED column -> typed ParseFailure { column, reason }.
        Surfaces the typed error rather than letting Arrow reject
        the null with an opaque InvalidArgumentError later.
      * OPTIONAL/FREE column -> null cell. Consumers that need
        to know about parse failures can compare against the
        original input separately.

## Helper: parse_iso_date_to_days(s)

Howard Hinnant civil_to_days, public-domain. Tested against:
  - 1970-01-01 -> 0
  - 1970-01-02 -> 1
  - 2000-01-01 -> 10_957
  - 2020-02-29 -> 18_321  (leap day)
  - garbage / out-of-range / wrong-separator -> None

## New TranscodeError variant

  ParseFailure { column: String, reason: &'static str }

## Tests

12 prior + 8 new (zerocopy:: total 21, all pass under
query-lite + auth-rls-lite):

  typed_resolver_currency_parses_to_float32
  typed_resolver_date_parses_to_days_since_epoch
  typed_resolver_customer_id_round_trips_uint64
  typed_resolver_required_unparseable_returns_parse_failure
  typed_resolver_optional_unparseable_emits_null
  typed_resolver_returns_null_when_resolver_misses
  typed_resolver_required_all_unresolved_errors
  typed_resolver_iso_date_parses_known_dates
  typed_resolver_iso_date_rejects_garbage

## What's still deferred

- Date(Month) / Date(Year) precisions. Today only YYYY-MM-DD
  parses; round-4 plumbs the precision into the parser.
- Geo / File / Image collapse to Utf8 still. Round-4 candidates.
- async resolver. Round-5.
- FixedSizeListF32 / FixedSizeBinary single-bytes resolver. Round-5
  wide-payload resolver.

Verified: cargo check + cargo test transcode::zerocopy:: -> 21/21
pass under {query-lite, query-lite+auth-rls-lite}. Clippy clean,
fmt clean.

Cross-link: PR #313 (the round-1 triples_to_batch this extends).
AdaWorldAPI added a commit that referenced this pull request Apr 30, 2026
Closes the deferred item from PR #316: parse_iso_date_to_days
previously only accepted YYYY-MM-DD, even for columns declared
Date(Month) or Date(Year). Round-4 widens the parser to accept
all three precisions, treating lower-precision inputs as the
earliest point in the period:

  YYYY-MM-DD  exact day
  YYYY-MM     first day of the month  (1970-02 -> 31)
  YYYY        first day of the year   (2000    -> 10_957)

Algorithm unchanged (Howard Hinnant civil_to_days). The match on
parts.as_slice() lets the parser decide what to default for the
missing components without changing the math.

## Tests (5 new + 1 expanded; 13 typed_resolver total)

  typed_resolver_iso_year_only_parses_to_jan_1
  typed_resolver_iso_year_month_parses_to_day_1
  typed_resolver_iso_lower_precision_rejects_bad_components
    -- bad month / bad day / non-numeric / empty year
  typed_resolver_iso_lower_precision_handles_leap_year_feb
    -- 2020-02 = day 31 of 2020 = 18_293 since epoch
  typed_resolver_iso_date_rejects_garbage (expanded)
    -- adds empty-string + 4-part-input rejection cases

## What's still deferred (round-5 territory)

- Strict-mode parser that REFUSES cross-precision inputs (e.g.
  a column declared Date(Day) rejects a YYYY-only string). Today's
  permissive default is the right round-1 choice for graceful
  up-cast; strict-mode lands when a consumer asks.
- Date(DateTime) precision -- ISO timestamp parser would need a
  Timestamp column type, not Date32. Different Arrow plumbing.

Verified: cargo test transcode::zerocopy::tests::typed_resolver::
13/13 pass under query-lite,auth-rls-lite. Clippy clean,
fmt clean.

Cross-link: PR #316 (round-3 typed-value resolver), PR #313
(round-1 triples_to_batch, where Date(_) was first plumbed).
AdaWorldAPI added a commit that referenced this pull request Apr 30, 2026
…#318)

Closes the deferred item from PR #316: parse_iso_date_to_days
previously only accepted YYYY-MM-DD, even for columns declared
Date(Month) or Date(Year). Round-4 widens the parser to accept
all three precisions, treating lower-precision inputs as the
earliest point in the period:

  YYYY-MM-DD  exact day
  YYYY-MM     first day of the month  (1970-02 -> 31)
  YYYY        first day of the year   (2000    -> 10_957)

Algorithm unchanged (Howard Hinnant civil_to_days). The match on
parts.as_slice() lets the parser decide what to default for the
missing components without changing the math.

## Tests (5 new + 1 expanded; 13 typed_resolver total)

  typed_resolver_iso_year_only_parses_to_jan_1
  typed_resolver_iso_year_month_parses_to_day_1
  typed_resolver_iso_lower_precision_rejects_bad_components
    -- bad month / bad day / non-numeric / empty year
  typed_resolver_iso_lower_precision_handles_leap_year_feb
    -- 2020-02 = day 31 of 2020 = 18_293 since epoch
  typed_resolver_iso_date_rejects_garbage (expanded)
    -- adds empty-string + 4-part-input rejection cases

## What's still deferred (round-5 territory)

- Strict-mode parser that REFUSES cross-precision inputs (e.g.
  a column declared Date(Day) rejects a YYYY-only string). Today's
  permissive default is the right round-1 choice for graceful
  up-cast; strict-mode lands when a consumer asks.
- Date(DateTime) precision -- ISO timestamp parser would need a
  Timestamp column type, not Date32. Different Arrow plumbing.

Verified: cargo test transcode::zerocopy::tests::typed_resolver::
13/13 pass under query-lite,auth-rls-lite. Clippy clean,
fmt clean.

Cross-link: PR #316 (round-3 typed-value resolver), PR #313
(round-1 triples_to_batch, where Date(_) was first plumbed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant