feat(callcenter::transcode): outer ↔ inner ontology mapper + parallelbetrieb#309
Conversation
…betrieb
Reusable Foundry-style mapper between the wire-shape DTO surface
(already in `ontology_dto`) and the inner SoA / SPO substrate.
Domain-agnostic — every module operates on whatever `Ontology` is
handed in. No medcare or smb specifics live under transcode/.
## Modules (all under crate::transcode)
zerocopy OuterColumn / OuterSchema / OwnedColumn /
from_columns. The cheap-zerocopy lane: Vec<T> →
Buffer is O(1) reinterpretation. Refuses
undeclared columns at the boundary.
cam_pq_decode CamPqDecoder trait + PassthroughDecoder for
CodecRoute::{Skip, Passthrough}. The codec math
itself stays in lance_graph_contract::cam.
spo_filter SpoFilterTranslator: SQL filter terms →
SpoLookup. Domain-agnostic; uses canonical
lance_graph_contract::hash::fnv1a for predicate
fingerprints.
ontology_table OntologyTableProvider: DataFusion TableProvider
over (Ontology, entity_type). Round-1 backs scan
with MemTable; SpoStore reader is the next round.
parallelbetrieb DriftEvent + DriftKind + Reconciler trait. The
ONE deliberate transition bandaid: MySQL ↔
DataFusion ↔ SPO ground-truth reconciliation.
Schema matches MedCareV2's C# DriftEvent.ToJson()
so both sides feed one dashboard.
## Why a submodule, not a sibling crate
PR #73 on medcare-rs explicitly framed `lance-graph-callcenter` as
the Foundry / supabase-realtime transcode crate. A sibling crate
would create a competing framing. These are the four reusable
primitives + the one bandaid; they belong here.
## What this does NOT introduce
- No duplicate of ontology_dto. The DTO surface stays canonical;
transcode re-exports OntologyDto / EntityTypeDto / Locale /
Label / SchemaExpander from one path so consumers reach the
whole transcode surface from one import.
- No duplicate of version_watcher. Realtime fan-out belongs to
LanceVersionWatcher; transcode does not introduce a second
channel primitive.
- No new HTTP / WS deps. Realtime transport adapters live in
the consumer-facing serve feature.
## Cargo / features
- Adds `async-trait = "0.1"` (small, no transitive deps).
- transcode/zerocopy + transcode/ontology_table compile under
`query-lite` or `persist` (both pull in arrow); the others
are always-on.
## Verified
- cargo check: clean across {default, persist, query-lite,
audit-log, query-lite+persist+audit-log,
auth-rls-lite+query-lite}
- cargo test transcode::: 26 passed, 0 failed
- clippy: zero transcode warnings across the same feature combos
- rustfmt --check: clean on all 6 new files
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a60c1eaa9e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Some((_, slot_opt)) => slot_opt | ||
| .take() | ||
| .ok_or_else(|| TranscodeError::MissingColumn(soa_col.name.to_string()))?, | ||
| None => return Err(TranscodeError::MissingColumn(soa_col.name.to_string())), | ||
| }; |
There was a problem hiding this comment.
Allow optional ontology columns to be omitted
from_columns currently raises MissingColumn for every schema column that is not present in body_columns, regardless of whether the property is optional. For ontologies with nullable fields (for example krankenkasse), omitting that column or trying to represent all-null values fails the batch build, which makes optional fields effectively required at runtime. This breaks ingestion/query paths for sparse records and should instead synthesize null arrays for optional properties (while still erroring on required ones).
Useful? React with 👍 / 👎.
| let id = entity_type_id(self.ontology, s); | ||
| if id != 0 { | ||
| out.entity_type_id = Some(id); | ||
| } |
There was a problem hiding this comment.
Preserve impossible entity_type filters in SPO lookup
When translating entity_type = 'UnknownType', the code computes ID 0 and then drops the predicate entirely, leaving entity_type_id unset. Any caller that relies on SpoLookup alone will treat this as an unconstrained lookup and can return rows instead of the SQL-correct empty result set. Unknown entity types should be encoded as an impossible constraint (or explicit no-match state), not silently removed.
Useful? React with 👍 / 👎.
partial writes + CachedOntology + route validation Addresses the five concrete gaps the brutally-honest review on #309 called out: 1. arrow_type_for_semantic no longer collapses everything to Utf8. Currency → Float32, Date(_) → Date32, CustomerId / InvoiceNumber → UInt64. DataFusion can now do real numeric / temporal predicate pushdown on those columns. The remaining semantic types stay Utf8 by deliberate choice — round 3 may pivot specific ones (Geo → struct{lat,lon}) when a consumer asks. 2. CachedOntology helper extracted upstream. Bundles an Arc<Ontology> with eagerly-projected DTOs per Locale (De/En). Prevents the per-call OntologyDto::from_ontology rebuild that medcare-rs's MedcareOntology and smb-office-rs's session ontology both grew independently. One implementation, one bug surface. 3. validate_route(route, ontology) added to parallelbetrieb. Parses /api/{entity_type}/{...} and asserts entity_type resolves to a declared Schema.name (case-insensitive). 4 tests covering accept-valid, reject-typo, reject-missing-prefix, reject-empty. Used as a pre-flight check for static route lists; the runner itself doesn't gate on it because typo routes are still genuine drift telemetry. 4. from_columns_partial added for PATCH-style upserts. Allows omitted Optional / Free columns (filled with Arrow null arrays); still rejects missing Required columns and undeclared columns. The existing from_columns keeps its strict full-row contract. 5. route_for_column now reads OuterColumn.codec_route directly, copied from the upstream PropertySpec.codec_route at schema derivation. Round-1 went through route_tensor which is calibrated for model-weight tensor names (q_proj, lm_head, ...) and would silently mis-classify document predicates. The contract's own field is now the source of truth — drift-by- construction impossible. Adds 7 new tests (6 in transcode/ + 1 implicit in cam_pq_decode where the existing test was rewritten to assert the new semantics): - cached_ontology_projects_every_locale_at_construction - cached_ontology_clones_are_arc_cheap - cached_ontology_inner_round_trips - validate_route_accepts_known_entity_type - validate_route_rejects_typo_entity_type - validate_route_rejects_missing_api_prefix - validate_route_rejects_empty_entity_segment - route_for_required_scalar_column_uses_property_spec_default (rewritten from route_for_scalar_columns_skips_codec; old test asserted the wrong thing — required scalars should default to Passthrough, not Skip) Verified: cargo check across {default, query-lite, query-lite+audit-log+ auth-rls-lite} — clean cargo test transcode:: → 33/33 passed (was 26/26 in #309; +7 new + 1 rewritten) cargo clippy on the same combos — zero transcode warnings rustfmt --check — clean across all 4 modified files
…ring F1 (MySQL <-> SPO oracle parity) shipped via MedCareV2 PRs #1, #2, #3, medcare-rs PR #71, and lance-graph PR #309. The vision doc still claimed F1 was "the next concrete deliverable". Rewrite section 7 to: state F1 has shipped, describe the LanceProbe -> ParityWitness -> DriftSink flow, name the contract DTO (lance-graph-callcenter::transcode::parallelbetrieb::DriftEvent), list F1's known gaps (no latency claims; in-memory ring buffer), and state F2 RBAC+audit wiring (medcare-rs adopting RlsRewriter) as the next posture. No other sections touched.
Adds the missing reverse-direction helper: takes a stream of ExpandedTriple (what Ontology::expand_entity returns) and materialises a RecordBatch grouped by subject_label. This is the Phase 5 (in #309's ROADMAP) / Phase 2-B (in the SQL-SPO bridge plan) bridge that consumer code needs to roundtrip an entity row through the SPO substrate. ## Why this shape (and not 'walk SpoStore::scan(lookup)') The original Phase-2 plan-doc described 'walk SpoStore::scan(lookup)' as the read path. SpoStore (in lance-graph proper) is fingerprint-Hamming-indexed and doesn't expose a flat scan(lookup) method — its API is per-verb (query_forward / query_reverse / query_relation) and the FNV-1a fingerprint is one-way (subject can't be reversed back to entity_id). ExpandedTriple is the right input shape: it carries the canonical subject_label (`entity:{type}:{id}`) so entity_id is recoverable, and the contract crate already mints these via SchemaExpander. Consumers wire SpoStore -> ExpandedTriple -> RecordBatch through their own subject_id-aware reader; this helper does the second step canonically. ## What ships triples_to_batch(soa, &[ExpandedTriple]) -> Result<RecordBatch> - Groups by subject_label (BTreeMap, lex sort, stable) - Parses entity_id from `entity:{type}:{id}` - Emits one row per subject; missing-required surfaces as MissingColumn error - Drops triples whose predicate isn't declared in the schema (BBB outer-view rule) - Rejects mixed entity_type via EntityTypeMismatch round1_lenient_schema(soa) -> SchemaRef - Round-1 helper: every body column emitted as nullable Utf8. The typed schema (Float32, Date32, etc.) applies on the from_columns / from_columns_partial path which has typed input. Round 3 adds typed-value reconstruction inside triples_to_batch. parse_entity_id_from_label(label, expected_type) -> Option<u64> - Private helper. Matches the canonical mint format. Two new TranscodeError variants: EntityTypeMismatch { expected, got } BadSubjectLabel(String) ## Tests 5 prior + 7 new (zerocopy:: total 12, all pass under query-lite + auth-rls-lite): triples_to_batch_produces_one_row_per_subject triples_to_batch_rejects_mixed_entity_types triples_to_batch_returns_empty_batch_for_empty_input triples_to_batch_drops_undeclared_predicates_silently triples_to_batch_rejects_missing_required_column triples_to_batch_subject_label_round_trip triples_to_batch_preserves_lex_subject_order ## What's deferred - Typed value reconstruction (round 3): every body column emits Utf8 today; round 3 parses object_label according to semantic_type (Currency -> Float32, Date -> Date32, etc.). - SpoStore reader proper: still needs a side-table mapping subject fingerprint -> entity_id. Consumer-side; tracked in `.claude/plans/sql-spo-ontology-bridge-v1.md`. Verified: cargo check + cargo test transcode::zerocopy:: -> 12/12 pass under {query-lite, query-lite+auth-rls-lite}. Clippy clean, fmt clean.
Summary
Reusable Foundry-style outer ↔ inner ontology mapper under
lance-graph-callcenter::transcode, plus the one deliberate transition bandaid (parallelbetrieb) for the MySQL ↔ DataFusion ↔ SPO ground-truth reconciliation.Domain-agnostic — every module operates on whatever
Ontologyis handed in. No medcare, smb, or any other vertical lives undertranscode/.Why this is a callcenter submodule, not a sibling crate
Medcare-rs PR #73 (merged) framed
lance-graph-callcenteras the canonical Foundry / supabase-realtime transcode crate. A siblinglance-graph-transcodewould create a competing framing. These are the four reusable primitives + the one bandaid; they belong here, alongsideontology_dtoandversion_watcher.What ships
transcode::zerocopyOuterColumn/OuterSchema/OwnedColumn/from_columns. Cheap-zerocopy lane:Vec<T>→BufferisO(1). Refuses undeclared columns at the boundary.transcode::cam_pq_decodeCamPqDecodertrait +PassthroughDecoderforCodecRoute::{Skip, Passthrough}. Codec math stays inlance_graph_contract::cam.transcode::spo_filterSpoFilterTranslator: SQL filter terms →SpoLookup. Uses canonicallance_graph_contract::hash::fnv1a.transcode::ontology_tableOntologyTableProvider: DataFusionTableProviderover(Ontology, entity_type). Round-1 backs scan withMemTable; SpoStore reader is the next round.transcode::parallelbetriebDriftEvent+DriftKind+Reconcilertrait. The one deliberate bandaid. Schema matches MedCareV2's C#DriftEvent.ToJson()so both sides feed one dashboard.Total: 26 tests, all passing.
Outer ↔ inner ontology framing
The transcode subtree is the mapper. Five files; ~1 100 LOC. Uses only primitives the rest of the workspace already exposes — no new dep on the bgz-tensor / cognitive-shader-driver internals.
Why parallelbetrieb is in this PR (and labelled as transitional)
Every other
transcode/module should still make sense in five years.parallelbetriebis different by design: it's the MySQL ground-truth reconciler that runs in F1 → F4 to prove the new substrate is correct. The module's doc-comment is explicit:The hard rules are spelled out in the module doc:
What this PR does NOT introduce
ontology_dto. The DTO surface stays canonical;transcode/mod.rsre-exportsOntologyDto/EntityTypeDto/Locale/Label/SchemaExpanderfrom one path so consumers reach the whole transcode surface from one import.version_watcher. Realtime fan-out belongs toLanceVersionWatcher.transcodedoes not introduce a second channel primitive.servefeature.Cargo / features
async-trait = "0.1"(small, no transitive deps).transcode::zerocopy+transcode::ontology_tablecompile underquery-liteorpersist(both pull in arrow); the others are always-on.transcode::ontology_tableis gated onquery-litebecause it needsdatafusion::TableProvider.Verified
cargo check: clean across 6 feature combos —{default, persist, query-lite, audit-log, query-lite+persist+audit-log, auth-rls-lite+query-lite}cargo test transcode::: 26 passed, 0 failedcargo clippy: zero transcode warnings across the same feature combosrustfmt --check: clean on all 6 new filesFiles changed
crates/lance-graph-callcenter/Cargo.toml(+5 lines:async-traitdep)crates/lance-graph-callcenter/src/lib.rs(+8 lines:pub mod transcode;)crates/lance-graph-callcenter/src/transcode/mod.rs(new, 78 lines)crates/lance-graph-callcenter/src/transcode/zerocopy.rs(new, ~360 lines)crates/lance-graph-callcenter/src/transcode/cam_pq_decode.rs(new, ~135 lines)crates/lance-graph-callcenter/src/transcode/spo_filter.rs(new, ~165 lines)crates/lance-graph-callcenter/src/transcode/ontology_table.rs(new, ~185 lines)crates/lance-graph-callcenter/src/transcode/parallelbetrieb.rs(new, ~265 lines)Generated by Claude Code
Generated by Claude Code