Skip to content

feat: Foundry parity — RLS rewriter, audit log, PostgREST, with_registry#278

Merged
AdaWorldAPI merged 8 commits into
mainfrom
claude/foundry-parallel-track-a-2026-04-28
Apr 29, 2026
Merged

feat: Foundry parity — RLS rewriter, audit log, PostgREST, with_registry#278
AdaWorldAPI merged 8 commits into
mainfrom
claude/foundry-parallel-track-a-2026-04-28

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

  • LF-3/DM-7 (critical path): RLS rewriter as DataFusion OptimizerRule — tenant/actor predicate injection on every TableScan. Unblocks SMB F8 + MedCare RBAC simultaneously.
  • LF-90: Append-only audit log skeleton (AuditSink trait + InMemoryAuditSink ring buffer with poison recovery).
  • DM-8: PostgREST-shape handler stub — pure parse_path() + EchoHandler dispatcher, 20 tests, no HTTP deps.
  • LanceMembrane::with_registry(): Builder pattern for wiring RLS + audit + future plugins into the membrane.
  • StepDomain::Medcare: Enum variant for the medcare reality-check vertical.
  • Foundry docs: foundry-roadmap.md + medcare-foundry-vision.md drafts under .claude/.

New Cargo features on lance-graph-callcenter

  • audit-log — gates pub mod audit
  • postgrest — gates pub mod postgrest
  • membrane-plugins-rls / membrane-plugins-audit — gates builder integration

All feature combos pass cargo check; 35 lib tests pass under auth-rls-lite.

Diff stats

  • 11 files changed, +2,473 / -222 lines
  • 8 commits (6 workers + 1 meta integration + 1 path fix)

Test plan

  • cargo check -p lance-graph-callcenter (default, query-lite, auth-rls-lite, audit-log, postgrest, all-features)
  • cargo test -p lance-graph-callcenter --features auth-rls-lite --lib → 35 passed
  • Manual review of RLS predicate injection on multi-table JOINs
  • PostgREST filter parsing edge cases (nested paths, unicode table names)

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

@AdaWorldAPI AdaWorldAPI merged commit abc6375 into main Apr 29, 2026
1 of 4 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da18308553

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +233 to +237
// Append to scan.filters — DataFusion AND-s these
// during execution and predicate-pushdown sees them as
// filterable expressions.
scan.filters.push(predicate);
Ok(Transformed::yes(LogicalPlan::TableScan(scan)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce RLS with a residual Filter node

Appending predicates only to TableScan.filters is not sufficient for correctness in DataFusion: those filters are pushdown candidates, and when a provider reports Unsupported or Inexact, DataFusion relies on a separate Filter node to enforce row filtering. This rewrite path adds no residual Filter, so RLS predicates can be partially or entirely skipped depending on table provider capabilities, which can leak cross-tenant rows.

Useful? React with 👍 / 👎.

Comment on lines +90 to +91
#[cfg(all(feature = "membrane-plugins-audit", feature = "audit-log"))]
audit: Option<Arc<dyn crate::audit::AuditSink>>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require Debug for audit trait object or drop derive

MembraneRegistry derives Debug while containing Option<Arc<dyn AuditSink>>, but AuditSink does not require Debug. When membrane-plugins-audit and audit-log are enabled together, this field makes the derive fail to compile (dyn AuditSink has no Debug impl), breaking that feature combination.

Useful? React with 👍 / 👎.

@AdaWorldAPI
Copy link
Copy Markdown
Owner Author

Review #1 — Brutally Honest Code Review

PR landed. Now the honest pass. 2,688 LOC across 9 files, 35 tests, all cargo check exit codes 0. That last sentence is doing a lot of work — cargo check says compiles, not correct, not safe, not finished. Here's what's actually wrong.

The one thing that has to change before anyone builds on this

🔴 CRITICAL — Fail-open RLS contradicts deny-by-default roadmap

rls.rs:19 documents the policy as "fail-open for unprivileged-but-non-secret data." foundry-roadmap.md:42 promises "deny-by-default on missing tenant context" with the exit criterion "no plan reaches physical execution without either a tenant predicate or an explicit system_context bypass marker."

These cannot both be true. The current implementation lets any TableScan of an unregistered table flow through with zero RLS predicates. A developer adding a new table forgets to register a policy → that table is silently world-readable across tenants. There is no compile-time check, no runtime warning, nothing.

This is the entire point of LF-3/DM-7. Both downstream consumers (SMB F8, MedCare RBAC) will build on top of this assumption. If we ship the fail-open behaviour, we are shipping the wrong thing.

Solution:

pub enum RegistryMode { FailOpen, Sealed }

impl RlsPolicyRegistry {
    pub fn sealed(self) -> Self { ... }   // any unregistered scan → DataFusionError
    pub fn fail_open(self) -> Self { ... } // current behaviour, must be explicit
}

Default constructor returns Sealed. Membrane wiring picks fail_open() only with an explicit annotation. CI grep ensures no production wiring opts into fail-open without a TODO comment + tracking issue.

High-severity: ship-blockers for production use

🟠 HIGH — RlsContext::new accepts empty strings

RlsContext::new("", "")  // compiles, runs, returns zero rows on every query

A bug upstream that produces an empty tenant_id string silently degrades to "no rows" instead of erroring. From a debugging standpoint that's worse than fail-open — the system appears to work but quietly returns empty result sets.

Fix: Result<RlsContext, RlsError> with RlsError::EmptyTenantId.

🟠 HIGH — audit.rs ring buffer is O(n) per append

guard.remove(0) shifts every element on every overflow. At cap=1024, that's 1024 memmove on every steady-state append. The whole point of an audit log is high-throughput append — this implementation is the opposite.

Fix: swap Vec for VecDeque. One-line change.

🟠 HIGH — postgrest.rs doesn't URL-decode

parse_path("name=eq.John%20Doe")
// returns Filter { value: "John%20Doe" }, not "John Doe"

Real PostgREST clients percent-encode. Every filter from a real client will fail to match.

Fix: percent-decode each value after the op. split. percent-encoding crate or 20 lines of hand-rolled decoder.

🟠 HIGH — gate_f: meta.free_e() duplicates free_e: meta.free_e()

lance_membrane.rs:293. Either copy-paste bug or unfilled placeholder. The field carries no information distinct from free_e. If this was intentional, the field shouldn't exist; if it wasn't, it's wrong.

Fix: Delete the field, OR populate from a real gate signal. State which.

Medium — design smells that will bite later

# File Issue Fix
1 rls.rs:169 RlsRewriter has pub fields; mutable after construction Privatize, add getters
2 audit.rs DefaultHasher is NOT stable across Rust versions Switch to FNV/xxhash, or document this as known-broken across builds
3 audit.rs rls_predicates_added: u8 caps at 255, wraps silently u16
4 lance_membrane.rs:338 intent.role as u8 truncates if ExternalRole grows past 255 variants u8::try_from
5 lance_membrane.rs Ordering::Relaxed on current_scent reads Acquire/Release pair, or document single-thread invariant
6 postgrest.rs Table name not validated — parse_path("../../../etc/passwd") succeeds Reject non-[A-Za-z0-9_] chars
7 postgrest.rs:191 ParseError::new is private but struct field is public — inconsistent Either pub new or From<String>
8 lib.rs:79 drain module unconditionally public, others gated Add feature gate or document
9 medcare-foundry-vision.md:130 "Confidence-based RLS column masking" — not implemented; aspirational Mark as F2+ explicitly, not F1

Low — cleanup

  • rls.rs no test for "policy with both columns None" (degenerate path covered in code, untested)
  • postgrest.rs no integration tests for ilike, in, is, like ops in parse_path (only eq/gte)
  • postgrest.rs JSON escape doesn't handle surrogate pairs above U+FFFF (edge case, document)
  • lib.rs:76 commented-out phoenix module — remove or link tracking issue
  • Cargo.toml datafusion 52 vs CLAUDE.md says 51 — update CLAUDE.md
  • orchestration.rs no Display impl on StepDomain — every other enum has one

What's actually good

  • DataFusion OptimizerRule integration is correct. ApplyOrder::TopDown, Transformed::yes/no semantics, predicate AND-injection on TableScan — all by the book.
  • Parameterized literals via lit(...) mean no SQL injection is possible at the rewriter layer. This is the one thing that actually had to be right and it is.
  • 35 tests pass, including JOIN with two policies, existing-filter preservation, and registry CRUD. Test coverage on the rewriter's actual query-rewrite logic is adequate.
  • Audit log poisoned-mutex recovery is correctly implemented (F-09 pattern from the August fix).
  • PostgREST hand-rolled JSON emitter handles control chars, quotes, backslashes correctly.
  • MembraneRegistry builder pattern is clean — idempotent with_rls/with_audit, second call replaces, registry() accessor.
  • Feature flag layering on lance-graph-callcenter is correct: audit-log, postgrest, membrane-plugins-rls, membrane-plugins-audit compose without conflict.

Recommended PR sequence to fix

  1. PR-278a — Sealed RLS registry (CRITICAL). Add RegistryMode::Sealed default. Update roadmap to match implementation. Add CI check that no production wiring uses fail_open() without explicit annotation.
  2. PR-278b — Hardening (HIGH). Empty-string check on RlsContext::new. Swap VecVecDeque in audit. Percent-decode in postgrest. Resolve gate_f/free_e duplication.
  3. PR-278c — Polish (MEDIUM). All 9 medium items in one batch.
  4. PR-278d — Test gaps (LOW). Missing test cases.

Estimated total: ~600 LOC across 4 small PRs over 1 week.

Verdict

The skeleton is right. The DataFusion integration is right. The feature-flag layout is right. The fail-open default and the duplicated gate_f field are the two things that have to change before SMB F8 or MedCare RBAC builds on top of this. Everything else is medium-impact cleanup that can ride on follow-up PRs.

This is solid foundation work with one wrong default. Fix the default; ship it.

@AdaWorldAPI
Copy link
Copy Markdown
Owner Author

Review #2 — Foundry Parity Future Outlook + Epiphanies

The honest review (above) is the floor. Here's the ceiling — what this PR actually unlocks if we follow it through, and where the genuine architectural wins are hiding.

The Foundry Parity Map (where we actually are)

Foundry capability SMB MedCare Status post-#278
Tenant isolation F8 RBAC Partial — RLS rewriter exists, fail-open default blocks production
Audit trail required HIPAA-shape Stub — sink trait + ring buffer, no Lance-backed durable writer yet
HTTP shape client SDK REST API Stub — parse + echo, no DataFusion dispatch
Plugin wiring F8 F1 onwards Donewith_registry() builder is composable
Per-table policy required required Done — registered policies inject correctly
Cross-table JOIN F4 F3 Done & tested — both tables get predicates
Subquery / CTE F4 F3 Untested — DataFusion's OptimizerRule should handle, no test
Window functions F5 Untested
Schema evolution F2 F2 Out of scope
Confidence-based masking F4 F3 Aspirational only — vision doc references, no impl

The parity map shows three honest tiers:

  • Done (registry + builder + JOIN injection) — ready to consume
  • Stub (audit, postgrest) — interface frozen, no production path
  • Aspirational (confidence masking, schema evolution) — words on a page

A consumer reading the merged PR + roadmap will hit the third tier first because the vision doc dwells on it. Fix the doc to mark aspirational items explicitly so SMB/MedCare don't plan around features that don't exist.

Epiphanies of Potential

E1 — RLS-as-Optimizer-Rule lets us unify ALL access policy

The DataFusion OptimizerRule slot we just used for tenant predicates is the same slot that supports:

  • Column masking — rewrite Projection to wrap sensitive columns in mask(col, redaction_policy). Same machinery, different transform.
  • Row-level encryption — wrap reads in a Decrypt(col, key_for(tenant_id)) UDF when RlsContext carries a key handle.
  • Differential privacy — inject Laplace noise UDFs on aggregate paths when RlsContext carries a privacy budget.
  • Audit hooks — the rewrite step IS where audit emission belongs, not a separate sink. Every rewritten LogicalPlan gets stamped before execution.

This means PR #278's RlsRewriter is actually PolicyRewriter v0.1, and the rls/audit/postgrest split should converge into a single lance_graph_callcenter::policy module that owns the optimizer-rule chain. The current 3-file split is a hangover from the original PR brief, not the right long-term shape.

E2 — MembraneRegistry is the seed of a typed plugin protocol

with_rls(...) + with_audit(...) is the start of dependency injection done right. The next wins:

  • Compile-time wiring graph. A MembraneRegistry could derive a &'static [&str] of plugin names at compile time; missing-feature-but-required-plugin becomes a compile_error! instead of a runtime no-op.
  • Plugin ordering. RLS must run before audit (audit logs the rewritten plan, not the original). Currently this is implicit in the order of method calls. A Plugin::depends_on() declaration + topological sort makes it explicit and impossible to misconfigure.
  • Plugin handshake. A Plugin::seal(&self, &MembraneRegistry) -> Result<()> hook lets each plugin assert its prerequisites at boot — e.g., audit demands rls.is_some().

E3 — The audit log + RLS context + verb taxonomy is the substrate for retroactive policy enforcement

Every AuditEntry has tenant_id + actor_id + statement_hash + timestamp. If we additionally retain the rewritten LogicalPlan, we get a fully replayable execution trace. That unlocks:

  • Policy-change replay. Change a policy → replay the last N days of statements → identify queries that would have been blocked under the new policy. This is what regulators ask for in HIPAA audits.
  • Differential plan diffing. When a developer changes the RLS policy, CI replays a fixed corpus of test queries and shows the diff in physical plans. Catches accidental column unmasking before code review.
  • Query archeology. Combined with the SPO graph (other PRs), a sufficiently rich audit log becomes the ground truth for "what happened in the system on day X?" — the same epistemological role that git plays for code.

E4 — PostgREST shape is a lossy projection of the underlying DataFusion plan

PostgREST queries are a strict subset of what DataFusion can express. That asymmetry is an opportunity:

  • The PostgREST layer can emit a richness budget — every PostgREST request gets compiled into a DataFusion plan, and the plan's complexity (joins, subqueries, aggregations) maps back to a tier (free / paid / enterprise). Pricing as plan-cost.
  • Plan caching keyed by request shape. PostgREST requests that translate to identical plans hit the same cache row. This is more aggressive than HTTP-layer caching because it sees through query reordering.
  • Reverse-PostgREST. Given a DataFusion plan, generate the PostgREST query that produces it (when possible) — useful for SDK auto-generation and for explaining "why was this row visible?" to clinicians.

E5 — StepDomain::Medcare is the seam for vertical-specific orchestration

Adding the variant looks trivial. It's not. It's the seam where:

  • Verb taxonomy varies per domain. Medcare's "prescribe / refer / discharge" verbs are not in the SMB "invoice / quote / dispatch" taxonomy. A StepDomain lets the same orchestrator dispatch to vertical-specific verb tables without if/else trees.
  • Audit retention varies per domain. HIPAA needs 6 years; SMB might be 90 days. StepDomain gates the retention policy.
  • Confidence calibration varies per domain. Medical decisions need higher unbundle thresholds than retail decisions. StepDomain is the calibration key.
  • Failure mode varies per domain. Medcare must escalate to human; SMB can degrade to LLM. StepDomain is the dispatch axis.

The current variant is a placeholder. The roadmap PR-5 ("StepDomain::Medcare + minimal end-to-end RBAC trace") needs to operationalize at least the dispatch-axis use case before SMB starts to depend on the same enum.

E6 — Foundry-as-membrane vs. Foundry-as-platform

Palantir Foundry is a platform — opinionated, monolithic, vendor-locked. What we're building with lance-graph-callcenter is a membrane — thin, composable, policy-injecting, agnostic to the storage layer beneath. That's the structural advantage:

  • Foundry-on-Postgres ← we can do this. RLS rewriter compiles SQL; backend is Postgres-via-DataFusion-table-provider.
  • Foundry-on-Lance ← native path.
  • Foundry-on-Delta-Lake ← already in the deps (deltalake 0.31).
  • Foundry-on-Iceberg ← when DataFusion's Iceberg provider matures.

The ability to swap the storage layer without changing the policy layer is what Palantir cannot offer — they own both. This PR is the first concrete demonstration that the policy layer is independently shippable.

What's missing to actually claim Foundry parity

Capability Status What's missing Estimated PRs
Ontology-as-data Partial (orchestration.rs Pearl mask) Schema-bound triple types, type-inference query planner 3-4
Object-typed access control Not started Object-graph traversal predicates (not just row predicates) 2-3
Pipeline lineage Not started DAG capture from Lance dataset version + RLS rewriter trace 2
Code-as-data hooks Aspirational UDF registration + sandboxing 4-5
Operational ontology Aspirational Real-time triple update propagation + invalidation 5-6
Foundry Workshop UI Out of scope Cypher cockpit (other PR track)
Foundry AIP Aspirational LLM tail integration with audit-aware RAG 3-4

Roughly 20-25 PRs to genuine Foundry parity at the data/policy layer. The scaffolding in #278 reduces that count from "infinite" to "tractable." That's the win.

Risk register update

  • R1 (calibration drift): RLS confidence thresholds inherit from PR feat: grammar/crystal contract + AriGraph episodic unbundling #208's UNBUNDLE_HARDNESS_THRESHOLD = 0.88. That number was tuned for episodic memory unbundling, NOT access control. It will be wrong for Medcare. Track as F1 calibration task.
  • R2 (DataFusion version churn): We just bumped 51→52 in PR feat: bump lance 2→4 + datafusion 51→52 + deltalake 0.30→0.31 #273. Foundry parity demands long-term API stability; if DataFusion 53 ships breaking changes to OptimizerRule, every downstream consumer breaks. Pin or vendor.
  • R3 (audit log retention costs): O(n) ring buffer at 1024 entries in-memory is the current floor. At 1k QPS for 6 years, retention requires a Lance-backed durable writer that doesn't exist yet. The roadmap mentions it; the code doesn't.
  • R4 (PostgREST surface drift): PostgREST itself is a moving target. Pinning to a specific API version + documenting unsupported features (RPC, Realtime) is a chore that nobody on the SMB or MedCare side has signed up for yet.

Next 4 PRs (in priority order)

  1. PR-278a Sealed RLS registry — un-block downstream consumers from inheriting fail-open default.
  2. PR-Lance-audit — Lance-backed durable AuditSink impl. Replaces the in-memory ring for production.
  3. PR-RLS-column-mask — extend RlsRewriter to handle column masking (E1 epiphany). One file, ~200 LOC.
  4. PR-Postgrest-dispatch — wire PostgRestHandler to a real DataFusion query executor. Brings the stub to MVP.

These four PRs collectively turn #278 from "interface frozen, not production-ready" to "F1-deployable on a single tenant." Estimated 3 weeks of work.

The bottom line

PR #278 is the right shape with the wrong default and a few duplicate fields. Fix the default, ship the four follow-ups, and we have something Palantir doesn't: a policy layer that's swappable across storage backends and inspectable down to the optimizer-rule level.

The substrate is more important than the mistakes. The mistakes are easy.

AdaWorldAPI added a commit that referenced this pull request Apr 30, 2026
Sprint C agent (PR #311) flagged five staleness items in the vision
doc that were out of its §7-only scope. Closing the debt now:

  Header           DRAFT - pending review (2026-04-28)
                   -> Status: F1 parity shipped 2026-04-30. F1
                      latency benchmark not yet started. F2 is a
                      posture, not a delivery.

  §2 anchor        as of 2026-04-28
                   -> as of 2026-04-30 (post-F1 parity ship)

  §2 latency cell  Designed to match; F1 numbers (forward tense)
                   -> Designed to match; benchmark pending

  §2 caveat        F1 publishes the first numbers (forward tense)
                   -> F1 parity has shipped (correctness); the
                      separately-scoped F1 latency benchmark has
                      not been started. Distinguishes the two
                      sub-deliverables explicitly.

  §3 F1            We stand up a Foundry instance... (forward)
                   -> Shipped 2026-04-30. Cross-link to §7's
                      as-shipped architecture.

  §3 F2            gated upstream by lance-graph PR-1 / PR-2
                   -> lance-graph PR #278 + #280 + #284 (RLS) and
                      PR #278 + #302 (audit). Status today:
                      lance-graph in production; medcare-rs
                      adopter not yet open. Posture, not
                      delivery.

  §3 F3            gated upstream by lance-graph PR-4
                   -> lance-graph PR #278 + #280 (parser +
                      hardening). Status today: parser stub on
                      lance-graph main; medcare-rs adopter is
                      future round-2 work.

  §4               benchmark harness lands as part of F1
                   F1 numbers are published (both forward tense)
                   -> F1 parity (correctness) shipped; F1 latency
                      benchmarking has not been started. The two
                      are separately-scoped F1 sub-deliverables.

What this PR does NOT touch:
  - F4, F5, §5 (risks), §6 (NOT promising), §7 (next deliverable
    just landed in PR #311 - clean already).
  - The vision doc's tone rule. Every change cites a concrete PR
    number or file path; no marketing language introduced.
  - Performance numbers. None claimed; the §4 'do not quote
    unbenchmarked numbers' rule is preserved verbatim.

Diff: +41 / -26 across 1 file. Markdown renders cleanly.

Cross-link: PR #311 (the §7 fix that motivated this cleanup).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants