demo(ca-governance): golden-query-via-workflow vs. normal agentic CA#9
Conversation
A BigQuery Conversational Analytics agent with a governance dial, built on the RFC google#93 model-authored-workflow engine. Proves that restricting CA to governed/golden (verified) queries is enforced STRUCTURALLY by the engine — not with a prompt: a WorkflowSpec may only compose capabilities in the CapabilityRegistry, and the validator rejects any plan referencing one that is not. - STRICT registry (match_verified_query, run_frozen_query, summarize, refuse) has no nl2sql -> an adversarial "just write SQL" plan is rejected at validation before any query runs; the same plan validates under FLEXIBLE. - Golden hit -> a frozen, auditable workflow runs the approved SQL on REAL BigQuery (thelook_ecommerce); miss -> STRICT refuses, OPEN falls through to a normal agentic agent (free-form ADK Agent + query_thelook BQ tool). - FLEXIBLE middle ground: gated nl2sql fallback that promotes the result into the governed pool (assisted authoring). Real Gemini + real BigQuery with a deterministic mock-warehouse fallback (engine-labeled). 9 CI-safe tests (no LLM, no BQ) pin the governance proofs. Headless driver (governance_demo.py) as a live-demo backstop. Self-contained; reuses the committed authoring engine. No changes to src/ or sibling samples.
29deff3 to
cf7173a
Compare
|
|
||
| def _mode_from(text: str) -> str: | ||
| low = text.lower() | ||
| if any(k in low for k in ("open mode", "agentic", "flexible")): |
There was a problem hiding this comment.
This collapses flexible into open, and the runtime path below always uses golden_registry() + author_golden_plan(). A live prompt with (flexible) therefore never exercises the author_flexible_plan() path described in the README/NARRATIVE; on a miss it falls straight to the normal agentic fallback. To make the governance dial real, keep flexible as its own mode, select flexible_registry() + author_flexible_plan(), and reserve open for the free-form agentic fallback.
There was a problem hiding this comment.
Fixed in b256f9c. _mode_from now returns three distinct modes; FLEXIBLE selects flexible_registry() + author_flexible_plan(), OPEN is reserved for the agentic fallback, STRICT refuses on a miss. Added test_mode_routing_is_three_distinct_modes. Live-verified: a (flexible) miss authors the gated nl2sql plan, runs it, and promotes.
| score: float = 0.0 | ||
|
|
||
|
|
||
| class Sql(BaseModel): |
There was a problem hiding this comment.
Sql only accepts sql, so the real nl2sql output cannot carry the original question forward. Even if the model returns a question field, Pydantic will drop it, _dry_run() will set question to "", and freeze_verified will promote an empty-question record. The tests hide this because the stub returns a raw dict with question. Add question: str = "" to Sql (and make the instruction return it), or otherwise bind the task question into the dry-run/run/freeze path, then assert the promoted record keeps the original question.
There was a problem hiding this comment.
Fixed in b256f9c. Added question: str = "" to Sql and updated the nl2sql instruction to copy the question verbatim, so it survives into dry_run/run/freeze. test_flexible_falls_back_validates_and_promotes_with_question now asserts the promoted record keeps the original question. Live run confirmed the promoted query_id is derived from the real question.
| route.block = [ | ||
| StepRef(kind="step", id="gen", capability="nl2sql", | ||
| input=Binding(source="step", step="match")), | ||
| StepRef(kind="step", id="check", capability="dry_run", |
There was a problem hiding this comment.
Right now dry_run is observational, not a gate: run_adhoc and freeze_verified run regardless of check.valid or check.error. In real BigQuery an invalid generated query would return an error from run_query, but this plan still reaches freeze_verified and promotes the SQL. Since the demo story says FLEXIBLE validates before running/promoting, add a branch on check.valid (or make run_adhoc/freeze_verified refuse invalid inputs) and add a test where nl2sql returns invalid SQL and no query is run or frozen.
There was a problem hiding this comment.
Fixed in b256f9c. The dry-run is now a real gate: author_flexible_plan() branches on check.valid — valid -> run_adhoc + freeze + summarize; invalid -> new reject_invalid leaf (nothing run, nothing promoted). Added test_flexible_gate_rejects_invalid_sql_no_run_no_freeze (asserts no adhoc/freeze state and the pool is unchanged).
| return {"sql": sql, "valid": False, "error": str(e)[:500], "engine": "bigquery"} | ||
|
|
||
|
|
||
| def run_query(value) -> dict: |
There was a problem hiding this comment.
run_query is documented as read-only, and the agentic tool instruction asks for read-only SELECTs, but the tool does not enforce that before calling BigQuery. In OPEN mode the model can pass arbitrary SQL to query_thelook, including DDL/DML or multi-statement SQL billed to GOOGLE_CLOUD_PROJECT. Please add a local guard shared by dry_run/run_query/query_thelook that rejects anything except a single read-only SELECT before client.query() (and before the mock path too, so tests cover it).
There was a problem hiding this comment.
Fixed in b256f9c. Added warehouse.read_only_violation() (rejects non single-SELECT/WITH: DDL/DML, scripting keywords, multi-statement) and enforced it at the top of dry_run and run_query — before BigQuery AND before the mock — so query_thelook is covered too. Added test_read_only_guard_blocks_non_select.
| python contributing/samples/workflows/authored_workflow_ca_governance_demo/governance_demo.py | ||
|
|
||
| # Deterministic (no creds): forces the mock warehouse; LLM steps still | ||
| # need a model, so pass --no-llm to script only the non-LLM beats. |
There was a problem hiding this comment.
This mentions --no-llm, but the argparse setup below only supports --beats. Either add the flag or remove this sentence and keep the documented deterministic command to CA_GOV_USE_BIGQUERY=0 python .../governance_demo.py --beats diff adversarial, which currently matches the implementation.
There was a problem hiding this comment.
Fixed in b256f9c. Removed the --no-llm sentence; the deterministic command is now CA_GOV_USE_BIGQUERY=0 ... --beats diff adversarial (matches the implementation). Also added a flexible beat to the driver.
|
Top-level demo polish suggestion: this is a strong leadership demo, but I would make the README follow the ADK sample shape a bit more closely before sharing broadly: add a small Mermaid graph for the three modes (STRICT -> golden workflow/refusal, FLEXIBLE -> golden first + validated promotion, OPEN -> agentic fallback), and add a short Related Guides/Related RFCs section pointing to the workflow/RFC google#92/google#93 material. That will make the control-plane story scannable for people who land directly on the sample, not only through the PR description. |
…un, read-only guard, question threading - flexible is now a distinct live mode: FLEXIBLE selects flexible_registry() + author_flexible_plan(); OPEN is reserved for the free-form agentic fallback; STRICT refuses on a miss (comment 1). - Sql carries `question` (output schema + instruction) so the originating question survives nl2sql and the promoted verified query keeps it (comment 2). - The FLEXIBLE dry-run is a real GATE: branch on check.valid -> run+freeze on valid, else reject_invalid (nothing run, nothing promoted) (comment 3). - warehouse.read_only_violation() rejects non single-SELECT (DDL/DML, scripting, multi-statement) before BigQuery AND the mock; enforced by dry_run/run_query/ query_thelook (comment 4). - governance_demo: drop the nonexistent --no-llm doc; add a `flexible` beat (comment 5). - README: Mermaid 3-mode diagram + Related (engine/RFC google#92/google#93) section; mode table includes FLEXIBLE (comment 6). Tests: 12 pass (added flexible-gate-rejects-invalid, read-only guard, mode routing; flexible test now asserts the promoted question is preserved). Live re-validated (gemini-3.5-flash global + real BigQuery): FLEXIBLE generated SQL, passed the real dry-run gate, ran, and promoted with the question intact.
|
Thanks for the review — all six points addressed in README polish (this comment): added a Mermaid flowchart of the three governance modes (STRICT → refuse, FLEXIBLE → golden-first + validated promotion, OPEN → agentic fallback) and a Related section pointing to the engine ( Code review comments (replied inline):
Tests: 12 pass (added: flexible-gate-rejects-invalid, read-only guard, mode routing; flexible test asserts the promoted question is preserved). |
| ), | ||
| } | ||
|
|
||
| DEFAULT_ORDER = ["diff", "adversarial", "hit", "refuse", "flexible", "agentic"] |
There was a problem hiding this comment.
Suggestion for LT demo repeatability: the default run includes the flexible beat, and that beat writes the promoted query into the file-backed ca_gov_store by default. After one rehearsal, rerunning the same default script will make beat 5 a governed hit instead of showing nl2sql -> dry_run -> promote, because the query is already in the pool. I would make the headless driver use a fresh temp CA_GOV_STORE by default (and print it for inspection), or add a --reset-store / --store flag and document the clean-start command. That keeps rehearsals deterministic while leaving adk web persistence intact.
There was a problem hiding this comment.
Fixed in 4e574f0. The headless driver now defaults to a fresh temp CA_GOV_STORE per run (printed as store: …), so the FLEXIBLE beat always shows nl2sql → dry_run → promote instead of becoming a governed hit on a re-run. Added --store (persist / share with adk web) and --reset-store (clear promoted, non-seed queries). adk web persistence is unchanged.
| **refuses** rather than guessing. `0 queries run`. *(A hard boundary that | ||
| fails safe.)* | ||
|
|
||
| 5. **`…churn cohorts… (open mode)`** — the *same* question, dial turned to OPEN, |
There was a problem hiding this comment.
The talking track skips the FLEXIBLE beat in its numbered walkthrough, while the README and driver now run flexible before agentic. For the LT demo, I would align this narrative with the actual default order: add the average-sale-price flexible promotion beat here, then make the open-mode churn question beat 6. Otherwise the presenter notes miss the core middle-ground story this PR added.
There was a problem hiding this comment.
Fixed in 4e574f0. The numbered walkthrough now matches the README/driver order: beat 5 is the FLEXIBLE average-sale-price promotion (semantic-constrained nl2sql → real dry-run gate → run → promote), and the OPEN-mode churn question is beat 6.
| if client is None: | ||
| return { | ||
| "sql": sql, | ||
| "valid": sql.strip().lower().startswith("select"), |
There was a problem hiding this comment.
Mock and real dry-run can diverge here: read_only_violation() already accepts WITH queries, and BigQuery dry-run would validate a legal CTE, but credential-less mode returns valid: false unless the qualified SQL starts with select. Since this demo is likely rehearsed with CA_GOV_USE_BIGQUERY=0, a valid nl2sql CTE can be rejected locally even though it would pass live. I would either treat the mock dry-run as valid after the read-only guard passes, or check select|with here, plus add a small test for warehouse.dry_run({'sql': 'WITH x AS (SELECT 1) SELECT * FROM x'})['valid'] is True.
There was a problem hiding this comment.
Fixed in 4e574f0. Mock and real dry-run no longer diverge: since read_only_violation() already confirms a single SELECT/WITH, the credential-less dry_run now returns valid: True once the guard passes (no second leading-select check). Added test_mock_dry_run_accepts_cte asserting dry_run({'sql': 'WITH x AS (SELECT 1) SELECT * FROM x'})['valid'] is True.
…, mock/real dry-run parity, narrative alignment - governance_demo: default to a FRESH temp CA_GOV_STORE per run so the FLEXIBLE promotion beat is repeatable (a persisted promotion would turn a re-run into a governed hit); add --store / --reset-store and print the store path. - warehouse.dry_run: mock now returns valid=True once the read-only guard passes, matching what BigQuery accepts — a legal `WITH ... SELECT` CTE is no longer rejected only in credential-less mode. Added test_mock_dry_run_accepts_cte. - NARRATIVE: numbered walkthrough now includes the FLEXIBLE promotion beat (5) and moves the OPEN-mode churn beat to 6, matching the README/driver order. Tests: 13 pass.
| ```bash | ||
| python contributing/samples/workflows/authored_workflow_ca_governance_demo/governance_demo.py | ||
| # or a subset: | ||
| python .../governance_demo.py --beats diff adversarial hit refuse flexible agentic |
There was a problem hiding this comment.
Small LT-demo doc polish: the driver now has the right behavior (fresh temp CA_GOV_STORE by default, plus --store / --reset-store), but this README section still only shows the old invocation. I would add one sentence here saying the headless driver uses a fresh store per run for repeatable rehearsals, and show the persistent/replay command, e.g. python .../governance_demo.py --store contributing/samples/workflows/authored_workflow_ca_governance_demo/ca_gov_store --reset-store. That makes it clear when beat 5 should re-promote versus when a presenter intentionally wants to share the promoted pool with adk web.
There was a problem hiding this comment.
Fixed in 904aff9. The Headless driver section now states the driver uses a fresh temp CA_GOV_STORE per run (printed as store: …) so beat 5 always re-promotes, and shows the persistent command for sharing the promoted pool with adk web:
python .../governance_demo.py --store contributing/samples/workflows/authored_workflow_ca_governance_demo/ca_gov_store --reset-store…t + --store/--reset-store The headless-driver section showed only the old invocation. Document that the driver uses a fresh temp CA_GOV_STORE per run (repeatable beat 5), and show the persistent command (--store + --reset-store) for sharing the promoted pool with adk web.
FLEXIBLE no longer auto-writes to the governed pool. Removed the freeze_verified capability entirely, so a model-authored plan CANNOT promote — the strongest form of the governance point. The flexible miss path now generates -> validates (dry-run gate) -> runs -> answers, then PARKS the validated query as a pending candidate. A human replies `approve` (-> added to the golden pool via golden.approve_pending) or `reject` (-> discarded). Promotion is the only path into the pool and it requires explicit human sign-off. - golden.py: pending-candidate store (save/get/clear/approve_pending), single-slot. - agent.py: approve/reject handling at the top of plan_and_run; flexible miss parks a candidate and asks for approval; flexible_registry drops freeze_verified; _strip_mode cleans the stored question; banner blurb updated. - governance_demo.py: `flexible` beat is now the multi-turn HITL sequence (ask -> approve -> re-ask = governed hit), merging the old beat-5 + closer. - README/NARRATIVE: HITL flow, mermaid (approve/reject branch), merged beat 5, "no promote capability" framing. Tests: 15 pass (added HITL approve/reject + _strip_mode; flexible test now asserts no auto-promote and that freeze_verified is absent from the registry). Live-verified end-to-end: flexible -> pending -> approve -> governed hit on gemini-3.5-flash global + real BigQuery.
| store = args.store or os.environ.get("CA_GOV_STORE") or tempfile.mkdtemp( | ||
| prefix="ca_gov_store_" | ||
| ) | ||
| if args.reset_store: |
There was a problem hiding this comment.
With the new HITL flow, --reset-store should also clear pending_candidate.json, not just verified/. Otherwise a presenter can run with a durable --store, generate a candidate but not approve/reject it, then later run --reset-store expecting a clean rehearsal; the old pending candidate survives and a stray approve will promote stale SQL into the freshly reset pool. I would either set CA_GOV_STORE = store before this block and call golden.clear_pending() as part of reset, or remove os.path.join(store, "pending_candidate.json") directly here. The help/comments should also say the reset clears promoted + pending candidates in the HITL version.
There was a problem hiding this comment.
Fixed in the latest push. CA_GOV_STORE is now set before the reset block, and --reset-store clears both verified/ and the pending candidate via golden.clear_pending() — so a leftover candidate can't be approved into a freshly reset pool. The --reset-store help and README now say it clears promoted and pending. Verified: a pending candidate in a durable --store is gone after --reset-store.
…date With human-in-the-loop promotion, a durable --store could retain an un-approved pending_candidate.json across a --reset-store, so a later `approve` would promote stale SQL into the freshly reset pool. Set CA_GOV_STORE before the reset block and clear BOTH verified/ and the pending candidate (golden.clear_pending()). Help text and README updated to say reset clears promoted + pending.
…eterministic fallback The demo now genuinely exercises RFC google#93's headline: the model AUTHORS the typed WorkflowSpec at runtime via LlmAgent(output_schema=WorkflowSpec), which is then validated against the registry and governed. Adds _author_live() (with retry) + brace-free planner instructions (ADK LlmAgent treats {...} as state-template vars, so the instruction must avoid literal braces) and a per-mode catalogue built from the registry. plan_and_run authors golden/flexible/adversarial plans live; a canned author_*_plan() is the fallback if live authoring is off or the model returns an off-shape plan (so the demo never breaks). The banner shows "Model-authored (live)" vs the fallback, honestly. Verified live (gemini-3.5-flash global + real BigQuery): golden hit, adversarial (model-authored nl2sql plan -> rejected by STRICT), and the post-approval strict re-ask all author live; the flexible nested-gate plan falls back gracefully. Tests: 18 pass (added _spec_ids, live-authoring-disabled fallback, planner instruction catalogue). CA_GOV_LIVE_PLANNER=1 default; set 0 for deterministic.
README: document CA_GOV_LIVE_PLANNER, add the 🧠 Model-authored callout to "what to point at", and an honest-scope note (authoring is real but instruction-guided; free-authoring evidence in sibling samples; governance rests on validator+registry regardless of authoring style). NARRATIVE: state the plan is model-authored live and tag beats 2/3 as 🧠 model-authored.
|
Update: the demo now genuinely exercises RFC google#93's model-authoring, not just the governance machinery. The plan is authored live by an Verified live (gemini-3.5-flash, global Vertex + real BigQuery): the golden hit, the adversarial beat (the model's own nl2sql plan is rejected by STRICT), and the post-approval strict re-ask all author live; the flexible nested-gate plan falls back gracefully when the model emits an off-shape plan. Implementation note for reviewers: ADK |
|
Fresh feedback on the latest live-authoring update: I think model-authored workflow is a strong fit for this demo. The demo is no longer only showing “workflow governance”; it now shows the more important RFC google#93 story: the model can author a typed The value is clearest in three beats:
That is the right enterprise story: “dynamic model-authored orchestration, bounded by structural policy,” not “please trust the model to follow a prompt.” The remaining thing I would tighten before using this as the LT demo is the live-plan acceptance gate. Right now With that tightened, I think the demo shows the value of model-authored workflow well while staying honest about the current implementation: live authoring is real, but intentionally instruction-guided for reliability. |
|
Preferred LT demo narrative: I would make the story a governance-control story first, and a model-authoring story second. The key line is:
Suggested flow:
Close with: “Model-authored workflow is valuable here because it separates who proposes the plan from who grants authority. The model authors; the registry limits; the validator enforces; the frozen record audits; the human approves promotion. That is the enterprise governance shape.” One implementation caveat I would mention only if asked: the live authoring in this demo is intentionally instruction-guided for reliability. That is fine for LT, as long as the UI/text honestly distinguishes live model-authored plans from deterministic fallback and the acceptance gate verifies the expected shape. |
Address PR google#9 review (discussion_r3476149931): _author_live previously accepted any registry-valid spec that merely contained the required node ids, so a model could return an off-shape-but-valid plan (different output binding, route values, branch condition, or capability/input wiring) and still be labeled "Model-authored (live)" and executed. Now the live label is earned only when the authored plan matches the exact expected shape per mode: _is_golden_shape / _is_flexible_shape / _is_adversarial_shape compare a canonical structural signature (node order, ids, capabilities, input/branch bindings, route values, spec output) against the canned plan for that mode. Any registry-valid but off-shape plan falls back to the deterministic canned plan and is honestly labeled a fallback. Tests: 21 pass (added shape-predicate acceptance/cross-mode, off-shape-but- registry-valid rejection, and live off-shape -> fallback). README honest-scope updated to describe the exact-shape gate. Live re-validated (gemini-3.5-flash, global Vertex + real BigQuery): golden hit and strict refusal author live, adversarial plan rejected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Tightened the live-plan acceptance gate as you asked (commit
Each compares a canonical structural signature — node order, ids, capability-per-id, input bindings, branch condition ( Tests: 18 → 21. Added: shape predicates accept their own canned plan and reject cross-mode plans; an off-shape-but-registry-valid plan (right ids, different output binding) passes the validator but fails the shape gate; and with the live planner ON, an off-shape authored plan makes README honest-scope updated to describe the exact-shape gate. Live re-validated (gemini-3.5-flash, global Vertex + real BigQuery): the golden hit and the strict refusal both author live (the model reliably emits the exact shape, so the tightening doesn't cost the on-camera live label), and the adversarial plan is still rejected under STRICT. With this in, live authoring is real, honestly distinguished from fallback, and the acceptance gate verifies the expected shape — matching the narrative. Also took your preferred LT narrative — the governance-control-first framing with "the model is allowed to author the workflow, but not to choose its own powers" — and worked it into the recording track. Thanks, this made the demo materially more honest. |
|
Punchline: A human-compiled workflow hardcodes one policy path; a model-authored workflow lets the model adapt the plan to the question, while the registry prevents it from granting itself new authority. Top 3 demo points:
|
…line Fold PR google#9's narrative feedback into NARRATIVE.md + README.md, governance-first: - Punchline: a human-compiled workflow hardcodes one policy path; a model-authored workflow adapts the plan to the question while the registry prevents it from self-granting authority ("authors the plan, not its powers"). - The three LT points (adaptive-without-losing-control / structural-not-prompt / safe discovery->governance) mapped to beats 2, 3, 5. - Keep honest scope: in this demo the plan shape is instruction-guided and exact-shape-gated, so per-question adaptation is dial/branch/SQL-content, not free structural decomposition (that evidence is in the sibling samples); the no-self-granted-authority guarantee holds regardless of authoring style. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Love the punchline — adopted it (commit Folded the human-compiled-vs-model-authored framing and the three points into both One small honesty edit I made on point 1 ("adaptive"): in this demo the plan shape is instruction-guided and now exact-shape-gated, so what the model adapts per question is the dial/mode, the match-vs- |
Sequential operator walkthrough (send / point-at / say) for the eight beats, wired to the actual prompts and on-screen markers. Carries the governance-first framing, the human-compiled-vs-model-authored punchline, the three LT points, and the honest-scope note (exact-shape gate, free-authoring in sibling samples). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What
A BigQuery Conversational Analytics demo with a governance dial, built on the model-authored-workflow engine from this branch. It shows how to restrict CA to governed (golden / verified) queries structurally — not with a prompt — while keeping a normal agentic answer one dial-turn away.
Self-contained sample under
contributing/samples/workflows/authored_workflow_ca_governance_demo/. No changes tosrc/or sibling samples — it reuses the committedauthoring.pyengine.The idea
The engine's
CapabilityRegistryis the allow-list: aWorkflowSpecmay only compose registered capabilities, andWorkflowSpecValidatorrejects any plan referencing one that is not. So governance is a registry composition, enforced at validation:One agent, two surfaces: a data question is matched against the verified-query pool; a hit is answered by a frozen, auditable workflow running approved SQL on real BigQuery (
thelook_ecommerce); a miss refuses under STRICT or falls through to a normal agentic agent (free-formAgent+query_thelooktool) under OPEN.Beats (see README)
show modes registry diff— governance is a one-capability difference, not a prompt.adversarial: just write SQL— annl2sqlplan is rejected at validation under STRICT (the same plan validates under FLEXIBLE). You can't prompt your way out.revenue by country (strict)— governed hit: frozen approved SQL on real BigQuery,0 model-drafted SQL.churn cohorts (strict)— refused, outside the governed set,0 queries.churn cohorts (open mode)— same question → normal agentic answer (not a frozen workflow — the trade-off).FLEXIBLE adds the constrained-yet-flexible middle ground: gated
nl2sqlfallback that promotes the result into the governed pool (assisted authoring).Verification
bigquery).test_ca_governance_demo.py— 9 tests, no LLM / no BigQuery (governance proofs are deterministic). Run:pytest contributing/samples/workflows/authored_workflow_ca_governance_demo/test_ca_governance_demo.py -q.governance_demo.pyruns the sameroot_agentthrough the beats — a live-demo backstop.Notes
spike/dynamic-supervisor-concurrencybecause it imports the committed authoring engine.