demo(ca-governance): golden-query-via-workflow vs. normal agentic CA by haiyuan-eng-google · Pull Request #9 · caohy1988/adk-python

haiyuan-eng-google · 2026-06-24T21:19:48Z

What

A BigQuery Conversational Analytics demo with a governance dial, built on the model-authored-workflow engine from this branch. It shows how to restrict CA to governed (golden / verified) queries structurally — not with a prompt — while keeping a normal agentic answer one dial-turn away.

Self-contained sample under contributing/samples/workflows/authored_workflow_ca_governance_demo/. No changes to src/ or sibling samples — it reuses the committed authoring.py engine.

The idea

The engine's CapabilityRegistry is the allow-list: a WorkflowSpec may only compose registered capabilities, and WorkflowSpecValidator rejects any plan referencing one that is not. So governance is a registry composition, enforced at validation:

STRICT (golden) : match_verified_query · run_frozen_query · summarize · refuse
FLEXIBLE        : … + nl2sql · dry_run · run_adhoc · freeze_verified

One agent, two surfaces: a data question is matched against the verified-query pool; a hit is answered by a frozen, auditable workflow running approved SQL on real BigQuery (thelook_ecommerce); a miss refuses under STRICT or falls through to a normal agentic agent (free-form Agent + query_thelook tool) under OPEN.

Beats (see README)

show modes registry diff — governance is a one-capability difference, not a prompt.
adversarial: just write SQL — an nl2sql plan is rejected at validation under STRICT (the same plan validates under FLEXIBLE). You can't prompt your way out.
revenue by country (strict) — governed hit: frozen approved SQL on real BigQuery, 0 model-drafted SQL.
churn cohorts (strict) — refused, outside the governed set, 0 queries.
churn cohorts (open mode) — same question → normal agentic answer (not a frozen workflow — the trade-off).

FLEXIBLE adds the constrained-yet-flexible middle ground: gated nl2sql fallback that promotes the result into the governed pool (assisted authoring).

Verification

Live: all five beats run end-to-end with Gemini (Vertex global) + real BigQuery (engine-labeled bigquery).
CI-safe: test_ca_governance_demo.py — 9 tests, no LLM / no BigQuery (governance proofs are deterministic). Run: pytest contributing/samples/workflows/authored_workflow_ca_governance_demo/test_ca_governance_demo.py -q.
Headless driver: governance_demo.py runs the same root_agent through the beats — a live-demo backstop.
Without credentials it degrades to a deterministic mock warehouse (engine-labeled), so it runs anywhere.

Notes

The verified-query matcher is deterministic keyword overlap (auditable); production would use the dataset's semantic model/graph + embeddings — the governance mechanism is unchanged by that swap.
Stacked on spike/dynamic-supervisor-concurrency because it imports the committed authoring engine.

A BigQuery Conversational Analytics agent with a governance dial, built on the RFC google#93 model-authored-workflow engine. Proves that restricting CA to governed/golden (verified) queries is enforced STRUCTURALLY by the engine — not with a prompt: a WorkflowSpec may only compose capabilities in the CapabilityRegistry, and the validator rejects any plan referencing one that is not. - STRICT registry (match_verified_query, run_frozen_query, summarize, refuse) has no nl2sql -> an adversarial "just write SQL" plan is rejected at validation before any query runs; the same plan validates under FLEXIBLE. - Golden hit -> a frozen, auditable workflow runs the approved SQL on REAL BigQuery (thelook_ecommerce); miss -> STRICT refuses, OPEN falls through to a normal agentic agent (free-form ADK Agent + query_thelook BQ tool). - FLEXIBLE middle ground: gated nl2sql fallback that promotes the result into the governed pool (assisted authoring). Real Gemini + real BigQuery with a deterministic mock-warehouse fallback (engine-labeled). 9 CI-safe tests (no LLM, no BQ) pin the governance proofs. Headless driver (governance_demo.py) as a live-demo backstop. Self-contained; reuses the committed authoring engine. No changes to src/ or sibling samples.

caohy1988 · 2026-06-24T21:32:27Z

+
+def _mode_from(text: str) -> str:
+  low = text.lower()
+  if any(k in low for k in ("open mode", "agentic", "flexible")):


This collapses flexible into open, and the runtime path below always uses golden_registry() + author_golden_plan(). A live prompt with (flexible) therefore never exercises the author_flexible_plan() path described in the README/NARRATIVE; on a miss it falls straight to the normal agentic fallback. To make the governance dial real, keep flexible as its own mode, select flexible_registry() + author_flexible_plan(), and reserve open for the free-form agentic fallback.

Fixed in b256f9c. _mode_from now returns three distinct modes; FLEXIBLE selects flexible_registry() + author_flexible_plan(), OPEN is reserved for the agentic fallback, STRICT refuses on a miss. Added test_mode_routing_is_three_distinct_modes. Live-verified: a (flexible) miss authors the gated nl2sql plan, runs it, and promotes.

caohy1988 · 2026-06-24T21:32:35Z

+  score: float = 0.0
+
+
+class Sql(BaseModel):


Sql only accepts sql, so the real nl2sql output cannot carry the original question forward. Even if the model returns a question field, Pydantic will drop it, _dry_run() will set question to "", and freeze_verified will promote an empty-question record. The tests hide this because the stub returns a raw dict with question. Add question: str = "" to Sql (and make the instruction return it), or otherwise bind the task question into the dry-run/run/freeze path, then assert the promoted record keeps the original question.

Fixed in b256f9c. Added question: str = "" to Sql and updated the nl2sql instruction to copy the question verbatim, so it survives into dry_run/run/freeze. test_flexible_falls_back_validates_and_promotes_with_question now asserts the promoted record keeps the original question. Live run confirmed the promoted query_id is derived from the real question.

caohy1988 · 2026-06-24T21:32:43Z

+      route.block = [
+          StepRef(kind="step", id="gen", capability="nl2sql",
+                  input=Binding(source="step", step="match")),
+          StepRef(kind="step", id="check", capability="dry_run",


Right now dry_run is observational, not a gate: run_adhoc and freeze_verified run regardless of check.valid or check.error. In real BigQuery an invalid generated query would return an error from run_query, but this plan still reaches freeze_verified and promotes the SQL. Since the demo story says FLEXIBLE validates before running/promoting, add a branch on check.valid (or make run_adhoc/freeze_verified refuse invalid inputs) and add a test where nl2sql returns invalid SQL and no query is run or frozen.

Fixed in b256f9c. The dry-run is now a real gate: author_flexible_plan() branches on check.valid — valid -> run_adhoc + freeze + summarize; invalid -> new reject_invalid leaf (nothing run, nothing promoted). Added test_flexible_gate_rejects_invalid_sql_no_run_no_freeze (asserts no adhoc/freeze state and the pool is unchanged).

caohy1988 · 2026-06-24T21:32:52Z

+    return {"sql": sql, "valid": False, "error": str(e)[:500], "engine": "bigquery"}
+
+
+def run_query(value) -> dict:


run_query is documented as read-only, and the agentic tool instruction asks for read-only SELECTs, but the tool does not enforce that before calling BigQuery. In OPEN mode the model can pass arbitrary SQL to query_thelook, including DDL/DML or multi-statement SQL billed to GOOGLE_CLOUD_PROJECT. Please add a local guard shared by dry_run/run_query/query_thelook that rejects anything except a single read-only SELECT before client.query() (and before the mock path too, so tests cover it).

Fixed in b256f9c. Added warehouse.read_only_violation() (rejects non single-SELECT/WITH: DDL/DML, scripting keywords, multi-statement) and enforced it at the top of dry_run and run_query — before BigQuery AND before the mock — so query_thelook is covered too. Added test_read_only_guard_blocks_non_select.

caohy1988 · 2026-06-24T21:32:59Z

+    python contributing/samples/workflows/authored_workflow_ca_governance_demo/governance_demo.py
+
+    # Deterministic (no creds): forces the mock warehouse; LLM steps still
+    # need a model, so pass --no-llm to script only the non-LLM beats.


This mentions --no-llm, but the argparse setup below only supports --beats. Either add the flag or remove this sentence and keep the documented deterministic command to CA_GOV_USE_BIGQUERY=0 python .../governance_demo.py --beats diff adversarial, which currently matches the implementation.

Fixed in b256f9c. Removed the --no-llm sentence; the deterministic command is now CA_GOV_USE_BIGQUERY=0 ... --beats diff adversarial (matches the implementation). Also added a flexible beat to the driver.

caohy1988 · 2026-06-24T21:33:06Z

Top-level demo polish suggestion: this is a strong leadership demo, but I would make the README follow the ADK sample shape a bit more closely before sharing broadly: add a small Mermaid graph for the three modes (STRICT -> golden workflow/refusal, FLEXIBLE -> golden first + validated promotion, OPEN -> agentic fallback), and add a short Related Guides/Related RFCs section pointing to the workflow/RFC google#92/google#93 material. That will make the control-plane story scannable for people who land directly on the sample, not only through the PR description.

…un, read-only guard, question threading - flexible is now a distinct live mode: FLEXIBLE selects flexible_registry() + author_flexible_plan(); OPEN is reserved for the free-form agentic fallback; STRICT refuses on a miss (comment 1). - Sql carries `question` (output schema + instruction) so the originating question survives nl2sql and the promoted verified query keeps it (comment 2). - The FLEXIBLE dry-run is a real GATE: branch on check.valid -> run+freeze on valid, else reject_invalid (nothing run, nothing promoted) (comment 3). - warehouse.read_only_violation() rejects non single-SELECT (DDL/DML, scripting, multi-statement) before BigQuery AND the mock; enforced by dry_run/run_query/ query_thelook (comment 4). - governance_demo: drop the nonexistent --no-llm doc; add a `flexible` beat (comment 5). - README: Mermaid 3-mode diagram + Related (engine/RFC google#92/google#93) section; mode table includes FLEXIBLE (comment 6). Tests: 12 pass (added flexible-gate-rejects-invalid, read-only guard, mode routing; flexible test now asserts the promoted question is preserved). Live re-validated (gemini-3.5-flash global + real BigQuery): FLEXIBLE generated SQL, passed the real dry-run gate, ran, and promoted with the question intact.

haiyuan-eng-google · 2026-06-24T21:46:22Z

Thanks for the review — all six points addressed in b256f9c1.

README polish (this comment): added a Mermaid flowchart of the three governance modes (STRICT → refuse, FLEXIBLE → golden-first + validated promotion, OPEN → agentic fallback) and a Related section pointing to the engine (authored_workflow_spike / dynamic_supervisor_spike), RFC google#92/google#93, the sibling samples, and the CA verified-queries docs. The mode table now includes the FLEXIBLE beat.

Code review comments (replied inline):

FLEXIBLE is now a real, distinct live mode (flexible_registry + author_flexible_plan); OPEN reserved for the agentic fallback.
Sql carries question so it survives nl2sql → dry_run → run → freeze (promoted record keeps the original question).
The dry-run is now a real gate (branch on check.valid; invalid SQL → reject_invalid, nothing run or promoted).
New read_only_violation() guard rejects non single-SELECT before BigQuery and the mock, shared by dry_run/run_query/query_thelook.
Removed the nonexistent --no-llm doc; added a flexible driver beat.

Tests: 12 pass (added: flexible-gate-rejects-invalid, read-only guard, mode routing; flexible test asserts the promoted question is preserved).
Live re-validation (gemini-3.5-flash, global Vertex endpoint + real BigQuery): the FLEXIBLE miss generated SQL, passed the real dry-run gate, executed (Men $63.46 / Women $55.84), and promoted the query with its question intact.

caohy1988 · 2026-06-24T22:02:16Z

+    ),
+}
+
+DEFAULT_ORDER = ["diff", "adversarial", "hit", "refuse", "flexible", "agentic"]


Suggestion for LT demo repeatability: the default run includes the flexible beat, and that beat writes the promoted query into the file-backed ca_gov_store by default. After one rehearsal, rerunning the same default script will make beat 5 a governed hit instead of showing nl2sql -> dry_run -> promote, because the query is already in the pool. I would make the headless driver use a fresh temp CA_GOV_STORE by default (and print it for inspection), or add a --reset-store / --store flag and document the clean-start command. That keeps rehearsals deterministic while leaving adk web persistence intact.

Fixed in 4e574f0. The headless driver now defaults to a fresh temp CA_GOV_STORE per run (printed as store: …), so the FLEXIBLE beat always shows nl2sql → dry_run → promote instead of becoming a governed hit on a re-run. Added --store (persist / share with adk web) and --reset-store (clear promoted, non-seed queries). adk web persistence is unchanged.

caohy1988 · 2026-06-24T22:02:16Z

+   **refuses** rather than guessing. `0 queries run`. *(A hard boundary that
+   fails safe.)*
+
+5. **`…churn cohorts… (open mode)`** — the *same* question, dial turned to OPEN,


The talking track skips the FLEXIBLE beat in its numbered walkthrough, while the README and driver now run flexible before agentic. For the LT demo, I would align this narrative with the actual default order: add the average-sale-price flexible promotion beat here, then make the open-mode churn question beat 6. Otherwise the presenter notes miss the core middle-ground story this PR added.

Fixed in 4e574f0. The numbered walkthrough now matches the README/driver order: beat 5 is the FLEXIBLE average-sale-price promotion (semantic-constrained nl2sql → real dry-run gate → run → promote), and the OPEN-mode churn question is beat 6.

caohy1988 · 2026-06-24T22:02:16Z

+  if client is None:
+    return {
+        "sql": sql,
+        "valid": sql.strip().lower().startswith("select"),


Mock and real dry-run can diverge here: read_only_violation() already accepts WITH queries, and BigQuery dry-run would validate a legal CTE, but credential-less mode returns valid: false unless the qualified SQL starts with select. Since this demo is likely rehearsed with CA_GOV_USE_BIGQUERY=0, a valid nl2sql CTE can be rejected locally even though it would pass live. I would either treat the mock dry-run as valid after the read-only guard passes, or check select|with here, plus add a small test for warehouse.dry_run({'sql': 'WITH x AS (SELECT 1) SELECT * FROM x'})['valid'] is True.

Fixed in 4e574f0. Mock and real dry-run no longer diverge: since read_only_violation() already confirms a single SELECT/WITH, the credential-less dry_run now returns valid: True once the guard passes (no second leading-select check). Added test_mock_dry_run_accepts_cte asserting dry_run({'sql': 'WITH x AS (SELECT 1) SELECT * FROM x'})['valid'] is True.

…, mock/real dry-run parity, narrative alignment - governance_demo: default to a FRESH temp CA_GOV_STORE per run so the FLEXIBLE promotion beat is repeatable (a persisted promotion would turn a re-run into a governed hit); add --store / --reset-store and print the store path. - warehouse.dry_run: mock now returns valid=True once the read-only guard passes, matching what BigQuery accepts — a legal `WITH ... SELECT` CTE is no longer rejected only in credential-less mode. Added test_mock_dry_run_accepts_cte. - NARRATIVE: numbered walkthrough now includes the FLEXIBLE promotion beat (5) and moves the OPEN-mode churn beat to 6, matching the README/driver order. Tests: 13 pass.

caohy1988 · 2026-06-24T22:11:45Z

+```bash
+python contributing/samples/workflows/authored_workflow_ca_governance_demo/governance_demo.py
+# or a subset:
+python .../governance_demo.py --beats diff adversarial hit refuse flexible agentic


Small LT-demo doc polish: the driver now has the right behavior (fresh temp CA_GOV_STORE by default, plus --store / --reset-store), but this README section still only shows the old invocation. I would add one sentence here saying the headless driver uses a fresh store per run for repeatable rehearsals, and show the persistent/replay command, e.g. python .../governance_demo.py --store contributing/samples/workflows/authored_workflow_ca_governance_demo/ca_gov_store --reset-store. That makes it clear when beat 5 should re-promote versus when a presenter intentionally wants to share the promoted pool with adk web.

Fixed in 904aff9. The Headless driver section now states the driver uses a fresh temp CA_GOV_STORE per run (printed as store: …) so beat 5 always re-promotes, and shows the persistent command for sharing the promoted pool with adk web:

python .../governance_demo.py --store contributing/samples/workflows/authored_workflow_ca_governance_demo/ca_gov_store --reset-store

…t + --store/--reset-store The headless-driver section showed only the old invocation. Document that the driver uses a fresh temp CA_GOV_STORE per run (repeatable beat 5), and show the persistent command (--store + --reset-store) for sharing the promoted pool with adk web.

FLEXIBLE no longer auto-writes to the governed pool. Removed the freeze_verified capability entirely, so a model-authored plan CANNOT promote — the strongest form of the governance point. The flexible miss path now generates -> validates (dry-run gate) -> runs -> answers, then PARKS the validated query as a pending candidate. A human replies `approve` (-> added to the golden pool via golden.approve_pending) or `reject` (-> discarded). Promotion is the only path into the pool and it requires explicit human sign-off. - golden.py: pending-candidate store (save/get/clear/approve_pending), single-slot. - agent.py: approve/reject handling at the top of plan_and_run; flexible miss parks a candidate and asks for approval; flexible_registry drops freeze_verified; _strip_mode cleans the stored question; banner blurb updated. - governance_demo.py: `flexible` beat is now the multi-turn HITL sequence (ask -> approve -> re-ask = governed hit), merging the old beat-5 + closer. - README/NARRATIVE: HITL flow, mermaid (approve/reject branch), merged beat 5, "no promote capability" framing. Tests: 15 pass (added HITL approve/reject + _strip_mode; flexible test now asserts no auto-promote and that freeze_verified is absent from the registry). Live-verified end-to-end: flexible -> pending -> approve -> governed hit on gemini-3.5-flash global + real BigQuery.

caohy1988 · 2026-06-24T23:08:37Z

+  store = args.store or os.environ.get("CA_GOV_STORE") or tempfile.mkdtemp(
+      prefix="ca_gov_store_"
+  )
+  if args.reset_store:


With the new HITL flow, --reset-store should also clear pending_candidate.json, not just verified/. Otherwise a presenter can run with a durable --store, generate a candidate but not approve/reject it, then later run --reset-store expecting a clean rehearsal; the old pending candidate survives and a stray approve will promote stale SQL into the freshly reset pool. I would either set CA_GOV_STORE = store before this block and call golden.clear_pending() as part of reset, or remove os.path.join(store, "pending_candidate.json") directly here. The help/comments should also say the reset clears promoted + pending candidates in the HITL version.

Fixed in the latest push. CA_GOV_STORE is now set before the reset block, and --reset-store clears both verified/ and the pending candidate via golden.clear_pending() — so a leftover candidate can't be approved into a freshly reset pool. The --reset-store help and README now say it clears promoted and pending. Verified: a pending candidate in a durable --store is gone after --reset-store.

…date With human-in-the-loop promotion, a durable --store could retain an un-approved pending_candidate.json across a --reset-store, so a later `approve` would promote stale SQL into the freshly reset pool. Set CA_GOV_STORE before the reset block and clear BOTH verified/ and the pending candidate (golden.clear_pending()). Help text and README updated to say reset clears promoted + pending.

…eterministic fallback The demo now genuinely exercises RFC google#93's headline: the model AUTHORS the typed WorkflowSpec at runtime via LlmAgent(output_schema=WorkflowSpec), which is then validated against the registry and governed. Adds _author_live() (with retry) + brace-free planner instructions (ADK LlmAgent treats {...} as state-template vars, so the instruction must avoid literal braces) and a per-mode catalogue built from the registry. plan_and_run authors golden/flexible/adversarial plans live; a canned author_*_plan() is the fallback if live authoring is off or the model returns an off-shape plan (so the demo never breaks). The banner shows "Model-authored (live)" vs the fallback, honestly. Verified live (gemini-3.5-flash global + real BigQuery): golden hit, adversarial (model-authored nl2sql plan -> rejected by STRICT), and the post-approval strict re-ask all author live; the flexible nested-gate plan falls back gracefully. Tests: 18 pass (added _spec_ids, live-authoring-disabled fallback, planner instruction catalogue). CA_GOV_LIVE_PLANNER=1 default; set 0 for deterministic.

README: document CA_GOV_LIVE_PLANNER, add the 🧠 Model-authored callout to "what to point at", and an honest-scope note (authoring is real but instruction-guided; free-authoring evidence in sibling samples; governance rests on validator+registry regardless of authoring style). NARRATIVE: state the plan is model-authored live and tag beats 2/3 as 🧠 model-authored.

haiyuan-eng-google · 2026-06-25T00:14:39Z

Update: the demo now genuinely exercises RFC google#93's model-authoring, not just the governance machinery. The plan is authored live by an LlmAgent(output_schema=WorkflowSpec) (_author_live, with retry), then validated against the registry and governed; a canned author_*_plan() is the deterministic fallback (and CA_GOV_LIVE_PLANNER=0 forces it). The banner shows 🧠 Model-authored (live) vs the fallback, honestly.

Verified live (gemini-3.5-flash, global Vertex + real BigQuery): the golden hit, the adversarial beat (the model's own nl2sql plan is rejected by STRICT), and the post-approval strict re-ask all author live; the flexible nested-gate plan falls back gracefully when the model emits an off-shape plan.

Implementation note for reviewers: ADK LlmAgent instructions treat {...} as session-state template vars, so the planner instructions are deliberately brace-free (the JSON schema comes from output_schema). Honest-scope: the plan shape is instruction-guided for on-camera reliability — free, un-prescribed authoring evidence remains in the sibling authored_workflow_spike / authored_workflow_demo samples. 18 tests pass.

caohy1988 · 2026-06-25T16:56:22Z

Fresh feedback on the latest live-authoring update:

I think model-authored workflow is a strong fit for this demo. The demo is no longer only showing “workflow governance”; it now shows the more important RFC google#93 story: the model can author a typed WorkflowSpec, but the system still governs what that plan can compose, validates it, freezes it, and keeps promotion under human control.

The value is clearest in three beats:

adversarial: the model-authored nl2sql -> run_adhoc plan is rejected under STRICT because the registry does not expose those capabilities;
governed hit: even when the model authors the plan, execution is still from a frozen/auditable workflow over approved SQL;
FLEXIBLE + HITL: the model can propose and validate a candidate, but cannot self-promote it into the governed pool.

That is the right enterprise story: “dynamic model-authored orchestration, bounded by structural policy,” not “please trust the model to follow a prompt.”

The remaining thing I would tighten before using this as the LT demo is the live-plan acceptance gate. Right now _author_live() accepts any registry-valid spec that contains the required ids. For the on-camera story, I would require the exact expected shape for golden/flexible/adversarial plans before labeling the turn as live model-authored; otherwise fall back. I left an inline comment here:

#9 (comment)

With that tightened, I think the demo shows the value of model-authored workflow well while staying honest about the current implementation: live authoring is real, but intentionally instruction-guided for reliability.

caohy1988 · 2026-06-25T16:57:13Z

Preferred LT demo narrative:

I would make the story a governance-control story first, and a model-authoring story second. The key line is:

The model is allowed to author the workflow, but it is not allowed to choose its own powers.

Suggested flow:

Start with the enterprise ask.

“Customers want Conversational Analytics, but some of them need a hard boundary: only answer from verified/golden queries unless policy explicitly allows more. A prompt like ‘only use verified queries’ is not governance. It is a request.”
Show the registry diff.

“The boundary is structural. STRICT exposes only match_verified_query, run_frozen_query, summarize, and refuse. FLEXIBLE adds nl2sql, dry_run, run_adhoc, and reject_invalid. Neither registry exposes promotion. That means the model cannot invent a capability or write itself into the governed pool.”
Show the adversarial beat.

“Now let the model author the wrong plan: nl2sql -> run_adhoc -> summarize. The plan may be model-authored, but validation rejects it under STRICT before any query runs. This is the headline: we are not trusting the model to obey a prompt; we are validating the workflow it authored against a capability registry.”
Show a governed hit.

“For a verified question, the model authors a workflow, the workflow validates, freezes, and runs the approved SQL on BigQuery. The answer is dynamic in orchestration, but governed in execution: approved SQL, frozen spec hash, replayable artifact, zero model-drafted SQL on the governed path.”
Show STRICT refusal.

“For an out-of-set question, STRICT refuses. That refusal is a feature. It proves the boundary fails closed: no verified match, no SQL run, no cost, no hallucinated answer.”
Show FLEXIBLE + HITL.

“Some customers do not want a hard stop; they want constrained authoring. FLEXIBLE lets the model generate SQL under the allowed capability set, dry-run validates it, then it runs and parks the candidate. But the model still cannot promote it. A human must approve. After approval, the same question becomes a governed hit.”
Show OPEN.

“OPEN is today’s normal agentic CA path: useful and powerful, but not frozen/auditable in the same way. The point is not that OPEN is bad; the point is that we can expose a dial: strict governed-only, flexible HITL-assisted authoring, or full agentic fallback.”

Close with:

“Model-authored workflow is valuable here because it separates who proposes the plan from who grants authority. The model authors; the registry limits; the validator enforces; the frozen record audits; the human approves promotion. That is the enterprise governance shape.”

One implementation caveat I would mention only if asked: the live authoring in this demo is intentionally instruction-guided for reliability. That is fine for LT, as long as the UI/text honestly distinguishes live model-authored plans from deterministic fallback and the acceptance gate verifies the expected shape.

Address PR google#9 review (discussion_r3476149931): _author_live previously accepted any registry-valid spec that merely contained the required node ids, so a model could return an off-shape-but-valid plan (different output binding, route values, branch condition, or capability/input wiring) and still be labeled "Model-authored (live)" and executed. Now the live label is earned only when the authored plan matches the exact expected shape per mode: _is_golden_shape / _is_flexible_shape / _is_adversarial_shape compare a canonical structural signature (node order, ids, capabilities, input/branch bindings, route values, spec output) against the canned plan for that mode. Any registry-valid but off-shape plan falls back to the deterministic canned plan and is honestly labeled a fallback. Tests: 21 pass (added shape-predicate acceptance/cross-mode, off-shape-but- registry-valid rejection, and live off-shape -> fallback). README honest-scope updated to describe the exact-shape gate. Live re-validated (gemini-3.5-flash, global Vertex + real BigQuery): golden hit and strict refusal author live, adversarial plan rejected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

haiyuan-eng-google · 2026-06-25T17:12:42Z

Tightened the live-plan acceptance gate as you asked (commit 51376c86).

_author_live() no longer accepts any registry-valid spec that merely contains the required node ids. The 🧠 Model-authored (live) label is now earned only when the authored plan matches the exact expected shape for that mode, via the predicates you suggested:

_is_golden_shape, _is_flexible_shape, _is_adversarial_shape

Each compares a canonical structural signature — node order, ids, capability-per-id, input bindings, branch condition (on), route values, and the spec output binding — against the canned plan for that mode (so the predicates stay in sync with author_*_plan() automatically; single source of truth). Any registry-valid but off-shape plan now falls back to the deterministic canned plan and is honestly labeled a fallback — covering exactly the "right ids, wrong output binding / route / wiring" case you flagged.

Tests: 18 → 21. Added: shape predicates accept their own canned plan and reject cross-mode plans; an off-shape-but-registry-valid plan (right ids, different output binding) passes the validator but fails the shape gate; and with the live planner ON, an off-shape authored plan makes _author_live return None → caller uses the canned fallback.

README honest-scope updated to describe the exact-shape gate.

Live re-validated (gemini-3.5-flash, global Vertex + real BigQuery): the golden hit and the strict refusal both author live (the model reliably emits the exact shape, so the tightening doesn't cost the on-camera live label), and the adversarial plan is still rejected under STRICT. With this in, live authoring is real, honestly distinguished from fallback, and the acceptance gate verifies the expected shape — matching the narrative.

Also took your preferred LT narrative — the governance-control-first framing with "the model is allowed to author the workflow, but not to choose its own powers" — and worked it into the recording track. Thanks, this made the demo materially more honest.

caohy1988 · 2026-06-25T17:15:54Z

Punchline: A human-compiled workflow hardcodes one policy path; a model-authored workflow lets the model adapt the plan to the question, while the registry prevents it from granting itself new authority.

Top 3 demo points:

Adaptive without losing control: the model can author the workflow for the user’s question, but it can only compose approved capabilities.
Governance is structural, not prompt-based: STRICT does not expose nl2sql, so even a model-authored SQL plan is rejected before anything runs.
Safe path from discovery to governance: FLEXIBLE lets the model generate and validate a candidate, but only human approve adds it to the governed pool.

…line Fold PR google#9's narrative feedback into NARRATIVE.md + README.md, governance-first: - Punchline: a human-compiled workflow hardcodes one policy path; a model-authored workflow adapts the plan to the question while the registry prevents it from self-granting authority ("authors the plan, not its powers"). - The three LT points (adaptive-without-losing-control / structural-not-prompt / safe discovery->governance) mapped to beats 2, 3, 5. - Keep honest scope: in this demo the plan shape is instruction-guided and exact-shape-gated, so per-question adaptation is dial/branch/SQL-content, not free structural decomposition (that evidence is in the sibling samples); the no-self-granted-authority guarantee holds regardless of authoring style. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

haiyuan-eng-google · 2026-06-25T17:19:16Z

Love the punchline — adopted it (commit 96c8c46a, docs-only).

Folded the human-compiled-vs-model-authored framing and the three points into both NARRATIVE.md and README.md, told governance-first with your line front and center: the model is allowed to author the workflow, but not to choose its own powers. The "separates who proposes the plan from who grants authority" close is now the thesis of the talking track.

One small honesty edit I made on point 1 ("adaptive"): in this demo the plan shape is instruction-guided and now exact-shape-gated, so what the model adapts per question is the dial/mode, the match-vs-nl2sql branch it takes at runtime, and the SQL content — not free structural decomposition. I pointed to the sibling authored_workflow_spike / authored_workflow_demo samples for the unconstrained-authoring evidence, and noted that the governance guarantee (can't self-grant authority) holds regardless of authoring style — so the punchline lands without overclaiming on camera. Flag if you'd rather state it more strongly.

Sequential operator walkthrough (send / point-at / say) for the eight beats, wired to the actual prompts and on-screen markers. Carries the governance-first framing, the human-compiled-vs-model-authored punchline, the three LT points, and the honest-scope note (exact-shape gate, free-authoring in sibling samples). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

haiyuan-eng-google force-pushed the demo/ca-governance-spectrum branch from 29deff3 to cf7173a Compare June 24, 2026 21:26

caohy1988 reviewed Jun 24, 2026

View reviewed changes

haiyuan-eng-google added 2 commits June 24, 2026 22:34

caohy1988 reviewed Jun 24, 2026

View reviewed changes

haiyuan-eng-google added 3 commits June 24, 2026 23:27

caohy1988 reviewed Jun 25, 2026

View reviewed changes

Comment thread contributing/samples/workflows/authored_workflow_ca_governance_demo/bq_ca_governance/agent.py Outdated

		return {"sql": sql, "valid": False, "error": str(e)[:500], "engine": "bigquery"}


		def run_query(value) -> dict:

Conversation

haiyuan-eng-google commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The idea

Beats (see README)

Verification

Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caohy1988 commented Jun 24, 2026

Uh oh!

haiyuan-eng-google commented Jun 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haiyuan-eng-google commented Jun 25, 2026

Uh oh!

Uh oh!

caohy1988 commented Jun 25, 2026

Uh oh!

caohy1988 commented Jun 25, 2026

Uh oh!

haiyuan-eng-google commented Jun 25, 2026

Uh oh!

caohy1988 commented Jun 25, 2026

Uh oh!

haiyuan-eng-google commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haiyuan-eng-google commented Jun 24, 2026 •

edited

Loading