Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
5531d61
spike(workflow): concurrent dynamic dispatch harness (DynamicNodeSupe…
caohy1988 Jun 2, 2026
627a5dc
spike(workflow): RFC #93 authored-workflow demand-gate artifact
caohy1988 Jun 2, 2026
6177a55
demo(workflow): ADK Web wrapper for RFC #93 authored workflows
caohy1988 Jun 2, 2026
f7ad788
docs(workflow): RFC #93 canonical design (incl. plan export/storage t…
caohy1988 Jun 2, 2026
bfd90a4
docs(workflow): export contract — digest definition, task_input_schem…
caohy1988 Jun 2, 2026
c4c1b23
docs(workflow): unify FrozenWorkflowRecord + import integrity (review)
caohy1988 Jun 2, 2026
921fbbe
docs(workflow): make positioning/phasing explicit (MVP-first, #92-fir…
caohy1988 Jun 2, 2026
5aa2dc6
docs(workflow): make demo-vs-production persistence explicit
caohy1988 Jun 2, 2026
5aa73fd
docs(workflow): rename to 'Reproducible Model-Authored Workflows for …
caohy1988 Jun 2, 2026
305c62e
docs(workflow): add Pipeline block (uses #92 ctx.pipeline) + Claude C…
caohy1988 Jun 2, 2026
40e2814
spike(workflow): implement Pipeline (barrier-free per-item) + tests
caohy1988 Jun 2, 2026
64fcb95
spike(workflow): enforce max_fan_out per Pipeline stage + sync docs
caohy1988 Jun 2, 2026
75264e5
demo(workflow): author a Pipeline (reviewer->verifier) in the securit…
caohy1988 Jun 2, 2026
f48b985
demo(workflow): derive the planning message's capability list from th…
caohy1988 Jun 2, 2026
82bd893
feat(workflow): exportable FrozenWorkflowRecord — export_plan/import_…
caohy1988 Jun 2, 2026
0bc3323
fix(workflow): enforce schema_version + registry_version on import; i…
caohy1988 Jun 2, 2026
7dc25cf
docs(workflow): converge with ADK AgentConfig; answer storage/tools/o…
caohy1988 Jun 3, 2026
0061706
docs(workflow): correct AgentConfig convergence framing (§11)
caohy1988 Jun 3, 2026
11b3215
demo(workflow): add honest AgentConfig-convergence talking point
caohy1988 Jun 3, 2026
625d34b
feat(workflow): demonstrate AgentConfig lowering of the static subset…
caohy1988 Jun 3, 2026
5a91136
docs(workflow): note AgentConfig is deprecated/experimental; qualify …
caohy1988 Jun 3, 2026
1aebca2
demo(workflow): add AgentConfig deprecated/experimental caveat to dem…
caohy1988 Jun 3, 2026
3774e4a
docs(workflow): clarify config lowering vs loop_config raw YAML
caohy1988 Jun 3, 2026
f6b4b0e
docs(workflow): separate Workflow YAML target from deprecated config …
caohy1988 Jun 3, 2026
8d61f0e
demo(workflow): trim lowering output — drop per-block reason + messag…
caohy1988 Jun 4, 2026
98c3ed9
spike(workflow): pattern-coverage sweep, plan-quality lints, loop-car…
caohy1988 Jun 9, 2026
c6bc335
demo(workflow): independence-lints beat + cost beat; narrative and co…
caohy1988 Jun 9, 2026
47dc907
docs(workflow): mdformat table alignment in spike README
caohy1988 Jun 9, 2026
d449d1b
demo(workflow): quality-gate beat — adversarial ask, lint fires on ca…
caohy1988 Jun 9, 2026
1a2957e
spike+demo(workflow): contract-hash drift, auditable lint waivers, fr…
caohy1988 Jun 9, 2026
826ab5b
fix(workflow): fail closed on stripped contract hashes in import_plan
caohy1988 Jun 9, 2026
5d5ad04
demo(workflow): BQ Conversational Analytics planner — 7 prompts, 7 au…
caohy1988 Jun 9, 2026
b29ea8f
fix(ca-demo): brace-free instruction strings — ADK templates {identif…
caohy1988 Jun 9, 2026
17bc042
fix(ca-demo): stubs tolerate authored binding shapes (dict / JSON str…
caohy1988 Jun 9, 2026
502f470
feat(ca-demo): live question as task input — template reuse on replay
caohy1988 Jun 9, 2026
afd30e8
fix(ca-demo): specialized scenario triggers beat the ask-a-question f…
caohy1988 Jun 9, 2026
9c6b4a5
feat(ca-demo): render_chart capability — CA-style Vega-Lite chart art…
caohy1988 Jun 9, 2026
787e275
feat(ca-demo): intelligent mock executor (micro-warehouse intent engi…
caohy1988 Jun 10, 2026
f88aae6
fix(ca-demo): engine understands yearly/quarterly grains; trend chart…
caohy1988 Jun 10, 2026
667c834
feat(ca-demo): REAL BigQuery execution over the public thelook datase…
caohy1988 Jun 10, 2026
b4466f8
docs(ca-demo): README — real BigQuery backend, engine transparency, c…
caohy1988 Jun 10, 2026
35c035b
feat(ca-demo): default flow self-repairs from REAL BigQuery dry-run e…
caohy1988 Jun 10, 2026
d469636
feat(ca-demo): cross-session workflow reuse via an exported plan store
caohy1988 Jun 10, 2026
a203834
fix(ca-demo): review findings — isort order, question preserved throu…
caohy1988 Jun 10, 2026
298f926
docs(ca-demo): precise contract-hash scope — primitive helpers rely o…
caohy1988 Jun 10, 2026
7272008
feat(ca-demo): conversational intent gate — meta/chat turns answer di…
caohy1988 Jun 10, 2026
a3ad5dd
test(ca-demo): no-LLM end-to-end escape-path test; sync counts
caohy1988 Jun 10, 2026
039c8b9
feat(ca-demo): audit scenario takes LIVE insights — inline, last-gene…
caohy1988 Jun 11, 2026
530b222
feat(ca-demo): skeptic verdicts show their reasoning in a rendered au…
caohy1988 Jun 11, 2026
c496b1b
test(ca-demo): runtime isolation proof — skeptics see ONLY their own …
caohy1988 Jun 11, 2026
82c8244
feat(ca-demo)+fix(workflow): data-grounded skeptics + per-dispatch is…
caohy1988 Jun 12, 2026
c4943c5
demo(ca): plan_inspector.py — render the frozen-plan store as an anno…
caohy1988 Jun 12, 2026
d58f4ef
fix(ca-demo): scenario banner states the ACTUAL data backend
caohy1988 Jun 12, 2026
076ae11
feat(ca-demo): all 7 real dataset tables + REAL profiling from __TABL…
caohy1988 Jun 12, 2026
04b6262
feat(ca-demo): everything real — no simulated data anywhere in the li…
caohy1988 Jun 12, 2026
3786a51
feat(ca-demo): profiling discovers the LIVE table list (all 8, incl. …
caohy1988 Jun 12, 2026
c41e6df
fix(ca-demo): live table discovery excludes empty strays (row_count > 0)
caohy1988 Jun 12, 2026
3a4730a
feat(ca-demo): dashboard pipeline EXECUTES every panel (draft/dry-run…
caohy1988 Jun 12, 2026
83a9636
feat(ca-demo): SQL freezing (numeric determinism) + human-feedback re…
caohy1988 Jun 12, 2026
768c1d4
demo(ca): plan inspector renders the frozen-SQL store with revision h…
caohy1988 Jun 12, 2026
fd315c0
demo(ca): plan inspector renders the LIVE session flow as a timeline
caohy1988 Jun 12, 2026
0e034a1
demo(ca): inspector reframed to sell the RFC — frozen workflows with …
caohy1988 Jun 12, 2026
9592610
demo(ca): inspector scopes middle results to the rendered session's flow
caohy1988 Jun 12, 2026
b41323a
feat(ca-demo): frozen middle results record their workflow lineage
caohy1988 Jun 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,11 @@ CLAUDE.md

# Conformance test outputs (timestamped folders from --test mode)
**/conformance/20*-*-*_*-*-*/

# Generated by the authored_workflow_demo "Export plan" beat (sample output)
security_audit_plan.json

# ADK Web demo session stores (runtime)
demo_sessions*.db
ca_demo_sessions*.db
ca_plan_store/
132 changes: 132 additions & 0 deletions contributing/samples/workflows/authored_workflow_ca_demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# ADK Web demo — model-authored workflows for BigQuery Conversational Analytics (RFC #93)

One agent, **seven prompts, seven workflow shapes**. Styled after [BigQuery
Conversational Analytics](https://docs.cloud.google.com/bigquery/docs/conversational-analytics):
a user asks data questions in natural language, and the planner **authors a
different typed `WorkflowSpec` per scenario** over Conversational-Analytics
capabilities — `nl2sql`, `dry_run`, `run_query`, `profile_table`, `skeptic`,
chart judging — against a mock `thelook_ecommerce` dataset (the dataset the
CA docs demo against). **Query execution is REAL BigQuery** when
credentials allow: `dry_run` hits the actual BigQuery dry-run API (real
errors, real bytes-scanned) and `run_query` executes against
`bigquery-public-data.thelook_ecommerce`, billed to your
`GOOGLE_CLOUD_PROJECT` with safety rails (`maximum_bytes_billed` = 2 GB per
query, 500-row result cap). Multi-dimensional questions ("each region's
trend per year") return real grouped results and chart as multi-series
lines. Without credentials (or with `CA_DEMO_USE_BIGQUERY=0`), execution
falls back to a deterministic micro-warehouse (synthetic facts +
SQL-intent aggregation) so CI and credential-less machines keep working —
each dry-run/result beat carries an `engine` field (`bigquery` or `mock`)
so the demo never misrepresents its data source. The language steps
(NL2SQL, summaries, classification, skeptics) are live Gemini calls.

Every scenario runs the full #93 machinery: **author → validate →
independence lints → freeze (per-scenario key) → execute on the real engine
(#92 supervisor) → cost line**, and every shape is pinned in CI with the
language capabilities stubbed.

## 0. Configure a model (no hardcoded project)

```bash
export GOOGLE_GENAI_USE_VERTEXAI=1
export GOOGLE_CLOUD_PROJECT=<your-project>
export GOOGLE_CLOUD_LOCATION=global
export SPIKE_GEMINI_MODEL=gemini-3.5-flash
```

## 1. Run it

```bash
adk web contributing/samples/workflows/authored_workflow_ca_demo --port 8001
```

**Talk to it first** — the agent has a conversational gate (the RFC's
"no-plan escape hatch"): untriggered messages are intent-classified, and
meta/chit-chat turns get a direct answer instead of a workflow. Try:

```text
What kinds of workflow can you issue?
```

→ a plain-language catalogue of the seven shapes with example prompts — `0 planner calls, 0 queries`. Data questions proceed to the machinery below.

Open the UI, pick `bq_ca_planner`, and send the prompts below — **one
scenario per prompt**, each authoring a different coordination shape:

| # | Send this prompt | Shape authored | CA story |
| --- | --------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1 | `What was revenue by region last quarter?` | `loop_until(draft → REAL dry-run → repair) → run_query → render_chart + summarize` | the standard CA flow — **your actual question is the task input**, and a real BigQuery dry-run error (e.g. `TIMESTAMP_SUB ... YEAR`) feeds the repair round |
| 2 | `Profile data quality across the dataset tables.` | fan-out → synthesize | per-table profiling in parallel, one report |
| 3 | `Build a dashboard for these three questions.` | pipeline(`nl2sql → dry_run`) per item | each panel translated + validated barrier-free |
| 4 | `Route my question: what does order status 'Complete' mean?` | classify & route (branch) | metadata questions skip SQL planning — answered by a data-grounded agent that queries the REAL data (DISTINCT values, counts) |
| 5 | `Answer with SQL self-repair — the dry run is unreliable.` | loop_until + **loop-carried `init`** | a REALLY broken query (`thelook_ecommerce.order`) is checked by the REAL dry-run, repaired from the actual BigQuery error, then executed |
| 6 | `Audit this insight: <paste any claim>` (or just `audit that insight` after a question) | adversarial verification | **audits YOUR insights with DATA-GROUNDED skeptics** — each runs real BigQuery checks via its `query_thelook` tool and cites the numbers (the $1M-AOV claim is refuted with the actual ~$86 AOV); insights from your message, the session's last insight, or the canned fallback |
| 7 | `Pick the best chart for revenue by region.` | tournament | pairwise chart judging to a single winner |

What to point at as each one streams:

- **🗂️ scenario banner** — the expected shape, named before the model authors it.
- **📋 authored plan** — a *different* typed `WorkflowSpec` per prompt; same closed vocabulary every time.
- **✅ + 🧪 validation & independence lints** — every scenario lints clean; the provenance facts are statically provable from the bindings.
- **🔒 freeze (per-scenario key) + 📦 cross-session export** — every authored plan exports its full `FrozenWorkflowRecord` to `ca_plan_store/<scenario>.json`. **Re-send any prompt**: same hash, `0 planner calls (frozen replay)`. **Start a whole new session** and ask again: the plan is **imported from the store** through the RFC's defensive path — spec hash recomputed, re-validated against the current registry, manual-version + declared-contract drift fail loudly (input kind + declared output schema; the typed object-output capabilities declare output models so the hash has teeth — primitive helpers like `sql_ok`/`judge_chart` return bare bool/str/list values and rely on manual versions) (with the rejection shown, then a fresh authoring), and your new question is validated against the captured `task_input_schema` (cross-session **template reuse**). Plans now outlive sessions.
- **template reuse (scenario 1)** — after the first ask, send a *different* question (`What was revenue by region last year?`): the frozen plan is reused unchanged, your new question flows through it as new task input, and the mock rows change with the window (quarter vs year canned sets). Same plan, new data — the RFC's replay-vs-template distinction, live.
- **📈 chart** — scenarios 1 and 7 emit the Conversational-Analytics-style chart artifact: a **rendered chart image inline in the chat** (matplotlib, optional — falls back to a Unicode preview) plus the **Vega-Lite spec** (what the real CA API returns). Time-series rows infer a line mark; in the tournament, the bracket picks the mark and `render_chart` draws the data with it.
- **honest failure handling** — a query that still fails after repair returns empty rows + the real error (`engine: bigquery`); the mock warehouse is used ONLY when credentials are absent, never to paper over a failing query.
- **📄 result + 📊 cost** — real execution on the #92 supervisor; the repair scenario shows exactly one repair iteration (`Table not found … did you mean orders?` → fixed), the audit scenario rejects the implausible insight, the tournament returns `["bar"]`.

Talking point for scenario 5 (the differentiated one): *the repair loop needs
**loop-carried state** — the drafting step reads the loop's own id to get the
prior round's failed dry-run output. That's `LoopUntil.init`, the vocabulary
gap the pattern-coverage sweep surfaced. And the whole loop is frozen and
replayable — a turn-by-turn agent retry never is.*

## 2. Correctness proof (no LLM, no BigQuery)

```bash
pytest contributing/samples/workflows/authored_workflow_ca_demo/test_ca_demo_agent.py -q # 38 collected (one live-gated; one gated on the patched ADK wrapper)
```

All seven expected shapes are built by hand, validated + lint-checked against
the demo registry, and **executed end-to-end** with the language capabilities
stubbed: the loop repairs exactly once, the branch routes the metadata
question away from SQL, the audit rejects the implausible insight, the
tournament converges to `bar` and renders it as a Vega-Lite chart artifact. The fan-out and tournament scenarios execute
against the **live** registry (their capabilities are deterministic mocks).

## SQL freezing + human-feedback revision

Plan freezing pins the *process*; **SQL freezing pins the numbers**. After a
question's SQL passes the real dry-run, it's frozen to
`ca_plan_store/sql/<question-digest>.json`. Re-ask the exact question (any
session): the drafting LLM is **skipped**, the frozen SQL re-validates
(doubling as warehouse-drift detection) and replays — live-verified
identical results run-to-run. Then govern it with feedback:

```text
revise: exclude orders with status Cancelled or Returned
```

→ the SQL is revised to follow the feedback, must pass the REAL dry-run
before it replaces the frozen artifact, and the feedback itself is recorded
in the artifact's `revisions` history — who changed the query and why,
auditable. A failed revision leaves the frozen SQL untouched.

## Notes

- Honesty: like the security-audit demo, scenario recipes are
instruction-guided so each prompt reliably authors its intended shape; the
free-decomposition evidence is the spike's demand gate and the main demo's
free-authoring beat. The *variety* — seven shapes from one closed
vocabulary — is the claim here.
- Nothing in the live path is simulated anymore: the repair scenario
checks a really-broken query against the real dry-run; transient-failure
simulation now lives only in the CI test stubs.
- Frozen plans are per-scenario (`authored_workflow:ca:<scenario>`) in
session state, AND exported per-scenario to `ca_plan_store/` for
cross-session reuse (delete a file to force fresh authoring; the store is
the demo's stand-in for the ArtifactService in the RFC's revised Q1).
- Scenario 1 takes your live message as the question; the other six prompts
are mode selectors with canned task inputs (their results don't change
with your wording). Query answers come from real BigQuery when
credentials allow (check the `engine` field in the dry-run/result beats);
otherwise the deterministic micro-warehouse.
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from . import agent # noqa: F401
Loading
Loading