Skip to content

feat(weave): add agent_turn/agent_conversation/agent_span ref kinds#6684

Draft
nikumar1206 wants to merge 57 commits intoben/genai-query-layerfrom
nikhil/agent-ref-kinds
Draft

feat(weave): add agent_turn/agent_conversation/agent_span ref kinds#6684
nikumar1206 wants to merge 57 commits intoben/genai-query-layerfrom
nikhil/agent-ref-kinds

Conversation

@nikumar1206
Copy link
Copy Markdown
Contributor

@nikumar1206 nikumar1206 commented Apr 22, 2026

Description

  • Fixes WB-NNNNN
  • Fixes #NNNN

First of three stacked PRs for Phase 1 of human feedback on agent turns.

Adds three new ref kinds (external + internal): `agent_turn/{trace_id}`, `agent_conversation/{conversation_id}`, `agent_span/{span_id}`. No consumers yet — just the grammar. `conversation_id` is URL-encoded since it's app-chosen.

Stack: this → #6685#6686

Testing

  • unit tests

bcsherma and others added 30 commits April 22, 2026 14:34
Add ClickHouse migration 027 with v4 genai schema (genai_spans MergeTree,
genai_agents/genai_agent_versions AggregatingMergeTree MVs,
genai_message_search ReplacingMergeTree) and wire up 12 query methods on
ClickHouseTraceServer via delegation to AgentQueryHandler/AgentWriteHandler.

Includes 6 integration tests covering table creation, span queries, trace
aggregation, MV-backed agent stats, conversation grouping, and message search.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migration: ReplacingMergeTree, monthly partitions, TTL, ngrambf
  indexes on dimensions, wb_run_id/step columns, raw_span_dump
- genai_extraction.py: semconv field extraction from OTel spans with
  provider fallback chains (OpenAI, Google ADK, Anthropic/Traceloop)
- OTel ingest handler on AgentWriteHandler with batch insert
- Fix conversations query to use time bounds (5s → 10ms at 50M)
- Add genai_otel_export to ClickHouseTraceServer

Benchmarked at 50M spans (200KB messages): all queries <200ms
round-trip except full-text search. CH server-side p50 is 5-16ms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sort_by/custom_filters are pydantic models — the .get() fallback
branches were unreachable. Also collapse the hand-rolled sort loop in
conversations_query onto build_order_by (extended with a column_exprs
map for the arrayElement aliases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every SELECT the agent handlers emit is now constructed in
query_builder/agent_query_builder.py via make_*_query functions, so the
SQL can be unit-tested against a ParamBuilder without a live ClickHouse
(matches the pattern in annotation_queues_query_builder,
objects_query_builder, etc).

Handlers now take a QueryFn callable — the server's bound _query method
— instead of a raw CHClient. Every agent query now participates in the
existing ddtrace span, structured logging (clickhouse_query with
duration/summary), ch_settings merging, and handle_clickhouse_query_error
normalization that the rest of the trace server gets. Before, agent
queries used self._ch.query() raw and bypassed all of that.

AgentWriteHandler takes both the ch_client (for batch insert, which has
no wrapper) and the query_fn (for the read query that feeds the chat
projection).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches the pattern in test_annotation_queues_query_builder.py,
test_objects_query_builder.py, etc — build a ParamBuilder, call the
make_*_query function directly, assert the entire formatted SQL plus
the full param dict (values and slot names).

The old test_genai_query_sql.py assertions were often just substring
checks (`assert "ORDER BY input_tokens asc" in sql`) or param-value-set
checks; the new tests diff the whole query against a verbatim expected
block, so we actually catch SELECT/FROM/WHERE/GROUP BY regressions.

add_span_filters now iterates sorted(SPAN_FILTERABLE_COLS) so multi-
filter test cases get deterministic param slot ordering.

28 tests covering: spans count/list/trace, traces count/list,
agents count/list, agent_versions count/list, conversations count/list
(including the arrayElement sort alias), message_search, conversation
chat spans, build_order_by, and parameterization safety.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Type genai_otel_export as (GenAIOTelExportReq) -> GenAIOTelExportRes
  instead of Any -> Any.
- Extract the magic `50` trace cap in conversation_chat into a module
  constant MAX_CONVERSATION_CHAT_TURNS.
- Drop test_genai_tables_created (redundant — if the migration didn't
  run, every other test in the file would fail too).
- Fix stale migration number in the file docstring (027 -> 029).
- test_message_search docstring no longer claims it verifies the
  tokenbf_v1 index was used — it checks LIKE returns expected rows.
- agent_schema.py docstring points to semconv.py as the single source
  of truth for attribute conventions rather than re-describing them
  inconsistently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches the existing add_X naming convention (023_add_annotation_queues,
025_add_tags_and_aliases, 027_add_cache_token_costs,
028_add_bloom_filter_id_index). The migration creates spans, agents,
agent_versions, two materialized views, and message_search — not just
spans — so the old name was misleading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
029 is now 029_add_ttl_support on master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The injection substring checks (DROP TABLE not in query) duplicate what
the rest of the suite already proves: every value goes through
ParamBuilder, which emits {genai_N:Type} placeholders — you can see
that in every full-SQL assertion. If someone broke parameterization,
the existing shape tests would fail before these would.

test_list_sort_by_array_alias now uses assert_sql against the full
conversations GROUP BY instead of a substring check for the
arrayElement(agent_names, 1) ORDER BY fragment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin's review: the ~18 granular tests across 9 classes were
mostly exercising individual branches of build_chat_messages in
isolation — coverage for its own sake. Replaced with one comprehensive
test that builds a realistic agent trace (invoke_agent root with tool,
handoff, and chat children) and asserts every message type is emitted
with expected content + token aggregation from child spans. Plus a
small empty-input smoke test.

Integration tests in test_genai_agent_queries.py already cover the
projection end-to-end against ClickHouse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin: the per-extractor tests (TestExtractProvider,
TestTokenExtraction, TestReasoningContent, etc.) each verified 1-3
lookups of a single key — coverage for its own sake. Replaced with one
comprehensive test that builds a span exercising every extractor at
once (weave vs gen_ai precedence, tokens + computed total, messages +
parts + finish_reason, tool call fields, compaction, content_refs,
reasoning from parts, custom attrs str/int/float, raw dumps), and one
separate test for ERROR status which is a distinct code path (reads
from Span.status, not attributes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin's review: group the agent-related source files under a
common subfolder instead of sprinkling agent_*.py across trace_server/.

- agent_chat_view.py    -> agents/chat_view.py
- agent_clickhouse.py   -> agents/clickhouse.py
- agent_helpers.py      -> agents/helpers.py
- agent_schema.py       -> agents/schema.py
- agent_types.py        -> agents/types.py
- semconv.py            -> agents/semconv.py

The SQL query builder stays under query_builder/ next to its peers
(annotation_queues_query_builder, objects_query_builder, etc.) so it's
easy to find alongside other query builders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin's review. Makes the migration safe to reapply (idempotent)
if it's ever interrupted partway through — matches the behavior of
CREATE DATABASE IF NOT EXISTS that the trace server does on init.

Also applied to the two MATERIALIZED VIEWs, so all six DDL statements
in the migration are idempotent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bcsherma and others added 2 commits April 22, 2026 14:34
NOISE_TOOL_NAMES was a frozenset of tool names (``(merged tools)``,
``(merged)``, ``transfer_to_agent``) that the chat projection
silently dropped from the rendered trajectory. Three problems:

1. ``transfer_to_agent`` was already dead code — the preceding
   ``startswith("transfer_to_")`` branch claims it as an agent_handoff
   before the NOISE check runs.
2. ``(merged tools)`` / ``(merged)`` are instrumentation artifacts
   from specific frameworks. Baking framework-specific tool names
   into Weave's backend is leakage — if the framework changes the
   placeholder, our list is silently wrong.
3. Render-time filtering is a UI concern. A backend that quietly
   drops spans from the chat view while leaving them visible in
   direct spans queries gives users inconsistent answers with no
   signal.

Delete the constant, the import, and the ``elif`` check. Tool calls
are now emitted unconditionally from ``execute_tool`` spans that
aren't a ``transfer_to_*`` handoff. If a producer later turns out to
emit junk spans that genuinely need filtering, the right fixes are
at ingest (recognize and flag the synthetic span) or in the frontend
(user-toggleable filter), not a hidden backend allowlist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatter fallout from the review-addressing commits. No
behavior change.

- ruff D209 on test_genai_agent_queries.py docstring
- ruff format line-wrap on types.py Field(...) default expressions
- ruff format on test_genai_chat_view.py / test_genai_extraction.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

nikumar1206 commented Apr 22, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 77.52809% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
weave/trace/refs.py 72.34% 13 Missing ⚠️
weave/shared/refs_internal.py 82.85% 3 Missing and 3 partials ⚠️
weave/trace/ref_conversion.py 85.71% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@nikumar1206 nikumar1206 force-pushed the nikhil/agent-ref-kinds branch from dd3668c to c6ebd20 Compare April 22, 2026 21:44
bcsherma and others added 6 commits April 22, 2026 16:09
Two bugs flagged in review:

1. ``agents_mv`` / ``agent_versions_mv`` computed duration as
   ``toUInt64(toUnixTimestamp64Milli(ended_at) -
   toUnixTimestamp64Milli(started_at))``. When ``ended_at`` was left
   at the schema default (epoch), the subtraction was hugely negative
   and the UInt64 cast wrapped modularly to ~2^64, permanently
   poisoning the ``total_duration_ms`` SimpleAggregateFunction(sum)
   for that (project, agent) pair. Guard the expression with
   ``if(ended_at > started_at, toUInt64(...), toUInt64(0))``.

2. ``_find_user_prompt`` in ``chat_view.py`` sorted spans by
   ``s.started_at or ""`` which mixes datetime and str and raises
   TypeError for any span where ``started_at is None`` — latent today
   (CH column is non-null) but a trap for in-memory callers and tests.
   Switch to the same tuple-key shape we use in ``build_span_tree``:
   ``(started_at is None, started_at, span_id)``.

Integration tests cover both: one inserts a span with defaulted
``ended_at`` and verifies ``total_duration_ms`` stays human-sized
instead of wrapping; one constructs a trace with a null-started
invoke_agent span and confirms ``build_trace_chat`` doesn't crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Python has four primitive attribute types (str / int / float / bool);
we had Maps for three. Bools were being coerced to ``"true"`` /
``"false"`` strings in the ``custom_attrs`` map — lossy round-trip,
awkward aggregation (``countIf(custom_attrs['x'] = 'true')`` instead
of ``sum(custom_attrs_bool['x'])``), and no clean way to write
``custom_attrs_bool.is_streaming`` in the Query DSL.

Adds ``custom_attrs_bool Map(String, Bool)`` on the ``spans`` table
and threads it through schema, extraction, compiler, and group-by
sources. The bool branch in ``_extract_custom_attrs`` must stay
before the int branch because Python's ``bool`` is a subclass of
``int``; ``_LITERAL_TYPE_TO_MAP`` dispatch is safe because
``dict.get(type(value))`` uses exact-type lookup rather than MRO.

No migration chain issue — migration 030 hasn't shipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat view previously recognized ``handoff`` / ``agent_handoff``
operation names and ``transfer_to_*`` tool-call names as special
boundaries, emitting a dedicated ``agent_handoff`` chat message type.
Pulling those extensions out — not a scope we want to promote as
first-class at this point. Producers that emit those operations still
land in the spans table and render as normal tool calls or
default-projected spans.

Changes:
- ``chat_view.py::_walk`` — remove the ``transfer_to_*`` prefix branch
  and the ``handoff`` / ``agent_handoff`` operation-name block.
- ``AgentChatMessage.type`` Literal — drop ``agent_handoff``. Five
  message types remain: ``user_message``, ``agent_message``,
  ``tool_call``, ``agent_start``, ``context_compacted``.
- test_genai_chat_view: remove the ``transfer_to_sales_agent`` fixture
  span from ``test_full_agent_turn`` and drop the associated
  assertions.

Also expands ``WEAVE_OTEL_SEMANTIC_CONVENTIONS.md`` from a thin column
list into a spec-relative reference: adds instrumentation-key columns
(``gen_ai.*`` / ``weave.*``) per attribute, marks every Weave-only row
explicitly, adds sections covering stability status, key-naming
conventions (short-form / aliases / legacy fallbacks), and a "Known
gaps vs. the OTel spec" section calling out the retrieval / embedding
attributes we don't extract plus the entire events API and metrics
tracks that are out of scope today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The grouped spans aggregate was publishing countIf(invoke_agent) under
the name turn_count, while trace_count held uniqExact(trace_id) — the
actual turn count per our data model. Rename to invocation_count (what
it really is), drop trace_count (we track invocations, not turns, in
the aggregate), and update types and tests to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the five per-source materialized views (output_messages,
input_messages, system_instructions, tool_call_arguments,
tool_call_result) that populate the messages search table with a
single MV that fans the sources into rows via arrayConcat + ARRAY JOIN,
tagged by role. One INSERT into messages per ingest block instead of
five, one projection to maintain, and downstream per-source breakdowns
remain available via GROUP BY role.

Also adds DATA_MODEL.md, which spells out how spans, sessions, turns,
invocations, and the messages projection relate.

Migration 030 is edited in place; local dev DB was cleaned and
backfilled through the new MV, and the test DB is auto-rebuilt from
the migration on each session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Single-backticks in comments/docstrings across agents/ and the
  agent query builders (was RST ``literal``, switch to Markdown-style `literal`).
- Rename AgentWriteHandler.self._ch -> self.ch_client.
- Extract SPAN_KIND_UNSPECIFIED and SEARCH_CONTENT_PREVIEW_CHARS
  constants; reuse the existing EXPIRE_AT_NEVER from clickhouse_schema.
- Extract _span_sort_key helper for the (started_at is None, started_at,
  span_id) idiom used in build_span_tree and _find_user_prompt.
- zip(col_names, row, strict=True) in _rows_as_dicts so CH column/row
  shape drift surfaces instead of silently truncating.
- Drop the `_: Any = None` unused-import silencer in agent_query_compiler;
  Any is no longer used in the module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bcsherma bcsherma force-pushed the ben/genai-query-layer branch from 25c84c6 to b1086fa Compare April 22, 2026 23:17
bcsherma and others added 3 commits April 22, 2026 17:02
Batches a set of cosmetic/low-risk fixes from PR review:

- `status_code` typed as `Literal["UNSET","OK","ERROR"]` — matches the
  ClickHouse `Enum8` exactly.
- `= []` defaults replaced with `Field(default_factory=list)` across
  eight list fields on `AgentSpanCHInsertable`.
- `str(x)` coercion in `unpack_string_array`; `normalize_span_row`
  rewritten as a `for` loop.
- Extract `NO_CONVERSATION_LABEL`, `MAX_INGEST_ERRORS_REPORTED`, and
  `OP_INVOKE_AGENT` to `constants.py`; wire them through.
- Ingest error messages: `logger.exception(...)` + surface only the
  exception type name to the client (never raw `str(e)`).
- Guard `AgentQueryHandler.search` sort against `None` `last_activity`.
- `_hydrate_group_row` gets a real docstring; `_compute_duration_ms` and
  `_dt_to_iso` log on their bare-except paths.
- Comments: trace == turn is a convention (pointing to DATA_MODEL), the
  conversation-chat truncation depends on the SQL `ORDER BY ASC`, and
  `conversation_name` is Weave-specific (not OTel).
- Module docstring nit: drop the "not gen_ai_usage_input_tokens" example.
- Log when the chat-view walk truncates at `MAX_WALK_DEPTH`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The document rendered too large to diff on GitHub, which blunted its
value as an in-tree living spec. Removing it for now; any pieces worth
keeping will land closer to the code they describe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comments on the chat-view recursion and the user-prompt
fallback:

- Expanded `build_chat_messages` docstring explaining the emission rules
  (user_message, agent_start, tool_call, agent_message, context_compacted)
  and the preorder walk order.
- Added a docstring to the inner `_walk` that names which closure
  variables it mutates (`messages`, `emitted_span_ids`) and what
  `nearest_agent` threads through the recursion.
- Documented `total_duration_ms` on the trace chat response as the
  root span's wall-clock duration (not a sum over subtree spans).
- Debug-log invoke_agent spans that arrive without `gen_ai.agent.name`
  so the silent "no agent_start divider" path is traceable.
- Comment on the orphan-as-root branch in `build_span_tree`: when a
  referenced parent isn't in the input, we promote the span to a
  synthetic root instead of dropping its subtree.
- Tightened `_extract_user_text` fallback: now excludes `assistant`
  and `tool` messages when no `role == "user"` is found, so we can't
  accidentally surface a model reply or tool result as "the user prompt".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nikumar1206 nikumar1206 force-pushed the nikhil/agent-ref-kinds branch from c6ebd20 to d7439af Compare April 23, 2026 16:49
@nikumar1206 nikumar1206 force-pushed the ben/genai-query-layer branch from ade26c3 to 5868d80 Compare April 23, 2026 16:49
@nikumar1206 nikumar1206 changed the title feat(weave): add agent_turn/agent_conversation/agent_step ref kinds feat(weave): add agent_turn/agent_conversation/agent_span ref kinds Apr 23, 2026
@nikumar1206 nikumar1206 force-pushed the ben/genai-query-layer branch from 5868d80 to ade26c3 Compare April 23, 2026 17:05
@nikumar1206 nikumar1206 force-pushed the nikhil/agent-ref-kinds branch from d7439af to f3d9e4c Compare April 23, 2026 17:09
bcsherma and others added 5 commits April 23, 2026 11:44
… its child LLM output

Single-call agent turns in the OpenAI Agents SDK and Google ADK land as
an invoke_agent span whose `output_messages` mirror the inner chat
span's `output_messages`. The projection previously emitted from both —
the child `chat` span emitted first, then the parent invoke_agent
emitted the same text again on the way back up the walk — because the
`emitted_span_ids` dedup set only blocked a span from emitting twice,
not a parent from emitting after its child already had.

Rework `_walk` to return a bool signalling whether its subtree emitted
an `agent_message`. The enclosing invoke_agent only emits its own
mirrored response if the subtree didn't. Drop `emitted_span_ids`
entirely — it wasn't protecting against this case and in a normal tree
walk the check never fires.

Also:
- Collapse `_find_user_prompt`'s two-pass `for prefer_invoke_agent in
  (True, False)` loop into a single sorted iteration, prioritising
  invoke_agent via the sort key.
- Add regression tests for both the mirrored-emit case (now emits once)
  and the solo invoke_agent case (still emits, since no descendant did).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename `AgentQueryHandler.search` → `search_messages`, `_trace_detail_spans`
  → `trace_detail_spans`, `AgentWriteHandler.otel_export` → `insert_otel_spans`
  for clearer intent.
- Use `model_validate` instead of `model_construct` so span hydration actually
  validates.
- Fix `last_activity` in `search_messages`: previously set only on first-seen
  row, now tracked as max across all rows for a conversation.
- Log at debug when `_extract_user_text` falls back past strict role=user.
- Drop `turn_count` from `AgentConversationChatRes` (client can use
  `len(turns)`).
- Clean up `` backticks in test docstrings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… its child LLM output

Single-call agent turns in the OpenAI Agents SDK and Google ADK land as
an invoke_agent span whose `output_messages` mirror the inner chat
span's `output_messages`. The projection previously emitted from both —
the child `chat` span emitted first, then the parent invoke_agent
emitted the same text again on the way back up the walk — because the
`emitted_span_ids` dedup set only blocked a span from emitting twice,
not a parent from emitting after its child already had.

Rework `_walk` to return a bool signalling whether its subtree emitted
an `agent_message`. The enclosing invoke_agent only emits its own
mirrored response if the subtree didn't. Drop `emitted_span_ids`
entirely — it wasn't protecting against this case and in a normal tree
walk the check never fires.

Also:
- Collapse `_find_user_prompt`'s two-pass `for prefer_invoke_agent in
  (True, False)` loop into a single sorted iteration, prioritising
  invoke_agent via the sort key.
- Add regression tests for both the mirrored-emit case (now emits once)
  and the solo invoke_agent case (still emits, since no descendant did).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename `AgentQueryHandler.search` → `search_messages`, `_trace_detail_spans`
  → `trace_detail_spans`, `AgentWriteHandler.otel_export` → `insert_otel_spans`
  for clearer intent.
- Use `model_validate` instead of `model_construct` so span hydration actually
  validates.
- Fix `last_activity` in `search_messages`: previously set only on first-seen
  row, now tracked as max across all rows for a conversation.
- Log at debug when `_extract_user_text` falls back past strict role=user.
- Drop `turn_count` from `AgentConversationChatRes` (client can use
  `len(turns)`).
- Clean up `` backticks in test docstrings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bcsherma bcsherma force-pushed the ben/genai-query-layer branch from 9593dba to a9d2a72 Compare April 24, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants