feat(weave): add agent_turn/agent_conversation/agent_span ref kinds#6684
Draft
nikumar1206 wants to merge 57 commits intoben/genai-query-layerfrom
Draft
feat(weave): add agent_turn/agent_conversation/agent_span ref kinds#6684nikumar1206 wants to merge 57 commits intoben/genai-query-layerfrom
nikumar1206 wants to merge 57 commits intoben/genai-query-layerfrom
Conversation
Add ClickHouse migration 027 with v4 genai schema (genai_spans MergeTree, genai_agents/genai_agent_versions AggregatingMergeTree MVs, genai_message_search ReplacingMergeTree) and wire up 12 query methods on ClickHouseTraceServer via delegation to AgentQueryHandler/AgentWriteHandler. Includes 6 integration tests covering table creation, span queries, trace aggregation, MV-backed agent stats, conversation grouping, and message search. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migration: ReplacingMergeTree, monthly partitions, TTL, ngrambf indexes on dimensions, wb_run_id/step columns, raw_span_dump - genai_extraction.py: semconv field extraction from OTel spans with provider fallback chains (OpenAI, Google ADK, Anthropic/Traceloop) - OTel ingest handler on AgentWriteHandler with batch insert - Fix conversations query to use time bounds (5s → 10ms at 50M) - Add genai_otel_export to ClickHouseTraceServer Benchmarked at 50M spans (200KB messages): all queries <200ms round-trip except full-text search. CH server-side p50 is 5-16ms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sort_by/custom_filters are pydantic models — the .get() fallback branches were unreachable. Also collapse the hand-rolled sort loop in conversations_query onto build_order_by (extended with a column_exprs map for the arrayElement aliases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every SELECT the agent handlers emit is now constructed in query_builder/agent_query_builder.py via make_*_query functions, so the SQL can be unit-tested against a ParamBuilder without a live ClickHouse (matches the pattern in annotation_queues_query_builder, objects_query_builder, etc). Handlers now take a QueryFn callable — the server's bound _query method — instead of a raw CHClient. Every agent query now participates in the existing ddtrace span, structured logging (clickhouse_query with duration/summary), ch_settings merging, and handle_clickhouse_query_error normalization that the rest of the trace server gets. Before, agent queries used self._ch.query() raw and bypassed all of that. AgentWriteHandler takes both the ch_client (for batch insert, which has no wrapper) and the query_fn (for the read query that feeds the chat projection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches the pattern in test_annotation_queues_query_builder.py, test_objects_query_builder.py, etc — build a ParamBuilder, call the make_*_query function directly, assert the entire formatted SQL plus the full param dict (values and slot names). The old test_genai_query_sql.py assertions were often just substring checks (`assert "ORDER BY input_tokens asc" in sql`) or param-value-set checks; the new tests diff the whole query against a verbatim expected block, so we actually catch SELECT/FROM/WHERE/GROUP BY regressions. add_span_filters now iterates sorted(SPAN_FILTERABLE_COLS) so multi- filter test cases get deterministic param slot ordering. 28 tests covering: spans count/list/trace, traces count/list, agents count/list, agent_versions count/list, conversations count/list (including the arrayElement sort alias), message_search, conversation chat spans, build_order_by, and parameterization safety. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Type genai_otel_export as (GenAIOTelExportReq) -> GenAIOTelExportRes instead of Any -> Any. - Extract the magic `50` trace cap in conversation_chat into a module constant MAX_CONVERSATION_CHAT_TURNS. - Drop test_genai_tables_created (redundant — if the migration didn't run, every other test in the file would fail too). - Fix stale migration number in the file docstring (027 -> 029). - test_message_search docstring no longer claims it verifies the tokenbf_v1 index was used — it checks LIKE returns expected rows. - agent_schema.py docstring points to semconv.py as the single source of truth for attribute conventions rather than re-describing them inconsistently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches the existing add_X naming convention (023_add_annotation_queues, 025_add_tags_and_aliases, 027_add_cache_token_costs, 028_add_bloom_filter_id_index). The migration creates spans, agents, agent_versions, two materialized views, and message_search — not just spans — so the old name was misleading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
029 is now 029_add_ttl_support on master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The injection substring checks (DROP TABLE not in query) duplicate what
the rest of the suite already proves: every value goes through
ParamBuilder, which emits {genai_N:Type} placeholders — you can see
that in every full-SQL assertion. If someone broke parameterization,
the existing shape tests would fail before these would.
test_list_sort_by_array_alias now uses assert_sql against the full
conversations GROUP BY instead of a substring check for the
arrayElement(agent_names, 1) ORDER BY fragment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin's review: the ~18 granular tests across 9 classes were mostly exercising individual branches of build_chat_messages in isolation — coverage for its own sake. Replaced with one comprehensive test that builds a realistic agent trace (invoke_agent root with tool, handoff, and chat children) and asserts every message type is emitted with expected content + token aggregation from child spans. Plus a small empty-input smoke test. Integration tests in test_genai_agent_queries.py already cover the projection end-to-end against ClickHouse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin: the per-extractor tests (TestExtractProvider, TestTokenExtraction, TestReasoningContent, etc.) each verified 1-3 lookups of a single key — coverage for its own sake. Replaced with one comprehensive test that builds a span exercising every extractor at once (weave vs gen_ai precedence, tokens + computed total, messages + parts + finish_reason, tool call fields, compaction, content_refs, reasoning from parts, custom attrs str/int/float, raw dumps), and one separate test for ERROR status which is a distinct code path (reads from Span.status, not attributes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin's review: group the agent-related source files under a common subfolder instead of sprinkling agent_*.py across trace_server/. - agent_chat_view.py -> agents/chat_view.py - agent_clickhouse.py -> agents/clickhouse.py - agent_helpers.py -> agents/helpers.py - agent_schema.py -> agents/schema.py - agent_types.py -> agents/types.py - semconv.py -> agents/semconv.py The SQL query builder stays under query_builder/ next to its peers (annotation_queues_query_builder, objects_query_builder, etc.) so it's easy to find alongside other query builders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per griffin's review. Makes the migration safe to reapply (idempotent) if it's ever interrupted partway through — matches the behavior of CREATE DATABASE IF NOT EXISTS that the trace server does on init. Also applied to the two MATERIALIZED VIEWs, so all six DDL statements in the migration are idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NOISE_TOOL_NAMES was a frozenset of tool names (``(merged tools)``,
``(merged)``, ``transfer_to_agent``) that the chat projection
silently dropped from the rendered trajectory. Three problems:
1. ``transfer_to_agent`` was already dead code — the preceding
``startswith("transfer_to_")`` branch claims it as an agent_handoff
before the NOISE check runs.
2. ``(merged tools)`` / ``(merged)`` are instrumentation artifacts
from specific frameworks. Baking framework-specific tool names
into Weave's backend is leakage — if the framework changes the
placeholder, our list is silently wrong.
3. Render-time filtering is a UI concern. A backend that quietly
drops spans from the chat view while leaving them visible in
direct spans queries gives users inconsistent answers with no
signal.
Delete the constant, the import, and the ``elif`` check. Tool calls
are now emitted unconditionally from ``execute_tool`` spans that
aren't a ``transfer_to_*`` handoff. If a producer later turns out to
emit junk spans that genuinely need filtering, the right fixes are
at ingest (recognize and flag the synthetic span) or in the frontend
(user-toggleable filter), not a hidden backend allowlist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatter fallout from the review-addressing commits. No behavior change. - ruff D209 on test_genai_agent_queries.py docstring - ruff format line-wrap on types.py Field(...) default expressions - ruff format on test_genai_chat_view.py / test_genai_extraction.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
This was referenced Apr 22, 2026
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
dd3668c to
c6ebd20
Compare
Two bugs flagged in review: 1. ``agents_mv`` / ``agent_versions_mv`` computed duration as ``toUInt64(toUnixTimestamp64Milli(ended_at) - toUnixTimestamp64Milli(started_at))``. When ``ended_at`` was left at the schema default (epoch), the subtraction was hugely negative and the UInt64 cast wrapped modularly to ~2^64, permanently poisoning the ``total_duration_ms`` SimpleAggregateFunction(sum) for that (project, agent) pair. Guard the expression with ``if(ended_at > started_at, toUInt64(...), toUInt64(0))``. 2. ``_find_user_prompt`` in ``chat_view.py`` sorted spans by ``s.started_at or ""`` which mixes datetime and str and raises TypeError for any span where ``started_at is None`` — latent today (CH column is non-null) but a trap for in-memory callers and tests. Switch to the same tuple-key shape we use in ``build_span_tree``: ``(started_at is None, started_at, span_id)``. Integration tests cover both: one inserts a span with defaulted ``ended_at`` and verifies ``total_duration_ms`` stays human-sized instead of wrapping; one constructs a trace with a null-started invoke_agent span and confirms ``build_trace_chat`` doesn't crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Python has four primitive attribute types (str / int / float / bool); we had Maps for three. Bools were being coerced to ``"true"`` / ``"false"`` strings in the ``custom_attrs`` map — lossy round-trip, awkward aggregation (``countIf(custom_attrs['x'] = 'true')`` instead of ``sum(custom_attrs_bool['x'])``), and no clean way to write ``custom_attrs_bool.is_streaming`` in the Query DSL. Adds ``custom_attrs_bool Map(String, Bool)`` on the ``spans`` table and threads it through schema, extraction, compiler, and group-by sources. The bool branch in ``_extract_custom_attrs`` must stay before the int branch because Python's ``bool`` is a subclass of ``int``; ``_LITERAL_TYPE_TO_MAP`` dispatch is safe because ``dict.get(type(value))`` uses exact-type lookup rather than MRO. No migration chain issue — migration 030 hasn't shipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat view previously recognized ``handoff`` / ``agent_handoff`` operation names and ``transfer_to_*`` tool-call names as special boundaries, emitting a dedicated ``agent_handoff`` chat message type. Pulling those extensions out — not a scope we want to promote as first-class at this point. Producers that emit those operations still land in the spans table and render as normal tool calls or default-projected spans. Changes: - ``chat_view.py::_walk`` — remove the ``transfer_to_*`` prefix branch and the ``handoff`` / ``agent_handoff`` operation-name block. - ``AgentChatMessage.type`` Literal — drop ``agent_handoff``. Five message types remain: ``user_message``, ``agent_message``, ``tool_call``, ``agent_start``, ``context_compacted``. - test_genai_chat_view: remove the ``transfer_to_sales_agent`` fixture span from ``test_full_agent_turn`` and drop the associated assertions. Also expands ``WEAVE_OTEL_SEMANTIC_CONVENTIONS.md`` from a thin column list into a spec-relative reference: adds instrumentation-key columns (``gen_ai.*`` / ``weave.*``) per attribute, marks every Weave-only row explicitly, adds sections covering stability status, key-naming conventions (short-form / aliases / legacy fallbacks), and a "Known gaps vs. the OTel spec" section calling out the retrieval / embedding attributes we don't extract plus the entire events API and metrics tracks that are out of scope today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The grouped spans aggregate was publishing countIf(invoke_agent) under the name turn_count, while trace_count held uniqExact(trace_id) — the actual turn count per our data model. Rename to invocation_count (what it really is), drop trace_count (we track invocations, not turns, in the aggregate), and update types and tests to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the five per-source materialized views (output_messages, input_messages, system_instructions, tool_call_arguments, tool_call_result) that populate the messages search table with a single MV that fans the sources into rows via arrayConcat + ARRAY JOIN, tagged by role. One INSERT into messages per ingest block instead of five, one projection to maintain, and downstream per-source breakdowns remain available via GROUP BY role. Also adds DATA_MODEL.md, which spells out how spans, sessions, turns, invocations, and the messages projection relate. Migration 030 is edited in place; local dev DB was cleaned and backfilled through the new MV, and the test DB is auto-rebuilt from the migration on each session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Single-backticks in comments/docstrings across agents/ and the agent query builders (was RST ``literal``, switch to Markdown-style `literal`). - Rename AgentWriteHandler.self._ch -> self.ch_client. - Extract SPAN_KIND_UNSPECIFIED and SEARCH_CONTENT_PREVIEW_CHARS constants; reuse the existing EXPIRE_AT_NEVER from clickhouse_schema. - Extract _span_sort_key helper for the (started_at is None, started_at, span_id) idiom used in build_span_tree and _find_user_prompt. - zip(col_names, row, strict=True) in _rows_as_dicts so CH column/row shape drift surfaces instead of silently truncating. - Drop the `_: Any = None` unused-import silencer in agent_query_compiler; Any is no longer used in the module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
25c84c6 to
b1086fa
Compare
Batches a set of cosmetic/low-risk fixes from PR review: - `status_code` typed as `Literal["UNSET","OK","ERROR"]` — matches the ClickHouse `Enum8` exactly. - `= []` defaults replaced with `Field(default_factory=list)` across eight list fields on `AgentSpanCHInsertable`. - `str(x)` coercion in `unpack_string_array`; `normalize_span_row` rewritten as a `for` loop. - Extract `NO_CONVERSATION_LABEL`, `MAX_INGEST_ERRORS_REPORTED`, and `OP_INVOKE_AGENT` to `constants.py`; wire them through. - Ingest error messages: `logger.exception(...)` + surface only the exception type name to the client (never raw `str(e)`). - Guard `AgentQueryHandler.search` sort against `None` `last_activity`. - `_hydrate_group_row` gets a real docstring; `_compute_duration_ms` and `_dt_to_iso` log on their bare-except paths. - Comments: trace == turn is a convention (pointing to DATA_MODEL), the conversation-chat truncation depends on the SQL `ORDER BY ASC`, and `conversation_name` is Weave-specific (not OTel). - Module docstring nit: drop the "not gen_ai_usage_input_tokens" example. - Log when the chat-view walk truncates at `MAX_WALK_DEPTH`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The document rendered too large to diff on GitHub, which blunted its value as an in-tree living spec. Removing it for now; any pieces worth keeping will land closer to the code they describe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review comments on the chat-view recursion and the user-prompt fallback: - Expanded `build_chat_messages` docstring explaining the emission rules (user_message, agent_start, tool_call, agent_message, context_compacted) and the preorder walk order. - Added a docstring to the inner `_walk` that names which closure variables it mutates (`messages`, `emitted_span_ids`) and what `nearest_agent` threads through the recursion. - Documented `total_duration_ms` on the trace chat response as the root span's wall-clock duration (not a sum over subtree spans). - Debug-log invoke_agent spans that arrive without `gen_ai.agent.name` so the silent "no agent_start divider" path is traceable. - Comment on the orphan-as-root branch in `build_span_tree`: when a referenced parent isn't in the input, we promote the span to a synthetic root instead of dropping its subtree. - Tightened `_extract_user_text` fallback: now excludes `assistant` and `tool` messages when no `role == "user"` is found, so we can't accidentally surface a model reply or tool result as "the user prompt". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c6ebd20 to
d7439af
Compare
ade26c3 to
5868d80
Compare
5868d80 to
ade26c3
Compare
d7439af to
f3d9e4c
Compare
… its child LLM output Single-call agent turns in the OpenAI Agents SDK and Google ADK land as an invoke_agent span whose `output_messages` mirror the inner chat span's `output_messages`. The projection previously emitted from both — the child `chat` span emitted first, then the parent invoke_agent emitted the same text again on the way back up the walk — because the `emitted_span_ids` dedup set only blocked a span from emitting twice, not a parent from emitting after its child already had. Rework `_walk` to return a bool signalling whether its subtree emitted an `agent_message`. The enclosing invoke_agent only emits its own mirrored response if the subtree didn't. Drop `emitted_span_ids` entirely — it wasn't protecting against this case and in a normal tree walk the check never fires. Also: - Collapse `_find_user_prompt`'s two-pass `for prefer_invoke_agent in (True, False)` loop into a single sorted iteration, prioritising invoke_agent via the sort key. - Add regression tests for both the mirrored-emit case (now emits once) and the solo invoke_agent case (still emits, since no descendant did). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename `AgentQueryHandler.search` → `search_messages`, `_trace_detail_spans` → `trace_detail_spans`, `AgentWriteHandler.otel_export` → `insert_otel_spans` for clearer intent. - Use `model_validate` instead of `model_construct` so span hydration actually validates. - Fix `last_activity` in `search_messages`: previously set only on first-seen row, now tracked as max across all rows for a conversation. - Log at debug when `_extract_user_text` falls back past strict role=user. - Drop `turn_count` from `AgentConversationChatRes` (client can use `len(turns)`). - Clean up `` backticks in test docstrings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… its child LLM output Single-call agent turns in the OpenAI Agents SDK and Google ADK land as an invoke_agent span whose `output_messages` mirror the inner chat span's `output_messages`. The projection previously emitted from both — the child `chat` span emitted first, then the parent invoke_agent emitted the same text again on the way back up the walk — because the `emitted_span_ids` dedup set only blocked a span from emitting twice, not a parent from emitting after its child already had. Rework `_walk` to return a bool signalling whether its subtree emitted an `agent_message`. The enclosing invoke_agent only emits its own mirrored response if the subtree didn't. Drop `emitted_span_ids` entirely — it wasn't protecting against this case and in a normal tree walk the check never fires. Also: - Collapse `_find_user_prompt`'s two-pass `for prefer_invoke_agent in (True, False)` loop into a single sorted iteration, prioritising invoke_agent via the sort key. - Add regression tests for both the mirrored-emit case (now emits once) and the solo invoke_agent case (still emits, since no descendant did). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename `AgentQueryHandler.search` → `search_messages`, `_trace_detail_spans` → `trace_detail_spans`, `AgentWriteHandler.otel_export` → `insert_otel_spans` for clearer intent. - Use `model_validate` instead of `model_construct` so span hydration actually validates. - Fix `last_activity` in `search_messages`: previously set only on first-seen row, now tracked as max across all rows for a conversation. - Log at debug when `_extract_user_text` falls back past strict role=user. - Drop `turn_count` from `AgentConversationChatRes` (client can use `len(turns)`). - Clean up `` backticks in test docstrings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9593dba to
a9d2a72
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
First of three stacked PRs for Phase 1 of human feedback on agent turns.
Adds three new ref kinds (external + internal): `agent_turn/{trace_id}`, `agent_conversation/{conversation_id}`, `agent_span/{span_id}`. No consumers yet — just the grammar. `conversation_id` is URL-encoded since it's app-chosen.
Stack: this → #6685 → #6686
Testing