Skip to content

hotspots: rank tools by raw output bytes alongside tokens (#436)#444

Merged
willwashburn merged 2 commits into
mainfrom
claude/burn-3.0-436-tool-output-bytes
May 25, 2026
Merged

hotspots: rank tools by raw output bytes alongside tokens (#436)#444
willwashburn merged 2 commits into
mainfrom
claude/burn-3.0-436-tool-output-bytes

Conversation

@willwashburn

Copy link
Copy Markdown
Member

Closes #436.

Part of the burn 3.0 coordinated bump (see #430#440 batch — do not merge alone).

Summary

  • Schema v1→v2: nullable output_bytes, output_truncated columns on tool_result_events. Idempotent ALTER TABLE migration; pre-v2 rows stay NULL until burn state rebuild.
  • Ingest: every reader's Measured struct now computes as_bytes().len() alongside its existing length field. Claude reader gets a detect_truncation_marker helper (<system-truncated>, [truncated], "output/result/response truncated", case-insensitive). Codex/opencode leave output_truncated: None (no equivalent marker).
  • SDK aggregations: total_output_bytes, max_output_bytes, truncated_count added to FileAggregation, BashAggregation, BashVerbAggregation, SubagentAggregation. ToolAttribution gets per-event output_bytes/output_truncated.
  • CLI: burn hotspots --rank-by cost|bytes flag (default cost). Tables gain a bytes column (decimal-SI formatted, inline helper — no humansize dep). Trailing * flags buckets with truncation; - means "no bytes measured" (pre-v2 rows).
  • JSON output: totalOutputBytes, maxOutputBytes, truncatedCount on every aggregation row.

Why this matters

A 4 MB Bash result that's post-truncation billed as 200 tokens currently looks cheap in burn hotspots. The full payload still blew the context window for every subsequent inference (via cached prefix). Bytes-rank surfaces the cause; cost-rank stays as today's default.

Ported the dual-metric pattern from agent-profiler's ui/src/components/conversation/ToolUsageOverviewChart.tsx.

Merge note

This PR bumps schema v1→v2 by adding columns to tool_result_events. #437 (stop_reason) ALSO bumps v1→v2 for a turns.stop_reason column. Whichever lands second needs to become v3 with a chained ALTER TABLE. Migration is idempotent; reconciling on merge is mechanical.

Test plan

  • aggregations_track_output_bytes_so_byte_ranking_inverts_token_ranking — 1 MB Bash (200 tok post-truncation) + 1 KB Read (4000 tok); cost ranks Read first, bytes ranks Bash first
  • measure_tool_result_populates_byte_length_and_truncation_flag (Claude reader)
  • detect_truncation_marker_matches_known_phrasings
  • Golden snapshots updated: hotspots, hotspots-json, state-status, state-status-json
  • cargo build --workspace clean
  • cargo test --workspace green — 794 unit/integration + 5/5 golden

Out of scope

  • #[non_exhaustive] — not added; this is part of the 3.0 major bump.
  • Legacy content_length field stays as-is (char-count for Claude, bytes for codex/opencode). The new output_bytes is consistently UTF-8 byte length across every source.
  • MCP wire-format update — verified packages/mcp doesn't reference Hotspot* structs.

Generated by Claude Code

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@willwashburn, we couldn't start this review because you've used your available PR reviews for now.

Your plan includes 1 review of capacity. Refill in 8 minutes and 10 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c8ad8b27-4a45-4d9e-88d0-7f81fcadf237

📥 Commits

Reviewing files that changed from the base of the PR and between 6778f77 and 9798ffc.

📒 Files selected for processing (18)
  • CHANGELOG.md
  • crates/relayburn-cli/src/commands/hotspots.rs
  • crates/relayburn-sdk/src/analyze/hotspots.rs
  • crates/relayburn-sdk/src/analyze/patterns_tests.rs
  • crates/relayburn-sdk/src/analyze/tool_output_bloat.rs
  • crates/relayburn-sdk/src/ledger/db.rs
  • crates/relayburn-sdk/src/ledger/schema.rs
  • crates/relayburn-sdk/src/ledger/tests.rs
  • crates/relayburn-sdk/src/ledger/writer.rs
  • crates/relayburn-sdk/src/query_verbs.rs
  • crates/relayburn-sdk/src/reader/claude.rs
  • crates/relayburn-sdk/src/reader/codex.rs
  • crates/relayburn-sdk/src/reader/opencode.rs
  • crates/relayburn-sdk/src/reader/types.rs
  • tests/fixtures/cli-golden/snapshots/hotspots-json.stdout.txt
  • tests/fixtures/cli-golden/snapshots/hotspots.stdout.txt
  • tests/fixtures/cli-golden/snapshots/state-status-json.stdout.txt
  • tests/fixtures/cli-golden/snapshots/state-status.stdout.txt
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/burn-3.0-436-tool-output-bytes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@willwashburn willwashburn marked this pull request as ready for review May 25, 2026 14:24

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 19 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread crates/relayburn-sdk/src/ledger/db.rs Outdated
Comment thread crates/relayburn-cli/src/commands/hotspots.rs Outdated
willwashburn pushed a commit that referenced this pull request May 25, 2026
The v1 -> v2 `ALTER TABLE turns ADD COLUMN stop_reason` previously used
a `column_exists` pre-check, which is race-prone under concurrent
ledger opens: two processes can both observe the column missing, both
issue the `ALTER`, and the second loses. Switch to try-then-swallow
on the `duplicate column name` SQLite error so the migration is
genuinely idempotent under contention.

Deliberately does not catch `SqliteFailure(_, None)` — that shape is
too broad and would mask real schema breakage. Keeps the pattern
aligned with the matching migration in #444 so the two PRs don't
diverge when both land.

`column_exists` had no other callers, so the helper is dropped.

Addresses cubic-dev-ai review on #445.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
willwashburn pushed a commit that referenced this pull request May 25, 2026
- ledger/db.rs: drop the `SqliteFailure(_, None)` migration arm so only
  the duplicate-column-name case is swallowed; any other `SqliteFailure`
  with no detail message now propagates rather than silently advancing
  `schema_version`.
- hotspots: turn `--rank-by` into a clap `ValueEnum` so invalid values
  fail at parse time (in both human and `--json` modes), instead of after
  ingest. Drops the manual string match on `args.rank_by.as_str()`.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
@willwashburn willwashburn force-pushed the claude/burn-3.0-436-tool-output-bytes branch from e954e3c to f59c2ac Compare May 25, 2026 15:01
willwashburn added a commit that referenced this pull request May 25, 2026
…y breakdown (#437) (#445)

* feat(sdk): expose StopReason on TurnRecord, surface turn outcomes in `burn summary` (#437)

BREAKING: `TurnRecord.stop_reason` switches from `Option<String>` to
`Option<StopReason>`. Anthropic / opencode finish-reason strings are
parsed into the enum at the reader boundary; unknown values fall back to
`StopReason::Silent` so pre-3.0 ledgers replay cleanly. Schema bumps to v2
with a new denormalized `turns.stop_reason TEXT` column populated on
insert; existing v1 ledgers are migrated in place via `ALTER TABLE` on
`Ledger::open`. `burn summary` adds a `Turn outcomes: …` line and a
`stopReasons` block in JSON output. Codex turns continue to carry `None`
(the rollout schema doesn't include a stop reason field).

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

* fix(sdk): map `stop` finish reason to EndTurn, not StopSequence

OpenAI / AI-SDK convention emits a bare `"stop"` for ordinary
end-of-turn completions, which is what opencode forwards when it wraps
an OpenAI-shaped provider. The previous mapping collapsed `"stop"` into
`StopSequence` alongside Anthropic's `stop_sequence`, which would
misclassify successful assistant endings as stop-sequence terminations
and skew the outcome buckets in `burn summary`.

`stop_sequence` / `stop-sequence` still resolve to `StopSequence` (the
actual Anthropic semantics).

Addresses chatgpt-codex-connector review on #445.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

* fix(sdk): make stop_reason migration idempotent via try-catch

The v1 -> v2 `ALTER TABLE turns ADD COLUMN stop_reason` previously used
a `column_exists` pre-check, which is race-prone under concurrent
ledger opens: two processes can both observe the column missing, both
issue the `ALTER`, and the second loses. Switch to try-then-swallow
on the `duplicate column name` SQLite error so the migration is
genuinely idempotent under contention.

Deliberately does not catch `SqliteFailure(_, None)` — that shape is
too broad and would mask real schema breakage. Keeps the pattern
aligned with the matching migration in #444 so the two PRs don't
diverge when both land.

`column_exists` had no other callers, so the helper is dropped.

Addresses cubic-dev-ai review on #445.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

---------

Co-authored-by: Claude <noreply@anthropic.com>
claude added 2 commits May 25, 2026 16:25
Track `output_bytes` and `output_truncated` per tool_result_event so a 4 MB
Bash blowout that gets truncated to ~0 billed tokens still surfaces in
`burn hotspots`. Adds a nullable `output_bytes` / `output_truncated` column
pair to `tool_result_events` (schema v2), threads the new fields through
`ToolResultEventRecord` ingest in every reader, and rolls them up onto
`ToolAttribution`, `FileAggregation`, `BashAggregation`,
`BashVerbAggregation`, and `SubagentAggregation` as `total_output_bytes`,
`max_output_bytes`, and `truncated_count`.

CLI gains a `--rank-by bytes` mode and a new `bytes` column on the per-tool
tables. The trailing `*` next to a bytes cell flags buckets with at least
one truncated payload. JSON output now ships the three new fields on every
aggregation row.

Pre-v2 ledgers leave the new columns NULL; `burn state rebuild` re-runs
ingest to backfill.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
- ledger/db.rs: drop the `SqliteFailure(_, None)` migration arm so only
  the duplicate-column-name case is swallowed; any other `SqliteFailure`
  with no detail message now propagates rather than silently advancing
  `schema_version`.
- hotspots: turn `--rank-by` into a clap `ValueEnum` so invalid values
  fail at parse time (in both human and `--json` modes), instead of after
  ingest. Drops the manual string match on `args.rank_by.as_str()`.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
@willwashburn willwashburn force-pushed the claude/burn-3.0-436-tool-output-bytes branch from 1ed469d to 9798ffc Compare May 25, 2026 16:26
@willwashburn willwashburn merged commit 33cc446 into main May 25, 2026
11 checks passed
@willwashburn willwashburn deleted the claude/burn-3.0-436-tool-output-bytes branch May 25, 2026 16:51
willwashburn pushed a commit that referenced this pull request May 25, 2026
Introduces an `Inference` aggregate keyed by `(source, session_id,
request_id)` so callers asking "how many API calls" stop conflating
Claude's multi-content-block assistant rows. One Claude API call lands
as multiple JSONL rows sharing a `requestId`; the existing
`TurnRecord` collapses by `message.id` (1:1 with `requestId` today),
but the inference key gives a durable per-API-call identity that
survives future harness changes and exposes a stable provenance field
(`request-id` / `message-id` / `row-synthetic`).

Schema bumps to v4: new `inferences` table keyed by the composite
triple, populated by the ingest pipeline from the parser's new
`request_id_lookup`. Chained migration on top of v2 (#437
stop_reason) and v3 (#436 output_bytes); `burn state rebuild` repopulates
on legacy ledgers. (If #444 hasn't merged by integration time, this
should renumber to v3 — the version constant + migration step + tests
sit together for an easy rebase.)

Also fixes a latent bug in the Claude parser: usage merging only
updated `usage_coverage` on subsequent rows of the same `message_id`,
not `usage` itself. If the carrier row wasn't the first row for that
message id, its tokens were dropped. The merge now adopts the
carrier's `usage` values regardless of arrival order.

SDK verb: `LedgerHandle::inferences(InferencesOptions) ->
Vec<Inference>` + free-function `inferences()`. Codex / opencode (no
upstream `requestId`) fall back to `message_id` via the
`InferenceKeySource::MessageId` provenance.

Golden updates: `state-status.stdout.txt` and `state-status-json.stdout.txt`
gain an `inferences: 0` row and the `schemaVersion` bumps 3 → 4. The
fixture is bootstrap-only (no ingest), so the count stays 0 until the
next `burn ingest` or `burn state rebuild`.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
willwashburn added a commit that referenced this pull request May 25, 2026
…#448)

* reader: dedupe assistant rows by requestId into an Inference unit (#434)

Introduces an `Inference` aggregate keyed by `(source, session_id,
request_id)` so callers asking "how many API calls" stop conflating
Claude's multi-content-block assistant rows. One Claude API call lands
as multiple JSONL rows sharing a `requestId`; the existing
`TurnRecord` collapses by `message.id` (1:1 with `requestId` today),
but the inference key gives a durable per-API-call identity that
survives future harness changes and exposes a stable provenance field
(`request-id` / `message-id` / `row-synthetic`).

Schema bumps to v4: new `inferences` table keyed by the composite
triple, populated by the ingest pipeline from the parser's new
`request_id_lookup`. Chained migration on top of v2 (#437
stop_reason) and v3 (#436 output_bytes); `burn state rebuild` repopulates
on legacy ledgers. (If #444 hasn't merged by integration time, this
should renumber to v3 — the version constant + migration step + tests
sit together for an easy rebase.)

Also fixes a latent bug in the Claude parser: usage merging only
updated `usage_coverage` on subsequent rows of the same `message_id`,
not `usage` itself. If the carrier row wasn't the first row for that
message id, its tokens were dropped. The merge now adopts the
carrier's `usage` values regardless of arrival order.

SDK verb: `LedgerHandle::inferences(InferencesOptions) ->
Vec<Inference>` + free-function `inferences()`. Codex / opencode (no
upstream `requestId`) fall back to `message_id` via the
`InferenceKeySource::MessageId` provenance.

Golden updates: `state-status.stdout.txt` and `state-status-json.stdout.txt`
gain an `inferences: 0` row and the `schemaVersion` bumps 3 → 4. The
fixture is bootstrap-only (no ingest), so the count stays 0 until the
next `burn ingest` or `burn state rebuild`.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

* fix(reader): make InferenceKind::Reasoning reachable and apply project filter (#448)

The classifier hard-coded `has_text = !has_tools`, which made the
`(true, false, false)` arm — reasoning-only turns — unreachable; they
were silently lumped into `Mixed`. Switch to a 2-tuple match on
`(has_reasoning, has_tools)` and document the intentional coarseness
(reasoning + text lumps with `Reasoning`, tools + text with `ToolUse`)
so the trade-off is visible at the call site.

`query_inferences` accepted `Query::project` but never applied it to
SQL, so project-scoped callers received cross-project rows. The
`inferences` table doesn't carry project columns; filter via a subquery
against `turns` (`session_id IN (... WHERE project = ? OR project_key = ?)`)
which mirrors the predicate shape `query_turns` already uses.

Adds two tests: a reasoning-only turn classifies as `Reasoning`, and a
two-project ledger returns only the requested project's inferences.

https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hotspots: surface tool-output bytes alongside tokens

2 participants