hotspots: rank tools by raw output bytes alongside tokens (#436) by willwashburn · Pull Request #444 · AgentWorkforce/burn

willwashburn · 2026-05-25T13:11:05Z

Closes #436.

Part of the burn 3.0 coordinated bump (see #430–#440 batch — do not merge alone).

Summary

Schema v1→v2: nullable output_bytes, output_truncated columns on tool_result_events. Idempotent ALTER TABLE migration; pre-v2 rows stay NULL until burn state rebuild.
Ingest: every reader's Measured struct now computes as_bytes().len() alongside its existing length field. Claude reader gets a detect_truncation_marker helper (<system-truncated>, [truncated], "output/result/response truncated", case-insensitive). Codex/opencode leave output_truncated: None (no equivalent marker).
SDK aggregations: total_output_bytes, max_output_bytes, truncated_count added to FileAggregation, BashAggregation, BashVerbAggregation, SubagentAggregation. ToolAttribution gets per-event output_bytes/output_truncated.
CLI: burn hotspots --rank-by cost|bytes flag (default cost). Tables gain a bytes column (decimal-SI formatted, inline helper — no humansize dep). Trailing * flags buckets with truncation; - means "no bytes measured" (pre-v2 rows).
JSON output: totalOutputBytes, maxOutputBytes, truncatedCount on every aggregation row.

Why this matters

A 4 MB Bash result that's post-truncation billed as 200 tokens currently looks cheap in burn hotspots. The full payload still blew the context window for every subsequent inference (via cached prefix). Bytes-rank surfaces the cause; cost-rank stays as today's default.

Ported the dual-metric pattern from agent-profiler's ui/src/components/conversation/ToolUsageOverviewChart.tsx.

Merge note

This PR bumps schema v1→v2 by adding columns to tool_result_events. #437 (stop_reason) ALSO bumps v1→v2 for a turns.stop_reason column. Whichever lands second needs to become v3 with a chained ALTER TABLE. Migration is idempotent; reconciling on merge is mechanical.

Test plan

aggregations_track_output_bytes_so_byte_ranking_inverts_token_ranking — 1 MB Bash (200 tok post-truncation) + 1 KB Read (4000 tok); cost ranks Read first, bytes ranks Bash first
measure_tool_result_populates_byte_length_and_truncation_flag (Claude reader)
detect_truncation_marker_matches_known_phrasings
Golden snapshots updated: hotspots, hotspots-json, state-status, state-status-json
cargo build --workspace clean
cargo test --workspace green — 794 unit/integration + 5/5 golden

Out of scope

#[non_exhaustive] — not added; this is part of the 3.0 major bump.
Legacy content_length field stays as-is (char-count for Claude, bytes for codex/opencode). The new output_bytes is consistently UTF-8 byte length across every source.
MCP wire-format update — verified packages/mcp doesn't reference Hotspot* structs.

Generated by Claude Code

coderabbitai · 2026-05-25T13:11:13Z

Warning

Review limit reached

@willwashburn, we couldn't start this review because you've used your available PR reviews for now.

Your plan includes 1 review of capacity. Refill in 8 minutes and 10 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c8ad8b27-4a45-4d9e-88d0-7f81fcadf237

📥 Commits

Reviewing files that changed from the base of the PR and between 6778f77 and 9798ffc.

📒 Files selected for processing (18)

CHANGELOG.md
crates/relayburn-cli/src/commands/hotspots.rs
crates/relayburn-sdk/src/analyze/hotspots.rs
crates/relayburn-sdk/src/analyze/patterns_tests.rs
crates/relayburn-sdk/src/analyze/tool_output_bloat.rs
crates/relayburn-sdk/src/ledger/db.rs
crates/relayburn-sdk/src/ledger/schema.rs
crates/relayburn-sdk/src/ledger/tests.rs
crates/relayburn-sdk/src/ledger/writer.rs
crates/relayburn-sdk/src/query_verbs.rs
crates/relayburn-sdk/src/reader/claude.rs
crates/relayburn-sdk/src/reader/codex.rs
crates/relayburn-sdk/src/reader/opencode.rs
crates/relayburn-sdk/src/reader/types.rs
tests/fixtures/cli-golden/snapshots/hotspots-json.stdout.txt
tests/fixtures/cli-golden/snapshots/hotspots.stdout.txt
tests/fixtures/cli-golden/snapshots/state-status-json.stdout.txt
tests/fixtures/cli-golden/snapshots/state-status.stdout.txt

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/burn-3.0-436-tool-output-bytes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cubic-dev-ai

2 issues found across 19 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

The v1 -> v2 `ALTER TABLE turns ADD COLUMN stop_reason` previously used a `column_exists` pre-check, which is race-prone under concurrent ledger opens: two processes can both observe the column missing, both issue the `ALTER`, and the second loses. Switch to try-then-swallow on the `duplicate column name` SQLite error so the migration is genuinely idempotent under contention. Deliberately does not catch `SqliteFailure(_, None)` — that shape is too broad and would mask real schema breakage. Keeps the pattern aligned with the matching migration in #444 so the two PRs don't diverge when both land. `column_exists` had no other callers, so the helper is dropped. Addresses cubic-dev-ai review on #445. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

- ledger/db.rs: drop the `SqliteFailure(_, None)` migration arm so only the duplicate-column-name case is swallowed; any other `SqliteFailure` with no detail message now propagates rather than silently advancing `schema_version`. - hotspots: turn `--rank-by` into a clap `ValueEnum` so invalid values fail at parse time (in both human and `--json` modes), instead of after ingest. Drops the manual string match on `args.rank_by.as_str()`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

…y breakdown (#437) (#445) * feat(sdk): expose StopReason on TurnRecord, surface turn outcomes in `burn summary` (#437) BREAKING: `TurnRecord.stop_reason` switches from `Option<String>` to `Option<StopReason>`. Anthropic / opencode finish-reason strings are parsed into the enum at the reader boundary; unknown values fall back to `StopReason::Silent` so pre-3.0 ledgers replay cleanly. Schema bumps to v2 with a new denormalized `turns.stop_reason TEXT` column populated on insert; existing v1 ledgers are migrated in place via `ALTER TABLE` on `Ledger::open`. `burn summary` adds a `Turn outcomes: …` line and a `stopReasons` block in JSON output. Codex turns continue to carry `None` (the rollout schema doesn't include a stop reason field). https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY * fix(sdk): map `stop` finish reason to EndTurn, not StopSequence OpenAI / AI-SDK convention emits a bare `"stop"` for ordinary end-of-turn completions, which is what opencode forwards when it wraps an OpenAI-shaped provider. The previous mapping collapsed `"stop"` into `StopSequence` alongside Anthropic's `stop_sequence`, which would misclassify successful assistant endings as stop-sequence terminations and skew the outcome buckets in `burn summary`. `stop_sequence` / `stop-sequence` still resolve to `StopSequence` (the actual Anthropic semantics). Addresses chatgpt-codex-connector review on #445. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY * fix(sdk): make stop_reason migration idempotent via try-catch The v1 -> v2 `ALTER TABLE turns ADD COLUMN stop_reason` previously used a `column_exists` pre-check, which is race-prone under concurrent ledger opens: two processes can both observe the column missing, both issue the `ALTER`, and the second loses. Switch to try-then-swallow on the `duplicate column name` SQLite error so the migration is genuinely idempotent under contention. Deliberately does not catch `SqliteFailure(_, None)` — that shape is too broad and would mask real schema breakage. Keeps the pattern aligned with the matching migration in #444 so the two PRs don't diverge when both land. `column_exists` had no other callers, so the helper is dropped. Addresses cubic-dev-ai review on #445. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY --------- Co-authored-by: Claude <noreply@anthropic.com>

Track `output_bytes` and `output_truncated` per tool_result_event so a 4 MB Bash blowout that gets truncated to ~0 billed tokens still surfaces in `burn hotspots`. Adds a nullable `output_bytes` / `output_truncated` column pair to `tool_result_events` (schema v2), threads the new fields through `ToolResultEventRecord` ingest in every reader, and rolls them up onto `ToolAttribution`, `FileAggregation`, `BashAggregation`, `BashVerbAggregation`, and `SubagentAggregation` as `total_output_bytes`, `max_output_bytes`, and `truncated_count`. CLI gains a `--rank-by bytes` mode and a new `bytes` column on the per-tool tables. The trailing `*` next to a bytes cell flags buckets with at least one truncated payload. JSON output now ships the three new fields on every aggregation row. Pre-v2 ledgers leave the new columns NULL; `burn state rebuild` re-runs ingest to backfill. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

- ledger/db.rs: drop the `SqliteFailure(_, None)` migration arm so only the duplicate-column-name case is swallowed; any other `SqliteFailure` with no detail message now propagates rather than silently advancing `schema_version`. - hotspots: turn `--rank-by` into a clap `ValueEnum` so invalid values fail at parse time (in both human and `--json` modes), instead of after ingest. Drops the manual string match on `args.rank_by.as_str()`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

Introduces an `Inference` aggregate keyed by `(source, session_id, request_id)` so callers asking "how many API calls" stop conflating Claude's multi-content-block assistant rows. One Claude API call lands as multiple JSONL rows sharing a `requestId`; the existing `TurnRecord` collapses by `message.id` (1:1 with `requestId` today), but the inference key gives a durable per-API-call identity that survives future harness changes and exposes a stable provenance field (`request-id` / `message-id` / `row-synthetic`). Schema bumps to v4: new `inferences` table keyed by the composite triple, populated by the ingest pipeline from the parser's new `request_id_lookup`. Chained migration on top of v2 (#437 stop_reason) and v3 (#436 output_bytes); `burn state rebuild` repopulates on legacy ledgers. (If #444 hasn't merged by integration time, this should renumber to v3 — the version constant + migration step + tests sit together for an easy rebase.) Also fixes a latent bug in the Claude parser: usage merging only updated `usage_coverage` on subsequent rows of the same `message_id`, not `usage` itself. If the carrier row wasn't the first row for that message id, its tokens were dropped. The merge now adopts the carrier's `usage` values regardless of arrival order. SDK verb: `LedgerHandle::inferences(InferencesOptions) -> Vec<Inference>` + free-function `inferences()`. Codex / opencode (no upstream `requestId`) fall back to `message_id` via the `InferenceKeySource::MessageId` provenance. Golden updates: `state-status.stdout.txt` and `state-status-json.stdout.txt` gain an `inferences: 0` row and the `schemaVersion` bumps 3 → 4. The fixture is bootstrap-only (no ingest), so the count stays 0 until the next `burn ingest` or `burn state rebuild`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY

…#448) * reader: dedupe assistant rows by requestId into an Inference unit (#434) Introduces an `Inference` aggregate keyed by `(source, session_id, request_id)` so callers asking "how many API calls" stop conflating Claude's multi-content-block assistant rows. One Claude API call lands as multiple JSONL rows sharing a `requestId`; the existing `TurnRecord` collapses by `message.id` (1:1 with `requestId` today), but the inference key gives a durable per-API-call identity that survives future harness changes and exposes a stable provenance field (`request-id` / `message-id` / `row-synthetic`). Schema bumps to v4: new `inferences` table keyed by the composite triple, populated by the ingest pipeline from the parser's new `request_id_lookup`. Chained migration on top of v2 (#437 stop_reason) and v3 (#436 output_bytes); `burn state rebuild` repopulates on legacy ledgers. (If #444 hasn't merged by integration time, this should renumber to v3 — the version constant + migration step + tests sit together for an easy rebase.) Also fixes a latent bug in the Claude parser: usage merging only updated `usage_coverage` on subsequent rows of the same `message_id`, not `usage` itself. If the carrier row wasn't the first row for that message id, its tokens were dropped. The merge now adopts the carrier's `usage` values regardless of arrival order. SDK verb: `LedgerHandle::inferences(InferencesOptions) -> Vec<Inference>` + free-function `inferences()`. Codex / opencode (no upstream `requestId`) fall back to `message_id` via the `InferenceKeySource::MessageId` provenance. Golden updates: `state-status.stdout.txt` and `state-status-json.stdout.txt` gain an `inferences: 0` row and the `schemaVersion` bumps 3 → 4. The fixture is bootstrap-only (no ingest), so the count stays 0 until the next `burn ingest` or `burn state rebuild`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY * fix(reader): make InferenceKind::Reasoning reachable and apply project filter (#448) The classifier hard-coded `has_text = !has_tools`, which made the `(true, false, false)` arm — reasoning-only turns — unreachable; they were silently lumped into `Mixed`. Switch to a 2-tuple match on `(has_reasoning, has_tools)` and document the intentional coarseness (reasoning + text lumps with `Reasoning`, tools + text with `ToolUse`) so the trade-off is visible at the call site. `query_inferences` accepted `Query::project` but never applied it to SQL, so project-scoped callers received cross-project rows. The `inferences` table doesn't carry project columns; filter via a subquery against `turns` (`session_id IN (... WHERE project = ? OR project_key = ?)`) which mirrors the predicate shape `query_turns` already uses. Adds two tests: a reasoning-only turn classifies as `Reasoning`, and a two-project ledger returns only the requested project's inferences. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY --------- Co-authored-by: Claude <noreply@anthropic.com>

willwashburn marked this pull request as ready for review May 25, 2026 14:24

cubic-dev-ai Bot reviewed May 25, 2026

View reviewed changes

Comment thread crates/relayburn-sdk/src/ledger/db.rs Outdated

Comment thread crates/relayburn-cli/src/commands/hotspots.rs Outdated

willwashburn force-pushed the claude/burn-3.0-436-tool-output-bytes branch from e954e3c to f59c2ac Compare May 25, 2026 15:01

claude added 2 commits May 25, 2026 16:25

willwashburn force-pushed the claude/burn-3.0-436-tool-output-bytes branch from 1ed469d to 9798ffc Compare May 25, 2026 16:26

willwashburn merged commit 33cc446 into main May 25, 2026
11 checks passed

willwashburn deleted the claude/burn-3.0-436-tool-output-bytes branch May 25, 2026 16:51

This was referenced May 25, 2026

reader: dedupe assistant rows by requestId into Inference unit (#434) #448

Merged

reader: pair Task subagents via toolUseResult.agentId, surface orphans as UnattachedGroup (#435) #449

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotspots: rank tools by raw output bytes alongside tokens (#436)#444

hotspots: rank tools by raw output bytes alongside tokens (#436)#444
willwashburn merged 2 commits into
mainfrom
claude/burn-3.0-436-tool-output-bytes

willwashburn commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Review limit reached

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

willwashburn commented May 25, 2026

Summary

Why this matters

Merge note

Test plan

Out of scope

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 25, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading