hotspots: rank tools by raw output bytes alongside tokens (#436)#444
Conversation
|
Warning Review limit reached
Your plan includes 1 review of capacity. Refill in 8 minutes and 10 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (18)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
2 issues found across 19 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
The v1 -> v2 `ALTER TABLE turns ADD COLUMN stop_reason` previously used a `column_exists` pre-check, which is race-prone under concurrent ledger opens: two processes can both observe the column missing, both issue the `ALTER`, and the second loses. Switch to try-then-swallow on the `duplicate column name` SQLite error so the migration is genuinely idempotent under contention. Deliberately does not catch `SqliteFailure(_, None)` — that shape is too broad and would mask real schema breakage. Keeps the pattern aligned with the matching migration in #444 so the two PRs don't diverge when both land. `column_exists` had no other callers, so the helper is dropped. Addresses cubic-dev-ai review on #445. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
- ledger/db.rs: drop the `SqliteFailure(_, None)` migration arm so only the duplicate-column-name case is swallowed; any other `SqliteFailure` with no detail message now propagates rather than silently advancing `schema_version`. - hotspots: turn `--rank-by` into a clap `ValueEnum` so invalid values fail at parse time (in both human and `--json` modes), instead of after ingest. Drops the manual string match on `args.rank_by.as_str()`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
e954e3c to
f59c2ac
Compare
…y breakdown (#437) (#445) * feat(sdk): expose StopReason on TurnRecord, surface turn outcomes in `burn summary` (#437) BREAKING: `TurnRecord.stop_reason` switches from `Option<String>` to `Option<StopReason>`. Anthropic / opencode finish-reason strings are parsed into the enum at the reader boundary; unknown values fall back to `StopReason::Silent` so pre-3.0 ledgers replay cleanly. Schema bumps to v2 with a new denormalized `turns.stop_reason TEXT` column populated on insert; existing v1 ledgers are migrated in place via `ALTER TABLE` on `Ledger::open`. `burn summary` adds a `Turn outcomes: …` line and a `stopReasons` block in JSON output. Codex turns continue to carry `None` (the rollout schema doesn't include a stop reason field). https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY * fix(sdk): map `stop` finish reason to EndTurn, not StopSequence OpenAI / AI-SDK convention emits a bare `"stop"` for ordinary end-of-turn completions, which is what opencode forwards when it wraps an OpenAI-shaped provider. The previous mapping collapsed `"stop"` into `StopSequence` alongside Anthropic's `stop_sequence`, which would misclassify successful assistant endings as stop-sequence terminations and skew the outcome buckets in `burn summary`. `stop_sequence` / `stop-sequence` still resolve to `StopSequence` (the actual Anthropic semantics). Addresses chatgpt-codex-connector review on #445. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY * fix(sdk): make stop_reason migration idempotent via try-catch The v1 -> v2 `ALTER TABLE turns ADD COLUMN stop_reason` previously used a `column_exists` pre-check, which is race-prone under concurrent ledger opens: two processes can both observe the column missing, both issue the `ALTER`, and the second loses. Switch to try-then-swallow on the `duplicate column name` SQLite error so the migration is genuinely idempotent under contention. Deliberately does not catch `SqliteFailure(_, None)` — that shape is too broad and would mask real schema breakage. Keeps the pattern aligned with the matching migration in #444 so the two PRs don't diverge when both land. `column_exists` had no other callers, so the helper is dropped. Addresses cubic-dev-ai review on #445. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY --------- Co-authored-by: Claude <noreply@anthropic.com>
Track `output_bytes` and `output_truncated` per tool_result_event so a 4 MB Bash blowout that gets truncated to ~0 billed tokens still surfaces in `burn hotspots`. Adds a nullable `output_bytes` / `output_truncated` column pair to `tool_result_events` (schema v2), threads the new fields through `ToolResultEventRecord` ingest in every reader, and rolls them up onto `ToolAttribution`, `FileAggregation`, `BashAggregation`, `BashVerbAggregation`, and `SubagentAggregation` as `total_output_bytes`, `max_output_bytes`, and `truncated_count`. CLI gains a `--rank-by bytes` mode and a new `bytes` column on the per-tool tables. The trailing `*` next to a bytes cell flags buckets with at least one truncated payload. JSON output now ships the three new fields on every aggregation row. Pre-v2 ledgers leave the new columns NULL; `burn state rebuild` re-runs ingest to backfill. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
- ledger/db.rs: drop the `SqliteFailure(_, None)` migration arm so only the duplicate-column-name case is swallowed; any other `SqliteFailure` with no detail message now propagates rather than silently advancing `schema_version`. - hotspots: turn `--rank-by` into a clap `ValueEnum` so invalid values fail at parse time (in both human and `--json` modes), instead of after ingest. Drops the manual string match on `args.rank_by.as_str()`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
1ed469d to
9798ffc
Compare
Introduces an `Inference` aggregate keyed by `(source, session_id, request_id)` so callers asking "how many API calls" stop conflating Claude's multi-content-block assistant rows. One Claude API call lands as multiple JSONL rows sharing a `requestId`; the existing `TurnRecord` collapses by `message.id` (1:1 with `requestId` today), but the inference key gives a durable per-API-call identity that survives future harness changes and exposes a stable provenance field (`request-id` / `message-id` / `row-synthetic`). Schema bumps to v4: new `inferences` table keyed by the composite triple, populated by the ingest pipeline from the parser's new `request_id_lookup`. Chained migration on top of v2 (#437 stop_reason) and v3 (#436 output_bytes); `burn state rebuild` repopulates on legacy ledgers. (If #444 hasn't merged by integration time, this should renumber to v3 — the version constant + migration step + tests sit together for an easy rebase.) Also fixes a latent bug in the Claude parser: usage merging only updated `usage_coverage` on subsequent rows of the same `message_id`, not `usage` itself. If the carrier row wasn't the first row for that message id, its tokens were dropped. The merge now adopts the carrier's `usage` values regardless of arrival order. SDK verb: `LedgerHandle::inferences(InferencesOptions) -> Vec<Inference>` + free-function `inferences()`. Codex / opencode (no upstream `requestId`) fall back to `message_id` via the `InferenceKeySource::MessageId` provenance. Golden updates: `state-status.stdout.txt` and `state-status-json.stdout.txt` gain an `inferences: 0` row and the `schemaVersion` bumps 3 → 4. The fixture is bootstrap-only (no ingest), so the count stays 0 until the next `burn ingest` or `burn state rebuild`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY
…#448) * reader: dedupe assistant rows by requestId into an Inference unit (#434) Introduces an `Inference` aggregate keyed by `(source, session_id, request_id)` so callers asking "how many API calls" stop conflating Claude's multi-content-block assistant rows. One Claude API call lands as multiple JSONL rows sharing a `requestId`; the existing `TurnRecord` collapses by `message.id` (1:1 with `requestId` today), but the inference key gives a durable per-API-call identity that survives future harness changes and exposes a stable provenance field (`request-id` / `message-id` / `row-synthetic`). Schema bumps to v4: new `inferences` table keyed by the composite triple, populated by the ingest pipeline from the parser's new `request_id_lookup`. Chained migration on top of v2 (#437 stop_reason) and v3 (#436 output_bytes); `burn state rebuild` repopulates on legacy ledgers. (If #444 hasn't merged by integration time, this should renumber to v3 — the version constant + migration step + tests sit together for an easy rebase.) Also fixes a latent bug in the Claude parser: usage merging only updated `usage_coverage` on subsequent rows of the same `message_id`, not `usage` itself. If the carrier row wasn't the first row for that message id, its tokens were dropped. The merge now adopts the carrier's `usage` values regardless of arrival order. SDK verb: `LedgerHandle::inferences(InferencesOptions) -> Vec<Inference>` + free-function `inferences()`. Codex / opencode (no upstream `requestId`) fall back to `message_id` via the `InferenceKeySource::MessageId` provenance. Golden updates: `state-status.stdout.txt` and `state-status-json.stdout.txt` gain an `inferences: 0` row and the `schemaVersion` bumps 3 → 4. The fixture is bootstrap-only (no ingest), so the count stays 0 until the next `burn ingest` or `burn state rebuild`. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY * fix(reader): make InferenceKind::Reasoning reachable and apply project filter (#448) The classifier hard-coded `has_text = !has_tools`, which made the `(true, false, false)` arm — reasoning-only turns — unreachable; they were silently lumped into `Mixed`. Switch to a 2-tuple match on `(has_reasoning, has_tools)` and document the intentional coarseness (reasoning + text lumps with `Reasoning`, tools + text with `ToolUse`) so the trade-off is visible at the call site. `query_inferences` accepted `Query::project` but never applied it to SQL, so project-scoped callers received cross-project rows. The `inferences` table doesn't carry project columns; filter via a subquery against `turns` (`session_id IN (... WHERE project = ? OR project_key = ?)`) which mirrors the predicate shape `query_turns` already uses. Adds two tests: a reasoning-only turn classifies as `Reasoning`, and a two-project ledger returns only the requested project's inferences. https://claude.ai/code/session_01QEpNZbWEYNwxzqQjTN5LCY --------- Co-authored-by: Claude <noreply@anthropic.com>
Closes #436.
Part of the burn 3.0 coordinated bump (see #430–#440 batch — do not merge alone).
Summary
output_bytes,output_truncatedcolumns ontool_result_events. IdempotentALTER TABLEmigration; pre-v2 rows stay NULL untilburn state rebuild.Measuredstruct now computesas_bytes().len()alongside its existing length field. Claude reader gets adetect_truncation_markerhelper (<system-truncated>,[truncated], "output/result/response truncated", case-insensitive). Codex/opencode leaveoutput_truncated: None(no equivalent marker).total_output_bytes,max_output_bytes,truncated_countadded toFileAggregation,BashAggregation,BashVerbAggregation,SubagentAggregation.ToolAttributiongets per-eventoutput_bytes/output_truncated.burn hotspots --rank-by cost|bytesflag (default cost). Tables gain abytescolumn (decimal-SI formatted, inline helper — nohumansizedep). Trailing*flags buckets with truncation;-means "no bytes measured" (pre-v2 rows).totalOutputBytes,maxOutputBytes,truncatedCounton every aggregation row.Why this matters
A 4 MB Bash result that's post-truncation billed as 200 tokens currently looks cheap in
burn hotspots. The full payload still blew the context window for every subsequent inference (via cached prefix). Bytes-rank surfaces the cause; cost-rank stays as today's default.Ported the dual-metric pattern from agent-profiler's
ui/src/components/conversation/ToolUsageOverviewChart.tsx.Merge note
This PR bumps schema v1→v2 by adding columns to
tool_result_events. #437 (stop_reason) ALSO bumps v1→v2 for aturns.stop_reasoncolumn. Whichever lands second needs to become v3 with a chainedALTER TABLE. Migration is idempotent; reconciling on merge is mechanical.Test plan
aggregations_track_output_bytes_so_byte_ranking_inverts_token_ranking— 1 MB Bash (200 tok post-truncation) + 1 KB Read (4000 tok); cost ranks Read first, bytes ranks Bash firstmeasure_tool_result_populates_byte_length_and_truncation_flag(Claude reader)detect_truncation_marker_matches_known_phrasingshotspots,hotspots-json,state-status,state-status-jsoncargo build --workspacecleancargo test --workspacegreen — 794 unit/integration + 5/5 goldenOut of scope
#[non_exhaustive]— not added; this is part of the 3.0 major bump.content_lengthfield stays as-is (char-count for Claude, bytes for codex/opencode). The newoutput_bytesis consistently UTF-8 byte length across every source.packages/mcpdoesn't referenceHotspot*structs.Generated by Claude Code