[WIP] agentx#348
Draft
cquil11 wants to merge 66 commits into
Draft
Conversation
Adds agentic_traces scenario end-to-end: - Schema migrations for agentic scenario, availability, and KV offload mode - DB ingest/ETL + query updates to carry scenario, offload_mode, and server/theoretical cache-hit rates through to the API layer - Frontend types, filters (GlobalFilterContext / InferenceContext / ChartControls), URL state, and tooltip rows for agentic-only fields - ScatterGraph: subtle dashed halo on Pareto-frontier points that used KV offload so the tradeoff is visible at a glance
- ScatterGraph: include `offload_mode` in `buildPointConfigId` so d3's data join keeps both `on` and `off` variants for the same (config, conc). Without it, the second variant collapsed onto the first key, so FP8 offload-on points (and their halos) silently disappeared. - benchmark-mapper: handle older artifacts that emit `users`/`offload_mode` AND newer ones that emit `conc`/`offloading` (with 'none' → 'off' mapping). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The halo's purpose is to surface KV-offload usage; restricting it to Pareto-frontier-only points hid the indicator on most runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b300-p1 (and similar) artifacts were skipping ingest because the runner-pool suffix wasn't in the strip list and didn't normalize to the canonical b300 GPU key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Label text now includes `C=<conc>` alongside the GPU/parallelism tag (default `<tp> C=<conc>`, advanced `<getPointLabel> C=<conc>`) - Bumped point-label font-weight to 700 so the labels read clearly against the chart fill - Greedy collision-avoidance pass on render and zoom: tries placing each label above/below the point through 4 candidate dy offsets, hiding the label only when no slot is free Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oint Tspans now ride above the text's `dy` anchor — the LAST line sits at the anchor (just above the point) and earlier lines stack above it. Previously the second tspan landed below the anchor and crashed into the marker. Also widened collision candidates by label height so the flipped-below position fully clears the point on multi-line labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pass When a `<text>` contains tspans, the parent's `dy` does not shift the bbox cleanly — its (unused) y=0 origin still factors in, so the rendered text ended up centered on the point. Move the absolute offset into the FIRST tspan's `dy`; later tspans cascade by 1.1em. Collision avoidance now drives the first tspan's `dy` and tries four candidate baselines (primary above, primary below, secondary above, secondary below), accounting for full label height when picking a non- overlapping slot. Labels still hidden as a last resort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary fixes for runs whose `results_bmk` aggregated artifact ends up containing both a successful row and a failed-attempt row for the same (config, conc, offload) — the failed row's null metrics were overwriting the good row via ON CONFLICT DO UPDATE. 1. Artifact-level: strip the trailing `_<runner-pool>_<attempt>` suffix from each artifact name and group by the logical name, keeping only the most recent per group. 2. Row-level: skip rows with `num_requests_successful === 0` AND `num_requests_total > 0`. The aggregated artifact merges rows from all runners — including failed ones — so artifact-level dedup alone can't reach inside it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # packages/app/src/components/GlobalFilterContext.tsx # packages/app/src/components/inference/utils/tooltipUtils.ts # packages/db/src/etl/normalizers.ts
Tag display name for the `aiperf` spec_method suffix used by the alternate-harness runs ingested for the agentic minimax sweep. Without this entry the legend shows 'AIPERF' from the default toUpperCase fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bigint workflow_run_id sometimes deserializes as a number on the frontend depending on the postgres adapter's behavior; strict === between a number and a string silently dropped every match, so the changelog popover always reported "no changelog data available." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the selected model has agentic_traces data, prefer that over the default 8K/1K fixed-seq when the user hasn't explicitly chosen via URL. effectiveSequence already falls back to availableSequences[0] for models without agentic, so models with only fixed-seq data still render correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
# Conflicts: # packages/app/src/components/inference/ui/ChartControls.tsx # packages/app/src/components/inference/utils/tooltipUtils.ts # packages/db/src/etl/normalizers.ts
rowToAggDataEntry was only copying median/p99 metric variants — picking p90/p99.9 in the percentile selector silently fell back to 0 and collapsed every point into a vertical line at x=0. Copy the full median/p90/p99/p99.9 set into AggDataEntry. Hide the X-Axis Metric dropdown for agentic mode (it doubled up with the percentile selector) and route the input-metric chart through withPercentile so picking p99 actually plots p99_ttft instead of the hard-coded p99_ttft config default. Percentile options pared back to median + p99.
# Conflicts: # packages/app/src/components/GlobalFilterContext.tsx # packages/app/src/components/inference/InferenceContext.tsx # packages/app/src/components/inference/hooks/useChartData.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the TTFT x-axis selectors with the percentile selector — only p90 is offered everywhere. Default x-axis metric and chart config input-throughput x are p90_ttft. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `!isAgentic` gate on the e2e TTFT override branch dropped the user's `p90_ttft` pick in agentic mode, leaving the chart on the default p90_e2el. The trailing withPercentile pass is idempotent when xAxisField is already at the right percentile, so the gate is unnecessary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Our chart_series + aggregate_stats extractors hardcoded vllm:* metric names, so SGLang runs (e.g. qwen3.5/h100/sglang) ingested cleanly but the per-point detail page rendered empty charts — chart_series fields were all zero-length arrays. Adds fallback chains in each extractor: KV cache util vllm:kv_cache_usage_perc → sglang:token_usage Prefix cache hits vllm:prefix_cache_hits → sglang:cached_tokens Prefix cache qrys vllm:prefix_cache_queries → sglang:prompt_tokens Requests running vllm:num_requests_running → sglang:num_running_reqs Requests waiting vllm:num_requests_waiting → sglang:num_queue_reqs Prompt tokens rate vllm:prompt_tokens → sglang:prompt_tokens Generation rate vllm:generation_tokens → sglang:generation_tokens The `pickFirstNonEmpty` helper walks the chain and uses whichever series has data, so a future framework (mori-sglang, dynamo, etc.) can plug in by adding its names to each chain — no per-framework branching. CHART_SERIES_VERSION → 4, STATS_VERSION → 3. Both backfills re-ran (86 chart_series rows, 190 aggregate_stats rows). SGLang chart_series for qwen3.5 run 944 verified — was 0-length arrays before, now ~1800 samples each. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SGLang runs' harness JSON doesn't populate server_gpu_cache_hit_rate (vLLM runs do), so the detail-page header and inference chart tooltip showed "—" for SGLang points. Now at trace_replay ingest, if any of the linked benchmark_results rows has a null server_gpu_cache_hit_rate and we have non-empty prefill/hits time-series in the computed chart_series, derive the lifetime cluster ratio as sum(hits.rate) / sum(prompt.rate) and write it into the row's metrics JSONB. Already-stored SGLang rows from runs 944/945 backfilled via a one-off UPDATE earlier in this session (8 rows, mostly ~87-89% hit rate, one high-conc outlier at 2.4%). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "Cumulative prompt token source breakdown" chart was empty for
SGLang runs because the vllm-specific vllm:prompt_tokens_by_source
metric doesn't exist on SGLang. Maps sglang:realtime_tokens (which has
mode={prefill_cache, prefill_compute, decode}) into the same source
breakdown when no vllm series is present, filtered to prefill_* modes
(decode tokens are output throughput, not prompt-token volume).
CHART_SERIES_VERSION → 5. Backfilled 128 rows; SGLang rows from runs
944/946/947 now have prefill_cache + prefill_compute sources populated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously SGLang detail pages showed two stacked-area layers in the prompt-token source breakdown: prefill_cache (everything that hit the cache) + prefill_compute (cache miss). The user wanted finer granularity — specifically a distinction between on-GPU HBM cache and CPU-offloaded (hicache) host cache. SGLang's sglang:cached_tokens metric carries a cache_source label that varies per cache tier: - "device" → on-GPU HBM cache hit - "host" → CPU-offload (hicache) cache hit - "total" → older sglang, single series with no tier breakdown Switches the cache-hit portion of the breakdown from the coarse `prefill_cache` mode label to per-cache_source series: - device → "cache hit (HBM)" - host → "cache hit (CPU offload)" - total → "cache hit" - other → "cache hit (<src>)" Cache misses still come from realtime_tokens[mode=prefill_compute], relabeled "compute (miss)" for symmetry. Current data only contains device/total (no hicache runs ingested yet) — when hicache runs come in, the chart will automatically split cache hits into HBM + CPU-offload layers with no further code change. CHART_SERIES_VERSION → 6. Backfilled 128 rows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lors
Two related fixes for SGLang hicache rendering on the agentic detail page:
1. KV cache utilization chart was GPU-HBM-only. SGLang hicache runs also
expose sglang:hicache_host_{used,total}_tokens — the CPU offload
pool's tokens-in-use over its capacity. Extracted as a new
`hostKvCacheUsage` time series; frontend overlays it as a second
orange line on the existing chart when the row has hicache data.
2. The cumulative-prompt-token-source-breakdown chart rendered ALL
three SGLang sources in the same color, because the colors dict
only knew vllm-style names (local_compute, local_cache_hit, etc.).
Added explicit colors for the SGLang label names ('cache hit
(HBM)', 'cache hit (CPU offload)', 'cache hit', 'compute (miss)')
plus a memoized fallback palette so any future unknown source name
gets a distinct color rather than falling through to gray.
CHART_SERIES_VERSION → 7. Backfilled 128 rows; hicache rows from
workflow_run 947 (8 rows) now have ~1830 hostKvCacheUsage samples
matching their HBM samples.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cumulative-prompt-token-source-breakdown chart was showing huge "100% compute (miss)" plateaus around minute 20-24 of many SGLang runs. Root cause: the chart computed cumulative shares per ARRAY INDEX (not timestamp), but in SGLang's per-scrape metrics, cache hits and misses fire on different ticks — one scrape reports 193K hits + 0 miss, the next reports 0 hits + 8K miss. So each source has a different timestamp array. Indexing them in lockstep mixed values from different moments and made the share calculation flap to 100% one side or the other. Fix: union timestamps across all sources, then for each unique timestamp carry forward each source's cumulative sum (a source that didn't report at time t holds its previous cumulative value rather than appearing as 0). After fix: shares change smoothly over time as each source's cumulative sum grows; transient single-tick gaps no longer drive the visible share to either extreme. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous inline derivation (commit 625d6e8) summed ALL cache hit sources into server_gpu_cache_hit_rate, which conflated GPU HBM hits with CPU offload hits on SGLang hicache rows. The harness JSON also never sets server_cpu_cache_hit_rate. Now derives both metrics from chart_series.promptTokensBySource: server_gpu_cache_hit_rate = sum(HBM + 'cache hit') / sum(prompts) server_cpu_cache_hit_rate = sum(CPU offload) / sum(prompts) or null (null when no CPU offload source exists) Falls back to prefixCacheHitsTps for vLLM rows where promptTokensBySource isn't broken out by cache source. Overwrites any pre-existing value so the derivation stays consistent with what the detail-page charts plot. Backfilled all existing rows via two-phase SQL update earlier in the session: - 8 hicache rows in workflow_run 947 now show GPU ~1-2% / CPU ~87-91% - Other SGLang rows show GPU ~87% / CPU null - vLLM rows restored to their original GPU hit rates Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Inline cache-hit-rate derivation only handled SGLang's hicache label
('cache hit (CPU offload)'). vLLM with LMCache uses 'external_kv_transfer'
in its prompt_tokens_by_source breakdown for the same concept (CPU
offload pool serving tokens to GPU). Those vLLM rows had cpu rate
null even when external_kv_transfer was the dominant source.
Adds external_kv_transfer + local_cache_hit to the source name aliases:
GPU hits = local_cache_hit + cache hit (HBM) + cache hit
CPU hits = external_kv_transfer + cache hit (CPU offload)
fallback = prefixCacheHitsTps total (for single-source rows)
Backfilled 132 affected rows via SQL — vLLM LMCache rows now show CPU
rate where present (e.g. dsv4 b300 conc=128 offload=on shows GPU ~1%
+ CPU ~87%, matching the actual cache topology).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The chart pre-fetched full trace_replay JSONL blobs for every visible
agentic point just to decide whether to render the "View charts" button
in pinned tooltips. With the latest run's 8x8 conc=512 rows pushing up
to 13 MB compressed per blob, 12-id chunks blew past Neon's 64 MB
per-HTTP-response cap and 500'd — hiding the button for every point.
New /api/v1/trace-availability returns {id: true} for ids that have a
stored blob; ScatterGraph uses that boolean instead. trace-histograms
is still used by the detail page (single id, no chunking issue).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cluster-average KV util line hides load skew on DEP configs — 8 ranks averaging 20% can hide one rank at 12% and another at 23%. Bump CHART_SERIES_VERSION 7 -> 8 to keep one entry per engine in kvCacheUsageByEngine. The detail page draws each rank in the request-timeline palette (so DP indices read as the same color in both views) and overlays the bold red "Avg" line on top. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The TTFT, interactivity, session-time, and prefill-tps charts used to compute their own Pareto frontiers on the swapped x metric. That let a vendor benchmark-hack: tune a config to top TTFT while quietly tanking decode (or vice versa), and post a chart-topping point that didn't reflect real e2e performance. When xmode != 'e2e', filter the displayed point set to those that sit on the (e2e_latency, y) Pareto frontier — same set of points across every non-e2e chart, just rendered at the chosen x metric. The e2e chart itself is unchanged and remains the source of truth. Per Oren's review: "all and only the points that show up on e2e latency pareto should show up on ttft & interactivity & prefill tok/s/user pareto." Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous change filtered the displayed data down to e2e-Pareto winners, which hid every dominated config from the TTFT / interactivity / session-time / prefill-tps views. Users couldn't see where the non-optimal configs actually sit on the alternative axes — losing diagnostic visibility just to enforce the anti-benchmark-hack rule. Switch from hard filter to a per-point `isOnE2eFrontier` flag: every point still renders as scatter, only the e2e-Pareto winners feed the frontier line. ScatterGraph honors the flag in its roofline compute so the line stays restricted to non-hackable configs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixed-seq workloads don't have the multi-turn / session-time framing that motivated the anti-hack rule — their e2e IS the request latency, so a TTFT hack there reads honestly on e2e too. Reverting fixed-seq to the prior per-axis Pareto avoids changing established leaderboard semantics for non-agentic runs. Agentic continues to mark `isOnE2eFrontier` on each point so the TTFT, interactivity, session-time and prefill-tps lines stay restricted to e2e-winning configs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add an optional infoTooltip field to LegendSwitchConfig that renders a small info icon next to the switch label. On agentic + non-e2e xmodes, hovering it explains that "optimal" means on the end-to-end Pareto frontier (not a per-axis Pareto), so users understand why off-frontier points may appear above the line. Hit target widened (-m-1.5 p-1.5) and delay dropped to 100ms so the tiny icon isn't flaky to hover. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t hardware Two workflow runs landing on the same date for the same model+precision but DIFFERENT hardware (e.g. a B300 dsv4 run and a B200 dsv4 run) each get their own changelog entry. The single-run scoping guard matched runs by model+precision only, so both counted as "runs with a changelog for this model", length>1 tripped, and selecting either run scoped the benchmarks query to that one workflow run — hiding the other GPU's curve entirely (carry-forward across hardware silently broke). Scope to a single run only when two runs contest the SAME full config_key (model-precision-hardware-framework) — a genuine same-day re-run of one hardware, where a DISTINCT ON merge could mix them. Complementary different-hardware runs now both render via the normal date carry-forward. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.