Skip to content

[WIP] agentx#348

Draft
cquil11 wants to merge 66 commits into
masterfrom
feat/agentx
Draft

[WIP] agentx#348
cquil11 wants to merge 66 commits into
masterfrom
feat/agentx

Conversation

@cquil11
Copy link
Copy Markdown
Contributor

@cquil11 cquil11 commented May 14, 2026

No description provided.

cquil11 and others added 12 commits April 23, 2026 13:40
Adds agentic_traces scenario end-to-end:
- Schema migrations for agentic scenario, availability, and KV offload mode
- DB ingest/ETL + query updates to carry scenario, offload_mode, and
  server/theoretical cache-hit rates through to the API layer
- Frontend types, filters (GlobalFilterContext / InferenceContext /
  ChartControls), URL state, and tooltip rows for agentic-only fields
- ScatterGraph: subtle dashed halo on Pareto-frontier points that used
  KV offload so the tradeoff is visible at a glance
- ScatterGraph: include `offload_mode` in `buildPointConfigId` so d3's data
  join keeps both `on` and `off` variants for the same (config, conc).
  Without it, the second variant collapsed onto the first key, so FP8
  offload-on points (and their halos) silently disappeared.
- benchmark-mapper: handle older artifacts that emit `users`/`offload_mode`
  AND newer ones that emit `conc`/`offloading` (with 'none' → 'off' mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The halo's purpose is to surface KV-offload usage; restricting it to
Pareto-frontier-only points hid the indicator on most runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b300-p1 (and similar) artifacts were skipping ingest because the runner-pool
suffix wasn't in the strip list and didn't normalize to the canonical b300
GPU key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Label text now includes `C=<conc>` alongside the GPU/parallelism tag
  (default `<tp> C=<conc>`, advanced `<getPointLabel> C=<conc>`)
- Bumped point-label font-weight to 700 so the labels read clearly against
  the chart fill
- Greedy collision-avoidance pass on render and zoom: tries placing each
  label above/below the point through 4 candidate dy offsets, hiding the
  label only when no slot is free

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oint

Tspans now ride above the text's `dy` anchor — the LAST line sits at the
anchor (just above the point) and earlier lines stack above it. Previously
the second tspan landed below the anchor and crashed into the marker.

Also widened collision candidates by label height so the flipped-below
position fully clears the point on multi-line labels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pass

When a `<text>` contains tspans, the parent's `dy` does not shift the bbox
cleanly — its (unused) y=0 origin still factors in, so the rendered text
ended up centered on the point. Move the absolute offset into the FIRST
tspan's `dy`; later tspans cascade by 1.1em.

Collision avoidance now drives the first tspan's `dy` and tries four
candidate baselines (primary above, primary below, secondary above,
secondary below), accounting for full label height when picking a non-
overlapping slot. Labels still hidden as a last resort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary fixes for runs whose `results_bmk` aggregated artifact
ends up containing both a successful row and a failed-attempt row for the
same (config, conc, offload) — the failed row's null metrics were
overwriting the good row via ON CONFLICT DO UPDATE.

1. Artifact-level: strip the trailing `_<runner-pool>_<attempt>` suffix
   from each artifact name and group by the logical name, keeping only the
   most recent per group.

2. Row-level: skip rows with `num_requests_successful === 0` AND
   `num_requests_total > 0`. The aggregated artifact merges rows from all
   runners — including failed ones — so artifact-level dedup alone can't
   reach inside it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	packages/app/src/components/GlobalFilterContext.tsx
#	packages/app/src/components/inference/utils/tooltipUtils.ts
#	packages/db/src/etl/normalizers.ts
Tag display name for the `aiperf` spec_method suffix used by the
alternate-harness runs ingested for the agentic minimax sweep.
Without this entry the legend shows 'AIPERF' from the default
toUpperCase fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bigint workflow_run_id sometimes deserializes as a number on the
frontend depending on the postgres adapter's behavior; strict ===
between a number and a string silently dropped every match, so the
changelog popover always reported "no changelog data available."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the selected model has agentic_traces data, prefer that over the
default 8K/1K fixed-seq when the user hasn't explicitly chosen via URL.
effectiveSequence already falls back to availableSequences[0] for models
without agentic, so models with only fixed-seq data still render correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment Jun 4, 2026 7:51pm

Request Review

# Conflicts:
#	packages/app/src/components/inference/ui/ChartControls.tsx
#	packages/app/src/components/inference/utils/tooltipUtils.ts
#	packages/db/src/etl/normalizers.ts
rowToAggDataEntry was only copying median/p99 metric variants — picking
p90/p99.9 in the percentile selector silently fell back to 0 and
collapsed every point into a vertical line at x=0. Copy the full
median/p90/p99/p99.9 set into AggDataEntry.

Hide the X-Axis Metric dropdown for agentic mode (it doubled up with the
percentile selector) and route the input-metric chart through
withPercentile so picking p99 actually plots p99_ttft instead of the
hard-coded p99_ttft config default. Percentile options pared back to
median + p99.
cquil11 added 2 commits May 15, 2026 12:30
# Conflicts:
#	packages/app/src/components/GlobalFilterContext.tsx
#	packages/app/src/components/inference/InferenceContext.tsx
#	packages/app/src/components/inference/hooks/useChartData.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the TTFT x-axis selectors with the percentile selector — only
p90 is offered everywhere. Default x-axis metric and chart config
input-throughput x are p90_ttft.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `!isAgentic` gate on the e2e TTFT override branch dropped the
user's `p90_ttft` pick in agentic mode, leaving the chart on the
default p90_e2el. The trailing withPercentile pass is idempotent
when xAxisField is already at the right percentile, so the gate is
unnecessary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Our chart_series + aggregate_stats extractors hardcoded vllm:* metric
names, so SGLang runs (e.g. qwen3.5/h100/sglang) ingested cleanly but
the per-point detail page rendered empty charts — chart_series fields
were all zero-length arrays.

Adds fallback chains in each extractor:

  KV cache util      vllm:kv_cache_usage_perc  → sglang:token_usage
  Prefix cache hits  vllm:prefix_cache_hits    → sglang:cached_tokens
  Prefix cache qrys  vllm:prefix_cache_queries → sglang:prompt_tokens
  Requests running   vllm:num_requests_running → sglang:num_running_reqs
  Requests waiting   vllm:num_requests_waiting → sglang:num_queue_reqs
  Prompt tokens rate vllm:prompt_tokens        → sglang:prompt_tokens
  Generation rate    vllm:generation_tokens    → sglang:generation_tokens

The `pickFirstNonEmpty` helper walks the chain and uses whichever
series has data, so a future framework (mori-sglang, dynamo, etc.) can
plug in by adding its names to each chain — no per-framework branching.

CHART_SERIES_VERSION → 4, STATS_VERSION → 3. Both backfills re-ran (86
chart_series rows, 190 aggregate_stats rows). SGLang chart_series for
qwen3.5 run 944 verified — was 0-length arrays before, now ~1800
samples each.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SGLang runs' harness JSON doesn't populate server_gpu_cache_hit_rate
(vLLM runs do), so the detail-page header and inference chart tooltip
showed "—" for SGLang points. Now at trace_replay ingest, if any of
the linked benchmark_results rows has a null server_gpu_cache_hit_rate
and we have non-empty prefill/hits time-series in the computed
chart_series, derive the lifetime cluster ratio as
sum(hits.rate) / sum(prompt.rate) and write it into the row's metrics
JSONB.

Already-stored SGLang rows from runs 944/945 backfilled via a one-off
UPDATE earlier in this session (8 rows, mostly ~87-89% hit rate, one
high-conc outlier at 2.4%).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "Cumulative prompt token source breakdown" chart was empty for
SGLang runs because the vllm-specific vllm:prompt_tokens_by_source
metric doesn't exist on SGLang. Maps sglang:realtime_tokens (which has
mode={prefill_cache, prefill_compute, decode}) into the same source
breakdown when no vllm series is present, filtered to prefill_* modes
(decode tokens are output throughput, not prompt-token volume).

CHART_SERIES_VERSION → 5. Backfilled 128 rows; SGLang rows from runs
944/946/947 now have prefill_cache + prefill_compute sources populated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously SGLang detail pages showed two stacked-area layers in the
prompt-token source breakdown: prefill_cache (everything that hit the
cache) + prefill_compute (cache miss). The user wanted finer
granularity — specifically a distinction between on-GPU HBM cache and
CPU-offloaded (hicache) host cache.

SGLang's sglang:cached_tokens metric carries a cache_source label that
varies per cache tier:
  - "device" → on-GPU HBM cache hit
  - "host"   → CPU-offload (hicache) cache hit
  - "total"  → older sglang, single series with no tier breakdown

Switches the cache-hit portion of the breakdown from the coarse
`prefill_cache` mode label to per-cache_source series:
  - device → "cache hit (HBM)"
  - host   → "cache hit (CPU offload)"
  - total  → "cache hit"
  - other  → "cache hit (<src>)"

Cache misses still come from realtime_tokens[mode=prefill_compute],
relabeled "compute (miss)" for symmetry.

Current data only contains device/total (no hicache runs ingested
yet) — when hicache runs come in, the chart will automatically split
cache hits into HBM + CPU-offload layers with no further code change.

CHART_SERIES_VERSION → 6. Backfilled 128 rows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lors

Two related fixes for SGLang hicache rendering on the agentic detail page:

1. KV cache utilization chart was GPU-HBM-only. SGLang hicache runs also
   expose sglang:hicache_host_{used,total}_tokens — the CPU offload
   pool's tokens-in-use over its capacity. Extracted as a new
   `hostKvCacheUsage` time series; frontend overlays it as a second
   orange line on the existing chart when the row has hicache data.

2. The cumulative-prompt-token-source-breakdown chart rendered ALL
   three SGLang sources in the same color, because the colors dict
   only knew vllm-style names (local_compute, local_cache_hit, etc.).
   Added explicit colors for the SGLang label names ('cache hit
   (HBM)', 'cache hit (CPU offload)', 'cache hit', 'compute (miss)')
   plus a memoized fallback palette so any future unknown source name
   gets a distinct color rather than falling through to gray.

CHART_SERIES_VERSION → 7. Backfilled 128 rows; hicache rows from
workflow_run 947 (8 rows) now have ~1830 hostKvCacheUsage samples
matching their HBM samples.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cumulative-prompt-token-source-breakdown chart was showing huge
"100% compute (miss)" plateaus around minute 20-24 of many SGLang runs.

Root cause: the chart computed cumulative shares per ARRAY INDEX (not
timestamp), but in SGLang's per-scrape metrics, cache hits and misses
fire on different ticks — one scrape reports 193K hits + 0 miss, the
next reports 0 hits + 8K miss. So each source has a different timestamp
array. Indexing them in lockstep mixed values from different moments
and made the share calculation flap to 100% one side or the other.

Fix: union timestamps across all sources, then for each unique
timestamp carry forward each source's cumulative sum (a source that
didn't report at time t holds its previous cumulative value rather
than appearing as 0).

After fix: shares change smoothly over time as each source's cumulative
sum grows; transient single-tick gaps no longer drive the visible
share to either extreme.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous inline derivation (commit 625d6e8) summed ALL cache hit
sources into server_gpu_cache_hit_rate, which conflated GPU HBM hits
with CPU offload hits on SGLang hicache rows. The harness JSON also
never sets server_cpu_cache_hit_rate.

Now derives both metrics from chart_series.promptTokensBySource:
  server_gpu_cache_hit_rate = sum(HBM + 'cache hit') / sum(prompts)
  server_cpu_cache_hit_rate = sum(CPU offload) / sum(prompts) or null
                              (null when no CPU offload source exists)

Falls back to prefixCacheHitsTps for vLLM rows where promptTokensBySource
isn't broken out by cache source. Overwrites any pre-existing value so
the derivation stays consistent with what the detail-page charts plot.

Backfilled all existing rows via two-phase SQL update earlier in the
session:
  - 8 hicache rows in workflow_run 947 now show GPU ~1-2% / CPU ~87-91%
  - Other SGLang rows show GPU ~87% / CPU null
  - vLLM rows restored to their original GPU hit rates

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Inline cache-hit-rate derivation only handled SGLang's hicache label
('cache hit (CPU offload)'). vLLM with LMCache uses 'external_kv_transfer'
in its prompt_tokens_by_source breakdown for the same concept (CPU
offload pool serving tokens to GPU). Those vLLM rows had cpu rate
null even when external_kv_transfer was the dominant source.

Adds external_kv_transfer + local_cache_hit to the source name aliases:
  GPU hits  = local_cache_hit + cache hit (HBM) + cache hit
  CPU hits  = external_kv_transfer + cache hit (CPU offload)
  fallback  = prefixCacheHitsTps total (for single-source rows)

Backfilled 132 affected rows via SQL — vLLM LMCache rows now show CPU
rate where present (e.g. dsv4 b300 conc=128 offload=on shows GPU ~1%
+ CPU ~87%, matching the actual cache topology).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cquil11 and others added 2 commits June 4, 2026 13:02
The chart pre-fetched full trace_replay JSONL blobs for every visible
agentic point just to decide whether to render the "View charts" button
in pinned tooltips. With the latest run's 8x8 conc=512 rows pushing up
to 13 MB compressed per blob, 12-id chunks blew past Neon's 64 MB
per-HTTP-response cap and 500'd — hiding the button for every point.

New /api/v1/trace-availability returns {id: true} for ids that have a
stored blob; ScatterGraph uses that boolean instead. trace-histograms
is still used by the detail page (single id, no chunking issue).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cluster-average KV util line hides load skew on DEP configs — 8
ranks averaging 20% can hide one rank at 12% and another at 23%.

Bump CHART_SERIES_VERSION 7 -> 8 to keep one entry per engine in
kvCacheUsageByEngine. The detail page draws each rank in the
request-timeline palette (so DP indices read as the same color in
both views) and overlays the bold red "Avg" line on top.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The TTFT, interactivity, session-time, and prefill-tps charts used to
compute their own Pareto frontiers on the swapped x metric. That let a
vendor benchmark-hack: tune a config to top TTFT while quietly tanking
decode (or vice versa), and post a chart-topping point that didn't
reflect real e2e performance.

When xmode != 'e2e', filter the displayed point set to those that sit
on the (e2e_latency, y) Pareto frontier — same set of points across
every non-e2e chart, just rendered at the chosen x metric. The e2e
chart itself is unchanged and remains the source of truth.

Per Oren's review:
  "all and only the points that show up on e2e latency pareto should
   show up on ttft & interactivity & prefill tok/s/user pareto."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous change filtered the displayed data down to e2e-Pareto winners,
which hid every dominated config from the TTFT / interactivity /
session-time / prefill-tps views. Users couldn't see where the
non-optimal configs actually sit on the alternative axes — losing
diagnostic visibility just to enforce the anti-benchmark-hack rule.

Switch from hard filter to a per-point `isOnE2eFrontier` flag: every
point still renders as scatter, only the e2e-Pareto winners feed the
frontier line. ScatterGraph honors the flag in its roofline compute
so the line stays restricted to non-hackable configs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixed-seq workloads don't have the multi-turn / session-time framing
that motivated the anti-hack rule — their e2e IS the request latency,
so a TTFT hack there reads honestly on e2e too. Reverting fixed-seq
to the prior per-axis Pareto avoids changing established leaderboard
semantics for non-agentic runs.

Agentic continues to mark `isOnE2eFrontier` on each point so the TTFT,
interactivity, session-time and prefill-tps lines stay restricted to
e2e-winning configs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add an optional infoTooltip field to LegendSwitchConfig that renders a
small info icon next to the switch label. On agentic + non-e2e xmodes,
hovering it explains that "optimal" means on the end-to-end Pareto
frontier (not a per-axis Pareto), so users understand why off-frontier
points may appear above the line.

Hit target widened (-m-1.5 p-1.5) and delay dropped to 100ms so the
tiny icon isn't flaky to hover.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t hardware

Two workflow runs landing on the same date for the same model+precision
but DIFFERENT hardware (e.g. a B300 dsv4 run and a B200 dsv4 run) each
get their own changelog entry. The single-run scoping guard matched runs
by model+precision only, so both counted as "runs with a changelog for
this model", length>1 tripped, and selecting either run scoped the
benchmarks query to that one workflow run — hiding the other GPU's curve
entirely (carry-forward across hardware silently broke).

Scope to a single run only when two runs contest the SAME full config_key
(model-precision-hardware-framework) — a genuine same-day re-run of one
hardware, where a DISTINCT ON merge could mix them. Complementary
different-hardware runs now both render via the normal date carry-forward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants