Skip to content

feat(grafana): add gpu workload history deep-link endpoint (#824)#596

Open
SekiXu wants to merge 2 commits into
developfrom
feat/824-gpu-workload-history-backend
Open

feat(grafana): add gpu workload history deep-link endpoint (#824)#596
SekiXu wants to merge 2 commits into
developfrom
feat/824-gpu-workload-history-backend

Conversation

@SekiXu

@SekiXu SekiXu commented Jun 29, 2026

Copy link
Copy Markdown

What type of PR is this?

Feature — new backend API endpoint (Grafana deep-link for GPU workload history).

Which issue(s) this PR fixes?

Backend track of #824 and #825 (cross-track User Stories; this PR is the backend portion only — not closing the stories).

What this PR does?

Adds GET /api/v1/datacenters/{dataCenter}/grafana/gpuWorkloadHistory/{hostname}, returning the Device dashboard (UID i-device) deep-links for a physical node's GPU Utilization (panel 50) and VRAM (panel 51), so the Frontend can add a "View Workload History" button on the physical GPU table.

Response:

{
  "gpuUtilizationUrl": "https://<vip>/grafana/d/i-device/device?orgId=1&var-GPU_HOST=<host>&from=now-3h&to=now&viewPanel=50",
  "vramUrl":           "https://<vip>/grafana/d/i-device/device?orgId=1&var-GPU_HOST=<host>&from=now-3h&to=now&viewPanel=51",
  "enabled": true
}

Changes (additive, zero breaking change):

  • definition/v1/grafana: new GpuWorkloadHistory struct (two URLs + enabled).
  • handlers/grafana/links.go: genGpuUtilizationHistoryLink / genGpuVramHistoryLink (reuse existing link-gen style + base.DataCenterVip).
  • handlers/grafana/handlers.go: forwardGpuWorkloadHistoryLinks + route.

Design notes:

  • One endpoint returns both URLs (per the spike) → Frontend gets both in one call.
  • Filter variable is var-GPU_HOST (hidden $GPU_HOST), not var-HOSTvar-HOST resolves to ipmi_sensor.hostname → blank panels.
  • enabled = nodes.IsExist(hostname) (cluster-wide). GetNodeGpusMap is intentionally avoided: it runs hex_sdk locally and reports the wrong node for remote hostnames.
  • Auth-free (GET /grafana/* is in the API auth-free allowlist), consistent with the other Grafana link endpoints.
  • Contract (see #824/#825): panel ids 50=Util / 51=VRAM and var-GPU_HOST must stay stable.

Test results (optional)

1). make sure the api docs have been updated

✅ Added /grafana/gpuWorkloadHistory/{hostname} to the OpenAPI spec
(cube-cos-openapi#103), and bumped the submodule pointer in this PR.

2). make sure the api works properly

  • Built on the x86 build container (go 1.24.2); ELF x86-64.
  • Deployed to test node sky150 and smoke-verified:
    • valid node → gpuUtilizationUrl (viewPanel=50) + vramUrl (viewPanel=51) + enabled:true
    • non-existent hostname → enabled:false
    • hostname consistency confirmed: gpu.host.host = Node.Hostname = sky150

🤖 Generated with Claude Code

SekiXu and others added 2 commits June 29, 2026 17:56
Add GET /grafana/gpuWorkloadHistory/:hostname returning the device
dashboard deep-links for a physical node's GPU Utilization (panel 50)
and VRAM (panel 51), filtered by var-GPU_HOST (not var-HOST). enabled
reflects node existence via nodes.IsExist; GetNodeGpusMap is avoided
because it is local-only and reports the wrong node for remote ones.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Seki Xu <seki.xu@bigstack.co>
…cs (#824)

Points to the openapi commit that documents
GET .../grafana/gpuWorkloadHistory/{hostname}, so the generated api
docs include the new endpoint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Seki Xu <seki.xu@bigstack.co>
@SekiXu SekiXu force-pushed the feat/824-gpu-workload-history-backend branch from 5ca75cd to 8863be5 Compare June 29, 2026 09:56
@SekiXu SekiXu self-assigned this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant