Skip to content

feat(rdna): add RDNA4 Triton + HIP cheatsheets, arch config, and preprocessing support#37

Merged
irvineoy merged 2 commits into
geak-triton-common-benchmarkfrom
feature/rdna-triton-enablement
May 8, 2026
Merged

feat(rdna): add RDNA4 Triton + HIP cheatsheets, arch config, and preprocessing support#37
irvineoy merged 2 commits into
geak-triton-common-benchmarkfrom
feature/rdna-triton-enablement

Conversation

@mehdi-saeedi
Copy link
Copy Markdown
Collaborator

Summary

Enables RDNA4 (gfx1201) as a target architecture for Triton and HIP tasks. Mirrors the AKA HIP-branch commit 14d240f, with an added Triton-specific cheatsheet and per-language knowledge_override wiring.

  • New RDNA4_architecture.md cheatsheet covering gfx1201 hardware specs (Wave32, WGP, WMMA, GDDR6 bandwidth constraints).
  • New hip_rdna_cheatsheet.md — standalone RDNA HIP best practices for hip2hip / torch2hip tasks on this branch.
  • New triton_rdna_cheatsheet.md — standalone RDNA Triton best practices (Wave32 implications, WMMA vs MFMA, gfx1201 tl.dot dtype support).
  • Wires RDNA4 into default_cheatsheet.yaml with a per-language knowledge_override (hip + triton).
  • Sets AMDGPU_TARGETS and GPU_TARGETS in src/preprocessing.py so CMake-based HIP builds target RDNA4.
  • Adds knowledge_override support to src/prompt_builder.py for per-arch cheatsheets.
  • Removes a hardcoded MI300X prompt from tasks/hip2hip/others/points_in_boxes/config.yaml.

Files changed (7)

  • src/preprocessing.py
  • src/prompt_builder.py
  • src/prompts/cheatsheet/RDNA4_architecture.md (new)
  • src/prompts/cheatsheet/default_cheatsheet.yaml
  • src/prompts/cheatsheet/hip_rdna_cheatsheet.md (new)
  • src/prompts/cheatsheet/triton_rdna_cheatsheet.md (new)
  • tasks/hip2hip/others/points_in_boxes/config.yaml

Test plan

  • Run a Triton task on RDNA4 (gfx1201) and confirm triton_rdna_cheatsheet.md is injected into the agent prompt via knowledge_override.
  • Run a HIP task on RDNA4 and confirm hip_rdna_cheatsheet.md is injected.
  • Confirm CMake-based HIP builds use gfx1201 (check AMDGPU_TARGETS / GPU_TARGETS propagation).
  • Sanity-check that MI300/MI355 task runs are unaffected (cheatsheet selection still works).

Made with Cursor

…rocessing support

- Add RDNA4_architecture.md with gfx1201 hardware specs (Wave32, WGP,
  WMMA, GDDR6 bandwidth constraints)
- Add hip_rdna_cheatsheet.md as standalone RDNA HIP best practices
  (needed for hip2hip and torch2hip tasks on this branch)
- Add triton_rdna_cheatsheet.md as standalone RDNA Triton best practices
  (Wave32 implications, WMMA vs MFMA, gfx1201 tl.dot dtype support)
- Wire RDNA4 into default_cheatsheet.yaml with per-language
  knowledge_override (hip + triton)
- Set AMDGPU_TARGETS and GPU_TARGETS in preprocessing for CMake builds
- Support knowledge_override in prompt_builder for per-arch cheatsheets
- Remove hardcoded MI300X prompt from points_in_boxes task config

Mirrors AgentKernelArena HIP-branch commit 14d240f, with an added
triton-specific cheatsheet and knowledge_override wiring.

Made-with: Cursor
@irvineoy
Copy link
Copy Markdown
Collaborator

irvineoy commented May 7, 2026

One issue I noticed with the new ROCm env propagation: the added CMake env vars are only set on the fallback path, not on the normal rocminfo detection path.

In setup_rocm_env(), when _detect_gfx_arch_from_rocminfo() succeeds, the function currently sets only PYTORCH_ROCM_ARCH and then returns:

detected_arch = _detect_gfx_arch_from_rocminfo()
if detected_arch:
    os.environ["PYTORCH_ROCM_ARCH"] = detected_arch
    logger.info(...)
    return

That means on machines where rocminfo works, AMDGPU_TARGETS and GPU_TARGETS remain unset even though the fallback path sets them. I reproduced this on the MI300 box: rocminfo detected gfx942, but after setup_rocm_env("MI300") the env was:

PYTORCH_ROCM_ARCH=gfx942
AMDGPU_TARGETS=None
GPU_TARGETS=None

This seems to undermine the PR's goal of making CMake/HIP builds pick up the selected arch consistently. Could we set all three vars in the detected-arch branch as well, e.g.:

if detected_arch:
    os.environ["PYTORCH_ROCM_ARCH"] = detected_arch
    os.environ["AMDGPU_TARGETS"] = detected_arch
    os.environ["GPU_TARGETS"] = detected_arch
    logger.info(...)
    return

… path too

setup_rocm_env() previously set all three of PYTORCH_ROCM_ARCH,
AMDGPU_TARGETS, and GPU_TARGETS only on the fallback path. On the
common rocminfo-success path it set just PYTORCH_ROCM_ARCH and
returned, leaving the two CMake env vars unset. Reproduced on MI300:
rocminfo detected gfx942 but AMDGPU_TARGETS / GPU_TARGETS remained
None, so CMake-based HIP builds did not pick up the selected arch.

Refactor the function to resolve gfx_arch first (rocminfo -> cheatsheet
fallback) and converge on a single export block driven by a module-
level tuple. Both paths now export all three vars uniformly, and any
future CMake env var can be added to the tuple in one place.

Addresses irvineoy's review on #37.

Co-authored-by: Cursor <cursoragent@cursor.com>
@mehdi-saeedi
Copy link
Copy Markdown
Collaborator Author

Thanks @irvineoy — confirmed and fixed in a254c5e.

The asymmetry was real: setup_rocm_env() exported all three vars only on the fallback path; the rocminfo-success path (the common case on any working ROCm box) exported only PYTORCH_ROCM_ARCH and returned, leaving AMDGPU_TARGETS / GPU_TARGETS unset. This undermined the PR's stated goal of getting CMake/HIP builds to pick up the arch consistently.

Rather than duplicate the three-var export inline on each path (which leaves the same shape vulnerable to drift the next time someone touches the function), I refactored to resolve gfx_arch first and converge on a single export block driven by a module-level tuple _ROCM_ARCH_ENV_VARS. Both paths now export all three uniformly, the log message is generated from the same tuple (so it can never lie about what was set), and adding any future CMake env var is a one-line change.

Verified by mocking _detect_gfx_arch_from_rocminfo:

  • rocminfo succeeds (gfx942)PYTORCH_ROCM_ARCH=AMDGPU_TARGETS=GPU_TARGETS=gfx942 ✅ (was the broken path)
  • rocminfo fails, model resolves (RDNA4gfx1201) → all three set to gfx1201 ✅ (unchanged)
  • both fail → env left untouched ✅ (unchanged)

@irvineoy
Copy link
Copy Markdown
Collaborator

irvineoy commented May 8, 2026

Thanks for the contribution

@irvineoy irvineoy merged commit 3a79d46 into geak-triton-common-benchmark May 8, 2026
@mehdi-saeedi mehdi-saeedi deleted the feature/rdna-triton-enablement branch May 8, 2026 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants