Problem Description
I'm try to run sample code in container after install gpu-operator helm chart on my cluster(1.32.8 k8s version). my test code is below. How can I solve this problem? I'd appreciate it if you could let me know.
import torch
if torch.cuda.is_available():
free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()
free_memory_gb = free_memory_bytes / (1024**3)
total_memory_gb = total_memory_bytes / (1024**3)
print(f"Free GPU memory: {free_memory_gb:.2f} GB")
print(f"Total GPU memory: {total_memory_gb:.2f} GB")
else:
print("CUDA is not available. Cannot get GPU memory info.")
and error message is:
Traceback (most recent call last):
File "/var/lib/jenkins/test.py", line 4, in <module>
free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 738, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
my pod yaml information:
apiVersion: v1
kind: Pod
metadata:
name: rocm-pytorch-test
namespace: kube-amd-gpu
labels:
app: rocm-pytorch-test
spec:
restartPolicy: Never
tolerations:
- key: "amd.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: rocm-pytorch
image: rocm/pytorch:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0
imagePullPolicy: IfNotPresent
command: ["sleep", "infinity"]
resources:
limits:
"amd.com/gpu": 8
Here is my OS and GPU information
OS: NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
CPU:
model name : AMD EPYC 7413 24-Core Processor
GPU:
Name: AMD EPYC 7413 24-Core Processor
Marketing Name: AMD EPYC 7413 24-Core Processor
Name: AMD EPYC 7413 24-Core Processor
Marketing Name: AMD EPYC 7413 24-Core Processor
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Operating System
Ubuntu 22.04
CPU
AMD EPYC 7413 24-Core Processor
GPU
AMD Instinct MI250X/MI250
ROCm Version
ROCm 6.4.2
ROCm Component
HIP
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Problem Description
I'm try to run sample code in container after install gpu-operator helm chart on my cluster(1.32.8 k8s version). my test code is below. How can I solve this problem? I'd appreciate it if you could let me know.
and error message is:
my pod yaml information:
Here is my OS and GPU information
Operating System
Ubuntu 22.04
CPU
AMD EPYC 7413 24-Core Processor
GPU
AMD Instinct MI250X/MI250
ROCm Version
ROCm 6.4.2
ROCm Component
HIP
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response