[Issue]: torch.cuda.mem_get_info function RuntimeError(HIP error: invalid argument)

### Problem Description

I'm try to run sample code in container after install gpu-operator helm chart on my cluster(1.32.8 k8s version). my test code is below. How can I solve this problem? I'd appreciate it if you could let me know.

```python
import torch
if torch.cuda.is_available():
    free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()

    free_memory_gb = free_memory_bytes / (1024**3)
    total_memory_gb = total_memory_bytes / (1024**3)

    print(f"Free GPU memory: {free_memory_gb:.2f} GB")
    print(f"Total GPU memory: {total_memory_gb:.2f} GB")

else:
    print("CUDA is not available. Cannot get GPU memory info.")
```

and error message is:
```
Traceback (most recent call last):
  File "/var/lib/jenkins/test.py", line 4, in <module>
    free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 738, in mem_get_info
    return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
```

my pod yaml information:
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: rocm-pytorch-test
  namespace: kube-amd-gpu
  labels:
    app: rocm-pytorch-test
spec:
  restartPolicy: Never
  tolerations:
    - key: "amd.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
    - name: rocm-pytorch
      image: rocm/pytorch:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0
      imagePullPolicy: IfNotPresent
      command: ["sleep", "infinity"]
      resources:
        limits:
          "amd.com/gpu": 8
```

Here is my OS and GPU information
```
OS: NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
CPU: 
model name	: AMD EPYC 7413 24-Core Processor
GPU:
  Name:                    AMD EPYC 7413 24-Core Processor    
  Marketing Name:          AMD EPYC 7413 24-Core Processor    
  Name:                    AMD EPYC 7413 24-Core Processor    
  Marketing Name:          AMD EPYC 7413 24-Core Processor    
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
```

### Operating System

Ubuntu 22.04

### CPU

AMD EPYC 7413 24-Core Processor

### GPU

AMD Instinct MI250X/MI250

### ROCm Version

ROCm 6.4.2

### ROCm Component

HIP

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: torch.cuda.mem_get_info function RuntimeError(HIP error: invalid argument) #330

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: torch.cuda.mem_get_info function RuntimeError(HIP error: invalid argument) #330

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions