Skip to content

[BUG] Deepspeed installation issue on ROCm #6599

@itej89

Description

@itej89

I am facing following errors while building main on ROCm with ops

Bug:

cd Deepspeed
DS_BUILD_FUSED_ADAM=1 pip install . 
Processing /myworkspace/DeepSpeed
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [21 lines of output]
      [2024-10-04 18:00:35,186] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2024-10-04 18:00:35,821] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/myworkspace/DeepSpeed/setup.py", line 200, in <module>
          ext_modules.append(builder.builder())
        File "/myworkspace/DeepSpeed/op_builder/builder.py", line 711, in builder
          compile_args['cxx'].append('-DROCM_WAVEFRONT_SIZE=%s' % self.get_rocm_wavefront_size())
        File "/myworkspace/DeepSpeed/op_builder/builder.py", line 276, in get_rocm_wavefront_size
          result = subprocess.check_output(rocm_wavefront_size_cmd)
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 424, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 505, in run
          with Popen(*popenargs, **kwargs) as process:
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 951, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 1837, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: "/opt/rocm/bin/rocminfo | grep -Eo -m1 'Wavefront Size:[[:space:]]+[0-9]+' | grep -Eo '[0-9]+'"
      DS_BUILD_OPS=0
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

This is caused by the removal of "shell=True" in the below security fix
659f6be

Resolution:
Propose to update the script to use subprocess.run command instead of subprocess.check_out

    @staticmethod
    def get_rocm_wavefront_size():
        if OpBuilder._rocm_wavefront_size:
            return OpBuilder._rocm_wavefront_size

        rocm_info = Path("/opt/rocm/bin/rocminfo")
        if not rocm_info.is_file():
            rocm_info = Path("rocminfo")

        # Construct the command as a list of arguments
        grep_cmd = [
            str(rocm_info), 
            "|", 
            "grep", "-Eo", "-m1", "Wavefront Size:[[:space:]]+[0-9]+", 
            "|", 
            "grep", "-Eo", "[0-9]+"
        ]
        
        try:
            # Run the command using subprocess.run
            result = subprocess.run(grep_cmd, capture_output=True)
            rocm_wavefront_size = result.stdout.strip()
        except subprocess.CalledProcessError:
            rocm_wavefront_size = "32"
        
        OpBuilder._rocm_wavefront_size = rocm_wavefront_size
        return OpBuilder._rocm_wavefront_size

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions