Skip to content

Conversation

@Datta0
Copy link
Collaborator

@Datta0 Datta0 commented Dec 11, 2025

Fixes : unslothai/unsloth#3703 unslothai/unsloth#3685

Without this, trainer throws the following error

$ cd /home/datta0/unsloth && UNSLOTH_COMPILE_LOCATION=/tmp/unsloth_compile_cache ~/.venvs/pyenv/bin/torchrun --nproc_per_node=2 unsloth-cli.py \
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] 
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] *****************************************
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] *****************************************
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 12-11 12:51:55 [__init__.py:216] Automatically detected platform cuda.
INFO 12-11 12:51:55 [__init__.py:216] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: Could not import trl.trainer.alignprop_trainer: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.ip_adapter because of the following error (look up to see its traceback):
Requires Flash-Attention version >=2.7.1,<=2.8.2 but got 2.8.3.
Unsloth: Could not import trl.trainer.alignprop_trainer: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.ip_adapter because of the following error (look up to see its traceback):
Requires Flash-Attention version >=2.7.1,<=2.8.2 but got 2.8.3.
Unsloth: Could not import trl.trainer.ddpo_trainer: Failed to import trl.trainer.ddpo_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.ip_adapter because of the following error (look up to see its traceback):
Requires Flash-Attention version >=2.7.1,<=2.8.2 but got 2.8.3.
Traceback (most recent call last):
  File "https://github.com/home/datta0/unsloth/unsloth-cli.py", line 441, in <module>
    run(args)
  File "https://github.com/home/datta0/unsloth/unsloth-cli.py", line 37, in run
    from unsloth import FastLanguageModel
  File "https://github.com/home/datta0/unsloth/unsloth/__init__.py", line 257, in <module>
    from .models import *
  File "https://github.com/home/datta0/unsloth/unsloth/models/__init__.py", line 15, in <module>
    from .llama import FastLlamaModel
  File "https://github.com/home/datta0/unsloth/unsloth/models/llama.py", line 3400, in <module>
    PatchFastRL(FastLanguageModel = FastLlamaModel)
  File "https://github.com/home/datta0/unsloth/unsloth/models/rl.py", line 1347, in PatchFastRL
    patch_trl_rl_trainers()
  File "https://github.com/home/datta0/unsloth/unsloth/models/rl.py", line 1333, in patch_trl_rl_trainers
    _patch_trl_rl_trainers(trainer)
  File "https://github.com/home/datta0/unsloth/unsloth/models/rl.py", line 1003, in _patch_trl_rl_trainers
    created_module = create_new_function(
                     ^^^^^^^^^^^^^^^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 573, in create_new_function
    function_location = os.path.join(compile_folder, f"{name}.py")
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen posixpath>", line 76, in join
TypeError: expected str, bytes or os.PathLike object, not NoneType
[unsloth_zoo.log|WARNING]Unsloth: Failed to import trl openenv: No module named 'trl.experimental'
==((====))==  Unsloth 2025.12.4: Fast Llama patching. Transformers: 4.57.1. vLLM: 0.11.0.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 2. Max memory: 79.179 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 9.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: https://github.com/unslothai/unsloth
W1211 12:52:00.431000 178520 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 178576 closing signal SIGTERM
E1211 12:52:00.946000 178520 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 178577) of binary: /home/datta0/.venvs/pyenv/bin/python3
Traceback (most recent call last):
  File "https://github.com/home/datta0/.venvs/pyenv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
unsloth-cli.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-11_12:52:00
  host      : h100-v3-datta.us-east4-b.c.unsloth.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 178577)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical compilation error encountered during distributed training setups. By implementing a safeguard in the distributed_function utility, it ensures that operations can proceed correctly even when the torch.distributed process group has not yet been fully initialized, thereby preventing a TypeError and allowing distributed training to compile successfully.

Highlights

  • Distributed Training Compilation Fix: Introduced a conditional check within the distributed_function to verify if torch.distributed is initialized. If not, all processes will execute the function independently, bypassing the main process check.
  • Error Resolution: This change directly resolves a TypeError that occurred when os.path.join received a NoneType argument during the compilation phase in distributed training environments, as reported in issue #3703.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a compilation error during distributed training by ensuring that all processes execute a function independently if the torch distributed process group has not been initialized yet. The change correctly addresses the bug. I've left one comment regarding code duplication introduced by the fix and suggested a refactoring to improve maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expected str or path object but received Nonetype Error when training using multi-gpu

2 participants