[FIX] [DDP] Fix compile for distributed training #379

Datta0 · 2025-12-11T12:58:23Z

Fixes : unslothai/unsloth#3703 unslothai/unsloth#3685

Without this, trainer throws the following error

$ cd /home/datta0/unsloth && UNSLOTH_COMPILE_LOCATION=/tmp/unsloth_compile_cache ~/.venvs/pyenv/bin/torchrun --nproc_per_node=2 unsloth-cli.py \
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] 
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] *****************************************
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1211 12:51:49.592000 178520 torch/distributed/run.py:774] *****************************************
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 12-11 12:51:55 [__init__.py:216] Automatically detected platform cuda.
INFO 12-11 12:51:55 [__init__.py:216] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: Could not import trl.trainer.alignprop_trainer: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.ip_adapter because of the following error (look up to see its traceback):
Requires Flash-Attention version >=2.7.1,<=2.8.2 but got 2.8.3.
Unsloth: Could not import trl.trainer.alignprop_trainer: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.ip_adapter because of the following error (look up to see its traceback):
Requires Flash-Attention version >=2.7.1,<=2.8.2 but got 2.8.3.
Unsloth: Could not import trl.trainer.ddpo_trainer: Failed to import trl.trainer.ddpo_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.ip_adapter because of the following error (look up to see its traceback):
Requires Flash-Attention version >=2.7.1,<=2.8.2 but got 2.8.3.
Traceback (most recent call last):
  File "https://github.com/home/datta0/unsloth/unsloth-cli.py", line 441, in <module>
    run(args)
  File "https://github.com/home/datta0/unsloth/unsloth-cli.py", line 37, in run
    from unsloth import FastLanguageModel
  File "https://github.com/home/datta0/unsloth/unsloth/__init__.py", line 257, in <module>
    from .models import *
  File "https://github.com/home/datta0/unsloth/unsloth/models/__init__.py", line 15, in <module>
    from .llama import FastLlamaModel
  File "https://github.com/home/datta0/unsloth/unsloth/models/llama.py", line 3400, in <module>
    PatchFastRL(FastLanguageModel = FastLlamaModel)
  File "https://github.com/home/datta0/unsloth/unsloth/models/rl.py", line 1347, in PatchFastRL
    patch_trl_rl_trainers()
  File "https://github.com/home/datta0/unsloth/unsloth/models/rl.py", line 1333, in patch_trl_rl_trainers
    _patch_trl_rl_trainers(trainer)
  File "https://github.com/home/datta0/unsloth/unsloth/models/rl.py", line 1003, in _patch_trl_rl_trainers
    created_module = create_new_function(
                     ^^^^^^^^^^^^^^^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/unsloth_zoo/compiler.py", line 573, in create_new_function
    function_location = os.path.join(compile_folder, f"{name}.py")
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen posixpath>", line 76, in join
TypeError: expected str, bytes or os.PathLike object, not NoneType
[unsloth_zoo.log|WARNING]Unsloth: Failed to import trl openenv: No module named 'trl.experimental'
==((====))==  Unsloth 2025.12.4: Fast Llama patching. Transformers: 4.57.1. vLLM: 0.11.0.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 2. Max memory: 79.179 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 9.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: https://github.com/unslothai/unsloth
W1211 12:52:00.431000 178520 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 178576 closing signal SIGTERM
E1211 12:52:00.946000 178520 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 178577) of binary: /home/datta0/.venvs/pyenv/bin/python3
Traceback (most recent call last):
  File "https://github.com/home/datta0/.venvs/pyenv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "https://github.com/home/datta0/.venvs/pyenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
unsloth-cli.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-11_12:52:00
  host      : h100-v3-datta.us-east4-b.c.unsloth.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 178577)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

gemini-code-assist · 2025-12-11T12:58:38Z

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical compilation error encountered during distributed training setups. By implementing a safeguard in the distributed_function utility, it ensures that operations can proceed correctly even when the torch.distributed process group has not yet been fully initialized, thereby preventing a TypeError and allowing distributed training to compile successfully.

Highlights

Distributed Training Compilation Fix: Introduced a conditional check within the distributed_function to verify if torch.distributed is initialized. If not, all processes will execute the function independently, bypassing the main process check.
Error Resolution: This change directly resolves a TypeError that occurred when os.path.join received a NoneType argument during the compilation phase in distributed training environments, as reported in issue #3703.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes a compilation error during distributed training by ensuring that all processes execute a function independently if the torch distributed process group has not been initialized yet. The change correctly addresses the bug. I've left one comment regarding code duplication introduced by the fix and suggested a refactoring to improve maintainability.

unsloth_zoo/utils.py

Fix compile for distributed training

115bbf5

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

unsloth_zoo/utils.py Show resolved Hide resolved

cleanup semi duplicate calls

319dc80

Datta0 mentioned this pull request Dec 12, 2025

[Bug] DDP with Qwen2.5VL 7B unslothai/unsloth#3713

Closed

danielhanchen merged commit 9126d67 into unslothai:main Dec 12, 2025

Datta0 mentioned this pull request Dec 15, 2025

[Bug] [MultiGPU/DDP] TypeError: expected str, bytes or os.PathLike object, not NoneType unslothai/unsloth#3722

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] [DDP] Fix compile for distributed training #379

[FIX] [DDP] Fix compile for distributed training #379

Uh oh!

Datta0 commented Dec 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[FIX] [DDP] Fix compile for distributed training #379

[FIX] [DDP] Fix compile for distributed training #379

Uh oh!

Conversation

Datta0 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Datta0 commented Dec 11, 2025 •

edited

Loading