Skip to content

Conversation

@jeromeku
Copy link
Contributor

@jeromeku jeromeku commented Jun 26, 2025

Unsloth Blackwell Compatibility

Overview

Blackwell (sm100+) requires all dependent libraries to be compiled with cuda 12.8.

The core libs for running unsloth which have dependencies on CUDA version are:

  • bitsandbytes - already has wheels built with CUDA 12.8 so pip install should work out of the box
  • triton - requires triton>=3.3.1
  • torch - requires installing with pip install torch --extra-index-url https://download.pytorch.org/whl/cu128
  • vllm - safest is to use the nightly build: uv pip install -U vllm --torch-backend=cu128 --extra-index-url https://wheels.vllm.ai/nightly
  • xformers - as of 6/26, xformers wheels are not yet built with sm100+ enabled as support was only recently added so will require a source build (see below).

Installation

The installation order is important, since we want to overwrite bundled dependencies with specific versions (namely, xformers and triton).

  1. I prefer to use uv over pip as it's faster and better for resolving dependencies, especially for libraries which depend on torch but for which a specific CUDA version is required per this scenario.

    Install uv

    curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env

    Create a project dir and venv:

    mkdir `unsloth-blackwell` && cd `unsloth-blackwell`
    uv venv .venv --python=3.12 --seed
    source .venv/bin/activate
  2. Install vllm

    uv pip install -U vllm --torch-backend=cu128 --extra-index-url https://wheels.vllm.ai/nightly

    Note that we have to specify cu128, otherwise vllm will install torch==2.7.0 but with cu126.

  3. Install unsloth dependencies

    uv pip install unsloth unsloth_zoo bitsandbytes
  4. Download and build xformers

    # First uninstall xformers installed by previous libraries
    uv pip uninstall xformers
    
    # Clone and build
    git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
    cd xformers
    export TORCH_CUDA_ARCH_LIST="12.0"
    python setup.py install

    Note that we have to explicitly set TORCH_CUDA_ARCH_LIST=12.0.

  5. Update triton

    uv pip install -U triton>=3.3.1

    triton>=3.3.1 is required for Blackwell support.

  6. transformers
    transformers >= 4.53.0 breaks unsloth inference. Specifically, transformers with gradient_checkpointing enabled will automatically switch off caching.

    When using unsloth FastLanguageModel to generate directly after training with use_cache=True, this will result in mismatch between expected and actual outputs here.

    Temporary solution is to switch off gradient_checkpointing (e.g., model.disable_gradient_checkpointing()) before generation if using 4.53.0 or stick with 4.52.4 for now:

    uv pip install -U transformers==4.52.4

After installation, your environment should look similar to blackwell.requirements.txt.

Note, might need to downgrade numpy<=2.2 after all the installs.

Test

Both test_llama32_sft.py and test_qwen3_grpo.py should run without issue if correct install. If not, check diff between your installed env and blackwell.requirements.txt.

Tested on RTX 5090 though should also work for sm100+ in general.

@jeromeku jeromeku changed the title Add instructions for installing on blackwell Add instructions for installing unsloth on RTX 5090 Jun 26, 2025
@rolandtannous
Copy link
Collaborator

in terms of documentation, i think we should also include instructions that use pip and conda as a lot of people are not using uv yet and we do not want to force a specific package manager. I can probably test the equivalent using conda/pip and we can add those in a separate section

@danielhanchen
Copy link
Contributor

Very nice - i'll also use disable_gradient_checkpointing()

@danielhanchen danielhanchen merged commit b02be21 into unslothai:main Jun 27, 2025
@l-cacherr
Copy link

Very very nice job,thanks! In my env, it's necessary to downgrade numpy<=2.2 after all the installs. Test successed with 5090 in test_qwen3_grpo.py. Max memory usage: mem=~8G CUDA mem=25G, CUDA share mem =2.7G. Moreover, Test successed with 5090 in test_llama32_sft.py, almost no memory usage, < 5G.

@0xrushi
Copy link

0xrushi commented Jul 13, 2025

How did you install that vllm command?

$ uv pip install -U vllm --torch-backend=cu128 --extra-index-url https://wheels.vllm.ai/nightly
error: invalid value 'cu128' for '--torch-backend <TORCH_BACKEND>'
  [possible values: auto, cpu, cu126, cu125, cu124, cu123, cu122, cu121, cu120, cu118, cu117, cu116, cu115, cu114, cu113, cu112, cu111, cu110, cu102, cu101, cu100, cu92, cu91, cu90, cu80]

  tip: a similar value exists: 'cu102'

@0xrushi
Copy link

0xrushi commented Jul 14, 2025

nvm my uv was outdated, I had to do uv self update

@john-yick-modv
Copy link

A few notes for things I needed to do to get it running on my 5090 to fine tune Qwen-14B

  1. The user guide https://docs.unsloth.ai/basics/training-llms-with-blackwell-rtx-50-series-and-unsloth here works fine but there are some cavets
  2. Cuda-toolkit MUST be version 12.8, using 12.9 will cause errors, this is what caught me for well over a day as 12.9 was throwing errors preventing unsloth from running.
  3. I am using Nvidia drivers 575.64.03
  4. Not too sure if needed, but I needed to monkey patch .venv/lib/python3.12/site-packages/transformers/configuration_utils.py

by adding this function to the bottom of the file

def layer_type_validation(config, expected_layer_types):
    layer_type = getattr(config, "layer_type", None)
    if layer_type not in expected_layer_types:
        raise ValueError(f"Unexpected layer type: {layer_type}")
    return True

After this I was able to run my normal training code

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-23 22:51:59 [__init__.py:235] Automatically detected platform cuda.
==((====))==  Unsloth 2025.7.8: Fast Qwen3 patching. Transformers: 4.52.4. vLLM: 0.10.0rc2.dev73+gf59ec35b7.
   \\   /|    NVIDIA GeForce RTX 5090. Num GPUs = 1. Max memory: 31.354 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: https://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.02s/it]
Unsloth 2025.7.8 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,628 | Num Epochs = 3 | Total steps = 2,862
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 64,225,280 of 14,832,532,480 (0.43% trained)
  0%|                                                                                                                                                                                                 | 0/2862 [00:00<?, ?it/s]Unsloth: Will smartly offload gradients to save VRAM!
{'loss': 2.0211, 'grad_norm': 0.24703896045684814, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                         
{'loss': 2.0929, 'grad_norm': 0.2276933193206787, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}                                                                                                                       
{'loss': 1.934, 'grad_norm': 0.3005346357822418, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}                                                                                                                         
{'loss': 1.7917, 'grad_norm': 0.22617483139038086, 'learning_rate': 6e-06, 'epoch': 0.0}                                                                                                                                       
{'loss': 1.911, 'grad_norm': 0.21807023882865906, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.01}                                                                             

It is processing at around 7s per it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants