Skip to content

QLoRA Support + NNX Decoder Sharding Fixes#3968

Merged
copybara-service[bot] merged 1 commit into
mainfrom
jackyf/qlora-support
Jun 25, 2026
Merged

QLoRA Support + NNX Decoder Sharding Fixes#3968
copybara-service[bot] merged 1 commit into
mainfrom
jackyf/qlora-support

Conversation

@RexBearIU

Copy link
Copy Markdown
Collaborator

Description

This PR introduces core support for QLoRA and implements robust sharding metadata synchronization for NNX decoders.

Currently, applying LoRA adapters alongside quantization and complex multi-host sharding setups can lead to PartitionSpec mismatches and cross-backend device_put issues during lax.scan.

This PR solves these issues by:

  • Extending the LoRA configuration to support quantization types (lora_weight_qtype) and block sizes (lora_tile_size) for QLoRA.
  • Adding generic NNX metadata synchronization helpers (nnx_remove_scan_axis, nnx_add_scan_axis, nnx_sync_moveaxis) to safely manipulate PartitionSpec metadata during lax.scan operations in NNXDecoder.
  • Implementing a _safe_reshard workaround using jax.make_array_from_callback in lora_utils.py to natively construct globally sharded arrays, bypassing backend-specific device_put issues on Pathways/McJAX.

Future improvements will include removing the _safe_reshard workaround once the underlying JAX/Qwix parameter materialization issues are fully resolved upstream.

Tests

  • Unit tests added for NNX metadata synchronization (tests/utils/test_maxtext_utils_nnx.py).
    • Existing unit tests for lora_utils updated (note: QLoRA specific tests are temporarily skipped pending an upstream qwix fix).
    • Tested via multi-host mock execution to trigger the resharding callback logic.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.28571% with 30 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/utils/maxtext_utils_nnx.py 62.29% 11 Missing and 12 partials ⚠️
src/maxtext/utils/lora_utils.py 53.84% 5 Missing and 1 partial ⚠️
src/maxtext/layers/nnx_decoders.py 90.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@RexBearIU RexBearIU changed the title feat/fix: QLoRA support and NNX Decoder Sharding Fixes QLoRA Support + NNX Decoder Sharding Fixes May 22, 2026
Comment thread src/maxtext/configs/types.py Outdated
Comment thread src/maxtext/configs/types.py Outdated
Comment thread tests/post_training/unit/lora_utils_test.py Outdated
@RexBearIU RexBearIU force-pushed the jackyf/qlora-support branch 2 times, most recently from c37faef to 4dde975 Compare June 1, 2026 06:59
@RexBearIU RexBearIU force-pushed the jackyf/qlora-support branch from 4dde975 to 6a12059 Compare June 8, 2026 07:20
@RexBearIU RexBearIU force-pushed the jackyf/qlora-support branch 5 times, most recently from 4d44567 to 67eec07 Compare June 23, 2026 01:15
Comment thread src/maxtext/utils/maxtext_utils_nnx.py
Comment thread tests/utils/test_maxtext_utils_nnx.py
Comment thread tests/utils/test_maxtext_utils_nnx.py
@RexBearIU RexBearIU force-pushed the jackyf/qlora-support branch 4 times, most recently from 2a0276d to bdc5177 Compare June 24, 2026 08:12
@RexBearIU RexBearIU force-pushed the jackyf/qlora-support branch 2 times, most recently from 82a010b to 1eae130 Compare June 25, 2026 10:04
@RexBearIU RexBearIU force-pushed the jackyf/qlora-support branch from 1eae130 to ef34fac Compare June 25, 2026 10:09
@copybara-service copybara-service Bot merged commit 863398e into main Jun 25, 2026
34 checks passed
@copybara-service copybara-service Bot deleted the jackyf/qlora-support branch June 25, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants