Fix llama tokenizer padding_side when using model.generate in inference mode #3644

dmsuehir · 2025-11-25T23:59:12Z

Summary

When using unsloth for batch inference, the model.generate() changes the tokenizer.padding_side from left to right. This causes an issue when using the tokenizer to decode the response.

Root Cause

I debugged this and found that the padding_side changes when FastLlamaModel.for_training(self) is called after generation and it sets the padding_side to right.

Changes

To fix the issue, I made code changes to only call FastLlamaModel.for_training(self) if the model was originally in training mode before generation. If the model was already in inference mode, then it should remain in inference mode.

Related Issues

Fixes #2217
Fixes #3283

Testing

I tested the use cases reported in both of these issues with my branch to verify the fix. For issue #2217, the output now looks like:

>>> print(tokenizer.decode(completions[0]))
<|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|begin_of_text|>shortest path problem

And for issue #3283, the test script output now looks like:

✅ Initial padding_side: left
✅ Before generation - padding_side: left
❌ After generation - padding_side: left
✅ No bug detected

… in training mode Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>

gemini-code-assist · 2025-11-25T23:59:26Z

Summary of Changes

Hello @dmsuehir, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a critical fix for Unsloth's model.generate() function, specifically addressing an unintended alteration of the Llama tokenizer's padding_side during inference. By ensuring that the model's training state is only restored if it was initially in training mode, the change prevents the tokenizer from switching its padding direction, thereby resolving decoding inconsistencies reported by users.

Highlights

Tokenizer Padding Fix: Resolves an issue where tokenizer.padding_side was incorrectly changed from 'left' to 'right' after model.generate() in inference mode, causing decoding problems.
Conditional Training Mode Restoration: Implements a conditional check to restore the model to training mode only if it was originally in training mode before generation, preventing unintended side effects on the tokenizer's padding configuration.
Addresses Known Issues: This fix directly addresses and resolves reported issues Llama 3 tokenizer is using right-padding during inference #2217 and [Bug] model.generate changes tokenizer's padding side #3283, which were related to the tokenizer padding behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request correctly fixes an issue where the tokenizer's padding_side was improperly changed during inference. The change to conditionally restore the model's training state is a direct and effective solution. I've added one suggestion to improve the robustness of this state management by using a try...finally block, ensuring the model's state is correctly restored even if an error occurs during generation.

danielhanchen · 2025-11-26T01:33:25Z

Oh thank you this works!

Only restore training mode after generation, if the model started out…

078e5a7

… in training mode Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8fd2c13

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

danielhanchen merged commit d4a311d into unslothai:main Nov 26, 2025
1 check passed

dmsuehir deleted the dina/padding_side_fix branch December 1, 2025 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix llama tokenizer padding_side when using model.generate in inference mode #3644

Fix llama tokenizer padding_side when using model.generate in inference mode #3644

dmsuehir commented Nov 25, 2025

Uh oh!

gemini-code-assist bot commented Nov 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

danielhanchen commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix llama tokenizer padding_side when using model.generate in inference mode #3644

Fix llama tokenizer padding_side when using model.generate in inference mode #3644

Conversation

dmsuehir commented Nov 25, 2025

Summary

Root Cause

Changes

Related Issues

Testing

Uh oh!

gemini-code-assist bot commented Nov 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants