Skip to content

Conversation

@dmsuehir
Copy link
Contributor

Summary

When using unsloth for batch inference, the model.generate() changes the tokenizer.padding_side from left to right. This causes an issue when using the tokenizer to decode the response.

Root Cause

I debugged this and found that the padding_side changes when FastLlamaModel.for_training(self) is called after generation and it sets the padding_side to right.

Changes

To fix the issue, I made code changes to only call FastLlamaModel.for_training(self) if the model was originally in training mode before generation. If the model was already in inference mode, then it should remain in inference mode.

Related Issues

Fixes #2217
Fixes #3283

Testing

I tested the use cases reported in both of these issues with my branch to verify the fix. For issue #2217, the output now looks like:

>>> print(tokenizer.decode(completions[0]))
<|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|begin_of_text|>shortest path problem

And for issue #3283, the test script output now looks like:

✅ Initial padding_side: left
✅ Before generation - padding_side: left
❌ After generation - padding_side: left
✅ No bug detected

… in training mode

Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dmsuehir, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a critical fix for Unsloth's model.generate() function, specifically addressing an unintended alteration of the Llama tokenizer's padding_side during inference. By ensuring that the model's training state is only restored if it was initially in training mode, the change prevents the tokenizer from switching its padding direction, thereby resolving decoding inconsistencies reported by users.

Highlights

  • Tokenizer Padding Fix: Resolves an issue where tokenizer.padding_side was incorrectly changed from 'left' to 'right' after model.generate() in inference mode, causing decoding problems.
  • Conditional Training Mode Restoration: Implements a conditional check to restore the model to training mode only if it was originally in training mode before generation, preventing unintended side effects on the tokenizer's padding configuration.
  • Addresses Known Issues: This fix directly addresses and resolves reported issues Llama 3 tokenizer is using right-padding during inference #2217 and [Bug] model.generate changes tokenizer's padding side #3283, which were related to the tokenizer padding behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue where the tokenizer's padding_side was improperly changed during inference. The change to conditionally restore the model's training state is a direct and effective solution. I've added one suggestion to improve the robustness of this state management by using a try...finally block, ensuring the model's state is correctly restored even if an error occurs during generation.

@danielhanchen
Copy link
Contributor

Oh thank you this works!

@danielhanchen danielhanchen merged commit d4a311d into unslothai:main Nov 26, 2025
1 check passed
@dmsuehir dmsuehir deleted the dina/padding_side_fix branch December 1, 2025 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] model.generate changes tokenizer's padding side Llama 3 tokenizer is using right-padding during inference

2 participants