# tt-scale-flux Initially, the repo started out as: > Simple re-implementation of inference-time scaling Flux.1-Dev as introduced in [Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps](https://arxiv.org/abs/2501.09732) by Ma et al. We implement the random search strategy to scale the inference compute budget. But it's been growing now! Check out the rest of the README to know more πŸ€—

Photo of an athlete cat explaining it’s latest scandal at a press conference to journalists.

**Updates** πŸ”₯ 04/03/2025: Support for LTX-Video and Wan in [this PR](https://github.com/sayakpaul/tt-scale-flux/pull/18) 🎬 Check out [this section](#videos) for results and more info. πŸ”₯ 01/03/2025: `OpenAIVerifier` was added in [this PR](https://github.com/sayakpaul/tt-scale-flux/pull/16). Specify "openai" in the `name` under `verifier_args`. Thanks to [zhuole1025](https://github.com/zhuole1025) for contributing this! πŸ”₯ 27/02/2025: [MaximClouser](https://github.com/MaximClouser) implemented a ComfyUI node for inference-time scaling in [this repo](https://github.com/YRIKKA/ComfyUI-InferenceTimeScaling). Check it out! πŸ”₯ 25/02/2025: Support for zero-order search has been added [in this PR](https://github.com/sayakpaul/tt-scale-flux/pull/14). Many thanks to Willis Ma for the reviews. Check out [this section](#controlling-search) for more details. πŸ”₯ 21/02/2025: Better configuration management for more flexibility, free the `argparse` madness. πŸ”₯ 16/02/2025: Support for batched image generation has been added [in this PR](https://github.com/sayakpaul/tt-scale-flux/pull/9). It speeds up the total time but consumes more memory. πŸ”₯ 15/02/2025: Support for structured generation with Qwen2.5 VL has been added (using `outlines` and `pydantic`) in [this PR](https://github.com/sayakpaul/tt-scale-flux/pull/6). πŸ”₯ 15/02/2025: Support to load other pipelines has been added in [this PR](https://github.com/sayakpaul/tt-scale-flux/pull/5)! [Result section](#more-results) has been updated, too. ## Getting started Make sure to install the dependencies: `pip install -r requirements`. The codebase was tested using a single H100 and two H100s (both 80GB variants). By default, we use [Gemini 2.0 Flash](https://deepmind.google/technologies/gemini/flash/) as the verifier (you [can use](#controlling-the-verifier) Qwen2.5, too). This requires two things: * `GEMINI_API_KEY` (obtain it from [here](https://ai.google.dev/gemini-api/docs)). * `google-genai` Python [library](https://pypi.org/project/google-genai/). Now, fire up: ```bash GEMINI_API_KEY=... python main.py --prompt="a tiny astronaut hatching from an egg on the moon" --num_prompts=None ``` If you want to use from the [data-is-better-together/open-image-preferences-v1-binarized](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized) dataset, you can just run: ```bash GEMINI_API_KEY=... python main.py ``` After this is done executing, you should expect a folder named `output` with the following structure:
Click to expand ```bash output/flux.1-dev/gemini/overall_score/20250215_141308$ tree . β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@1_s@1039315023.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@1_s@77559330.json β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@1_s@77559330.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@2_s@1046091514.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@2_s@1388753168.json β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@2_s@1388753168.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@2_s@1527774201.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@2_s@1632020675.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@1648932110.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@2033640094.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@2056028012.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@510118118.json β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@510118118.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@544879571.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@722867022.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@951309743.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@3_s@973580742.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1169137714.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1271234848.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1327836930.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1589777351.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1592595351.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1654773907.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1901647417.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1916603945.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@209448213.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@2104826872.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@532500803.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@710122236.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@744797903.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@754998363.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@823891989.png β”œβ”€β”€ prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@836183088.json └── prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@836183088.png ```
Each JSON file should look like so:
Click to expand ```json { "prompt": "Photo of an athlete cat explaining it\u2019s latest scandal at a press conference to journalists.", "search_round": 4, "num_noises": 16, "best_noise_seed": 1940263961, "best_score": { "explanation": "The image excels in accuracy, visual quality, and originality, with minor deductions for thematic resonance. Overall, it's a well-executed and imaginative response to the prompt.", "score": 9.0 }, "choice_of_metric": "overall_score", "best_img_path": "output/flux.1-dev/gemini/overall_score/20250216_135414/prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@4_s@1940263961.png" } ```
To limit the number of prompts, specify `--num_prompts`. By default, we use 2 prompts. Specify "--num_prompts=all" to use all. The output directory should also contain a `config.json`, looking like so:
Click to expand ```json { "max_new_tokens": 300, "use_low_gpu_vram": false, "choice_of_metric": "overall_score", "verifier_to_use": "gemini", "torch_dtype": "bf16", "height": 1024, "width": 1024, "max_sequence_length": 512, "guidance_scale": 3.5, "num_inference_steps": 50, "pipeline_config_path": "configs/flux.1_dev.json", "search_rounds": 4, "prompt": "an anime illustration of a wiener schnitzel", "num_prompts": null } ```
> [!NOTE] > `max_new_tokens` arg is ignored when using Gemini. Once the results are generated, process the results by running: ```bash python process_results.py --path=path_to_the_output_dir ``` This should output a collage of the best images generated in each search round, grouped by the same prompt. By default, the `--batch_size_for_img_gen` is set to 1. To speed up the process (at the expense of more memory), this number can be increased. ## Controlling experiment configurations Experiment configurations are provided through the `--pipeline_config_path` arg which points to a JSON file. The structure of such JSON files should look like so: ```json { "pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev", "torch_dtype": "bf16", "pipeline_call_args": { "height": 1024, "width": 1024, "max_sequence_length": 512, "guidance_scale": 3.5, "num_inference_steps": 50 }, "verifier_args": { "name": "gemini", "max_new_tokens": 800, "choice_of_metric": "overall_score" }, "search_args": { "search_method": "random", "search_rounds": 4 } } ``` This lets us control the pipeline call arguments, the verifier, and the search process. ### Controlling the pipeline checkpoint and `__call__()` args This is controlled via the `--pipeline_config_path` CLI args. By default, it uses [`configs/flux.1_dev.json`](./configs/flux.1_dev.json). You can either modify this one or create your own JSON file to experiment with different pipelines. We provide some predefined configs for Flux.1-Dev, PixArt-Sigma, SDXL, and SD v1.5 in the [`configs`](./conf) directory. The above-mentioned pipelines are already supported. To add your own, you need to make modifications to: * [`MODEL_NAME_MAP`](https://github.com/sayakpaul/tt-scale-flux/blob/8e4ba232fbdfeb7a6879049d3b5765f81969ddf3/utils.py#L16) * [`get_latent_prep_fn()`](https://github.com/sayakpaul/tt-scale-flux/blob/8e4ba232fbdfeb7a6879049d3b5765f81969ddf3/utils.py#L125C5-L125C23) ### Controlling the "scale" By default, we use 4 `search_rounds` and start with a noise pool size of 2. Each search round scales up the pool size like so: `2 ** current_seach_round` (with indexing starting from 1). This is where the "scale" in inference-time scaling comes from. You can increase the compute budget by specifying a larger `search_rounds` in the config file. For each search round, we serialize the images and best datapoint (characterized by the best eval score) in a JSON file. For other supported CLI args, run `python main.py -h`. ### Controlling the verifier If you don't want to use Gemini, you can use [Qwen2.5 VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) as an option. Simply specify `"name"=qwen` under the `"verifier_args"` of the config. Below is a complete command that uses SDXL-base: ```bash python main.py \ --pipeline_config_path="configs/sdxl.json" \ --prompt="Photo of an athlete cat explaining it’s latest scandal at a press conference to journalists." \ --num_prompts=None ```
Sample search JSON ```json { "prompt": "Photo of an athlete cat explaining it\u2019s latest scandal at a press conference to journalists.", "search_round": 6, "num_noises": 64, "best_noise_seed": 1937268448, "best_score": { "explanation": "Overall, the image demonstrates a high level of accuracy, creativity, and theme consistency while maintaining a high visual quality and coherence within the depicted scenario. The humor and surprise value are significant, contributing to above-average scoring.", "score": 9.0 }, "choice_of_metric": "overall_score", "best_img_path": "output/sdxl-base/qwen/overall_score/20250216_141140/prompt@Photo_of_an_athlete_cat_explaining_it_s_latest_scandal_at_a_press_conference_to_journ_hash@b9094b65_i@6_s@1937268448.png" } ```
  
Results
Result
scandal_cat
Photo of an athlete cat explaining it’s latest scandal at a press conference to journalists.
   > [!IMPORTANT] > This setup was tested on 2 H100s. If you want to do this on a single GPU, specify `--use_low_gpu_vram`. When using Qwen2.5 VL, by default, we use use this checkpoint: [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). However, you can pass other supported checkpoints too. Specify the `model_name` parameter under `verifier_args`: ```json "verifier_args": { "name": "qwen", "model_name": "Qwen/Qwen2.5-VL-3B-Instruct", "max_new_tokens": 800, "choice_of_metric": "overall_score" } ``` `model_name` is supported for the other non-local verifiers. For example, for the `GeminiVerifier`, you can pass any model supported by the Gemini API through `model_name`. You can also bring in your own verifier by implementing a so-called `Verifier` class following the structure of either of `GeminiVerifier` or `QwenVerifier`. You will then have to make adjustments to the following places: * [Scoring](https://github.com/sayakpaul/tt-scale-flux/blob/c654bc066171aee9c765fa42a322f65415529a77/main.py#L135) * [Sorting](https://github.com/sayakpaul/tt-scale-flux/blob/c654bc066171aee9c765fa42a322f65415529a77/main.py#L163) By default, we use "overall_score" as the metric to obtain the best samples in each search round. You can change it by specifying `choice_of_metric` in the `verifier_args`. The list of supported values for `choice_metric` is verifier-dependent. Supported values for the Gemini and Qwen verifiers are: * "accuracy_to_prompt" * "creativity_and_originality" * "visual_quality_and_realism" * "consistency_and_cohesion" * "emotional_or_thematic_resonance" * "overall_score" The verifier prompt that is used during grading/verification is specified in [this file](./verifiers/verifier_prompt.txt). The prompt is a slightly modified version of the one specified in the Figure 16 of the paper (Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps). You are welcome to experiment with a different prompt. Set the `prompt_path` under `verifier_args`. ### Controlling search You can configure search related arguments through `search_args` in the configuration file. Currently, "random search" and "zero-order search" are supported. The default configurations provided under [`configs/`](./configs/) are all for random search. Below is a configuration for zero-order search: ```json "search_args": { "search_method": "zero-order", "search_rounds": 4, "threshold": 0.95, "num_neighbors": 4 } ```
For details about the parameters * `threshold`: threshold to use for filtering out neighbor candidates from the base noise * `num_neighbors`: number of neighbors to generate from the base noise
> [!NOTE] > If the neighbors in the current round do not improve the current search round results, we simply reject the round, starting the next round with a new base nosie. In case of worse neighbors, we don't serialize the artifacts. ## More results
Click to expand
Result
Manga
a bustling manga street, devoid of vehicles, detailed with vibrant colors and dynamic
line work, characters in the background adding life and movement, under a soft golden
hour light, with rich textures and a lively atmosphere, high resolution, sharp focus
Alice
Alice in a vibrant, dreamlike digital painting inside the Nemo Nautilus submarine.
wiener_schnitzel
an anime illustration of a wiener schnitzel
   Both searches were performed with "overall_score" as the metric. Below is example, presenting a comparison between the outputs of different metrics -- "overall_score" vs. "emotional_or_thematic_resonance" for the prompt: "a tiny astronaut hatching from an egg on the moon":
Click to expand
Metric Result
"overall_score" overall
"emotional_or_thematic_resonance" Alicet
### Results from other models
PixArt-Sigma
Result
saxophone
A person playing saxophone.
scandal_cat
Photo of an athlete cat explaining it’s latest scandal at a press conference to journalists.

SD v1.5
Result
saxophone
a photo of an astronaut riding a horse on mars
  
SDXL-base
Result
scandal_cat
Photo of an athlete cat explaining it’s latest scandal at a press conference to journalists.
   ## Videos We currently support [LTX-Video](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video) and [Wan](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan). Only LAION aeshtetic scoring is supported for these. Checkout the LTX and Wan configs [here](./configs/ltx_video.json) and [here](./configs/wan.json).
Expand for results
Wan

Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.
Check the video manually here if it doesn't show up.
LTX-Video


The camera pans over a snow-covered mountain range, revealing a vast expanse of snow-capped peaks and valleys.The mountains are covered in a thick layer of snow, with some areas appearing almost white while others have a slightly darker, almost grayish hue. The peaks are jagged and irregular, with some rising sharply into the sky while others are more rounded. The valleys are deep and narrow, with steep slopes that are also covered in snow. The trees in the foreground are mostly bare, with only a few leaves remaining on their branches. The sky is overcast, with thick clouds obscuring the sun. The overall impression is one of peace and tranquility, with the snow-covered mountains standing as a testament to the power and beauty of nature.
Check the video manually here if it doesn't show up.
## Acknowledgements * Thanks to [Willis Ma](https://twitter.com/ma_nanye) for all the guidance and pair-coding. * Thanks to Hugging Face for supporting the compute. * Thanks to Google for providing Gemini credits. * Thanks a bunch to [amitness](https://github.com/amitness) for [this PR](https://github.com/sayakpaul/tt-scale-flux/pull/7).