Skip to content

MultiGPU Work Units For Accelerated Sampling#7063

Open
Kosinkadink wants to merge 131 commits intomasterfrom
worksplit-multigpu
Open

MultiGPU Work Units For Accelerated Sampling#7063
Kosinkadink wants to merge 131 commits intomasterfrom
worksplit-multigpu

Conversation

@Kosinkadink
Copy link
Copy Markdown
Member

@Kosinkadink Kosinkadink commented Mar 4, 2025

Overview

This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.

As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.

The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:
image

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:
image

Nvidia (CUDA): Tested, works ✅.
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux ⚠️.

Implementation Details

Based on max_gpus and the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher's additional_models under key multigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with any additional_models kept consistent. To avoid unnecessarily deepcloning models, currently_loaded_models from comfy.model_management are checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.

When native conds are used as the work units, _calc_cond_batch calls and returns _calc_cond_batch_multigpu to avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.

Since deepcloning requires that the base model is 'clean', comfy.model_management has received a unload_model_and_clones function to unload only specific models and their clones.

The --cuda-device startup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:
image
image
This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)

Performance (will add more examples soon)

Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.
image
image

Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090
image
image

API Node PR Checklist

Scope

  • Is API Node Change

Pricing & Billing

  • Need pricing update
  • No pricing update

If Need pricing update:

  • Metronome rate cards updated
  • Auto‑billing tests updated and passing

QA

  • QA done
  • QA not required

Comms

  • Informed Kosinkadink

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
…added some doc strings and removed a so-far unused variable
…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR adds multi‑GPU support across the codebase: new comfy.multigpu (GPUOptions, GPUOptionsGroup, deep‑clone creation, and load balancing); ModelPatcher gains multigpu metadata, deepclone_multigpu, match_multigpu_clones, and multigpu‑aware hook/keyframe handling; ControlBase/ControlNet/T2IAdapter gain per‑device clone management and ControlIsolation; samplers and sampler_helpers add per‑device batching and threaded evaluation; model_management enumerates and frees memory across all torch devices; CLI --cuda-device now accepts strings; new Comfy nodes expose multigpu setup; minor whitespace change in quant_ops.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'MultiGPU Work Units For Accelerated Sampling' directly and clearly describes the main feature being added: multi-GPU acceleration through work unit splitting.
Description check ✅ Passed The pull request description comprehensively documents the MultiGPU Work Units feature, implementation details, performance metrics, and known issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_multigpu.py`:
- Around line 66-69: The GPUOptionsGroup.clone() return value is being discarded
in create_gpu_options; capture and use the cloned object so we don't mutate the
caller-supplied gpu_options. Change the behavior in create_gpu_options to assign
the result of GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.

In `@comfy/cli_args.py`:
- Line 52: The --cuda-device argument currently only accepts a single token;
update the parser.add_argument call for "--cuda-device" to accept multiple
space-separated device IDs by adding nargs='+' (and set type=int if you want
integer IDs) so that invocations like "--cuda-device 0 1" parse correctly;
alternatively, if you prefer comma-separated input, change the help text to
explicitly state the required format instead of implying plural support.

In `@comfy/controlnet.py`:
- Around line 322-328: The multigpu clone path in deepclone_multigpu currently
builds c = self.copy() which does not carry the previous_controlnet chain,
causing stacked ControlNets/T2IAdapters to be lost on secondary GPUs; update
deepclone_multigpu to copy previous_controlnet (and any linked
.previous_controlnet chain) from self to c after c = self.copy() so that the
full chain is preserved, then continue deep-copying control_model and wrapping
it as before (ensure multigpu_clones[load_device] assignment remains unchanged);
apply the same preservation of previous_controlnet chaining to the similar clone
code paths that use copy_to()/get_instance_for_device() so all per-device clones
keep the full previous_controlnet chain.

In `@comfy/model_management.py`:
- Around line 214-231: The function get_all_torch_devices currently only handles
NVIDIA/Intel/Ascend and can return an empty list (breaking exclude_current and
unload_all_models); update it to (1) add detection for other common backends
(e.g., ROCm/DirectML/MLU/MPS) or at minimum attempt generic torch backend checks
such as torch.cuda.device_count() and torch.backends.mps.is_available() and
append appropriate torch.device entries, (2) if after all backend checks devices
is still empty, append get_torch_device() as a safe fallback so callers always
get at least the current device, and (3) make the exclude_current branch robust
by checking membership before calling devices.remove(get_torch_device()); refer
to get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.

In `@comfy/model_patcher.py`:
- Around line 1315-1321: The ON_PREPARE_STATE callbacks are being invoked with
four positional args in prepare_state, breaking backward-compatibility for
callbacks that expect three; update prepare_state to detect each callback's
accepted arity (e.g., via inspect.signature or callable.__code__.co_argcount)
and call either callback(self, timestep, model_options, ignore_multigpu) if it
accepts 4 args or callback(self, timestep, model_options) if it only accepts 3
(or attempt the 4-arg call and fall back to 3-arg on TypeError), and apply the
same arity-gated invocation when recursing into multigpu clones; reference
prepare_state and CallbacksMP.ON_PREPARE_STATE to locate where to change the
callsite.

In `@comfy/multigpu.py`:
- Around line 60-112: create_multigpu_deepclones clones existing "multigpu"
additional models but never removes ones that exceed the new max_gpus; to fix,
after computing limit_extra_devices (the allowed device list) retrieve
model.get_additional_models_with_key("multigpu"), filter out any clone whose
load_device is not in ([model.load_device] + limit_extra_devices) (use each
ModelPatcher.load_device to decide), then call
model.set_additional_models("multigpu", filtered_list) before
match_multigpu_clones()/gpu_options.register; ensure reuse_loaded logic still
can find matching clones and that is_multigpu_base_clone flags remain correct
for retained clones.

In `@comfy/quant_ops.py`:
- Line 23: The unconditional call ck.registry.disable("cuda") in
comfy/quant_ops.py should be removed and only invoked when the unsupported
multigpu+cuda combination is actually active; locate the
ck.registry.disable("cuda") invocation and wrap it with a guard that checks the
real multigpu/backend state (for example an existing multigpu flag or function
like is_multigpu_enabled(), a config/ENV check, or the code path that handles
multigpu setup) so that CUDA is only disabled when multigpu is enabled and the
specific backend combination is unsupported, otherwise leave CUDA enabled for
normal single-GPU runs.

In `@comfy/sampler_helpers.py`:
- Line 200: Add the missing BaseModel import used in the type annotation for
real_model (the line "real_model: BaseModel = model.model") by adding "from
typing import TYPE_CHECKING" already present and then inside the existing
TYPE_CHECKING block import BaseModel from its module (e.g., "from <module>
import BaseModel") so the annotation is defined at type-check time;
alternatively remove the BaseModel annotation if you prefer not to add the
import.

In `@comfy/samplers.py`:
- Around line 391-397: The multigpu scheduler currently ignores multigpu_options
and uses integer floor division (//) inside math.ceil, producing coarse,
incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.
- Around line 847-850: The code calls x['control'].pre_run(model, ...) for the
base control and then calls device_cnet.pre_run(model, ...) for each control
clone, incorrectly passing the base model to per-device controls; update the
loop to pass the matching per-device model clone instead. Specifically, when
iterating x['control'].multigpu_clones (the device_cnet clones), look up the
corresponding model clone (e.g., from model.multigpu_clones using the same
keys/ids) and call device_cnet.pre_run(model_clone,
percent_to_timestep_function) so each control clone receives its matching model
clone; keep the initial x['control'].pre_run(model, ...) for the base control.

In `@comfy/sd.py`:
- Line 1557: The assignment to out[0].cached_patcher_init can raise when out[0]
is None (e.g. when called from load_checkpoint_guess_config_clip_only()), so
guard it: check that out[0] is not None before assigning to
out[0].cached_patcher_init and, if the CLIP patcher is created separately for
checkpoint-backed models, set its own cached_patcher_init instead (or attach the
init tuple to the patcher instance). Locate the assignment line and ensure both
code paths (when out[0] exists and when only a CLIP patcher exists) receive the
same cached_patcher_init tuple so no NullReference occurs and checkpoint-backed
CLIP patchers keep their cached_patcher_init.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: acbb973a-f861-46f7-805a-a50660ad4671

📥 Commits

Reviewing files that changed from the base of the PR and between b941913 and f410d28.

📒 Files selected for processing (12)
  • comfy/cli_args.py
  • comfy/controlnet.py
  • comfy/model_management.py
  • comfy/model_patcher.py
  • comfy/multigpu.py
  • comfy/patcher_extension.py
  • comfy/quant_ops.py
  • comfy/sampler_helpers.py
  • comfy/samplers.py
  • comfy/sd.py
  • comfy_extras/nodes_multigpu.py
  • nodes.py

Comment thread comfy_extras/nodes_multigpu.py Outdated
Comment on lines +66 to +69
def create_gpu_options(self, device_index: int, relative_speed: float, gpu_options: comfy.multigpu.GPUOptionsGroup=None):
if not gpu_options:
gpu_options = comfy.multigpu.GPUOptionsGroup()
gpu_options.clone()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -e
FILE=$(fd '^multigpu\.py$' comfy)
rg -n -C4 'class GPUOptionsGroup|def clone\(' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 793


🏁 Script executed:

#!/bin/bash
set -e
FILE=$(fd '^nodes_multigpu\.py$' comfy_extras)
sed -n '66,80p' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 529


Capture the result of gpu_options.clone() on line 69.

The cloned object is being discarded. Since GPUOptionsGroup.clone() returns a new object, the current code ignores it and then mutates the caller-supplied gpu_options directly. When multiple MultiGPU Options nodes share the same GPU options group, this causes unintended state sharing across the node chain.

Change line 69 from:

gpu_options.clone()

to:

gpu_options = gpu_options.clone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_multigpu.py` around lines 66 - 69, The
GPUOptionsGroup.clone() return value is being discarded in create_gpu_options;
capture and use the cloned object so we don't mutate the caller-supplied
gpu_options. Change the behavior in create_gpu_options to assign the result of
GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.

Comment thread comfy/cli_args.py
parser.add_argument("--auto-launch", action="store_true", help="Automatically launch ComfyUI in the default browser.")
parser.add_argument("--disable-auto-launch", action="store_true", help="Disable auto launching the browser.")
parser.add_argument("--cuda-device", type=int, default=None, metavar="DEVICE_ID", help="Set the id of the cuda device this instance will use. All other devices will not be visible.")
parser.add_argument("--cuda-device", type=str, default=None, metavar="DEVICE_ID", help="Set the ids of cuda devices this instance will use. All other devices will not be visible.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--cuda-device", type=str, default=None)

for argv in (
    ["--cuda-device", "0,1"],
    ["--cuda-device", "0", "1"],
):
    try:
        ns = parser.parse_args(argv)
        print(argv, "->", ns.cuda_device)
    except SystemExit as e:
        print(argv, "-> parse failed with exit", e.code)
PY

Repository: Comfy-Org/ComfyUI

Length of output: 243


🏁 Script executed:

rg "cuda_device" -A 3 -B 3 --max-count=20

Repository: Comfy-Org/ComfyUI

Length of output: 702


🏁 Script executed:

rg "(cuda-device|cuda_device)" -l

Repository: Comfy-Org/ComfyUI

Length of output: 86


🏁 Script executed:

git log --oneline --all -n 1

Repository: Comfy-Org/ComfyUI

Length of output: 113


🏁 Script executed:

git diff HEAD~1 HEAD -- comfy/cli_args.py 2>/dev/null || git show HEAD:comfy/cli_args.py 2>/dev/null | grep -A 5 -B 5 "cuda-device"

Repository: Comfy-Org/ComfyUI

Length of output: 1367


--cuda-device 0 1 still won't parse.

Changing the type to str only changes the value type, not the option arity. argparse still accepts one token here, so --cuda-device 0 1 fails with "unrecognized arguments: 1". The help text mentions "ids" (plural), implying multi-device support, but the current implementation requires comma-separated format: --cuda-device 0,1. Either add nargs='+' to accept space-separated device IDs or clarify the help text to document the required comma-separated input format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/cli_args.py` at line 52, The --cuda-device argument currently only
accepts a single token; update the parser.add_argument call for "--cuda-device"
to accept multiple space-separated device IDs by adding nargs='+' (and set
type=int if you want integer IDs) so that invocations like "--cuda-device 0 1"
parse correctly; alternatively, if you prefer comma-separated input, change the
help text to explicitly state the required format instead of implying plural
support.

Comment thread comfy/controlnet.py
Comment on lines +322 to +328
def deepclone_multigpu(self, load_device, autoregister=False):
c = self.copy()
c.control_model = copy.deepcopy(c.control_model)
c.control_model_wrapped = comfy.model_patcher.ModelPatcher(c.control_model, load_device=load_device, offload_device=comfy.model_management.unet_offload_device())
if autoregister:
self.multigpu_clones[load_device] = c
return c
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the previous_controlnet chain in multigpu clones.

These new clone paths build c from copy(), but copy_to() does not carry previous_controlnet. Once get_instance_for_device() returns the per-device clone, stacked ControlNets/T2IAdapters on earlier links are silently dropped on secondary GPUs.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

Also applies to: 952-958

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/controlnet.py` around lines 322 - 328, The multigpu clone path in
deepclone_multigpu currently builds c = self.copy() which does not carry the
previous_controlnet chain, causing stacked ControlNets/T2IAdapters to be lost on
secondary GPUs; update deepclone_multigpu to copy previous_controlnet (and any
linked .previous_controlnet chain) from self to c after c = self.copy() so that
the full chain is preserved, then continue deep-copying control_model and
wrapping it as before (ensure multigpu_clones[load_device] assignment remains
unchanged); apply the same preservation of previous_controlnet chaining to the
similar clone code paths that use copy_to()/get_instance_for_device() so all
per-device clones keep the full previous_controlnet chain.

Comment thread comfy/model_management.py
Comment on lines +214 to +231
def get_all_torch_devices(exclude_current=False):
global cpu_state
devices = []
if cpu_state == CPUState.GPU:
if is_nvidia():
for i in range(torch.cuda.device_count()):
devices.append(torch.device(i))
elif is_intel_xpu():
for i in range(torch.xpu.device_count()):
devices.append(torch.device(i))
elif is_ascend_npu():
for i in range(torch.npu.device_count()):
devices.append(torch.device(i))
else:
devices.append(get_torch_device())
if exclude_current:
devices.remove(get_torch_device())
return devices
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle non-CUDA backends in get_all_torch_devices.

This helper only enumerates CUDA/XPU/NPU devices, so ROCm/DirectML/MLU-style paths leave devices empty. With exclude_current=True that turns into a remove() failure, and unload_all_models() also stops freeing anything on those backends because it now routes through this helper.

As per coding guidelines, comfy/** changes should focus on backward compatibility and memory management/GPU resource handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 214 - 231, The function
get_all_torch_devices currently only handles NVIDIA/Intel/Ascend and can return
an empty list (breaking exclude_current and unload_all_models); update it to (1)
add detection for other common backends (e.g., ROCm/DirectML/MLU/MPS) or at
minimum attempt generic torch backend checks such as torch.cuda.device_count()
and torch.backends.mps.is_available() and append appropriate torch.device
entries, (2) if after all backend checks devices is still empty, append
get_torch_device() as a safe fallback so callers always get at least the current
device, and (3) make the exclude_current branch robust by checking membership
before calling devices.remove(get_torch_device()); refer to
get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.

Comment thread comfy/model_patcher.py
Comment on lines +1315 to +1321
def prepare_state(self, timestep, model_options, ignore_multigpu=False):
for callback in self.get_all_callbacks(CallbacksMP.ON_PREPARE_STATE):
callback(self, timestep)
callback(self, timestep, model_options, ignore_multigpu)
if not ignore_multigpu and "multigpu_clones" in model_options:
for p in model_options["multigpu_clones"].values():
p: ModelPatcher
p.prepare_state(timestep, model_options, ignore_multigpu=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep ON_PREPARE_STATE callback arity backward-compatible.

This now passes ignore_multigpu as a fourth positional argument to every existing ON_PREPARE_STATE callback. Any custom node still implementing the old (patcher, timestep, model_options) signature will fail with TypeError during sampling unless this is gated behind an arity check or opt-in keyword path.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_patcher.py` around lines 1315 - 1321, The ON_PREPARE_STATE
callbacks are being invoked with four positional args in prepare_state, breaking
backward-compatibility for callbacks that expect three; update prepare_state to
detect each callback's accepted arity (e.g., via inspect.signature or
callable.__code__.co_argcount) and call either callback(self, timestep,
model_options, ignore_multigpu) if it accepts 4 args or callback(self, timestep,
model_options) if it only accepts 3 (or attempt the 4-arg call and fall back to
3-arg on TypeError), and apply the same arity-gated invocation when recursing
into multigpu clones; reference prepare_state and CallbacksMP.ON_PREPARE_STATE
to locate where to change the callsite.

Comment thread comfy/quant_ops.py Outdated
ck.registry.disable("cuda")
logging.warning("WARNING: You need pytorch with cu130 or higher to use optimized CUDA operations.")

ck.registry.disable("cuda") # multigpu will not work rn with comfy-kitchen on cuda backend
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see the full context of the file around line 23
cat -n comfy/quant_ops.py | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 1910


🏁 Script executed:

# Let's also check if there's any git context to see if this is a new change
git diff HEAD -- comfy/quant_ops.py | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Search for multigpu-related logic in the codebase to understand the gating/conditions
rg -n "multigpu" --type py -B 2 -A 2

Repository: Comfy-Org/ComfyUI

Length of output: 25235


Don't disable the comfy-kitchen CUDA backend globally.

Line 23 disables it unconditionally at import time, which affects all CUDA runs including single-GPU and non-multigpu workflows. This regresses quantization performance for users not using multigpu. Gate the disable behind actual multigpu activation or the specific unsupported backend combination instead of doing it unconditionally here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/quant_ops.py` at line 23, The unconditional call
ck.registry.disable("cuda") in comfy/quant_ops.py should be removed and only
invoked when the unsupported multigpu+cuda combination is actually active;
locate the ck.registry.disable("cuda") invocation and wrap it with a guard that
checks the real multigpu/backend state (for example an existing multigpu flag or
function like is_multigpu_enabled(), a config/ENV check, or the code path that
handles multigpu setup) so that CUDA is only disabled when multigpu is enabled
and the specific backend combination is unsupported, otherwise leave CUDA
enabled for normal single-GPU runs.

Comment thread comfy/sampler_helpers.py
minimum_memory_required += inference_memory
comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required, minimum_memory_required=minimum_memory_required, force_full_load=force_full_load)
real_model = model.model
real_model: BaseModel = model.model
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check the imports and line 200 in the file
head -20 comfy/sampler_helpers.py && echo "---" && sed -n '195,205p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 1130


🏁 Script executed:

# Search for all imports of BaseModel or where it might come from
rg -i "basemodel|from.*import|import" comfy/sampler_helpers.py | head -30

Repository: Comfy-Org/ComfyUI

Length of output: 463


🏁 Script executed:

# Check if there's a BaseModel defined or imported elsewhere that should be in this file
rg "class BaseModel|from.*BaseModel" comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 50375


🏁 Script executed:

# Check git diff to see if line 200 is new
git diff HEAD^ comfy/sampler_helpers.py | grep -A 5 -B 5 "BaseModel"

Repository: Comfy-Org/ComfyUI

Length of output: 246


🏁 Script executed:

# Look for similar type annotations in the file to understand the pattern
rg ":\s*\w+\s*=" comfy/sampler_helpers.py | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 100


🏁 Script executed:

# Check if there are any existing TYPE_CHECKING imports or patterns
sed -n '1,30p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 894


Add the missing BaseModel import to fix undefined name.

Line 200 uses BaseModel in the type annotation, but it's not imported. Add it to the TYPE_CHECKING block to match the file's existing pattern, or remove the annotation.

Suggested fix
 if TYPE_CHECKING:
     from comfy.model_patcher import ModelPatcher
     from comfy.controlnet import ControlBase
+    from comfy.model_base import BaseModel
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
real_model: BaseModel = model.model
if TYPE_CHECKING:
from comfy.model_patcher import ModelPatcher
from comfy.controlnet import ControlBase
from comfy.model_base import BaseModel
🧰 Tools
🪛 GitHub Actions: Python Linting

[error] 200-200: F821 Undefined name BaseModel detected by ruff check. Ensure BaseModel is imported or defined.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/sampler_helpers.py` at line 200, Add the missing BaseModel import used
in the type annotation for real_model (the line "real_model: BaseModel =
model.model") by adding "from typing import TYPE_CHECKING" already present and
then inside the existing TYPE_CHECKING block import BaseModel from its module
(e.g., "from <module> import BaseModel") so the annotation is defined at
type-check time; alternatively remove the BaseModel annotation if you prefer not
to add the import.

Comment thread comfy/samplers.py
Comment on lines +391 to +397
devices = [dev_m for dev_m in model_options['multigpu_clones'].keys()]
device_batched_hooked_to_run: dict[torch.device, list[tuple[comfy.hooks.HookGroup, tuple]]] = {}

total_conds = 0
for to_run in hooked_to_run.values():
total_conds += len(to_run)
conds_per_device = max(1, math.ceil(total_conds//len(devices)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

relative_speed is not used by the multigpu scheduler.

This branch still computes a fixed conds_per_device and round-robins by raw condition count; multigpu_options is never consulted here. The new MultiGPU Options node therefore has no effect on work distribution, and the // inside math.ceil(...) makes the split even coarser on uneven counts.

As per coding guidelines, comfy/** changes should focus on performance implications in hot paths.

Also applies to: 403-416, 433-435

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 391 - 397, The multigpu scheduler currently
ignores multigpu_options and uses integer floor division (//) inside math.ceil,
producing coarse, incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.

Comment thread comfy/samplers.py
Comment on lines 847 to +850
if 'control' in x:
x['control'].pre_run(model, percent_to_timestep_function)
for device_cnet in x['control'].multigpu_clones.values():
device_cnet.pre_run(model, percent_to_timestep_function)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run per-device controls against the matching model clone.

These new pre_run() calls feed every device clone the base model. Any control that snapshots model-specific state during pre_run() will capture the wrong device/model; QwenFunControlNet.pre_run() in this file already stores model.diffusion_model, so its multigpu clone will still point at the base UNet.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 847 - 850, The code calls
x['control'].pre_run(model, ...) for the base control and then calls
device_cnet.pre_run(model, ...) for each control clone, incorrectly passing the
base model to per-device controls; update the loop to pass the matching
per-device model clone instead. Specifically, when iterating
x['control'].multigpu_clones (the device_cnet clones), look up the corresponding
model clone (e.g., from model.multigpu_clones using the same keys/ids) and call
device_cnet.pre_run(model_clone, percent_to_timestep_function) so each control
clone receives its matching model clone; keep the initial
x['control'].pre_run(model, ...) for the base control.

Comment thread comfy/sd.py
out[0].cached_patcher_init = (load_checkpoint_guess_config_model_only, (ckpt_path, embedding_directory, model_options, te_model_options))
if output_clip and out[1] is not None:
out[1].patcher.cached_patcher_init = (load_checkpoint_guess_config_clip_only, (ckpt_path, embedding_directory, model_options, te_model_options))
out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard the model-side cache assignment.

load_checkpoint_guess_config_clip_only() reaches this path with output_model=False, so out[0] is None and this line raises before the CLIP patcher can be returned. It also leaves checkpoint-backed CLIP patchers without their own cached_patcher_init.

Possible fix
-    out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0)
+    if out[0] is not None:
+        out[0].cached_patcher_init = (
+            load_checkpoint_guess_config,
+            (ckpt_path, False, False, False, embedding_directory, True, model_options, te_model_options),
+            0,
+        )
+    if out[1] is not None:
+        out[1].patcher.cached_patcher_init = (
+            load_checkpoint_guess_config,
+            (ckpt_path, False, True, False, embedding_directory, False, model_options, te_model_options),
+            1,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/sd.py` at line 1557, The assignment to out[0].cached_patcher_init can
raise when out[0] is None (e.g. when called from
load_checkpoint_guess_config_clip_only()), so guard it: check that out[0] is not
None before assigning to out[0].cached_patcher_init and, if the CLIP patcher is
created separately for checkpoint-backed models, set its own cached_patcher_init
instead (or attach the init tuple to the patcher instance). Locate the
assignment line and ensure both code paths (when out[0] exists and when only a
CLIP patcher exists) receive the same cached_patcher_init tuple so no
NullReference occurs and checkpoint-backed CLIP patchers keep their
cached_patcher_init.

Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b
Co-authored-by: Amp <amp@ampcode.com>

# Conflicts:
#	comfy/samplers.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
comfy/samplers.py (1)

506-508: Exception re-raised after appending to results - consider removing redundant raise.

The raise in the exception handler terminates the thread silently (exceptions in threads don't propagate to the main thread automatically). Since errors are already captured in results and re-raised at line 522-523, the raise here is redundant but harmless. The current pattern works correctly - the main thread will surface the first captured error after all threads complete.

✏️ Optional: Remove redundant raise for clarity
         except Exception as e:
             results.append(thread_result(None, None, None, None, None, error=e))
-            raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 506 - 508, Remove the redundant re-raise
inside the except block: keep the results.append(thread_result(..., error=e)) to
record the error (referencing the thread_result construction), but delete the
trailing raise so the worker thread simply records the error and exits; the main
thread already inspects results and re-raises the first captured exception
later, so no further in-thread raising is necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@comfy/samplers.py`:
- Around line 506-508: Remove the redundant re-raise inside the except block:
keep the results.append(thread_result(..., error=e)) to record the error
(referencing the thread_result construction), but delete the trailing raise so
the worker thread simply records the error and exits; the main thread already
inspects results and re-raises the first captured exception later, so no further
in-thread raising is necessary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 360b873a-b7a1-43f6-904a-72bb0f1a647b

📥 Commits

Reviewing files that changed from the base of the PR and between f410d28 and be35378.

📒 Files selected for processing (6)
  • comfy/cli_args.py
  • comfy/model_management.py
  • comfy/model_patcher.py
  • comfy/samplers.py
  • comfy/sd.py
  • nodes.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • nodes.py
  • comfy/cli_args.py
  • comfy/sd.py
  • comfy/model_patcher.py

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
comfy/samplers.py (2)

391-397: ⚠️ Potential issue | 🔴 Critical

Fix per-device quota math; current scheduling can stall and ignores speed weights.

Line 397 uses floor division inside ceil, and Lines 415-416 hard-cap selection by remaining quota. When quotas are exhausted, to_batch can become empty, to_run is never popped, and the while loop can spin indefinitely. relative_speed is also not applied, so load-balancing options still have no effect.

Suggested direction
- conds_per_device = max(1, math.ceil(total_conds//len(devices)))
+ # Use true division and weighted shares (relative_speed), then assign remainders.
+ # Also skip saturated devices before batching to guarantee progress.

As per coding guidelines, comfy/** changes should focus on performance implications in hot paths and thread safety for concurrent execution.

Also applies to: 403-416, 433-435

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 391 - 397, The per-device quota calculation
and scheduling are wrong: replace the floor-division inside ceil
(total_conds//len(devices)) with true division (total_conds / len(devices)) when
computing conds_per_device, and implement per-device dynamic quotas that honor
each device's relative_speed weight (use model_options['relative_speed'] keyed
by device to allocate proportional quotas instead of a single uniform
conds_per_device). In the scheduling loop that moves items from hooked_to_run
into device_batched_hooked_to_run (symbols: hooked_to_run, to_run, to_batch,
device_batched_hooked_to_run), ensure you always pop empty to_run lists so the
while loop can terminate (if to_batch becomes empty remove that to_run from
consideration), guarantee at least one item is selected per-device when work
remains (but still respect remaining global/weighted quota), and recalc
remaining quotas each iteration so devices with higher relative_speed get more
items; this prevents empty selections, avoids infinite loops, and applies the
relative_speed weighting correctly.

850-851: ⚠️ Potential issue | 🟠 Major

Run control clones with their matching model clone, not the base model.

Lines 850-851 call device_cnet.pre_run(model, ...) for every clone. Controls that cache model internals during pre_run() will bind to the wrong model/device.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 850 - 851, The control clones are being
pre-run with the base model instead of their corresponding model clone; change
the call in the loop that iterates x['control'].multigpu_clones so each
device_cnet.pre_run gets the matching model clone (not the top-level model).
Locate the multigpu_clones structures (x['control'].multigpu_clones and the
model clone container, e.g. x['model'].multigpu_clones or similarly named
mapping) and pair clones by device/key (or zip the values if they are ordered)
and call device_cnet.pre_run(matching_model_clone, percent_to_timestep_function)
so cached internals bind to the correct model clone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/samplers.py`:
- Around line 424-425: The multigpu memory-fit check builds input_shape from
batch_amount/first_shape but calls model.memory_required(input_shape) without
conditioning shapes, which underestimates memory for conditioned models; change
the call to pass the conditioning shapes (use cond_shapes) just like the
single-GPU path so memory_required receives both input_shape and cond_shapes
(update the check around the variables input_shape, cond_shapes,
model.memory_required, batch_amount, and first_shape) to avoid over-batching and
OOMs.
- Around line 447-448: The code unconditionally calls
torch.cuda.set_device(device) before selecting model_current from
model_options["multigpu_clones"], which fails for non-CUDA devices; change the
logic in the section that sets model_current (and any device setup) to first
check the device type (e.g., inspect device.type or use
str(device).startswith("cuda")) and only call torch.cuda.set_device(device) for
CUDA devices, otherwise skip or use appropriate backend-specific calls; update
the block referencing model_options["multigpu_clones"][device].model and
torch.cuda.set_device to conditionally handle CUDA vs non-CUDA backends so
multi-backend multigpu clones (XPU/NPU/MLU/MPS/CPU/etc.) don't trigger a
CUDA-only call.
- Around line 503-506: The worker thread is moving tensors asynchronously with
.to(output_device) (via model_options['model_function_wrapper'] /
model_current.apply_model) and then returning thread_result, which can cause
races when the main thread aggregates GPU tensors; fix by calling
torch.cuda.synchronize(output_device) immediately after the
.to(output_device).chunk(...) transfer completes in the worker (guarded by a
check that output_device is a CUDA device) so the copy/kernels have finished
before appending/returning the thread_result.

---

Duplicate comments:
In `@comfy/samplers.py`:
- Around line 391-397: The per-device quota calculation and scheduling are
wrong: replace the floor-division inside ceil (total_conds//len(devices)) with
true division (total_conds / len(devices)) when computing conds_per_device, and
implement per-device dynamic quotas that honor each device's relative_speed
weight (use model_options['relative_speed'] keyed by device to allocate
proportional quotas instead of a single uniform conds_per_device). In the
scheduling loop that moves items from hooked_to_run into
device_batched_hooked_to_run (symbols: hooked_to_run, to_run, to_batch,
device_batched_hooked_to_run), ensure you always pop empty to_run lists so the
while loop can terminate (if to_batch becomes empty remove that to_run from
consideration), guarantee at least one item is selected per-device when work
remains (but still respect remaining global/weighted quota), and recalc
remaining quotas each iteration so devices with higher relative_speed get more
items; this prevents empty selections, avoids infinite loops, and applies the
relative_speed weighting correctly.
- Around line 850-851: The control clones are being pre-run with the base model
instead of their corresponding model clone; change the call in the loop that
iterates x['control'].multigpu_clones so each device_cnet.pre_run gets the
matching model clone (not the top-level model). Locate the multigpu_clones
structures (x['control'].multigpu_clones and the model clone container, e.g.
x['model'].multigpu_clones or similarly named mapping) and pair clones by
device/key (or zip the values if they are ordered) and call
device_cnet.pre_run(matching_model_clone, percent_to_timestep_function) so
cached internals bind to the correct model clone.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbdf3a6c-ad44-4348-8b59-a5b62349893a

📥 Commits

Reviewing files that changed from the base of the PR and between be35378 and 1d8e379.

📒 Files selected for processing (2)
  • comfy/samplers.py
  • comfy_extras/nodes_multigpu.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • comfy_extras/nodes_multigpu.py

Comment thread comfy/samplers.py
Comment on lines +424 to +425
input_shape = [len(batch_amount) * first_shape[0]] + list(first_shape)[1:]
if model.memory_required(input_shape) * 1.5 < free_memory:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use cond_shapes in multigpu memory-fit checks to avoid over-batching.

Line 425 calls model.memory_required(input_shape) without conditioning shapes, unlike the single-GPU path. This can under-estimate required memory and trigger avoidable OOMs under complex conditioning.

Parity fix with single-GPU path
-                if model.memory_required(input_shape) * 1.5 < free_memory:
+                cond_shapes = collections.defaultdict(list)
+                for tt in batch_amount:
+                    for k, v in to_run[tt][0].conditioning.items():
+                        cond_shapes[k].append(v.size())
+                if model.memory_required(input_shape, cond_shapes=cond_shapes) * 1.5 < free_memory:
                     to_batch = batch_amount
                     break

As per coding guidelines, comfy/** changes should focus on memory management and GPU resource handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 424 - 425, The multigpu memory-fit check
builds input_shape from batch_amount/first_shape but calls
model.memory_required(input_shape) without conditioning shapes, which
underestimates memory for conditioned models; change the call to pass the
conditioning shapes (use cond_shapes) just like the single-GPU path so
memory_required receives both input_shape and cond_shapes (update the check
around the variables input_shape, cond_shapes, model.memory_required,
batch_amount, and first_shape) to avoid over-batching and OOMs.

Comment thread comfy/samplers.py
Comment on lines +447 to +448
torch.cuda.set_device(device)
model_current: BaseModel = model_options["multigpu_clones"][device].model
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify backend diversity and cuda-only binding usage
rg -n --type=py -C2 'def (is_intel_xpu|is_ascend_npu|is_mlu)|torch\.(xpu|npu|mlu)\.set_device'
rg -n --type=py -C2 'torch\.cuda\.set_device\(device\)|_calc_cond_batch_multigpu'

Repository: Comfy-Org/ComfyUI

Length of output: 1936


🏁 Script executed:

# Find where multigpu_clones is populated
rg -n --type=py 'multigpu_clones\s*=' | head -20

# Check the context around creation
rg -n --type=py -B5 -A5 'multigpu_clones\[' | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 1511


🏁 Script executed:

# Check if line 447 is new in this commit or pre-existing
git log --oneline -10 -- comfy/samplers.py

# Get git diff for this specific line
git diff HEAD~1 HEAD -- comfy/samplers.py | grep -A10 -B10 'torch.cuda.set_device'

Repository: Comfy-Org/ComfyUI

Length of output: 164


🏁 Script executed:

# Find where _calc_cond_batch_multigpu is called and what device is passed
rg -n --type=py -B10 '_calc_cond_batch_multigpu' | head -40

# Look for where multigpu_clones dict is iterated or populated
rg -n --type=py 'for .* in.*multigpu_clones' -A5 -B5 | head -80

Repository: Comfy-Org/ComfyUI

Length of output: 6026


🏁 Script executed:

# Find what creates model_options['multigpu_clones'] and what device types go into it
rg -n --type=py "model_options\[.multigpu_clones.\]" -B3 -A3 | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 3189


🏁 Script executed:

# Find where multigpu setup happens and what device types are assigned
rg -n --type=py -B15 'multigpu_dict\[.*load_device.*\]' comfy/sampler_helpers.py

# Check if there's device filtering for multigpu
rg -n --type=py 'multigpu' comfy/sampler_helpers.py | head -30

Repository: Comfy-Org/ComfyUI

Length of output: 2758


🏁 Script executed:

# Look for where load_device is set and what device types are possible
rg -n --type=py 'load_device\s*=' --type=py -A2 -B2 | grep -E '(load_device|xpu|npu|mlu|cuda)' | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 5885


🏁 Script executed:

# Find get_torch_device implementation
rg -n --type=py 'def get_torch_device' -A20 comfy/model_management.py

# Also check if there's device type checking in the codebase
rg -n --type=py 'device\.type.*==.*"cuda"' | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 1910


🏁 Script executed:

# Check git history for line 447 in samplers.py
git log --all --oneline --follow -p -- comfy/samplers.py | head -200

# Alternative: check recent changes to that specific line
git blame comfy/samplers.py | grep -A2 -B2 "torch.cuda.set_device"

Repository: Comfy-Org/ComfyUI

Length of output: 7767


Avoid CUDA-only set_device in a multi-backend multigpu path.

Line 447 unconditionally calls torch.cuda.set_device(device), which will crash when multigpu clones contain non-CUDA devices (XPU, NPU, MLU, MPS, CPU, DirectML). Since get_torch_device() returns various device types that are added as keys to multigpu_clones, this breaks multi-backend support.

Safer minimal fix
-            torch.cuda.set_device(device)
+            if hasattr(device, "type") and device.type == "cuda":
+                torch.cuda.set_device(device)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 447 - 448, The code unconditionally calls
torch.cuda.set_device(device) before selecting model_current from
model_options["multigpu_clones"], which fails for non-CUDA devices; change the
logic in the section that sets model_current (and any device setup) to first
check the device type (e.g., inspect device.type or use
str(device).startswith("cuda")) and only call torch.cuda.set_device(device) for
CUDA devices, otherwise skip or use appropriate backend-specific calls; update
the block referencing model_options["multigpu_clones"][device].model and
torch.cuda.set_device to conditionally handle CUDA vs non-CUDA backends so
multi-backend multigpu clones (XPU/NPU/MLU/MPS/CPU/etc.) don't trigger a
CUDA-only call.

Comment thread comfy/samplers.py
Comment on lines +503 to +506
output = model_options['model_function_wrapper'](model_current.apply_model, {"input": input_x, "timestep": timestep_, "c": c, "cond_or_uncond": cond_or_uncond}).to(output_device).chunk(batch_chunks)
else:
output = model_current.apply_model(input_x, timestep_, **c).to(output_device).chunk(batch_chunks)
results.append(thread_result(output, mult, area, batch_chunks, cond_or_uncond))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, does joining a Python thread guarantee completion of CUDA kernels and cross-device .to() copies launched in that thread, or is explicit stream/device synchronization required?

💡 Result:

No. thread.join() only guarantees the Python thread has finished executing and enqueuing work, not that the CUDA work it launched has completed.

  • In CUDA, kernel launches and many copies are asynchronous w.r.t. the host thread (the CPU thread can return before the GPU finishes). CUDA can also use a per-thread default stream, so “the thread ended” is not a synchronization primitive for the GPU. [2]
  • In PyTorch, torch.cuda.synchronize(device=...) is the explicit mechanism that waits for all kernels in all streams on a CUDA device to complete. [1]
  • For .to() copies: if you use non_blocking=True, PyTorch explicitly performs the transfer without triggering host synchronization (i.e., it can still be in flight after the call returns, and therefore after join()). [3][4]

So if you need a guarantee that kernels and cross-device .to() copies launched in that thread are finished (e.g., before freeing/reusing buffers, consuming results on another device/stream, or accurate timing), you must use explicit CUDA/PyTorch synchronization (e.g., stream.synchronize(), event-based sync, or torch.cuda.synchronize(device)), rather than relying on join(). [1][2]


🏁 Script executed:

# First, let's look at the code around lines 503-506
sed -n '490,520p' comfy/samplers.py | cat -n

Repository: Comfy-Org/ComfyUI

Length of output: 2024


🏁 Script executed:

# Get more context on the function to understand the threading model
sed -n '470,540p' comfy/samplers.py | cat -n

Repository: Comfy-Org/ComfyUI

Length of output: 4231


🏁 Script executed:

# Check if there's any synchronization happening after thread.join()
grep -n "thread.join\|synchronize\|cuda.stream" comfy/samplers.py | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 86


🏁 Script executed:

# Check git status and diff for comfy/samplers.py to see what changed
git diff HEAD comfy/samplers.py | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Check git log for recent changes to this file
git log --oneline -n 10 -- comfy/samplers.py

Repository: Comfy-Org/ComfyUI

Length of output: 118


🏁 Script executed:

# Look at the actual diff around lines 503-506
git diff HEAD -- comfy/samplers.py | grep -A 20 -B 20 "output_device"

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Check git status
git status

Repository: Comfy-Org/ComfyUI

Length of output: 126


🏁 Script executed:

# Show the recent commit to see what changed
git show --stat 1d8e379f

Repository: Comfy-Org/ComfyUI

Length of output: 49365


🏁 Script executed:

# Check if samplers.py was modified in that commit
git show 1d8e379f -- comfy/samplers.py | head -150

Repository: Comfy-Org/ComfyUI

Length of output: 5195


🏁 Script executed:

# Verify the exact lines 503-506 match the pattern described
sed -n '500,510p' comfy/samplers.py | cat -n

Repository: Comfy-Org/ComfyUI

Length of output: 937


Add explicit CUDA synchronization before tensor aggregation.

Inside worker threads (lines 503-506), .to(output_device) transfers tensors asynchronously. After thread.join() returns, CUDA kernels may still be in flight. The subsequent aggregation (GPU operations on these tensors) can race pending device copies without explicit synchronization like torch.cuda.synchronize(output_device).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 503 - 506, The worker thread is moving
tensors asynchronously with .to(output_device) (via
model_options['model_function_wrapper'] / model_current.apply_model) and then
returning thread_result, which can cause races when the main thread aggregates
GPU tensors; fix by calling torch.cuda.synchronize(output_device) immediately
after the .to(output_device).chunk(...) transfer completes in the worker
(guarded by a check that output_device is a CUDA device) so the copy/kernels
have finished before appending/returning the thread_result.

Kosinkadink and others added 4 commits March 30, 2026 08:32
…aps to parent patcher

When a multigpu clone ModelPatcher is garbage collected, LoadedModel._switch_parent
switches the weakref to point at the parent (main) ModelPatcher. However, it was not
updating LoadedModel.device, leaving it with the old clone's device (e.g., cuda:1).
On subsequent runs, this stale device was passed to ModelPatcherDynamic.load(), causing
an assertion failure (device_to != self.load_device).

Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba
Co-authored-by: Amp <amp@ampcode.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
comfy/model_management.py (3)

522-526: Consider narrowing the exception handler.

Bare except: catches everything including KeyboardInterrupt and SystemExit. While this is best-effort logging, using except Exception: would be more precise.

🔧 Suggested change
 try:
     for device in get_all_torch_devices(exclude_current=True):
         logging.info("Device: {}".format(get_torch_device_name(device)))
-except:
+except Exception:
     pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 522 - 526, The try/except block
around the device logging is too broad; replace the bare except with a narrower
Exception handler so system-exiting signals aren't swallowed. Update the block
that calls get_all_torch_devices(exclude_current=True) and logging.info("Device:
{}".format(get_torch_device_name(device))) to catch only Exception (i.e., use
except Exception:) and optionally log the exception via logging.debug or
logging.exception to retain best-effort behavior without suppressing
KeyboardInterrupt/SystemExit.

1815-1839: Docstring should use triple quotes.

Line 1816 uses single quotes for the docstring. Per Python convention, docstrings should use triple double-quotes.

📝 Suggested change
 def unload_model_and_clones(model: ModelPatcher, unload_additional_models=True, all_devices=False):
-    'Unload only model and its clones - primarily for multigpu cloning purposes.'
+    """Unload only model and its clones - primarily for multigpu cloning purposes."""
     initial_keep_loaded: list[LoadedModel] = current_loaded_models.copy()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 1815 - 1839, The function
unload_model_and_clones uses a single-quoted string for its docstring; replace
the single-quoted docstring with a proper triple-quoted docstring (use triple
double-quotes """"...""") at the start of the unload_model_and_clones function
so it follows Python conventions and is recognized by tools that read __doc__
for that function.

1822-1834: Consider using a set for UUID lookups.

The nested loop checking additional_models is O(n×m). If additional_models grows large, converting UUIDs to a set would improve lookup performance.

♻️ Suggested optimization
 def unload_model_and_clones(model: ModelPatcher, unload_additional_models=True, all_devices=False):
-    'Unload only model and its clones - primarily for multigpu cloning purposes.'
+    """Unload only model and its clones - primarily for multigpu cloning purposes."""
     initial_keep_loaded: list[LoadedModel] = current_loaded_models.copy()
     additional_models = []
+    additional_uuids = set()
     if unload_additional_models:
         additional_models = model.get_nested_additional_models()
+        additional_uuids = {m.clone_base_uuid for m in additional_models}
     keep_loaded = []
     for loaded_model in initial_keep_loaded:
         if loaded_model.model is not None:
             if model.clone_base_uuid == loaded_model.model.clone_base_uuid:
                 continue
             # check additional models if they are a match
-            skip = False
-            for add_model in additional_models:
-                if add_model.clone_base_uuid == loaded_model.model.clone_base_uuid:
-                    skip = True
-                    break
-            if skip:
+            if loaded_model.model.clone_base_uuid in additional_uuids:
                 continue
         keep_loaded.append(loaded_model)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 1822 - 1834, The nested loop over
initial_keep_loaded and additional_models causes O(n×m) UUID comparisons;
optimize by precomputing a set of clone_base_uuid values from additional_models
and use set membership instead. Specifically, before iterating
initial_keep_loaded, build a set (e.g., additional_uuids = {m.clone_base_uuid
for m in additional_models}) and then inside the loop replace the inner loop
that checks add_model.clone_base_uuid with a single membership test against
additional_uuids; keep references to initial_keep_loaded, loaded_model,
model.clone_base_uuid, additional_models, and keep_loaded to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/model_management.py`:
- Around line 220-231: The device-listing loops create devices with the wrong
type and removal is unsafe: replace torch.device(i) with explicit device strings
(e.g. torch.device(f"cuda:{i}") for CUDA, torch.device(f"xpu:{i}") for XPU when
is_intel_xpu() is true, and torch.device(f"npu:{i}") for NPU when
is_ascend_npu() is true) so the device kind matches get_torch_device(); then
change the exclude_current removal to check membership first (if
get_torch_device() in devices: devices.remove(...)) to avoid ValueError.
Reference: the device construction loops and the use of get_torch_device() /
exclude_current in this block.

---

Nitpick comments:
In `@comfy/model_management.py`:
- Around line 522-526: The try/except block around the device logging is too
broad; replace the bare except with a narrower Exception handler so
system-exiting signals aren't swallowed. Update the block that calls
get_all_torch_devices(exclude_current=True) and logging.info("Device:
{}".format(get_torch_device_name(device))) to catch only Exception (i.e., use
except Exception:) and optionally log the exception via logging.debug or
logging.exception to retain best-effort behavior without suppressing
KeyboardInterrupt/SystemExit.
- Around line 1815-1839: The function unload_model_and_clones uses a
single-quoted string for its docstring; replace the single-quoted docstring with
a proper triple-quoted docstring (use triple double-quotes """"...""") at the
start of the unload_model_and_clones function so it follows Python conventions
and is recognized by tools that read __doc__ for that function.
- Around line 1822-1834: The nested loop over initial_keep_loaded and
additional_models causes O(n×m) UUID comparisons; optimize by precomputing a set
of clone_base_uuid values from additional_models and use set membership instead.
Specifically, before iterating initial_keep_loaded, build a set (e.g.,
additional_uuids = {m.clone_base_uuid for m in additional_models}) and then
inside the loop replace the inner loop that checks add_model.clone_base_uuid
with a single membership test against additional_uuids; keep references to
initial_keep_loaded, loaded_model, model.clone_base_uuid, additional_models, and
keep_loaded to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e3f6eb1-aa35-4792-b004-a5900f6db6b4

📥 Commits

Reviewing files that changed from the base of the PR and between 1d8e379 and b418fb1.

📒 Files selected for processing (2)
  • comfy/model_management.py
  • comfy/quant_ops.py
💤 Files with no reviewable changes (1)
  • comfy/quant_ops.py

Comment thread comfy/model_management.py
Comment on lines +220 to +231
for i in range(torch.cuda.device_count()):
devices.append(torch.device(i))
elif is_intel_xpu():
for i in range(torch.xpu.device_count()):
devices.append(torch.device(i))
elif is_ascend_npu():
for i in range(torch.npu.device_count()):
devices.append(torch.device(i))
else:
devices.append(get_torch_device())
if exclude_current:
devices.remove(get_torch_device())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

XPU and NPU device creation uses incorrect device type.

Lines 224 and 227 use torch.device(i) which defaults to CUDA, not the actual accelerator type. On XPU systems, this creates cuda:0 instead of xpu:0, causing a mismatch with get_torch_device() (which correctly returns xpu:X). This will cause devices.remove() at line 231 to fail with ValueError when exclude_current=True.

Additionally, line 231 should guard against the device not being in the list to handle edge cases gracefully.

🐛 Proposed fix for device type and remove() safety
 def get_all_torch_devices(exclude_current=False):
     global cpu_state
     devices = []
     if cpu_state == CPUState.GPU:
         if is_nvidia():
             for i in range(torch.cuda.device_count()):
-                devices.append(torch.device(i))
+                devices.append(torch.device("cuda", i))
         elif is_intel_xpu():
             for i in range(torch.xpu.device_count()):
-                devices.append(torch.device(i))
+                devices.append(torch.device("xpu", i))
         elif is_ascend_npu():
             for i in range(torch.npu.device_count()):
-                devices.append(torch.device(i))
+                devices.append(torch.device("npu", i))
     else:
         devices.append(get_torch_device())
     if exclude_current:
-        devices.remove(get_torch_device())
+        current = get_torch_device()
+        if current in devices:
+            devices.remove(current)
     return devices
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 220 - 231, The device-listing loops
create devices with the wrong type and removal is unsafe: replace
torch.device(i) with explicit device strings (e.g. torch.device(f"cuda:{i}") for
CUDA, torch.device(f"xpu:{i}") for XPU when is_intel_xpu() is true, and
torch.device(f"npu:{i}") for NPU when is_ascend_npu() is true) so the device
kind matches get_torch_device(); then change the exclude_current removal to
check membership first (if get_torch_device() in devices: devices.remove(...))
to avoid ValueError. Reference: the device construction loops and the use of
get_torch_device() / exclude_current in this block.

Kosinkadink and others added 4 commits April 8, 2026 05:08
Replace per-step thread create/destroy in _calc_cond_batch_multigpu with a
persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device()
once at startup, preserving compiled kernel caches across diffusion steps.

- Add MultiGPUThreadPool class in comfy/multigpu.py
- Create pool in CFGGuider.outer_sample(), shut down in finally block
- Main thread handles its own device batch directly for zero overhead
- Falls back to sequential execution if no pool is available
Benchmarked hybrid (main thread + pool) vs all-pool on 2x RTX 4090
with SD1.5 and NetaYume models. No meaningful performance difference
(within noise). All-pool is simpler: eliminates the main_device
special case, main_batch_tuple deferred execution, and the 3-way
branch in the dispatch loop.
* main: init all visible cuda devices in aimdo

* mp: call vbars_analyze for the GPU in question

* requirements: bump aimdo to pre-release version
@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 16, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedcomfy-aimdo@​0.2.12 ⏵ 0.0.21310010010010070 -30

View full report

@Suppe2000
Copy link
Copy Markdown

Screenshot 2026-04-21 190648

What am I doing wrong? ComfyUI isn't starting.

@Kosinkadink
Copy link
Copy Markdown
Member Author

You'll need to run requirements.txt

@Suppe2000
Copy link
Copy Markdown

You'll need to run requirements.txt

All requirements are satisfied. So I don't understand what is going on. The Master branch is working fine. Just this tree branch isn't working.

@Kosinkadink
Copy link
Copy Markdown
Member Author

Can you do a pip freeze and show me the printout? The version of aimdo for this branch right now requires a specific preview version that is included in the requirements here

@Suppe2000
Copy link
Copy Markdown

Suppe2000 commented Apr 21, 2026

Can you do a pip freeze and show me the printout? The version of aimdo for this branch right now requires a specific preview version that is included in the requirements here

This one?

Requirement already satisfied: comfy-aimdo==0.0.213 in C:\Users\...\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages (from -r requirements.txt (line 26)) (0.0.213)

Does anyone have an idea how to solve the issue?

rattus128 and others added 3 commits April 23, 2026 19:09
Cut pre-release 0.0.214 off aimdo master to pickup async mem accounting
fix.
)

* Fix Hunyuan 3D 2.1 multi-GPU worksplit: use cond_or_uncond instead of hardcoded chunk(2)

Amp-Thread-ID: https://ampcode.com/threads/T-019da964-2cc8-77f9-9aae-23f65da233db
Co-authored-by: Amp <amp@ampcode.com>

* Add GPU device selection to all loader nodes

- Add get_gpu_device_options() and resolve_gpu_device_option() helpers
  in model_management.py for vendor-agnostic GPU device selection
- Add device widget to CheckpointLoaderSimple, UNETLoader, VAELoader
- Expand device options in CLIPLoader, DualCLIPLoader, LTXAVTextEncoderLoader
  from [default, cpu] to include gpu:0, gpu:1, etc. on multi-GPU systems
- Wire load_diffusion_model_state_dict and load_state_dict_guess_config
  to respect model_options['load_device']
- Graceful fallback: unrecognized devices (e.g. gpu:1 on single-GPU)
  silently fall back to default

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Add VALIDATE_INPUTS to skip device combo validation for workflow portability

When a workflow saved on a 2-GPU machine (with device=gpu:1) is loaded
on a 1-GPU machine, the combo validation would reject the unknown value.
VALIDATE_INPUTS with the device parameter bypasses combo validation for
that input only, allowing resolve_gpu_device_option to handle the
graceful fallback at runtime.

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Set CUDA device context in outer_sample to match model load_device

Custom CUDA kernels (comfy_kitchen fp8 quantization) use
torch.cuda.current_device() for DLPack tensor export. When a model is
loaded on a non-default GPU (e.g. cuda:1), the CUDA context must match
or the kernel fails with 'Can't export tensors on a different CUDA
device index'. Save and restore the previous device around sampling.

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Fix code review bugs: negative index guard, CPU offload_device, checkpoint te_model_options

- resolve_gpu_device_option: reject negative indices (gpu:-1)
- UNETLoader: set offload_device when cpu is selected
- CheckpointLoaderSimple: pass te_model_options for CLIP device,
  set offload_device for cpu, pass load_device to VAE
- load_diffusion_model_state_dict: respect offload_device from model_options
- load_state_dict_guess_config: respect offload_device, pass load_device to VAE

Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a
Co-authored-by: Amp <amp@ampcode.com>

* Fix CUDA device context for CLIP encoding and VAE encode/decode

Add torch.cuda.set_device() calls to match model's load device in:
- CLIP.encode_from_tokens: fixes 'Can't export tensors on a different
  CUDA device index' when CLIP is loaded on a non-default GPU
- CLIP.encode_from_tokens_scheduled: same fix for the hooks code path
- CLIP.generate: same fix for text generation
- VAE.decode: fixes VAE decoding on non-default GPU
- VAE.encode: fixes VAE encoding on non-default GPU

Same pattern as the existing outer_sample fix in samplers.py - saves
and restores previous CUDA device in a try/finally block.

Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57
Co-authored-by: Amp <amp@ampcode.com>

* Extract cuda_device_context manager, fix tiled VAE methods

Add model_management.cuda_device_context() — a context manager that
saves/restores torch.cuda.current_device when operating on a non-default
GPU. Replaces 6 copies of the manual save/set/restore boilerplate.

Refactored call sites:
- CLIP.encode_from_tokens
- CLIP.encode_from_tokens_scheduled (hooks path)
- CLIP.generate
- VAE.decode
- VAE.encode
- samplers.outer_sample

Bug fixes (newly wrapped):
- VAE.decode_tiled: was missing device context entirely, would fail
  on non-default GPU when called from 'VAE Decode (Tiled)' node
- VAE.encode_tiled: same issue for 'VAE Encode (Tiled)' node

Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57
Co-authored-by: Amp <amp@ampcode.com>

* Restore CheckpointLoaderSimple, add CheckpointLoaderDevice

Revert CheckpointLoaderSimple to its original form (no device input)
so it remains the simple default loader.

Add new CheckpointLoaderDevice node (advanced/loaders) with separate
model_device, clip_device, and vae_device inputs for per-component
GPU placement in multi-GPU setups.

Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57
Co-authored-by: Amp <amp@ampcode.com>

---------

Co-authored-by: Amp <amp@ampcode.com>
* fix: pin SQLAlchemy>=2.0 in requirements.txt (fixes #13036) (#13316)

* Refactor io to IO in nodes_ace.py (#13485)

* Bump comfyui-frontend-package to 1.42.12 (#13489)

* Make the ltx audio vae more native. (#13486)

* feat(api-nodes): add automatic downscaling of videos for ByteDance 2 nodes (#13465)

* Support standalone LTXV audio VAEs (#13499)

* [Partner Nodes]  added 4K resolution for Veo models; added Veo 3 Lite model (#13330)

* feat(api nodes): added 4K resolution for Veo models; added Veo 3 Lite model

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* increase poll_interval from 5 to 9

---------

Signed-off-by: bigcat88 <bigcat88@icloud.com>
Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>

* Bump comfyui-frontend-package to 1.42.14 (#13493)

* Add gpt-image-2 as version option (#13501)

* Allow logging in comfy app files. (#13505)

* chore: update workflow templates to v0.9.59 (#13507)

* fix(veo): reject 4K resolution for veo-3.0 models in Veo3VideoGenerationNode (#13504)

The tooltip on the resolution input states that 4K is not available for
veo-3.1-lite or veo-3.0 models, but the execute guard only rejected the
lite combination. Selecting 4K with veo-3.0-generate-001 or
veo-3.0-fast-generate-001 would fall through and hit the upstream API
with an invalid request.

Broaden the guard to match the documented behavior and update the error
message accordingly.

Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>

* feat: RIFE and FILM frame interpolation model support (CORE-29) (#13258)

* initial RIFE support

* Also support FILM

* Better RAM usage, reduce FILM VRAM peak

* Add model folder placeholder

* Fix oom fallback frame loss

* Remove torch.compile for now

* Rename model input

* Shorter input type name

---------

* fix: use Parameter assignment for Stable_Zero123 cc_projection weights (fixes #13492) (#13518)

On Windows with aimdo enabled, disable_weight_init.Linear uses lazy
initialization that sets weight and bias to None to avoid unnecessary
memory allocation. This caused a crash when copy_() was called on the
None weight attribute in Stable_Zero123.__init__.

Replace copy_() with direct torch.nn.Parameter assignment, which works
correctly on both Windows (aimdo enabled) and other platforms.

* Derive InterruptProcessingException from BaseException (#13523)

* bump manager version to 4.2.1 (#13516)

* ModelPatcherDynamic: force cast stray weights on comfy layers (#13487)

the mixed_precision ops can have input_scale parameters that are used
in tensor math but arent a weight or bias so dont get proper VRAM
management. Treat these as force-castable parameters like the non comfy
weight, random params are buffers already are.

* Update logging level for invalid version format (#13526)

* [Partner Nodes] add SD2 real human support (#13509)

* feat(api-nodes): add SD2 real human support

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* fix: add validation before uploading Assets

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* Add asset_id and group_id displaying on the node

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* extend poll_op to use instead of custom async cycle

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* added the polling for the "Active" status after asset creation

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* updated tooltip for group_id

* allow usage of real human in the ByteDance2FirstLastFrame node

* add reference count limits

* corrected price in status when input assets contain video

Signed-off-by: bigcat88 <bigcat88@icloud.com>

---------

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* feat: SAM (segment anything) 3.1 support (CORE-34) (#13408)

* [Partner Nodes] GPTImage: fix price badges, add new resolutions (#13519)

* fix(api-nodes): fixed price badges, add new resolutions

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* proper calculate the total run cost when "n > 1"

Signed-off-by: bigcat88 <bigcat88@icloud.com>

---------

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* chore: update workflow templates to v0.9.61 (#13533)

* chore: update embedded docs to v0.4.4 (#13535)

* add 4K resolution to Kling nodes (#13536)

Signed-off-by: bigcat88 <bigcat88@icloud.com>

* Fix LTXV Reference Audio node (#13531)

* comfy-aimdo 0.2.14: Hotfix async allocator estimations (#13534)

This was doing an over-estimate of VRAM used by the async allocator when lots
of little small tensors were in play.

Also change the versioning scheme to == so we can roll forward aimdo without
worrying about stable regressions downstream in comfyUI core.

* Disable sageattention for SAM3 (#13529)

Causes Nans

* execution: Add anti-cycle validation (#13169)

Currently if the graph contains a cycle, the just inifitiate recursions,
hits a catch all then throws a generic error against the output node
that seeded the validation. Instead, fail the offending cycling mode
chain and handlng it as an error in its own right.

Co-authored-by: guill <jacob.e.segal@gmail.com>

* chore: update workflow templates to v0.9.62 (#13539)

---------

Signed-off-by: bigcat88 <bigcat88@icloud.com>
Co-authored-by: Octopus <liyuan851277048@icloud.com>
Co-authored-by: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com>
Co-authored-by: Comfy Org PR Bot <snomiao+comfy-pr@gmail.com>
Co-authored-by: Alexander Piskun <13381981+bigcat88@users.noreply.github.com>
Co-authored-by: Jukka Seppänen <40791699+kijai@users.noreply.github.com>
Co-authored-by: AustinMroz <austin@comfy.org>
Co-authored-by: Daxiong (Lin) <contact@comfyui-wiki.com>
Co-authored-by: Matt Miller <matt@miller-media.com>
Co-authored-by: blepping <157360029+blepping@users.noreply.github.com>
Co-authored-by: Dr.Lt.Data <128333288+ltdrdata@users.noreply.github.com>
Co-authored-by: rattus <46076784+rattus128@users.noreply.github.com>
Co-authored-by: guill <jacob.e.segal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.