MultiGPU Work Units For Accelerated Sampling by Kosinkadink · Pull Request #7063 · Comfy-Org/ComfyUI

Kosinkadink · 2025-03-04T06:37:49Z

Overview

This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.

As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.

The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:

Nvidia (CUDA): Tested, works ✅.
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux ⚠️.

Implementation Details

Based on max_gpus and the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher's additional_models under key multigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with any additional_models kept consistent. To avoid unnecessarily deepcloning models, currently_loaded_models from comfy.model_management are checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.

When native conds are used as the work units, _calc_cond_batch calls and returns _calc_cond_batch_multigpu to avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.

Since deepcloning requires that the base model is 'clean', comfy.model_management has received a unload_model_and_clones function to unload only specific models and their clones.

The --cuda-device startup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:

This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)

Performance (will add more examples soon)

Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.

Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090

API Node PR Checklist

Scope

Is API Node Change

Pricing & Billing

Need pricing update
No pricing update

If Need pricing update:

Metronome rate cards updated
Auto‑billing tests updated and passing

QA

QA done
QA not required

Comms

Informed Kosinkadink

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time

…nsHook are not yet operational

…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)

…added some doc strings and removed a so-far unused variable

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

…t torch hardware device

…ltiple GPUs

…ction

…nit__.py

Amp-Thread-ID: https://ampcode.com/threads/T-019d009d-e059-7623-85ca-401168168516 Co-authored-by: Amp <amp@ampcode.com>

coderabbitai · 2026-03-18T11:37:19Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR adds multi‑GPU support across the codebase: new comfy.multigpu (GPUOptions, GPUOptionsGroup, deep‑clone creation, and load balancing); ModelPatcher gains multigpu metadata, deepclone_multigpu, match_multigpu_clones, and multigpu‑aware hook/keyframe handling; ControlBase/ControlNet/T2IAdapter gain per‑device clone management and ControlIsolation; samplers and sampler_helpers add per‑device batching and threaded evaluation; model_management enumerates and frees memory across all torch devices; CLI --cuda-device now accepts strings; new Comfy nodes expose multigpu setup; minor whitespace change in quant_ops.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'MultiGPU Work Units For Accelerated Sampling' directly and clearly describes the main feature being added: multi-GPU acceleration through work unit splitting.
Description check	✅ Passed	The pull request description comprehensively documents the MultiGPU Work Units feature, implementation details, performance metrics, and known issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_multigpu.py`:
- Around line 66-69: The GPUOptionsGroup.clone() return value is being discarded
in create_gpu_options; capture and use the cloned object so we don't mutate the
caller-supplied gpu_options. Change the behavior in create_gpu_options to assign
the result of GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options =
gpu_options.clone()) and then continue using that local gpu_options, ensuring
each node gets its own cloned GPUOptionsGroup rather than sharing state.

In `@comfy/cli_args.py`:
- Line 52: The --cuda-device argument currently only accepts a single token;
update the parser.add_argument call for "--cuda-device" to accept multiple
space-separated device IDs by adding nargs='+' (and set type=int if you want
integer IDs) so that invocations like "--cuda-device 0 1" parse correctly;
alternatively, if you prefer comma-separated input, change the help text to
explicitly state the required format instead of implying plural support.

In `@comfy/controlnet.py`:
- Around line 322-328: The multigpu clone path in deepclone_multigpu currently
builds c = self.copy() which does not carry the previous_controlnet chain,
causing stacked ControlNets/T2IAdapters to be lost on secondary GPUs; update
deepclone_multigpu to copy previous_controlnet (and any linked
.previous_controlnet chain) from self to c after c = self.copy() so that the
full chain is preserved, then continue deep-copying control_model and wrapping
it as before (ensure multigpu_clones[load_device] assignment remains unchanged);
apply the same preservation of previous_controlnet chaining to the similar clone
code paths that use copy_to()/get_instance_for_device() so all per-device clones
keep the full previous_controlnet chain.

In `@comfy/model_management.py`:
- Around line 214-231: The function get_all_torch_devices currently only handles
NVIDIA/Intel/Ascend and can return an empty list (breaking exclude_current and
unload_all_models); update it to (1) add detection for other common backends
(e.g., ROCm/DirectML/MLU/MPS) or at minimum attempt generic torch backend checks
such as torch.cuda.device_count() and torch.backends.mps.is_available() and
append appropriate torch.device entries, (2) if after all backend checks devices
is still empty, append get_torch_device() as a safe fallback so callers always
get at least the current device, and (3) make the exclude_current branch robust
by checking membership before calling devices.remove(get_torch_device()); refer
to get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu,
is_ascend_npu, and get_torch_device when implementing these fixes.

In `@comfy/model_patcher.py`:
- Around line 1315-1321: The ON_PREPARE_STATE callbacks are being invoked with
four positional args in prepare_state, breaking backward-compatibility for
callbacks that expect three; update prepare_state to detect each callback's
accepted arity (e.g., via inspect.signature or callable.__code__.co_argcount)
and call either callback(self, timestep, model_options, ignore_multigpu) if it
accepts 4 args or callback(self, timestep, model_options) if it only accepts 3
(or attempt the 4-arg call and fall back to 3-arg on TypeError), and apply the
same arity-gated invocation when recursing into multigpu clones; reference
prepare_state and CallbacksMP.ON_PREPARE_STATE to locate where to change the
callsite.

In `@comfy/multigpu.py`:
- Around line 60-112: create_multigpu_deepclones clones existing "multigpu"
additional models but never removes ones that exceed the new max_gpus; to fix,
after computing limit_extra_devices (the allowed device list) retrieve
model.get_additional_models_with_key("multigpu"), filter out any clone whose
load_device is not in ([model.load_device] + limit_extra_devices) (use each
ModelPatcher.load_device to decide), then call
model.set_additional_models("multigpu", filtered_list) before
match_multigpu_clones()/gpu_options.register; ensure reuse_loaded logic still
can find matching clones and that is_multigpu_base_clone flags remain correct
for retained clones.

In `@comfy/quant_ops.py`:
- Line 23: The unconditional call ck.registry.disable("cuda") in
comfy/quant_ops.py should be removed and only invoked when the unsupported
multigpu+cuda combination is actually active; locate the
ck.registry.disable("cuda") invocation and wrap it with a guard that checks the
real multigpu/backend state (for example an existing multigpu flag or function
like is_multigpu_enabled(), a config/ENV check, or the code path that handles
multigpu setup) so that CUDA is only disabled when multigpu is enabled and the
specific backend combination is unsupported, otherwise leave CUDA enabled for
normal single-GPU runs.

In `@comfy/sampler_helpers.py`:
- Line 200: Add the missing BaseModel import used in the type annotation for
real_model (the line "real_model: BaseModel = model.model") by adding "from
typing import TYPE_CHECKING" already present and then inside the existing
TYPE_CHECKING block import BaseModel from its module (e.g., "from <module>
import BaseModel") so the annotation is defined at type-check time;
alternatively remove the BaseModel annotation if you prefer not to add the
import.

In `@comfy/samplers.py`:
- Around line 391-397: The multigpu scheduler currently ignores multigpu_options
and uses integer floor division (//) inside math.ceil, producing coarse,
incorrect splits; update the batching logic around devices,
device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to
consult multigpu_options (specifically the relative_speed entry for each device
clone) and distribute total_conds proportionally to those relative_speed weights
(then ceil each device's share and ensure at least 1 if there are any
conditions), replacing the math.ceil(total_conds//len(devices)) approach with a
proper float division and per-device allocation; keep device ordering based on
model_options['multigpu_clones'].keys() and ensure the same proportional logic
is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so
the MultiGPU Options node actually affects work distribution.
- Around line 847-850: The code calls x['control'].pre_run(model, ...) for the
base control and then calls device_cnet.pre_run(model, ...) for each control
clone, incorrectly passing the base model to per-device controls; update the
loop to pass the matching per-device model clone instead. Specifically, when
iterating x['control'].multigpu_clones (the device_cnet clones), look up the
corresponding model clone (e.g., from model.multigpu_clones using the same
keys/ids) and call device_cnet.pre_run(model_clone,
percent_to_timestep_function) so each control clone receives its matching model
clone; keep the initial x['control'].pre_run(model, ...) for the base control.

In `@comfy/sd.py`:
- Line 1557: The assignment to out[0].cached_patcher_init can raise when out[0]
is None (e.g. when called from load_checkpoint_guess_config_clip_only()), so
guard it: check that out[0] is not None before assigning to
out[0].cached_patcher_init and, if the CLIP patcher is created separately for
checkpoint-backed models, set its own cached_patcher_init instead (or attach the
init tuple to the patcher instance). Locate the assignment line and ensure both
code paths (when out[0] exists and when only a CLIP patcher exists) receive the
same cached_patcher_init tuple so no NullReference occurs and checkpoint-backed
CLIP patchers keep their cached_patcher_init.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: acbb973a-f861-46f7-805a-a50660ad4671

📥 Commits

Reviewing files that changed from the base of the PR and between b941913 and f410d28.

📒 Files selected for processing (12)

comfy/cli_args.py
comfy/controlnet.py
comfy/model_management.py
comfy/model_patcher.py
comfy/multigpu.py
comfy/patcher_extension.py
comfy/quant_ops.py
comfy/sampler_helpers.py
comfy/samplers.py
comfy/sd.py
comfy_extras/nodes_multigpu.py
nodes.py

coderabbitai · 2026-03-18T11:37:22Z

+    def create_gpu_options(self, device_index: int, relative_speed: float, gpu_options: comfy.multigpu.GPUOptionsGroup=None):
+        if not gpu_options:
+            gpu_options = comfy.multigpu.GPUOptionsGroup()
+        gpu_options.clone()


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -e FILE=$(fd '^multigpu\.py$' comfy) rg -n -C4 'class GPUOptionsGroup|def clone\(' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 793

🏁 Script executed:

#!/bin/bash set -e FILE=$(fd '^nodes_multigpu\.py$' comfy_extras) sed -n '66,80p' "$FILE"

Repository: Comfy-Org/ComfyUI

Length of output: 529

Capture the result of gpu_options.clone() on line 69.

The cloned object is being discarded. Since GPUOptionsGroup.clone() returns a new object, the current code ignores it and then mutates the caller-supplied gpu_options directly. When multiple MultiGPU Options nodes share the same GPU options group, this causes unintended state sharing across the node chain.

Change line 69 from:

gpu_options.clone()

to:

gpu_options = gpu_options.clone()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_multigpu.py` around lines 66 - 69, The GPUOptionsGroup.clone() return value is being discarded in create_gpu_options; capture and use the cloned object so we don't mutate the caller-supplied gpu_options. Change the behavior in create_gpu_options to assign the result of GPUOptionsGroup.clone() back to gpu_options (i.e., gpu_options = gpu_options.clone()) and then continue using that local gpu_options, ensuring each node gets its own cloned GPUOptionsGroup rather than sharing state.

coderabbitai · 2026-03-18T11:37:22Z

 parser.add_argument("--auto-launch", action="store_true", help="Automatically launch ComfyUI in the default browser.")
 parser.add_argument("--disable-auto-launch", action="store_true", help="Disable auto launching the browser.")
-parser.add_argument("--cuda-device", type=int, default=None, metavar="DEVICE_ID", help="Set the id of the cuda device this instance will use. All other devices will not be visible.")
+parser.add_argument("--cuda-device", type=str, default=None, metavar="DEVICE_ID", help="Set the ids of cuda devices this instance will use. All other devices will not be visible.")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import argparse parser = argparse.ArgumentParser() parser.add_argument("--cuda-device", type=str, default=None) for argv in ( ["--cuda-device", "0,1"], ["--cuda-device", "0", "1"], ): try: ns = parser.parse_args(argv) print(argv, "->", ns.cuda_device) except SystemExit as e: print(argv, "-> parse failed with exit", e.code) PY

Repository: Comfy-Org/ComfyUI

Length of output: 243

🏁 Script executed:

rg "cuda_device" -A 3 -B 3 --max-count=20

Repository: Comfy-Org/ComfyUI

Length of output: 702

🏁 Script executed:

rg "(cuda-device|cuda_device)" -l

Repository: Comfy-Org/ComfyUI

Length of output: 86

🏁 Script executed:

git log --oneline --all -n 1

Repository: Comfy-Org/ComfyUI

Length of output: 113

🏁 Script executed:

git diff HEAD~1 HEAD -- comfy/cli_args.py 2>/dev/null || git show HEAD:comfy/cli_args.py 2>/dev/null | grep -A 5 -B 5 "cuda-device"

Repository: Comfy-Org/ComfyUI

Length of output: 1367

--cuda-device 0 1 still won't parse.

Changing the type to str only changes the value type, not the option arity. argparse still accepts one token here, so --cuda-device 0 1 fails with "unrecognized arguments: 1". The help text mentions "ids" (plural), implying multi-device support, but the current implementation requires comma-separated format: --cuda-device 0,1. Either add nargs='+' to accept space-separated device IDs or clarify the help text to document the required comma-separated input format.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/cli_args.py` at line 52, The --cuda-device argument currently only accepts a single token; update the parser.add_argument call for "--cuda-device" to accept multiple space-separated device IDs by adding nargs='+' (and set type=int if you want integer IDs) so that invocations like "--cuda-device 0 1" parse correctly; alternatively, if you prefer comma-separated input, change the help text to explicitly state the required format instead of implying plural support.

coderabbitai · 2026-03-18T11:37:22Z

+    def deepclone_multigpu(self, load_device, autoregister=False):
+        c = self.copy()
+        c.control_model = copy.deepcopy(c.control_model)
+        c.control_model_wrapped = comfy.model_patcher.ModelPatcher(c.control_model, load_device=load_device, offload_device=comfy.model_management.unet_offload_device())
+        if autoregister:
+            self.multigpu_clones[load_device] = c
+        return c


⚠️ Potential issue | 🟠 Major

Preserve the previous_controlnet chain in multigpu clones.

These new clone paths build c from copy(), but copy_to() does not carry previous_controlnet. Once get_instance_for_device() returns the per-device clone, stacked ControlNets/T2IAdapters on earlier links are silently dropped on secondary GPUs.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

Also applies to: 952-958

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/controlnet.py` around lines 322 - 328, The multigpu clone path in deepclone_multigpu currently builds c = self.copy() which does not carry the previous_controlnet chain, causing stacked ControlNets/T2IAdapters to be lost on secondary GPUs; update deepclone_multigpu to copy previous_controlnet (and any linked .previous_controlnet chain) from self to c after c = self.copy() so that the full chain is preserved, then continue deep-copying control_model and wrapping it as before (ensure multigpu_clones[load_device] assignment remains unchanged); apply the same preservation of previous_controlnet chaining to the similar clone code paths that use copy_to()/get_instance_for_device() so all per-device clones keep the full previous_controlnet chain.

coderabbitai · 2026-03-18T11:37:22Z

+def get_all_torch_devices(exclude_current=False):
+    global cpu_state
+    devices = []
+    if cpu_state == CPUState.GPU:
+        if is_nvidia():
+            for i in range(torch.cuda.device_count()):
+                devices.append(torch.device(i))
+        elif is_intel_xpu():
+            for i in range(torch.xpu.device_count()):
+                devices.append(torch.device(i))
+        elif is_ascend_npu():
+            for i in range(torch.npu.device_count()):
+                devices.append(torch.device(i))
+    else:
+        devices.append(get_torch_device())
+    if exclude_current:
+        devices.remove(get_torch_device())
+    return devices


⚠️ Potential issue | 🟠 Major

Handle non-CUDA backends in get_all_torch_devices.

This helper only enumerates CUDA/XPU/NPU devices, so ROCm/DirectML/MLU-style paths leave devices empty. With exclude_current=True that turns into a remove() failure, and unload_all_models() also stops freeing anything on those backends because it now routes through this helper.

As per coding guidelines, comfy/** changes should focus on backward compatibility and memory management/GPU resource handling.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_management.py` around lines 214 - 231, The function get_all_torch_devices currently only handles NVIDIA/Intel/Ascend and can return an empty list (breaking exclude_current and unload_all_models); update it to (1) add detection for other common backends (e.g., ROCm/DirectML/MLU/MPS) or at minimum attempt generic torch backend checks such as torch.cuda.device_count() and torch.backends.mps.is_available() and append appropriate torch.device entries, (2) if after all backend checks devices is still empty, append get_torch_device() as a safe fallback so callers always get at least the current device, and (3) make the exclude_current branch robust by checking membership before calling devices.remove(get_torch_device()); refer to get_all_torch_devices, cpu_state, CPUState.GPU, is_nvidia, is_intel_xpu, is_ascend_npu, and get_torch_device when implementing these fixes.

coderabbitai · 2026-03-18T11:37:22Z

+    def prepare_state(self, timestep, model_options, ignore_multigpu=False):
        for callback in self.get_all_callbacks(CallbacksMP.ON_PREPARE_STATE):
-            callback(self, timestep)
+            callback(self, timestep, model_options, ignore_multigpu)
+        if not ignore_multigpu and "multigpu_clones" in model_options:
+            for p in model_options["multigpu_clones"].values():
+                p: ModelPatcher
+                p.prepare_state(timestep, model_options, ignore_multigpu=True)


⚠️ Potential issue | 🟠 Major

Keep ON_PREPARE_STATE callback arity backward-compatible.

This now passes ignore_multigpu as a fourth positional argument to every existing ON_PREPARE_STATE callback. Any custom node still implementing the old (patcher, timestep, model_options) signature will fail with TypeError during sampling unless this is gated behind an arity check or opt-in keyword path.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_patcher.py` around lines 1315 - 1321, The ON_PREPARE_STATE callbacks are being invoked with four positional args in prepare_state, breaking backward-compatibility for callbacks that expect three; update prepare_state to detect each callback's accepted arity (e.g., via inspect.signature or callable.__code__.co_argcount) and call either callback(self, timestep, model_options, ignore_multigpu) if it accepts 4 args or callback(self, timestep, model_options) if it only accepts 3 (or attempt the 4-arg call and fall back to 3-arg on TypeError), and apply the same arity-gated invocation when recursing into multigpu clones; reference prepare_state and CallbacksMP.ON_PREPARE_STATE to locate where to change the callsite.

coderabbitai · 2026-03-18T11:37:22Z

            ck.registry.disable("cuda")
            logging.warning("WARNING: You need pytorch with cu130 or higher to use optimized CUDA operations.")
-
+    ck.registry.disable("cuda") # multigpu will not work rn with comfy-kitchen on cuda backend


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see the full context of the file around line 23 cat -n comfy/quant_ops.py | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 1910

🏁 Script executed:

# Let's also check if there's any git context to see if this is a new change git diff HEAD -- comfy/quant_ops.py | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 43

🏁 Script executed:

# Search for multigpu-related logic in the codebase to understand the gating/conditions rg -n "multigpu" --type py -B 2 -A 2

Repository: Comfy-Org/ComfyUI

Length of output: 25235

Don't disable the comfy-kitchen CUDA backend globally.

Line 23 disables it unconditionally at import time, which affects all CUDA runs including single-GPU and non-multigpu workflows. This regresses quantization performance for users not using multigpu. Gate the disable behind actual multigpu activation or the specific unsupported backend combination instead of doing it unconditionally here.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/quant_ops.py` at line 23, The unconditional call ck.registry.disable("cuda") in comfy/quant_ops.py should be removed and only invoked when the unsupported multigpu+cuda combination is actually active; locate the ck.registry.disable("cuda") invocation and wrap it with a guard that checks the real multigpu/backend state (for example an existing multigpu flag or function like is_multigpu_enabled(), a config/ENV check, or the code path that handles multigpu setup) so that CUDA is only disabled when multigpu is enabled and the specific backend combination is unsupported, otherwise leave CUDA enabled for normal single-GPU runs.

coderabbitai · 2026-03-18T11:37:23Z

        minimum_memory_required += inference_memory
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required, minimum_memory_required=minimum_memory_required, force_full_load=force_full_load)
-    real_model = model.model
+    real_model: BaseModel = model.model


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check the imports and line 200 in the file head -20 comfy/sampler_helpers.py && echo "---" && sed -n '195,205p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 1130

🏁 Script executed:

# Search for all imports of BaseModel or where it might come from rg -i "basemodel|from.*import|import" comfy/sampler_helpers.py | head -30

Repository: Comfy-Org/ComfyUI

Length of output: 463

🏁 Script executed:

# Check if there's a BaseModel defined or imported elsewhere that should be in this file rg "class BaseModel|from.*BaseModel" comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 50375

🏁 Script executed:

# Check git diff to see if line 200 is new git diff HEAD^ comfy/sampler_helpers.py | grep -A 5 -B 5 "BaseModel"

Repository: Comfy-Org/ComfyUI

Length of output: 246

🏁 Script executed:

# Look for similar type annotations in the file to understand the pattern rg ":\s*\w+\s*=" comfy/sampler_helpers.py | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 100

🏁 Script executed:

# Check if there are any existing TYPE_CHECKING imports or patterns sed -n '1,30p' comfy/sampler_helpers.py

Repository: Comfy-Org/ComfyUI

Length of output: 894

Add the missing BaseModel import to fix undefined name.

Line 200 uses BaseModel in the type annotation, but it's not imported. Add it to the TYPE_CHECKING block to match the file's existing pattern, or remove the annotation.

Suggested fix

if TYPE_CHECKING: from comfy.model_patcher import ModelPatcher from comfy.controlnet import ControlBase + from comfy.model_base import BaseModel

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

real_model: BaseModel = model.model

if TYPE_CHECKING:

from comfy.model_patcher import ModelPatcher

from comfy.controlnet import ControlBase

from comfy.model_base import BaseModel

🧰 Tools

🪛 GitHub Actions: Python Linting

[error] 200-200: F821 Undefined name BaseModel detected by ruff check. Ensure BaseModel is imported or defined.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/sampler_helpers.py` at line 200, Add the missing BaseModel import used in the type annotation for real_model (the line "real_model: BaseModel = model.model") by adding "from typing import TYPE_CHECKING" already present and then inside the existing TYPE_CHECKING block import BaseModel from its module (e.g., "from <module> import BaseModel") so the annotation is defined at type-check time; alternatively remove the BaseModel annotation if you prefer not to add the import.

coderabbitai · 2026-03-18T11:37:23Z

+    devices = [dev_m for dev_m in model_options['multigpu_clones'].keys()]
+    device_batched_hooked_to_run: dict[torch.device, list[tuple[comfy.hooks.HookGroup, tuple]]] = {}
+
+    total_conds = 0
+    for to_run in hooked_to_run.values():
+        total_conds += len(to_run)
+    conds_per_device = max(1, math.ceil(total_conds//len(devices)))


⚠️ Potential issue | 🟠 Major

relative_speed is not used by the multigpu scheduler.

This branch still computes a fixed conds_per_device and round-robins by raw condition count; multigpu_options is never consulted here. The new MultiGPU Options node therefore has no effect on work distribution, and the // inside math.ceil(...) makes the split even coarser on uneven counts.

As per coding guidelines, comfy/** changes should focus on performance implications in hot paths.

Also applies to: 403-416, 433-435

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 391 - 397, The multigpu scheduler currently ignores multigpu_options and uses integer floor division (//) inside math.ceil, producing coarse, incorrect splits; update the batching logic around devices, device_batched_hooked_to_run, total_conds, hooked_to_run and conds_per_device to consult multigpu_options (specifically the relative_speed entry for each device clone) and distribute total_conds proportionally to those relative_speed weights (then ceil each device's share and ensure at least 1 if there are any conditions), replacing the math.ceil(total_conds//len(devices)) approach with a proper float division and per-device allocation; keep device ordering based on model_options['multigpu_clones'].keys() and ensure the same proportional logic is applied in the other affected blocks (lines mentioned: 403-416, 433-435) so the MultiGPU Options node actually affects work distribution.

coderabbitai · 2026-03-18T11:37:23Z

        if 'control' in x:
            x['control'].pre_run(model, percent_to_timestep_function)
+            for device_cnet in x['control'].multigpu_clones.values():
+                device_cnet.pre_run(model, percent_to_timestep_function)


⚠️ Potential issue | 🟠 Major

Run per-device controls against the matching model clone.

These new pre_run() calls feed every device clone the base model. Any control that snapshots model-specific state during pre_run() will capture the wrong device/model; QwenFunControlNet.pre_run() in this file already stores model.diffusion_model, so its multigpu clone will still point at the base UNet.

As per coding guidelines, comfy/** changes should focus on backward compatibility.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 847 - 850, The code calls x['control'].pre_run(model, ...) for the base control and then calls device_cnet.pre_run(model, ...) for each control clone, incorrectly passing the base model to per-device controls; update the loop to pass the matching per-device model clone instead. Specifically, when iterating x['control'].multigpu_clones (the device_cnet clones), look up the corresponding model clone (e.g., from model.multigpu_clones using the same keys/ids) and call device_cnet.pre_run(model_clone, percent_to_timestep_function) so each control clone receives its matching model clone; keep the initial x['control'].pre_run(model, ...) for the base control.

coderabbitai · 2026-03-18T11:37:23Z

-        out[0].cached_patcher_init = (load_checkpoint_guess_config_model_only, (ckpt_path, embedding_directory, model_options, te_model_options))
-    if output_clip and out[1] is not None:
-        out[1].patcher.cached_patcher_init = (load_checkpoint_guess_config_clip_only, (ckpt_path, embedding_directory, model_options, te_model_options))
+    out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0)


⚠️ Potential issue | 🔴 Critical

Guard the model-side cache assignment.

load_checkpoint_guess_config_clip_only() reaches this path with output_model=False, so out[0] is None and this line raises before the CLIP patcher can be returned. It also leaves checkpoint-backed CLIP patchers without their own cached_patcher_init.

Possible fix

- out[0].cached_patcher_init = (load_checkpoint_guess_config, (ckpt_path, False, False, False, embedding_directory, output_model, model_options, te_model_options), 0) + if out[0] is not None: + out[0].cached_patcher_init = ( + load_checkpoint_guess_config, + (ckpt_path, False, False, False, embedding_directory, True, model_options, te_model_options), + 0, + ) + if out[1] is not None: + out[1].patcher.cached_patcher_init = ( + load_checkpoint_guess_config, + (ckpt_path, False, True, False, embedding_directory, False, model_options, te_model_options), + 1, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/sd.py` at line 1557, The assignment to out[0].cached_patcher_init can raise when out[0] is None (e.g. when called from load_checkpoint_guess_config_clip_only()), so guard it: check that out[0] is not None before assigning to out[0].cached_patcher_init and, if the CLIP patcher is created separately for checkpoint-backed models, set its own cached_patcher_init instead (or attach the init tuple to the patcher instance). Locate the assignment line and ensure both code paths (when out[0] exists and when only a CLIP patcher exists) receive the same cached_patcher_init tuple so no NullReference occurs and checkpoint-backed CLIP patchers keep their cached_patcher_init.

Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com> # Conflicts: # comfy/samplers.py

coderabbitai

🧹 Nitpick comments (1)

comfy/samplers.py (1)
506-508: Exception re-raised after appending to results - consider removing redundant raise.

The raise in the exception handler terminates the thread silently (exceptions in threads don't propagate to the main thread automatically). Since errors are already captured in results and re-raised at line 522-523, the raise here is redundant but harmless. The current pattern works correctly - the main thread will surface the first captured error after all threads complete.
✏️ Optional: Remove redundant raise for clarity
         except Exception as e:
             results.append(thread_result(None, None, None, None, None, error=e))
-            raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 506 - 508, Remove the redundant re-raise
inside the except block: keep the results.append(thread_result(..., error=e)) to
record the error (referencing the thread_result construction), but delete the
trailing raise so the worker thread simply records the error and exits; the main
thread already inspects results and re-raises the first captured exception
later, so no further in-thread raising is necessary.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@comfy/samplers.py`:
- Around line 506-508: Remove the redundant re-raise inside the except block:
keep the results.append(thread_result(..., error=e)) to record the error
(referencing the thread_result construction), but delete the trailing raise so
the worker thread simply records the error and exits; the main thread already
inspects results and re-raises the first captured exception later, so no further
in-thread raising is necessary.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 360b873a-b7a1-43f6-904a-72bb0f1a647b

📥 Commits

Reviewing files that changed from the base of the PR and between f410d28 and be35378.

📒 Files selected for processing (6)

comfy/cli_args.py
comfy/model_management.py
comfy/model_patcher.py
comfy/samplers.py
comfy/sd.py
nodes.py

🚧 Files skipped from review as they are similar to previous changes (4)

nodes.py
comfy/cli_args.py
comfy/sd.py
comfy/model_patcher.py

…bugs Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

…, disable Options node Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

comfy/samplers.py (2)
391-397: ⚠️ Potential issue | 🔴 Critical

Fix per-device quota math; current scheduling can stall and ignores speed weights.

Line 397 uses floor division inside ceil, and Lines 415-416 hard-cap selection by remaining quota. When quotas are exhausted, to_batch can become empty, to_run is never popped, and the while loop can spin indefinitely. relative_speed is also not applied, so load-balancing options still have no effect.
Suggested direction
- conds_per_device = max(1, math.ceil(total_conds//len(devices)))
+ # Use true division and weighted shares (relative_speed), then assign remainders.
+ # Also skip saturated devices before batching to guarantee progress.
As per coding guidelines, comfy/** changes should focus on performance implications in hot paths and thread safety for concurrent execution.

Also applies to: 403-416, 433-435
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 391 - 397, The per-device quota calculation
and scheduling are wrong: replace the floor-division inside ceil
(total_conds//len(devices)) with true division (total_conds / len(devices)) when
computing conds_per_device, and implement per-device dynamic quotas that honor
each device's relative_speed weight (use model_options['relative_speed'] keyed
by device to allocate proportional quotas instead of a single uniform
conds_per_device). In the scheduling loop that moves items from hooked_to_run
into device_batched_hooked_to_run (symbols: hooked_to_run, to_run, to_batch,
device_batched_hooked_to_run), ensure you always pop empty to_run lists so the
while loop can terminate (if to_batch becomes empty remove that to_run from
consideration), guarantee at least one item is selected per-device when work
remains (but still respect remaining global/weighted quota), and recalc
remaining quotas each iteration so devices with higher relative_speed get more
items; this prevents empty selections, avoids infinite loops, and applies the
relative_speed weighting correctly.
850-851: ⚠️ Potential issue | 🟠 Major

Run control clones with their matching model clone, not the base model.

Lines 850-851 call device_cnet.pre_run(model, ...) for every clone. Controls that cache model internals during pre_run() will bind to the wrong model/device.

As per coding guidelines, comfy/** changes should focus on backward compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/samplers.py` around lines 850 - 851, The control clones are being
pre-run with the base model instead of their corresponding model clone; change
the call in the loop that iterates x['control'].multigpu_clones so each
device_cnet.pre_run gets the matching model clone (not the top-level model).
Locate the multigpu_clones structures (x['control'].multigpu_clones and the
model clone container, e.g. x['model'].multigpu_clones or similarly named
mapping) and pair clones by device/key (or zip the values if they are ordered)
and call device_cnet.pre_run(matching_model_clone, percent_to_timestep_function)
so cached internals bind to the correct model clone.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/samplers.py`:
- Around line 424-425: The multigpu memory-fit check builds input_shape from
batch_amount/first_shape but calls model.memory_required(input_shape) without
conditioning shapes, which underestimates memory for conditioned models; change
the call to pass the conditioning shapes (use cond_shapes) just like the
single-GPU path so memory_required receives both input_shape and cond_shapes
(update the check around the variables input_shape, cond_shapes,
model.memory_required, batch_amount, and first_shape) to avoid over-batching and
OOMs.
- Around line 447-448: The code unconditionally calls
torch.cuda.set_device(device) before selecting model_current from
model_options["multigpu_clones"], which fails for non-CUDA devices; change the
logic in the section that sets model_current (and any device setup) to first
check the device type (e.g., inspect device.type or use
str(device).startswith("cuda")) and only call torch.cuda.set_device(device) for
CUDA devices, otherwise skip or use appropriate backend-specific calls; update
the block referencing model_options["multigpu_clones"][device].model and
torch.cuda.set_device to conditionally handle CUDA vs non-CUDA backends so
multi-backend multigpu clones (XPU/NPU/MLU/MPS/CPU/etc.) don't trigger a
CUDA-only call.
- Around line 503-506: The worker thread is moving tensors asynchronously with
.to(output_device) (via model_options['model_function_wrapper'] /
model_current.apply_model) and then returning thread_result, which can cause
races when the main thread aggregates GPU tensors; fix by calling
torch.cuda.synchronize(output_device) immediately after the
.to(output_device).chunk(...) transfer completes in the worker (guarded by a
check that output_device is a CUDA device) so the copy/kernels have finished
before appending/returning the thread_result.

---

Duplicate comments:
In `@comfy/samplers.py`:
- Around line 391-397: The per-device quota calculation and scheduling are
wrong: replace the floor-division inside ceil (total_conds//len(devices)) with
true division (total_conds / len(devices)) when computing conds_per_device, and
implement per-device dynamic quotas that honor each device's relative_speed
weight (use model_options['relative_speed'] keyed by device to allocate
proportional quotas instead of a single uniform conds_per_device). In the
scheduling loop that moves items from hooked_to_run into
device_batched_hooked_to_run (symbols: hooked_to_run, to_run, to_batch,
device_batched_hooked_to_run), ensure you always pop empty to_run lists so the
while loop can terminate (if to_batch becomes empty remove that to_run from
consideration), guarantee at least one item is selected per-device when work
remains (but still respect remaining global/weighted quota), and recalc
remaining quotas each iteration so devices with higher relative_speed get more
items; this prevents empty selections, avoids infinite loops, and applies the
relative_speed weighting correctly.
- Around line 850-851: The control clones are being pre-run with the base model
instead of their corresponding model clone; change the call in the loop that
iterates x['control'].multigpu_clones so each device_cnet.pre_run gets the
matching model clone (not the top-level model). Locate the multigpu_clones
structures (x['control'].multigpu_clones and the model clone container, e.g.
x['model'].multigpu_clones or similarly named mapping) and pair clones by
device/key (or zip the values if they are ordered) and call
device_cnet.pre_run(matching_model_clone, percent_to_timestep_function) so
cached internals bind to the correct model clone.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbdf3a6c-ad44-4348-8b59-a5b62349893a

📥 Commits

Reviewing files that changed from the base of the PR and between be35378 and 1d8e379.

📒 Files selected for processing (2)

comfy/samplers.py
comfy_extras/nodes_multigpu.py

🚧 Files skipped from review as they are similar to previous changes (1)

comfy_extras/nodes_multigpu.py

coderabbitai · 2026-03-30T15:20:13Z

+                input_shape = [len(batch_amount) * first_shape[0]] + list(first_shape)[1:]
+                if model.memory_required(input_shape) * 1.5 < free_memory:


⚠️ Potential issue | 🟠 Major

Use cond_shapes in multigpu memory-fit checks to avoid over-batching.

Line 425 calls model.memory_required(input_shape) without conditioning shapes, unlike the single-GPU path. This can under-estimate required memory and trigger avoidable OOMs under complex conditioning.

Parity fix with single-GPU path

- if model.memory_required(input_shape) * 1.5 < free_memory: + cond_shapes = collections.defaultdict(list) + for tt in batch_amount: + for k, v in to_run[tt][0].conditioning.items(): + cond_shapes[k].append(v.size()) + if model.memory_required(input_shape, cond_shapes=cond_shapes) * 1.5 < free_memory: to_batch = batch_amount break

As per coding guidelines, comfy/** changes should focus on memory management and GPU resource handling.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 424 - 425, The multigpu memory-fit check builds input_shape from batch_amount/first_shape but calls model.memory_required(input_shape) without conditioning shapes, which underestimates memory for conditioned models; change the call to pass the conditioning shapes (use cond_shapes) just like the single-GPU path so memory_required receives both input_shape and cond_shapes (update the check around the variables input_shape, cond_shapes, model.memory_required, batch_amount, and first_shape) to avoid over-batching and OOMs.

coderabbitai · 2026-03-30T15:20:13Z

+            torch.cuda.set_device(device)
+            model_current: BaseModel = model_options["multigpu_clones"][device].model


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify backend diversity and cuda-only binding usage rg -n --type=py -C2 'def (is_intel_xpu|is_ascend_npu|is_mlu)|torch\.(xpu|npu|mlu)\.set_device' rg -n --type=py -C2 'torch\.cuda\.set_device$device$|_calc_cond_batch_multigpu'

Repository: Comfy-Org/ComfyUI

Length of output: 1936

🏁 Script executed:

# Find where multigpu_clones is populated rg -n --type=py 'multigpu_clones\s*=' | head -20 # Check the context around creation rg -n --type=py -B5 -A5 'multigpu_clones\[' | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 1511

🏁 Script executed:

# Check if line 447 is new in this commit or pre-existing git log --oneline -10 -- comfy/samplers.py # Get git diff for this specific line git diff HEAD~1 HEAD -- comfy/samplers.py | grep -A10 -B10 'torch.cuda.set_device'

Repository: Comfy-Org/ComfyUI

Length of output: 164

🏁 Script executed:

# Find where _calc_cond_batch_multigpu is called and what device is passed rg -n --type=py -B10 '_calc_cond_batch_multigpu' | head -40 # Look for where multigpu_clones dict is iterated or populated rg -n --type=py 'for .* in.*multigpu_clones' -A5 -B5 | head -80

Repository: Comfy-Org/ComfyUI

Length of output: 6026

🏁 Script executed:

# Find what creates model_options['multigpu_clones'] and what device types go into it rg -n --type=py "model_options\[.multigpu_clones.\]" -B3 -A3 | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 3189

🏁 Script executed:

# Find where multigpu setup happens and what device types are assigned rg -n --type=py -B15 'multigpu_dict\[.*load_device.*\]' comfy/sampler_helpers.py # Check if there's device filtering for multigpu rg -n --type=py 'multigpu' comfy/sampler_helpers.py | head -30

Repository: Comfy-Org/ComfyUI

Length of output: 2758

🏁 Script executed:

# Look for where load_device is set and what device types are possible rg -n --type=py 'load_device\s*=' --type=py -A2 -B2 | grep -E '(load_device|xpu|npu|mlu|cuda)' | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 5885

🏁 Script executed:

# Find get_torch_device implementation rg -n --type=py 'def get_torch_device' -A20 comfy/model_management.py # Also check if there's device type checking in the codebase rg -n --type=py 'device\.type.*==.*"cuda"' | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 1910

🏁 Script executed:

# Check git history for line 447 in samplers.py git log --all --oneline --follow -p -- comfy/samplers.py | head -200 # Alternative: check recent changes to that specific line git blame comfy/samplers.py | grep -A2 -B2 "torch.cuda.set_device"

Repository: Comfy-Org/ComfyUI

Length of output: 7767

Avoid CUDA-only set_device in a multi-backend multigpu path.

Line 447 unconditionally calls torch.cuda.set_device(device), which will crash when multigpu clones contain non-CUDA devices (XPU, NPU, MLU, MPS, CPU, DirectML). Since get_torch_device() returns various device types that are added as keys to multigpu_clones, this breaks multi-backend support.

Safer minimal fix

- torch.cuda.set_device(device) + if hasattr(device, "type") and device.type == "cuda": + torch.cuda.set_device(device)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 447 - 448, The code unconditionally calls torch.cuda.set_device(device) before selecting model_current from model_options["multigpu_clones"], which fails for non-CUDA devices; change the logic in the section that sets model_current (and any device setup) to first check the device type (e.g., inspect device.type or use str(device).startswith("cuda")) and only call torch.cuda.set_device(device) for CUDA devices, otherwise skip or use appropriate backend-specific calls; update the block referencing model_options["multigpu_clones"][device].model and torch.cuda.set_device to conditionally handle CUDA vs non-CUDA backends so multi-backend multigpu clones (XPU/NPU/MLU/MPS/CPU/etc.) don't trigger a CUDA-only call.

coderabbitai · 2026-03-30T15:20:13Z

+                        output = model_options['model_function_wrapper'](model_current.apply_model, {"input": input_x, "timestep": timestep_, "c": c, "cond_or_uncond": cond_or_uncond}).to(output_device).chunk(batch_chunks)
+                    else:
+                        output = model_current.apply_model(input_x, timestep_, **c).to(output_device).chunk(batch_chunks)
+                    results.append(thread_result(output, mult, area, batch_chunks, cond_or_uncond))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, does joining a Python thread guarantee completion of CUDA kernels and cross-device .to() copies launched in that thread, or is explicit stream/device synchronization required?

💡 Result:

No. thread.join() only guarantees the Python thread has finished executing and enqueuing work, not that the CUDA work it launched has completed.

In CUDA, kernel launches and many copies are asynchronous w.r.t. the host thread (the CPU thread can return before the GPU finishes). CUDA can also use a per-thread default stream, so “the thread ended” is not a synchronization primitive for the GPU. [2]

In PyTorch, torch.cuda.synchronize(device=...) is the explicit mechanism that waits for all kernels in all streams on a CUDA device to complete. [1]

For .to() copies: if you use non_blocking=True, PyTorch explicitly performs the transfer without triggering host synchronization (i.e., it can still be in flight after the call returns, and therefore after join()). [3][4]

So if you need a guarantee that kernels and cross-device .to() copies launched in that thread are finished (e.g., before freeing/reusing buffers, consuming results on another device/stream, or accurate timing), you must use explicit CUDA/PyTorch synchronization (e.g., stream.synchronize(), event-based sync, or torch.cuda.synchronize(device)), rather than relying on join(). [1][2]

🏁 Script executed:

# First, let's look at the code around lines 503-506 sed -n '490,520p' comfy/samplers.py | cat -n

Repository: Comfy-Org/ComfyUI

Length of output: 2024

🏁 Script executed:

# Get more context on the function to understand the threading model sed -n '470,540p' comfy/samplers.py | cat -n

Repository: Comfy-Org/ComfyUI

Length of output: 4231

🏁 Script executed:

# Check if there's any synchronization happening after thread.join() grep -n "thread.join\|synchronize\|cuda.stream" comfy/samplers.py | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 86

🏁 Script executed:

# Check git status and diff for comfy/samplers.py to see what changed git diff HEAD comfy/samplers.py | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 43

🏁 Script executed:

# Check git log for recent changes to this file git log --oneline -n 10 -- comfy/samplers.py

Repository: Comfy-Org/ComfyUI

Length of output: 118

🏁 Script executed:

# Look at the actual diff around lines 503-506 git diff HEAD -- comfy/samplers.py | grep -A 20 -B 20 "output_device"

Repository: Comfy-Org/ComfyUI

Length of output: 43

🏁 Script executed:

# Check git status git status

Repository: Comfy-Org/ComfyUI

Length of output: 126

🏁 Script executed:

# Show the recent commit to see what changed git show --stat 1d8e379f

Repository: Comfy-Org/ComfyUI

Length of output: 49365

🏁 Script executed:

# Check if samplers.py was modified in that commit git show 1d8e379f -- comfy/samplers.py | head -150

Repository: Comfy-Org/ComfyUI

Length of output: 5195

🏁 Script executed:

# Verify the exact lines 503-506 match the pattern described sed -n '500,510p' comfy/samplers.py | cat -n

Repository: Comfy-Org/ComfyUI

Length of output: 937

Add explicit CUDA synchronization before tensor aggregation.

Inside worker threads (lines 503-506), .to(output_device) transfers tensors asynchronously. After thread.join() returns, CUDA kernels may still be in flight. The subsequent aggregation (GPU operations on these tensors) can race pending device copies without explicit synchronization like torch.cuda.synchronize(output_device).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/samplers.py` around lines 503 - 506, The worker thread is moving tensors asynchronously with .to(output_device) (via model_options['model_function_wrapper'] / model_current.apply_model) and then returning thread_result, which can cause races when the main thread aggregates GPU tensors; fix by calling torch.cuda.synchronize(output_device) immediately after the .to(output_device).chunk(...) transfer completes in the worker (guarded by a check that output_device is a CUDA device) so the copy/kernels have finished before appending/returning the thread_result.

Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba Co-authored-by: Amp <amp@ampcode.com>

…aps to parent patcher When a multigpu clone ModelPatcher is garbage collected, LoadedModel._switch_parent switches the weakref to point at the parent (main) ModelPatcher. However, it was not updating LoadedModel.device, leaving it with the old clone's device (e.g., cuda:1). On subsequent runs, this stale device was passed to ModelPatcherDynamic.load(), causing an assertion failure (device_to != self.load_device). Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba Co-authored-by: Amp <amp@ampcode.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

comfy/model_management.py (3)

522-526: Consider narrowing the exception handler.

Bare except: catches everything including KeyboardInterrupt and SystemExit. While this is best-effort logging, using except Exception: would be more precise.

🔧 Suggested change

 try:
     for device in get_all_torch_devices(exclude_current=True):
         logging.info("Device: {}".format(get_torch_device_name(device)))
-except:
+except Exception:
     pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 522 - 526, The try/except block
around the device logging is too broad; replace the bare except with a narrower
Exception handler so system-exiting signals aren't swallowed. Update the block
that calls get_all_torch_devices(exclude_current=True) and logging.info("Device:
{}".format(get_torch_device_name(device))) to catch only Exception (i.e., use
except Exception:) and optionally log the exception via logging.debug or
logging.exception to retain best-effort behavior without suppressing
KeyboardInterrupt/SystemExit.

1815-1839: Docstring should use triple quotes.

Line 1816 uses single quotes for the docstring. Per Python convention, docstrings should use triple double-quotes.

📝 Suggested change

 def unload_model_and_clones(model: ModelPatcher, unload_additional_models=True, all_devices=False):
-    'Unload only model and its clones - primarily for multigpu cloning purposes.'
+    """Unload only model and its clones - primarily for multigpu cloning purposes."""
     initial_keep_loaded: list[LoadedModel] = current_loaded_models.copy()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 1815 - 1839, The function
unload_model_and_clones uses a single-quoted string for its docstring; replace
the single-quoted docstring with a proper triple-quoted docstring (use triple
double-quotes """"...""") at the start of the unload_model_and_clones function
so it follows Python conventions and is recognized by tools that read __doc__
for that function.

1822-1834: Consider using a set for UUID lookups.

The nested loop checking additional_models is O(n×m). If additional_models grows large, converting UUIDs to a set would improve lookup performance.

♻️ Suggested optimization

 def unload_model_and_clones(model: ModelPatcher, unload_additional_models=True, all_devices=False):
-    'Unload only model and its clones - primarily for multigpu cloning purposes.'
+    """Unload only model and its clones - primarily for multigpu cloning purposes."""
     initial_keep_loaded: list[LoadedModel] = current_loaded_models.copy()
     additional_models = []
+    additional_uuids = set()
     if unload_additional_models:
         additional_models = model.get_nested_additional_models()
+        additional_uuids = {m.clone_base_uuid for m in additional_models}
     keep_loaded = []
     for loaded_model in initial_keep_loaded:
         if loaded_model.model is not None:
             if model.clone_base_uuid == loaded_model.model.clone_base_uuid:
                 continue
             # check additional models if they are a match
-            skip = False
-            for add_model in additional_models:
-                if add_model.clone_base_uuid == loaded_model.model.clone_base_uuid:
-                    skip = True
-                    break
-            if skip:
+            if loaded_model.model.clone_base_uuid in additional_uuids:
                 continue
         keep_loaded.append(loaded_model)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@comfy/model_management.py` around lines 1822 - 1834, The nested loop over
initial_keep_loaded and additional_models causes O(n×m) UUID comparisons;
optimize by precomputing a set of clone_base_uuid values from additional_models
and use set membership instead. Specifically, before iterating
initial_keep_loaded, build a set (e.g., additional_uuids = {m.clone_base_uuid
for m in additional_models}) and then inside the loop replace the inner loop
that checks add_model.clone_base_uuid with a single membership test against
additional_uuids; keep references to initial_keep_loaded, loaded_model,
model.clone_base_uuid, additional_models, and keep_loaded to locate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/model_management.py`:
- Around line 220-231: The device-listing loops create devices with the wrong
type and removal is unsafe: replace torch.device(i) with explicit device strings
(e.g. torch.device(f"cuda:{i}") for CUDA, torch.device(f"xpu:{i}") for XPU when
is_intel_xpu() is true, and torch.device(f"npu:{i}") for NPU when
is_ascend_npu() is true) so the device kind matches get_torch_device(); then
change the exclude_current removal to check membership first (if
get_torch_device() in devices: devices.remove(...)) to avoid ValueError.
Reference: the device construction loops and the use of get_torch_device() /
exclude_current in this block.

---

Nitpick comments:
In `@comfy/model_management.py`:
- Around line 522-526: The try/except block around the device logging is too
broad; replace the bare except with a narrower Exception handler so
system-exiting signals aren't swallowed. Update the block that calls
get_all_torch_devices(exclude_current=True) and logging.info("Device:
{}".format(get_torch_device_name(device))) to catch only Exception (i.e., use
except Exception:) and optionally log the exception via logging.debug or
logging.exception to retain best-effort behavior without suppressing
KeyboardInterrupt/SystemExit.
- Around line 1815-1839: The function unload_model_and_clones uses a
single-quoted string for its docstring; replace the single-quoted docstring with
a proper triple-quoted docstring (use triple double-quotes """"...""") at the
start of the unload_model_and_clones function so it follows Python conventions
and is recognized by tools that read __doc__ for that function.
- Around line 1822-1834: The nested loop over initial_keep_loaded and
additional_models causes O(n×m) UUID comparisons; optimize by precomputing a set
of clone_base_uuid values from additional_models and use set membership instead.
Specifically, before iterating initial_keep_loaded, build a set (e.g.,
additional_uuids = {m.clone_base_uuid for m in additional_models}) and then
inside the loop replace the inner loop that checks add_model.clone_base_uuid
with a single membership test against additional_uuids; keep references to
initial_keep_loaded, loaded_model, model.clone_base_uuid, additional_models, and
keep_loaded to locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e3f6eb1-aa35-4792-b004-a5900f6db6b4

📥 Commits

Reviewing files that changed from the base of the PR and between 1d8e379 and b418fb1.

📒 Files selected for processing (2)

comfy/model_management.py
comfy/quant_ops.py

💤 Files with no reviewable changes (1)

comfy/quant_ops.py

coderabbitai · 2026-03-30T21:21:55Z

+            for i in range(torch.cuda.device_count()):
+                devices.append(torch.device(i))
+        elif is_intel_xpu():
+            for i in range(torch.xpu.device_count()):
+                devices.append(torch.device(i))
+        elif is_ascend_npu():
+            for i in range(torch.npu.device_count()):
+                devices.append(torch.device(i))
+    else:
+        devices.append(get_torch_device())
+    if exclude_current:
+        devices.remove(get_torch_device())


⚠️ Potential issue | 🟠 Major

XPU and NPU device creation uses incorrect device type.

Lines 224 and 227 use torch.device(i) which defaults to CUDA, not the actual accelerator type. On XPU systems, this creates cuda:0 instead of xpu:0, causing a mismatch with get_torch_device() (which correctly returns xpu:X). This will cause devices.remove() at line 231 to fail with ValueError when exclude_current=True.

Additionally, line 231 should guard against the device not being in the list to handle edge cases gracefully.

🐛 Proposed fix for device type and remove() safety

def get_all_torch_devices(exclude_current=False): global cpu_state devices = [] if cpu_state == CPUState.GPU: if is_nvidia(): for i in range(torch.cuda.device_count()): - devices.append(torch.device(i)) + devices.append(torch.device("cuda", i)) elif is_intel_xpu(): for i in range(torch.xpu.device_count()): - devices.append(torch.device(i)) + devices.append(torch.device("xpu", i)) elif is_ascend_npu(): for i in range(torch.npu.device_count()): - devices.append(torch.device(i)) + devices.append(torch.device("npu", i)) else: devices.append(get_torch_device()) if exclude_current: - devices.remove(get_torch_device()) + current = get_torch_device() + if current in devices: + devices.remove(current) return devices

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@comfy/model_management.py` around lines 220 - 231, The device-listing loops create devices with the wrong type and removal is unsafe: replace torch.device(i) with explicit device strings (e.g. torch.device(f"cuda:{i}") for CUDA, torch.device(f"xpu:{i}") for XPU when is_intel_xpu() is true, and torch.device(f"npu:{i}") for NPU when is_ascend_npu() is true) so the device kind matches get_torch_device(); then change the exclude_current removal to check membership first (if get_torch_device() in devices: devices.remove(...)) to avoid ValueError. Reference: the device construction loops and the use of get_torch_device() / exclude_current in this block.

Replace per-step thread create/destroy in _calc_cond_batch_multigpu with a persistent MultiGPUThreadPool. Each worker thread calls torch.cuda.set_device() once at startup, preserving compiled kernel caches across diffusion steps. - Add MultiGPUThreadPool class in comfy/multigpu.py - Create pool in CFGGuider.outer_sample(), shut down in finally block - Main thread handles its own device batch directly for zero overhead - Falls back to sequential execution if no pool is available

Benchmarked hybrid (main thread + pool) vs all-pool on 2x RTX 4090 with SD1.5 and NetaYume models. No meaningful performance difference (within noise). All-pool is simpler: eliminates the main_device special case, main_batch_tuple deferred execution, and the 3-way branch in the dispatch loop.

* main: init all visible cuda devices in aimdo * mp: call vbars_analyze for the GPU in question * requirements: bump aimdo to pre-release version

socket-security · 2026-04-16T05:49:35Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	comfy-aimdo@0.2.12 ⏵ 0.0.213					^-30

View full report

… hardcoded chunk(2) (#13478)

Suppe2000 · 2026-04-21T17:07:26Z

What am I doing wrong? ComfyUI isn't starting.

Kosinkadink · 2026-04-21T17:48:39Z

You'll need to run requirements.txt

Suppe2000 · 2026-04-21T17:52:28Z

You'll need to run requirements.txt

All requirements are satisfied. So I don't understand what is going on. The Master branch is working fine. Just this tree branch isn't working.

Kosinkadink · 2026-04-21T18:02:34Z

Can you do a pip freeze and show me the printout? The version of aimdo for this branch right now requires a specific preview version that is included in the requirements here

Suppe2000 · 2026-04-21T18:10:48Z

Can you do a pip freeze and show me the printout? The version of aimdo for this branch right now requires a specific preview version that is included in the requirements here

This one?

Requirement already satisfied: comfy-aimdo==0.0.213 in C:\Users\...\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages (from -r requirements.txt (line 26)) (0.0.213)

Does anyone have an idea how to solve the issue?

Cut pre-release 0.0.214 off aimdo master to pickup async mem accounting fix.

) * Fix Hunyuan 3D 2.1 multi-GPU worksplit: use cond_or_uncond instead of hardcoded chunk(2) Amp-Thread-ID: https://ampcode.com/threads/T-019da964-2cc8-77f9-9aae-23f65da233db Co-authored-by: Amp <amp@ampcode.com> * Add GPU device selection to all loader nodes - Add get_gpu_device_options() and resolve_gpu_device_option() helpers in model_management.py for vendor-agnostic GPU device selection - Add device widget to CheckpointLoaderSimple, UNETLoader, VAELoader - Expand device options in CLIPLoader, DualCLIPLoader, LTXAVTextEncoderLoader from [default, cpu] to include gpu:0, gpu:1, etc. on multi-GPU systems - Wire load_diffusion_model_state_dict and load_state_dict_guess_config to respect model_options['load_device'] - Graceful fallback: unrecognized devices (e.g. gpu:1 on single-GPU) silently fall back to default Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Add VALIDATE_INPUTS to skip device combo validation for workflow portability When a workflow saved on a 2-GPU machine (with device=gpu:1) is loaded on a 1-GPU machine, the combo validation would reject the unknown value. VALIDATE_INPUTS with the device parameter bypasses combo validation for that input only, allowing resolve_gpu_device_option to handle the graceful fallback at runtime. Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Set CUDA device context in outer_sample to match model load_device Custom CUDA kernels (comfy_kitchen fp8 quantization) use torch.cuda.current_device() for DLPack tensor export. When a model is loaded on a non-default GPU (e.g. cuda:1), the CUDA context must match or the kernel fails with 'Can't export tensors on a different CUDA device index'. Save and restore the previous device around sampling. Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Fix code review bugs: negative index guard, CPU offload_device, checkpoint te_model_options - resolve_gpu_device_option: reject negative indices (gpu:-1) - UNETLoader: set offload_device when cpu is selected - CheckpointLoaderSimple: pass te_model_options for CLIP device, set offload_device for cpu, pass load_device to VAE - load_diffusion_model_state_dict: respect offload_device from model_options - load_state_dict_guess_config: respect offload_device, pass load_device to VAE Amp-Thread-ID: https://ampcode.com/threads/T-019daa41-f394-731a-8955-4cff4f16283a Co-authored-by: Amp <amp@ampcode.com> * Fix CUDA device context for CLIP encoding and VAE encode/decode Add torch.cuda.set_device() calls to match model's load device in: - CLIP.encode_from_tokens: fixes 'Can't export tensors on a different CUDA device index' when CLIP is loaded on a non-default GPU - CLIP.encode_from_tokens_scheduled: same fix for the hooks code path - CLIP.generate: same fix for text generation - VAE.decode: fixes VAE decoding on non-default GPU - VAE.encode: fixes VAE encoding on non-default GPU Same pattern as the existing outer_sample fix in samplers.py - saves and restores previous CUDA device in a try/finally block. Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57 Co-authored-by: Amp <amp@ampcode.com> * Extract cuda_device_context manager, fix tiled VAE methods Add model_management.cuda_device_context() — a context manager that saves/restores torch.cuda.current_device when operating on a non-default GPU. Replaces 6 copies of the manual save/set/restore boilerplate. Refactored call sites: - CLIP.encode_from_tokens - CLIP.encode_from_tokens_scheduled (hooks path) - CLIP.generate - VAE.decode - VAE.encode - samplers.outer_sample Bug fixes (newly wrapped): - VAE.decode_tiled: was missing device context entirely, would fail on non-default GPU when called from 'VAE Decode (Tiled)' node - VAE.encode_tiled: same issue for 'VAE Encode (Tiled)' node Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57 Co-authored-by: Amp <amp@ampcode.com> * Restore CheckpointLoaderSimple, add CheckpointLoaderDevice Revert CheckpointLoaderSimple to its original form (no device input) so it remains the simple default loader. Add new CheckpointLoaderDevice node (advanced/loaders) with separate model_device, clip_device, and vae_device inputs for per-component GPU placement in multi-GPU setups. Amp-Thread-ID: https://ampcode.com/threads/T-019dabdc-8feb-766f-b4dc-f46ef4d8ff57 Co-authored-by: Amp <amp@ampcode.com> --------- Co-authored-by: Amp <amp@ampcode.com>

* fix: pin SQLAlchemy>=2.0 in requirements.txt (fixes #13036) (#13316) * Refactor io to IO in nodes_ace.py (#13485) * Bump comfyui-frontend-package to 1.42.12 (#13489) * Make the ltx audio vae more native. (#13486) * feat(api-nodes): add automatic downscaling of videos for ByteDance 2 nodes (#13465) * Support standalone LTXV audio VAEs (#13499) * [Partner Nodes] added 4K resolution for Veo models; added Veo 3 Lite model (#13330) * feat(api nodes): added 4K resolution for Veo models; added Veo 3 Lite model Signed-off-by: bigcat88 <bigcat88@icloud.com> * increase poll_interval from 5 to 9 --------- Signed-off-by: bigcat88 <bigcat88@icloud.com> Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com> * Bump comfyui-frontend-package to 1.42.14 (#13493) * Add gpt-image-2 as version option (#13501) * Allow logging in comfy app files. (#13505) * chore: update workflow templates to v0.9.59 (#13507) * fix(veo): reject 4K resolution for veo-3.0 models in Veo3VideoGenerationNode (#13504) The tooltip on the resolution input states that 4K is not available for veo-3.1-lite or veo-3.0 models, but the execute guard only rejected the lite combination. Selecting 4K with veo-3.0-generate-001 or veo-3.0-fast-generate-001 would fall through and hit the upstream API with an invalid request. Broaden the guard to match the documented behavior and update the error message accordingly. Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com> * feat: RIFE and FILM frame interpolation model support (CORE-29) (#13258) * initial RIFE support * Also support FILM * Better RAM usage, reduce FILM VRAM peak * Add model folder placeholder * Fix oom fallback frame loss * Remove torch.compile for now * Rename model input * Shorter input type name --------- * fix: use Parameter assignment for Stable_Zero123 cc_projection weights (fixes #13492) (#13518) On Windows with aimdo enabled, disable_weight_init.Linear uses lazy initialization that sets weight and bias to None to avoid unnecessary memory allocation. This caused a crash when copy_() was called on the None weight attribute in Stable_Zero123.__init__. Replace copy_() with direct torch.nn.Parameter assignment, which works correctly on both Windows (aimdo enabled) and other platforms. * Derive InterruptProcessingException from BaseException (#13523) * bump manager version to 4.2.1 (#13516) * ModelPatcherDynamic: force cast stray weights on comfy layers (#13487) the mixed_precision ops can have input_scale parameters that are used in tensor math but arent a weight or bias so dont get proper VRAM management. Treat these as force-castable parameters like the non comfy weight, random params are buffers already are. * Update logging level for invalid version format (#13526) * [Partner Nodes] add SD2 real human support (#13509) * feat(api-nodes): add SD2 real human support Signed-off-by: bigcat88 <bigcat88@icloud.com> * fix: add validation before uploading Assets Signed-off-by: bigcat88 <bigcat88@icloud.com> * Add asset_id and group_id displaying on the node Signed-off-by: bigcat88 <bigcat88@icloud.com> * extend poll_op to use instead of custom async cycle Signed-off-by: bigcat88 <bigcat88@icloud.com> * added the polling for the "Active" status after asset creation Signed-off-by: bigcat88 <bigcat88@icloud.com> * updated tooltip for group_id * allow usage of real human in the ByteDance2FirstLastFrame node * add reference count limits * corrected price in status when input assets contain video Signed-off-by: bigcat88 <bigcat88@icloud.com> --------- Signed-off-by: bigcat88 <bigcat88@icloud.com> * feat: SAM (segment anything) 3.1 support (CORE-34) (#13408) * [Partner Nodes] GPTImage: fix price badges, add new resolutions (#13519) * fix(api-nodes): fixed price badges, add new resolutions Signed-off-by: bigcat88 <bigcat88@icloud.com> * proper calculate the total run cost when "n > 1" Signed-off-by: bigcat88 <bigcat88@icloud.com> --------- Signed-off-by: bigcat88 <bigcat88@icloud.com> * chore: update workflow templates to v0.9.61 (#13533) * chore: update embedded docs to v0.4.4 (#13535) * add 4K resolution to Kling nodes (#13536) Signed-off-by: bigcat88 <bigcat88@icloud.com> * Fix LTXV Reference Audio node (#13531) * comfy-aimdo 0.2.14: Hotfix async allocator estimations (#13534) This was doing an over-estimate of VRAM used by the async allocator when lots of little small tensors were in play. Also change the versioning scheme to == so we can roll forward aimdo without worrying about stable regressions downstream in comfyUI core. * Disable sageattention for SAM3 (#13529) Causes Nans * execution: Add anti-cycle validation (#13169) Currently if the graph contains a cycle, the just inifitiate recursions, hits a catch all then throws a generic error against the output node that seeded the validation. Instead, fail the offending cycling mode chain and handlng it as an error in its own right. Co-authored-by: guill <jacob.e.segal@gmail.com> * chore: update workflow templates to v0.9.62 (#13539) --------- Signed-off-by: bigcat88 <bigcat88@icloud.com> Co-authored-by: Octopus <liyuan851277048@icloud.com> Co-authored-by: comfyanonymous <121283862+comfyanonymous@users.noreply.github.com> Co-authored-by: Comfy Org PR Bot <snomiao+comfy-pr@gmail.com> Co-authored-by: Alexander Piskun <13381981+bigcat88@users.noreply.github.com> Co-authored-by: Jukka Seppänen <40791699+kijai@users.noreply.github.com> Co-authored-by: AustinMroz <austin@comfy.org> Co-authored-by: Daxiong (Lin) <contact@comfyui-wiki.com> Co-authored-by: Matt Miller <matt@miller-media.com> Co-authored-by: blepping <157360029+blepping@users.noreply.github.com> Co-authored-by: Dr.Lt.Data <128333288+ltdrdata@users.noreply.github.com> Co-authored-by: rattus <46076784+rattus128@users.noreply.github.com> Co-authored-by: guill <jacob.e.segal@gmail.com>

Kosinkadink added 30 commits December 29, 2024 15:49

Add 'sigmas' to transformer_options so that downstream code can know …

72bbf49

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

Merge branch 'master' into hooks_part2

bf21be0

Merge branch 'master' into hooks_part2

d44295e

Cleaned up hooks.py, refactored Hook.should_register and add_hook_pat…

5a2ad03

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

Refactor WrapperHook into TransformerOptionsHook, as there is no need…

776aa73

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

Refactored HookGroup to also store a dictionary of hooks separated by…

111fd0c

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

In inner_sample, change "sigmas" to "sampler_sigmas" in transformer_o…

6620d86

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

Merge branch 'add_sample_sigmas' into hooks_part2

db2d7ad

Made hook clone code sane, made clear ObjectPatchHook and SetInjectio…

4446c86

…nsHook are not yet operational

Filter only registered hooks on self.conds in CFGGuider.sample

0a7e2ae

Merge branch 'master' into hooks_part2

6463c39

Make hook_scope functional for TransformerOptionsHook

f48f90e

Merge branch 'master' into hooks_part2

2724ac4

removed 4 whitespace lines to satisfy Ruff,

1b38f5b

Add a get_injections function to ModelPatcher

58bf881

Made TransformerOptionsHook contribute to registered hooks properly, …

216fea1

…added some doc strings and removed a so-far unused variable

Merge branch 'master' into hooks_part2

11c6d56

Rename AddModelsHooks to AdditionalModelsHook, rename SetInjectionsHo…

3cd4c5c

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

Clean up a typehint

7333281

Merge branch 'comfyanonymous:master' into multigpu_support

66838eb

Add get_all_torch_devices to get detected devices intended for curren…

871258a

…t torch hardware device

Initial proof of concept of giving splitting cond sampling between mu…

7448f02

…ltiple GPUs

Merge branch 'comfyanonymous:master' into multigpu_support

d3cf2b7

Fix cond_cat to not try to cast anything that doesn't have a 'to' fun…

e88c6c0

…ction

Merge branch 'master' into multigpu_support

8d4b501

Make test node for multigpu instead of storing it in just a local __i…

d508807

…nit__.py

Merge branch 'master' into multigpu_support

ec16ee2

Add nodes_multigpu.py to loaded nodes

198953c

Merge origin/master into worksplit-multigpu

f410d28

Amp-Thread-ID: https://ampcode.com/threads/T-019d009d-e059-7623-85ca-401168168516 Co-authored-by: Amp <amp@ampcode.com>

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Merge branch 'master' into worksplit-multigpu

be35378

Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com> # Conflicts: # comfy/samplers.py

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Kosinkadink and others added 4 commits March 30, 2026 07:07

Set CUDA device at start of multigpu threads to avoid multithreading …

84f465e

…bugs Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

Rewrite multigpu nodes to V3 format

d52dcbc

Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

Simplify multigpu nodes: default max_gpus=2, remove gpu_options input…

5f4fcd1

…, disable Options node Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

Rename MultiGPU Work Units to MultiGPU CFG Split

1d8e379

Amp-Thread-ID: https://ampcode.com/threads/T-019d3ee9-19d5-767a-9d7a-e50cbbef815b Co-authored-by: Amp <amp@ampcode.com>

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Kosinkadink and others added 4 commits March 30, 2026 08:32

Re-enable comfy-kitchen cuda backend for multigpu testing

afdddce

Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba Co-authored-by: Amp <amp@ampcode.com>

Add debug logging for device mismatch in ModelPatcherDynamic.load

3fab720

Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba Co-authored-by: Amp <amp@ampcode.com>

Add detailed multigpu debug logging to load_models_gpu

2080374

Amp-Thread-ID: https://ampcode.com/threads/T-019d3f5c-28c5-72c9-abed-34681f1b54ba Co-authored-by: Amp <amp@ampcode.com>

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Kosinkadink and others added 4 commits April 8, 2026 05:08

Merge remote-tracking branch 'origin/master' into worksplit-multigpu

da38644

Minor updates for worksplit_gpu with comfy-aimdo (#13419)

f0d550b

* main: init all visible cuda devices in aimdo * mp: call vbars_analyze for the GPU in question * requirements: bump aimdo to pre-release version

Kosinkadink added 2 commits April 20, 2026 02:37

Fix Hunyuan 3D 2.1 multi-GPU worksplit: use cond_or_uncond instead of…

37deccb

… hardcoded chunk(2) (#13478)

Merge remote-tracking branch 'origin/master' into worksplit-multigpu

b502bcf

rattus128 and others added 3 commits April 23, 2026 19:09

comfy-aimdo: 0.0.214 (#13532)

7b8b367

Cut pre-release 0.0.214 off aimdo master to pickup async mem accounting fix.

-    real_model: BaseModel = model.model
+if TYPE_CHECKING:
+    from comfy.model_patcher import ModelPatcher
+    from comfy.controlnet import ControlBase
+    from comfy.model_base import BaseModel

		input_shape = [len(batch_amount) * first_shape[0]] + list(first_shape)[1:]
		if model.memory_required(input_shape) * 1.5 < free_memory:

		torch.cuda.set_device(device)
		model_current: BaseModel = model_options["multigpu_clones"][device].model

Conversation

Kosinkadink commented Mar 4, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Implementation Details

Performance (will add more examples soon)

API Node PR Checklist

Scope

Pricing & Billing

QA

Comms

Uh oh!

coderabbitai Bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

socket-security Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suppe2000 commented Apr 21, 2026

Uh oh!

Kosinkadink commented Apr 21, 2026

Uh oh!

Suppe2000 commented Apr 21, 2026

Kosinkadink commented Mar 4, 2025 •

edited by github-actions Bot

Loading

coderabbitai Bot commented Mar 18, 2026 •

edited

Loading

socket-security Bot commented Apr 16, 2026 •

edited

Loading

Suppe2000 commented Apr 21, 2026 •

edited

Loading