Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal)#13745
Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal)#13745Enderfga wants to merge 37 commits into
Conversation
…vel imports This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py, pipeline_anyflow_causal.py, transformer_anyflow.py, scheduling_flow_map_euler_discrete.py) come in subsequent commits.
The flow-map scheduler advances samples from timestep t to caller-provided target r in a single Euler step, supporting any-step sampling on flow-map- distilled checkpoints. It is a general-purpose scheduler — not specific to the AnyFlow checkpoints. Tests: 12 standalone tests covering instantiation, set_timesteps endpoints, shift identity/monotonicity, step shape preservation, zero-interval identity, one-shot sampling, train weight schemes, scale_noise endpoints. Docs: api/schedulers/flow_map_euler_discrete.md
A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules: * FAR causal blocks (init_far_model=True): block-sparse causal attention via flex_attention + compressed-frame patch embedding for frame-level autoregressive generation (Gu et al., 2025, arXiv:2503.19325). * Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary intervals (AnyFlow). With both flags off, the model reduces to stock Wan2.1. The class is intentionally self-contained rather than annotated with '# Copied from diffusers.models.transformers.transformer_wan' because upstream Wan has been refactored extensively since v0.35.1 (new WanAttention class, different processor architecture). Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and determinism, return_dict variants, save/load round-trip with and without init_far_model, gradient checkpointing toggle. Docs: api/models/anyflow_transformer3d.md
* AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using
flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}.
* AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based
causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints
from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers.
Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel,
and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel
introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler.
Tests:
* tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests +
slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers.
* tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant.
Reference slices for slow integration tests are deferred to Phase 7
(Final quality pass) where the user runs them on a real GPU.
Modeled on the Helios pipeline doc (PR huggingface#13208). Sections: paper link + abstract, supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V examples for both bidirectional and causal variants, autodoc trailers.
…ersion script * Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING. * AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key. * scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints (with 'ema' state dict) into a diffusers save_pretrained layout. Supports all 4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the upstream repo with argparse to match other diffusers conversion scripts.
* ruff format pass on all 5 source files (long lines + trailing comma fixes) * check_dummies.py --fix_and_overwrite regenerated: - dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler - dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline Local fast tests: 21/21 passed - 12 scheduler tests (FlowMapEulerDiscreteScheduler) - 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load) The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install that matches the diffusers main branch's transformers >= compatibility floor. The reference slices for slow integration tests (real GPU + 1.3B/14B checkpoints) are intentionally left as TODO stubs to be captured by the user on a real GPU machine before opening the PR.
…torials
Critical bug fixes (verified against precision-validation review):
* pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded
transformer_dtype = torch.bfloat16 with self.transformer.dtype, so
pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a
dtype mismatch in the patch_embedding conv3d.
* transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in
_build_causal_mask (was a copy-paste typo carried over from FAR-Dev).
* transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals
and the `# noqa: F841` markers that were silencing the dead-store warning.
* transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the
pipeline manages KV cache directly, the mixin's interface is unused.
* transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)`
with try/except so the file imports cleanly on CPU CI / no-Triton machines.
* convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the
stdlib logger (warning_once-style) and a module-level basicConfig.
Documentation accuracy:
* AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial:
drop the fictitious `task_type` / `image` / `video` arguments and document
the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`)
to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes.
* Pipeline class docstrings + main doc: explicitly describe AnyFlow's
two-stage LoRA distillation including DMD reverse-divergence supervision
with Flow-Map backward simulation in stage 2 (was previously implicit).
* training_rollout: add detailed docstring explaining its role as the
3-segment Flow-Map backward simulation entry point used during DMD training.
* Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and
Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added
and registered in both `_toctree.yml` files.
Tests:
* Skip `test_attention_slicing_forward_pass` in both pipeline test classes
with a clear rationale (custom attention processor does not support slicing).
* All 21 standalone tests still pass (12 scheduler + 9 transformer).
Quality gates:
* `ruff check` clean across all AnyFlow files.
* `ruff format --check` reports 6 files already formatted.
* `python utils/check_copies.py` reports no diff.
Out of scope for this commit (deferred until reviewer feedback):
* Splitting AnyFlowTransformer3DModel into bidi + causal subclasses
* Unifying _forward_inference / _forward_cache return types
* Migrating model tests from plain unittest to BaseModelTesterConfig + mixins
* HF model card / config.json metadata updates on the nvidia/* repos
(push to Hub manually before opening the PR)
… output
Round 2 of review feedback. Three groups of changes; transformer state-dict
keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact
validation remains valid.
A. Pipeline rename (mechanical, no behavior change):
* Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers
usually means an attention mask; AnyFlow's variant is FAR autoregressive,
so the FAR name is more specific and matches the paper).
* File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv).
* Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv).
* All references updated in src/, tests/, docs/, scripts/, plus stale
anyflowcausalpipeline anchor links in tutorial markdown.
B. Pipeline test bug fixes (closes 19 fast-test failures reported by
precision-validation reviewer):
* pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets
self._num_timesteps = num_inference_steps before the rollout, so the
PipelineTesterMixin callback tests can read pipe.num_timesteps.
* tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious
task_type="t2v" kwarg that crashed every causal fast test (the FAR
pipeline selects mode via context_sequence, not a task_type arg).
C. Transformer architecture cleanups (review-driven, no tensor changes):
* Replace forward(*args, **kwargs) dispatcher with an explicit signature
listing every supported kwarg (hidden_states, timestep, r_timestep,
encoder_hidden_states, encoder_hidden_states_image, chunk_partition,
clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal,
attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile
tracing.
* Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput
(BaseOutput dataclass with sample + kv_cache fields) for the two causal
paths that need to also propagate kv_cache (_forward_inference and the
newly return_dict-aware _forward_cache). _forward_train and
_forward_bidirection now consistently return Transformer2DModelOutput.
Pipeline call sites already use return_dict=False with positional
unpacking, so the fix is transparent there.
Out of scope (deferred until canonical-org HF metadata sync):
* Splitting AnyFlowTransformer3DModel into a bidi class plus an
AnyFlowFARTransformer3DModel subclass — touches register_to_config keys
and would require updating model_index.json on every released checkpoint.
* Promoting chunk_partition from register_to_config to a forward-time
argument (same reason).
* Renaming training_rollout to _denoise — would break callers in the
FAR-Dev on-policy trainer that produced the released checkpoints.
Local fast tests: 21/21 still pass (12 scheduler + 9 transformer).
ruff check, ruff format, and check_copies.py are all clean.
…nk_partition to FAR fast-test fixture
Two root causes for the 19 remaining PipelineTesterMixin failures, identified
by the H200 reviewer:
1. callback_on_step_end was accepted by __call__ but never invoked. Both
pipelines pass it through to training_rollout (and FAR additionally through
inference()), and inference_range now fires it after scheduler.step in
the standard inference branch:
if callback_on_step_end is not None:
callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs}
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
prompt_embeds = ...
negative_prompt_embeds = ...
`nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite
the closure-captured embeddings, matching upstream WanPipeline semantics.
The 3-segment grad_timestep training rollout does not invoke the callback;
it is intentionally training-only.
2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built
the dummy transformer without a `chunk_partition`, leaving it None on the
model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`.
Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame
each, matching the test's num_frames=9 -> 3 latent frames).
Local fast tests: 21/21 still pass.
ruff check, ruff format, and check_copies.py are all clean.
…ig + rename helpers
Major architectural refactor that aligns the integration with diffusers conventions
ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and
tensor flow are unchanged so the H200 bit-exact validation remains valid; only
the on-disk transformer/config.json fields move.
Changes:
1. **Sibling transformer classes** replace the flag-driven single class:
* AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size /
full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition
kwargs (always-on for AnyFlow distilled checkpoints).
* AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward
paths (train / cache-prefill / autoregressive inference).
* AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by
the old setup_flowmap_model bootstrap) is removed; both classes now build
AnyFlowDualTimestepTextImageEmbedding directly in __init__.
* setup_flowmap_model / setup_far_model methods are removed; weight warm-start
for far_patch_embedding (trilinear interpolation from patch_embedding) moves
into AnyFlowFARTransformer3DModel.__init__.
2. **chunk_partition** is no longer a model config field. The FAR pipeline owns
the schedule:
* AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]
matches the released 81-frame NVIDIA checkpoints.
* AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition
argument that overrides the default for non-default num_frames.
3. **training_rollout -> _denoise_rollout** rename across both pipelines and all
English / Chinese docs that referenced it. Signals the method is internal to
the pipeline driver, not a public training API.
4. **Conversion script + tests + docs + registries**:
* scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right
transformer class per variant; init_far_model / init_flowmap_model /
chunk_partition kwargs are removed from the from_pretrained call.
* Transformer test file split into AnyFlowTransformer3DModelTest and
AnyFlowFARTransformer3DModelTest classes.
* Pipeline test fixtures use the right class and pass chunk_partition via
get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test).
* New docs page docs/source/en/api/models/anyflow_far_transformer3d.md;
anyflow_transformer3d.md rewritten for the bidi-only class.
* AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py,
src/diffusers/models/__init__.py, models/transformers/__init__.py and the
dummy_pt_objects.py stubs.
* docs/source/en/_toctree.yml: new entry for the FAR transformer page.
5. **Cleanups**:
* Pipeline __call__ no longer passes is_causal=False to the bidi forward (the
bidi class doesn't accept it).
* Pipeline class docstrings drop stale references to init_*_model flags.
Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes).
ruff check / format / check_copies clean.
Hub artifacts (model_index.json, transformer/config.json, scheduler config) need
to be regenerated for the released checkpoints; the HF update guide will be
delivered separately.
…models.md Hard violations (per official diffusers guidelines): * drop einops dependency — replace 25+ rearrange() calls with native permute/reshape/unflatten in transformer + both pipelines * device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt per-device via _build_freqs (matches transformer_wan / transformer_flux pattern) * migrate attention to dispatch_attention_fn — replace direct F.scaled_dot_product_attention calls with dispatch_attention_fn (works with sage / flash / native backends); introduce AnyFlowAttention( AttentionModuleMixin) with _default_processor_cls / _available_processors; rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and declare _attention_backend / _parallel_config class attrs * drop dead config fields — qk_norm and added_kv_proj_dim are pruned from both transformer __init__ signatures and AnyFlowTransformerBlock; AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme the released checkpoints use) and has no add_k_proj path (T2V only) * add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer classes for compile_repeated_blocks() support (matches Wan) * annotate prepare_latents with `# Copied from diffusers.pipelines.wan. pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange to (B, T, C, H, W) layout is moved to the call site State-dict keys are preserved (legacy Attention had identical to_q / to_k / to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load bit-exactly into the new AnyFlowAttention class. The HF Hub config-update guide is updated correspondingly: transformer/ config.json now drops qk_norm and added_kv_proj_dim alongside the previous init_far_model / init_flowmap_model / chunk_partition removals. 22 fast CPU tests still pass; ruff format / ruff check / check_copies all clean.
…/head-dim fallbacks + KV-cache dtype + num_timesteps
Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR
causal path still calls flex_attention directly, which has hard requirements
(CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy
components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact
numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward
0.00e+00, backward kernel-nondet only, ratio 1.000).
Code fixes:
1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now
short-circuit to an empty tensor when num_frames / height / width is 0.
PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw
spatial input becomes a 2x2 latent which then floors to 0 against
compressed_patch_size=(1, 4, 4); the original
`freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime.
2. flex_attention dispatch: split the module-load
`torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager`
(always available) plus `_flex_attention_compiled`, with a tiny wrapper
that picks compiled for CUDA tensors and eager for CPU. Avoids
torch._inductor C++ codegen failures that broke fast tests after
`pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on
bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd).
3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16
(flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass
`scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows
contribute 0, so trimming the output back is mathematically equivalent.
Released ckpts use head_dim=128 so the branch is never taken in production.
4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded
`latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded
bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and
bias type (float) should be the same"); real bf16 ckpts are unaffected.
5. pipeline_anyflow_far._denoise_rollout sets
`self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps`
before the chunk loop, so PipelineTesterMixin.test_callback_cfg's
`pipe.num_timesteps`-based assertion matches the actual number of callback
fires (chunks * NFE) instead of the previous hardcoded num_inference_steps.
Tests:
* test_callback_inputs cannot pass without changing FAR's chunk-wise output
semantics — it zeroes latents on the final step and asserts the *entire*
output buffer is zero, but only the active chunk's slice is overwritten in
a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale;
callback functionality itself is still covered by test_callback_cfg.
* Full pytest run on tests/pipelines/anyflow/ +
tests/models/transformers/test_models_transformer_anyflow.py +
tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed,
0 failed, 11 skipped.
Quality gates:
* `ruff check` and `ruff format --check` clean across all AnyFlow files.
* `python utils/check_copies.py` clean.
* `python utils/check_dummies.py` clean.
User-facing alignment with the official HF Hub model card and the day-of-announcement materials at https://huggingface.co/collections/nvidia/anyflow. * Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries). * Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers copy uses the same Video-to-Video terminology as the official model card. * Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) HF collection link to the three tutorial intros. * Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page / ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live. * Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project page) in place of the prior <github-org> / <project-page-url> placeholders. * Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA affiliation in the main tutorial, API pipeline page, and both transformer model pages; BibTeX uses the standard `and others` to elide the full list until the next pass. Working tree, CI gates, and tests after the change: ruff format --check ✓ ruff check ✓ python utils/check_copies.py ✓ python utils/check_dummies.py ✓ pytest tests/models + tests/schedulers (22 fast) ✓ No production code logic changes — only docstring wording inside pipeline files (TV2V → V2V).
Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and
Fang, Guian and others}, ...}`` block in both the English and Chinese
tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion,
...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors:
Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai,
Mike Zheng Shou.
Docs-only.
Scheduler - FlowMapEulerDiscreteScheduler.step now returns a FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False) and uses the conventional positional order (model_output, timestep, sample, r_timestep). - Drop training-only helpers: adaptive_weighting, set_train_weight, get_train_weight, linear_timesteps_weights, and the weight_type config field. - Add scale_model_input no-op for API parity; raise ValueError on missing r_timestep. Transformer - Remove gate_track debug write inside AnyFlowDualTimestepTextImageEmbedding.forward_timestep. - Compile flex_attention lazily on first CUDA call instead of at import time. - Replace assert with ValueError in build_block_mask. - Resolve <arxiv-id> placeholders to 2605.13724. Pipelines (AnyFlowPipeline + AnyFlowFARPipeline) - Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__ docstrings covering every argument. - Move use_mean_velocity from __init__ to __call__ so save/load round-trips. - Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout), the inner inference_range closure, and the redundant negative-prompt concat. - Replace asserts with ValueError; wire show_progress to tqdm; rename inference -> _inference; remove dead current_timestep property. - Update scheduler.step call sites to the new signature. - Trim class docstrings to inference-only language. Pipeline output - Add Apache 2.0 license header; switch to relative import. Auto pipeline / conversion script - Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and AUTO_VIDEO2VIDEO_PIPELINES_MAPPING. - Document the weights_only=False requirement in the conversion script. Tests - Scheduler tests use the new step signature and verify the Output dataclass contract. - Drop the four obsolete training-weight tests; drop weight_type kwarg from pipeline test fixtures; remove internal milestone names from TODO comments. Docs - Resolve <arxiv-id> in the scheduler docs page. - Trim DMD / on-policy distillation language in EN/ZH tutorials and the pipelines page; the paper abstract quote is preserved verbatim.
…all_docstrings main huggingface#13758 added utils/check_forward_call_docstrings.py which requires every signature arg to appear as its own `name (...):` entry under Args:. Expand the bidi and FAR transformer forward docstrings to list each parameter individually.
| _attention_backend = None | ||
| _parallel_config = None | ||
|
|
||
| _SUPPORTED_BACKENDS = (None, "flex", "_native_flex") |
There was a problem hiding this comment.
| _attention_backend = None | |
| _parallel_config = None | |
| _SUPPORTED_BACKENDS = (None, "flex", "_native_flex") | |
| _attention_backend = "flex" | |
| _parallel_config = None | |
| _SUPPORTED_BACKENDS = ("flex", "_native_flex") |
I think setting the default _attention_backend to "flex" rather than None and removing None from the _SUPPORTED_BACKENDS is cleaner, as only Flex Attention backends are compatible with AnyFlowCausalAttnProcessor. (Using None would generally default to the "native" backend, which isn't compatible.)
There was a problem hiding this comment.
Done in ffdc969 — _attention_backend = "flex" default and _SUPPORTED_BACKENDS = ("flex", "_native_flex"). Caught a real bug while verifying: the previous None default would silently fall through to SDPA on backends that ignore BlockMask (visible on mps locally — now raises loudly instead of returning wrong outputs).
| dropout_p=0.0, | ||
| is_causal=False, | ||
| scale=scale, | ||
| backend="flex", |
There was a problem hiding this comment.
| backend="flex", | |
| backend=self._attention_backend, |
Follow up to #13745 (comment): using self._attention_backend instead of hardcoding flex here allows us to use other supported backends such as _native_flex.
There was a problem hiding this comment.
Done in ffdc969 — backend=self._attention_backend so _native_flex can be selected explicitly.
| # complex128, so we downcast to complex64 there. | ||
| self._freqs_cache: Optional[Tuple[Any, torch.Tensor]] = None | ||
|
|
||
| def _build_freqs(self, device: torch.device) -> torch.Tensor: |
There was a problem hiding this comment.
| def _build_freqs(self, device: torch.device) -> torch.Tensor: | |
| # Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._build_freqs | |
| def _build_freqs(self, device: torch.device) -> torch.Tensor: |
I think _build_freqs should be the same for both the causal and non-causal RoPE embedding modules, so sync their implementations.
There was a problem hiding this comment.
Done in ffdc969 — added # Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._build_freqs. make fix-copies runs clean.
| freqs = torch.cat([freqs_f, freqs_h, freqs_w], dim=-1) | ||
| return freqs | ||
|
|
||
| def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor: |
There was a problem hiding this comment.
| def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor: | |
| # Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._forward_full_frame | |
| def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor: |
Similarly, I think _forward_full_frame should be the same between the causal and non-causal RoPE modules.
There was a problem hiding this comment.
Done in ffdc969 — same pattern (# Copied from for _forward_full_frame).
| Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]`. When provided, the pipeline | ||
| VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive | ||
| with `video_latents`. |
There was a problem hiding this comment.
| Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]`. When provided, the pipeline | |
| VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive | |
| with `video_latents`. | |
| Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]`. When provided, the pipeline | |
| VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive | |
| with `video_latents`. |
I think this needs to be updated as VideoProcessor.preprocess_video expects 5D torch.Tensor inputs to have shape [B, T, C, H, W] instead of [B, C, T, H, W].
There was a problem hiding this comment.
Done in ffdc969 — docstring + EXAMPLE_DOC_STRING flipped to (B, T, C, H, W) everywhere in both pipelines. Good catch — video_processor.preprocess_video's 5D contract is (B, T, C, H, W), so the previous (B, C, T, H, W) doc would have silently broken users who followed it literally.
| # 6. Encode conditioning frames (or accept pre-encoded latents). | ||
| if video is not None and video_latents is not None: | ||
| raise ValueError("Provide either `video` or `video_latents`, not both.") | ||
| if video is not None: |
There was a problem hiding this comment.
Can we move this check to check_inputs so that we fail earlier?
There was a problem hiding this comment.
Done in 7a6643b — both bidi and FAR pipelines now do the video / video_latents mutual-exclusion check inside check_inputs. The FAR-specific (num_frames - 1) % 4 == 0 constraint moved there too, so both fail before any work runs.
| @torch.no_grad() | ||
| @torch.no_grad() | ||
| def encode_video(self, video: torch.Tensor, height: int, width: int) -> torch.Tensor: |
There was a problem hiding this comment.
| @torch.no_grad() | |
| @torch.no_grad() | |
| def encode_video(self, video: torch.Tensor, height: int, width: int) -> torch.Tensor: | |
| @torch.no_grad() | |
| def encode_video(self, video: torch.Tensor, height: int, width: int) -> torch.Tensor: |
nit: remove extra @torch.no_grad() decorator.
There was a problem hiding this comment.
Done in ffdc969 — duplicate @torch.no_grad() removed. (Per the bot follow-up, the non-duplicate @torch.no_grad() was also dropped from bidi encode_video and FAR encode_kv_cache since __call__ already wraps the no-grad scope.)
| >>> # Single-frame I2V: wrap the conditioning image as a (1, 3, 1, H, W) tensor in [0, 1]. | ||
| >>> first_frame = load_image("path/to/first_frame.png").resize((832, 480)) | ||
| >>> arr = np.asarray(first_frame).astype("float32") / 255.0 | ||
| >>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(2).to("cuda") |
There was a problem hiding this comment.
| >>> # Single-frame I2V: wrap the conditioning image as a (1, 3, 1, H, W) tensor in [0, 1]. | |
| >>> first_frame = load_image("path/to/first_frame.png").resize((832, 480)) | |
| >>> arr = np.asarray(first_frame).astype("float32") / 255.0 | |
| >>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(2).to("cuda") | |
| >>> # Single-frame I2V: wrap the conditioning image as a (1, 1, 3, H, W) tensor in [0, 1]. | |
| >>> first_frame = load_image("path/to/first_frame.png").resize((832, 480)) | |
| >>> arr = np.asarray(first_frame).astype("float32") / 255.0 | |
| >>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda") |
For the same reason as #13745 (comment), we should input a [B, T, C, H, W] rather than a [B, C, T, H, W] video tensor.
There was a problem hiding this comment.
Done in ffdc969 — unsqueeze(0).unsqueeze(1) to produce the (1, 1, 3, H, W) shape per the suggestion.
| video (`torch.Tensor`, *optional*): | ||
| Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the | ||
| pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually | ||
| exclusive with `video_latents`. |
There was a problem hiding this comment.
| video (`torch.Tensor`, *optional*): | |
| Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the | |
| pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually | |
| exclusive with `video_latents`. | |
| video (`torch.Tensor`, *optional*): | |
| Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the | |
| pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually | |
| exclusive with `video_latents`. |
Analogous suggestion to #13745 (comment) for the FAR causal pipeline.
| if show_progress: | ||
| chunk_iter = tqdm(chunk_iter) |
There was a problem hiding this comment.
Instead of using a show_progress argument here, I think we should use nested progress bars like LLaDA-2 does. We can define an outer progress bar:
diffusers/src/diffusers/pipelines/llada2/pipeline_llada2.py
Lines 426 to 428 in f502538
and an inner progress bar:
diffusers/src/diffusers/pipelines/llada2/pipeline_llada2.py
Lines 444 to 450 in f502538
using the pipeline's _progress_bar_config and appropriate arguments to make sure that the inner progress bars don't stack up. This should respect any configuration set through DiffusionPipeline.set_progress_bar_config better (for example, using pipe.set_progress_bar_config(disable=None) to disable the progress bars).
There was a problem hiding this comment.
Done in 7a6643b — show_progress argument removed; replaced with nested tqdm bars in the LLaDA-2 pattern (outer Chunks at position=0, inner Inference Steps per chunk at position=1, leave=False). Both pick up DiffusionPipeline._progress_bar_config, so set_progress_bar_config(disable=None) etc. now work as expected.
| timestep = timestep / self.config.num_train_timesteps | ||
| r_timestep = r_timestep / self.config.num_train_timesteps |
There was a problem hiding this comment.
nit: I think getting the underlying t_sigma and r_sigma corresponding to timestep and r_timestep via something like the logic in _resolve_next_timestep or an internal step_idx like FlowMatchEulerDiscreteScheduler uses:
diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py
Lines 501 to 503 in f502538
would be slightly better here to cover cases where the timesteps and sigmas aren't related through scaling by self.config.num_train_timesteps.
There was a problem hiding this comment.
Done across 7a6643b + 84605d5 + 89128cf:
- 7a6643b: introduced
index_for_timestep()and rewrotestep()to resolve botht_sigmaandr_sigmaviaself.sigmas[idx]lookups. For the shipped linspace + shift schedule this is bit-identical to the previoust / num_train_timestepsformulation (max abs diff = 0.0 on an 8-step replay), but it stays correct under any future schedule whose timestep / sigma mapping isn't strictly linear. - 84605d5: added the full
FlowMatchEulerDiscreteScheduler-style state machine —_step_index,_begin_index,step_index/begin_indexproperties,set_begin_index,_init_step_index.step()lazily initializes and advances the counter on every call so downstream callbacks / composable schedulers observe it. Sigma resolution stays a pure function of the passed-in(timestep, r_timestep)sostep()is idempotent (calling it twice with the same args returns identicalprev_sample). - 89128cf: audit fix — the earlier
_init_step_indexraised when the firsttimestepwas off-schedule, which contradictedstep()'s documented any-step support._init_step_indexnow falls back to0for off-schedule starts (still a valid observable counter);_resolve_next_timestepwas removed since its callers were all inlined.
Bit-exact replay on H200 (random-init bidi + FAR forward, fp32, comparing d0181ea baseline to 84605d5): state_dict missing=0 / unexpected=0, L2 = 0.0e+00, max|Δ| = 0.0e+00.
| @unittest.skip("AnyFlow has no optional components.") | ||
| def test_save_load_optional_components(self): | ||
| pass | ||
|
|
There was a problem hiding this comment.
| @unittest.skip("AnyFlow has no optional components.") | |
| def test_save_load_optional_components(self): | |
| pass |
I think we don't need to skip test_save_load_optional_components here as this test actually passes.
| @unittest.skip("AnyFlow has no optional components.") | ||
| def test_save_load_optional_components(self): | ||
| pass | ||
|
|
There was a problem hiding this comment.
| @unittest.skip("AnyFlow has no optional components.") | |
| def test_save_load_optional_components(self): | |
| pass |
Analogous comment as #13745 (comment).
|
Hi @claude, can you do a review to see if there is anything I missed? |
|
Claude finished @dg845's task in 5m 22s —— View job Review CompleteHi @dg845 — here's my review of the remaining items. Your review was thorough; I'll focus on things I found that you didn't already flag, and then include a summary of your open items for tracking. New findings (beyond your existing review comments)1. Unused Both Per models.md: "Don't declare 2. Misleading The bidirectional transformer ( 3. Beyond the 4. Redundant You flagged the duplicate 5. Dead In 6. Public methods In 7. Minor: comment numbering skips step 4 in both pipelines Both 8. Minor: In Suggestions / additional info: dead code traceI traced the call paths from both pipelines into the transformers: Bidirectional pipeline →
FAR pipeline →
Summary of dg845's open review items (for tracking)These are your existing review comments that appear to still be open based on the latest code:
Overall the PR is well-structured — the transformer split is clean, the scheduler API is sensible, and the |
|
Thanks for the second pass @dg845 — really appreciate the careful read. Working through all 13 now. A quick map of how I'm going to handle them:
I'll re-run the bit-exact replay against |
FAR transformer: - AnyFlowCausalAttnProcessor: default _attention_backend = 'flex' (was None); remove None from _SUPPORTED_BACKENDS. None previously fell through to SDPA which silently ignored the BlockMask; failing loudly is the right default. - dispatch_attention_fn call: read self._attention_backend instead of hardcoded 'flex', so '_native_flex' selection works. - _build_freqs / _forward_full_frame: add '# Copied from' to bidi RoPE. Pipelines: - bidi + FAR docstrings: video shape (B, C, T, H, W) -> (B, T, C, H, W) to match VideoProcessor.preprocess_video. - FAR EXAMPLE_DOC_STRING: single-frame I2V tensor wrap uses unsqueeze(1) for the T axis instead of unsqueeze(2). - FAR encode_video: drop duplicated @torch.no_grad() decorator. Tests: - test_anyflow / test_anyflow_far: lift the test_save_load_optional_components skip (the test actually passes). - FAR processor smoke test: assert default backend is 'flex' (was 'None').
Pipelines:
- check_inputs accepts video / video_latents and raises early on:
(a) mutual exclusion (was checked late in __call__);
(b) FAR's (num_frames - 1) % 4 == 0 constraint.
__call__ no longer carries duplicate validation.
- FAR pipeline: drop the show_progress kwarg and replace the single tqdm with
nested progress bars in the LLaDA-2 pattern — outer 'Chunks' (position=0)
and per-chunk inner 'Inference Steps' (position=1, leave=False) — both
picking up DiffusionPipeline._progress_bar_config (so set_progress_bar_config
controls them, including disable=None).
Scheduler:
- step() resolves source and target sigmas by indexing self.sigmas via the new
index_for_timestep(), instead of dividing the input timesteps by
num_train_timesteps. This keeps the math correct for any future schedule
whose timestep/sigma relationship is non-linear. For an off-schedule
r_timestep the code falls back to r / num_train_timesteps, so explicit
any-step sampling outside the schedule still works (and t off-schedule with
r=None still raises a clear ValueError, as before).
Numerical equivalence: for the shipped linspace+shift schedule the two
formulations are bit-identical (verified: max abs diff = 0.0 over an N=8,
shift=5 schedule).
Finding huggingface#1 — attention_kwargs plumbing: Both transformers now decorate forward() with @apply_lora_scale('attention_kwargs') (matches Wan); pipelines forward attention_kwargs to the transformer + encode_kv_cache, and the unused parameter is dropped from the inner _forward_train / _forward_cache / _forward_inference signatures. Pipeline docstrings updated to the standard wording. Finding huggingface#2 — naming: Rename far_cfg -> layout_cfg in the bidi transformer (the bidi path is not FAR; the FAR transformer keeps far_cfg, which is accurate there). Finding huggingface#3 — scheduler state machine: Add _step_index, _begin_index, step_index property, begin_index property, set_begin_index(), _init_step_index(). step() lazily initializes and advances the counter so downstream callbacks / composable schedulers can observe rollout progress. Sigma resolution remains a pure function of (timestep, r_timestep) — calling step() twice with identical args still returns identical prev_sample (idempotent). Finding huggingface#4 — redundant @torch.no_grad(): Drop the redundant decorators on bidi pipeline's encode_video and FAR pipeline's encode_kv_cache (callers are already in __call__'s no-grad scope). Finding huggingface#5 — dead code: Remove the unreachable temb.ndim == 2 else branch from the bidi transformer's output-norm path (condition_embedder.forward always returns a 3D temb). Finding huggingface#6 — private rename: forward_far_patchify[_inference] -> _forward_far_patchify[_inference] (only called internally by _forward_train / _forward_cache / _forward_inference). Finding huggingface#7 — pipeline comment numbering: Bidi + FAR pipelines renumber steps so the # 4. slot is no longer skipped. Finding huggingface#8 — mask-mod comment numbering: _build_causal_mask numbered comments now run 1) 2) 3) ... (was 1) 3) 4) ...). Tests: - New test_step_index_advances + test_set_begin_index_anchors_step_index in the scheduler test file exercise the new state machine. - All existing pipeline / transformer / scheduler tests still pass (85 passed, 85 skipped on CPU). Bit-exact: 8-step rollout vs the previous formulation, max abs diff = 0.0 (the new sigma-lookup is byte-identical to t/num_train_timesteps on this schedule).
…; drop dead _resolve_next_timestep Audit caught two issues in the previous scheduler commit: 1. The new state machine raised in _init_step_index whenever the first timestep wasn't on the active schedule, contradicting the documented contract that step() falls back to t/num_train_timesteps for off-schedule any-step sampling. The fall-back numerics were intact but they were unreachable — the init check fired first. Fix: _init_step_index now initializes _step_index to 0 when the timestep is off-schedule (still a valid observable counter for callbacks). step()'s sigma resolution is untouched, so on-schedule rollouts stay bit-exact and off-schedule any-step sampling actually runs again. Regression test: test_step_off_schedule_anystep_supported. 2. _resolve_next_timestep had no remaining callers after the step() rewrite inlined the same lookup. Removed (private helper, no external API).
- en api/pipelines/anyflow.md: video shape (B, C, T, H, W) -> (B, T, C, H, W);
example tensor wrap uses unsqueeze(0).unsqueeze(1) and permute(0, 3, 1, 2)
to match VideoProcessor.preprocess_video's 5D contract.
- zh using-diffusers/anyflow.md: same shape fixes; also flip the I2V / V2V
examples from the obsolete context_sequence={...} dict to the current
video= / video_latents= kwargs; helper to_video_tensor returns (1, T, C, H, W);
add a note about mutual exclusion.
|
Hi @dg845 @sayakpaul — second-pass review fully addressed. Per-thread replies are inline; this is the high-level summary. dg845's 13 review threads — all applied
Claude bot follow-up — also applied
Bit-exact validation (H200, fp32, random-init bidi + FAR)Replay comparing
So the entire second-pass refactor is provably numerically equivalent on the shipped linspace + shift schedule. The earlier Commits this round
Ready when you are. Happy to iterate further if anything's still off. |
.ai/skills/model-integration/SKILL.md is explicit: 'No integration / slow tests in the initial PR — don't add anything gated on @slow / RUN_SLOW=1 yet.' Our two integration test classes were shape-only assertions with TODOs for a future numeric reference, so dropping them loses no actual coverage — the relevant rollouts are covered by H200 bit-exact replay outside the pytest suite. Can land a follow-up PR after merge with proper numeric reference slices once the maintainer is comfortable enabling slow tests.
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
What does this PR do?
This PR adds pipelines for AnyFlow (paper, project page, official code, model weights), an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16, 32 NFE without retraining, and quality scales monotonically with steps — unlike consistency-based distillation, which often degrades as NFE grows.
Two new pipelines are added, both on top of a new
FlowMapEulerDiscreteSchedulerand reusingWanLoraLoaderMixin:AnyFlowPipeline→AnyFlowTransformer3DModel: bidirectional text-to-video built on the Wan2.1 backbone with anAnyFlowDualTimestepTextImageEmbeddingconditioning on the source/target timestep pair(t, r).AnyFlowFARPipeline→AnyFlowFARTransformer3DModel: frame-level autoregressive variant (block-sparse causalflex_attention+ KV cache + compressed-frame patch embedding) jointly handling T2V / I2V / V2V through onecontext_sequenceargument.Four checkpoints are released under the
nvidia/anyflowcollection (Wan2.1-T2V-{1.3B,14B}bidi +FAR-Wan2.1-{1.3B,14B}causal). All four have been validated bit-exact against the officialNVlabs/AnyFlowreference on H200: forward L2 =0.00e+00for scheduler / transformer / bidi pipeline / FAR pipeline; backward grad delta is4.88e-04, attributable to bf16 kernel non-determinism only (PR-vs-PR = PR-vs-reference, ratio1.000); inference latency matches the reference at ±0.0% on both pipelines.T2V inference example:
I2V inference example with the FAR pipeline (single conditioning frame → autoregressive rollout):
Documentation: EN tutorial at
docs/source/en/using-diffusers/anyflow.md, ZH tutorial atdocs/source/zh/using-diffusers/anyflow.md, and three API pages (pipelines + two transformer model pages). Tests: 22 fast tests (transformer + scheduler, CPU) plus four pipeline test files, with slow integration tests gated onRUN_SLOW=1 @require_torch_acceleratorfor the released checkpoints.anyflow-pr-presentation.mp4
Before submitting
Who can review?
@yiyixuxu @asomoza