Skip to content

Add modular pipeline for HunyuanVideo 1.5#13389

Merged
yiyixuxu merged 19 commits intohuggingface:mainfrom
akshan-main:modular-hunyuan1.5
Apr 16, 2026
Merged

Add modular pipeline for HunyuanVideo 1.5#13389
yiyixuxu merged 19 commits intohuggingface:mainfrom
akshan-main:modular-hunyuan1.5

Conversation

@akshan-main
Copy link
Copy Markdown
Contributor

@akshan-main akshan-main commented Apr 2, 2026

What does this PR do?

Adds modular pipeline blocks for HunyuanVideo 1.5 with both text-to-video (HunyuanVideo15Blocks) and image-to-video (HunyuanVideo15Image2VideoBlocks).

Parity verified on Colab G4 GPU:

  • T2V: MAD 0.000000 vs HunyuanVideo15Pipeline
hv15_t2v_standard.mp4
hv15_t2v_modular.mp4
T2V reproduction code
import gc
import numpy as np
import torch
from diffusers import (
    HunyuanVideo15Pipeline,
    HunyuanVideo15ImageToVideoPipeline,
    HunyuanVideo15Blocks,
    HunyuanVideo15ModularPipeline,
)
from diffusers.utils import load_image, export_to_video

device = "cuda"
dtype = torch.bfloat16

T2V_ID = "hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v"
I2V_ID = "hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_i2v"

def to_np(x):
    if hasattr(x, "frames"):
        x = x.frames
    if isinstance(x, list):
        x = np.array(x)
    if isinstance(x, torch.Tensor):
        x = x.float().cpu().numpy()
    return x
prompt = "A cinematic drone shot over snowy mountains at sunrise."

print("=== Standard T2V ===")

ref_pipe = HunyuanVideo15Pipeline.from_pretrained(T2V_ID, torch_dtype=dtype).to(device)
g = torch.Generator(device=device).manual_seed(1234)
ref_out = ref_pipe(prompt=prompt, num_frames=55, num_inference_steps=6, generator=g, output_type="np").frames
print(f"Shape: {np.array(ref_out).shape}")
export_to_video(ref_out[0], "/content/hv15_t2v_standard.mp4", fps=24)
del ref_pipe; gc.collect(); torch.cuda.empty_cache()



print("\n=== Modular T2V ===")
blocks = HunyuanVideo15Blocks()
pipe = blocks.init_pipeline(T2V_ID)
pipe.load_components(torch_dtype=dtype)
pipe.to(device)

print("Guider type:", type(pipe.guider).__name__)
print("Guider scale:", pipe.guider.guidance_scale)
print("Guider enabled:", pipe.guider._enabled)
print("Guider num_conditions:", pipe.guider.num_conditions)
g = torch.Generator(device=device).manual_seed(1234)
mod_out = pipe(prompt=prompt, num_frames=55, num_inference_steps=6, generator=g, output="videos", output_type="np")
print(f"Shape: {np.array(mod_out).shape}")
export_to_video(mod_out[0], "/content/hv15_t2v_modular.mp4", fps=24)

diff = np.abs(to_np(ref_out).astype(float) - to_np(mod_out).astype(float)).mean()
print(f"\nT2V MAD: {diff:.6f}")
del pipe, blocks; gc.collect(); torch.cuda.empty_cache()
  • I2V: MAD 0.000000 vs HunyuanVideo15ImageToVideoPipeline
hv15_i2v_standard.mp4
hv15_i2v_modular.mp4
I2V reproduction code
from diffusers.modular_pipelines import HunyuanVideo15Blocks, HunyuanVideo15Image2VideoBlocks, HunyuanVideo15ModularPipeline

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png").convert("RGB")

print("=== Standard I2V ===")
ref_pipe = HunyuanVideo15ImageToVideoPipeline.from_pretrained(I2V_ID, torch_dtype=dtype).to(device)
g = torch.Generator(device=device).manual_seed(1234)
ref_out = ref_pipe(image=image, prompt="A cat turns its head", num_frames=55, num_inference_steps=6, generator=g, output_type="np").frames
print(f"Shape: {np.array(ref_out).shape}")
export_to_video(ref_out[0], "/content/hv15_i2v_standard.mp4", fps=24)
del ref_pipe; gc.collect(); torch.cuda.empty_cache()

print("\n=== Modular I2V ===")
blocks = HunyuanVideo15Image2VideoBlocks()
pipe = blocks.init_pipeline(I2V_ID)
pipe.load_components(torch_dtype=dtype)
pipe.to(device)
g = torch.Generator(device=device).manual_seed(1234)
mod_out = pipe(image=image, prompt="A cat turns its head", num_frames=55, num_inference_steps=6, generator=g, output="videos", output_type="np")
print(f"Shape: {np.array(mod_out).shape}")
export_to_video(mod_out[0], "/content/hv15_i2v_modular.mp4", fps=24)

diff = np.abs(to_np(ref_out).astype(float) - to_np(mod_out).astype(float)).mean()
print(f"\nI2V MAD: {diff:.6f}")
print("\n=== Done ===")

Addresses #13295 (HunyuanVideo 1.5 contribution)

Before submitting

Who can review?

@sayakpaul @yiyixuxu @asomoza

@akshan-main
Copy link
Copy Markdown
Contributor Author

hey guys @yiyixuxu @sayakpaul would greatly appreciate a review!

@sayakpaul
Copy link
Copy Markdown
Member

@claude could you do an initial review here?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@akshan-main
Copy link
Copy Markdown
Contributor Author

@sayakpaul looks like the Claude bot run failed on this one

@github-actions github-actions bot added the size/L PR with diff > 200 LOC label Apr 11, 2026
@akshan-main
Copy link
Copy Markdown
Contributor Author

friendly ping @sayakpaul @yiyixuxu , awaiting review for modular hunyuanvideo 1.5. Will greatly appreciate a review!

@akshan-main
Copy link
Copy Markdown
Contributor Author

Hey guys this is on more of my contribution to modular diffusers @yiyixuxu @sayakpaul

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@akshan-main
thanks for your PR and contribution!
we'll wait to merge until we fix our claude CI and have it do a review as well, as we're currently testing and rolling it out.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu I think these tests are always failing; they have nothing to do with my PR. Anyways I was wondering if you'd like me to work on something else within or outside modular diffusers. Will be glad to contribute or test out any pipelines.

@github-actions github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026
@yiyixuxu
Copy link
Copy Markdown
Collaborator

@akshan-main
how about modular ernie-image?

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@claude can you help to do a review here? please pay addition attention to:

  1. look over all the PipelineBlocks to see if there is any inputs/intermediate_outputs/expected_components that are defined but not used in call
  2. compare it with modular implementation of wan/qwen/flux2 to see if there is anything that's missing

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Claude finished @yiyixuxu's task in 6m 50s —— View job


Review: Modular HunyuanVideo 1.5 Pipeline

  • Read repo guidelines (CLAUDE.md, AGENTS.md, modular.md)
  • Read all changed files in the PR
  • Compare with wan/qwen/flux2 modular implementations
  • Check PipelineBlocks for unused inputs/intermediate_outputs/expected_components
  • Post final review

Overall the implementation follows the modular pipeline conventions well — block structure, file layout, guider abstraction, and denoise loop patterns are all correct and consistent with wan/flux2/qwen. Parity is verified at MAD 0.000000 for both T2V and I2V, which is great.

Below are the findings, organized by severity.


Issues to fix

1. Missing __init__.py registrations for HunyuanVideo15Blocks and HunyuanVideo15Image2VideoBlocks

These classes are exported from the local hunyuan_video1_5/__init__.py but are not registered in:

  • src/diffusers/modular_pipelines/__init__.py
  • src/diffusers/__init__.py
  • src/diffusers/utils/dummy_torch_and_transformers_objects.py

This means from diffusers import HunyuanVideo15Blocks (as shown in the PR description) will fail. The wan pipeline registers WanBlocks, Wan22Blocks, etc. in all three files — this PR should do the same.

Fix this →

2. Test model path uses personal repo (before_denoise.py:49, test_modular_pipeline_hunyuan_video1_5.py:49)

pretrained_model_name_or_path = "akshan-main/tiny-hunyuanvideo1_5-modular-pipe"

Per modular.md gotcha #9: "Tiny test models must live under hf-internal-testing/, not personal repos." The model should be moved to hf-internal-testing/ before merge.

3. HunyuanVideo15TextInputStep declares num_videos_per_prompt as input but never uses it (before_denoise.py:102, 114-118)

The input is declared at line 102:

InputParam.template("num_images_per_prompt", name="num_videos_per_prompt"),

But the __call__ (lines 114–118) only uses batch_size and prompt_embedsnum_videos_per_prompt is never accessed.

4. Encoder accepts pre-computed embeddings (modular.md gotcha #4) (encoders.py:166-176)

HunyuanVideo15TextEncoderStep accepts prompt_embeds, prompt_embeds_mask, negative_prompt_embeds, negative_prompt_embeds_mask, prompt_embeds_2, prompt_embeds_mask_2, negative_prompt_embeds_2, negative_prompt_embeds_mask_2 as inputs to skip encoding.

Per modular.md:

"In modular pipelines this is unnecessary — users just pop out the encoder block and run it separately. Encoder blocks should only accept raw inputs (prompt, image, etc.)."

The Wan text encoder, for comparison, only accepts prompt, negative_prompt, and max_sequence_length. These pre-computed embedding inputs should be removed.


Suggestions (non-blocking)

5. HunyuanVideo15PrepareLatentsStep accesses components.video_processor without declaring it (before_denoise.py:204)

The block accesses components.video_processor.calculate_default_height_width(...) but video_processor is not in its expected_components. It works in practice because the decoder block registers it, but this is fragile if the block is used standalone. Consider either adding it to expected_components or replacing with pipeline properties.

6. Many TODO: Add description. placeholders remain in auto-generated docstrings (modular_blocks_hunyuan_video1_5.py)

The docstrings for prompt_embeds_2, prompt_embeds_mask_2, negative_prompt_embeds_2, etc., num_frames, and image_latents all show TODO: Add description. For example, lines 62–83 of modular_blocks_hunyuan_video1_5.py. While other merged implementations (wan, flux2) also have unfilled TODOs, it would be nice to fill these in — especially for the dual-encoder parameters that are unique to HunyuanVideo 1.5.

7. Code duplication between T2V and I2V denoisers (denoise.py:59-149 vs 238-334)

HunyuanVideo15LoopDenoiser and HunyuanVideo15Image2VideoLoopDenoiser are ~90% identical, differing only in the MeanFlow timestep_r computation (lines 293–301) and the extra timestep_r=timestep_r kwarg. This is not a blocker since the pattern matches other implementations (e.g., Wan has WanLoopDenoiser and Wan22LoopDenoiser), but noting it for awareness.


Block-by-block audit summary

Block Unused inputs Unused outputs Unused components Notes
HunyuanVideo15TextEncoderStep prompt_embeds* (gotcha #4) Pre-computed embeds violate modular convention
HunyuanVideo15VaeEncoderStep Clean
HunyuanVideo15ImageEncoderStep Clean
HunyuanVideo15TextInputStep num_videos_per_prompt Declared but never accessed in __call__
HunyuanVideo15SetTimestepsStep Clean
HunyuanVideo15PrepareLatentsStep Uses undeclared video_processor
HunyuanVideo15Image2VideoPrepareLatentsStep Clean
HunyuanVideo15LoopBeforeDenoiser N/A Clean
HunyuanVideo15LoopDenoiser N/A Clean
HunyuanVideo15LoopAfterDenoiser N/A Clean
HunyuanVideo15Image2VideoLoopDenoiser N/A Clean
HunyuanVideo15VaeDecoderStep Clean

@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu will address 3 and 4. skipping 1 since earlier feedback on the LTX PR was to only export AutoBlocks from top-level. For 2, can akshan-main/tiny-hunyuanvideo1_5-modular-pipe be transferred?

@github-actions github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026
Comment on lines +26 to +27
"HunyuanVideo15Blocks",
"HunyuanVideo15Image2VideoBlocks",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"HunyuanVideo15Blocks",
"HunyuanVideo15Image2VideoBlocks",

Comment on lines +40 to +41
HunyuanVideo15Blocks,
HunyuanVideo15Image2VideoBlocks,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HunyuanVideo15Blocks,
HunyuanVideo15Image2VideoBlocks,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@property
def expected_components(self) -> list[ComponentSpec]:
return [ComponentSpec("transformer", HunyuanVideo15Transformer3DModel)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed missing a video_processor here #13389 (comment)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@yiyixuxu
Copy link
Copy Markdown
Collaborator

@akshan-main sounds good, let's try to address #6 too

@github-actions github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026
@akshan-main akshan-main requested a review from yiyixuxu April 16, 2026 20:08
@akshan-main
Copy link
Copy Markdown
Contributor Author

@akshan-main how about modular ernie-image?

on it

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@yiyixuxu yiyixuxu merged commit b3889ea into huggingface:main Apr 16, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants