[Logo]

Video

HunyuanCustom

Open-source custom videos with consistent subjects

Developer:—

Launched:2025

Last Updated:2026-02-14

8.5

★★★★★

Overall score out of 10

Visit Official Website →View Pricing Compare Alternatives

8.5Overall Rating

—Starting Price

5Key Features

Overview

HunyuanCustom is a multimodal conditional video-generation pipeline designed to preserve subject identity while supporting flexible conditioning from text, images, audio, and video. Built on a HunyuanVideo backbone, it introduces modules to close modality gaps and prevent identity drift so generated videos remain faithful to a provided subject image(s). Key components include a Text–image fusion module (LLaVA-based) for injecting image identity cues into text prompts, an Image ID enhancement module that temporally concatenates image features across frames, AudioNet for audio-conditioned alignment of audio and visual features, and a patchify-based video-driven injection for latent-conditioned editing. The framework uses a disentangled identity representation to decouple identity information from other modalities, enabling independent control over image, audio, and video inputs. Supported workflows include single-subject video customization, video-driven editing with masks, and audio-driven customization, with downstream use cases such as singing avatars and virtual advertisements. Deployment details cover Linux, CUDA, PyTorch, multi-GPU setups, and Docker/Gradio-based tooling.

Key Features

Text–image fusion module

LLaVA-based module that injects image identity cues into textual prompts to improve image+text conditioned generation.

Image ID enhancement

Temporally concatenates image features across frames to strengthen subject identity consistency.

AudioNet

Audio-conditioned module aligning audio and visual features hierarchically with spatial cross-attention for singing/avatar-style generation.

Video-driven injection

Patchify-based feature alignment to inject latent conditional video features for video-driven editing and guidance.

Disentangled identity representation

Separates identity information from other modalities to enable decoupled control over image, audio, and video inputs.

HunyuanCustom Screenshot

HunyuanCustom Screenshot 1

Pricing Plans

Pricing information is not available yet.

Pros & Cons

✓ Pros

Pros will be listed here once they are curated.

✗ Cons

Cons will be listed here once they are curated.

Related Articles (5)

Daily Papers Newsletter: Trending Papers on Hierarchical Cross-Attention Mechanisms

huggingface.co•3w ago•1 min read

Daily Papers Newsletter: Trending Papers on Hierarchical Cross-Attention Mechanisms

Newsletter signup page offering trending papers on hierarchical cross-attention; no article content provided.

newslettertrending papersHugging Facehierarchical cross-attention

Alibaba Tongyi Lab Unleashes Z-Image-Base: Non-Distilled, High-Quality Image Gen with Day-0 ComfyUI Support

comfyui-wiki.com•3w ago•2 min read

Alibaba Tongyi Lab Unleashes Z-Image-Base: Non-Distilled, High-Quality Image Gen with Day-0 ComfyUI Support

Non-distilled, high-quality Z-Image-Base now with Day-0 ComfyUI support and full ecosystem tooling.

Z-Image-Basenon-distilledComfyUIimage generation

HunyuanCustom: A Multi-Modal, Identity-Preserving Framework for Customized Video Generation

github.com•9mo ago•7 min read

HunyuanCustom: A Multi-Modal, Identity-Preserving Framework for Customized Video Generation

A multi-modal framework for subject-consistent customized video generation from text, image, audio, and video inputs.

multimodal video generationsubject consistencyHunyuanCustomtext-image fusion

Your Feedback Drives Better Docs: A Practical Guide to Qualifiers and Metadata

github.com•9mo ago•1 min read

Your Feedback Drives Better Docs: A Practical Guide to Qualifiers and Metadata

A meta-guide showing how user feedback and docs shape metadata for article content.

feedbackdocumentationqualifiersmetadata

github.io•1y ago•5 min read

HunyuanCustom: A Multimodal Framework for Subject-Consistent Customized Video Generation

A multimodal, subject-consistent video generation framework that fuses image, text, audio, and video inputs for controllable synthesis.

multimodalvideo generationsubject consistencyLLaVA

📦

Video

HunyuanCustom

Open-source custom videos with consistent subjects

8.5

Rating

Custom

Price

5

Key Features

Overview

HunyuanCustom is a multimodal conditional video-generation pipeline designed to preserve subject identity while supporting flexible conditioning from text, images, audio, and video. Built on a HunyuanVideo backbone, it introduces modules to close modality gaps and prevent identity drift so generated videos remain faithful to a provided subject image(s). Key components include a Text–image fusion module (LLaVA-based) for injecting image identity cues into text prompts, an Image ID enhancement module that temporally concatenates image features across frames, AudioNet for audio-conditioned alignment of audio and visual features, and a patchify-based video-driven injection for latent-conditioned editing. The framework uses a disentangled identity representation to decouple identity information from other modalities, enabling independent control over image, audio, and video inputs. Supported workflows include single-subject video customization, video-driven editing with masks, and audio-driven customization, with downstream use cases such as singing avatars and virtual advertisements. Deployment details cover Linux, CUDA, PyTorch, multi-GPU setups, and Docker/Gradio-based tooling.

Details

Developer

—

Launch Year

2025

Free Trial

No

Updated

2026-02-14

Features

Text–image fusion module

LLaVA-based module that injects image identity cues into textual prompts to improve image+text conditioned generation.

Image ID enhancement

Temporally concatenates image features across frames to strengthen subject identity consistency.

AudioNet

Audio-conditioned module aligning audio and visual features hierarchically with spatial cross-attention for singing/avatar-style generation.

Video-driven injection

Patchify-based feature alignment to inject latent conditional video features for video-driven editing and guidance.

Disentangled identity representation

Separates identity information from other modalities to enable decoupled control over image, audio, and video inputs.

Screenshots

HunyuanCustom Screenshot

HunyuanCustom Screenshot

Tags

multimodalidentity-preservingvideo generationtext conditioningimage conditioningaudio conditioningvideo conditioningLLaVA-based fusionAudioNetpatchifyidentity driftsubject identitydisentangled identityGradioLinuxCUDA 11.8/12.4PyTorch 2.4.0reproducibilityopen-source

Related Articles (5)

Daily Papers Newsletter: Trending Papers on Hierarchical Cross-Attention Mechanisms

huggingface.co•3w ago•1 min read

Daily Papers Newsletter: Trending Papers on Hierarchical Cross-Attention Mechanisms

Newsletter signup page offering trending papers on hierarchical cross-attention; no article content provided.

newslettertrending papersHugging Facehierarchical cross-attention

Alibaba Tongyi Lab Unleashes Z-Image-Base: Non-Distilled, High-Quality Image Gen with Day-0 ComfyUI Support

comfyui-wiki.com•3w ago•2 min read

Alibaba Tongyi Lab Unleashes Z-Image-Base: Non-Distilled, High-Quality Image Gen with Day-0 ComfyUI Support

Non-distilled, high-quality Z-Image-Base now with Day-0 ComfyUI support and full ecosystem tooling.

Z-Image-Basenon-distilledComfyUIimage generation

HunyuanCustom: A Multi-Modal, Identity-Preserving Framework for Customized Video Generation

github.com•9mo ago•7 min read

HunyuanCustom: A Multi-Modal, Identity-Preserving Framework for Customized Video Generation

A multi-modal framework for subject-consistent customized video generation from text, image, audio, and video inputs.

multimodal video generationsubject consistencyHunyuanCustomtext-image fusion

Your Feedback Drives Better Docs: A Practical Guide to Qualifiers and Metadata

github.com•9mo ago•1 min read

Your Feedback Drives Better Docs: A Practical Guide to Qualifiers and Metadata

A meta-guide showing how user feedback and docs shape metadata for article content.

feedbackdocumentationqualifiersmetadata

github.io•1y ago•5 min read

HunyuanCustom: A Multimodal Framework for Subject-Consistent Customized Video Generation

A multimodal, subject-consistent video generation framework that fuses image, text, audio, and video inputs for controllable synthesis.

multimodalvideo generationsubject consistencyLLaVA

Similar Tools

Video Memories AI

Video Memories AI

9.1/10 • Free

LTXV-13b AI Video Generation

LTXV-13b AI Video Generation

9.1/10 • $36/mo

9.1/10 • $10/mo