📦
Video

HunyuanCustom

Open-source custom videos with consistent subjects
8.5
Rating
Custom
Price
5
Key Features

Overview

HunyuanCustom is a multimodal conditional video-generation pipeline designed to preserve subject identity while supporting flexible conditioning from text, images, audio, and video. Built on a HunyuanVideo backbone, it introduces modules to close modality gaps and prevent identity drift so generated videos remain faithful to a provided subject image(s). Key components include a Text–image fusion module (LLaVA-based) for injecting image identity cues into text prompts, an Image ID enhancement module that temporally concatenates image features across frames, AudioNet for audio-conditioned alignment of audio and visual features, and a patchify-based video-driven injection for latent-conditioned editing. The framework uses a disentangled identity representation to decouple identity information from other modalities, enabling independent control over image, audio, and video inputs. Supported workflows include single-subject video customization, video-driven editing with masks, and audio-driven customization, with downstream use cases such as singing avatars and virtual advertisements. Deployment details cover Linux, CUDA, PyTorch, multi-GPU setups, and Docker/Gradio-based tooling.

Details

Developer
Launch Year
2025
Free Trial
No
Updated
2026-02-14

Features

Text–image fusion module

LLaVA-based module that injects image identity cues into textual prompts to improve image+text conditioned generation.

Image ID enhancement

Temporally concatenates image features across frames to strengthen subject identity consistency.

AudioNet

Audio-conditioned module aligning audio and visual features hierarchically with spatial cross-attention for singing/avatar-style generation.

Video-driven injection

Patchify-based feature alignment to inject latent conditional video features for video-driven editing and guidance.

Disentangled identity representation

Separates identity information from other modalities to enable decoupled control over image, audio, and video inputs.

Screenshots

HunyuanCustom Screenshot
HunyuanCustom Screenshot

Tags

multimodalidentity-preservingvideo generationtext conditioningimage conditioningaudio conditioningvideo conditioningLLaVA-based fusionAudioNetpatchifyidentity driftsubject identitydisentangled identityGradioLinuxCUDA 11.8/12.4PyTorch 2.4.0reproducibilityopen-source