VACE: Alibaba's All-in-One Video Creation and Editing Model (ICCV 2025)

Q: "What is VACE?"

"VACE (Video All-to-All Creation and Editing) is a unified video generation and editing model by Alibaba's Tongyi Lab, accepted at ICCV 2025. It handles reference-to-video generation, video-to-video translation, and masked video editing within a single framework, eliminating the need for separate models for each task."

Q: "What task categories does VACE support?"

"VACE supports three main task categories: video creation (text-to-video, image-to-video, reference-to-video), video editing (video-to-video style transfer, object replacement), and masked editing (inpainting, outpainting, object removal). Users specify the task through different input combinations rather than selecting separate model modes."

Q: "What model variants are available?"

"VACE offers a full model variant and a lightweight lite variant. The full model provides the highest quality for all tasks, while the lite variant is optimized for faster inference on consumer GPUs. Both variants share the same architecture but differ in parameter count and inference speed."

Q: "What is VACE's architecture?"

"VACE uses a unified diffusion transformer architecture with a task-agnostic design. Instead of training separate adapters for each task, VACE uses a unified conditioning mechanism that can represent any video creation or editing task as a combination of reference frames, target frames, and mask information. This shared representation enables all tasks to benefit from joint training."

Q: "How do you install and use VACE?"

"VACE can be installed by cloning the repository and setting up the environment with the provided requirements. The repository includes inference scripts for all supported tasks, a Gradio web interface for interactive use, and pre-trained model weights available on Hugging Face. An A100 GPU is recommended for the full model, while the lite variant runs on RTX 4090."

VACE is an all-in-one video creation and editing model by Alibaba's Tongyi Lab, unifying reference-to-video, video-to-video, and masked editing tasks.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 04, 2026 5 min read

Video generation and editing have traditionally been handled by separate models – one model for text-to-video, another for video stylization, yet another for inpainting. This fragmentation makes it difficult to build comprehensive video production pipelines and forces practitioners to learn multiple model interfaces. VACE (Video All-to-All Creation and Editing) eliminates this problem by unifying all video creation and editing tasks in a single diffusion transformer model.

Accepted at ICCV 2025, VACE is the work of Alibaba’s Tongyi Lab. The key insight behind VACE is that video creation and editing tasks share a common underlying structure: they all involve generating or modifying video content based on some combination of reference frames, text descriptions, and mask information. By designing a unified conditioning mechanism, VACE can handle all of these tasks without task-specific model variants.

The model supports three main task categories: video creation (generating new video from text, images, or reference clips), video editing (stylizing or transforming existing video), and masked editing (precise modifications using masks for inpainting, outpainting, or object removal).

What Tasks Can VACE Perform?

VACE’s unified architecture enables a wide range of video generation and editing tasks through different input configurations.

graph TD
    A[VACE Unified Model] --> B[Video Creation]
    A --> C[Video Editing]
    A --> D[Masked Editing]
    B --> E[Text-to-Video]
    B --> F[Image-to-Video]
    B --> G[Reference-to-Video]
    C --> H[Style Transfer]
    C --> I[Object Replacement]
    C --> J[Background Change]
    D --> K[Video Inpainting]
    D --> L[Video Outpainting]
    D --> M[Object Removal]

Task Category	Input Type	Output	Example Use Case
Text-to-Video	Text prompt	Generated video	Creating B-roll from description
Image-to-Video	Image + text	Animated video	Bringing a photo to life
Reference-to-Video	Reference video + text	Styled video	Applying a reference clip’s motion
Style Transfer	Source video + style text	Styled video	Converting footage to anime style
Video Inpainting	Video + mask	Repaired video	Removing unwanted objects
Video Outpainting	Video + expansion mask	Extended video	Expanding video frame boundaries

How Does VACE’s Architecture Compare to Other Methods?

VACE’s unified approach contrasts with the more common practice of training separate models or adapters for each task.

Aspect	VACE (Unified)	Task-Specific Models	Multi-Adapter Approaches
Architecture	Single base model	Separate base per task	Single base + separate adapters
Training	Joint training	Independent training	Sequential adapter training
Parameter Efficiency	One set of weights	N sets of weights	Base + N adapters
Cross-task Transfer	Natural knowledge sharing	No transfer	Limited by adapter isolation
Inference Overhead	Single model load	Load appropriate model	Load base + switch adapters
Maintenance	One codebase	Multiple codebases	One codebase + adapter management

The unified approach means that improvements from training on one task benefit all other tasks. For instance, learning better motion representation from video-to-video translation improves the model’s ability to generate coherent motion in text-to-video creation.

What Model Variants Are Available and What Hardware Do They Need?

VACE provides two variants to accommodate different hardware and quality requirements.

Variant	Parameters	Recommended GPU	Inference Speed	Quality
VACE Full	~7B	A100 / H100	Real-time (A100)	Best
VACE Lite	~3B	RTX 4090 / A10G	Fast (RTX 4090)	High

Feature	Full Model	Lite Model
Resolution	1024x576	720x480
Frame Count	16-32 frames	8-16 frames
GPU Memory	~24 GB	~12 GB
Inference Time	~15s (A100 for 16 frames)	~20s (RTX 4090 for 16 frames)

FAQ

What is VACE? VACE (Video All-to-All Creation and Editing) is a unified video generation and editing model by Alibaba’s Tongyi Lab, accepted at ICCV 2025. It handles reference-to-video generation, video-to-video translation, and masked video editing within a single framework, eliminating the need for separate models for each task.

What task categories does VACE support? VACE supports three main task categories: video creation (text-to-video, image-to-video, reference-to-video), video editing (video-to-video style transfer, object replacement), and masked editing (inpainting, outpainting, object removal). Users specify the task through different input combinations rather than selecting separate model modes.

What model variants are available? VACE offers a full model variant and a lightweight lite variant. The full model provides the highest quality for all tasks, while the lite variant is optimized for faster inference on consumer GPUs. Both variants share the same architecture but differ in parameter count and inference speed.

What is VACE’s architecture? VACE uses a unified diffusion transformer architecture with a task-agnostic design. Instead of training separate adapters for each task, VACE uses a unified conditioning mechanism that can represent any video creation or editing task as a combination of reference frames, target frames, and mask information. This shared representation enables all tasks to benefit from joint training.

How do you install and use VACE? VACE can be installed by cloning the repository and setting up the environment with the provided requirements. The repository includes inference scripts for all supported tasks, a Gradio web interface for interactive use, and pre-trained model weights available on Hugging Face. An A100 GPU is recommended for the full model, while the lite variant runs on RTX 4090.

VACE: Alibaba's All-in-One Video Creation and Editing Model (ICCV 2025)

What Tasks Can VACE Perform?

How Does VACE’s Architecture Compare to Other Methods?

What Model Variants Are Available and What Hardware Do They Need?

FAQ

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES