Video generation and editing have traditionally been handled by separate models – one model for text-to-video, another for video stylization, yet another for inpainting. This fragmentation makes it difficult to build comprehensive video production pipelines and forces practitioners to learn multiple model interfaces. VACE (Video All-to-All Creation and Editing) eliminates this problem by unifying all video creation and editing tasks in a single diffusion transformer model.
Accepted at ICCV 2025, VACE is the work of Alibaba’s Tongyi Lab. The key insight behind VACE is that video creation and editing tasks share a common underlying structure: they all involve generating or modifying video content based on some combination of reference frames, text descriptions, and mask information. By designing a unified conditioning mechanism, VACE can handle all of these tasks without task-specific model variants.
The model supports three main task categories: video creation (generating new video from text, images, or reference clips), video editing (stylizing or transforming existing video), and masked editing (precise modifications using masks for inpainting, outpainting, or object removal).
What Tasks Can VACE Perform?
VACE’s unified architecture enables a wide range of video generation and editing tasks through different input configurations.
graph TD
A[VACE Unified Model] --> B[Video Creation]
A --> C[Video Editing]
A --> D[Masked Editing]
B --> E[Text-to-Video]
B --> F[Image-to-Video]
B --> G[Reference-to-Video]
C --> H[Style Transfer]
C --> I[Object Replacement]
C --> J[Background Change]
D --> K[Video Inpainting]
D --> L[Video Outpainting]
D --> M[Object Removal]
| Task Category | Input Type | Output | Example Use Case |
|---|---|---|---|
| Text-to-Video | Text prompt | Generated video | Creating B-roll from description |
| Image-to-Video | Image + text | Animated video | Bringing a photo to life |
| Reference-to-Video | Reference video + text | Styled video | Applying a reference clip’s motion |
| Style Transfer | Source video + style text | Styled video | Converting footage to anime style |
| Video Inpainting | Video + mask | Repaired video | Removing unwanted objects |
| Video Outpainting | Video + expansion mask | Extended video | Expanding video frame boundaries |
How Does VACE’s Architecture Compare to Other Methods?
VACE’s unified approach contrasts with the more common practice of training separate models or adapters for each task.
| Aspect | VACE (Unified) | Task-Specific Models | Multi-Adapter Approaches |
|---|---|---|---|
| Architecture | Single base model | Separate base per task | Single base + separate adapters |
| Training | Joint training | Independent training | Sequential adapter training |
| Parameter Efficiency | One set of weights | N sets of weights | Base + N adapters |
| Cross-task Transfer | Natural knowledge sharing | No transfer | Limited by adapter isolation |
| Inference Overhead | Single model load | Load appropriate model | Load base + switch adapters |
| Maintenance | One codebase | Multiple codebases | One codebase + adapter management |
The unified approach means that improvements from training on one task benefit all other tasks. For instance, learning better motion representation from video-to-video translation improves the model’s ability to generate coherent motion in text-to-video creation.
What Model Variants Are Available and What Hardware Do They Need?
VACE provides two variants to accommodate different hardware and quality requirements.
| Variant | Parameters | Recommended GPU | Inference Speed | Quality |
|---|---|---|---|---|
| VACE Full | ~7B | A100 / H100 | Real-time (A100) | Best |
| VACE Lite | ~3B | RTX 4090 / A10G | Fast (RTX 4090) | High |
| Feature | Full Model | Lite Model |
|---|---|---|
| Resolution | 1024x576 | 720x480 |
| Frame Count | 16-32 frames | 8-16 frames |
| GPU Memory | ~24 GB | ~12 GB |
| Inference Time | ~15s (A100 for 16 frames) | ~20s (RTX 4090 for 16 frames) |
FAQ
What is VACE? VACE (Video All-to-All Creation and Editing) is a unified video generation and editing model by Alibaba’s Tongyi Lab, accepted at ICCV 2025. It handles reference-to-video generation, video-to-video translation, and masked video editing within a single framework, eliminating the need for separate models for each task.
What task categories does VACE support? VACE supports three main task categories: video creation (text-to-video, image-to-video, reference-to-video), video editing (video-to-video style transfer, object replacement), and masked editing (inpainting, outpainting, object removal). Users specify the task through different input combinations rather than selecting separate model modes.
What model variants are available? VACE offers a full model variant and a lightweight lite variant. The full model provides the highest quality for all tasks, while the lite variant is optimized for faster inference on consumer GPUs. Both variants share the same architecture but differ in parameter count and inference speed.
What is VACE’s architecture? VACE uses a unified diffusion transformer architecture with a task-agnostic design. Instead of training separate adapters for each task, VACE uses a unified conditioning mechanism that can represent any video creation or editing task as a combination of reference frames, target frames, and mask information. This shared representation enables all tasks to benefit from joint training.
How do you install and use VACE? VACE can be installed by cloning the repository and setting up the environment with the provided requirements. The repository includes inference scripts for all supported tasks, a Gradio web interface for interactive use, and pre-trained model weights available on Hugging Face. An A100 GPU is recommended for the full model, while the lite variant runs on RTX 4090.
Further Reading
- VACE GitHub Repository – Source code, models, and documentation
- VACE Academic Paper (ICCV 2025) – Research paper on the unified video framework
- Alibaba Tongyi Lab Research – Alibaba’s AI research lab
- ICCV 2025 Conference – Conference where VACE was accepted
- VACE Model on Hugging Face – Pre-trained model weights and demos
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!