VoxCPM2 is a tokenizer-free text-to-speech (TTS) model developed by OpenBMB, an open-source AI research community affiliated with Tsinghua University and the Beijing Academy of Artificial Intelligence (BAAI). With 2 billion parameters, VoxCPM2 represents a paradigm shift in speech synthesis by operating directly on continuous speech representations, eliminating the need for discrete audio tokenizers that typically degrade voice quality.
The model supports over 30 languages with capabilities spanning zero-shot voice cloning, voice design (creating entirely new voices from text descriptions), and real-time streaming inference. VoxCPM2 has quickly become one of the most talked-about open-source TTS models of 2026, competing directly with commercial offerings like ElevenLabs and OpenAI’s TTS while remaining freely available under the Apache 2.0 license.
What makes VoxCPM2 different from traditional TTS models?
Traditional TTS pipelines rely on cascaded systems: text is converted to linguistic features, then to discrete audio tokens, and finally to waveforms. Each stage introduces compression artifacts and information loss. VoxCPM2’s tokenizer-free architecture processes continuous speech representations directly using a flow-matching diffusion backbone, preserving the full richness of natural speech including prosody, emotion, and speaker identity.
Model Versions and Specifications
| Model Variant | Parameters | Languages | Key Feature |
|---|---|---|---|
| VoxCPM2-Base | 2B | 30+ | Full multilingual TTS |
| VoxCPM2-VoiceDesign | 2B | 30+ | Text-prompted voice creation |
| VoxCPM2-Streaming | 2B | 30+ | Real-time streaming output |
| VoxCPM2-Light | ~600M | 10 | Lightweight for edge deployment |
Voice Design: Creating Voices from Text Descriptions
One of VoxCPM2’s most innovative features is voice design. Instead of requiring a reference audio sample for cloning, users can describe the desired voice in natural language. For example, “A warm, authoritative male voice with a slight British accent” generates a matching voice on demand. This capability rivals commercial offerings from ElevenLabs and Play.ht, but runs entirely locally with no API costs.
flowchart LR
A[Text Prompt] --> B[Voice Encoder]
B --> C[Latent Space]
D[Speaker Description] --> E[Design Encoder]
E --> C
C --> F[Flow Matching Decoder]
F --> G[Waveform Output]Supported Languages and Performance
| Language Family | Languages | Quality Rating |
|---|---|---|
| Indo-European | English, Spanish, French, German, Portuguese, Italian, Russian, Hindi, Urdu, Bengali | Excellent |
| Sino-Tibetan | Mandarin Chinese, Cantonese, Tibetan, Burmese | Excellent |
| Japonic/Korean | Japanese, Korean | Very Good |
| Austronesian | Indonesian, Malay, Tagalog, Vietnamese | Very Good |
| Afro-Asiatic | Arabic, Hebrew, Amharic | Good |
| Turkic | Turkish, Uzbek, Kazakh, Azerbaijani | Good |
Hardware Requirements for Running VoxCPM2
| Configuration | GPU Memory | Inference Speed (Real-Time Factor) |
|---|---|---|
| Minimum | 8 GB VRAM | ~0.3 RTF |
| Recommended | 16 GB VRAM | ~0.15 RTF |
| Real-time streaming | 24 GB VRAM | ~0.05 RTF (sub-100ms latency) |
| CPU (ONNX) | 32 GB RAM | ~0.8 RTF |
The model runs efficiently on consumer GPUs like the NVIDIA RTX 4090, and quantization via bitsandbytes can reduce memory requirements by 40-50% with minimal quality loss.
How does zero-shot voice cloning work in VoxCPM2?
Zero-shot cloning requires a 3-10 second reference audio clip. VoxCPM2 extracts a speaker embedding from the reference and conditions the flow-matching decoder to generate speech matching the reference voice. The process requires no fine-tuning or additional training, making it ideal for applications like audiobook narration, content localization, and personalized voice assistants.
Can VoxCPM2 run in real-time?
Yes. VoxCPM2 supports streaming inference with sub-100ms latency on modern GPUs. The model uses a delayed parallel decoding strategy where speech is generated in overlapping chunks, allowing the first audio segment to start playing before the remaining utterance is fully generated. This makes it suitable for live voice assistants, real-time translation, and interactive dialogue systems.
sequenceDiagram
participant User as User
participant Model as VoxCPM2
participant Speaker as Speaker Encoder
participant Audio as Audio Output
User->>Model: Provide text + reference audio
Model->>Speaker: Extract speaker embedding
Speaker-->>Model: Speaker vector
Note over Model: Generate chunk 1
Model->>Audio: Stream chunk 1 (50ms latency)
Note over Model: Generate chunk 2 (parallel)
Model->>Audio: Stream chunk 2
Note over Model: Continue until complete
Audio-->>User: Full speech outputWhat is the license and how can I use it?
VoxCPM2 is released under the Apache 2.0 license, allowing free use for commercial and research purposes. The model weights are hosted on Hugging Face. The team provides a Gradio web interface for easy experimentation and a Python API for programmatic use. Installation requires Python 3.10+ and PyTorch 2.0+.
Frequently Asked Questions
What is VoxCPM? VoxCPM2 is a tokenizer-free TTS model by OpenBMB that generates natural speech across 30+ languages using continuous speech representations.
What model versions are available? The project offers VoxCPM2-Base (2B, multilingual), VoxCPM2-Light (600M, 10 languages), VoxCPM2-VoiceDesign (text-to-voice), and VoxCPM2-Streaming (real-time).
How does voice design work? Users describe the desired voice in natural language (e.g., “warm female voice with a Southern accent”) and the model generates speech matching that description without reference audio.
What languages are supported? Over 30 languages including English, Chinese, Japanese, Korean, Spanish, French, German, Arabic, Hindi, and many more.
What are the hardware requirements? Minimum 8 GB VRAM for inference, 16 GB recommended for optimal quality, and 24 GB for real-time streaming. CPU inference is possible with ONNX export.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!