AI

VoxCPM2: OpenBMB's Tokenizer-Free TTS for Multilingual Speech Generation

VoxCPM2 is a 2B parameter tokenizer-free TTS model by OpenBMB supporting 30 languages with voice design, voice cloning, and real-time streaming.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
VoxCPM2: OpenBMB's Tokenizer-Free TTS for Multilingual Speech Generation

VoxCPM2 is a tokenizer-free text-to-speech (TTS) model developed by OpenBMB, an open-source AI research community affiliated with Tsinghua University and the Beijing Academy of Artificial Intelligence (BAAI). With 2 billion parameters, VoxCPM2 represents a paradigm shift in speech synthesis by operating directly on continuous speech representations, eliminating the need for discrete audio tokenizers that typically degrade voice quality.

The model supports over 30 languages with capabilities spanning zero-shot voice cloning, voice design (creating entirely new voices from text descriptions), and real-time streaming inference. VoxCPM2 has quickly become one of the most talked-about open-source TTS models of 2026, competing directly with commercial offerings like ElevenLabs and OpenAI’s TTS while remaining freely available under the Apache 2.0 license.

What makes VoxCPM2 different from traditional TTS models?

Traditional TTS pipelines rely on cascaded systems: text is converted to linguistic features, then to discrete audio tokens, and finally to waveforms. Each stage introduces compression artifacts and information loss. VoxCPM2’s tokenizer-free architecture processes continuous speech representations directly using a flow-matching diffusion backbone, preserving the full richness of natural speech including prosody, emotion, and speaker identity.

Model Versions and Specifications

Model VariantParametersLanguagesKey Feature
VoxCPM2-Base2B30+Full multilingual TTS
VoxCPM2-VoiceDesign2B30+Text-prompted voice creation
VoxCPM2-Streaming2B30+Real-time streaming output
VoxCPM2-Light~600M10Lightweight for edge deployment

Voice Design: Creating Voices from Text Descriptions

One of VoxCPM2’s most innovative features is voice design. Instead of requiring a reference audio sample for cloning, users can describe the desired voice in natural language. For example, “A warm, authoritative male voice with a slight British accent” generates a matching voice on demand. This capability rivals commercial offerings from ElevenLabs and Play.ht, but runs entirely locally with no API costs.

Supported Languages and Performance

Language FamilyLanguagesQuality Rating
Indo-EuropeanEnglish, Spanish, French, German, Portuguese, Italian, Russian, Hindi, Urdu, BengaliExcellent
Sino-TibetanMandarin Chinese, Cantonese, Tibetan, BurmeseExcellent
Japonic/KoreanJapanese, KoreanVery Good
AustronesianIndonesian, Malay, Tagalog, VietnameseVery Good
Afro-AsiaticArabic, Hebrew, AmharicGood
TurkicTurkish, Uzbek, Kazakh, AzerbaijaniGood

Hardware Requirements for Running VoxCPM2

ConfigurationGPU MemoryInference Speed (Real-Time Factor)
Minimum8 GB VRAM~0.3 RTF
Recommended16 GB VRAM~0.15 RTF
Real-time streaming24 GB VRAM~0.05 RTF (sub-100ms latency)
CPU (ONNX)32 GB RAM~0.8 RTF

The model runs efficiently on consumer GPUs like the NVIDIA RTX 4090, and quantization via bitsandbytes can reduce memory requirements by 40-50% with minimal quality loss.

How does zero-shot voice cloning work in VoxCPM2?

Zero-shot cloning requires a 3-10 second reference audio clip. VoxCPM2 extracts a speaker embedding from the reference and conditions the flow-matching decoder to generate speech matching the reference voice. The process requires no fine-tuning or additional training, making it ideal for applications like audiobook narration, content localization, and personalized voice assistants.

Can VoxCPM2 run in real-time?

Yes. VoxCPM2 supports streaming inference with sub-100ms latency on modern GPUs. The model uses a delayed parallel decoding strategy where speech is generated in overlapping chunks, allowing the first audio segment to start playing before the remaining utterance is fully generated. This makes it suitable for live voice assistants, real-time translation, and interactive dialogue systems.

What is the license and how can I use it?

VoxCPM2 is released under the Apache 2.0 license, allowing free use for commercial and research purposes. The model weights are hosted on Hugging Face. The team provides a Gradio web interface for easy experimentation and a Python API for programmatic use. Installation requires Python 3.10+ and PyTorch 2.0+.

Frequently Asked Questions

What is VoxCPM? VoxCPM2 is a tokenizer-free TTS model by OpenBMB that generates natural speech across 30+ languages using continuous speech representations.

What model versions are available? The project offers VoxCPM2-Base (2B, multilingual), VoxCPM2-Light (600M, 10 languages), VoxCPM2-VoiceDesign (text-to-voice), and VoxCPM2-Streaming (real-time).

How does voice design work? Users describe the desired voice in natural language (e.g., “warm female voice with a Southern accent”) and the model generates speech matching that description without reference audio.

What languages are supported? Over 30 languages including English, Chinese, Japanese, Korean, Spanish, French, German, Arabic, Hindi, and many more.

What are the hardware requirements? Minimum 8 GB VRAM for inference, 16 GB recommended for optimal quality, and 24 GB for real-time streaming. CPU inference is possible with ONNX export.

Further Reading

TAG
CATEGORIES