VoxCPM2: OpenBMB's Tokenizer-Free TTS for Multilingual Speech Generation

VoxCPM2 is a 2B parameter tokenizer-free TTS model by OpenBMB supporting 30 languages with voice design, voice cloning, and real-time streaming.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 5 min read

VoxCPM2 is a tokenizer-free text-to-speech (TTS) model developed by OpenBMB, an open-source AI research community affiliated with Tsinghua University and the Beijing Academy of Artificial Intelligence (BAAI). With 2 billion parameters, VoxCPM2 represents a paradigm shift in speech synthesis by operating directly on continuous speech representations, eliminating the need for discrete audio tokenizers that typically degrade voice quality.

The model supports over 30 languages with capabilities spanning zero-shot voice cloning, voice design (creating entirely new voices from text descriptions), and real-time streaming inference. VoxCPM2 has quickly become one of the most talked-about open-source TTS models of 2026, competing directly with commercial offerings like ElevenLabs and OpenAI’s TTS while remaining freely available under the Apache 2.0 license.

What makes VoxCPM2 different from traditional TTS models?

Traditional TTS pipelines rely on cascaded systems: text is converted to linguistic features, then to discrete audio tokens, and finally to waveforms. Each stage introduces compression artifacts and information loss. VoxCPM2’s tokenizer-free architecture processes continuous speech representations directly using a flow-matching diffusion backbone, preserving the full richness of natural speech including prosody, emotion, and speaker identity.

Model Versions and Specifications

Model Variant	Parameters	Languages	Key Feature
VoxCPM2-Base	2B	30+	Full multilingual TTS
VoxCPM2-VoiceDesign	2B	30+	Text-prompted voice creation
VoxCPM2-Streaming	2B	30+	Real-time streaming output
VoxCPM2-Light	~600M	10	Lightweight for edge deployment

Voice Design: Creating Voices from Text Descriptions

One of VoxCPM2’s most innovative features is voice design. Instead of requiring a reference audio sample for cloning, users can describe the desired voice in natural language. For example, “A warm, authoritative male voice with a slight British accent” generates a matching voice on demand. This capability rivals commercial offerings from ElevenLabs and Play.ht, but runs entirely locally with no API costs.

flowchart LR
    A[Text Prompt] --> B[Voice Encoder]
    B --> C[Latent Space]
    D[Speaker Description] --> E[Design Encoder]
    E --> C
    C --> F[Flow Matching Decoder]
    F --> G[Waveform Output]

Supported Languages and Performance

Language Family	Languages	Quality Rating
Indo-European	English, Spanish, French, German, Portuguese, Italian, Russian, Hindi, Urdu, Bengali	Excellent
Sino-Tibetan	Mandarin Chinese, Cantonese, Tibetan, Burmese	Excellent
Japonic/Korean	Japanese, Korean	Very Good
Austronesian	Indonesian, Malay, Tagalog, Vietnamese	Very Good
Afro-Asiatic	Arabic, Hebrew, Amharic	Good
Turkic	Turkish, Uzbek, Kazakh, Azerbaijani	Good

Hardware Requirements for Running VoxCPM2

Configuration	GPU Memory	Inference Speed (Real-Time Factor)
Minimum	8 GB VRAM	~0.3 RTF
Recommended	16 GB VRAM	~0.15 RTF
Real-time streaming	24 GB VRAM	~0.05 RTF (sub-100ms latency)
CPU (ONNX)	32 GB RAM	~0.8 RTF

The model runs efficiently on consumer GPUs like the NVIDIA RTX 4090, and quantization via bitsandbytes can reduce memory requirements by 40-50% with minimal quality loss.

How does zero-shot voice cloning work in VoxCPM2?

Zero-shot cloning requires a 3-10 second reference audio clip. VoxCPM2 extracts a speaker embedding from the reference and conditions the flow-matching decoder to generate speech matching the reference voice. The process requires no fine-tuning or additional training, making it ideal for applications like audiobook narration, content localization, and personalized voice assistants.

Can VoxCPM2 run in real-time?

Yes. VoxCPM2 supports streaming inference with sub-100ms latency on modern GPUs. The model uses a delayed parallel decoding strategy where speech is generated in overlapping chunks, allowing the first audio segment to start playing before the remaining utterance is fully generated. This makes it suitable for live voice assistants, real-time translation, and interactive dialogue systems.

sequenceDiagram
    participant User as User
    participant Model as VoxCPM2
    participant Speaker as Speaker Encoder
    participant Audio as Audio Output

    User->>Model: Provide text + reference audio
    Model->>Speaker: Extract speaker embedding
    Speaker-->>Model: Speaker vector
    Note over Model: Generate chunk 1
    Model->>Audio: Stream chunk 1 (50ms latency)
    Note over Model: Generate chunk 2 (parallel)
    Model->>Audio: Stream chunk 2
    Note over Model: Continue until complete
    Audio-->>User: Full speech output

What is the license and how can I use it?

VoxCPM2 is released under the Apache 2.0 license, allowing free use for commercial and research purposes. The model weights are hosted on Hugging Face. The team provides a Gradio web interface for easy experimentation and a Python API for programmatic use. Installation requires Python 3.10+ and PyTorch 2.0+.

Frequently Asked Questions

What is VoxCPM? VoxCPM2 is a tokenizer-free TTS model by OpenBMB that generates natural speech across 30+ languages using continuous speech representations.

What model versions are available? The project offers VoxCPM2-Base (2B, multilingual), VoxCPM2-Light (600M, 10 languages), VoxCPM2-VoiceDesign (text-to-voice), and VoxCPM2-Streaming (real-time).

How does voice design work? Users describe the desired voice in natural language (e.g., “warm female voice with a Southern accent”) and the model generates speech matching that description without reference audio.

What languages are supported? Over 30 languages including English, Chinese, Japanese, Korean, Spanish, French, German, Arabic, Hindi, and many more.

What are the hardware requirements? Minimum 8 GB VRAM for inference, 16 GB recommended for optimal quality, and 24 GB for real-time streaming. CPU inference is possible with ONNX export.

VoxCPM2: OpenBMB's Tokenizer-Free TTS for Multilingual Speech Generation

What makes VoxCPM2 different from traditional TTS models?

Model Versions and Specifications

Voice Design: Creating Voices from Text Descriptions

Supported Languages and Performance

Hardware Requirements for Running VoxCPM2

How does zero-shot voice cloning work in VoxCPM2?

Can VoxCPM2 run in real-time?

What is the license and how can I use it?

Frequently Asked Questions

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES