"ChatTTS is an open-source text-to-speech model optimized specifically for conversational dialogue. It is designed to produce natural, expressive speech that sounds like a real person speaking, rather than the robotic monotone of traditional TTS systems."

"What languages does ChatTTS support?"

"ChatTTS supports English and Chinese with the main pre-trained model, producing convincingly natural speech in both languages. The architecture is language-agnostic and can likely be extended to additional languages with fine-tuning."

"How does ChatTTS handle prosody control?"

"ChatTTS offers fine-grained prosody control through filler tokens, laughter tokens, and pause tokens embedded directly in the input text. Users can insert markers for [laugh], [uv_break], and other paralinguistic events to control speech rhythm and emotion."

"What are the hardware requirements for ChatTTS?"

"ChatTTS requires approximately 4 GB of VRAM for inference on GPU, making it accessible on a wide range of consumer graphics cards. CPU-only inference is also supported but considerably slower."

"What license is ChatTTS released under?"

"ChatTTS is available under multiple licenses depending on use case: the core model weights use the AGPLv3 license for open-source use, and a CC BY-NC 4.0 variant is available for non-commercial research purposes."

ChatTTS: Open-Source Conversational Text-to-Speech Model for Natural Dialogue

ChatTTS is an open-source text-to-speech model optimized for conversational dialogue with fine-grained prosody control, supporting English and Chinese.

Editorial Team May 02, 2026 6 min read

Text-to-speech technology has advanced dramatically in recent years, but a persistent gap remains between synthetic voices and the natural cadence of human conversation. Most TTS models produce clean, clear speech that sounds unmistakably artificial — perfectly enunciated, but lacking the pauses, breathiness, laughter, and tonal variation that make dialogue feel real. ChatTTS directly targets this gap, offering an open-source model designed from the ground up for conversational speech rather than narration or announcement.

Developed by the team at 2noise, ChatTTS has rapidly gained traction in the open-source community for its ability to produce speech that sounds genuinely human. The model was trained on over 30,000 hours of conversational audio data, deliberately prioritizing natural dialogue patterns over the pristine recording quality that characterizes most commercial TTS datasets. The result is a model that laughs, pauses, trails off, and varies its pitch and pace in ways that feel remarkably organic.

The model’s architecture builds on modern transformer-based neural codec language models, similar in spirit to models like Bark and VALL-E, but optimized specifically for the two most widely spoken languages on the web: English and Chinese. Its ability to handle code-switching — mixing English and Chinese within a single sentence — makes it particularly valuable for bilingual applications ranging from language learning to international customer service.

What Makes ChatTTS Different from Other TTS Models?

The fundamental difference lies in training data philosophy and prosody modeling. Most TTS systems are trained on audiobook recordings or professionally narrated datasets: clean, well-paced, and deliberately enunciated. These produce excellent results for reading aloud but sound unnatural in dialogue contexts.

ChatTTS trained on conversational data — real human conversations with all their imperfections, overlaps, hesitations, and expressive variations. The model learned to reproduce these patterns, including paralinguistic elements like laughter, audible breathing, and filled pauses (“um,” “uh”) that are essential for natural-sounding dialogue but typically filtered out of TTS training corpora.

TTS Model	Training Data	Prosody Control	Languages	Naturalness Rating	VRAM
ChatTTS	30,000+ hours conversation	Fine-grained tokens	EN, ZH	Very High	4 GB
Bark (Suno)	Labeled audio	Coarse (speaker prompts)	Multilingual	High	10+ GB
VALL-E (Microsoft)	60,000 hours	Speaker adaptation	EN	Very High	8+ GB
Piper TTS	Varied	Limited (speed/pitch)	Multilingual	Moderate	1-2 GB
Edge / Azure TTS	Professional studio	SSML markup	100+ languages	High	Cloud API

How Does ChatTTS’s Prosody Control Work in Practice?

ChatTTS offers one of the most granular prosody control systems available in open-source TTS. Rather than requiring complex SSML markup or post-processing, prosody markers are embedded directly in the text as special tokens:

Token	Effect	Example Use
`[laugh]`	Light laughter	“That’s hilarious [laugh] I can’t believe it”
`[uv_break]`	Unvoiced breath/pause	“Well [uv_break] let me think about that”
`[v_break]`	Voiced hesitation	“I’m not sure [v_break] maybe tomorrow?”
`[lbreak]`	Long pause for emphasis	“The answer is [lbreak] forty-two”
Intra-word colon	Extended vowel sound	“I’m so: sorry to hear that”

This token-based approach means developers can script dialogue with specific emotional and rhythmic qualities without needing a separate prosody prediction model. The tokens function similarly to stage directions in a script — they tell the model how to perform the lines, not just what words to say.

graph LR
    A[Input text with prosody tokens] --> B[ChatTTS Tokenizer]
    B --> C[Transformer Encoder]
    C --> D[Neural Codec Decoder]
    D --> E[Discrete Audio Codes]
    E --> F[Vocoder / Codec Decoder]
    F --> G[Output waveform: 24kHz WAV]
    H[Speaker prompt embedding] --> C

What Are the Practical Applications of ChatTTS?

The conversational focus of ChatTTS unlocks use cases where traditional TTS falls short:

Voice assistants and chatbots benefit most directly. A customer service bot reading scripted responses sounds robotic; one using ChatTTS can insert natural hesitations, confirmations, and even empathetic tones. Language learning applications can use ChatTTS to generate realistic bilingual dialogue examples with authentic pacing. Audiobook narration of dialogue-heavy fiction becomes more engaging when characters speak with natural conversational patterns. Content creation — including YouTube narration, podcast segments, and social media voiceovers — gains production value from speech that does not sound synthesized.

Application	Traditional TTS	ChatTTS	Why It Matters
Customer service IVR	Noticeably synthetic	Near-human dialogue	Higher caller satisfaction
Language learning apps	Stiff pronunciation	Natural conversational flow	Better listening comprehension
Game NPC dialogue	Pre-recorded or robotic	Dynamic, expressive speech	Reduced production costs
Accessibility tools	Functional but flat	Engaging, varied delivery	Improved user experience
Content creation	Requires heavy editing	Less post-processing	Faster production cycles

How Resource-Intensive Is ChatTTS to Run?

ChatTTS is designed for practical deployment. The model requires approximately 4 GB of VRAM for GPU inference, a modest footprint that runs on most consumer GPUs. CPU inference is possible but approximately 10-20x slower.

Inference Mode	Hardware	Speed (per second of audio)
CUDA GPU	NVIDIA RTX 3060+	~0.3-0.5x real-time
CUDA GPU	NVIDIA RTX 4090	~2-3x real-time
Metal GPU	Apple M2/M3	~0.8-1.5x real-time
CPU only	Modern multi-core	~5-10x real-time

The model’s GitHub repository provides a straightforward Python API. A basic inference script requires fewer than 20 lines of code, making it accessible to developers who are not speech synthesis specialists.

Frequently Asked Questions About ChatTTS

How to Get Started with ChatTTS

Setting up ChatTTS locally is straightforward for anyone familiar with Python and PyTorch:

Clone the repository from github.com/2noise/ChatTTS
Install dependencies: pip install ChatTTS torch torchaudio
Run a basic inference script:

import ChatTTS
import torchaudio

chat = ChatTTS.Chat()
chat.load_models()

texts = ["Hello [uv_break] this is a test of ChatTTS [laugh]"]
wavs = chat.infer(texts, use_decoder=True)
torchaudio.save("output.wav", wavs[0], 24000)

The model weights download automatically on first load. The complete pipeline — from text to a playable WAV file — runs in under a minute on a CUDA-capable GPU.

sequenceDiagram
    participant Dev as Developer
    participant API as ChatTTS API
    participant Model as Pre-trained Model
    participant Output as WAV File

    Dev->>API: Import ChatTTS library
    Dev->>API: Call load_models()
    API->>Model: Download weights (~2GB)
    Model-->>API: Model ready
    Dev->>API: text + prosody tokens
    API->>Model: Encode & infer
    Model-->>API: Audio codec frames
    API->>Output: Decode to waveform
    Output-->>Dev: Playable WAV file

License Considerations for ChatTTS

ChatTTS uses a dual-licensing approach. The default open-source license is AGPLv3, which requires that any software incorporating the model weights also be released under a compatible open-source license when distributed. For non-commercial research and personal projects, a CC BY-NC 4.0 license is available, which permits free use as long as it is not for commercial purposes.

Developers building commercial applications should consult the license files in the repository carefully, and consider whether the AGPLv3 terms are compatible with their distribution model. The repository also includes a separate agreement for commercial licensing inquiries.

ChatTTS: Open-Source Conversational Text-to-Speech Model for Natural Dialogue

What Makes ChatTTS Different from Other TTS Models?

How Does ChatTTS’s Prosody Control Work in Practice?

What Are the Practical Applications of ChatTTS?

How Resource-Intensive Is ChatTTS to Run?

Frequently Asked Questions About ChatTTS

How to Get Started with ChatTTS

License Considerations for ChatTTS

Further Reading

LATEST POST

Easy Dataset: Open-Source Framework for Synthesizing LLM Fine-Tuning Data

CopilotKit: The Open-Source Frontend Stack for Building In-App AI Copilots

ComfyUI: The Most Powerful Open-Source Diffusion Model GUI with Node-Based Workflow

TAG

CATEGORIES