GPT-SoVITS: Few-Shot Voice Cloning with Just 1 Minute of Voice Data

GPT-SoVITS is an open-source voice cloning TTS model requiring just 1 minute of voice data for training, supporting Chinese, English, Japanese, and Korean.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 5 min read

GPT-SoVITS is an open-source voice cloning and text-to-speech system developed by RVC-Boss that has taken the AI audio community by storm. The project’s standout capability is few-shot voice cloning requiring just 1 minute of voice data to train a convincing voice model, with zero-shot capabilities using as little as 5-10 seconds of reference audio. Supporting Chinese, English, Japanese, and Korean, GPT-SoVITS combines the power of GPT-based autoregressive modeling with the spectral fidelity of SoVITS (Singing Voice Synthesis with Iterative refinement using a Transformer-based Sinkhorn).

The project has amassed significant GitHub popularity by making professional-grade voice cloning accessible to anyone with a consumer GPU. Unlike commercial voice cloning services that charge per minute or require cloud uploads, GPT-SoVITS runs entirely locally, protecting user privacy and enabling unlimited usage. The quality has improved dramatically through iterative versions, with recent releases approaching studio-grade fidelity for trained voices.

What is GPT-SoVITS and how does it work?

GPT-SoVITS uses a two-stage architecture. First, a GPT-based autoregressive model generates semantic tokens from text input, conditioned on a speaker reference. These semantic tokens capture the prosody, intonation, and speaking style. Second, a SoVITS-based diffusion model converts the semantic tokens into high-fidelity audio. This separation allows the GPT component to focus on “what to say and how to say it” while the SoVITS component focuses on “how to make it sound real.”

How much training data is needed?

Mode	Reference Audio	Training Time (RTX 4090)	Quality
Zero-shot	5-10 seconds	None (instant)	Good
Quick few-shot	30 seconds	2-3 minutes	Very good
Standard few-shot	1 minute	5-10 minutes	Excellent
Optimal	3-5 minutes	15-30 minutes	Studio quality

What languages are supported?

Language	Zero-shot	Few-shot	Quality Rating
Chinese (Mandarin)	Excellent	Excellent	Best
English	Excellent	Excellent	Best
Japanese	Very Good	Very Good	Very High
Korean	Good	Very Good	High
Cantonese	Fair	Good	Beta
Other languages	Via transfer	Experimental	Variable

How does zero-shot voice cloning work?

Zero-shot voice cloning in GPT-SoVITS requires only a short reference audio clip (5-10 seconds). The system extracts a speaker embedding using a pre-trained speaker encoder and uses it to condition the GPT model during inference. While zero-shot quality is good for short utterances, it can struggle with emotional variation and unusual prosody. For production use, few-shot fine-tuning with 1 minute of data is recommended for significantly better quality.

flowchart LR
    A[Reference Audio] --> B[Speaker Encoder]
    B --> C[Speaker Embedding]
    D[Text Input] --> E[Text Tokenizer]
    E --> F[GPT Model]
    C --> F
    F --> G[Semantic Tokens]
    G --> H[SoVITS Diffusion]
    H --> I[Mel Spectrogram]
    I --> J[Vocoder]
    J --> K[Output Audio]

What features does GPT-SoVITS offer?

Feature	Description	Status
Text-to-Speech	Generate speech from text with cloned voice	Stable
Voice Conversion	Convert any audio to the target voice	Stable
Emotion Control	Adjust emotional tone of generated speech	Beta
Cross-lingual	Speak one language with voice trained on another	Stable
Real-time	Low-latency inference for interactive use	Experimental
Web UI	Gradio-based graphical interface	Stable
API Server	REST API for programmatic integration	Stable

How does GPT-SoVITS compare to other voice cloning tools?

Compared to commercial solutions like ElevenLabs, GPT-SoVITS offers comparable quality for trained voices while being free and fully local. Compared to other open-source TTS models like Coqui TTS or Tortoise-TTS, GPT-SoVITS typically produces more natural prosody and better voice similarity with less training data. The key advantage over VALL-E and similar token-based approaches is GPT-SoVITS’s ability to produce high-quality results without requiring massive amounts of training data per speaker.

sequenceDiagram
    participant User
    participant GPT as GPT Model
    participant SoVITS as SoVITS Diffusion
    participant Vocoder

    User->>GPT: "Hello, welcome to our podcast" + reference
    GPT->>GPT: Generate semantic tokens
    GPT-->>SoVITS: Token sequence with prosody
    SoVITS->>SoVITS: Iterative refinement
    SoVITS-->>Vocoder: Mel spectrogram
    Vocoder->>Vocoder: Waveform generation
    Vocoder-->>User: Audio output
    Note over User,Vocoder: Total latency ~500ms for 10s audio

What are the hardware requirements?

Component	Minimum	Recommended
GPU Memory	6 GB VRAM	12 GB VRAM
GPU Model	RTX 3060	RTX 4090
RAM	16 GB	32 GB
Storage	10 GB (models + dependencies)	20 GB
Training Time (1 min data)	30 minutes (RTX 3060)	5-10 minutes (RTX 4090)

How do I install GPT-SoVITS?

Installation is streamlined through the project’s one-click installers for Windows and Linux. For manual installation, the project requires Python 3.9+, PyTorch with CUDA support, and several audio processing libraries. The Gradio web UI launches automatically after setup, providing an intuitive interface for voice cloning, TTS generation, and voice conversion. An API mode is available for server deployment and integration with other applications.

Frequently Asked Questions

What is GPT-SoVITS? GPT-SoVITS is an open-source voice cloning TTS system that can clone a voice with just 1 minute of training data, supporting Chinese, English, Japanese, and Korean.

How much training data is needed? Zero-shot works with 5-10 seconds of audio, few-shot requires about 1 minute for high quality, and optimal results use 3-5 minutes.

What is the difference between zero-shot and few-shot? Zero-shot uses a reference audio at inference time without fine-tuning; few-shot fine-tunes the model on the reference audio for better quality and similarity.

What languages are supported? Chinese (best quality), English, Japanese, and Korean fully supported. Other languages have experimental support via cross-lingual transfer.

What are the hardware requirements? Minimum 6 GB VRAM (RTX 3060), recommended 12+ GB (RTX 4090). Training 1 minute of data takes 5-30 minutes depending on GPU.

GPT-SoVITS: Few-Shot Voice Cloning with Just 1 Minute of Voice Data

What is GPT-SoVITS and how does it work?

How much training data is needed?

What languages are supported?

How does zero-shot voice cloning work?

What features does GPT-SoVITS offer?

How does GPT-SoVITS compare to other voice cloning tools?

What are the hardware requirements?

How do I install GPT-SoVITS?

Frequently Asked Questions

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES