RVC WebUI: Open-Source Real-Time Voice Conversion with VITS

RVC is an easy-to-use voice conversion framework based on VITS that trains good models with just 10 minutes of voice data and supports real-time conversion.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 5 min read

RVC (Retrieval-based Voice Conversion) WebUI is an open-source voice conversion framework developed by the RVC-Project team that has become the standard tool for AI voice conversion in both spoken and singing contexts. Built on the VITS (Variational Inference Text-to-Speech) architecture, RVC achieves high-quality voice conversion with remarkably little training data – just 10 minutes of audio is sufficient for a convincing voice model.

The project distinguishes itself from traditional voice conversion approaches through its retrieval-based mechanism. Instead of requiring paired data (same content spoken in different voices), RVC uses a feature retrieval approach that extracts and transfers speaker characteristics while preserving the linguistic content of the source audio. This makes it particularly powerful for singing voice conversion, where preserving pitch, rhythm, and emotional expression is critical.

What is RVC and how does voice conversion work?

RVC converts the voice in an audio recording from one speaker to another while preserving the linguistic content, rhythm, and emotional delivery. The process involves extracting speaker-agnostic content features from the source audio, retrieving relevant voice characteristics from the target speaker’s trained model, and reconstructing the audio with the target voice characteristics applied. Unlike TTS, voice conversion doesn’t require text input – it takes audio as input and outputs audio with a different voice.

Training Requirements

Aspect	Minimum	Recommended	Optimal
Voice Data Duration	5 minutes	10 minutes	30+ minutes
Audio Quality	16kHz/16-bit	44.1kHz/24-bit	48kHz/24-bit
Training Steps	10,000	20,000	50,000+
Training Time (RTX 4090)	15 minutes	30 minutes	1 hour

Key Components

RVC’s pipeline includes several specialized components that work together to deliver high-quality voice conversion.

Component	Function	Technical Detail
RMVPE	Pitch extraction	Accurate F0 estimation for singing voices
UVR5	Source separation	Isolates vocals from background music
Content Extractor	Extracts content features	HuBERT-based feature extraction
Feature Retriever	Matches to target voice	KNN-based retrieval from trained database
VITS Generator	Reconstructs audio	VITS-based neural vocoder

How does real-time voice conversion work?

RVC supports real-time voice conversion with latency as low as 20-30ms on modern GPUs. In real-time mode, audio is processed in small overlapping frames. The content extractor analyzes each frame, the feature retriever finds the best-matching target features, and the VITS generator produces the converted output. This enables live applications like voice changers for streaming, real-time interpretation, and interactive voice filters.

flowchart LR
    A[Source Audio Input] --> B[UVR5 Source Separation]
    B --> C[Vocal Track]
    C --> D[RMVPE Pitch Extraction]
    C --> E[Content Extractor (HuBERT)]
    D --> F[Pitch Features]
    E --> G[Content Features]
    G --> H[Feature Retriever (KNN)]
    H --> I[Matched Target Features]
    F --> J[VITS Generator]
    I --> J
    J --> K[Converted Audio Output]

What is the RMVPE component?

RMVPE (Robust Multi-scale Voice Pitch Estimation) is a critical component for singing voice conversion. Unlike standard pitch extractors that struggle with the wide pitch ranges and rapid variations in singing, RMVPE is specifically trained on singing data with multi-scale processing to accurately track pitch even in complex vocal performances. This enables RVC to preserve the singer’s original melody while changing the timbre to the target voice.

Features and Capabilities

Feature	Description	Performance
Voice Conversion	Change voice of any audio recording	Near real-time (500ms for 10s audio)
Real-time Conversion	Live voice changing	20-30ms latency on RTX 4090
Singing Voice	Pitch-preserving voice conversion for songs	Excellent quality
Cross-lingual	Convert voice across languages	Good (limited by language coverage)
Batch Processing	Convert multiple files at once	Configurable batch size
Audio Enhancement	Post-processing filters and EQ	Built-in equalizer

What is UVR5 and why is it needed?

UVR5 (Ultimate Vocal Remover 5) is the source separation component. When converting voice from a song, UVR5 first separates the vocal track from the background music. This separation is essential because the voice conversion model needs to process only the voice signal – processing mixed audio would introduce artifacts from the music. UVR5 uses a Demucs-based deep learning model that achieves state-of-the-art separation quality, preserving vocal quality while effectively removing instrumental backing.

sequenceDiagram
    participant User
    participant RVC as RVC WebUI
    participant UVR as UVR5 Separator
    participant Model as Voice Model
    participant Output as Audio Output

    User->>RVC: Upload song with vocals
    RVC->>UVR: Separate vocals from music
    UVR-->>RVC: Isolated vocal track
    RVC->>RVC: Apply RMVPE pitch detection
    RVC->>Model: Extract + retrieve features
    Model-->>RVC: Converted voice features
    RVC->>RVC: VITS reconstruction
    RVC-->>Output: Converted audio
    Note over Output: 1 min audio processed in ~3 seconds

What are the hardware requirements for RVC?

GPU	Real-time Latency	Training Speed	Quality
RTX 4090 (24 GB)	20-30ms	15 min (10k steps)	Excellent
RTX 3090 (24 GB)	25-35ms	25 min	Excellent
RTX 3060 (12 GB)	40-50ms	45 min	Very good
GTX 1660 (6 GB)	60-80ms	90 min	Good
CPU Only	500-1000ms	Not recommended	Fair

How do I install and use RVC?

RVC WebUI provides a one-click installer for Windows, and manual installation guides for Linux and macOS. The web interface guides users through the full workflow: uploading training data, preprocessing audio (via UVR5), extracting features, training the voice model (with adjustable steps and learning rate), and performing voice conversion with tunable parameters like pitch shift, formant preservation, and retrieval strength.

Frequently Asked Questions

What is RVC? RVC (Retrieval-based Voice Conversion) is an open-source voice conversion framework based on VITS that can train high-quality voice models with just 10 minutes of audio data.

How much training data is required? Minimum 5 minutes, recommended 10 minutes, optimal 30+ minutes of clean vocal audio for a high-quality voice model.

What is RMVPE? RMVPE is a robust multi-scale pitch extraction component specifically designed for accurate pitch tracking in singing voice conversion.

What is UVR5? UVR5 (Ultimate Vocal Remover 5) is the source separation component that isolates vocals from background music before voice conversion.

Does RVC support real-time conversion? Yes, with 20-30ms latency on high-end GPUs like the RTX 4090, suitable for live streaming and real-time voice changing applications.

RVC WebUI: Open-Source Real-Time Voice Conversion with VITS

What is RVC and how does voice conversion work?

Training Requirements

Key Components

How does real-time voice conversion work?

What is the RMVPE component?

Features and Capabilities

What is UVR5 and why is it needed?

What are the hardware requirements for RVC?

How do I install and use RVC?

Frequently Asked Questions

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES