AI

RVC WebUI: Open-Source Real-Time Voice Conversion with VITS

RVC is an easy-to-use voice conversion framework based on VITS that trains good models with just 10 minutes of voice data and supports real-time conversion.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
RVC WebUI: Open-Source Real-Time Voice Conversion with VITS

RVC (Retrieval-based Voice Conversion) WebUI is an open-source voice conversion framework developed by the RVC-Project team that has become the standard tool for AI voice conversion in both spoken and singing contexts. Built on the VITS (Variational Inference Text-to-Speech) architecture, RVC achieves high-quality voice conversion with remarkably little training data – just 10 minutes of audio is sufficient for a convincing voice model.

The project distinguishes itself from traditional voice conversion approaches through its retrieval-based mechanism. Instead of requiring paired data (same content spoken in different voices), RVC uses a feature retrieval approach that extracts and transfers speaker characteristics while preserving the linguistic content of the source audio. This makes it particularly powerful for singing voice conversion, where preserving pitch, rhythm, and emotional expression is critical.

What is RVC and how does voice conversion work?

RVC converts the voice in an audio recording from one speaker to another while preserving the linguistic content, rhythm, and emotional delivery. The process involves extracting speaker-agnostic content features from the source audio, retrieving relevant voice characteristics from the target speaker’s trained model, and reconstructing the audio with the target voice characteristics applied. Unlike TTS, voice conversion doesn’t require text input – it takes audio as input and outputs audio with a different voice.

Training Requirements

AspectMinimumRecommendedOptimal
Voice Data Duration5 minutes10 minutes30+ minutes
Audio Quality16kHz/16-bit44.1kHz/24-bit48kHz/24-bit
Training Steps10,00020,00050,000+
Training Time (RTX 4090)15 minutes30 minutes1 hour

Key Components

RVC’s pipeline includes several specialized components that work together to deliver high-quality voice conversion.

ComponentFunctionTechnical Detail
RMVPEPitch extractionAccurate F0 estimation for singing voices
UVR5Source separationIsolates vocals from background music
Content ExtractorExtracts content featuresHuBERT-based feature extraction
Feature RetrieverMatches to target voiceKNN-based retrieval from trained database
VITS GeneratorReconstructs audioVITS-based neural vocoder

How does real-time voice conversion work?

RVC supports real-time voice conversion with latency as low as 20-30ms on modern GPUs. In real-time mode, audio is processed in small overlapping frames. The content extractor analyzes each frame, the feature retriever finds the best-matching target features, and the VITS generator produces the converted output. This enables live applications like voice changers for streaming, real-time interpretation, and interactive voice filters.

What is the RMVPE component?

RMVPE (Robust Multi-scale Voice Pitch Estimation) is a critical component for singing voice conversion. Unlike standard pitch extractors that struggle with the wide pitch ranges and rapid variations in singing, RMVPE is specifically trained on singing data with multi-scale processing to accurately track pitch even in complex vocal performances. This enables RVC to preserve the singer’s original melody while changing the timbre to the target voice.

Features and Capabilities

FeatureDescriptionPerformance
Voice ConversionChange voice of any audio recordingNear real-time (500ms for 10s audio)
Real-time ConversionLive voice changing20-30ms latency on RTX 4090
Singing VoicePitch-preserving voice conversion for songsExcellent quality
Cross-lingualConvert voice across languagesGood (limited by language coverage)
Batch ProcessingConvert multiple files at onceConfigurable batch size
Audio EnhancementPost-processing filters and EQBuilt-in equalizer

What is UVR5 and why is it needed?

UVR5 (Ultimate Vocal Remover 5) is the source separation component. When converting voice from a song, UVR5 first separates the vocal track from the background music. This separation is essential because the voice conversion model needs to process only the voice signal – processing mixed audio would introduce artifacts from the music. UVR5 uses a Demucs-based deep learning model that achieves state-of-the-art separation quality, preserving vocal quality while effectively removing instrumental backing.

What are the hardware requirements for RVC?

GPUReal-time LatencyTraining SpeedQuality
RTX 4090 (24 GB)20-30ms15 min (10k steps)Excellent
RTX 3090 (24 GB)25-35ms25 minExcellent
RTX 3060 (12 GB)40-50ms45 minVery good
GTX 1660 (6 GB)60-80ms90 minGood
CPU Only500-1000msNot recommendedFair

How do I install and use RVC?

RVC WebUI provides a one-click installer for Windows, and manual installation guides for Linux and macOS. The web interface guides users through the full workflow: uploading training data, preprocessing audio (via UVR5), extracting features, training the voice model (with adjustable steps and learning rate), and performing voice conversion with tunable parameters like pitch shift, formant preservation, and retrieval strength.

Frequently Asked Questions

What is RVC? RVC (Retrieval-based Voice Conversion) is an open-source voice conversion framework based on VITS that can train high-quality voice models with just 10 minutes of audio data.

How much training data is required? Minimum 5 minutes, recommended 10 minutes, optimal 30+ minutes of clean vocal audio for a high-quality voice model.

What is RMVPE? RMVPE is a robust multi-scale pitch extraction component specifically designed for accurate pitch tracking in singing voice conversion.

What is UVR5? UVR5 (Ultimate Vocal Remover 5) is the source separation component that isolates vocals from background music before voice conversion.

Does RVC support real-time conversion? Yes, with 20-30ms latency on high-end GPUs like the RTX 4090, suitable for live streaming and real-time voice changing applications.

Further Reading

TAG
CATEGORIES