Retrieval-based-Voice-Conversion-WebUI
Easily train a good VC model with voice data <= 10 mins!
Top Related Projects
リアルタイムボイスチェンジャー Realtime Voice Changer
Clone a voice in 5 seconds to generate arbitrary speech in real-time
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Quick Overview
RVC-Project/Retrieval-based-Voice-Conversion-WebUI is an open-source project that provides a web-based interface for voice conversion using retrieval-based methods. It allows users to convert one person's voice to another's while maintaining the original content and emotion. The project combines various AI technologies to achieve high-quality voice conversion with relatively low computational requirements.
Pros
- User-friendly web interface for easy access and operation
- Supports multiple languages and can handle various accents
- Requires less training data compared to some other voice conversion methods
- Offers real-time voice conversion capabilities
Cons
- May require significant computational resources for optimal performance
- The quality of voice conversion can vary depending on the input audio and target voice
- Limited documentation for advanced customization and troubleshooting
- Potential ethical concerns regarding voice cloning and misuse
Code Examples
# Example 1: Loading a pre-trained model
from infer.modules.vc.modules import VC
model = VC(config_path='path/to/config.json')
model.load_model('path/to/model.pth')
# Example 2: Performing voice conversion
input_audio = 'path/to/input.wav'
output_audio = 'path/to/output.wav'
converted_audio = model.convert(input_audio, target_speaker='Speaker1')
converted_audio.save(output_audio)
# Example 3: Real-time voice conversion
import sounddevice as sd
def callback(indata, outdata, frames, time, status):
if status:
print(status)
converted = model.convert_realtime(indata)
outdata[:] = converted
with sd.Stream(callback=callback, channels=1, samplerate=44100):
sd.sleep(10000) # Run for 10 seconds
Getting Started
-
Clone the repository:
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git cd Retrieval-based-Voice-Conversion-WebUI
-
Install dependencies:
pip install -r requirements.txt
-
Run the web interface:
python webui.py
-
Open a web browser and navigate to
http://localhost:7860
to access the interface.
Competitor Comparisons
リアルタイムボイスチェンジャー Realtime Voice Changer
Pros of voice-changer
- Real-time voice conversion capabilities
- Supports multiple voice conversion models (RVC, MMVCv13, MMVCv15, So-VITS-SVC)
- Cross-platform compatibility (Windows, Mac, Linux)
Cons of voice-changer
- Less focus on training custom voice models
- May have higher system requirements for real-time processing
- Potentially more complex setup for beginners
Code Comparison
voice-changer:
def _onnx_inference(self, wave):
inputs = {self.input_name: wave}
out = self.onnx_session.run(None, inputs)[0]
return out
Retrieval-based-Voice-Conversion-WebUI:
def vc_single(
sid,
input_audio,
f0_up_key,
f0_file,
f0_method,
file_index,
file_index2,
# ... (additional parameters)
):
# Function implementation
The code snippets show different approaches:
- voice-changer focuses on ONNX inference for real-time processing
- Retrieval-based-Voice-Conversion-WebUI has a more comprehensive function for voice conversion with various parameters
Both projects aim to provide voice conversion capabilities, but voice-changer emphasizes real-time performance and multiple model support, while Retrieval-based-Voice-Conversion-WebUI offers more customization options and focuses on training custom voice models.
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Pros of Real-Time-Voice-Cloning
- Focuses on real-time voice cloning, allowing for immediate results
- Utilizes a pre-trained model, reducing the need for extensive training
- Provides a more straightforward approach for quick voice cloning tasks
Cons of Real-Time-Voice-Cloning
- Less customizable compared to Retrieval-based-Voice-Conversion-WebUI
- May have lower audio quality in some cases due to real-time processing
- Limited to English language support
Code Comparison
Real-Time-Voice-Cloning:
def load_model(weights_fpath):
model = SpeakerEncoder()
checkpoint = torch.load(weights_fpath)
model.load_state_dict(checkpoint["model_state"])
return model
Retrieval-based-Voice-Conversion-WebUI:
def get_vc(sid, to_return_protect0):
global n_spk, tgt_sr, net_g, vc, cpt, version
if sid == "" or sid == []:
global hubert_model
if hubert_model is not None:
print("clean_empty_cache")
del net_g, n_spk, vc, hubert_model, tgt_sr
hubert_model = net_g = n_spk = vc = hubert_model = tgt_sr = None
if torch.cuda.is_available():
torch.cuda.empty_cache()
if_f0 = cpt.get("f0", 1)
version = cpt.get("version", "v1")
return (
{"visible": False, "__type__": "update"},
{"visible": False, "__type__": "update"},
{"visible": False, "__type__": "update"},
"clean_empty_cache",
)
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Pros of TTS
- More comprehensive text-to-speech solution with multiple models and languages
- Better documentation and easier integration into existing projects
- Active development with frequent updates and community support
Cons of TTS
- Requires more computational resources for training and inference
- Less focused on voice conversion, primarily a text-to-speech system
- Steeper learning curve for customization and fine-tuning
Code Comparison
TTS:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")
Retrieval-based-Voice-Conversion-WebUI:
from infer_web import get_vc
from tools.infer_tools import infer_tool
vc = get_vc()
audio = infer_tool.infer(vc, "input.wav", "output.wav")
The code snippets demonstrate that TTS is more straightforward for text-to-speech tasks, while Retrieval-based-Voice-Conversion-WebUI is specifically designed for voice conversion. TTS offers a simpler API for generating speech from text, whereas Retrieval-based-Voice-Conversion-WebUI requires more setup and is tailored for converting one voice to another.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- Broader scope: fairseq is a comprehensive sequence modeling toolkit, supporting various tasks beyond voice conversion
- Extensive documentation and examples: Provides detailed guides and tutorials for different use cases
- Active development and community support: Regular updates and contributions from Facebook AI Research and the open-source community
Cons of fairseq
- Steeper learning curve: Requires more technical expertise to set up and use effectively
- Less specialized for voice conversion: May require additional configuration or fine-tuning for specific voice conversion tasks
Code Comparison
Retrieval-based-Voice-Conversion-WebUI:
import torch
from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs256NSFsid_nono
from vc_infer_pipeline import VC
model = SynthesizerTrnMs256NSFsid(*args)
vc = VC(model)
audio = vc.pipeline(input_audio)
fairseq:
from fairseq.models.text_to_speech import TTSHubInterface
model = TTSHubInterface.from_pretrained("tts_transformer_lj")
wav, rate = model.predict("Hello world", voice="ljspeech")
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Pros of StyleTTS2
- More advanced text-to-speech capabilities, including style transfer and prosody control
- Potentially higher quality voice synthesis with more natural-sounding results
- Supports multi-speaker and multi-lingual voice conversion
Cons of StyleTTS2
- May require more computational resources due to its advanced features
- Potentially more complex to set up and use compared to Retrieval-based-Voice-Conversion-WebUI
- Less focus on real-time voice conversion, which might be important for some use cases
Code Comparison
StyleTTS2:
style_vector = model.get_style_vector(ref_wav, ref_wav_lengths)
audio = model.infer(text, text_lengths, speakers, style_vector=style_vector)
Retrieval-based-Voice-Conversion-WebUI:
f0_up_key = int(tgt_sr / 16000 * 12)
audio = vc.pipeline(hubert_model, net_g, sid, audio, tgt_sr, f0_up_key)
Both projects offer voice conversion capabilities, but StyleTTS2 focuses more on text-to-speech with style transfer, while Retrieval-based-Voice-Conversion-WebUI emphasizes real-time voice conversion. StyleTTS2 provides more advanced features for controlling voice characteristics, while Retrieval-based-Voice-Conversion-WebUI may be simpler to use for basic voice conversion tasks.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Retrieval-based-Voice-Conversion-WebUI
ä¸ä¸ªåºäºVITSçç®åæç¨çåå£°æ¡æ¶æ´æ°æ¥å¿ | 常è§é®é¢è§£ç | AutoDL·5æ¯é±è®ç»AIææ | å¯¹ç §å®éªè®°å½ | å¨çº¿æ¼ç¤º
English | 䏿ç®ä½ | æ¥æ¬èª | íêµì´ (éåèª) | Français | Türkçe | Português
åºæ¨¡ä½¿ç¨æ¥è¿50å°æ¶ç弿ºé«è´¨éVCTKè®ç»éè®ç»ï¼æ çææ¹é¢ç顾èï¼è¯·å¤§å®¶æ¾å¿ä½¿ç¨
请æå¾ RVCv3çåºæ¨¡ï¼åæ°æ´å¤§ï¼æ°æ®æ´å¤§ï¼æææ´å¥½ï¼åºæ¬æå¹³çæ¨çé度ï¼éè¦è®ç»æ°æ®éæ´å°ã
è®ç»æ¨ççé¢ | 宿¶å声çé¢ |
go-web.bat | go-realtime-gui.bat |
å¯ä»¥èªç±éæ©æ³è¦æ§è¡çæä½ã | æä»¬å·²ç»å®ç°ç«¯å°ç«¯170mså»¶è¿ãå¦ä½¿ç¨ASIOè¾å ¥è¾åºè®¾å¤ï¼å·²è½å®ç°ç«¯å°ç«¯90mså»¶è¿ï¼ä½é常ä¾èµç¡¬ä»¶é©±å¨æ¯æã |
ç®ä»
æ¬ä»åºå ·æä»¥ä¸ç¹ç¹
- 使ç¨top1æ£ç´¢æ¿æ¢è¾å ¥æºç¹å¾ä¸ºè®ç»éç¹å¾æ¥æç»é³è²æ³æ¼
- å³ä¾¿å¨ç¸å¯¹è¾å·®çæ¾å¡ä¸ä¹è½å¿«éè®ç»
- 使ç¨å°éæ°æ®è¿è¡è®ç»ä¹è½å¾å°è¾å¥½ç»æ(æ¨èè³å°æ¶é10åéä½åºåªè¯é³æ°æ®)
- å¯ä»¥éè¿æ¨¡åè忥æ¹åé³è²(åå©ckptå¤çé项å¡ä¸çckpt-merge)
- ç®åæç¨çç½é¡µçé¢
- å¯è°ç¨UVR5æ¨¡åæ¥å¿«éå离人声åä¼´å¥
- ä½¿ç¨æå è¿ç人声é³é«æåç®æ³InterSpeech2023-RMVPEæ ¹ç»åé³é®é¢ãæææå¥½ï¼æ¾èå°ï¼ä½æ¯crepe_fullæ´å¿«ãèµæºå ç¨æ´å°
- Aå¡Iå¡å 鿝æ
ç¹æ¤æ¥çæä»¬çæ¼ç¤ºè§é¢ !
ç¯å¢é ç½®
以䏿令éå¨ Python çæ¬å¤§äº3.8çç¯å¢ä¸æ§è¡ã
Windows/Linux/MacOSçå¹³å°éç¨æ¹æ³
ä¸åæ¹æ³ä»»éå ¶ä¸ã
1. éè¿ pip å®è£ ä¾èµ
- å®è£ Pytorchåå ¶æ ¸å¿ä¾èµï¼è¥å·²å®è£ åè·³è¿ãåèèª: https://pytorch.org/get-started/locally/
pip install torch torchvision torchaudio
- å¦ææ¯ win ç³»ç» + Nvidia Ampere æ¶æ(RTX30xx)ï¼æ ¹æ® #21 çç»éªï¼éè¦æå® pytorch 对åºç cuda çæ¬
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
- æ ¹æ®èªå·±çæ¾å¡å®è£ 对åºä¾èµ
- Nå¡
pip install -r requirements.txt
- Aå¡/Iå¡
pip install -r requirements-dml.txt
- Aå¡ROCM(Linux)
pip install -r requirements-amd.txt
- Iå¡IPEX(Linux)
pip install -r requirements-ipex.txt
2. éè¿ poetry æ¥å®è£ ä¾èµ
å®è£ Poetry ä¾èµç®¡çå·¥å ·ï¼è¥å·²å®è£ åè·³è¿ãåèèª: https://python-poetry.org/docs/#installation
curl -sSL https://install.python-poetry.org | python3 -
éè¿ Poetry å®è£ ä¾èµæ¶ï¼python å»ºè®®ä½¿ç¨ 3.7-3.10 çæ¬ï¼å ¶ä½çæ¬å¨å®è£ llvmlite==0.39.0 æ¶ä¼åºç°å²çª
poetry init -n
poetry env use "path to your python.exe"
poetry run pip install -r requirments.txt
MacOS
å¯ä»¥éè¿ run.sh
æ¥å®è£
ä¾èµ
sh ./run.sh
å ¶ä»é¢æ¨¡ååå¤
RVCéè¦å ¶ä»ä¸äºé¢æ¨¡åæ¥æ¨çåè®ç»ã
ä½ å¯ä»¥ä»æä»¬çHugging Face spaceä¸è½½å°è¿äºæ¨¡åã
1. ä¸è½½ assets
以䏿¯ä¸ä»½æ¸
åï¼å
æ¬äºææRVCæéç颿¨¡ååå
¶ä»æä»¶çåç§°ãä½ å¯ä»¥å¨tools
æä»¶å¤¹æ¾å°ä¸è½½å®ä»¬çèæ¬ã
-
./assets/hubert/hubert_base.pt
-
./assets/pretrained
-
./assets/uvr5_weights
æ³ä½¿ç¨v2çæ¬æ¨¡åçè¯ï¼éè¦é¢å¤ä¸è½½
- ./assets/pretrained_v2
2. å®è£ ffmpeg
è¥ffmpegåffprobeå·²å®è£ åè·³è¿ã
Ubuntu/Debian ç¨æ·
sudo apt install ffmpeg
MacOS ç¨æ·
brew install ffmpeg
Windows ç¨æ·
ä¸è½½åæ¾ç½®å¨æ ¹ç®å½ã
-
ä¸è½½ffmpeg.exe
-
ä¸è½½ffprobe.exe
3. ä¸è½½ rmvpe 人声é³é«æåç®æ³æéæä»¶
å¦æä½ æ³ä½¿ç¨ææ°çRMVPE人声é³é«æåç®æ³ï¼åä½ éè¦ä¸è½½é³é«æå模ååæ°å¹¶æ¾ç½®äºRVCæ ¹ç®å½ã
- ä¸è½½rmvpe.pt
ä¸è½½ rmvpe ç dml ç¯å¢(å¯é, Aå¡/Iå¡ç¨æ·)
- ä¸è½½rmvpe.onnx
4. AMDæ¾å¡Rocm(å¯é, ä» Linux)
å¦æä½ æ³åºäºAMDçRocmææ¯å¨Linuxç³»ç»ä¸è¿è¡RVCï¼è¯·å å¨è¿éå®è£ æéç驱å¨ã
è¥ä½ 使ç¨çæ¯Arch Linuxï¼å¯ä»¥ä½¿ç¨pacmanæ¥å®è£ æé驱å¨ï¼
pacman -S rocm-hip-sdk rocm-opencl-sdk
å¯¹äºæäºåå·çæ¾å¡ï¼ä½ å¯è½éè¦é¢å¤é ç½®å¦ä¸çç¯å¢åéï¼å¦ï¼RX6700XTï¼ï¼
export ROCM_PATH=/opt/rocm
export HSA_OVERRIDE_GFX_VERSION=10.3.0
åæ¶ç¡®ä¿ä½ çå½åç¨æ·å¤äºrender
ä¸video
ç¨æ·ç»å
ï¼
sudo usermod -aG render $USERNAME
sudo usermod -aG video $USERNAME
å¼å§ä½¿ç¨
ç´æ¥å¯å¨
使ç¨ä»¥ä¸æä»¤æ¥å¯å¨ WebUI
python infer-web.py
è¥å åä½¿ç¨ Poetry å®è£ ä¾èµï¼åå¯ä»¥éè¿ä»¥ä¸æ¹å¼å¯å¨WebUI
poetry run python infer-web.py
ä½¿ç¨æ´åå
ä¸è½½å¹¶è§£åRVC-beta.7z
Windows ç¨æ·
åå»go-web.bat
MacOS ç¨æ·
sh ./run.sh
对äºéè¦ä½¿ç¨IPEXææ¯çIå¡ç¨æ·(ä» Linux)
source /opt/intel/oneapi/setvars.sh
åè项ç®
- ContentVec
- VITS
- HIFIGAN
- Gradio
- FFmpeg
- Ultimate Vocal Remover
- audio-slicer
- Vocal pitch extraction:RMVPE
æè°¢ææè´¡ç®è ä½åºçåªå
Top Related Projects
リアルタイムボイスチェンジャー Realtime Voice Changer
Clone a voice in 5 seconds to generate arbitrary speech in real-time
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot