Retrieval-based-Voice-Conversion-WebUI

Easily train a good VC model with voice data <= 10 mins!

29,011

4,078

29,011

569

View on GitHub

Top Related Projects

voice-changer

17,850

リアルタイムボイスチェンジャー Realtime Voice Changer

Real-Time-Voice-Cloning

54,087

Clone a voice in 5 seconds to generate arbitrary speech in real-time

TTS

39,656

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

StyleTTS2

5,682

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Quick Overview

RVC-Project/Retrieval-based-Voice-Conversion-WebUI is an open-source project that provides a web-based interface for voice conversion using retrieval-based methods. It allows users to convert one person's voice to another's while maintaining the original content and emotion. The project combines various AI technologies to achieve high-quality voice conversion with relatively low computational requirements.

Pros

User-friendly web interface for easy access and operation
Supports multiple languages and can handle various accents
Requires less training data compared to some other voice conversion methods
Offers real-time voice conversion capabilities

Cons

May require significant computational resources for optimal performance
The quality of voice conversion can vary depending on the input audio and target voice
Limited documentation for advanced customization and troubleshooting
Potential ethical concerns regarding voice cloning and misuse

Code Examples

# Example 1: Loading a pre-trained model
from infer.modules.vc.modules import VC
model = VC(config_path='path/to/config.json')
model.load_model('path/to/model.pth')

# Example 2: Performing voice conversion
input_audio = 'path/to/input.wav'
output_audio = 'path/to/output.wav'
converted_audio = model.convert(input_audio, target_speaker='Speaker1')
converted_audio.save(output_audio)

# Example 3: Real-time voice conversion
import sounddevice as sd

def callback(indata, outdata, frames, time, status):
    if status:
        print(status)
    converted = model.convert_realtime(indata)
    outdata[:] = converted

with sd.Stream(callback=callback, channels=1, samplerate=44100):
    sd.sleep(10000)  # Run for 10 seconds

Getting Started

Clone the repository:

git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
cd Retrieval-based-Voice-Conversion-WebUI

Install dependencies:
```
pip install -r requirements.txt
```
Run the web interface:
```
python webui.py
```
Open a web browser and navigate to http://localhost:7860 to access the interface.

Competitor Comparisons

voice-changer

17,850

リアルタイムボイスチェンジャー Realtime Voice Changer

Pros of voice-changer

Real-time voice conversion capabilities
Supports multiple voice conversion models (RVC, MMVCv13, MMVCv15, So-VITS-SVC)
Cross-platform compatibility (Windows, Mac, Linux)

Cons of voice-changer

Less focus on training custom voice models
May have higher system requirements for real-time processing
Potentially more complex setup for beginners

Code Comparison

voice-changer:

def _onnx_inference(self, wave):
    inputs = {self.input_name: wave}
    out = self.onnx_session.run(None, inputs)[0]
    return out

Retrieval-based-Voice-Conversion-WebUI:

def vc_single(
    sid,
    input_audio,
    f0_up_key,
    f0_file,
    f0_method,
    file_index,
    file_index2,
    # ... (additional parameters)
):
    # Function implementation

The code snippets show different approaches:

voice-changer focuses on ONNX inference for real-time processing
Retrieval-based-Voice-Conversion-WebUI has a more comprehensive function for voice conversion with various parameters

Both projects aim to provide voice conversion capabilities, but voice-changer emphasizes real-time performance and multiple model support, while Retrieval-based-Voice-Conversion-WebUI offers more customization options and focuses on training custom voice models.

Real-Time-Voice-Cloning

54,087

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Focuses on real-time voice cloning, allowing for immediate results
Utilizes a pre-trained model, reducing the need for extensive training
Provides a more straightforward approach for quick voice cloning tasks

Cons of Real-Time-Voice-Cloning

Less customizable compared to Retrieval-based-Voice-Conversion-WebUI
May have lower audio quality in some cases due to real-time processing
Limited to English language support

Code Comparison

Real-Time-Voice-Cloning:

def load_model(weights_fpath):
    model = SpeakerEncoder()
    checkpoint = torch.load(weights_fpath)
    model.load_state_dict(checkpoint["model_state"])
    return model

Retrieval-based-Voice-Conversion-WebUI:

def get_vc(sid, to_return_protect0):
    global n_spk, tgt_sr, net_g, vc, cpt, version
    if sid == "" or sid == []:
        global hubert_model
        if hubert_model is not None:
            print("clean_empty_cache")
            del net_g, n_spk, vc, hubert_model, tgt_sr
            hubert_model = net_g = n_spk = vc = hubert_model = tgt_sr = None
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            if_f0 = cpt.get("f0", 1)
            version = cpt.get("version", "v1")
            return (
                {"visible": False, "__type__": "update"},
                {"visible": False, "__type__": "update"},
                {"visible": False, "__type__": "update"},
                "clean_empty_cache",
            )

TTS

39,656

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Pros of TTS

More comprehensive text-to-speech solution with multiple models and languages
Better documentation and easier integration into existing projects
Active development with frequent updates and community support

Cons of TTS

Requires more computational resources for training and inference
Less focused on voice conversion, primarily a text-to-speech system
Steeper learning curve for customization and fine-tuning

Code Comparison

TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Retrieval-based-Voice-Conversion-WebUI:

from infer_web import get_vc
from tools.infer_tools import infer_tool

vc = get_vc()
audio = infer_tool.infer(vc, "input.wav", "output.wav")

The code snippets demonstrate that TTS is more straightforward for text-to-speech tasks, while Retrieval-based-Voice-Conversion-WebUI is specifically designed for voice conversion. TTS offers a simpler API for generating speech from text, whereas Retrieval-based-Voice-Conversion-WebUI requires more setup and is tailored for converting one voice to another.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Broader scope: fairseq is a comprehensive sequence modeling toolkit, supporting various tasks beyond voice conversion
Extensive documentation and examples: Provides detailed guides and tutorials for different use cases
Active development and community support: Regular updates and contributions from Facebook AI Research and the open-source community

Cons of fairseq

Steeper learning curve: Requires more technical expertise to set up and use effectively
Less specialized for voice conversion: May require additional configuration or fine-tuning for specific voice conversion tasks

Code Comparison

Retrieval-based-Voice-Conversion-WebUI:

import torch
from infer_pack.models import SynthesizerTrnMs256NSFsid, SynthesizerTrnMs256NSFsid_nono
from vc_infer_pipeline import VC

model = SynthesizerTrnMs256NSFsid(*args)
vc = VC(model)
audio = vc.pipeline(input_audio)

fairseq:

from fairseq.models.text_to_speech import TTSHubInterface

model = TTSHubInterface.from_pretrained("tts_transformer_lj")
wav, rate = model.predict("Hello world", voice="ljspeech")

StyleTTS2

5,682

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Pros of StyleTTS2

More advanced text-to-speech capabilities, including style transfer and prosody control
Potentially higher quality voice synthesis with more natural-sounding results
Supports multi-speaker and multi-lingual voice conversion

Cons of StyleTTS2

May require more computational resources due to its advanced features
Potentially more complex to set up and use compared to Retrieval-based-Voice-Conversion-WebUI
Less focus on real-time voice conversion, which might be important for some use cases

Code Comparison

StyleTTS2:

style_vector = model.get_style_vector(ref_wav, ref_wav_lengths)
audio = model.infer(text, text_lengths, speakers, style_vector=style_vector)

Retrieval-based-Voice-Conversion-WebUI:

f0_up_key = int(tgt_sr / 16000 * 12)
audio = vc.pipeline(hubert_model, net_g, sid, audio, tgt_sr, f0_up_key)

Both projects offer voice conversion capabilities, but StyleTTS2 focuses more on text-to-speech with style transfer, while Retrieval-based-Voice-Conversion-WebUI emphasizes real-time voice conversion. StyleTTS2 provides more advanced features for controlling voice characteristics, while Retrieval-based-Voice-Conversion-WebUI may be simpler to use for basic voice conversion tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Retrieval-based-Voice-Conversion-WebUI

ä¸ä¸ªåºäºVITSçç®åæç¨çåå£°æ¡æ¶

æ´æ°æ¥å¿ | å¸¸è§é®é¢è§£ç | AutoDLÂ·5æ¯é±è®ç»AIææ | å¯¹ç§å®éªè®°å½ | å¨çº¿æ¼ç¤º

åºæ¨¡ä½¿ç¨æ¥è¿50å°æ¶çå¼æºé«è´¨éVCTKè®ç»éè®ç»ï¼æ çææ¹é¢çé¡¾èï¼è¯·å¤§å®¶æ¾å¿ä½¿ç¨

è¯·æå¾RVCv3çåºæ¨¡ï¼åæ°æ´å¤§ï¼æ°æ®æ´å¤§ï¼æææ´å¥½ï¼åºæ¬æå¹³çæ¨çéåº¦ï¼éè¦è®ç»æ°æ®éæ´å°ã

è®ç»æ¨ççé¢	å®æ¶åå£°çé¢

go-web.bat	go-realtime-gui.bat
å¯ä»¥èªç±éæ©æ³è¦æ§è¡çæä½ã	æä»¬å·²ç»å®ç°ç«¯å°ç«¯170mså»¶è¿ãå¦ä½¿ç¨ASIOè¾å¥è¾åºè®¾å¤ï¼å·²è½å®ç°ç«¯å°ç«¯90mså»¶è¿ï¼ä½éå¸¸ä¾èµç¡¬ä»¶é©±å¨æ¯æã

ç®ä»

æ¬ä»åºå·æä»¥ä¸ç¹ç¹

ä½¿ç¨top1æ£ç´¢æ¿æ¢è¾å¥æºç¹å¾ä¸ºè®ç»éç¹å¾æ¥æç»é³è²æ³æ¼
å³ä¾¿å¨ç¸å¯¹è¾å·®çæ¾å¡ä¸ä¹è½å¿«éè®ç»
ä½¿ç¨å°éæ°æ®è¿è¡è®ç»ä¹è½å¾å°è¾å¥½ç»æ(æ¨èè³å°æ¶é10åéä½åºåªè¯é³æ°æ®)
å¯ä»¥éè¿æ¨¡åèåæ¥æ¹åé³è²(åå©ckptå¤çéé¡¹å¡ä¸çckpt-merge)
ç®åæç¨çç½é¡µçé¢
å¯è°ç¨UVR5æ¨¡åæ¥å¿«éåç¦»äººå£°åä¼´å¥
ä½¿ç¨æåè¿çäººå£°é³é«æåç®æ³InterSpeech2023-RMVPEæ ¹ç»åé³é®é¢ãæææå¥½ï¼æ¾èå°ï¼ä½æ¯crepe_fullæ´å¿«ãèµæºå ç¨æ´å°
Aå¡Iå¡å éæ¯æ

ç¹æ¤æ¥çæä»¬çæ¼ç¤ºè§é¢ !

ç¯å¢éç½®

ä»¥ä¸æä»¤éå¨ Python çæ¬å¤§äº3.8çç¯å¢ä¸æ§è¡ã

Windows/Linux/MacOSçå¹³å°éç¨æ¹æ³

ä¸åæ¹æ³ä»»éå¶ä¸ã

1. éè¿ pip å®è£ä¾èµ

å®è£Pytorchåå¶æ ¸å¿ä¾èµï¼è¥å·²å®è£åè·³è¿ãåèèª: https://pytorch.org/get-started/locally/

pip install torch torchvision torchaudio

å¦ææ¯ win ç³»ç» + Nvidia Ampere æ¶æ(RTX30xx)ï¼æ ¹æ® #21 çç»éªï¼éè¦æå® pytorch å¯¹åºç cuda çæ¬

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

æ ¹æ®èªå·±çæ¾å¡å®è£å¯¹åºä¾èµ

Nå¡

pip install -r requirements.txt

Aå¡/Iå¡

pip install -r requirements-dml.txt

Aå¡ROCM(Linux)

pip install -r requirements-amd.txt

Iå¡IPEX(Linux)

pip install -r requirements-ipex.txt

2. éè¿ poetry æ¥å®è£ä¾èµ

å®è£ Poetry ä¾èµç®¡çå·¥å·ï¼è¥å·²å®è£åè·³è¿ãåèèª: https://python-poetry.org/docs/#installation

curl -sSL https://install.python-poetry.org | python3 -

poetry init -n
poetry env use "path to your python.exe"
poetry run pip install -r requirments.txt

MacOS

å¯ä»¥éè¿ run.sh æ¥å®è£ä¾èµ

sh ./run.sh

å¶ä»é¢æ¨¡ååå¤

RVCéè¦å¶ä»ä¸äºé¢æ¨¡åæ¥æ¨çåè®ç»ã

ä½ å¯ä»¥ä»æä»¬çHugging Face spaceä¸è½½å°è¿äºæ¨¡åã

1. ä¸è½½ assets

./assets/hubert/hubert_base.pt
./assets/pretrained
./assets/uvr5_weights

æ³ä½¿ç¨v2çæ¬æ¨¡åçè¯ï¼éè¦é¢å¤ä¸è½½

./assets/pretrained_v2

2. å®è£ ffmpeg

è¥ffmpegåffprobeå·²å®è£åè·³è¿ã

Ubuntu/Debian ç¨æ·

sudo apt install ffmpeg

MacOS ç¨æ·

brew install ffmpeg

Windows ç¨æ·

ä¸è½½åæ¾ç½®å¨æ ¹ç®å½ã

ä¸è½½ffmpeg.exe
ä¸è½½ffprobe.exe

3. ä¸è½½ rmvpe äººå£°é³é«æåç®æ³æéæä»¶

ä¸è½½rmvpe.pt

ä¸è½½ rmvpe ç dml ç¯å¢(å¯é, Aå¡/Iå¡ç¨æ·)

ä¸è½½rmvpe.onnx

4. AMDæ¾å¡Rocm(å¯é, ä»Linux)

pacman -S rocm-hip-sdk rocm-opencl-sdk

export ROCM_PATH=/opt/rocm
export HSA_OVERRIDE_GFX_VERSION=10.3.0

åæ¶ç¡®ä¿ä½ çå½åç¨æ·å¤äºrenderä¸videoç¨æ·ç»åï¼

sudo usermod -aG render $USERNAME
sudo usermod -aG video $USERNAME

å¼å§ä½¿ç¨

ç´æ¥å¯å¨

ä½¿ç¨ä»¥ä¸æä»¤æ¥å¯å¨ WebUI

python infer-web.py

poetry run python infer-web.py

ä½¿ç¨æ´åå

ä¸è½½å¹¶è§£åRVC-beta.7z

Windows ç¨æ·

åå»go-web.bat

MacOS ç¨æ·

sh ./run.sh

å¯¹äºéè¦ä½¿ç¨IPEXææ¯çIå¡ç¨æ·(ä»Linux)

source /opt/intel/oneapi/setvars.sh

åèé¡¹ç®

ContentVec
VITS
HIFIGAN
Gradio
FFmpeg
Ultimate Vocal Remover
audio-slicer
Vocal pitch extraction:RMVPE
- The pretrained model is trained and tested by yxlllc and RVC-Boss.

æè°¢ææè´¡ç®èä½åºçåªå

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of voice-changer

Cons of voice-changer

Code Comparison

Pros of Real-Time-Voice-Cloning

Cons of Real-Time-Voice-Cloning

Code Comparison

Pros of TTS

Cons of TTS

Code Comparison

Pros of fairseq

Cons of fairseq

Code Comparison

Pros of StyleTTS2

Cons of StyleTTS2

Code Comparison

Convert designs to code with AI

README

Retrieval-based-Voice-Conversion-WebUI

ç®ä»

ç¯å¢é ç½®

Windows/Linux/MacOSç­å¹³å°éç¨æ¹æ³

1. éè¿ pip å®è£ ä¾èµ

2. éè¿ poetry æ¥å®è£ ä¾èµ

MacOS

å ¶ä»é¢æ¨¡ååå¤

1. ä¸è½½ assets

2. å®è£ ffmpeg

Ubuntu/Debian ç¨æ·

MacOS ç¨æ·

Windows ç¨æ·

3. ä¸è½½ rmvpe äººå£°é³é«æåç®æ³æéæä»¶

ä¸è½½ rmvpe ç dml ç¯å¢(å¯é, Aå¡/Iå¡ç¨æ·)

4. AMDæ¾å¡Rocm(å¯é, ä» Linux)

å¼å§ä½¿ç¨

ç´æ¥å¯å¨

ä½¿ç¨æ´åå

Windows ç¨æ·

MacOS ç¨æ·

å¯¹äºéè¦ä½¿ç¨IPEXææ¯çIå¡ç¨æ·(ä» Linux)

åèé¡¹ç®

æè°¢ææè´¡ç®è ä½åºçåªå

Top Related Projects

Convert designs to code with AI

ç®ä»

ç¯å¢éç½®

Windows/Linux/MacOSçå¹³å°éç¨æ¹æ³

1. éè¿ pip å®è£ä¾èµ

2. éè¿ poetry æ¥å®è£ä¾èµ

å¶ä»é¢æ¨¡ååå¤

1. ä¸è½½ assets

2. å®è£ ffmpeg

Ubuntu/Debian ç¨æ·

MacOS ç¨æ·

Windows ç¨æ·

3. ä¸è½½ rmvpe äººå£°é³é«æåç®æ³æéæä»¶

ä¸è½½ rmvpe ç dml ç¯å¢(å¯é, Aå¡/Iå¡ç¨æ·)

4. AMDæ¾å¡Rocm(å¯é, ä»Linux)

å¼å§ä½¿ç¨

ç´æ¥å¯å¨

ä½¿ç¨æ´åå

Windows ç¨æ·

MacOS ç¨æ·

å¯¹äºéè¦ä½¿ç¨IPEXææ¯çIå¡ç¨æ·(ä»Linux)

åèé¡¹ç®

æè°¢ææè´¡ç®èä½åºçåªå