Top Related Projects
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
A python package to analyze and compare voices with deep learning
Quick Overview
The w-okada/voice-changer
repository is a Python library that provides a set of tools for real-time voice conversion. It allows users to modify the pitch, timbre, and other characteristics of their voice during live audio input or pre-recorded audio files.
Pros
- Real-time processing: The library supports real-time voice conversion, enabling live applications such as video conferencing or voice chat.
- Flexible configuration: Users can customize the voice conversion parameters to achieve a wide range of voice transformations.
- Cross-platform compatibility: The library can be used on various operating systems, including Windows, macOS, and Linux.
- Open-source: The project is open-source, allowing for community contributions and further development.
Cons
- Complexity: The library may have a steep learning curve for users unfamiliar with audio processing and machine learning concepts.
- Performance limitations: Depending on the hardware and the complexity of the voice conversion, the library may not be able to achieve real-time performance on all devices.
- Limited pre-trained models: The library currently provides a limited set of pre-trained voice conversion models, which may not cover all desired voice transformations.
- Potential privacy concerns: Users should be aware of the potential privacy implications when using voice conversion technology, especially in live applications.
Code Examples
Here are a few code examples demonstrating the usage of the w-okada/voice-changer
library:
- Real-time voice conversion:
from voice_changer import VoiceChanger
vc = VoiceChanger()
vc.start_stream()
while True:
audio_frame = vc.get_audio_frame()
converted_audio = vc.convert_voice(audio_frame)
# Process the converted audio (e.g., play it, save it to a file)
- Batch voice conversion:
from voice_changer import VoiceChanger
vc = VoiceChanger()
vc.load_audio_file('input_audio.wav')
converted_audio = vc.convert_voice()
vc.save_audio_file('output_audio.wav')
- Adjusting voice conversion parameters:
from voice_changer import VoiceChanger
vc = VoiceChanger()
vc.set_pitch_shift(2.0) # Shift the pitch up by 2 semitones
vc.set_timbre_shift(0.5) # Shift the timbre by 0.5
converted_audio = vc.convert_voice()
- Using pre-trained voice conversion models:
from voice_changer import VoiceChanger
vc = VoiceChanger(model_name='male_to_female')
vc.load_audio_file('input_audio.wav')
converted_audio = vc.convert_voice()
vc.save_audio_file('output_audio.wav')
Getting Started
To get started with the w-okada/voice-changer
library, follow these steps:
- Install the library using pip:
pip install voice-changer
- Import the
VoiceChanger
class and create an instance:
from voice_changer import VoiceChanger
vc = VoiceChanger()
- Load an audio file or start a real-time audio stream:
vc.load_audio_file('input_audio.wav')
# or
vc.start_stream()
- Convert the voice using the
convert_voice()
method:
converted_audio = vc.convert_voice()
- Save the converted audio to a file:
vc.save_audio_file('output_audio.wav')
- Adjust the voice conversion parameters as needed:
vc.set_pitch_shift(2.0)
vc.set_timbre_shift(0.5)
- Explore the available pre-trained voice conversion models:
vc = VoiceChanger(model_name='male_to_female')
Competitor Comparisons
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Pros of Real-Time-Voice-Cloning
- More comprehensive voice cloning solution, including voice encoding and synthesis
- Implements a pre-trained speaker verification model for enhanced accuracy
- Provides a graphical user interface for easier interaction
Cons of Real-Time-Voice-Cloning
- Less focus on real-time processing compared to voice-changer
- May require more computational resources due to its comprehensive approach
- Limited to English language support
Code Comparison
Real-Time-Voice-Cloning:
def load_model(weights_fpath):
model = SpeakerEncoder()
checkpoint = torch.load(weights_fpath)
model.load_state_dict(checkpoint["model_state"])
return model
voice-changer:
def get_audio_data(self):
data = self.stream.read(self.chunk)
return np.frombuffer(data, dtype=np.int16)
The Real-Time-Voice-Cloning code snippet shows model loading, while the voice-changer code focuses on real-time audio processing. This highlights the different priorities of each project, with Real-Time-Voice-Cloning emphasizing comprehensive voice cloning and voice-changer prioritizing real-time functionality.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- Comprehensive sequence-to-sequence modeling toolkit with support for various tasks
- Highly optimized and efficient implementation for large-scale training
- Extensive documentation and active community support
Cons of fairseq
- Steeper learning curve due to its complexity and wide range of features
- Requires more computational resources for training and inference
- Less focused on real-time voice conversion applications
Code comparison
fairseq:
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model')
translated = model.translate('Hello world!')
print(translated)
voice-changer:
from voice_changer import VoiceChanger
vc = VoiceChanger(model_path='/path/to/model')
converted_audio = vc.convert('input_audio.wav', target_speaker='speaker_id')
vc.save_audio(converted_audio, 'output_audio.wav')
Key differences
- fairseq is a general-purpose sequence-to-sequence toolkit, while voice-changer focuses specifically on voice conversion
- fairseq offers more flexibility and customization options, but voice-changer provides a simpler API for voice conversion tasks
- fairseq is better suited for research and large-scale applications, while voice-changer is more accessible for quick voice conversion projects
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Pros of TTS
- Comprehensive text-to-speech toolkit with multiple models and languages
- Well-documented and actively maintained by Mozilla
- Supports both training and inference
Cons of TTS
- Focused solely on text-to-speech, lacking voice conversion capabilities
- May require more technical expertise to set up and use effectively
- Larger project scope, potentially overwhelming for simple use cases
Code Comparison
TTS:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")
voice-changer:
from voice_changer import VoiceChanger
vc = VoiceChanger(model_path="path/to/model")
vc.convert("input.wav", "output.wav", target_speaker="speaker_id")
Key Differences
- TTS focuses on generating speech from text, while voice-changer primarily handles voice conversion
- TTS offers a wider range of pre-trained models and languages
- voice-changer provides a simpler interface for voice conversion tasks
- TTS is better suited for large-scale text-to-speech applications
- voice-changer is more appropriate for real-time voice modification and conversion
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Pros of Tacotron2
- Highly advanced text-to-speech synthesis model
- Backed by NVIDIA, with extensive documentation and research
- Supports fine-tuning for custom voices
Cons of Tacotron2
- More complex to set up and use
- Requires significant computational resources
- Limited to text-to-speech, not real-time voice conversion
Code Comparison
Tacotron2 (PyTorch implementation):
from tacotron2.model import Tacotron2
from tacotron2.hparams import create_hparams
hparams = create_hparams()
model = Tacotron2(hparams)
Voice-changer (JavaScript implementation):
import { VoiceChanger } from 'voice-changer';
const voiceChanger = new VoiceChanger();
voiceChanger.initialize();
Key Differences
- Tacotron2 focuses on text-to-speech synthesis, while Voice-changer is designed for real-time voice conversion
- Tacotron2 is implemented in Python using PyTorch, whereas Voice-changer is primarily JavaScript-based
- Tacotron2 requires more setup and computational resources, while Voice-changer is more lightweight and easier to integrate into web applications
Use Cases
- Tacotron2: High-quality text-to-speech systems, custom voice synthesis for virtual assistants
- Voice-changer: Real-time voice modification for gaming, streaming, or voice chat applications
A python package to analyze and compare voices with deep learning
Pros of Resemblyzer
- Focused on speaker verification and embedding extraction
- Lightweight and easy to integrate into other projects
- Well-documented with clear examples and use cases
Cons of Resemblyzer
- Limited to speaker recognition tasks, not full voice changing
- Less active development and community support
- Fewer features for real-time voice manipulation
Code Comparison
Resemblyzer:
from resemblyzer import VoiceEncoder, preprocess_wav
encoder = VoiceEncoder()
wav = preprocess_wav(wav_fpath)
embedding = encoder.embed_utterance(wav)
Voice Changer:
from voice_changer import VoiceChanger
vc = VoiceChanger()
changed_voice = vc.change_voice(input_audio, target_voice)
Key Differences
- Resemblyzer is primarily for speaker recognition and embedding generation, while Voice Changer focuses on voice transformation and manipulation.
- Voice Changer offers more comprehensive voice modification features, including pitch shifting and voice conversion.
- Resemblyzer is better suited for tasks like speaker identification and verification, while Voice Changer is designed for voice alteration and synthesis.
Use Cases
Resemblyzer is ideal for:
- Speaker verification systems
- Voice-based authentication
- Speaker clustering and diarization
Voice Changer is better for:
- Voice conversion applications
- Real-time voice modification
- Creating voice effects and alterations
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
æ¥æ¬èª / è±èª / éå½èª/ ä¸å½èª/ ãã¤ãèª/ ã¢ã©ãã¢èª/ ã®ãªã·ã£èª/ ã¹ãã¤ã³èª/ ãã©ã³ã¹èª/ ã¤ã¿ãªã¢èª/ ã©ãã³èª/ ãã¬ã¼èª/ ãã·ã¢èª *æ¥æ¬èªä»¥å¤ã¯æ©æ¢°ç¿»è¨³ã§ãã
VCClient
VCClientã¯ãAIãç¨ãã¦ãªã¢ã«ã¿ã¤ã é³å£°å¤æãè¡ãã½ããã¦ã§ã¢ã§ãã
What's New!
- v.2.0.76-beta
- new feature:
- Beatrice: 話è ãã¼ã¸ã®å®è£
- Beatrice: ãªã¼ããããã·ãã
- bugfix:
- ãµã¼ãã¢ã¼ãã®ããã¤ã¹é¸ææã®ä¸å ·å対å¿
- new feature:
- v.2.0.73-beta
- new feature:
- ç·¨éããbeatrice modelã®ãã¦ã³ãã¼ã
- bugfix:
- beatrice v2 ã®pitch, formantãåæ ãããªããã°ãä¿®æ£
- Applio ã®embedderã使ç¨ãã¦ããã¢ãã«ã®ONNXãã§ããªããã°ãä¿®æ£
- new feature:
ãã¦ã³ãã¼ãã¨é¢é£ãªã³ã¯
Windowsçã M1 Macçã¯hugging faceã®ãªãã¸ããªãããã¦ã³ãã¼ãã§ãã¾ãã
*1 Linuxã¯ãªãã¸ããªãcloneãã¦ã使ããã ããã
é¢é£ãªã³ã¯
- Beatrice V2 ãã¬ã¼ãã³ã°ã³ã¼ãã®ãªãã¸ããª
- Beatrice V2 ãã¬ã¼ãã³ã°ã³ã¼ã Colabç
é¢é£ã½ããã¦ã§ã¢
- ãªã¢ã«ã¿ã¤ã ãã¤ã¹ãã§ã³ã¸ã£ VCClient
- èªã¿ä¸ãã½ããã¦ã§ã¢ TTSClient
- ãªã¢ã«ã¿ã¤ã é³å£°èªèã½ããã¦ã§ã¢ ASRClient
VC Clientã®ç¹å¾´
夿§ãªAIã¢ãã«ããµãã¼ã
AIã¢ãã« | v.2 | v.1 | ã©ã¤ã»ã³ã¹ |
---|---|---|---|
RVC | supported | supported | ãªãã¸ããªãåç §ãã¦ãã ããã |
Beatrice v1 | n/a | supported (only win) | ç¬èª |
Beatrice v2 | supported | n/a | ç¬èª |
MMVC | n/a | supported | ãªãã¸ããªãåç §ãã¦ãã ããã |
so-vits-svc | n/a | supported | ãªãã¸ããªãåç §ãã¦ãã ããã |
DDSP-SVC | n/a | supported | ãªãã¸ããªãåç §ãã¦ãã ããã |
ã¹ã¿ã³ãã¢ãã³ããããã¯ã¼ã¯çµç±ã®ä¸¡æ§æããµãã¼ã
ãã¼ã«ã«PCã§å®çµããé³å£°å¤æãããããã¯ã¼ã¯ãä»ããé³å£°å¤æããµãã¼ããã¦ãã¾ãã ãããã¯ã¼ã¯ãä»ããå©ç¨ãè¡ããã¨ã§ãã²ã¼ã ãªã©ã®é«è² è·ãªã¢ããªã±ã¼ã·ã§ã³ã¨åæã«ä½¿ç¨ããå ´åã«é³å£°å¤æã®è² è·ãå¤é¨ã«ãªããã¼ããããã¨ãã§ãã¾ãã
è¤æ°ãã©ãããã©ã¼ã ã«å¯¾å¿
Windows, Mac(M1), Linux, Google Colab
*1 Linuxã¯ãªãã¸ããªãcloneãã¦ã使ããã ããã
REST APIãæä¾
å種ããã°ã©ãã³ã°è¨èªã§ã¯ã©ã¤ã¢ã³ãã使ãããã¨ãã§ãã¾ãã
ã¾ããcurlãªã©ã®OSã«çµã¿è¾¼ã¾ãã¦ããHTTPã¯ã©ã¤ã¢ã³ãã使ã£ã¦æä½ãã§ãã¾ãã
ãã©ãã«ã·ã¥ã¼ã
éçºè ã®ç½²åã«ã¤ãã¦
æ¬ã½ããã¦ã§ã¢ã¯éçºå ã®ç½²åãã¦ããã¾ãããä¸è¨ã®ããã«è¦åãåºã¾ãããã³ã³ããã¼ã«ãã¼ãæ¼ããªããã¢ã¤ã³ã³ãã¯ãªãã¯ããã¨å®è¡ã§ããããã«ãªãã¾ãããã㯠Apple ã®ã»ãã¥ãªãã£ããªã·ã¼ã«ãããã®ã§ããå®è¡ã¯èªå·±è²¬ä»»ã¨ãªãã¾ãã
Acknowledgments
æ¬ã½ããã¦ã§ã¢ã®é³å£°åæã«ã¯ãããªã¼ç´ æãã£ã©ã¯ã¿ã¼ãã¤ããã¿ã¡ããããç¡æå
¬éãã¦ããé³å£°ãã¼ã¿ã使ç¨ãã¦ãã¾ãã
â ã¤ããã¿ã¡ããã³ã¼ãã¹ï¼CV.夢åé»ï¼
https://tyc.rei-yumesaki.net/material/corpus/
© Rei Yumesaki
å©ç¨è¦ç´
- ãªã¢ã«ã¿ã¤ã ãã¤ã¹ãã§ã³ã¸ã£ã¼ã¤ããã¿ã¡ããã«ã¤ãã¦ã¯ãã¤ããã¿ã¡ããã³ã¼ãã¹ã®å©ç¨è¦ç´ã«æºããæ¬¡ã®ç®çã§å¤æå¾ã®é³å£°ã使ç¨ãããã¨ãç¦æ¢ãã¾ãã
â äººãæ¹å¤ã»æ»æãããã¨ãï¼ãæ¹å¤ã»æ»æãã®å®ç¾©ã¯ãã¤ããã¿ã¡ãããã£ã©ã¯ã¿ã¼ã©ã¤ã»ã³ã¹ã«æºãã¾ãï¼
â ç¹å®ã®æ¿æ²»çç«å ´ã»å®æã»ææ³ã¸ã®è³åã¾ãã¯å対ãå¼ã³ããããã¨ã
â åºæ¿ã®å¼·ã表ç¾ãã¾ã¼ãã³ã°ãªãã§å
¬éãããã¨ã
â ä»è
ã«å¯¾ãã¦äºæ¬¡å©ç¨ï¼ç´ æã¨ãã¦ã®å©ç¨ï¼ã許å¯ããå½¢ã§å
¬éãããã¨ã
â»éè³ç¨ã®ä½åã¨ãã¦é
å¸ã»è²©å£²ãã¦ããã ããã¨ã¯åé¡ãããã¾ããã
- ãªã¢ã«ã¿ã¤ã ãã¤ã¹ãã§ã³ã¸ã£ã¼ãã¿ããã«ã¤ãã¦ã¯ããã¿ããã®å£°ç´ æå·¥æ¿æ§ã®æ¬¡ã®å©ç¨è¦ç´ã«æºãã¾ãã詳細ã¯ãã¡ã
ãã¿ããã®å£°ç´ æãã³ã¼ãã¹èªã¿ä¸ãé³å£°ã使ã£ã¦é³å£°ã¢ãã«ãä½ã£ããããã¤ã¹ãã§ã³ã¸ã£ã¼ãå£°è³ªå¤æãªã©ã使ç¨ãã¦ãèªåã®å£°ããã¿ããã®å£°ã«å¤æãã¦ä½¿ãã®ãOKã§ãã
ãã ããã®å ´åã¯çµ¶å¯¾ã«ããã¿ããï¼ãããã¯å°æ¥é³ã¢ãï¼ã®å£°ã«å£°è³ªå¤æãã¦ãããã¨ãæè¨ãããã¿ããï¼ããã³å°æ¥é³ã¢ãï¼ã話ãã¦ããããã§ã¯ãªããã¨ã誰ã§ããããããã«ãã¦ãã ããã
ã¾ãããã¿ããã®å£°ã§è©±ãå
容ã¯å£°ç´ æã®å©ç¨è¦ç´ã®ç¯å²å
ã®ã¿ã¨ããã»ã³ã·ãã£ããªçºè¨ãªã©ã¯ããªãã§ãã ããã
- ãªã¢ã«ã¿ã¤ã ãã¤ã¹ãã§ã³ã¸ã£ã¼é»ç´ã¾ã²ãã«ã¤ãã¦ã¯ããã·ããã©ã¼ãã®å©ç¨è¦ç´ã«æºãã¾ãã詳細ã¯ãã¡ã
å 責äºé
æ¬ã½ããã¦ã§ã¢ã®ä½¿ç¨ã¾ãã¯ä½¿ç¨ä¸è½ã«ããçãããããªãç´æ¥æå®³ã»éæ¥æå®³ã»æ³¢åçæå®³ã»çµæçæå®³ ã¾ãã¯ç¹å¥æå®³ã«ã¤ãã¦ããä¸å責任ãè² ãã¾ããã
Top Related Projects
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
A python package to analyze and compare voices with deep learning
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot