Convert Figma logo to code with AI

speechbrain logospeechbrain

A PyTorch-based Speech Toolkit

9,216
1,422
9,216
150

Top Related Projects

37,019

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

8,696

End-to-End Speech Processing Toolkit

2,584

Data manipulation and transformation for audio signal processing, powered by PyTorch

14,458

kaldi-asr/kaldi is the official location of the Kaldi project.

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Quick Overview

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch. It provides a wide range of speech-related technologies, including speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing, and language modeling. The project aims to make speech technology research and development more accessible and efficient.

Pros

  • Comprehensive toolkit covering various speech processing tasks
  • Built on PyTorch, allowing for easy integration with other deep learning projects
  • Modular architecture for flexibility and customization
  • Extensive documentation and tutorials for beginners and advanced users

Cons

  • Steeper learning curve for users new to speech processing
  • May require significant computational resources for large-scale tasks
  • Some advanced features might be less stable or thoroughly tested compared to more established libraries

Code Examples

  1. Speech Recognition:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")
transcription = asr_model.transcribe_file("audio_file.wav")
print(transcription)
  1. Speaker Recognition:
from speechbrain.pretrained import SpeakerRecognition

speaker_model = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
score, prediction = speaker_model.verify_files("speaker1.wav", "speaker2.wav")
print(f"Verification score: {score}")
print(f"Same speaker: {prediction}")
  1. Speech Enhancement:
from speechbrain.pretrained import SpectralMaskEnhancement

enhancer = SpectralMaskEnhancement.from_hparams(source="speechbrain/metricgan-plus-voicebank", savedir="pretrained_models/metricgan-plus-voicebank")
enhanced_signal = enhancer.enhance_file("noisy_speech.wav")

Getting Started

  1. Install SpeechBrain:
pip install speechbrain
  1. Import and use a pretrained model:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")
transcription = asr_model.transcribe_file("path/to/audio_file.wav")
print(transcription)
  1. For custom training, refer to the SpeechBrain documentation and tutorials for detailed instructions on preparing data, defining hyperparameters, and running experiments.

Competitor Comparisons

37,019

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Pros of TTS

  • Focused specifically on text-to-speech, offering a wide range of TTS models and voices
  • Provides pre-trained models for quick deployment and easy fine-tuning
  • Supports multiple languages and accents out-of-the-box

Cons of TTS

  • Limited to text-to-speech tasks, lacking broader speech processing capabilities
  • May require more setup and configuration for custom use cases
  • Less extensive documentation compared to SpeechBrain

Code Comparison

SpeechBrain example:

from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech")
transcription = asr_model.transcribe_file("audio_file.wav")

TTS example:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Both repositories offer powerful speech processing capabilities, but they focus on different aspects. SpeechBrain provides a comprehensive toolkit for various speech-related tasks, including ASR, TTS, and speaker recognition. TTS, on the other hand, specializes in text-to-speech synthesis, offering a range of models and voices for generating speech from text input.

8,696

End-to-End Speech Processing Toolkit

Pros of ESPnet

  • More comprehensive toolkit with support for various speech processing tasks
  • Larger community and more frequent updates
  • Better documentation and extensive examples

Cons of ESPnet

  • Steeper learning curve due to complexity
  • Heavier dependencies and larger codebase
  • Less flexibility for customization in some areas

Code Comparison

ESPnet example (ASR training):

from espnet2.bin.asr_train import main

args = {
    "output_dir": "exp/asr_train_asr_transformer_raw_bpe",
    "max_epoch": 100,
    "batch_size": 32,
    "accum_grad": 2,
    "use_amp": True,
}

main(args)

SpeechBrain example (ASR training):

from speechbrain.pretrained import EncoderASR

asr_model = EncoderASR.from_hparams(
    source="speechbrain/asr-transformer-transformerlm-librispeech",
    savedir="pretrained_models/asr-transformer-transformerlm-librispeech",
)

transcripts = asr_model.transcribe_file("audio_file.wav")

Both ESPnet and SpeechBrain are powerful speech processing toolkits, but they cater to slightly different needs. ESPnet offers a more comprehensive set of features and has a larger community, while SpeechBrain provides a more user-friendly interface and easier customization for specific use cases.

2,584

Data manipulation and transformation for audio signal processing, powered by PyTorch

Pros of Audio

  • More comprehensive audio processing toolkit with broader functionality
  • Tighter integration with PyTorch ecosystem
  • Larger community and more frequent updates

Cons of Audio

  • Steeper learning curve for beginners
  • Less focused on speech recognition tasks
  • Requires more manual setup for speech-specific workflows

Code Comparison

SpeechBrain example (speech recognition):

from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech")
transcription = asr_model.transcribe_file("audio.wav")

Audio example (audio processing):

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
mel_spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)

SpeechBrain focuses on providing high-level APIs for speech-related tasks, making it easier to implement speech recognition systems. Audio offers a broader range of audio processing tools but requires more manual setup for speech-specific tasks. SpeechBrain is more beginner-friendly for speech recognition, while Audio provides more flexibility for general audio processing.

14,458

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

  • Mature and widely-used in industry and academia
  • Extensive documentation and community support
  • Highly optimized for performance, especially for large-scale tasks

Cons of Kaldi

  • Steeper learning curve due to C++ codebase
  • Less flexible for rapid prototyping and experimentation
  • Requires more manual configuration and setup

Code Comparison

Kaldi (C++):

LatticeFasterDecoder decoder(fst, config);
DecodableAmDiagGmmScaled gmm_decodable(am_gmm, trans_model, features, acoustic_scale);
decoder.Decode(&gmm_decodable);

SpeechBrain (Python):

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech")
transcripts = asr_model.transcribe_file("audio.wav")

SpeechBrain is a more recent, Python-based toolkit that offers easier integration with deep learning frameworks and simpler usage for researchers. It provides a higher-level API and is more suitable for quick experimentation. However, Kaldi remains a powerful choice for production systems and large-scale tasks, especially when performance optimization is crucial.

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of Vosk-API

  • Lightweight and optimized for mobile and embedded devices
  • Supports offline speech recognition without internet connection
  • Provides bindings for multiple programming languages (Python, Java, Node.js, etc.)

Cons of Vosk-API

  • Limited to speech recognition tasks, lacks broader speech processing capabilities
  • Smaller community and fewer pre-trained models compared to SpeechBrain
  • Less flexibility for customizing model architectures

Code Comparison

SpeechBrain example (speech recognition):

from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech")
transcription = asr_model.transcribe_file("audio_file.wav")

Vosk-API example (speech recognition):

from vosk import Model, KaldiRecognizer
import wave

model = Model("model")
wf = wave.open("audio_file.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())
result = rec.FinalResult()

Both repositories focus on speech processing, but SpeechBrain offers a more comprehensive toolkit for various speech-related tasks, while Vosk-API specializes in lightweight, offline speech recognition. SpeechBrain provides a higher level of abstraction and more pre-trained models, making it easier for researchers to experiment with different architectures. Vosk-API, on the other hand, is more suitable for deployment in resource-constrained environments or applications requiring offline functionality.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

SpeechBrain Logo

Typing SVG

| 📘 Tutorials | 🌐 Website | 📚 Documentation | 🤝 Contributing | 🤗 HuggingFace | ▶️ YouTube | 🐦 X |

GitHub Repo stars Please, help our community project. Star on GitHub!

Exciting News (January, 2024): Discover what is new in SpeechBrain 1.0 here!

🗣️💬 What SpeechBrain Offers

  • SpeechBrain is an open-source PyTorch toolkit that accelerates Conversational AI development, i.e., the technology behind speech assistants, chatbots, and large language models.

  • It is crafted for fast and easy creation of advanced technologies for Speech and Text Processing.

🌐 Vision

  • With the rise of deep learning, once-distant domains like speech processing and NLP are now very close. A well-designed neural network and large datasets are all you need.

  • We think it is now time for a holistic toolkit that, mimicking the human brain, jointly supports diverse technologies for complex Conversational AI systems.

  • This spans speech recognition, speaker recognition, speech enhancement, speech separation, language modeling, dialogue, and beyond.

  • Aligned with our long-term goal of natural human-machine conversation, including for non-verbal individuals, we have recently added support for the EEG modality.

📚 Training Recipes

  • We share over 200 competitive training recipes on more than 40 datasets supporting 20 speech and text processing tasks (see below).

  • We support both training from scratch and fine-tuning pretrained models such as Whisper, Wav2Vec2, WavLM, Hubert, GPT2, Llama2, and beyond. The models on HuggingFace can be easily plugged in and fine-tuned.

  • For any task, you train the model using these commands:

python train.py hparams/train.yaml
  • The hyperparameters are encapsulated in a YAML file, while the training process is orchestrated through a Python script.

  • We maintained a consistent code structure across different tasks.

  • For better replicability, training logs and checkpoints are hosted on Dropbox.

drawing Pretrained Models and Inference

  • Access over 100 pretrained models hosted on HuggingFace.
  • Each model comes with a user-friendly interface for seamless inference. For example, transcribing speech using a pretrained model requires just three lines of code:
from speechbrain.inference import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-conformer-transformerlm-librispeech", savedir="pretrained_models/asr-transformer-transformerlm-librispeech")
asr_model.transcribe_file("speechbrain/asr-conformer-transformerlm-librispeech/example.wav")

drawing Documentation

  • We are deeply dedicated to promoting inclusivity and education.
  • We have authored over 30 tutorials that not only describe how SpeechBrain works but also help users familiarize themselves with Conversational AI.
  • Every class or function has clear explanations and examples that you can run. Check out the documentation for more details 📚.

🎯 Use Cases

  • 🚀 Research Acceleration: Speeding up academic and industrial research. You can develop and integrate new models effortlessly, comparing their performance against our baselines.

  • ⚡️ Rapid Prototyping: Ideal for quick prototyping in time-sensitive projects.

  • 🎓 Educational Tool: SpeechBrain's simplicity makes it a valuable educational resource. It is used by institutions like Mila, Concordia University, Avignon University, and many others for student training.

🚀 Quick Start

To get started with SpeechBrain, follow these simple steps:

🛠️ Installation

Install via PyPI

  1. Install SpeechBrain using PyPI:

    pip install speechbrain
    
  2. Access SpeechBrain in your Python code:

    import speechbrain as sb
    

Install from GitHub

This installation is recommended for users who wish to conduct experiments and customize the toolkit according to their needs.

  1. Clone the GitHub repository and install the requirements:

    git clone https://github.com/speechbrain/speechbrain.git
    cd speechbrain
    pip install -r requirements.txt
    pip install --editable .
    
  2. Access SpeechBrain in your Python code:

    import speechbrain as sb
    

Any modifications made to the speechbrain package will be automatically reflected, thanks to the --editable flag.

✔️ Test Installation

Ensure your installation is correct by running the following commands:

pytest tests
pytest --doctest-modules speechbrain

🏃‍♂️ Running an Experiment

In SpeechBrain, you can train a model for any task using the following steps:

cd recipes/<dataset>/<task>/
python experiment.py params.yaml

The results will be saved in the output_folder specified in the YAML file.

📘 Learning SpeechBrain

  • Website: Explore general information on the official website.

  • Tutorials: Start with basic tutorials covering fundamental functionalities. Find advanced tutorials and topics in the Tutorial notebooks category in the SpeechBrain documentation.

  • Documentation: Detailed information on the SpeechBrain API, contribution guidelines, and code is available in the documentation.

🔧 Supported Technologies

  • SpeechBrain is a versatile framework designed for implementing a wide range of technologies within the field of Conversational AI.
  • It excels not only in individual task implementations but also in combining various technologies into complex pipelines.

🎙️ Speech/Audio Processing

TasksDatasetsTechnologies/Models
Speech RecognitionAISHELL-1, CommonVoice, DVoice, KsponSpeech, LibriSpeech, MEDIA, RescueSpeech, Switchboard, TIMIT, Tedlium2, VoicebankCTC, Transducers, Transformers, Seq2Seq, Beamsearch techniques for CTC,seq2seq,transducers), Rescoring, Conformer, Branchformer, Hyperconformer, Kaldi2-FST
Speaker RecognitionVoxCelebECAPA-TDNN, ResNET, Xvectors, PLDA, Score Normalization
Speech SeparationWSJ0Mix, LibriMix, WHAM!, WHAMR!, Aishell1Mix, BinauralWSJ0MixSepFormer, RESepFormer, SkiM, DualPath RNN, ConvTasNET
Speech EnhancementDNS, VoicebankSepFormer, MetricGAN, MetricGAN-U, SEGAN, spectral masking, time masking
InterpretabilityESC50Listenable Maps for Audio Classifiers (L-MAC), Learning-to-Interpret (L2I), Non-Negative Matrix Factorization (NMF), PIQ
Speech GenerationAudioMNISTDiffusion, Latent Diffusion
Text-to-SpeechLJSpeech, LibriTTSTacotron2, Zero-Shot Multi-Speaker Tacotron2, FastSpeech2
VocodingLJSpeech, LibriTTSHiFiGAN, DiffWave
Spoken Language UnderstandingMEDIA, SLURP, Fluent Speech Commands, Timers-and-SuchDirect SLU, Decoupled SLU, Multistage SLU
Speech-to-Speech TranslationCVSSDiscrete Hubert, HiFiGAN, wav2vec2
Speech TranslationFisher CallHome (Spanish), IWSLT22(lowresource)wav2vec2
Emotion ClassificationIEMOCAP, ZaionEmotionDatasetECAPA-TDNN, wav2vec2, Emotion Diarization
Language IdentificationVoxLingua107, CommonLanguageECAPA-TDNN
Voice Activity DetectionLibriPartyCRDNN
Sound ClassificationESC50, UrbanSoundCNN14, ECAPA-TDNN
Self-Supervised LearningCommonVoice, LibriSpeechwav2vec2
Metric LearningREAL-M, VoicebankBlind SNR-Estimation, PESQ Learning
AlignmentTIMITCTC, Viterbi, Forward Forward
DiarizationAMIECAPA-TDNN, X-vectors, Spectral Clustering

📝 Text Processing

TasksDatasetsTechnologies/Models
Language ModelingCommonVoice, LibriSpeechn-grams, RNNLM, TransformerLM
Response GenerationMultiWOZGPT2, Llama2
Grapheme-to-PhonemeLibriSpeechRNN, Transformer, Curriculum Learning, Homograph loss

🧠 EEG Processing

TasksDatasetsTechnologies/Models
Motor ImageryBNCI2014001, BNCI2014004, BNCI2015001, Lee2019_MI, Zhou201EEGNet, ShallowConvNet, EEGConformer
P300BNCI2014009, EPFLP300, bi2015a,EEGNet
SSVEPLee2019_SSVEPEEGNet

🔍 Additional Features

SpeechBrain includes a range of native functionalities that enhance the development of Conversational AI technologies. Here are some examples:

  • Training Orchestration: The Brain class serves as a fully customizable tool for managing training and evaluation loops over data. It simplifies training loops while providing the flexibility to override any part of the process.

  • Hyperparameter Management: A YAML-based hyperparameter file specifies all hyperparameters, from individual numbers (e.g., learning rate) to complete objects (e.g., custom models). This elegant solution drastically simplifies the training script.

  • Dynamic Dataloader: Enables flexible and efficient data reading.

  • GPU Training: Supports single and multi-GPU training, including distributed training.

  • Dynamic Batching: On-the-fly dynamic batching enhances the efficient processing of variable-length signals.

  • Mixed-Precision Training: Accelerates training through mixed-precision techniques.

  • Efficient Data Reading: Reads large datasets efficiently from a shared Network File System (NFS) via WebDataset.

  • Hugging Face Integration: Interfaces seamlessly with HuggingFace for popular models such as wav2vec2 and Hubert.

  • Orion Integration: Interfaces with Orion for hyperparameter tuning.

  • Speech Augmentation Techniques: Includes SpecAugment, Noise, Reverberation, and more.

  • Data Preparation Scripts: Includes scripts for preparing data for supported datasets.

SpeechBrain is rapidly evolving, with ongoing efforts to support a growing array of technologies in the future.

📊 Performance

  • SpeechBrain integrates a variety of technologies, including those that achieves competitive or state-of-the-art performance.

  • For a comprehensive overview of the achieved performance across different tasks, datasets, and technologies, please visit here.

📜 License

  • SpeechBrain is released under the Apache License, version 2.0, a popular BSD-like license.
  • You are free to redistribute SpeechBrain for both free and commercial purposes, with the condition of retaining license headers. Unlike the GPL, the Apache License is not viral, meaning you are not obligated to release modifications to the source code.

🔮Future Plans

We have ambitious plans for the future, with a focus on the following priorities:

  • Scale Up: We aim to provide comprehensive recipes and technologies for training massive models on extensive datasets.

  • Scale Down: While scaling up delivers unprecedented performance, we recognize the challenges of deploying large models in production scenarios. We are focusing on real-time, streamable, and small-footprint Conversational AI.

  • Multimodal Large Language Models: We envision a future where a single foundation model can handle a wide range of text, speech, and audio tasks. Our core team is focused on enabling the training of advanced multimodal LLMs.

🤝 Contributing

  • SpeechBrain is a community-driven project, led by a core team with the support of numerous international collaborators.
  • We welcome contributions and ideas from the community. For more information, check here.

🙏 Sponsors

  • SpeechBrain is an academically driven project and relies on the passion and enthusiasm of its contributors.
  • As we cannot rely on the resources of a large company, we deeply appreciate any form of support, including donations or collaboration with the core team.
  • If you're interested in sponsoring SpeechBrain, please reach out to us at speechbrainproject@gmail.com.
  • A heartfelt thank you to all our sponsors, including the current ones:

Image 1     Image 3     Image 4



Image 5     Image 2     Image 6



Image 7     Image 9     Image 8    

📖 Citing SpeechBrain

If you use SpeechBrain in your research or business, please cite it using the following BibTeX entry:

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}