Top Related Projects
End-to-End Speech Processing Toolkit
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
A PyTorch-based Speech Toolkit
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
πΈπ¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Quick Overview
Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. It is designed to be flexible and modular, providing state-of-the-art speech recognition techniques and algorithms for researchers and industry professionals.
Pros
- Highly efficient and optimized for performance
- Extensive documentation and active community support
- Supports various speech recognition tasks and models
- Integrates well with other tools and libraries in the speech processing ecosystem
Cons
- Steep learning curve for beginners
- Complex setup and installation process
- Primarily command-line based, lacking a user-friendly GUI
- Requires significant computational resources for large-scale tasks
Code Examples
- Feature extraction using MFCC:
#include "feat/feature-mfcc.h"
using namespace kaldi;
int main() {
MfccOptions opts;
opts.num_ceps = 13;
Mfcc mfcc(opts);
Vector<BaseFloat> waveform(1000);
// Fill waveform with audio data
Matrix<BaseFloat> features;
mfcc.Compute(waveform, 1.0, &features);
}
- Decoding using a simple FST-based decoder:
#include "decoder/simple-decoder.h"
using namespace kaldi;
int main() {
FstReader<StdArc> fst_reader;
Fst<StdArc>* decode_fst = fst_reader.Read("graph.fst");
SimpleDecoder decoder(*decode_fst);
DecodableInterface* decodable = // Initialize your decodable object
decoder.Decode(decodable);
Lattice lat;
decoder.GetBestPath(&lat);
}
- Training a GMM-HMM model:
#include "gmm/am-diag-gmm.h"
#include "hmm/transition-model.h"
using namespace kaldi;
int main() {
AmDiagGmm am_gmm;
TransitionModel trans_model;
// Initialize am_gmm and trans_model
for (int iter = 0; iter < num_iters; iter++) {
// Accumulate statistics
AccumAmDiagGmm gmm_accs;
for (/* each utterance */) {
// Accumulate stats for current utterance
}
// Update model
BaseFloat objf_impr, count;
am_gmm.Update(gmm_accs, kGmmAll, &objf_impr, &count);
}
}
Getting Started
-
Clone the Kaldi repository:
git clone https://github.com/kaldi-asr/kaldi.git cd kaldi
-
Install dependencies:
./tools/extras/check_dependencies.sh
-
Compile Kaldi:
cd tools make -j $(nproc) cd ../src ./configure --shared make depend -j $(nproc) make -j $(nproc)
-
Run example scripts:
cd egs/yesno/s5 ./run.sh
Competitor Comparisons
End-to-End Speech Processing Toolkit
Pros of ESPnet
- More user-friendly and easier to get started with
- Supports end-to-end models like Transformer and Conformer
- Integrates well with popular deep learning frameworks (PyTorch)
Cons of ESPnet
- Less mature and stable compared to Kaldi
- Smaller community and fewer pre-trained models available
- May not be as suitable for large-scale production deployments
Code Comparison
ESPnet (Python-based):
import torch
from espnet2.bin.asr_inference import Speech2Text
speech2text = Speech2Text.from_pretrained("espnet/model")
speech, rate = soundfile.read("audio.wav")
nbest = speech2text(speech)
text, *_ = nbest[0]
Kaldi (C++ and shell script-based):
# Extract features
compute-mfcc-feats --config=conf/mfcc.conf scp:data/test/wav.scp ark:- | \
copy-feats ark:- ark,scp:mfcc/raw_mfcc.ark,mfcc/raw_mfcc.scp
# Decode
gmm-latgen-faster --max-active=7000 --beam=13.0 --lattice-beam=6.0 \
--acoustic-scale=0.083333 --allow-partial=true \
--word-symbol-table=exp/tri4b/graph/words.txt \
exp/tri4b/final.mdl exp/tri4b/graph/HCLG.fst \
"ark,s,cs:apply-cmvn --utt2spk=ark:data/test/utt2spk scp:data/test/cmvn.scp scp:mfcc/raw_mfcc.scp ark:- | add-deltas ark:- ark:- |" \
"ark:|gzip -c > exp/tri4b/decode_test/lat.1.gz"
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- More versatile, supporting various NLP tasks beyond speech recognition
- Easier to use and integrate with PyTorch ecosystem
- More active development and frequent updates
Cons of fairseq
- Less specialized for speech recognition compared to Kaldi
- May require more computational resources for training
Code Comparison
Kaldi (C++):
LatticeFasterDecoder decoder(fst, config);
DecodableAmDiagGmmScaled gmm_decodable(am_gmm, trans_model, features, acoustic_scale);
decoder.Decode(&gmm_decodable);
fairseq (Python):
model = TransformerModel.build_model(args, task)
generator = task.build_generator([model], args)
translations = task.inference_step(generator, [model], sample)
Summary
Fairseq is a more versatile and user-friendly toolkit for various NLP tasks, including speech recognition. It benefits from integration with PyTorch and active development. However, Kaldi remains a specialized and efficient tool specifically for speech recognition tasks. The choice between the two depends on the specific requirements of the project and the user's familiarity with the respective ecosystems.
A PyTorch-based Speech Toolkit
Pros of SpeechBrain
- More user-friendly and easier to learn, especially for those familiar with PyTorch
- Offers end-to-end solutions for various speech tasks, including pre-trained models
- Actively maintained with frequent updates and a growing community
Cons of SpeechBrain
- Less mature and battle-tested compared to Kaldi's long-standing reputation
- Fewer available recipes and pre-built models for specific tasks or languages
- May have lower performance in some scenarios due to its higher-level abstractions
Code Comparison
SpeechBrain example (PyTorch-based):
class SimpleCNN(sb.Brain):
def compute_forward(self, batch, stage):
wavs, lens = batch.sig
feats = self.modules.compute_features(wavs)
out = self.modules.cnn(feats)
return out
Kaldi example (C++ based):
void ComputeFeatures(const VectorBase<BaseFloat> &wave,
Matrix<BaseFloat> *output_feats) {
FbankComputer fbank(fbank_opts_);
fbank.Compute(wave, 1.0, output_feats);
}
The code snippets illustrate the difference in language and abstraction level between the two libraries, with SpeechBrain offering a more high-level, Python-based approach compared to Kaldi's lower-level C++ implementation.
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Pros of Vosk
- Easier to use and integrate, with a simpler API and pre-built models
- Supports multiple programming languages (Python, Java, Node.js, etc.)
- Designed for real-time speech recognition on mobile and embedded devices
Cons of Vosk
- Less flexible and customizable compared to Kaldi's extensive toolkit
- Smaller community and fewer resources for advanced users
- Limited to pre-trained models, which may not cover all use cases
Code Comparison
Vosk (Python):
from vosk import Model, KaldiRecognizer
import pyaudio
model = Model("model")
rec = KaldiRecognizer(model, 16000)
# Audio processing and recognition code
Kaldi (Shell script):
#!/bin/bash
. ./path.sh
. ./cmd.sh
steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
# Additional preprocessing and training steps
Vosk provides a higher-level API for quick integration, while Kaldi offers more granular control over the speech recognition pipeline. Vosk is better suited for developers seeking rapid deployment, whereas Kaldi is ideal for researchers and those requiring extensive customization.
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Pros of DeepSpeech
- Easier to use and deploy, with a simpler architecture
- Better support for streaming audio and real-time transcription
- More accessible for developers without extensive speech recognition expertise
Cons of DeepSpeech
- Less flexible and customizable compared to Kaldi's extensive toolkit
- May not perform as well on specialized or domain-specific tasks
- Smaller community and fewer pre-trained models available
Code Comparison
DeepSpeech example (Python):
import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)
Kaldi example (Bash/Shell):
./path.sh
. ./cmd.sh
steps/nnet3/decode.sh --nj 4 --cmd "$decode_cmd" \
exp/nnet3/tdnn_sp/graph_tgsmall data/test_dev93 \
exp/nnet3/tdnn_sp/decode_test_dev93_tgsmall
DeepSpeech uses a simpler API, making it more accessible for quick implementation. Kaldi's approach requires more setup and understanding of the underlying speech recognition pipeline, but offers greater flexibility and control over the process.
πΈπ¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Pros of TTS
- More user-friendly and easier to set up, especially for beginners
- Focuses specifically on text-to-speech, offering a streamlined experience
- Provides pre-trained models and easy-to-use APIs for quick implementation
Cons of TTS
- Less comprehensive than Kaldi, which covers a broader range of speech technologies
- May have fewer options for fine-tuning and customization compared to Kaldi
- Smaller community and potentially less extensive documentation
Code Comparison
TTS example:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")
Kaldi example:
echo "Hello world" | text2wav --config=conf/tts.conf | \
wav-copy - output.wav
The TTS code is more straightforward and Python-based, while Kaldi uses command-line tools and configuration files, reflecting their different approaches to speech synthesis.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Kaldi Speech Recognition Toolkit
To build the toolkit: see ./INSTALL
. These instructions are valid for UNIX
systems including various flavors of Linux; Darwin; and Cygwin (has not been
tested on more "exotic" varieties of UNIX). For Windows installation
instructions (excluding Cygwin), see windows/INSTALL
.
To run the example system builds, see egs/README.txt
If you encounter problems (and you probably will), please do not hesitate to contact the developers (see below). In addition to specific questions, please let us know if there are specific aspects of the project that you feel could be improved, that you find confusing, etc., and which missing features you most wish it had.
Kaldi information channels
For HOT news about Kaldi see the project site.
- Info about the project, description of techniques, tutorial for C++ coding.
- Doxygen reference of the C++ code.
Kaldi forums and mailing lists:
We have two different lists
- User list kaldi-help
- Developer list kaldi-developers:
To sign up to any of those mailing lists, go to http://kaldi-asr.org/forums.html:
Development pattern for contributors
- Create a personal fork of the main Kaldi repository in GitHub.
- Make your changes in a named branch different from
master
, e.g. you create a branchmy-awesome-feature
. - Generate a pull request through the Web interface of GitHub.
- As a general rule, please follow Google C++ Style Guide. There are a few exceptions in Kaldi. You can use the Google's cpplint.py to verify that your code is free of basic mistakes.
Platform specific notes
PowerPC 64bits little-endian (ppc64le)
- Kaldi is expected to work out of the box in RHEL >= 7 and Ubuntu >= 16.04 with OpenBLAS, ATLAS, or CUDA.
- CUDA drivers for ppc64le can be found at https://developer.nvidia.com/cuda-downloads.
- An IBM Redbook is available as a guide to install and configure CUDA.
Android
- Kaldi supports cross compiling for Android using Android NDK, clang++ and OpenBLAS.
- See this blog post for details.
Web Assembly
- Kaldi supports cross compiling for Web Assembly for in-browser execution using emscripten and CLAPACK.
- See this post for a step-by-step description of the build process.
Top Related Projects
End-to-End Speech Processing Toolkit
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
A PyTorch-based Speech Toolkit
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
πΈπ¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot