kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

14,458

5,332

14,458

250

View on GitHub

Top Related Projects

espnet

8,696

End-to-End Speech Processing Toolkit

fairseq

30,829

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

vosk-api

9,066

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

DeepSpeech

25,630

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

TTS

37,019

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Quick Overview

Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. It is designed to be flexible and modular, providing state-of-the-art speech recognition techniques and algorithms for researchers and industry professionals.

Pros

Highly efficient and optimized for performance
Extensive documentation and active community support
Supports various speech recognition tasks and models
Integrates well with other tools and libraries in the speech processing ecosystem

Cons

Steep learning curve for beginners
Complex setup and installation process
Primarily command-line based, lacking a user-friendly GUI
Requires significant computational resources for large-scale tasks

Code Examples

Feature extraction using MFCC:

#include "feat/feature-mfcc.h"

using namespace kaldi;

int main() {
    MfccOptions opts;
    opts.num_ceps = 13;
    Mfcc mfcc(opts);

    Vector<BaseFloat> waveform(1000);
    // Fill waveform with audio data
    Matrix<BaseFloat> features;
    mfcc.Compute(waveform, 1.0, &features);
}

Decoding using a simple FST-based decoder:

#include "decoder/simple-decoder.h"

using namespace kaldi;

int main() {
    FstReader<StdArc> fst_reader;
    Fst<StdArc>* decode_fst = fst_reader.Read("graph.fst");

    SimpleDecoder decoder(*decode_fst);
    DecodableInterface* decodable = // Initialize your decodable object
    decoder.Decode(decodable);

    Lattice lat;
    decoder.GetBestPath(&lat);
}

Training a GMM-HMM model:

#include "gmm/am-diag-gmm.h"
#include "hmm/transition-model.h"

using namespace kaldi;

int main() {
    AmDiagGmm am_gmm;
    TransitionModel trans_model;

    // Initialize am_gmm and trans_model

    for (int iter = 0; iter < num_iters; iter++) {
        // Accumulate statistics
        AccumAmDiagGmm gmm_accs;
        for (/* each utterance */) {
            // Accumulate stats for current utterance
        }
        // Update model
        BaseFloat objf_impr, count;
        am_gmm.Update(gmm_accs, kGmmAll, &objf_impr, &count);
    }
}

Getting Started

Clone the Kaldi repository:

git clone https://github.com/kaldi-asr/kaldi.git
cd kaldi

Install dependencies:
```
./tools/extras/check_dependencies.sh
```

Compile Kaldi:

cd tools
make -j $(nproc)
cd ../src
./configure --shared
make depend -j $(nproc)
make -j $(nproc)

Run example scripts:
```
cd egs/yesno/s5
./run.sh
```

Competitor Comparisons

espnet

8,696

End-to-End Speech Processing Toolkit

Pros of ESPnet

More user-friendly and easier to get started with
Supports end-to-end models like Transformer and Conformer
Integrates well with popular deep learning frameworks (PyTorch)

Cons of ESPnet

Less mature and stable compared to Kaldi
Smaller community and fewer pre-trained models available
May not be as suitable for large-scale production deployments

Code Comparison

ESPnet (Python-based):

import torch
from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained("espnet/model")
speech, rate = soundfile.read("audio.wav")
nbest = speech2text(speech)
text, *_ = nbest[0]

Kaldi (C++ and shell script-based):

# Extract features
compute-mfcc-feats --config=conf/mfcc.conf scp:data/test/wav.scp ark:- | \
copy-feats ark:- ark,scp:mfcc/raw_mfcc.ark,mfcc/raw_mfcc.scp

# Decode
gmm-latgen-faster --max-active=7000 --beam=13.0 --lattice-beam=6.0 \
  --acoustic-scale=0.083333 --allow-partial=true \
  --word-symbol-table=exp/tri4b/graph/words.txt \
  exp/tri4b/final.mdl exp/tri4b/graph/HCLG.fst \
  "ark,s,cs:apply-cmvn --utt2spk=ark:data/test/utt2spk scp:data/test/cmvn.scp scp:mfcc/raw_mfcc.scp ark:- | add-deltas ark:- ark:- |" \
  "ark:|gzip -c > exp/tri4b/decode_test/lat.1.gz"

fairseq

30,829

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

More versatile, supporting various NLP tasks beyond speech recognition
Easier to use and integrate with PyTorch ecosystem
More active development and frequent updates

Cons of fairseq

Less specialized for speech recognition compared to Kaldi
May require more computational resources for training

Code Comparison

Kaldi (C++):

LatticeFasterDecoder decoder(fst, config);
DecodableAmDiagGmmScaled gmm_decodable(am_gmm, trans_model, features, acoustic_scale);
decoder.Decode(&gmm_decodable);

fairseq (Python):

model = TransformerModel.build_model(args, task)
generator = task.build_generator([model], args)
translations = task.inference_step(generator, [model], sample)

Summary

Fairseq is a more versatile and user-friendly toolkit for various NLP tasks, including speech recognition. It benefits from integration with PyTorch and active development. However, Kaldi remains a specialized and efficient tool specifically for speech recognition tasks. The choice between the two depends on the specific requirements of the project and the user's familiarity with the respective ecosystems.

speechbrain

9,216

A PyTorch-based Speech Toolkit

Pros of SpeechBrain

More user-friendly and easier to learn, especially for those familiar with PyTorch
Offers end-to-end solutions for various speech tasks, including pre-trained models
Actively maintained with frequent updates and a growing community

Cons of SpeechBrain

Less mature and battle-tested compared to Kaldi's long-standing reputation
Fewer available recipes and pre-built models for specific tasks or languages
May have lower performance in some scenarios due to its higher-level abstractions

Code Comparison

SpeechBrain example (PyTorch-based):

class SimpleCNN(sb.Brain):
    def compute_forward(self, batch, stage):
        wavs, lens = batch.sig
        feats = self.modules.compute_features(wavs)
        out = self.modules.cnn(feats)
        return out

Kaldi example (C++ based):

void ComputeFeatures(const VectorBase<BaseFloat> &wave,
                     Matrix<BaseFloat> *output_feats) {
  FbankComputer fbank(fbank_opts_);
  fbank.Compute(wave, 1.0, output_feats);
}

The code snippets illustrate the difference in language and abstraction level between the two libraries, with SpeechBrain offering a more high-level, Python-based approach compared to Kaldi's lower-level C++ implementation.

vosk-api

9,066

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of Vosk

Easier to use and integrate, with a simpler API and pre-built models
Supports multiple programming languages (Python, Java, Node.js, etc.)
Designed for real-time speech recognition on mobile and embedded devices

Cons of Vosk

Less flexible and customizable compared to Kaldi's extensive toolkit
Smaller community and fewer resources for advanced users
Limited to pre-trained models, which may not cover all use cases

Code Comparison

Vosk (Python):

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

# Audio processing and recognition code

Kaldi (Shell script):

#!/bin/bash
. ./path.sh
. ./cmd.sh

steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir

# Additional preprocessing and training steps

Vosk provides a higher-level API for quick integration, while Kaldi offers more granular control over the speech recognition pipeline. Vosk is better suited for developers seeking rapid deployment, whereas Kaldi is ideal for researchers and those requiring extensive customization.

DeepSpeech

25,630

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

Easier to use and deploy, with a simpler architecture
Better support for streaming audio and real-time transcription
More accessible for developers without extensive speech recognition expertise

Cons of DeepSpeech

Less flexible and customizable compared to Kaldi's extensive toolkit
May not perform as well on specialized or domain-specific tasks
Smaller community and fewer pre-trained models available

Code Comparison

DeepSpeech example (Python):

import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)

Kaldi example (Bash/Shell):

./path.sh
. ./cmd.sh
steps/nnet3/decode.sh --nj 4 --cmd "$decode_cmd" \
  exp/nnet3/tdnn_sp/graph_tgsmall data/test_dev93 \
  exp/nnet3/tdnn_sp/decode_test_dev93_tgsmall

DeepSpeech uses a simpler API, making it more accessible for quick implementation. Kaldi's approach requires more setup and understanding of the underlying speech recognition pipeline, but offers greater flexibility and control over the process.

TTS

37,019

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Pros of TTS

More user-friendly and easier to set up, especially for beginners
Focuses specifically on text-to-speech, offering a streamlined experience
Provides pre-trained models and easy-to-use APIs for quick implementation

Cons of TTS

Less comprehensive than Kaldi, which covers a broader range of speech technologies
May have fewer options for fine-tuning and customization compared to Kaldi
Smaller community and potentially less extensive documentation

Code Comparison

TTS example:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Kaldi example:

echo "Hello world" | text2wav --config=conf/tts.conf | \
  wav-copy - output.wav

The TTS code is more straightforward and Python-based, while Kaldi uses command-line tools and configuration files, reflecting their different approaches to speech synthesis.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Kaldi Speech Recognition Toolkit

To build the toolkit: see ./INSTALL. These instructions are valid for UNIX systems including various flavors of Linux; Darwin; and Cygwin (has not been tested on more "exotic" varieties of UNIX). For Windows installation instructions (excluding Cygwin), see windows/INSTALL.

To run the example system builds, see egs/README.txt

If you encounter problems (and you probably will), please do not hesitate to contact the developers (see below). In addition to specific questions, please let us know if there are specific aspects of the project that you feel could be improved, that you find confusing, etc., and which missing features you most wish it had.

Kaldi information channels

For HOT news about Kaldi see the project site.

Documentation of Kaldi:

Info about the project, description of techniques, tutorial for C++ coding.
Doxygen reference of the C++ code.

Kaldi forums and mailing lists:

We have two different lists

User list kaldi-help
Developer list kaldi-developers:

To sign up to any of those mailing lists, go to http://kaldi-asr.org/forums.html:

Development pattern for contributors

Create a personal fork of the main Kaldi repository in GitHub.
Make your changes in a named branch different from master, e.g. you create a branch my-awesome-feature.
Generate a pull request through the Web interface of GitHub.
As a general rule, please follow Google C++ Style Guide. There are a few exceptions in Kaldi. You can use the Google's cpplint.py to verify that your code is free of basic mistakes.

Platform specific notes

Fedora 41 (and later)

In order to build it on Fedora 41 using the libraries that are provided by the distro, you need to install the development libraries and dependencies with

sudo dnf install lapack-devel openfst-devel

then build the package as follows:

cmake -S ./ -Bbuild/Release -DFETCHCONTENT_FULLY_DISCONNECTED=ON -DBuildForFedora=ON
cmake --build /home/gerhard/workspace/kaldi/build/Release

PowerPC 64bits little-endian (ppc64le)

Kaldi is expected to work out of the box in RHEL >= 7 and Ubuntu >= 16.04 with OpenBLAS, ATLAS, or CUDA.
CUDA drivers for ppc64le can be found at https://developer.nvidia.com/cuda-downloads.
An IBM Redbook is available as a guide to install and configure CUDA.

Android

Kaldi supports cross compiling for Android using Android NDK, clang++ and OpenBLAS.
See this blog post for details.

Web Assembly

Kaldi supports cross compiling for Web Assembly for in-browser execution using emscripten and CLAPACK.
See this post for a step-by-step description of the build process.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot