Convert Figma logo to code with AI

kaldi-asr logokaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

14,094
5,309
14,094
242

Top Related Projects

8,267

End-to-End Speech Processing Toolkit

30,129

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

A PyTorch-based Speech Toolkit

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

33,278

πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Quick Overview

Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. It is designed to be flexible and modular, providing state-of-the-art speech recognition techniques and algorithms for researchers and industry professionals.

Pros

  • Highly efficient and optimized for performance
  • Extensive documentation and active community support
  • Supports various speech recognition tasks and models
  • Integrates well with other tools and libraries in the speech processing ecosystem

Cons

  • Steep learning curve for beginners
  • Complex setup and installation process
  • Primarily command-line based, lacking a user-friendly GUI
  • Requires significant computational resources for large-scale tasks

Code Examples

  1. Feature extraction using MFCC:
#include "feat/feature-mfcc.h"

using namespace kaldi;

int main() {
    MfccOptions opts;
    opts.num_ceps = 13;
    Mfcc mfcc(opts);

    Vector<BaseFloat> waveform(1000);
    // Fill waveform with audio data
    Matrix<BaseFloat> features;
    mfcc.Compute(waveform, 1.0, &features);
}
  1. Decoding using a simple FST-based decoder:
#include "decoder/simple-decoder.h"

using namespace kaldi;

int main() {
    FstReader<StdArc> fst_reader;
    Fst<StdArc>* decode_fst = fst_reader.Read("graph.fst");

    SimpleDecoder decoder(*decode_fst);
    DecodableInterface* decodable = // Initialize your decodable object
    decoder.Decode(decodable);

    Lattice lat;
    decoder.GetBestPath(&lat);
}
  1. Training a GMM-HMM model:
#include "gmm/am-diag-gmm.h"
#include "hmm/transition-model.h"

using namespace kaldi;

int main() {
    AmDiagGmm am_gmm;
    TransitionModel trans_model;

    // Initialize am_gmm and trans_model

    for (int iter = 0; iter < num_iters; iter++) {
        // Accumulate statistics
        AccumAmDiagGmm gmm_accs;
        for (/* each utterance */) {
            // Accumulate stats for current utterance
        }
        // Update model
        BaseFloat objf_impr, count;
        am_gmm.Update(gmm_accs, kGmmAll, &objf_impr, &count);
    }
}

Getting Started

  1. Clone the Kaldi repository:

    git clone https://github.com/kaldi-asr/kaldi.git
    cd kaldi
    
  2. Install dependencies:

    ./tools/extras/check_dependencies.sh
    
  3. Compile Kaldi:

    cd tools
    make -j $(nproc)
    cd ../src
    ./configure --shared
    make depend -j $(nproc)
    make -j $(nproc)
    
  4. Run example scripts:

    cd egs/yesno/s5
    ./run.sh
    

Competitor Comparisons

8,267

End-to-End Speech Processing Toolkit

Pros of ESPnet

  • More user-friendly and easier to get started with
  • Supports end-to-end models like Transformer and Conformer
  • Integrates well with popular deep learning frameworks (PyTorch)

Cons of ESPnet

  • Less mature and stable compared to Kaldi
  • Smaller community and fewer pre-trained models available
  • May not be as suitable for large-scale production deployments

Code Comparison

ESPnet (Python-based):

import torch
from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained("espnet/model")
speech, rate = soundfile.read("audio.wav")
nbest = speech2text(speech)
text, *_ = nbest[0]

Kaldi (C++ and shell script-based):

# Extract features
compute-mfcc-feats --config=conf/mfcc.conf scp:data/test/wav.scp ark:- | \
copy-feats ark:- ark,scp:mfcc/raw_mfcc.ark,mfcc/raw_mfcc.scp

# Decode
gmm-latgen-faster --max-active=7000 --beam=13.0 --lattice-beam=6.0 \
  --acoustic-scale=0.083333 --allow-partial=true \
  --word-symbol-table=exp/tri4b/graph/words.txt \
  exp/tri4b/final.mdl exp/tri4b/graph/HCLG.fst \
  "ark,s,cs:apply-cmvn --utt2spk=ark:data/test/utt2spk scp:data/test/cmvn.scp scp:mfcc/raw_mfcc.scp ark:- | add-deltas ark:- ark:- |" \
  "ark:|gzip -c > exp/tri4b/decode_test/lat.1.gz"
30,129

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • More versatile, supporting various NLP tasks beyond speech recognition
  • Easier to use and integrate with PyTorch ecosystem
  • More active development and frequent updates

Cons of fairseq

  • Less specialized for speech recognition compared to Kaldi
  • May require more computational resources for training

Code Comparison

Kaldi (C++):

LatticeFasterDecoder decoder(fst, config);
DecodableAmDiagGmmScaled gmm_decodable(am_gmm, trans_model, features, acoustic_scale);
decoder.Decode(&gmm_decodable);

fairseq (Python):

model = TransformerModel.build_model(args, task)
generator = task.build_generator([model], args)
translations = task.inference_step(generator, [model], sample)

Summary

Fairseq is a more versatile and user-friendly toolkit for various NLP tasks, including speech recognition. It benefits from integration with PyTorch and active development. However, Kaldi remains a specialized and efficient tool specifically for speech recognition tasks. The choice between the two depends on the specific requirements of the project and the user's familiarity with the respective ecosystems.

A PyTorch-based Speech Toolkit

Pros of SpeechBrain

  • More user-friendly and easier to learn, especially for those familiar with PyTorch
  • Offers end-to-end solutions for various speech tasks, including pre-trained models
  • Actively maintained with frequent updates and a growing community

Cons of SpeechBrain

  • Less mature and battle-tested compared to Kaldi's long-standing reputation
  • Fewer available recipes and pre-built models for specific tasks or languages
  • May have lower performance in some scenarios due to its higher-level abstractions

Code Comparison

SpeechBrain example (PyTorch-based):

class SimpleCNN(sb.Brain):
    def compute_forward(self, batch, stage):
        wavs, lens = batch.sig
        feats = self.modules.compute_features(wavs)
        out = self.modules.cnn(feats)
        return out

Kaldi example (C++ based):

void ComputeFeatures(const VectorBase<BaseFloat> &wave,
                     Matrix<BaseFloat> *output_feats) {
  FbankComputer fbank(fbank_opts_);
  fbank.Compute(wave, 1.0, output_feats);
}

The code snippets illustrate the difference in language and abstraction level between the two libraries, with SpeechBrain offering a more high-level, Python-based approach compared to Kaldi's lower-level C++ implementation.

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of Vosk

  • Easier to use and integrate, with a simpler API and pre-built models
  • Supports multiple programming languages (Python, Java, Node.js, etc.)
  • Designed for real-time speech recognition on mobile and embedded devices

Cons of Vosk

  • Less flexible and customizable compared to Kaldi's extensive toolkit
  • Smaller community and fewer resources for advanced users
  • Limited to pre-trained models, which may not cover all use cases

Code Comparison

Vosk (Python):

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

# Audio processing and recognition code

Kaldi (Shell script):

#!/bin/bash
. ./path.sh
. ./cmd.sh

steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir

# Additional preprocessing and training steps

Vosk provides a higher-level API for quick integration, while Kaldi offers more granular control over the speech recognition pipeline. Vosk is better suited for developers seeking rapid deployment, whereas Kaldi is ideal for researchers and those requiring extensive customization.

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

  • Easier to use and deploy, with a simpler architecture
  • Better support for streaming audio and real-time transcription
  • More accessible for developers without extensive speech recognition expertise

Cons of DeepSpeech

  • Less flexible and customizable compared to Kaldi's extensive toolkit
  • May not perform as well on specialized or domain-specific tasks
  • Smaller community and fewer pre-trained models available

Code Comparison

DeepSpeech example (Python):

import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)

Kaldi example (Bash/Shell):

./path.sh
. ./cmd.sh
steps/nnet3/decode.sh --nj 4 --cmd "$decode_cmd" \
  exp/nnet3/tdnn_sp/graph_tgsmall data/test_dev93 \
  exp/nnet3/tdnn_sp/decode_test_dev93_tgsmall

DeepSpeech uses a simpler API, making it more accessible for quick implementation. Kaldi's approach requires more setup and understanding of the underlying speech recognition pipeline, but offers greater flexibility and control over the process.

33,278

πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Pros of TTS

  • More user-friendly and easier to set up, especially for beginners
  • Focuses specifically on text-to-speech, offering a streamlined experience
  • Provides pre-trained models and easy-to-use APIs for quick implementation

Cons of TTS

  • Less comprehensive than Kaldi, which covers a broader range of speech technologies
  • May have fewer options for fine-tuning and customization compared to Kaldi
  • Smaller community and potentially less extensive documentation

Code Comparison

TTS example:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Kaldi example:

echo "Hello world" | text2wav --config=conf/tts.conf | \
  wav-copy - output.wav

The TTS code is more straightforward and Python-based, while Kaldi uses command-line tools and configuration files, reflecting their different approaches to speech synthesis.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Build Status Gitpod Ready-to-Code Kaldi Speech Recognition Toolkit

To build the toolkit: see ./INSTALL. These instructions are valid for UNIX systems including various flavors of Linux; Darwin; and Cygwin (has not been tested on more "exotic" varieties of UNIX). For Windows installation instructions (excluding Cygwin), see windows/INSTALL.

To run the example system builds, see egs/README.txt

If you encounter problems (and you probably will), please do not hesitate to contact the developers (see below). In addition to specific questions, please let us know if there are specific aspects of the project that you feel could be improved, that you find confusing, etc., and which missing features you most wish it had.

Kaldi information channels

For HOT news about Kaldi see the project site.

Documentation of Kaldi:

  • Info about the project, description of techniques, tutorial for C++ coding.
  • Doxygen reference of the C++ code.

Kaldi forums and mailing lists:

We have two different lists

  • User list kaldi-help
  • Developer list kaldi-developers:

To sign up to any of those mailing lists, go to http://kaldi-asr.org/forums.html:

Development pattern for contributors

  1. Create a personal fork of the main Kaldi repository in GitHub.
  2. Make your changes in a named branch different from master, e.g. you create a branch my-awesome-feature.
  3. Generate a pull request through the Web interface of GitHub.
  4. As a general rule, please follow Google C++ Style Guide. There are a few exceptions in Kaldi. You can use the Google's cpplint.py to verify that your code is free of basic mistakes.

Platform specific notes

PowerPC 64bits little-endian (ppc64le)

Android

  • Kaldi supports cross compiling for Android using Android NDK, clang++ and OpenBLAS.
  • See this blog post for details.

Web Assembly

  • Kaldi supports cross compiling for Web Assembly for in-browser execution using emscripten and CLAPACK.
  • See this post for a step-by-step description of the build process.