DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

26,544

4,071

26,544

151

View on GitHub View on NPM

Top Related Projects

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

STT

2,488

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

whisper

85,961

Robust Speech Recognition via Large-Scale Weak Supervision

espnet

9,348

End-to-End Speech Processing Toolkit

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Quick Overview

DeepSpeech is an open-source Speech-To-Text engine developed by Mozilla. It uses a model trained by machine learning techniques based on Baidu's Deep Speech research paper. The project aims to make speech recognition technology more accessible and accurate for a wide range of applications.

Pros

Open-source and free to use
Supports multiple platforms (Windows, macOS, Linux, Android)
Provides pre-trained models for immediate use
Actively maintained with regular updates and improvements

Cons

Requires significant computational resources for training custom models
May have lower accuracy compared to some commercial speech recognition solutions
Limited language support compared to some other speech recognition systems
Steep learning curve for customization and fine-tuning

Code Examples

Basic usage with a pre-trained model:

import deepspeech
import numpy as np

# Load pre-trained model
model = deepspeech.Model('path/to/model.pbmm')

# Perform inference on audio data
audio = np.frombuffer(audio_data, np.int16)
text = model.stt(audio)
print(text)

Streaming inference:

import deepspeech
import numpy as np

model = deepspeech.Model('path/to/model.pbmm')
stream = model.createStream()

# Process audio in chunks
while True:
    chunk = get_next_audio_chunk()
    if len(chunk) == 0:
        break
    stream.feedAudioContent(np.frombuffer(chunk, np.int16))

text = stream.finishStream()
print(text)

Using language model for improved accuracy:

import deepspeech

model = deepspeech.Model('path/to/model.pbmm')
model.enableExternalScorer('path/to/scorer.scorer')

# Adjust language model weight and word insertion bonus
model.setScorerAlphaBeta(alpha=0.75, beta=1.85)

text = model.stt(audio)
print(text)

Getting Started

Install DeepSpeech:
```
pip install deepspeech
```

Download a pre-trained model:

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Basic usage:

import deepspeech
import numpy as np
import wave

# Load model
model = deepspeech.Model('deepspeech-0.9.3-models.pbmm')
model.enableExternalScorer('deepspeech-0.9.3-models.scorer')

# Load audio file
with wave.open('audio.wav', 'rb') as w:
    rate = w.getframerate()
    frames = w.getnframes()
    buffer = w.readframes(frames)

# Perform inference
data = np.frombuffer(buffer, dtype=np.int16)
text = model.stt(data)
print(text)

Competitor Comparisons

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

More comprehensive and flexible toolkit for speech recognition
Supports a wider range of acoustic models and language models
Better suited for research and customization of ASR systems

Cons of Kaldi

Steeper learning curve and more complex setup
Requires more expertise in speech recognition algorithms
Less user-friendly for beginners or quick deployments

Code Comparison

DeepSpeech (Python):

import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)

Kaldi (Shell script):

./path.sh
./cmd.sh
steps/nnet3/decode.sh --nj 1 --cmd "$decode_cmd" \
  exp/nnet3/tdnn_sp/graph_tgsmall data/test \
  exp/nnet3/tdnn_sp/decode_test_tgsmall

DeepSpeech offers a simpler API for quick integration, while Kaldi provides more granular control over the ASR pipeline. Kaldi's approach allows for fine-tuning of various components but requires more in-depth knowledge of the underlying algorithms and data preparation steps. DeepSpeech, on the other hand, abstracts much of this complexity, making it more accessible for rapid prototyping and deployment in applications where customization is less critical.

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of Vosk

Lightweight and efficient, suitable for mobile and embedded devices
Supports multiple languages out-of-the-box
Easier to integrate with existing applications due to simple API

Cons of Vosk

Less accurate for complex speech recognition tasks
Smaller community and fewer resources compared to DeepSpeech
Limited customization options for advanced users

Code Comparison

DeepSpeech:

import deepspeech
model = deepspeech.Model('model.pbmm')
audio = wave.open('audio.wav', 'rb')
text = model.stt(audio)

Vosk:

from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
with open("audio.wav", "rb") as wf:
    text = rec.Result()

Both repositories provide speech recognition capabilities, but they cater to different use cases. DeepSpeech offers more advanced features and higher accuracy for complex tasks, while Vosk focuses on simplicity and efficiency, making it more suitable for lightweight applications and mobile devices. DeepSpeech has a larger community and more extensive documentation, but Vosk provides easier integration and out-of-the-box support for multiple languages. The code comparison shows that Vosk has a slightly simpler API, requiring fewer steps to initialize and use the model.

STT

2,488

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

Pros of STT

More active development and frequent updates
Improved performance and accuracy in speech recognition
Better support for multiple languages and accents

Cons of STT

Steeper learning curve for beginners
Requires more computational resources for training and inference

Code Comparison

DeepSpeech:

import deepspeech

model = deepspeech.Model('path/to/model.pbmm')
result = model.stt(audio_buffer)

STT:

from stt import Model

model = Model('path/to/model.tflite')
result = model.stt(audio_buffer)

Both repositories provide similar high-level APIs for speech recognition, but STT offers more flexibility in model configuration and fine-tuning. STT also supports TensorFlow Lite models, which can be beneficial for deployment on edge devices or mobile platforms.

STT builds upon the foundation laid by DeepSpeech, offering improvements in accuracy, performance, and language support. However, it may require more expertise to fully utilize its advanced features. DeepSpeech remains a solid choice for simpler use cases or when working with limited computational resources.

whisper

85,961

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

Supports multiple languages and can perform translation
Generally higher accuracy, especially for noisy or accented speech
Easier to use out-of-the-box with pre-trained models

Cons of Whisper

Larger model size, requiring more computational resources
Slower inference time compared to DeepSpeech
Less flexibility for fine-tuning on specific domains or accents

Code Comparison

DeepSpeech:

import deepspeech

model = deepspeech.Model('model.pbmm')
audio = 'audio.wav'
text = model.stt(audio)

Whisper:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.wav")
text = result["text"]

Both libraries offer straightforward APIs for transcription, but Whisper provides additional features like language detection and translation with minimal extra code. DeepSpeech focuses on efficient, lightweight models for specific use cases, while Whisper aims for broader language support and higher accuracy at the cost of increased resource requirements.

espnet

9,348

End-to-End Speech Processing Toolkit

Pros of ESPnet

More comprehensive toolkit supporting various speech processing tasks (ASR, TTS, speech enhancement, etc.)
Actively maintained with frequent updates and a larger contributor base
Flexible architecture allowing for easy integration of new models and techniques

Cons of ESPnet

Steeper learning curve due to its extensive feature set
Potentially higher computational requirements for some models
Less focus on production deployment compared to DeepSpeech

Code Comparison

ESPnet example (ASR training):

from espnet2.bin.asr_train import main

args = {
    "output_dir": "exp/asr_train_asr_transformer_raw_bpe",
    "max_epoch": 100,
    "train_data_path_and_name_and_type": ["dump/raw/train/text", "text", "text"],
    "valid_data_path_and_name_and_type": ["dump/raw/valid/text", "text", "text"],
}
main(args)

DeepSpeech example (model training):

from deepspeech_training.train import train

train(train_files, dev_files, test_files, n_hidden=2048, epochs=100, learning_rate=0.0001, dropout=[0.05, 0.05, 0.05])

Both repositories provide powerful speech recognition capabilities, but ESPnet offers a broader range of speech processing tasks and more flexibility in model architecture. DeepSpeech, on the other hand, focuses primarily on ASR and may be easier to get started with for beginners.

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Supports a wider range of NLP tasks beyond speech recognition
More actively maintained with frequent updates and contributions
Offers more flexibility and customization options for model architectures

Cons of fairseq

Steeper learning curve due to its broader scope and complexity
Requires more computational resources for training and inference
Less focused on speech recognition specifically compared to DeepSpeech

Code Comparison

DeepSpeech:

from deepspeech import Model

model = Model('path/to/model.pbmm')
text = model.stt(audio)

fairseq:

from fairseq.models.wav2vec import Wav2VecCtc

model = Wav2VecCtc.from_pretrained('path/to/model')
tokens = model.transcribe(audio)
text = model.decode(tokens)

Both repositories provide tools for speech recognition, but fairseq offers a more comprehensive toolkit for various NLP tasks. DeepSpeech is more specialized and easier to use for speech recognition, while fairseq provides greater flexibility at the cost of increased complexity. The code examples demonstrate that DeepSpeech has a simpler API for transcription, whereas fairseq requires additional steps but allows for more fine-grained control over the process.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Status

This project is now discontinued.

Project DeepSpeech

.. image:: https://readthedocs.org/projects/deepspeech/badge/?version=latest :target: https://deepspeech.readthedocs.io/?badge=latest :alt: Documentation

.. image:: https://github.com/mozilla/DeepSpeech/actions/workflows/macOS-amd64.yml/badge.svg :target: https://github.com/mozilla/DeepSpeech/actions/workflows/macOS-amd64.yml :alt: macOS builds

.. image:: https://github.com/mozilla/DeepSpeech/actions/workflows/lint.yml/badge.svg :target: https://github.com/mozilla/DeepSpeech/actions/workflows/lint.yml :alt: Linters

.. image:: https://github.com/mozilla/DeepSpeech/actions/workflows/docker.yml/badge.svg :target: https://github.com/mozilla/DeepSpeech/actions/workflows/docker.yml :alt: Docker Images

DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper <https://arxiv.org/abs/1412.5567>. Project DeepSpeech uses Google's TensorFlow <https://www.tensorflow.org/> to make the implementation easier.

Documentation for installation, usage, and training models are available on deepspeech.readthedocs.io <https://deepspeech.readthedocs.io/?badge=latest>_.

For the latest release, including pre-trained models and checkpoints, see the latest release on GitHub <https://github.com/mozilla/DeepSpeech/releases/latest>_.

For contribution guidelines, see CONTRIBUTING.rst <CONTRIBUTING.rst>_.

For contact and support information, see SUPPORT.rst <SUPPORT.rst>_.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Kaldi

Cons of Kaldi

Code Comparison

Pros of Vosk

Cons of Vosk

Code Comparison

Pros of STT

Cons of STT

Code Comparison

Pros of Whisper

Cons of Whisper

Code Comparison

Pros of ESPnet

Cons of ESPnet

Code Comparison

Pros of fairseq

Cons of fairseq

Code Comparison

Convert designs to code with AI

README

Status

Project DeepSpeech

Top Related Projects

Convert designs to code with AI

NPM DownloadsLast 30 Days