Convert Figma logo to code with AI

mozilla logoDeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

25,220
3,953
25,220
151

Top Related Projects

14,200

kaldi-asr/kaldi is the official location of the Kaldi project.

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

2,251

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

69,530

Robust Speech Recognition via Large-Scale Weak Supervision

8,390

End-to-End Speech Processing Toolkit

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Quick Overview

DeepSpeech is an open-source Speech-To-Text engine developed by Mozilla. It uses a model trained by machine learning techniques based on Baidu's Deep Speech research paper. The project aims to make speech recognition technology more accessible and accurate for a wide range of applications.

Pros

  • Open-source and free to use
  • Supports multiple platforms (Windows, macOS, Linux, Android)
  • Provides pre-trained models for immediate use
  • Actively maintained with regular updates and improvements

Cons

  • Requires significant computational resources for training custom models
  • May have lower accuracy compared to some commercial speech recognition solutions
  • Limited language support compared to some other speech recognition systems
  • Steep learning curve for customization and fine-tuning

Code Examples

  1. Basic usage with a pre-trained model:
import deepspeech
import numpy as np

# Load pre-trained model
model = deepspeech.Model('path/to/model.pbmm')

# Perform inference on audio data
audio = np.frombuffer(audio_data, np.int16)
text = model.stt(audio)
print(text)
  1. Streaming inference:
import deepspeech
import numpy as np

model = deepspeech.Model('path/to/model.pbmm')
stream = model.createStream()

# Process audio in chunks
while True:
    chunk = get_next_audio_chunk()
    if len(chunk) == 0:
        break
    stream.feedAudioContent(np.frombuffer(chunk, np.int16))

text = stream.finishStream()
print(text)
  1. Using language model for improved accuracy:
import deepspeech

model = deepspeech.Model('path/to/model.pbmm')
model.enableExternalScorer('path/to/scorer.scorer')

# Adjust language model weight and word insertion bonus
model.setScorerAlphaBeta(alpha=0.75, beta=1.85)

text = model.stt(audio)
print(text)

Getting Started

  1. Install DeepSpeech:

    pip install deepspeech
    
  2. Download a pre-trained model:

    curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
    curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
    
  3. Basic usage:

    import deepspeech
    import numpy as np
    import wave
    
    # Load model
    model = deepspeech.Model('deepspeech-0.9.3-models.pbmm')
    model.enableExternalScorer('deepspeech-0.9.3-models.scorer')
    
    # Load audio file
    with wave.open('audio.wav', 'rb') as w:
        rate = w.getframerate()
        frames = w.getnframes()
        buffer = w.readframes(frames)
    
    # Perform inference
    data = np.frombuffer(buffer, dtype=np.int16)
    text = model.stt(data)
    print(text)
    

Competitor Comparisons

14,200

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

  • More comprehensive and flexible toolkit for speech recognition
  • Supports a wider range of acoustic models and language models
  • Better suited for research and customization of ASR systems

Cons of Kaldi

  • Steeper learning curve and more complex setup
  • Requires more expertise in speech recognition algorithms
  • Less user-friendly for beginners or quick deployments

Code Comparison

DeepSpeech (Python):

import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)

Kaldi (Shell script):

./path.sh
./cmd.sh
steps/nnet3/decode.sh --nj 1 --cmd "$decode_cmd" \
  exp/nnet3/tdnn_sp/graph_tgsmall data/test \
  exp/nnet3/tdnn_sp/decode_test_tgsmall

DeepSpeech offers a simpler API for quick integration, while Kaldi provides more granular control over the ASR pipeline. Kaldi's approach allows for fine-tuning of various components but requires more in-depth knowledge of the underlying algorithms and data preparation steps. DeepSpeech, on the other hand, abstracts much of this complexity, making it more accessible for rapid prototyping and deployment in applications where customization is less critical.

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of Vosk

  • Lightweight and efficient, suitable for mobile and embedded devices
  • Supports multiple languages out-of-the-box
  • Easier to integrate with existing applications due to simple API

Cons of Vosk

  • Less accurate for complex speech recognition tasks
  • Smaller community and fewer resources compared to DeepSpeech
  • Limited customization options for advanced users

Code Comparison

DeepSpeech:

import deepspeech
model = deepspeech.Model('model.pbmm')
audio = wave.open('audio.wav', 'rb')
text = model.stt(audio)

Vosk:

from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
with open("audio.wav", "rb") as wf:
    text = rec.Result()

Both repositories provide speech recognition capabilities, but they cater to different use cases. DeepSpeech offers more advanced features and higher accuracy for complex tasks, while Vosk focuses on simplicity and efficiency, making it more suitable for lightweight applications and mobile devices. DeepSpeech has a larger community and more extensive documentation, but Vosk provides easier integration and out-of-the-box support for multiple languages. The code comparison shows that Vosk has a slightly simpler API, requiring fewer steps to initialize and use the model.

2,251

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

Pros of STT

  • More active development and frequent updates
  • Improved performance and accuracy in speech recognition
  • Better support for multiple languages and accents

Cons of STT

  • Steeper learning curve for beginners
  • Requires more computational resources for training and inference

Code Comparison

DeepSpeech:

import deepspeech

model = deepspeech.Model('path/to/model.pbmm')
result = model.stt(audio_buffer)

STT:

from stt import Model

model = Model('path/to/model.tflite')
result = model.stt(audio_buffer)

Both repositories provide similar high-level APIs for speech recognition, but STT offers more flexibility in model configuration and fine-tuning. STT also supports TensorFlow Lite models, which can be beneficial for deployment on edge devices or mobile platforms.

STT builds upon the foundation laid by DeepSpeech, offering improvements in accuracy, performance, and language support. However, it may require more expertise to fully utilize its advanced features. DeepSpeech remains a solid choice for simpler use cases or when working with limited computational resources.

69,530

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

  • Supports multiple languages and can perform translation
  • Generally higher accuracy, especially for noisy or accented speech
  • Easier to use out-of-the-box with pre-trained models

Cons of Whisper

  • Larger model size, requiring more computational resources
  • Slower inference time compared to DeepSpeech
  • Less flexibility for fine-tuning on specific domains or accents

Code Comparison

DeepSpeech:

import deepspeech

model = deepspeech.Model('model.pbmm')
audio = 'audio.wav'
text = model.stt(audio)

Whisper:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.wav")
text = result["text"]

Both libraries offer straightforward APIs for transcription, but Whisper provides additional features like language detection and translation with minimal extra code. DeepSpeech focuses on efficient, lightweight models for specific use cases, while Whisper aims for broader language support and higher accuracy at the cost of increased resource requirements.

8,390

End-to-End Speech Processing Toolkit

Pros of ESPnet

  • More comprehensive toolkit supporting various speech processing tasks (ASR, TTS, speech enhancement, etc.)
  • Actively maintained with frequent updates and a larger contributor base
  • Flexible architecture allowing for easy integration of new models and techniques

Cons of ESPnet

  • Steeper learning curve due to its extensive feature set
  • Potentially higher computational requirements for some models
  • Less focus on production deployment compared to DeepSpeech

Code Comparison

ESPnet example (ASR training):

from espnet2.bin.asr_train import main

args = {
    "output_dir": "exp/asr_train_asr_transformer_raw_bpe",
    "max_epoch": 100,
    "train_data_path_and_name_and_type": ["dump/raw/train/text", "text", "text"],
    "valid_data_path_and_name_and_type": ["dump/raw/valid/text", "text", "text"],
}
main(args)

DeepSpeech example (model training):

from deepspeech_training.train import train

train(train_files, dev_files, test_files, n_hidden=2048, epochs=100, learning_rate=0.0001, dropout=[0.05, 0.05, 0.05])

Both repositories provide powerful speech recognition capabilities, but ESPnet offers a broader range of speech processing tasks and more flexibility in model architecture. DeepSpeech, on the other hand, focuses primarily on ASR and may be easier to get started with for beginners.

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • Supports a wider range of NLP tasks beyond speech recognition
  • More actively maintained with frequent updates and contributions
  • Offers more flexibility and customization options for model architectures

Cons of fairseq

  • Steeper learning curve due to its broader scope and complexity
  • Requires more computational resources for training and inference
  • Less focused on speech recognition specifically compared to DeepSpeech

Code Comparison

DeepSpeech:

from deepspeech import Model

model = Model('path/to/model.pbmm')
text = model.stt(audio)

fairseq:

from fairseq.models.wav2vec import Wav2VecCtc

model = Wav2VecCtc.from_pretrained('path/to/model')
tokens = model.transcribe(audio)
text = model.decode(tokens)

Both repositories provide tools for speech recognition, but fairseq offers a more comprehensive toolkit for various NLP tasks. DeepSpeech is more specialized and easier to use for speech recognition, while fairseq provides greater flexibility at the cost of increased complexity. The code examples demonstrate that DeepSpeech has a simpler API for transcription, whereas fairseq requires additional steps but allows for more fine-grained control over the process.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Project DeepSpeech

.. image:: https://readthedocs.org/projects/deepspeech/badge/?version=latest :target: https://deepspeech.readthedocs.io/?badge=latest :alt: Documentation

.. image:: https://github.com/mozilla/DeepSpeech/actions/workflows/macOS-amd64.yml/badge.svg :target: https://github.com/mozilla/DeepSpeech/actions/workflows/macOS-amd64.yml :alt: macOS builds

.. image:: https://github.com/mozilla/DeepSpeech/actions/workflows/lint.yml/badge.svg :target: https://github.com/mozilla/DeepSpeech/actions/workflows/lint.yml :alt: Linters

.. image:: https://github.com/mozilla/DeepSpeech/actions/workflows/docker.yml/badge.svg :target: https://github.com/mozilla/DeepSpeech/actions/workflows/docker.yml :alt: Docker Images

DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper <https://arxiv.org/abs/1412.5567>. Project DeepSpeech uses Google's TensorFlow <https://www.tensorflow.org/> to make the implementation easier.

Documentation for installation, usage, and training models are available on deepspeech.readthedocs.io <https://deepspeech.readthedocs.io/?badge=latest>_.

For the latest release, including pre-trained models and checkpoints, see the latest release on GitHub <https://github.com/mozilla/DeepSpeech/releases/latest>_.

For contribution guidelines, see CONTRIBUTING.rst <CONTRIBUTING.rst>_.

For contact and support information, see SUPPORT.rst <SUPPORT.rst>_.

NPM DownloadsLast 30 Days