STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

2,419

286

2,419

107

View on GitHub

Top Related Projects

DeepSpeech

26,278

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

whisper

80,764

Robust Speech Recognition via Large-Scale Weak Supervision

kaldi

14,802

kaldi-asr/kaldi is the official location of the Kaldi project.

espnet

9,036

End-to-End Speech Processing Toolkit

audio

2,653

Data manipulation and transformation for audio signal processing, powered by PyTorch

Quick Overview

Coqui-ai/STT is an open-source speech-to-text engine, originally developed by Mozilla and now maintained by Coqui. It uses a model trained on various speech datasets to convert audio into text, supporting multiple languages and offering pre-trained models for immediate use.

Pros

Open-source and free to use
Supports multiple languages and accents
Offers pre-trained models for quick deployment
Can be used offline, ensuring privacy and data security

Cons

May have lower accuracy compared to some commercial solutions
Requires significant computational resources for training custom models
Limited documentation and community support compared to larger projects
May require fine-tuning for specific use cases or accents

Code Examples

Basic usage with a pre-trained model:

from stt import Model

# Load pre-trained English model
model = Model('path/to/model.pbmm')

# Transcribe audio file
text = model.stt(audio_file)
print(text)

Streaming audio transcription:

import wave
from stt import Model

def transcribe_streaming(audio_file):
    model = Model('path/to/model.pbmm')
    
    with wave.open(audio_file, 'rb') as w:
        rate = w.getframerate()
        frames = w.getnframes()
        buffer = w.readframes(frames)

    context = model.createStream()
    context.feedAudioContent(buffer)
    text = context.finishStream()
    return text

result = transcribe_streaming('path/to/audio.wav')
print(result)

Using a custom language model:

from stt import Model

model = Model('path/to/acoustic_model.pbmm')
model.enableExternalScorer('path/to/language_model.scorer')

text = model.stt(audio_file)
print(text)

Getting Started

Install the STT package:
```
pip install stt
```
Download a pre-trained model from the Coqui Model Zoo: https://coqui.ai/models

Use the model in your Python script:

from stt import Model

model = Model('path/to/model.pbmm')
text = model.stt('path/to/audio.wav')
print(text)

Competitor Comparisons

DeepSpeech

26,278

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

Longer development history and more established community
Extensive documentation and tutorials available
Supports a wider range of languages out-of-the-box

Cons of DeepSpeech

No longer actively maintained by Mozilla
Generally slower inference times compared to STT
Larger model size, requiring more computational resources

Code Comparison

DeepSpeech:

import deepspeech

model = deepspeech.Model('path/to/model.pbmm')
text = model.stt(audio_buffer)

STT:

import stt

model = stt.Model('path/to/model.tflite')
text = model.stt(audio_buffer)

Both repositories provide speech-to-text functionality, but STT is a fork of DeepSpeech with ongoing development. STT offers improved performance and smaller model sizes, making it more suitable for edge devices and real-time applications. However, DeepSpeech still has a larger community and more extensive documentation due to its longer history.

The code usage is similar for both projects, with minor differences in import statements and model file formats. Users familiar with DeepSpeech can easily transition to STT if they require active development and performance improvements.

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of vosk-api

Lightweight and efficient, suitable for embedded systems and mobile devices
Supports multiple languages out of the box
Offers offline speech recognition capabilities

Cons of vosk-api

Less accurate for complex or noisy audio compared to STT
Smaller community and fewer resources available
Limited customization options for specialized use cases

Code Comparison

vosk-api:

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

STT:

import stt
import numpy as np

model = stt.Model("model.pbmm")
desired_sample_rate = model.sampleRate()

audio = np.frombuffer(audio_data, np.int16)
text = model.stt(audio)
print(text)

The code examples demonstrate that vosk-api requires more setup for audio input, while STT focuses on processing pre-loaded audio data. vosk-api provides real-time streaming capabilities, whereas STT is more suited for batch processing of audio files.

whisper

80,764

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

Supports 99 languages, offering broader multilingual capabilities
Utilizes a large-scale, pre-trained model for improved accuracy
Performs well on diverse audio sources, including noisy environments

Cons of Whisper

Requires more computational resources due to its large model size
May have slower inference times, especially on less powerful hardware
Less suitable for real-time applications due to processing requirements

Code Comparison

STT example:

from stt import Model

model = Model("path/to/model.pbmm")
result = model.stt("path/to/audio.wav")
print(result)

Whisper example:

import whisper

model = whisper.load_model("base")
result = model.transcribe("path/to/audio.wav")
print(result["text"])

Key Differences

STT is designed for lightweight, real-time applications, while Whisper focuses on accuracy and language coverage
STT allows for more customization and fine-tuning, whereas Whisper provides a pre-trained model
Whisper offers integrated language detection and translation features, which are not present in STT
STT is more suitable for edge devices and low-resource environments, while Whisper is better for server-side processing

kaldi

14,802

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

More comprehensive and flexible toolkit for speech recognition research
Supports a wider range of acoustic models and language models
Better performance for large-scale and complex speech recognition tasks

Cons of Kaldi

Steeper learning curve and more complex setup process
Requires more computational resources for training and inference
Less user-friendly for beginners or those seeking quick deployment

Code Comparison

STT (Python):

import stt

model = stt.Model("model.pbmm")
result = model.stt("audio.wav")
print(result)

Kaldi (Shell script):

#!/bin/bash
. ./path.sh
compute-mfcc-feats --config=conf/mfcc.conf scp:data/test/wav.scp ark:- | \
  apply-cmvn-sliding --norm-vars=false --center=true --cmn-window=300 ark:- ark:- | \
  gmm-latgen-faster --max-active=7000 --beam=13.0 --lattice-beam=6.0 \
    --acoustic-scale=0.083333 --allow-partial=true \
    --word-symbol-table=exp/tri4b/graph/words.txt \
    exp/tri4b/final.mdl exp/tri4b/graph/HCLG.fst ark:- ark:- | \
  lattice-scale --inv-acoustic-scale=10 ark:- ark:- | \
  lattice-best-path --word-symbol-table=exp/tri4b/graph/words.txt \
    ark:- ark,t:- | utils/int2sym.pl -f 2- exp/tri4b/graph/words.txt

espnet

9,036

End-to-End Speech Processing Toolkit

Pros of espnet

Broader scope: Supports multiple speech processing tasks (ASR, TTS, speech enhancement, etc.)
More flexible: Offers a wide range of models and architectures
Active development: Frequent updates and contributions from the research community

Cons of espnet

Steeper learning curve: More complex to set up and use due to its extensive features
Higher resource requirements: May need more computational power for training and inference

Code Comparison

STT (Python):

from stt import Model

model = Model("path/to/model.pbmm")
text = model.stt("path/to/audio.wav")
print(text)

espnet (Python):

from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained("model_name")
text, *_ = speech2text("path/to/audio.wav")
print(text)

Both repositories provide speech recognition capabilities, but espnet offers a more comprehensive toolkit for various speech processing tasks. STT focuses specifically on speech-to-text and may be easier to use for beginners or those with limited resources. espnet, while more complex, provides greater flexibility and supports cutting-edge research in speech processing.

audio

2,653

Data manipulation and transformation for audio signal processing, powered by PyTorch

Pros of audio

Broader scope: Covers various audio processing tasks beyond speech recognition
Tighter integration with PyTorch ecosystem
More active development and larger community support

Cons of audio

Less specialized for speech-to-text tasks
May require more setup and configuration for specific STT use cases
Potentially steeper learning curve for STT-focused developers

Code Comparison

STT example:

import stt

model = stt.Model("model.pbmm")
result = model.stt("audio.wav")
print(result)

audio example:

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
model = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.get_model()
emission, _ = model(waveform)
result = torchaudio.functional.greedy_decode(emission.squeeze(0))

The STT code is more concise and focused on speech-to-text, while the audio code offers more flexibility but requires more setup for STT tasks. STT provides a simpler API for quick speech recognition, whereas audio allows for more granular control over the process.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. note:: This project is no longer actively maintained, and we have stopped hosting the online Model Zoo. We've seen focus shift towards newer STT models such as Whisper, and have ourselves focused on Coqui TTS and Coqui Studio.

The models will remain available in the releases of the coqui-ai/STT-models repo.

.. image:: images/coqui-STT-logo-green.png :alt: Coqui STT logo

.. |doc-img| image:: https://readthedocs.org/projects/stt/badge/?version=latest :target: https://stt.readthedocs.io/?badge=latest :alt: Documentation

.. |covenant-img| image:: https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg :target: CODE_OF_CONDUCT.md :alt: Contributor Covenant

.. |gitter-img| image:: https://badges.gitter.im/coqui-ai/STT.svg :target: https://gitter.im/coqui-ai/STT?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge :alt: Gitter Room

.. |doi| image:: https://zenodo.org/badge/344354127.svg :target: https://zenodo.org/badge/latestdoi/344354127

ð Subscribe to ð¸Coqui's Newsletter <https://coqui.ai/?subscription=true>_

Coqui STT (ð¸STT) is a fast, open-source, multi-platform, deep-learning toolkit for training and deploying speech-to-text models. ð¸STT is battle tested in both production and research ð

ð¸STT features

High-quality pre-trained STT model.
Efficient training pipeline with Multi-GPU support.
Streaming inference.
Multiple possible transcripts, each with an associated confidence score.
Real-time inference.
Small-footprint acoustic model.
Bindings for various programming languages.

`Quickstart <https://stt.readthedocs.io/en/latest/#quickstart>`_

Where to Ask Questions

.. list-table:: :widths: 25 25 :header-rows: 1

- Type
- Link
- ð¨ Bug Reports
- Github Issue Tracker <https://github.com/coqui-ai/STT/issues/>_
- ð Feature Requests & Ideas
- Github Issue Tracker <https://github.com/coqui-ai/STT/issues/>_
- â Questions
- Github Discussions <https://github.com/coqui-ai/stt/discussions/>_
- ð¬ General Discussion
- Github Discussions <https://github.com/coqui-ai/stt/discussions/>_ or Gitter Room <https://gitter.im/coqui-ai/STT?utm_source=share-link&utm_medium=link&utm_campaign=share-link>_

Links & Resources

.. list-table:: :widths: 25 25 :header-rows: 1

- Type
- Link
- ð° Documentation
- stt.readthedocs.io <https://stt.readthedocs.io/>_
- ð Latest release with pre-trained models
- see the latest release on GitHub <https://github.com/coqui-ai/STT/releases/latest>_
- ð¤ Contribution Guidelines
- CONTRIBUTING.rst <CONTRIBUTING.rst>_

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of DeepSpeech

Cons of DeepSpeech

Code Comparison

Pros of vosk-api

Cons of vosk-api

Code Comparison

Pros of Whisper

Cons of Whisper

Code Comparison

Key Differences

Pros of Kaldi

Cons of Kaldi

Code Comparison

Pros of espnet

Cons of espnet

Code Comparison

Pros of audio

Cons of audio

Code Comparison

Convert designs to code with AI

README

ð¸STT features

Quickstart <https://stt.readthedocs.io/en/latest/#quickstart>_

Where to Ask Questions

Links & Resources

Top Related Projects

Convert designs to code with AI

ð¸STT features

`Quickstart <https://stt.readthedocs.io/en/latest/#quickstart>`_