Convert Figma logo to code with AI

coqui-ai logoSTT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

2,251
274
2,251
107

Top Related Projects

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

69,530

Robust Speech Recognition via Large-Scale Weak Supervision

14,200

kaldi-asr/kaldi is the official location of the Kaldi project.

8,390

End-to-End Speech Processing Toolkit

2,506

Data manipulation and transformation for audio signal processing, powered by PyTorch

Quick Overview

Coqui-ai/STT is an open-source speech-to-text engine, originally developed by Mozilla and now maintained by Coqui. It uses a model trained on various speech datasets to convert audio into text, supporting multiple languages and offering pre-trained models for immediate use.

Pros

  • Open-source and free to use
  • Supports multiple languages and accents
  • Offers pre-trained models for quick deployment
  • Can be used offline, ensuring privacy and data security

Cons

  • May have lower accuracy compared to some commercial solutions
  • Requires significant computational resources for training custom models
  • Limited documentation and community support compared to larger projects
  • May require fine-tuning for specific use cases or accents

Code Examples

  1. Basic usage with a pre-trained model:
from stt import Model

# Load pre-trained English model
model = Model('path/to/model.pbmm')

# Transcribe audio file
text = model.stt(audio_file)
print(text)
  1. Streaming audio transcription:
import wave
from stt import Model

def transcribe_streaming(audio_file):
    model = Model('path/to/model.pbmm')
    
    with wave.open(audio_file, 'rb') as w:
        rate = w.getframerate()
        frames = w.getnframes()
        buffer = w.readframes(frames)

    context = model.createStream()
    context.feedAudioContent(buffer)
    text = context.finishStream()
    return text

result = transcribe_streaming('path/to/audio.wav')
print(result)
  1. Using a custom language model:
from stt import Model

model = Model('path/to/acoustic_model.pbmm')
model.enableExternalScorer('path/to/language_model.scorer')

text = model.stt(audio_file)
print(text)

Getting Started

  1. Install the STT package:

    pip install stt
    
  2. Download a pre-trained model from the Coqui Model Zoo: https://coqui.ai/models

  3. Use the model in your Python script:

    from stt import Model
    
    model = Model('path/to/model.pbmm')
    text = model.stt('path/to/audio.wav')
    print(text)
    

Competitor Comparisons

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

  • Longer development history and more established community
  • Extensive documentation and tutorials available
  • Supports a wider range of languages out-of-the-box

Cons of DeepSpeech

  • No longer actively maintained by Mozilla
  • Generally slower inference times compared to STT
  • Larger model size, requiring more computational resources

Code Comparison

DeepSpeech:

import deepspeech

model = deepspeech.Model('path/to/model.pbmm')
text = model.stt(audio_buffer)

STT:

import stt

model = stt.Model('path/to/model.tflite')
text = model.stt(audio_buffer)

Both repositories provide speech-to-text functionality, but STT is a fork of DeepSpeech with ongoing development. STT offers improved performance and smaller model sizes, making it more suitable for edge devices and real-time applications. However, DeepSpeech still has a larger community and more extensive documentation due to its longer history.

The code usage is similar for both projects, with minor differences in import statements and model file formats. Users familiar with DeepSpeech can easily transition to STT if they require active development and performance improvements.

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of vosk-api

  • Lightweight and efficient, suitable for embedded systems and mobile devices
  • Supports multiple languages out of the box
  • Offers offline speech recognition capabilities

Cons of vosk-api

  • Less accurate for complex or noisy audio compared to STT
  • Smaller community and fewer resources available
  • Limited customization options for specialized use cases

Code Comparison

vosk-api:

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

STT:

import stt
import numpy as np

model = stt.Model("model.pbmm")
desired_sample_rate = model.sampleRate()

audio = np.frombuffer(audio_data, np.int16)
text = model.stt(audio)
print(text)

The code examples demonstrate that vosk-api requires more setup for audio input, while STT focuses on processing pre-loaded audio data. vosk-api provides real-time streaming capabilities, whereas STT is more suited for batch processing of audio files.

69,530

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

  • Supports 99 languages, offering broader multilingual capabilities
  • Utilizes a large-scale, pre-trained model for improved accuracy
  • Performs well on diverse audio sources, including noisy environments

Cons of Whisper

  • Requires more computational resources due to its large model size
  • May have slower inference times, especially on less powerful hardware
  • Less suitable for real-time applications due to processing requirements

Code Comparison

STT example:

from stt import Model

model = Model("path/to/model.pbmm")
result = model.stt("path/to/audio.wav")
print(result)

Whisper example:

import whisper

model = whisper.load_model("base")
result = model.transcribe("path/to/audio.wav")
print(result["text"])

Key Differences

  • STT is designed for lightweight, real-time applications, while Whisper focuses on accuracy and language coverage
  • STT allows for more customization and fine-tuning, whereas Whisper provides a pre-trained model
  • Whisper offers integrated language detection and translation features, which are not present in STT
  • STT is more suitable for edge devices and low-resource environments, while Whisper is better for server-side processing
14,200

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

  • More comprehensive and flexible toolkit for speech recognition research
  • Supports a wider range of acoustic models and language models
  • Better performance for large-scale and complex speech recognition tasks

Cons of Kaldi

  • Steeper learning curve and more complex setup process
  • Requires more computational resources for training and inference
  • Less user-friendly for beginners or those seeking quick deployment

Code Comparison

STT (Python):

import stt

model = stt.Model("model.pbmm")
result = model.stt("audio.wav")
print(result)

Kaldi (Shell script):

#!/bin/bash
. ./path.sh
compute-mfcc-feats --config=conf/mfcc.conf scp:data/test/wav.scp ark:- | \
  apply-cmvn-sliding --norm-vars=false --center=true --cmn-window=300 ark:- ark:- | \
  gmm-latgen-faster --max-active=7000 --beam=13.0 --lattice-beam=6.0 \
    --acoustic-scale=0.083333 --allow-partial=true \
    --word-symbol-table=exp/tri4b/graph/words.txt \
    exp/tri4b/final.mdl exp/tri4b/graph/HCLG.fst ark:- ark:- | \
  lattice-scale --inv-acoustic-scale=10 ark:- ark:- | \
  lattice-best-path --word-symbol-table=exp/tri4b/graph/words.txt \
    ark:- ark,t:- | utils/int2sym.pl -f 2- exp/tri4b/graph/words.txt
8,390

End-to-End Speech Processing Toolkit

Pros of espnet

  • Broader scope: Supports multiple speech processing tasks (ASR, TTS, speech enhancement, etc.)
  • More flexible: Offers a wide range of models and architectures
  • Active development: Frequent updates and contributions from the research community

Cons of espnet

  • Steeper learning curve: More complex to set up and use due to its extensive features
  • Higher resource requirements: May need more computational power for training and inference

Code Comparison

STT (Python):

from stt import Model

model = Model("path/to/model.pbmm")
text = model.stt("path/to/audio.wav")
print(text)

espnet (Python):

from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained("model_name")
text, *_ = speech2text("path/to/audio.wav")
print(text)

Both repositories provide speech recognition capabilities, but espnet offers a more comprehensive toolkit for various speech processing tasks. STT focuses specifically on speech-to-text and may be easier to use for beginners or those with limited resources. espnet, while more complex, provides greater flexibility and supports cutting-edge research in speech processing.

2,506

Data manipulation and transformation for audio signal processing, powered by PyTorch

Pros of audio

  • Broader scope: Covers various audio processing tasks beyond speech recognition
  • Tighter integration with PyTorch ecosystem
  • More active development and larger community support

Cons of audio

  • Less specialized for speech-to-text tasks
  • May require more setup and configuration for specific STT use cases
  • Potentially steeper learning curve for STT-focused developers

Code Comparison

STT example:

import stt

model = stt.Model("model.pbmm")
result = model.stt("audio.wav")
print(result)

audio example:

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
model = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.get_model()
emission, _ = model(waveform)
result = torchaudio.functional.greedy_decode(emission.squeeze(0))

The STT code is more concise and focused on speech-to-text, while the audio code offers more flexibility but requires more setup for STT tasks. STT provides a simpler API for quick speech recognition, whereas audio allows for more granular control over the process.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. note:: This project is no longer actively maintained, and we have stopped hosting the online Model Zoo. We've seen focus shift towards newer STT models such as Whisper, and have ourselves focused on Coqui TTS and Coqui Studio.

The models will remain available in the releases of the coqui-ai/STT-models repo.

.. image:: images/coqui-STT-logo-green.png :alt: Coqui STT logo

.. |doc-img| image:: https://readthedocs.org/projects/stt/badge/?version=latest :target: https://stt.readthedocs.io/?badge=latest :alt: Documentation

.. |covenant-img| image:: https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg :target: CODE_OF_CONDUCT.md :alt: Contributor Covenant

.. |gitter-img| image:: https://badges.gitter.im/coqui-ai/STT.svg :target: https://gitter.im/coqui-ai/STT?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge :alt: Gitter Room

.. |doi| image:: https://zenodo.org/badge/344354127.svg :target: https://zenodo.org/badge/latestdoi/344354127

|doc-img| |covenant-img| |gitter-img| |doi|

👉 Subscribe to 🐸Coqui's Newsletter <https://coqui.ai/?subscription=true>_

Coqui STT (🐸STT) is a fast, open-source, multi-platform, deep-learning toolkit for training and deploying speech-to-text models. 🐸STT is battle tested in both production and research 🚀

🐸STT features

  • High-quality pre-trained STT model.
  • Efficient training pipeline with Multi-GPU support.
  • Streaming inference.
  • Multiple possible transcripts, each with an associated confidence score.
  • Real-time inference.
  • Small-footprint acoustic model.
  • Bindings for various programming languages.

Quickstart <https://stt.readthedocs.io/en/latest/#quickstart>_

Where to Ask Questions

.. list-table:: :widths: 25 25 :header-rows: 1

    • Type
    • Link
    • 🚨 Bug Reports
    • Github Issue Tracker <https://github.com/coqui-ai/STT/issues/>_
    • 🎁 Feature Requests & Ideas
    • Github Issue Tracker <https://github.com/coqui-ai/STT/issues/>_
    • ❔ Questions
    • Github Discussions <https://github.com/coqui-ai/stt/discussions/>_
    • 💬 General Discussion
    • Github Discussions <https://github.com/coqui-ai/stt/discussions/>_ or Gitter Room <https://gitter.im/coqui-ai/STT?utm_source=share-link&utm_medium=link&utm_campaign=share-link>_

Links & Resources

.. list-table:: :widths: 25 25 :header-rows: 1

    • Type
    • Link
    • 📰 Documentation
    • stt.readthedocs.io <https://stt.readthedocs.io/>_
    • 🚀 Latest release with pre-trained models
    • see the latest release on GitHub <https://github.com/coqui-ai/STT/releases/latest>_
    • 🤝 Contribution Guidelines
    • CONTRIBUTING.rst <CONTRIBUTING.rst>_