STT
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Top Related Projects
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Robust Speech Recognition via Large-Scale Weak Supervision
kaldi-asr/kaldi is the official location of the Kaldi project.
End-to-End Speech Processing Toolkit
Data manipulation and transformation for audio signal processing, powered by PyTorch
Quick Overview
Coqui-ai/STT is an open-source speech-to-text engine, originally developed by Mozilla and now maintained by Coqui. It uses a model trained on various speech datasets to convert audio into text, supporting multiple languages and offering pre-trained models for immediate use.
Pros
- Open-source and free to use
- Supports multiple languages and accents
- Offers pre-trained models for quick deployment
- Can be used offline, ensuring privacy and data security
Cons
- May have lower accuracy compared to some commercial solutions
- Requires significant computational resources for training custom models
- Limited documentation and community support compared to larger projects
- May require fine-tuning for specific use cases or accents
Code Examples
- Basic usage with a pre-trained model:
from stt import Model
# Load pre-trained English model
model = Model('path/to/model.pbmm')
# Transcribe audio file
text = model.stt(audio_file)
print(text)
- Streaming audio transcription:
import wave
from stt import Model
def transcribe_streaming(audio_file):
model = Model('path/to/model.pbmm')
with wave.open(audio_file, 'rb') as w:
rate = w.getframerate()
frames = w.getnframes()
buffer = w.readframes(frames)
context = model.createStream()
context.feedAudioContent(buffer)
text = context.finishStream()
return text
result = transcribe_streaming('path/to/audio.wav')
print(result)
- Using a custom language model:
from stt import Model
model = Model('path/to/acoustic_model.pbmm')
model.enableExternalScorer('path/to/language_model.scorer')
text = model.stt(audio_file)
print(text)
Getting Started
-
Install the STT package:
pip install stt
-
Download a pre-trained model from the Coqui Model Zoo: https://coqui.ai/models
-
Use the model in your Python script:
from stt import Model model = Model('path/to/model.pbmm') text = model.stt('path/to/audio.wav') print(text)
Competitor Comparisons
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Pros of DeepSpeech
- Longer development history and more established community
- Extensive documentation and tutorials available
- Supports a wider range of languages out-of-the-box
Cons of DeepSpeech
- No longer actively maintained by Mozilla
- Generally slower inference times compared to STT
- Larger model size, requiring more computational resources
Code Comparison
DeepSpeech:
import deepspeech
model = deepspeech.Model('path/to/model.pbmm')
text = model.stt(audio_buffer)
STT:
import stt
model = stt.Model('path/to/model.tflite')
text = model.stt(audio_buffer)
Both repositories provide speech-to-text functionality, but STT is a fork of DeepSpeech with ongoing development. STT offers improved performance and smaller model sizes, making it more suitable for edge devices and real-time applications. However, DeepSpeech still has a larger community and more extensive documentation due to its longer history.
The code usage is similar for both projects, with minor differences in import statements and model file formats. Users familiar with DeepSpeech can easily transition to STT if they require active development and performance improvements.
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Pros of vosk-api
- Lightweight and efficient, suitable for embedded systems and mobile devices
- Supports multiple languages out of the box
- Offers offline speech recognition capabilities
Cons of vosk-api
- Less accurate for complex or noisy audio compared to STT
- Smaller community and fewer resources available
- Limited customization options for specialized use cases
Code Comparison
vosk-api:
from vosk import Model, KaldiRecognizer
import pyaudio
model = Model("model")
rec = KaldiRecognizer(model, 16000)
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()
while True:
data = stream.read(4000)
if rec.AcceptWaveform(data):
print(rec.Result())
STT:
import stt
import numpy as np
model = stt.Model("model.pbmm")
desired_sample_rate = model.sampleRate()
audio = np.frombuffer(audio_data, np.int16)
text = model.stt(audio)
print(text)
The code examples demonstrate that vosk-api requires more setup for audio input, while STT focuses on processing pre-loaded audio data. vosk-api provides real-time streaming capabilities, whereas STT is more suited for batch processing of audio files.
Robust Speech Recognition via Large-Scale Weak Supervision
Pros of Whisper
- Supports 99 languages, offering broader multilingual capabilities
- Utilizes a large-scale, pre-trained model for improved accuracy
- Performs well on diverse audio sources, including noisy environments
Cons of Whisper
- Requires more computational resources due to its large model size
- May have slower inference times, especially on less powerful hardware
- Less suitable for real-time applications due to processing requirements
Code Comparison
STT example:
from stt import Model
model = Model("path/to/model.pbmm")
result = model.stt("path/to/audio.wav")
print(result)
Whisper example:
import whisper
model = whisper.load_model("base")
result = model.transcribe("path/to/audio.wav")
print(result["text"])
Key Differences
- STT is designed for lightweight, real-time applications, while Whisper focuses on accuracy and language coverage
- STT allows for more customization and fine-tuning, whereas Whisper provides a pre-trained model
- Whisper offers integrated language detection and translation features, which are not present in STT
- STT is more suitable for edge devices and low-resource environments, while Whisper is better for server-side processing
kaldi-asr/kaldi is the official location of the Kaldi project.
Pros of Kaldi
- More comprehensive and flexible toolkit for speech recognition research
- Supports a wider range of acoustic models and language models
- Better performance for large-scale and complex speech recognition tasks
Cons of Kaldi
- Steeper learning curve and more complex setup process
- Requires more computational resources for training and inference
- Less user-friendly for beginners or those seeking quick deployment
Code Comparison
STT (Python):
import stt
model = stt.Model("model.pbmm")
result = model.stt("audio.wav")
print(result)
Kaldi (Shell script):
#!/bin/bash
. ./path.sh
compute-mfcc-feats --config=conf/mfcc.conf scp:data/test/wav.scp ark:- | \
apply-cmvn-sliding --norm-vars=false --center=true --cmn-window=300 ark:- ark:- | \
gmm-latgen-faster --max-active=7000 --beam=13.0 --lattice-beam=6.0 \
--acoustic-scale=0.083333 --allow-partial=true \
--word-symbol-table=exp/tri4b/graph/words.txt \
exp/tri4b/final.mdl exp/tri4b/graph/HCLG.fst ark:- ark:- | \
lattice-scale --inv-acoustic-scale=10 ark:- ark:- | \
lattice-best-path --word-symbol-table=exp/tri4b/graph/words.txt \
ark:- ark,t:- | utils/int2sym.pl -f 2- exp/tri4b/graph/words.txt
End-to-End Speech Processing Toolkit
Pros of espnet
- Broader scope: Supports multiple speech processing tasks (ASR, TTS, speech enhancement, etc.)
- More flexible: Offers a wide range of models and architectures
- Active development: Frequent updates and contributions from the research community
Cons of espnet
- Steeper learning curve: More complex to set up and use due to its extensive features
- Higher resource requirements: May need more computational power for training and inference
Code Comparison
STT (Python):
from stt import Model
model = Model("path/to/model.pbmm")
text = model.stt("path/to/audio.wav")
print(text)
espnet (Python):
from espnet2.bin.asr_inference import Speech2Text
speech2text = Speech2Text.from_pretrained("model_name")
text, *_ = speech2text("path/to/audio.wav")
print(text)
Both repositories provide speech recognition capabilities, but espnet offers a more comprehensive toolkit for various speech processing tasks. STT focuses specifically on speech-to-text and may be easier to use for beginners or those with limited resources. espnet, while more complex, provides greater flexibility and supports cutting-edge research in speech processing.
Data manipulation and transformation for audio signal processing, powered by PyTorch
Pros of audio
- Broader scope: Covers various audio processing tasks beyond speech recognition
- Tighter integration with PyTorch ecosystem
- More active development and larger community support
Cons of audio
- Less specialized for speech-to-text tasks
- May require more setup and configuration for specific STT use cases
- Potentially steeper learning curve for STT-focused developers
Code Comparison
STT example:
import stt
model = stt.Model("model.pbmm")
result = model.stt("audio.wav")
print(result)
audio example:
import torchaudio
waveform, sample_rate = torchaudio.load("audio.wav")
model = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.get_model()
emission, _ = model(waveform)
result = torchaudio.functional.greedy_decode(emission.squeeze(0))
The STT code is more concise and focused on speech-to-text, while the audio code offers more flexibility but requires more setup for STT tasks. STT provides a simpler API for quick speech recognition, whereas audio allows for more granular control over the process.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
.. note:: This project is no longer actively maintained, and we have stopped hosting the online Model Zoo. We've seen focus shift towards newer STT models such as Whisper, and have ourselves focused on Coqui TTS and Coqui Studio.
The models will remain available in the releases of the coqui-ai/STT-models repo.
.. image:: images/coqui-STT-logo-green.png :alt: Coqui STT logo
.. |doc-img| image:: https://readthedocs.org/projects/stt/badge/?version=latest :target: https://stt.readthedocs.io/?badge=latest :alt: Documentation
.. |covenant-img| image:: https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg :target: CODE_OF_CONDUCT.md :alt: Contributor Covenant
.. |gitter-img| image:: https://badges.gitter.im/coqui-ai/STT.svg :target: https://gitter.im/coqui-ai/STT?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge :alt: Gitter Room
.. |doi| image:: https://zenodo.org/badge/344354127.svg :target: https://zenodo.org/badge/latestdoi/344354127
|doc-img| |covenant-img| |gitter-img| |doi|
ð Subscribe to ð¸Coqui's Newsletter <https://coqui.ai/?subscription=true>
_
Coqui STT (ð¸STT) is a fast, open-source, multi-platform, deep-learning toolkit for training and deploying speech-to-text models. ð¸STT is battle tested in both production and research ð
ð¸STT features
- High-quality pre-trained STT model.
- Efficient training pipeline with Multi-GPU support.
- Streaming inference.
- Multiple possible transcripts, each with an associated confidence score.
- Real-time inference.
- Small-footprint acoustic model.
- Bindings for various programming languages.
Quickstart <https://stt.readthedocs.io/en/latest/#quickstart>
_
Where to Ask Questions
.. list-table:: :widths: 25 25 :header-rows: 1
-
- Type
- Link
-
- ð¨ Bug Reports
Github Issue Tracker <https://github.com/coqui-ai/STT/issues/>
_
-
- ð Feature Requests & Ideas
Github Issue Tracker <https://github.com/coqui-ai/STT/issues/>
_
-
- â Questions
Github Discussions <https://github.com/coqui-ai/stt/discussions/>
_
-
- ð¬ General Discussion
Github Discussions <https://github.com/coqui-ai/stt/discussions/>
_ orGitter Room <https://gitter.im/coqui-ai/STT?utm_source=share-link&utm_medium=link&utm_campaign=share-link>
_
Links & Resources
.. list-table:: :widths: 25 25 :header-rows: 1
-
- Type
- Link
-
- ð° Documentation
stt.readthedocs.io <https://stt.readthedocs.io/>
_
-
- ð Latest release with pre-trained models
see the latest release on GitHub <https://github.com/coqui-ai/STT/releases/latest>
_
-
- ð¤ Contribution Guidelines
CONTRIBUTING.rst <CONTRIBUTING.rst>
_
Top Related Projects
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Robust Speech Recognition via Large-Scale Weak Supervision
kaldi-asr/kaldi is the official location of the Kaldi project.
End-to-End Speech Processing Toolkit
Data manipulation and transformation for audio signal processing, powered by PyTorch
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot