vosk-api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Top Related Projects
kaldi-asr/kaldi is the official location of the Kaldi project.
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Robust Speech Recognition via Large-Scale Weak Supervision
Open-Source Large Vocabulary Continuous Speech Recognition Engine
A small speech recognizer
Quick Overview
Vosk-api is an offline speech recognition toolkit that provides fast and accurate speech-to-text capabilities. It supports multiple languages and can be integrated into various applications and platforms, including mobile devices and embedded systems.
Pros
- Offline functionality, ensuring privacy and reliability without internet dependency
- Supports multiple languages and can be easily extended
- Lightweight and suitable for embedded systems and mobile devices
- Provides APIs for various programming languages (Python, Java, Node.js, etc.)
Cons
- May not be as accurate as some cloud-based speech recognition services
- Requires downloading language models, which can be large files
- Limited documentation compared to some commercial alternatives
- May require more setup and configuration than cloud-based solutions
Code Examples
- Basic speech recognition in Python:
import vosk
import sys
import json
model = vosk.Model("path/to/model")
rec = vosk.KaldiRecognizer(model, 16000)
with open("audio.wav", "rb") as wf:
while True:
data = wf.read(4000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
result = json.loads(rec.Result())
print(result["text"])
print(json.loads(rec.FinalResult())["text"])
- Real-time speech recognition from microphone:
import vosk
import pyaudio
import json
model = vosk.Model("path/to/model")
rec = vosk.KaldiRecognizer(model, 16000)
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()
while True:
data = stream.read(4000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
result = json.loads(rec.Result())
print(result["text"])
- Using Vosk with Node.js:
const vosk = require('vosk');
const fs = require('fs');
const { Readable } = require('stream');
const wav = require('wav');
vosk.setLogLevel(0);
const model = new vosk.Model('path/to/model');
const rec = new vosk.Recognizer({model: model, sampleRate: 16000});
const wfStream = fs.createReadStream('audio.wav');
const wfReader = new wav.Reader();
wfReader.on('format', async ({ audioFormat, sampleRate, channels }) => {
if (audioFormat != 1 || channels != 1) {
console.error("Audio file must be WAV format mono PCM.");
process.exit(1);
}
for await (const data of new Readable().wrap(wfReader)) {
const end_of_speech = rec.acceptWaveform(data);
if (end_of_speech) {
console.log(JSON.parse(rec.result()).text);
}
}
console.log(JSON.parse(rec.finalResult()).text);
rec.free();
model.free();
});
wfStream.pipe(wfReader);
Getting Started
-
Install Vosk:
pip install vosk
-
Download a language model from the Vosk website.
-
Use the code examples above, replacing "path/to/model" with the actual path to your downloaded model.
-
For more advanced usage and integration with other languages, refer to the Vosk API documentation.
Competitor Comparisons
kaldi-asr/kaldi is the official location of the Kaldi project.
Pros of Kaldi
- More comprehensive and flexible toolkit for speech recognition research
- Extensive documentation and active community support
- Wider range of acoustic models and language processing tools
Cons of Kaldi
- Steeper learning curve and more complex setup process
- Requires more computational resources and expertise to use effectively
- Less suitable for quick deployment in production environments
Code Comparison
Vosk API (Python):
from vosk import Model, KaldiRecognizer
import pyaudio
model = Model("model")
rec = KaldiRecognizer(model, 16000)
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()
while True:
data = stream.read(4000)
if rec.AcceptWaveform(data):
print(rec.Result())
Kaldi (Shell script):
#!/bin/bash
. ./path.sh
. ./cmd.sh
steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
utils/subset_data_dir.sh --first data/train 10000 data/train_10k
steps/train_mono.sh --nj 20 --cmd "$train_cmd" data/train_10k data/lang exp/mono
Summary
Kaldi offers a more comprehensive toolkit for speech recognition research with extensive documentation and community support. However, it has a steeper learning curve and requires more resources. Vosk API, on the other hand, provides a simpler interface for quick deployment but with fewer advanced features.
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Pros of DeepSpeech
- More extensive documentation and community support
- Better performance on English language recognition
- Supports both online and offline speech recognition
Cons of DeepSpeech
- Larger model size, requiring more computational resources
- Limited language support compared to Vosk
- Development has been discontinued by Mozilla
Code Comparison
DeepSpeech:
import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)
Vosk:
from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
result = rec.Result()
Both DeepSpeech and Vosk are open-source speech recognition libraries, but they have different strengths and use cases. DeepSpeech offers robust English language support and can work offline, making it suitable for privacy-focused applications. However, its development has been discontinued, which may impact long-term support and updates.
Vosk, on the other hand, provides support for multiple languages and has a smaller model size, making it more versatile and resource-efficient. It's actively maintained and offers good performance across various languages, but may not match DeepSpeech's accuracy for English recognition.
When choosing between the two, consider factors such as language requirements, computational resources, and the need for ongoing support and updates.
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Pros of STT
- More extensive language support, including multi-language models
- Advanced features like speaker diarization and punctuation
- Larger community and more frequent updates
Cons of STT
- Higher resource requirements and slower inference speed
- More complex setup and integration process
- Larger model sizes, requiring more storage space
Code Comparison
Vosk API usage:
from vosk import Model, KaldiRecognizer
import wave
model = Model("model")
rec = KaldiRecognizer(model, 16000)
wf = wave.open("audio.wav", "rb")
result = rec.AcceptWaveform(wf.readframes(wf.getnframes()))
print(rec.Result())
STT usage:
import stt
model = stt.Model("model.pbmm")
stream = stt.Stream(model)
with open("audio.wav", "rb") as audio:
while True:
chunk = audio.read(1024)
if not chunk:
break
stream.feedAudioContent(chunk)
print(stream.finishStream())
Both repositories offer speech-to-text capabilities, but they differ in their approach and features. Vosk API is lightweight and efficient, making it suitable for embedded systems and real-time applications. STT, on the other hand, provides more advanced features and language support at the cost of higher resource requirements and complexity.
Robust Speech Recognition via Large-Scale Weak Supervision
Pros of Whisper
- Higher accuracy for speech recognition, especially in challenging conditions
- Supports multilingual transcription and translation
- Leverages large-scale pre-training on diverse audio datasets
Cons of Whisper
- Requires more computational resources and may be slower for real-time applications
- Larger model size, which can be challenging for deployment on resource-constrained devices
- Depends on PyTorch, which may not be suitable for all environments
Code Comparison
Vosk API usage:
from vosk import Model, KaldiRecognizer
import pyaudio
model = Model("model")
rec = KaldiRecognizer(model, 16000)
# Audio processing loop
Whisper usage:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Both Vosk and Whisper are powerful speech recognition libraries, but they cater to different use cases. Vosk is designed for lightweight, offline, and real-time applications, while Whisper excels in accuracy and multilingual support at the cost of higher computational requirements. The choice between them depends on the specific needs of your project, such as resource constraints, accuracy requirements, and language support.
Open-Source Large Vocabulary Continuous Speech Recognition Engine
Pros of Julius
- Long-standing project with extensive documentation and research papers
- Supports real-time decoding and low-latency recognition
- Highly customizable with various acoustic models and language models
Cons of Julius
- Primarily focused on Japanese language support
- Less active development compared to Vosk
- Steeper learning curve for integration and customization
Code Comparison
Julius:
int main(int argc, char *argv[])
{
Jconf *jconf;
Recog *recog;
jconf = j_config_load_file_new(jconf_filename);
recog = j_create_instance_from_jconf(jconf);
j_adin_init(recog);
j_recognize_stream(recog);
}
Vosk:
from vosk import Model, KaldiRecognizer
import pyaudio
model = Model("model")
rec = KaldiRecognizer(model, 16000)
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()
while True:
data = stream.read(4000)
if rec.AcceptWaveform(data):
print(rec.Result())
The code snippets demonstrate the basic setup and usage of each library. Julius uses C and requires more configuration, while Vosk offers a simpler Python interface for quick integration.
A small speech recognizer
Pros of PocketSphinx
- Longer development history and more established in the speech recognition community
- Supports a wider range of languages and acoustic models
- More extensive documentation and academic research backing
Cons of PocketSphinx
- Generally slower performance compared to Vosk
- Less active development and updates in recent years
- More complex setup and integration process
Code Comparison
PocketSphinx (C):
ps_decoder_t *ps = ps_init(&config);
FILE *fh = fopen("goforward.raw", "rb");
int16 buf[512];
while (!feof(fh)) {
size_t nsamp = fread(buf, 2, 512, fh);
ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}
Vosk (Python):
from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
while True:
data = stream.read(4000)
if rec.AcceptWaveform(data):
print(rec.Result())
Both libraries offer speech recognition capabilities, but Vosk provides a more modern and streamlined API with better performance on resource-constrained devices. PocketSphinx offers more flexibility in terms of language support and acoustic models, but at the cost of increased complexity and slower processing speed. Vosk is generally easier to integrate and use, especially for developers new to speech recognition.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Vosk Speech Recognition Toolkit
Vosk is an offline open source speech recognition toolkit. It enables speech recognition for 20+ languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish. More to come.
Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification.
Speech recognition bindings implemented for various programming languages like Python, Java, Node.JS, C#, C++, Rust, Go and others.
Vosk supplies speech recognition for chatbots, smart home appliances, virtual assistants. It can also create subtitles for movies, transcription for lectures and interviews.
Vosk scales from small devices like Raspberry Pi or Android smartphone to big clusters.
Documentation
For installation instructions, examples and documentation visit Vosk Website.
Top Related Projects
kaldi-asr/kaldi is the official location of the Kaldi project.
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Robust Speech Recognition via Large-Scale Weak Supervision
Open-Source Large Vocabulary Continuous Speech Recognition Engine
A small speech recognizer
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot