vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

12,442

1,480

12,442

545

View on GitHub

Top Related Projects

kaldi

14,802

kaldi-asr/kaldi is the official location of the Kaldi project.

DeepSpeech

26,278

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

STT

2,419

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

whisper

80,764

Robust Speech Recognition via Large-Scale Weak Supervision

julius

1,887

Open-Source Large Vocabulary Continuous Speech Recognition Engine

Quick Overview

Vosk-api is an offline speech recognition toolkit that provides fast and accurate speech-to-text capabilities. It supports multiple languages and can be integrated into various applications and platforms, including mobile devices and embedded systems.

Pros

Offline functionality, ensuring privacy and reliability without internet dependency
Supports multiple languages and can be easily extended
Lightweight and suitable for embedded systems and mobile devices
Provides APIs for various programming languages (Python, Java, Node.js, etc.)

Cons

May not be as accurate as some cloud-based speech recognition services
Requires downloading language models, which can be large files
Limited documentation compared to some commercial alternatives
May require more setup and configuration than cloud-based solutions

Code Examples

Basic speech recognition in Python:

import vosk
import sys
import json

model = vosk.Model("path/to/model")
rec = vosk.KaldiRecognizer(model, 16000)

with open("audio.wav", "rb") as wf:
    while True:
        data = wf.read(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            result = json.loads(rec.Result())
            print(result["text"])

print(json.loads(rec.FinalResult())["text"])

Real-time speech recognition from microphone:

import vosk
import pyaudio
import json

model = vosk.Model("path/to/model")
rec = vosk.KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        result = json.loads(rec.Result())
        print(result["text"])

Using Vosk with Node.js:

const vosk = require('vosk');
const fs = require('fs');
const { Readable } = require('stream');
const wav = require('wav');

vosk.setLogLevel(0);
const model = new vosk.Model('path/to/model');
const rec = new vosk.Recognizer({model: model, sampleRate: 16000});

const wfStream = fs.createReadStream('audio.wav');
const wfReader = new wav.Reader();

wfReader.on('format', async ({ audioFormat, sampleRate, channels }) => {
    if (audioFormat != 1 || channels != 1) {
        console.error("Audio file must be WAV format mono PCM.");
        process.exit(1);
    }
    for await (const data of new Readable().wrap(wfReader)) {
        const end_of_speech = rec.acceptWaveform(data);
        if (end_of_speech) {
            console.log(JSON.parse(rec.result()).text);
        }
    }
    console.log(JSON.parse(rec.finalResult()).text);
    rec.free();
    model.free();
});

wfStream.pipe(wfReader);

Getting Started

Install Vosk:
```
pip install vosk
```
Download a language model from the Vosk website.
Use the code examples above, replacing "path/to/model" with the actual path to your downloaded model.
For more advanced usage and integration with other languages, refer to the Vosk API documentation.

Competitor Comparisons

kaldi

14,802

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

More comprehensive and flexible toolkit for speech recognition research
Extensive documentation and active community support
Wider range of acoustic models and language processing tools

Cons of Kaldi

Steeper learning curve and more complex setup process
Requires more computational resources and expertise to use effectively
Less suitable for quick deployment in production environments

Code Comparison

Vosk API (Python):

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

Kaldi (Shell script):

#!/bin/bash

. ./path.sh
. ./cmd.sh

steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir

utils/subset_data_dir.sh --first data/train 10000 data/train_10k

steps/train_mono.sh --nj 20 --cmd "$train_cmd" data/train_10k data/lang exp/mono

Summary

Kaldi offers a more comprehensive toolkit for speech recognition research with extensive documentation and community support. However, it has a steeper learning curve and requires more resources. Vosk API, on the other hand, provides a simpler interface for quick deployment but with fewer advanced features.

DeepSpeech

26,278

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

More extensive documentation and community support
Better performance on English language recognition
Supports both online and offline speech recognition

Cons of DeepSpeech

Larger model size, requiring more computational resources
Limited language support compared to Vosk
Development has been discontinued by Mozilla

Code Comparison

DeepSpeech:

import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)

Vosk:

from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
result = rec.Result()

Both DeepSpeech and Vosk are open-source speech recognition libraries, but they have different strengths and use cases. DeepSpeech offers robust English language support and can work offline, making it suitable for privacy-focused applications. However, its development has been discontinued, which may impact long-term support and updates.

Vosk, on the other hand, provides support for multiple languages and has a smaller model size, making it more versatile and resource-efficient. It's actively maintained and offers good performance across various languages, but may not match DeepSpeech's accuracy for English recognition.

When choosing between the two, consider factors such as language requirements, computational resources, and the need for ongoing support and updates.

STT

2,419

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

Pros of STT

More extensive language support, including multi-language models
Advanced features like speaker diarization and punctuation
Larger community and more frequent updates

Cons of STT

Higher resource requirements and slower inference speed
More complex setup and integration process
Larger model sizes, requiring more storage space

Code Comparison

Vosk API usage:

from vosk import Model, KaldiRecognizer
import wave

model = Model("model")
rec = KaldiRecognizer(model, 16000)

wf = wave.open("audio.wav", "rb")
result = rec.AcceptWaveform(wf.readframes(wf.getnframes()))
print(rec.Result())

STT usage:

import stt

model = stt.Model("model.pbmm")
stream = stt.Stream(model)

with open("audio.wav", "rb") as audio:
    while True:
        chunk = audio.read(1024)
        if not chunk:
            break
        stream.feedAudioContent(chunk)

print(stream.finishStream())

Both repositories offer speech-to-text capabilities, but they differ in their approach and features. Vosk API is lightweight and efficient, making it suitable for embedded systems and real-time applications. STT, on the other hand, provides more advanced features and language support at the cost of higher resource requirements and complexity.

whisper

80,764

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

Higher accuracy for speech recognition, especially in challenging conditions
Supports multilingual transcription and translation
Leverages large-scale pre-training on diverse audio datasets

Cons of Whisper

Requires more computational resources and may be slower for real-time applications
Larger model size, which can be challenging for deployment on resource-constrained devices
Depends on PyTorch, which may not be suitable for all environments

Code Comparison

Vosk API usage:

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

# Audio processing loop

Whisper usage:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Both Vosk and Whisper are powerful speech recognition libraries, but they cater to different use cases. Vosk is designed for lightweight, offline, and real-time applications, while Whisper excels in accuracy and multilingual support at the cost of higher computational requirements. The choice between them depends on the specific needs of your project, such as resource constraints, accuracy requirements, and language support.

julius

1,887

Open-Source Large Vocabulary Continuous Speech Recognition Engine

Pros of Julius

Long-standing project with extensive documentation and research papers
Supports real-time decoding and low-latency recognition
Highly customizable with various acoustic models and language models

Cons of Julius

Primarily focused on Japanese language support
Less active development compared to Vosk
Steeper learning curve for integration and customization

Code Comparison

Julius:

int main(int argc, char *argv[])
{
    Jconf *jconf;
    Recog *recog;
    jconf = j_config_load_file_new(jconf_filename);
    recog = j_create_instance_from_jconf(jconf);
    j_adin_init(recog);
    j_recognize_stream(recog);
}

Vosk:

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

The code snippets demonstrate the basic setup and usage of each library. Julius uses C and requires more configuration, while Vosk offers a simpler Python interface for quick integration.

pocketsphinx

4,100

A small speech recognizer

Pros of PocketSphinx

Longer development history and more established in the speech recognition community
Supports a wider range of languages and acoustic models
More extensive documentation and academic research backing

Cons of PocketSphinx

Generally slower performance compared to Vosk
Less active development and updates in recent years
More complex setup and integration process

Code Comparison

PocketSphinx (C):

ps_decoder_t *ps = ps_init(&config);
FILE *fh = fopen("goforward.raw", "rb");
int16 buf[512];
while (!feof(fh)) {
    size_t nsamp = fread(buf, 2, 512, fh);
    ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}

Vosk (Python):

from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

Both libraries offer speech recognition capabilities, but Vosk provides a more modern and streamlined API with better performance on resource-constrained devices. PocketSphinx offers more flexibility in terms of language support and acoustic models, but at the cost of increased complexity and slower processing speed. Vosk is generally easier to integrate and use, especially for developers new to speech recognition.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Vosk Speech Recognition Toolkit

Vosk is an offline open source speech recognition toolkit. It enables speech recognition for 20+ languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish. More to come.

Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification.

Speech recognition bindings implemented for various programming languages like Python, Java, Node.JS, C#, C++, Rust, Go and others.

Vosk supplies speech recognition for chatbots, smart home appliances, virtual assistants. It can also create subtitles for movies, transcription for lectures and interviews.

Vosk scales from small devices like Raspberry Pi or Android smartphone to big clusters.

Documentation

For installation instructions, examples and documentation visit Vosk Website.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot