Convert Figma logo to code with AI

alphacep logovosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

7,693
1,082
7,693
476

Top Related Projects

14,094

kaldi-asr/kaldi is the official location of the Kaldi project.

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

2,231

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

67,223

Robust Speech Recognition via Large-Scale Weak Supervision

1,830

Open-Source Large Vocabulary Continuous Speech Recognition Engine

A small speech recognizer

Quick Overview

Vosk-api is an offline speech recognition toolkit that provides fast and accurate speech-to-text capabilities. It supports multiple languages and can be integrated into various applications and platforms, including mobile devices and embedded systems.

Pros

  • Offline functionality, ensuring privacy and reliability without internet dependency
  • Supports multiple languages and can be easily extended
  • Lightweight and suitable for embedded systems and mobile devices
  • Provides APIs for various programming languages (Python, Java, Node.js, etc.)

Cons

  • May not be as accurate as some cloud-based speech recognition services
  • Requires downloading language models, which can be large files
  • Limited documentation compared to some commercial alternatives
  • May require more setup and configuration than cloud-based solutions

Code Examples

  1. Basic speech recognition in Python:
import vosk
import sys
import json

model = vosk.Model("path/to/model")
rec = vosk.KaldiRecognizer(model, 16000)

with open("audio.wav", "rb") as wf:
    while True:
        data = wf.read(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            result = json.loads(rec.Result())
            print(result["text"])

print(json.loads(rec.FinalResult())["text"])
  1. Real-time speech recognition from microphone:
import vosk
import pyaudio
import json

model = vosk.Model("path/to/model")
rec = vosk.KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        result = json.loads(rec.Result())
        print(result["text"])
  1. Using Vosk with Node.js:
const vosk = require('vosk');
const fs = require('fs');
const { Readable } = require('stream');
const wav = require('wav');

vosk.setLogLevel(0);
const model = new vosk.Model('path/to/model');
const rec = new vosk.Recognizer({model: model, sampleRate: 16000});

const wfStream = fs.createReadStream('audio.wav');
const wfReader = new wav.Reader();

wfReader.on('format', async ({ audioFormat, sampleRate, channels }) => {
    if (audioFormat != 1 || channels != 1) {
        console.error("Audio file must be WAV format mono PCM.");
        process.exit(1);
    }
    for await (const data of new Readable().wrap(wfReader)) {
        const end_of_speech = rec.acceptWaveform(data);
        if (end_of_speech) {
            console.log(JSON.parse(rec.result()).text);
        }
    }
    console.log(JSON.parse(rec.finalResult()).text);
    rec.free();
    model.free();
});

wfStream.pipe(wfReader);

Getting Started

  1. Install Vosk:

    pip install vosk
    
  2. Download a language model from the Vosk website.

  3. Use the code examples above, replacing "path/to/model" with the actual path to your downloaded model.

  4. For more advanced usage and integration with other languages, refer to the Vosk API documentation.

Competitor Comparisons

14,094

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

  • More comprehensive and flexible toolkit for speech recognition research
  • Extensive documentation and active community support
  • Wider range of acoustic models and language processing tools

Cons of Kaldi

  • Steeper learning curve and more complex setup process
  • Requires more computational resources and expertise to use effectively
  • Less suitable for quick deployment in production environments

Code Comparison

Vosk API (Python):

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

Kaldi (Shell script):

#!/bin/bash

. ./path.sh
. ./cmd.sh

steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir

utils/subset_data_dir.sh --first data/train 10000 data/train_10k

steps/train_mono.sh --nj 20 --cmd "$train_cmd" data/train_10k data/lang exp/mono

Summary

Kaldi offers a more comprehensive toolkit for speech recognition research with extensive documentation and community support. However, it has a steeper learning curve and requires more resources. Vosk API, on the other hand, provides a simpler interface for quick deployment but with fewer advanced features.

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

  • More extensive documentation and community support
  • Better performance on English language recognition
  • Supports both online and offline speech recognition

Cons of DeepSpeech

  • Larger model size, requiring more computational resources
  • Limited language support compared to Vosk
  • Development has been discontinued by Mozilla

Code Comparison

DeepSpeech:

import deepspeech
model = deepspeech.Model('model.pbmm')
text = model.stt(audio)

Vosk:

from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
result = rec.Result()

Both DeepSpeech and Vosk are open-source speech recognition libraries, but they have different strengths and use cases. DeepSpeech offers robust English language support and can work offline, making it suitable for privacy-focused applications. However, its development has been discontinued, which may impact long-term support and updates.

Vosk, on the other hand, provides support for multiple languages and has a smaller model size, making it more versatile and resource-efficient. It's actively maintained and offers good performance across various languages, but may not match DeepSpeech's accuracy for English recognition.

When choosing between the two, consider factors such as language requirements, computational resources, and the need for ongoing support and updates.

2,231

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

Pros of STT

  • More extensive language support, including multi-language models
  • Advanced features like speaker diarization and punctuation
  • Larger community and more frequent updates

Cons of STT

  • Higher resource requirements and slower inference speed
  • More complex setup and integration process
  • Larger model sizes, requiring more storage space

Code Comparison

Vosk API usage:

from vosk import Model, KaldiRecognizer
import wave

model = Model("model")
rec = KaldiRecognizer(model, 16000)

wf = wave.open("audio.wav", "rb")
result = rec.AcceptWaveform(wf.readframes(wf.getnframes()))
print(rec.Result())

STT usage:

import stt

model = stt.Model("model.pbmm")
stream = stt.Stream(model)

with open("audio.wav", "rb") as audio:
    while True:
        chunk = audio.read(1024)
        if not chunk:
            break
        stream.feedAudioContent(chunk)

print(stream.finishStream())

Both repositories offer speech-to-text capabilities, but they differ in their approach and features. Vosk API is lightweight and efficient, making it suitable for embedded systems and real-time applications. STT, on the other hand, provides more advanced features and language support at the cost of higher resource requirements and complexity.

67,223

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

  • Higher accuracy for speech recognition, especially in challenging conditions
  • Supports multilingual transcription and translation
  • Leverages large-scale pre-training on diverse audio datasets

Cons of Whisper

  • Requires more computational resources and may be slower for real-time applications
  • Larger model size, which can be challenging for deployment on resource-constrained devices
  • Depends on PyTorch, which may not be suitable for all environments

Code Comparison

Vosk API usage:

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

# Audio processing loop

Whisper usage:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Both Vosk and Whisper are powerful speech recognition libraries, but they cater to different use cases. Vosk is designed for lightweight, offline, and real-time applications, while Whisper excels in accuracy and multilingual support at the cost of higher computational requirements. The choice between them depends on the specific needs of your project, such as resource constraints, accuracy requirements, and language support.

1,830

Open-Source Large Vocabulary Continuous Speech Recognition Engine

Pros of Julius

  • Long-standing project with extensive documentation and research papers
  • Supports real-time decoding and low-latency recognition
  • Highly customizable with various acoustic models and language models

Cons of Julius

  • Primarily focused on Japanese language support
  • Less active development compared to Vosk
  • Steeper learning curve for integration and customization

Code Comparison

Julius:

int main(int argc, char *argv[])
{
    Jconf *jconf;
    Recog *recog;
    jconf = j_config_load_file_new(jconf_filename);
    recog = j_create_instance_from_jconf(jconf);
    j_adin_init(recog);
    j_recognize_stream(recog);
}

Vosk:

from vosk import Model, KaldiRecognizer
import pyaudio

model = Model("model")
rec = KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

The code snippets demonstrate the basic setup and usage of each library. Julius uses C and requires more configuration, while Vosk offers a simpler Python interface for quick integration.

A small speech recognizer

Pros of PocketSphinx

  • Longer development history and more established in the speech recognition community
  • Supports a wider range of languages and acoustic models
  • More extensive documentation and academic research backing

Cons of PocketSphinx

  • Generally slower performance compared to Vosk
  • Less active development and updates in recent years
  • More complex setup and integration process

Code Comparison

PocketSphinx (C):

ps_decoder_t *ps = ps_init(&config);
FILE *fh = fopen("goforward.raw", "rb");
int16 buf[512];
while (!feof(fh)) {
    size_t nsamp = fread(buf, 2, 512, fh);
    ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}

Vosk (Python):

from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

Both libraries offer speech recognition capabilities, but Vosk provides a more modern and streamlined API with better performance on resource-constrained devices. PocketSphinx offers more flexibility in terms of language support and acoustic models, but at the cost of increased complexity and slower processing speed. Vosk is generally easier to integrate and use, especially for developers new to speech recognition.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Vosk Speech Recognition Toolkit

Vosk is an offline open source speech recognition toolkit. It enables speech recognition for 20+ languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish. More to come.

Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification.

Speech recognition bindings implemented for various programming languages like Python, Java, Node.JS, C#, C++, Rust, Go and others.

Vosk supplies speech recognition for chatbots, smart home appliances, virtual assistants. It can also create subtitles for movies, transcription for lectures and interviews.

Vosk scales from small devices like Raspberry Pi or Android smartphone to big clusters.

Documentation

For installation instructions, examples and documentation visit Vosk Website.