pocketsphinx

A small speech recognizer

4,165

727

4,165

View on GitHub

Top Related Projects

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

DeepSpeech

26,544

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

julius

1,899

Open-Source Large Vocabulary Continuous Speech Recognition Engine

STT

2,488

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

silero-models

5,420

Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple

Quick Overview

PocketSphinx is an open-source, lightweight speech recognition engine specifically designed for mobile and embedded devices. It is part of the CMU Sphinx toolkit and provides a flexible, fast, and accurate solution for speech recognition tasks in resource-constrained environments.

Pros

Lightweight and efficient, suitable for mobile and embedded devices
Supports multiple languages and can be easily adapted to new ones
Offers both offline and real-time speech recognition capabilities
Provides a flexible API for integration into various applications

Cons

May have lower accuracy compared to more resource-intensive speech recognition systems
Limited support for handling background noise and multiple speakers
Requires manual acoustic model training for optimal performance in specific domains
Documentation can be sparse or outdated in some areas

Code Examples

Basic speech recognition from an audio file:

from pocketsphinx import Pocketsphinx, get_model_path

model_path = get_model_path()
ps = Pocketsphinx(model_path=model_path)
ps.decode(audio_file='path/to/audio.wav')
print(ps.hypothesis())

Continuous speech recognition from microphone input:

from pocketsphinx import LiveSpeech

for phrase in LiveSpeech():
    print(phrase)

Customizing the language model:

from pocketsphinx import Pocketsphinx, get_model_path

model_path = get_model_path()
ps = Pocketsphinx(
    model_path=model_path,
    lm='path/to/custom_language_model.lm',
    dict='path/to/custom_dictionary.dict'
)
ps.decode(audio_file='path/to/audio.wav')
print(ps.hypothesis())

Getting Started

To get started with PocketSphinx, follow these steps:

Install PocketSphinx:
```
pip install pocketsphinx
```
Download the necessary acoustic model and language model files from the CMU Sphinx website.
Use the following basic code to perform speech recognition:

from pocketsphinx import Pocketsphinx, get_model_path

model_path = get_model_path()
ps = Pocketsphinx(model_path=model_path)
ps.decode(audio_file='path/to/your/audio.wav')
print(ps.hypothesis())

Replace 'path/to/your/audio.wav' with the path to your audio file. This will output the recognized speech from the audio file.

Competitor Comparisons

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of Vosk-api

Better performance and accuracy, especially for non-English languages
Supports streaming recognition, allowing real-time processing
Smaller model sizes, making it more suitable for mobile and embedded devices

Cons of Vosk-api

Less mature and less widely adopted compared to PocketSphinx
Fewer language models available out-of-the-box
Limited documentation and community support

Code Comparison

PocketSphinx:

ps_decoder_t *ps = ps_init(config);
FILE *fh = fopen("goforward.raw", "rb");
int16 buf[512];
while (!feof(fh)) {
    size_t nsamp = fread(buf, 2, 512, fh);
    ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}

Vosk-api:

from vosk import Model, KaldiRecognizer
import sys
import os
import wave

wf = wave.open(sys.argv[1], "rb")
rec = KaldiRecognizer(Model("model"), wf.getframerate())
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())

The code examples show that Vosk-api uses a more modern, Python-based approach, while PocketSphinx uses a lower-level C implementation. Vosk-api's code is more concise and easier to read, potentially leading to faster development and easier integration.

DeepSpeech

26,544

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

Higher accuracy in speech recognition, especially for complex or noisy audio
Better support for multiple languages and accents
Utilizes deep learning techniques, potentially offering more advanced features

Cons of DeepSpeech

Requires more computational resources and memory
Longer processing time for speech recognition tasks
Steeper learning curve for implementation and customization

Code Comparison

PocketSphinx (C):

ps_decoder_t *ps = ps_init(config);
FILE *fh = fopen("goforward.raw", "rb");
int16 buf[512];
while (!feof(fh)) {
    size_t nsamp = fread(buf, 2, 512, fh);
    ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}

DeepSpeech (Python):

ds = Model(model_path)
fs, audio = wav.read(audio_file)
data = np.frombuffer(audio, np.int16)
text = ds.stt(data)
print(text)

Both repositories offer speech recognition capabilities, but they differ in their approach and implementation. PocketSphinx is lightweight and suitable for embedded systems, while DeepSpeech provides higher accuracy at the cost of increased resource requirements. The code examples demonstrate the simplicity of using PocketSphinx in C compared to the more concise Python implementation of DeepSpeech.

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

More advanced and flexible acoustic modeling techniques
Better performance on large vocabulary tasks
Active development and community support

Cons of Kaldi

Steeper learning curve and more complex setup
Higher computational requirements
Less suitable for embedded or resource-constrained devices

Code Comparison

Pocketsphinx (C):

ps_decoder_t *ps = ps_init(&config);
ps_start_utt(ps);
ps_process_raw(ps, buffer, n_samples, FALSE, FALSE);
ps_end_utt(ps);
hyp = ps_get_hyp(ps, NULL);

Kaldi (C++):

OnlineNnet2FeaturePipeline feature_pipeline(feature_info);
SingleUtteranceNnet3Decoder decoder(decoder_opts, trans_model, *am_nnet, *decode_fst, &feature_pipeline);
decoder.InitDecoding();
decoder.AdvanceDecoding();
decoder.FinalizeDecoding();
CompactLattice clat;
decoder.GetLattice(true, &clat);

Both Pocketsphinx and Kaldi are open-source speech recognition toolkits, but they cater to different use cases. Pocketsphinx is lightweight and suitable for embedded systems, while Kaldi offers more advanced features and better performance for large-scale speech recognition tasks. The code comparison illustrates the difference in complexity, with Kaldi requiring more setup but providing more flexibility in its decoding process.

julius

1,899

Open-Source Large Vocabulary Continuous Speech Recognition Engine

Pros of Julius

Higher recognition accuracy for Japanese language
More flexible acoustic model adaptation capabilities
Supports real-time recognition with low latency

Cons of Julius

Less extensive documentation and community support
Fewer pre-trained models available for languages other than Japanese
Steeper learning curve for non-Japanese speakers

Code Comparison

Julius:

JCONF *jconf = j_config_load_file_new(jconf_filename);
Recog *recog = j_create_instance_from_jconf(jconf);
j_adin_init(recog);
j_recognize_stream(recog);

PocketSphinx:

ps_config_t *config = ps_config_init(NULL);
ps_decoder_t *ps = ps_init(config);
ps_start_utt(ps);
ps_process_raw(ps, buffer, n_samples, FALSE, FALSE);
ps_end_utt(ps);

Both Julius and PocketSphinx are open-source speech recognition engines, but they have different strengths and use cases. Julius excels in Japanese speech recognition and offers more advanced acoustic model adaptation, while PocketSphinx provides better support for multiple languages and has a larger community. The code examples show that both libraries have similar initialization and recognition processes, but with different function names and structures.

STT

2,488

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

Pros of STT

Utilizes deep learning techniques for improved accuracy
Supports multiple languages and accents
Offers pre-trained models for quick deployment

Cons of STT

Requires more computational resources
May have longer processing times for real-time applications
Steeper learning curve for customization

Code Comparison

PocketSphinx (C):

ps_decoder_t *ps = ps_init(&config);
ps_start_utt(ps);
ps_process_raw(ps, buffer, n_samples, FALSE, FALSE);
ps_end_utt(ps);
hyp = ps_get_hyp(ps, NULL);

STT (Python):

model = Model("model.pbmm")
stream = model.createStream()
stream.feedAudioContent(audio_buffer)
text = stream.finishStream()

Both libraries provide APIs for speech recognition, but STT offers a more high-level interface with pre-trained models, while PocketSphinx requires more low-level configuration. STT's Python-based approach may be more accessible for many developers, whereas PocketSphinx's C implementation could offer better performance in resource-constrained environments.

silero-models

5,420

Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple

Pros of Silero-models

Supports multiple languages and provides pre-trained models
Offers both speech-to-text and text-to-speech capabilities
Designed for easy integration with popular deep learning frameworks

Cons of Silero-models

Requires more computational resources due to neural network architecture
May have a steeper learning curve for users new to deep learning
Less extensive documentation compared to PocketSphinx

Code Comparison

PocketSphinx (C):

ps_decoder_t *decoder = ps_init(config);
FILE *fh = fopen("goforward.raw", "rb");
ps_decode_raw(decoder, fh, -1);
char const *hyp = ps_get_hyp(decoder, NULL);

Silero-models (Python):

import torch
model, decoder = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_stt')
audio_path = 'audio.wav'
transcription = model.transcribe(audio_path)

Both repositories offer speech recognition capabilities, but they differ in approach and implementation. PocketSphinx is a lightweight, C-based solution suitable for embedded systems, while Silero-models leverages deep learning techniques and provides more flexibility in terms of supported languages and features.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PocketSphinx 5.0.4

This is PocketSphinx, one of Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech recognition engines.

Although this was at one point a research system, active development has largely ceased and it has become very, very far from the state of the art. I am making a release, because people are nonetheless using it, and there are a number of historical errors in the build system and API which needed to be corrected.

The version number is strangely large because there was a "release" that people are using called 5prealpha, and we will use proper semantic versioning from now on.

Please see the LICENSE file for terms of use.

Installation

We now use CMake for building, which should give reasonable results across Linux and Windows. Not certain about Mac OS X because I don't have one of those. In addition, the audio library, which never really built or worked correctly on any platform at all, has simply been removed.

There is no longer any dependency on SphinxBase. There is no SphinxBase anymore. This is not the SphinxBase you're looking for. All your SphinxBase are belong to us.

There are some other dependencies that you may find useful in order to use the example code (though they are not strictly necessary to build and install). On Debian GNU/Linux and its derivatives (such as Raspberry Pi OS, Ubuntu, etc), you can install them with:

sudo apt install \
    ffmpeg \
    libasound2-dev \
    libportaudio2 \
    libportaudiocpp0 \
    libpulse-dev \
    libsox-fmt-all \
    portaudio19-dev \
    sox

To install the Python module in a virtual environment (replace ~/ve_pocketsphinx with the virtual environment you wish to create), from the top level directory:

python3 -m venv ~/ve_pocketsphinx
. ~/ve_pocketsphinx/bin/activate
pip install .

To install the C library and bindings (assuming you have access to /usr/local - if not, use -DCMAKE_INSTALL_PREFIX to set a different prefix in the first cmake command below):

cmake -S . -B build
cmake --build build
cmake --build build --target install

Usage

The pocketsphinx command-line program reads single-channel 16-bit PCM audio from standard input or one or more files, and attempts to recognize speech in it using the default acoustic and language model. It accepts a large number of options which you probably don't care about, a command which defaults to live, and one or more inputs (except in align mode), or - to read from standard input.

If you have a single-channel WAV file called "speech.wav" and you want to recognize speech in it, you can try doing this (the results may not be wonderful):

pocketsphinx single speech.wav

If your input is in some other format I suggest converting it with sox as described below.

The commands are as follows:

help: Print a long list of those options you don't care about.
config: Dump configuration as JSON to standard output (can be loaded with the -config option).
live: Detect speech segments in each input, run recognition on them (using those options you don't care about), and write the results to standard output in line-delimited JSON. I realize this isn't the prettiest format, but it sure beats XML. Each line contains a JSON object with these fields, which have short names to make the lines more readable:
- b: Start time in seconds, from the beginning of the stream
- d: Duration in seconds
- p: Estimated probability of the recognition result, i.e. a number between 0 and 1 representing the likelihood of the input according to the model
- t: Full text of recognition result
- w: List of segments (usually words), each of which in turn contains the b, d, p, and t fields, for start, end, probability, and the text of the word. If -phone_align yes has been passed, then a w field will be present containing phone segmentations, in the same format.
single: Recognize each input as a single utterance, and write a JSON object in the same format described above.
align: Align a single input file (or - for standard input) to a word sequence, and write a JSON object in the same format described above. The first positional argument is the input, and all subsequent ones are concatenated to make the text, to avoid surprises if you forget to quote it. You are responsible for normalizing the text to remove punctuation, uppercase, centipedes, etc. For example:
```
pocketsphinx align goforward.wav "go forward ten meters"
```
By default, only word-level alignment is done. To get phone alignments, pass -phone_align yes in the flags, e.g.:
```
pocketsphinx -phone_align yes align audio.wav $text
```
This will make not particularly readable output, but you can use jq to clean it up. For example, you can get just the word names and start times like this:
```
pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]'
```
Or you could get the phone names and durations like this:
```
pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]'
```
There are many, many other possibilities, of course.
soxflags: Return arguments to sox which will create the appropriate input format. Note that because the sox command-line is slightly quirky these must always come after the filename or -d (which tells sox to read from the microphone). You can run live recognition like this:
```
sox -d $(pocketsphinx soxflags) | pocketsphinx -
```
or decode from a file named "audio.mp3" like this:
```
sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx -
```

By default only errors are printed to standard error, but if you want more information you can pass -loglevel INFO. Partial results are not printed, maybe they will be in the future, but don't hold your breath.

Programming

For programming, see the examples directory for a number of examples of using the library from C and Python. You can also read the documentation for the Python API or the C API

Authors

PocketSphinx is ultimately based on Sphinx-II which in turn was based on some older systems at Carnegie Mellon University, which were released as free software under a BSD-like license thanks to the efforts of Kevin Lenzo. Much of the decoder in particular was written by Ravishankar Mosur (look for "rkm" in the comments), but various other people contributed as well, see the AUTHORS file for more details.

David Huggins-Daines (the author of this document) is responsible for creating PocketSphinx which added various speed and memory optimizations, fixed-point computation, JSGF support, portability to various platforms, and a somewhat coherent API. He then disappeared for a while.

Nickolay Shmyrev took over maintenance for quite a long time afterwards, and a lot of code was contributed by Alexander Solovets, Vyacheslav Klimkov, and others.

Currently this is maintained by David Huggins-Daines again.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot