Top Related Projects
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
kaldi-asr/kaldi is the official location of the Kaldi project.
Open-Source Large Vocabulary Continuous Speech Recognition Engine
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Quick Overview
PocketSphinx is an open-source, lightweight speech recognition engine specifically designed for mobile and embedded devices. It is part of the CMU Sphinx toolkit and provides a flexible, fast, and accurate solution for speech recognition tasks in resource-constrained environments.
Pros
- Lightweight and efficient, suitable for mobile and embedded devices
- Supports multiple languages and can be easily adapted to new ones
- Offers both offline and real-time speech recognition capabilities
- Provides a flexible API for integration into various applications
Cons
- May have lower accuracy compared to more resource-intensive speech recognition systems
- Limited support for handling background noise and multiple speakers
- Requires manual acoustic model training for optimal performance in specific domains
- Documentation can be sparse or outdated in some areas
Code Examples
- Basic speech recognition from an audio file:
from pocketsphinx import Pocketsphinx, get_model_path
model_path = get_model_path()
ps = Pocketsphinx(model_path=model_path)
ps.decode(audio_file='path/to/audio.wav')
print(ps.hypothesis())
- Continuous speech recognition from microphone input:
from pocketsphinx import LiveSpeech
for phrase in LiveSpeech():
print(phrase)
- Customizing the language model:
from pocketsphinx import Pocketsphinx, get_model_path
model_path = get_model_path()
ps = Pocketsphinx(
model_path=model_path,
lm='path/to/custom_language_model.lm',
dict='path/to/custom_dictionary.dict'
)
ps.decode(audio_file='path/to/audio.wav')
print(ps.hypothesis())
Getting Started
To get started with PocketSphinx, follow these steps:
-
Install PocketSphinx:
pip install pocketsphinx
-
Download the necessary acoustic model and language model files from the CMU Sphinx website.
-
Use the following basic code to perform speech recognition:
from pocketsphinx import Pocketsphinx, get_model_path
model_path = get_model_path()
ps = Pocketsphinx(model_path=model_path)
ps.decode(audio_file='path/to/your/audio.wav')
print(ps.hypothesis())
Replace 'path/to/your/audio.wav' with the path to your audio file. This will output the recognized speech from the audio file.
Competitor Comparisons
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Pros of Vosk-api
- Better performance and accuracy, especially for non-English languages
- Supports streaming recognition, allowing real-time processing
- Smaller model sizes, making it more suitable for mobile and embedded devices
Cons of Vosk-api
- Less mature and less widely adopted compared to PocketSphinx
- Fewer language models available out-of-the-box
- Limited documentation and community support
Code Comparison
PocketSphinx:
ps_decoder_t *ps = ps_init(config);
FILE *fh = fopen("goforward.raw", "rb");
int16 buf[512];
while (!feof(fh)) {
size_t nsamp = fread(buf, 2, 512, fh);
ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}
Vosk-api:
from vosk import Model, KaldiRecognizer
import sys
import os
import wave
wf = wave.open(sys.argv[1], "rb")
rec = KaldiRecognizer(Model("model"), wf.getframerate())
while True:
data = wf.readframes(4000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
print(rec.Result())
The code examples show that Vosk-api uses a more modern, Python-based approach, while PocketSphinx uses a lower-level C implementation. Vosk-api's code is more concise and easier to read, potentially leading to faster development and easier integration.
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Pros of DeepSpeech
- Higher accuracy in speech recognition, especially for complex or noisy audio
- Better support for multiple languages and accents
- Utilizes deep learning techniques, potentially offering more advanced features
Cons of DeepSpeech
- Requires more computational resources and memory
- Longer processing time for speech recognition tasks
- Steeper learning curve for implementation and customization
Code Comparison
PocketSphinx (C):
ps_decoder_t *ps = ps_init(config);
FILE *fh = fopen("goforward.raw", "rb");
int16 buf[512];
while (!feof(fh)) {
size_t nsamp = fread(buf, 2, 512, fh);
ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}
DeepSpeech (Python):
ds = Model(model_path)
fs, audio = wav.read(audio_file)
data = np.frombuffer(audio, np.int16)
text = ds.stt(data)
print(text)
Both repositories offer speech recognition capabilities, but they differ in their approach and implementation. PocketSphinx is lightweight and suitable for embedded systems, while DeepSpeech provides higher accuracy at the cost of increased resource requirements. The code examples demonstrate the simplicity of using PocketSphinx in C compared to the more concise Python implementation of DeepSpeech.
kaldi-asr/kaldi is the official location of the Kaldi project.
Pros of Kaldi
- More advanced and flexible acoustic modeling techniques
- Better performance on large vocabulary tasks
- Active development and community support
Cons of Kaldi
- Steeper learning curve and more complex setup
- Higher computational requirements
- Less suitable for embedded or resource-constrained devices
Code Comparison
Pocketsphinx (C):
ps_decoder_t *ps = ps_init(&config);
ps_start_utt(ps);
ps_process_raw(ps, buffer, n_samples, FALSE, FALSE);
ps_end_utt(ps);
hyp = ps_get_hyp(ps, NULL);
Kaldi (C++):
OnlineNnet2FeaturePipeline feature_pipeline(feature_info);
SingleUtteranceNnet3Decoder decoder(decoder_opts, trans_model, *am_nnet, *decode_fst, &feature_pipeline);
decoder.InitDecoding();
decoder.AdvanceDecoding();
decoder.FinalizeDecoding();
CompactLattice clat;
decoder.GetLattice(true, &clat);
Both Pocketsphinx and Kaldi are open-source speech recognition toolkits, but they cater to different use cases. Pocketsphinx is lightweight and suitable for embedded systems, while Kaldi offers more advanced features and better performance for large-scale speech recognition tasks. The code comparison illustrates the difference in complexity, with Kaldi requiring more setup but providing more flexibility in its decoding process.
Open-Source Large Vocabulary Continuous Speech Recognition Engine
Pros of Julius
- Higher recognition accuracy for Japanese language
- More flexible acoustic model adaptation capabilities
- Supports real-time recognition with low latency
Cons of Julius
- Less extensive documentation and community support
- Fewer pre-trained models available for languages other than Japanese
- Steeper learning curve for non-Japanese speakers
Code Comparison
Julius:
JCONF *jconf = j_config_load_file_new(jconf_filename);
Recog *recog = j_create_instance_from_jconf(jconf);
j_adin_init(recog);
j_recognize_stream(recog);
PocketSphinx:
ps_config_t *config = ps_config_init(NULL);
ps_decoder_t *ps = ps_init(config);
ps_start_utt(ps);
ps_process_raw(ps, buffer, n_samples, FALSE, FALSE);
ps_end_utt(ps);
Both Julius and PocketSphinx are open-source speech recognition engines, but they have different strengths and use cases. Julius excels in Japanese speech recognition and offers more advanced acoustic model adaptation, while PocketSphinx provides better support for multiple languages and has a larger community. The code examples show that both libraries have similar initialization and recognition processes, but with different function names and structures.
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Pros of STT
- Utilizes deep learning techniques for improved accuracy
- Supports multiple languages and accents
- Offers pre-trained models for quick deployment
Cons of STT
- Requires more computational resources
- May have longer processing times for real-time applications
- Steeper learning curve for customization
Code Comparison
PocketSphinx (C):
ps_decoder_t *ps = ps_init(&config);
ps_start_utt(ps);
ps_process_raw(ps, buffer, n_samples, FALSE, FALSE);
ps_end_utt(ps);
hyp = ps_get_hyp(ps, NULL);
STT (Python):
model = Model("model.pbmm")
stream = model.createStream()
stream.feedAudioContent(audio_buffer)
text = stream.finishStream()
Both libraries provide APIs for speech recognition, but STT offers a more high-level interface with pre-trained models, while PocketSphinx requires more low-level configuration. STT's Python-based approach may be more accessible for many developers, whereas PocketSphinx's C implementation could offer better performance in resource-constrained environments.
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Pros of Silero-models
- Supports multiple languages and provides pre-trained models
- Offers both speech-to-text and text-to-speech capabilities
- Designed for easy integration with popular deep learning frameworks
Cons of Silero-models
- Requires more computational resources due to neural network architecture
- May have a steeper learning curve for users new to deep learning
- Less extensive documentation compared to PocketSphinx
Code Comparison
PocketSphinx (C):
ps_decoder_t *decoder = ps_init(config);
FILE *fh = fopen("goforward.raw", "rb");
ps_decode_raw(decoder, fh, -1);
char const *hyp = ps_get_hyp(decoder, NULL);
Silero-models (Python):
import torch
model, decoder = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_stt')
audio_path = 'audio.wav'
transcription = model.transcribe(audio_path)
Both repositories offer speech recognition capabilities, but they differ in approach and implementation. PocketSphinx is a lightweight, C-based solution suitable for embedded systems, while Silero-models leverages deep learning techniques and provides more flexibility in terms of supported languages and features.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
PocketSphinx 5.0.3
This is PocketSphinx, one of Carnegie Mellon University's open source large vocabulary, speaker-independent continuous speech recognition engines.
Although this was at one point a research system, active development has largely ceased and it has become very, very far from the state of the art. I am making a release, because people are nonetheless using it, and there are a number of historical errors in the build system and API which needed to be corrected.
The version number is strangely large because there was a "release" that people are using called 5prealpha, and we will use proper semantic versioning from now on.
Please see the LICENSE file for terms of use.
Installation
We now use CMake for building, which should give reasonable results across Linux and Windows. Not certain about Mac OS X because I don't have one of those. In addition, the audio library, which never really built or worked correctly on any platform at all, has simply been removed.
There is no longer any dependency on SphinxBase. There is no SphinxBase anymore. This is not the SphinxBase you're looking for. All your SphinxBase are belong to us.
To install the Python module in a virtual environment (replace
~/ve_pocketsphinx
with the virtual environment you wish to create),
from the top level directory:
python3 -m venv ~/ve_pocketsphinx
. ~/ve_pocketsphinx/bin/activate
pip install .
To install the C library and bindings (assuming you have access to
/usr/local - if not, use -DCMAKE_INSTALL_PREFIX
to set a different
prefix in the first cmake
command below):
cmake -S . -B build
cmake --build build
cmake --build build --target install
Usage
The pocketsphinx
command-line program reads single-channel 16-bit
PCM audio from standard input or one or more files, and attemps to
recognize speech in it using the default acoustic and language model.
It accepts a large number of options which you probably don't care
about, a command which defaults to live
, and one or more inputs
(except in align
mode), or -
to read from standard input.
If you have a single-channel WAV file called "speech.wav" and you want to recognize speech in it, you can try doing this (the results may not be wonderful):
pocketsphinx single speech.wav
If your input is in some other format I suggest converting it with
sox
as described below.
The commands are as follows:
-
help
: Print a long list of those options you don't care about. -
config
: Dump configuration as JSON to standard output (can be loaded with the-config
option). -
live
: Detect speech segments in each input, run recognition on them (using those options you don't care about), and write the results to standard output in line-delimited JSON. I realize this isn't the prettiest format, but it sure beats XML. Each line contains a JSON object with these fields, which have short names to make the lines more readable:b
: Start time in seconds, from the beginning of the streamd
: Duration in secondsp
: Estimated probability of the recognition result, i.e. a number between 0 and 1 representing the likelihood of the input according to the modelt
: Full text of recognition resultw
: List of segments (usually words), each of which in turn contains theb
,d
,p
, andt
fields, for start, end, probability, and the text of the word. If-phone_align yes
has been passed, then aw
field will be present containing phone segmentations, in the same format.
-
single
: Recognize each input as a single utterance, and write a JSON object in the same format described above. -
align
: Align a single input file (or-
for standard input) to a word sequence, and write a JSON object in the same format described above. The first positional argument is the input, and all subsequent ones are concatenated to make the text, to avoid surprises if you forget to quote it. You are responsible for normalizing the text to remove punctuation, uppercase, centipedes, etc. For example:pocketsphinx align goforward.wav "go forward ten meters"
By default, only word-level alignment is done. To get phone alignments, pass
-phone_align yes
in the flags, e.g.:pocketsphinx -phone_align yes align audio.wav $text
This will make not particularly readable output, but you can use jq to clean it up. For example, you can get just the word names and start times like this:
pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]'
Or you could get the phone names and durations like this:
pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]'
There are many, many other possibilities, of course.
-
soxflags
: Return arguments tosox
which will create the appropriate input format. Note that because thesox
command-line is slightly quirky these must always come after the filename or-d
(which tellssox
to read from the microphone). You can run live recognition like this:sox -d $(pocketsphinx soxflags) | pocketsphinx -
or decode from a file named "audio.mp3" like this:
sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx -
By default only errors are printed to standard error, but if you want
more information you can pass -loglevel INFO
. Partial results are
not printed, maybe they will be in the future, but don't hold your
breath.
Programming
For programming, see the examples directory for a number of examples of using the library from C and Python. You can also read the documentation for the Python API or the C API
Authors
PocketSphinx is ultimately based on Sphinx-II
which in turn was
based on some older systems at Carnegie Mellon University, which were
released as free software under a BSD-like license thanks to the
efforts of Kevin Lenzo. Much of the decoder in particular was written
by Ravishankar Mosur (look for "rkm" in the comments), but various
other people contributed as well, see the AUTHORS file
for more details.
David Huggins-Daines (the author of this document) is
responsible for creating PocketSphinx
which added
various speed and memory optimizations, fixed-point computation, JSGF
support, portability to various platforms, and a somewhat coherent
API. He then disappeared for a while.
Nickolay Shmyrev took over maintenance for quite a long time afterwards, and a lot of code was contributed by Alexander Solovets, Vyacheslav Klimkov, and others.
Currently this is maintained by David Huggins-Daines again.
Top Related Projects
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
kaldi-asr/kaldi is the official location of the Kaldi project.
Open-Source Large Vocabulary Continuous Speech Recognition Engine
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot