julius

Open-Source Large Vocabulary Continuous Speech Recognition Engine

1,899

306

1,899

106

View on GitHub

Top Related Projects

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

DeepSpeech

26,544

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

STT

2,488

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

espnet

9,348

End-to-End Speech Processing Toolkit

Quick Overview

Julius is an open-source large vocabulary continuous speech recognition (LVCSR) engine. It is designed for research and development purposes, offering high recognition performance and flexibility. Julius supports various languages and can be used for both real-time and offline speech recognition tasks.

Pros

High recognition accuracy and performance
Supports multiple languages and acoustic models
Flexible and customizable for various speech recognition tasks
Active development and community support

Cons

Steep learning curve for beginners
Limited documentation in English
Requires significant computational resources for large vocabulary tasks
May require additional tools and resources for optimal performance

Code Examples

Basic recognition from an audio file:

#include <julius/juliuslib.h>

int main(int argc, char *argv[])
{
    Jconf *jconf;
    Recog *recog;

    jconf = j_config_load_file_new("julius.jconf");
    recog = j_create_instance_from_jconf(jconf);

    if (j_adin_init(recog) == FALSE) {
        fprintf(stderr, "Error: failed to initialize audio input\n");
        return -1;
    }

    if (j_recognize_file(recog, "input.wav") == -1) {
        fprintf(stderr, "Error: recognition failed\n");
        return -1;
    }

    j_recog_free(recog);
    return 0;
}

Real-time recognition from microphone input:

#include <julius/juliuslib.h>

static void
output_result(Recog *recog, void *dummy)
{
    RecogProcess *r = recog->process_list;
    while (r) {
        if (r->result.status == J_RESULT_STATUS_SUCCESS) {
            printf("%s\n", r->result.sent[0].word);
        }
        r = r->next;
    }
}

int main(int argc, char *argv[])
{
    Jconf *jconf;
    Recog *recog;

    jconf = j_config_load_file_new("julius.jconf");
    recog = j_create_instance_from_jconf(jconf);

    callback_add(recog, CALLBACK_RESULT, output_result, NULL);

    if (j_adin_init(recog) == FALSE) {
        fprintf(stderr, "Error: failed to initialize audio input\n");
        return -1;
    }

    if (j_open_stream(recog, NULL) == FALSE) {
        fprintf(stderr, "Error: failed to open audio stream\n");
        return -1;
    }

    j_recognize_stream(recog);

    j_close_stream(recog);
    j_recog_free(recog);
    return 0;
}

Using a custom language model:

#include <julius/juliuslib.h>

int main(int argc, char *argv[])
{
    Jconf *jconf;
    Recog *recog;

    jconf = j_config_load_file_new("julius.jconf");
    j_config_load_string(jconf, "-lm custom_language_model.arpa");
    recog = j_create_instance_from_jconf(jconf);

    // ... (rest of the recognition code)

    j_recog_free(recog);
    return 0;
}

Getting Started

Install Julius and its dependencies:

git clone https://github.com/julius-speech/julius.git
cd julius
./configure
make
sudo make install

Prepare a configuration file (julius.jconf) with appropriate settings for your task.

Compile and run your recognition program:

gcc -o recognition_program recognition_program.c $(pkg-config --cflags --libs julius)
./recognition_program

Competitor Comparisons

pocketsphinx

4,165

A small speech recognizer

Pros of PocketSphinx

More active development with frequent updates and contributions
Better support for mobile and embedded devices
Wider language support out-of-the-box

Cons of PocketSphinx

Steeper learning curve for beginners
Less extensive documentation compared to Julius

Code Comparison

PocketSphinx:

ps_decoder_t *ps = ps_init(&config);
ps_start_utt(ps);
ps_process_raw(ps, buffer, n_samples, FALSE, FALSE);
ps_end_utt(ps);
hyp = ps_get_hyp(ps, NULL);

Julius:

Recog *recog = j_recog_new();
j_config_load_file(recog, "julius.jconf");
j_recog_recognize_stream(recog, NULL);
RecogProcess *r = j_recog_get_process(recog, 0);
Sentence *s = r->result.sent;

Both libraries offer C APIs for speech recognition, but PocketSphinx's API is more concise and straightforward. Julius requires more setup and configuration, which can be beneficial for advanced users but may be overwhelming for beginners.

PocketSphinx is better suited for mobile and embedded applications, while Julius excels in desktop environments and offers more customization options. PocketSphinx has broader language support, making it a good choice for multilingual projects. However, Julius provides more comprehensive documentation, which can be advantageous for developers new to speech recognition.

vosk-api

12,442

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Pros of Vosk

Supports multiple languages and acoustic models out of the box
Offers bindings for various programming languages (Python, Java, Node.js, etc.)
Designed for real-time, streaming speech recognition

Cons of Vosk

Larger model size and potentially higher resource usage
Less customizable for specific use cases compared to Julius
Newer project with potentially less mature ecosystem

Code Comparison

Julius:

JCONF *jconf = j_config_load_file_new("julius.jconf");
Recog *recog = j_create_instance_from_jconf(jconf);
j_adin_init(recog);
j_recognize_stream(recog);

Vosk:

from vosk import Model, KaldiRecognizer
model = Model("model")
rec = KaldiRecognizer(model, 16000)
while True:
    data = stream.read(4000)
    if rec.AcceptWaveform(data):
        print(rec.Result())

Summary

Julius is a more established, highly customizable speech recognition engine, while Vosk offers easier multi-language support and integration with various programming languages. Julius may be better for specific, fine-tuned applications, whereas Vosk is more suitable for quick implementation of multilingual speech recognition tasks.

kaldi

15,018

kaldi-asr/kaldi is the official location of the Kaldi project.

Pros of Kaldi

More comprehensive and feature-rich toolkit for speech recognition
Supports state-of-the-art deep learning techniques
Larger and more active community, with frequent updates and contributions

Cons of Kaldi

Steeper learning curve and more complex setup process
Higher computational requirements for training and decoding
Less suitable for lightweight or embedded applications

Code Comparison

Julius:

int main(int argc, char *argv[])
{
    Jconf *jconf;
    Recog *recog;
    jconf = j_config_load_args_new(argc, argv);
    recog = j_create_instance_from_jconf(jconf);
    j_recognize_stream(recog);
    j_recog_free(recog);
    return 0;
}

Kaldi:

int main(int argc, char *argv[]) {
    using namespace kaldi;
    try {
        const char *usage = "Decode features using GMM-based model.\n";
        ParseOptions po(usage);
        po.Read(argc, argv);
        LatticeFasterDecoder decoder(fst, config);
        decoder.Decode(&decodable);
    } catch(const std::exception &e) {
        std::cerr << e.what() << '\n';
        return -1;
    }
}

Both Julius and Kaldi are open-source speech recognition toolkits, but they differ in their approach and target applications. Julius is lightweight and suitable for real-time speech recognition, while Kaldi offers a more comprehensive set of tools for research and large-scale applications.

DeepSpeech

26,544

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Pros of DeepSpeech

Uses deep learning techniques, potentially offering better accuracy for complex speech recognition tasks
Supports multiple languages and can be fine-tuned for specific domains
Actively maintained with regular updates and improvements

Cons of DeepSpeech

Requires more computational resources due to its neural network-based approach
May have a steeper learning curve for implementation and customization
Larger model size, which can impact deployment in resource-constrained environments

Code Comparison

Julius:

int main(int argc, char *argv[])
{
  Jconf *jconf;
  Recog *recog;
  jconf = j_config_load_args_new(argc, argv);
  recog = j_create_instance_from_jconf(jconf);
  j_recognize_stream(recog);
  j_recog_free(recog);
  return 0;
}

DeepSpeech:

import deepspeech
model = deepspeech.Model('path/to/model.pbmm')
stream = model.createStream()
stream.feedAudioContent(audio_buffer)
text = stream.finishStream()
print(text)

Key Differences

Julius is written in C, while DeepSpeech uses Python for its main interface
Julius uses traditional speech recognition techniques, while DeepSpeech employs deep learning
DeepSpeech's API is more streamlined, focusing on model loading and inference
Julius offers more fine-grained control over the recognition process

STT

2,488

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

Pros of STT

Supports multiple languages and accents out of the box
Utilizes deep learning techniques for improved accuracy
Offers pre-trained models for quick deployment

Cons of STT

Higher computational requirements due to deep learning models
May require more setup and configuration for custom use cases

Code Comparison

STT (Python):

import stt

model = stt.Model("path/to/model.pbmm")
audio = open("audio.wav", "rb").read()
result = model.stt(audio)
print(result)

Julius (C):

#include <julius/juliuslib.h>

Recog *recog;
jconf_initialize(&recog);
jconf_load_file(recog, "julius.jconf");
j_open_stream(recog, NULL);
j_recognize_stream(recog);

Key Differences

STT uses deep learning models, while Julius relies on traditional HMM-based recognition
STT supports multiple languages by default, whereas Julius focuses primarily on Japanese
Julius offers more fine-grained control over recognition parameters
STT provides a simpler API for quick integration, while Julius requires more low-level setup

Both projects are open-source and actively maintained, with STT being more suitable for modern, multi-language applications, and Julius excelling in scenarios requiring detailed control and optimization for specific use cases.

espnet

9,348

End-to-End Speech Processing Toolkit

Pros of ESPnet

More comprehensive toolkit with support for various speech processing tasks (ASR, TTS, speech enhancement, etc.)
Active development and frequent updates
Utilizes modern deep learning frameworks (PyTorch)

Cons of ESPnet

Steeper learning curve due to its complexity
Higher computational requirements for training and inference

Code Comparison

ESPnet (Python):

import torch
from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained("espnet/model")
speech, rate = soundfile.read("audio.wav")
nbest = speech2text(speech)
text, *_ = nbest[0]

Julius (C):

int main(int argc, char *argv[])
{
    Jconf *jconf = j_config_load_file_new(argv[1]);
    Recog *recog = j_create_instance_from_jconf(jconf);
    j_adin_init(recog);
    j_recognize_stream(recog);
    j_recog_free(recog);
}

Summary

ESPnet is a more modern and versatile toolkit for speech processing tasks, offering support for various applications beyond just speech recognition. It leverages deep learning frameworks and provides frequent updates. However, it may be more complex to use and require more computational resources compared to Julius, which is a lightweight speech recognition engine focused primarily on ASR tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Julius: Open-Source Large Vocabulary Continuous Speech Recognition Engine

Copyright (c) 1991-2020 Kawahara Lab., Kyoto University
Copyright (c) 2005-2020 Julius project team, Lee Lab., Nagoya Institute of Technology
Copyright (c) 1997-2000 Information-technology Promotion Agency, Japan
Copyright (c) 2000-2005 Shikano Lab., Nara Institute of Science and Technology

About Julius

"Julius" is a high-performance, small-footprint large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. Based on word N-gram and context-dependent HMM, it can perform real-time decoding on various computers and devices from micro-computer to cloud server. The algorithm is based on 2-pass tree-trellis search, which fully incorporates major decoding techniques such as tree-organized lexicon, 1-best / word-pair context approximation, rank/score pruning, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized to be independent from model structures, and wide variety of HMM structures are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phone sets. It also can run multi-instance recognition, running dictation, grammar-based recognition or isolated word recognition simultaneously in a single thread. Standard formats are adopted for the models to cope with other speech / language modeling toolkit such as HTK, SRILM, etc. Recent version also supports Deep Neural Network (DNN) based real-time decoding.

The main platform is Linux and other Unix-based system, as well as Windows, Mac, Androids and other platforms.

Julius has been developed as a research software for Japanese LVCSR since 1997, and the work was continued under IPA Japanese dictation toolkit project (1997-2000), Continuous Speech Recognition Consortium, Japan (CSRC) (2000-2003) and Interactive Speech Technology Consortium (ISTC).

The main developer / maintainer is Akinobu Lee (ri@nitech.ac.jp).

Features

An open-source LVCSR software (BSD 3-clause license).
Real-time, hi-speed, accurate recognition based on 2-pass strategy.
Low memory requirement: less than 32MBytes required for work area (<64MBytes for 20k-word dictation with on-memory 3-gram LM).
Supports LM of N-gram with arbitrary N. Also supports rule-based grammar, and word list for isolated word recognition.
Language and unit-dependent: Any LM in ARPA standard format and AM in HTK ascii hmm definition format can be used.
Highly configurable: can set various search parameters. Also alternate decoding algorithm (1-best/word-pair approx., word trellis/word graph intermediates, etc.) can be chosen.
List of major supported features:
- On-the-fly recognition for microphone and network input
- GMM-based input rejection
- Successive decoding, delimiting input by short pauses
- N-best output
- Word graph output
- Forced alignment on word, phoneme, and state level
- Confidence scoring
- Server mode and control API
- Many search parameters for tuning its performance
- Character code conversion for result output.
- (Rev. 4) Engine becomes Library and offers simple API
- (Rev. 4) Long N-gram support
- (Rev. 4) Run with forward / backward N-gram only
- (Rev. 4) Confusion network output
- (Rev. 4) Arbitrary multi-model decoding in a single thread.
- (Rev. 4) Rapid isolated word recognition
- (Rev. 4) User-defined LM function embedding
DNN-based decoding, using front-end module for frame-wise state probability calculation for flexibility.

Quick Run

How to test English dictation with Julius and English DNN model. The procedure is for Linux but almost the same for other OS.

(For Japanese dictation, Use dictation kit)

1. Build latest Julius

% sudo apt-get install build-essential zlib1g-dev libsdl2-dev libasound2-dev
% git clone https://github.com/julius-speech/julius.git
% cd julius
% ./configure --enable-words-int
% make -j4
% ls -l julius/julius
-rwxr-xr-x 1 ri lab 746056 May 26 13:01 julius/julius

2. Get English DNN model

Go to JuliusModel page and download the English model(LM+DNN-HMM) named "ENVR-v5.4.Dnn.Bin.zip". Unzip it and cd to there.

% cd ..
% unzip /some/where/ENVR-v5.4.Dnn.Bin.zip
% cd ENVR-v5.4.Dnn.Bin

3. Modify config file

Edit the dnn.jconf file in the unzipped folder to fit the latest version of Julius:

(edit dnn.jconf)
@@ -1,5 +1,5 @@
 feature_type MFCC_E_D_A_Z
-feature_options -htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cmnstatic
+feature_options -htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic
 num_threads 1
 feature_len 48
 context_len 11
@@ -21,3 +21,4 @@
 output_B ENVR-v5.3.layerout_bias.npy
 state_prior_factor 1.0
 state_prior ENVR-v5.3.prior
+state_prior_log10nize false

4. Recognize audio file

Recognize "mozilla.wav" included in the zip file.

% ../julius/julius/julius -C julius.jconf -dnnconf dnn.jconf

You'll get tons of messages, but the final result of the first speech part will be output like this:

sentence1: <s> without the data said the article was useless </s>
wseq1: <s> without the data said the article was useless </s>
phseq1: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w ax z | y uw s l ah s | sil
cmscore1: 0.785 0.892 0.318 0.284 0.669 0.701 0.818 0.103 0.528 1.000
score1: 261.947144

"test.dbl" contains list of audio files to be recognized. Edit the file and run again to test with another files.

5. Run with live microphone input

To run Julius on live microphone input, save the following text as "mic.jconf".

-input mic
-htkconf wav_config
-h ENVR-v5.3.am
-hlist ENVR-v5.3.phn
-d ENVR-v5.3.lm
-v ENVR-v5.3.dct
-b 4000
-lmp 12 -6
-lmp2 12 -6
-fallback1pass
-multipath
-iwsp
-iwcd1 max
-spmodel sp
-no_ccd
-sepnum 150
-b2 360
-n 40
-s 2000
-m 8000
-lookuprange 5
-sb 80
-forcedict

and run Julius with the mic.jconf instead of julius.jconf

% ../julius/julius/julius -C mic.jconf -dnnconf dnn.jconf

Download

The latest release version is 4.6, released on September 2, 2020. You can get the released package from the Release page. See the "Release.txt" file for full list of updates. Run with "-help" to see full list of options.

Install / Build Julius

Follow the instructions in INSTALL.txt.

Tools and Assets

There are also toolkit and assets to run Julius. They are maintained by the Julius development team. You can get them from the following Github pages:

Japanese Dictation Kit

A set of Julius executables and Japanese LM/AM. You can test 60k-word Japanese dictation with this kit. For AM, triphone HMMs of both GMM and DNN are included. For DNN, a front-end DNN module, separated from Julius, computes the state probabilities of HMM for each input frame and send them to Julius via socket to perform real-time DNN decoding. For LM, 60k-word 3-gram trained by BCCWJ corpus is included. You can get it from its GitHub page.

Recognition Grammar Toolkit

Documents, sample files and conversion tools to use and build a recognition grammar for Julius. You can get it from the GitHub page.

Speech Segmentation Toolkit

This is a handy toolkit to do phoneme segmentation (aka phoneme alignments) for speech audio file using Julius. Given pairs of speech audio file and its transcription, this toolkit perform Viterbi alignment to get the beginning and ending time of each phoneme. This toolkit is available at its GitHub page.

Prompter

Prompter is a perl/Tkx based tiny program that displays recognition results of Julius in a scrolling caption style.

About Models

Since Julius itself is a language-independent decoding program, you can make a recognizer of a language if given an appropriate language model and acoustic model for the target language. The recognition accuracy largely depends on the models. Julius adopts acoustic models in HTK ascii format, pronunciation dictionary in almost HTK format, and word 3-gram language models in ARPA standard format (forward 2-gram and reverse N-gram trained from same corpus).

We had already examined English dictations with Julius, and another researcher has reported that Julius has also worked well in English, Slovenian (see pp.681--684 of Proc. ICSLP2002), French, Thai language, and many other Languages.

Here you can get Japanese and English language/acoustic models.

Japanese

Japanese language model (60k-word trained by balanced corpus) and acoustic models (triphone GMM/DNN) are included in the Japanese dictation kit. More various types of Japanese N-gram LM and acoustic models are available at CSRC. For more detail, please contact csrc@astem.or.jp.

English

There are some user-contributed English models for Julius available on the Web.

JuliusModels hosts English and Polish models for Julius. All of the models are based on HTK modelling software and data sets available freely on the Internet. They can be downloaded from a project website which I created for this purpose. Please note that DNN version of these models require minor changes which the author included in a modified version of Julius on Github at https://github.com/palles77/julius .

The VoxForge-project is working on the creation of an open-source acoustic model for the English language. If you have any language or acoustic model that can be distributed as a freeware, would you please contact us? We want to run dictation kit on various languages other than Japanese, and share them freely to provide a free speech recognition system available for various languages.

Documents

References

Web site (Japanese)
Old development site, having old releases
Publications:
- A. Lee and T. Kawahara. "Recent Development of Open-Source Speech Recognition Engine Julius" Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2009.
- A. Lee, T. Kawahara and K. Shikano. "Julius --- an open source real-time large vocabulary recognition engine." In Proc. European Conference on Speech Communication and Technology (EUROSPEECH), pp. 1691--1694, 2001.
- T. Kawahara, A. Lee, T. Kobayashi, K. Takeda, N. Minematsu, S. Sagayama, K. Itou, A. Ito, M. Yamamoto, A. Yamada, T. Utsuro and K. Shikano. "Free software toolkit for Japanese large vocabulary continuous speech recognition." In Proc. Int'l Conf. on Spoken Language Processing (ICSLP) , Vol. 4, pp. 476--479, 2000.

Moved to UTF-8

We are going to move to UTF-8.

The master branch after the release of 4.5 (2019/1/2) has codes converted to UTF-8. All files were converted to UTF-8, and future update will be commited also in UTF-8.

For backward compatibility and log visibility, we are keeping the old encoding codes at branch "master-4.5-legacy". The branch keeps legacy encoding version of version 4.5. If you want to inspect the code progress before the release of 4.5 (2019/1/2), please checkout the branch.

License and Citation

This code is made available under the modified BSD License (BSD-3-Clause License).

Over and above the legal restrictions imposed by this license, when you publish or present results by using this software, we would highly appreciate if you mention the use of "Large Vocabulary Continuous Speech Recognition Engine Julius" and provide proper reference or citation so that readers can easily access the information of the software. This would help boost the visibility of Julius and then further enhance Julius and the related software.

Citation to this software can be a paper that describes it,

A. Lee, T. Kawahara and K. Shikano. "Julius --- An Open Source Real-Time Large Vocabulary Recognition Engine". In Proc. EUROSPEECH, pp.1691--1694, 2001.

A. Lee and T. Kawahara. "Recent Development of Open-Source Speech Recognition Engine Julius" Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2009.

or a direct citation to this software,

A. Lee and T. Kawahara: Julius v4.5 (2019) https://doi.org/10.5281/zenodo.2530395

or both.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot