whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

16,462

1,760

16,462

331

View on GitHub

Top Related Projects

whisper

85,961

Robust Speech Recognition via Large-Scale Weak Supervision

whisper.cpp

41,097

Port of OpenAI's Whisper model in C/C++

faster-whisper

17,373

Faster Whisper transcription with CTranslate2

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

silero-models

5,420

Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple

Quick Overview

WhisperX is an advanced speech recognition and transcription tool that extends OpenAI's Whisper model. It offers improved timestamp accuracy, speaker diarization, and faster transcription speeds. WhisperX aims to provide a more comprehensive and efficient solution for audio transcription tasks.

Pros

Enhanced timestamp accuracy for word-level alignment
Integrated speaker diarization for multi-speaker audio
Faster transcription speeds compared to the original Whisper model
Support for multiple languages and accents

Cons

Requires more computational resources due to additional features
May have occasional accuracy issues with heavily accented speech
Limited documentation for advanced customization
Dependency on external libraries and models

Code Examples

Basic transcription:

import whisperx

model = whisperx.load_model("large-v2")
result = model.transcribe("audio.mp3")
print(result["text"])

Transcription with speaker diarization:

import whisperx

model = whisperx.load_model("large-v2")
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN")
result = model.transcribe("audio.mp3", diarize=True, diarize_model=diarize_model)
print(result["segments"])

Transcription with language detection:

import whisperx

model = whisperx.load_model("large-v2")
result = model.transcribe("audio.mp3", language="auto")
print(result["language"], result["text"])

Getting Started

To get started with WhisperX, follow these steps:

Install WhisperX:

pip install git+https://github.com/m-bain/whisperx.git

Install additional dependencies:

pip install pyannote.audio

Use WhisperX in your Python script:

import whisperx

model = whisperx.load_model("large-v2")
result = model.transcribe("path/to/your/audio.mp3")
print(result["text"])

Note: For speaker diarization, you'll need to obtain an authentication token from Hugging Face and use it when initializing the diarization pipeline.

Competitor Comparisons

whisper

85,961

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

Developed by OpenAI, a leading AI research company
Extensive documentation and community support
Broader language support with 99 languages

Cons of Whisper

Slower processing speed for long audio files
Less precise timestamp alignment for transcriptions
Limited fine-tuning options for specific use cases

Code Comparison

WhisperX:

from whisperx import load_model, transcribe

model = load_model("large-v2")
result = transcribe("audio.wav", model)

Whisper:

import whisper

model = whisper.load_model("large")
result = model.transcribe("audio.wav")

Key Differences

WhisperX focuses on improved speed and accuracy for long-form content
WhisperX offers better word-level timestamp alignment
Whisper provides a more general-purpose solution for various audio transcription tasks

Use Cases

WhisperX: Ideal for long-form content, podcasts, and applications requiring precise word-level timestamps
Whisper: Better suited for general transcription tasks and multilingual applications

Community and Support

Whisper: Larger community, more third-party integrations, and extensive documentation
WhisperX: Growing community, focused on specific improvements over the original Whisper model

whisper.cpp

41,097

Port of OpenAI's Whisper model in C/C++

Pros of whisper.cpp

Lightweight and efficient C++ implementation, optimized for CPU usage
Supports various quantization levels for reduced memory footprint
Can run on embedded devices and older hardware

Cons of whisper.cpp

Limited to basic transcription and translation functionality
Lacks advanced features like speaker diarization and word-level timestamps
May have lower accuracy compared to GPU-accelerated implementations

Code Comparison

WhisperX:

from whisperx import load_model, transcribe

model = load_model("large-v2")
result = transcribe(model, "audio.wav", language="en")

whisper.cpp:

#include "whisper.h"

whisper_context * ctx = whisper_init_from_file("ggml-large.bin");
whisper_full_params params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
whisper_full(ctx, params, "audio.wav", NULL, 0);

WhisperX offers a more user-friendly Python interface with additional features, while whisper.cpp provides a lower-level C++ implementation focused on efficiency and portability. WhisperX is better suited for advanced speech recognition tasks, while whisper.cpp excels in resource-constrained environments.

faster-whisper

17,373

Faster Whisper transcription with CTranslate2

Pros of faster-whisper

Optimized for speed, offering faster transcription times
Supports streaming audio input for real-time transcription
Implements efficient CPU and GPU inference

Cons of faster-whisper

Limited to core Whisper functionality without additional features
May have slightly lower accuracy compared to WhisperX in some cases
Less focus on timestamp alignment and speaker diarization

Code Comparison

WhisperX:

model = WhisperModel("large-v2", device="cuda", compute_type="float16")
result = model.transcribe("audio.wav", batch_size=16)
segments = result["segments"]

faster-whisper:

model = WhisperModel("large-v2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
    print(segment.text)

Key Differences

WhisperX focuses on enhancing Whisper with additional features like improved timestamp alignment and speaker diarization, while faster-whisper prioritizes speed optimization and efficient inference. WhisperX may offer better accuracy and more advanced features, but faster-whisper excels in transcription speed and real-time processing capabilities.

Both projects build upon the original Whisper model, but cater to different use cases. WhisperX is more suitable for applications requiring precise timing and speaker identification, while faster-whisper is ideal for scenarios where speed and efficiency are paramount.

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Comprehensive toolkit for sequence modeling tasks
Supports a wide range of architectures and tasks
Highly customizable and extensible

Cons of fairseq

Steeper learning curve due to its complexity
May be overkill for simple speech recognition tasks
Requires more setup and configuration

Code Comparison

fairseq:

from fairseq.models.wav2vec import Wav2VecModel

model = Wav2VecModel.from_pretrained('path/to/model')
features = model.extract_features(waveform)

WhisperX:

import whisperx

model = whisperx.load_model("base")
result = model.transcribe("audio.mp3")

Key Differences

fairseq is a general-purpose sequence modeling toolkit, while WhisperX focuses specifically on speech recognition and diarization
WhisperX provides a simpler API for quick transcription tasks
fairseq offers more flexibility for advanced users and researchers
WhisperX includes built-in speaker diarization, which is not a core feature of fairseq

Use Cases

fairseq: Research, custom model development, and complex sequence modeling tasks
WhisperX: Rapid speech transcription, speaker diarization, and alignment in production environments

silero-models

5,420

Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple

Pros of silero-models

Supports multiple languages and tasks (speech recognition, text-to-speech, voice activity detection)
Lightweight models suitable for edge devices and mobile applications
Extensive documentation and examples for various programming languages

Cons of silero-models

May have lower accuracy compared to WhisperX for speech recognition tasks
Lacks advanced features like word-level timestamps and speaker diarization
Smaller community and fewer updates compared to WhisperX

Code Comparison

silero-models:

import torch
import torchaudio
from silero import silero_stt

model, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                       model='silero_stt',
                                       language='en')

WhisperX:

import whisperx

device = "cuda"
audio_file = "audio.wav"
model = whisperx.load_model("large-v2", device)
result = model.transcribe(audio_file)

Both repositories offer speech recognition capabilities, but WhisperX focuses on extending OpenAI's Whisper model with additional features, while silero-models provides a broader range of speech-related tasks. WhisperX may be more suitable for high-accuracy transcription with advanced features, while silero-models is better for lightweight, multi-purpose speech processing applications.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

WhisperX

This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

â¡ï¸ Batched inference for 70x realtime transcription using whisper large-v2
ðª¶ faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
ð¯ Accurate word-level timestamps using wav2vec2 alignment
ð¯ââï¸ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
ð£ï¸ VAD preprocessing, reduces hallucination & batching with no WER degradation

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.

Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is wav2vec2.0.

Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.

Voice Activity Detection (VAD) is the detection of the presence or absence of human speech.

Speaker Diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.

Newð¨

1st place at Ego4d transcription challenge ð
WhisperX accepted at INTERSPEECH 2023
v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
v3 released, 70x speed-up open-sourced. Using batched whisper with faster-whisper backend!
v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
Paper dropðð¨âð«! Please see our ArxiV preprint for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed.

Setup âï¸

1. Simple Installation (Recommended)

The easiest way to install WhisperX is through PyPi:

pip install whisperx

Or if using uvx:

uvx whisperx

2. Advanced Installation Options

These installation methods are for developers or users with specific needs. If you're not sure, stick with the simple installation above.

Option A: Install from GitHub

To install directly from the GitHub repository:

uvx git+https://github.com/m-bain/whisperX.git

Option B: Developer Installation

If you want to modify the code or contribute to the project:

git clone https://github.com/m-bain/whisperX.git
cd whisperX
uv sync --all-extras --dev

Note: The development version may contain experimental features and bugs. Use the stable PyPI release for production environments.

You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.

Common Issues & Troubleshooting ð§

libcudnn Dependencies (GPU Users)

If you're using WhisperX with GPU support and encounter errors like:

Could not load library libcudnn_ops_infer.so.8
Unable to load any of {libcudnn_cnn.so.9.1.0, libcudnn_cnn.so.9.1, libcudnn_cnn.so.9, libcudnn_cnn.so}
libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory

This means your system is missing the CUDA Deep Neural Network library (cuDNN). This library is needed for GPU acceleration but isn't always installed by default.

Install cuDNN (example for apt based systems):

sudo apt update
sudo apt install libcudnn8 libcudnn8-dev -y

Speaker Diarization

To enable Speaker Diarization, include your Hugging Face access token (read) that you can generate from Here after the --hf_token argument and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1 (if you choose to use Speaker-Diarization 2.x, follow requirements here instead.)

Note
As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see this issue for more details and potential workarounds.

Usage ð¬ (command line)

English

Run whisper on example segment (using default params, whisper small) add --highlight_words True to visualise word timings in the .srt file.

whisperx path/to/audio.wav

Result using WhisperX with forced alignment to wav2vec2.0 large:

https://user-images.githubusercontent.com/36994049/208253969-7e35fe2a-7541-434a-ae91-8e919540555d.mp4

Compare this to original whisper out the box, where many transcriptions are out of sync:

https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov

For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.

whisperx path/to/audio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

To label the transcript with speaker ID's (set number of speakers if known e.g. --min_speakers 2 --max_speakers 2):

whisperx path/to/audio.wav --model large-v2 --diarize --highlight_words True

To run on CPU instead of GPU (and for running on Mac OS X):

whisperx path/to/audio.wav --compute_type int8

Other languages

The phoneme ASR alignment model is language-specific, for tested languages these models are automatically picked from torchaudio pipelines or huggingface. Just pass in the --language code, and use the whisper --model large.

Currently default models provided for {en, fr, de, es, it} via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under DEFAULT_ALIGN_MODELS_HF on alignment.py. If the detected language is not in this list, you need to find a phoneme-based ASR model from huggingface model hub and test it on your data.

E.g. German

whisperx --model large-v2 --language de path/to/audio.wav

https://user-images.githubusercontent.com/36994049/208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov

See more examples in other languages here.

Python usage ð

import whisperx
import gc

device = "cuda"
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

Demos ð

If you don't have access to your own GPUs, use the links above to try out WhisperX.

Technical Details ð·ââï¸

For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint paper.

To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):

reduce batch size, e.g. --batch_size 4
use a smaller ASR model --model base
Use lighter compute type --compute_type int8

Transcription differences from openai's whisper:

Transcription without timestamps. To enable single pass batching, whisper inference is performed --without_timestamps True, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
VAD-based segment transcription, unlike the buffered transcription of openai's. In the WhisperX paper we show this reduces WER, and enables accurate batched inference
--condition_on_prev_text is set to False by default (reduces hallucination)

Limitations â ï¸

Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "Â£13.60" cannot be aligned and therefore are not given a timing.
Overlapping speech is not handled particularly well by whisper nor whisperx
Diarization is far from perfect
Language specific wav2vec2 model is needed

Contribute ð§âð«

If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.

Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.

TODO ð

Contact/Support ð

Contact maxhbain@gmail.com for queries.

Acknowledgements ð

This work, and my PhD, is supported by the VGG (Visual Geometry Group) and the University of Oxford.

Of course, this is builds on openAI's whisper. Borrows important alignment code from PyTorch tutorial on forced alignment And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio

Valuable VAD & Diarization Models from:

[pyannote audio][https://github.com/pyannote/pyannote-audio]
[silero vad][https://github.com/snakers4/silero-vad]

Great backend from faster-whisper and CTranslate2

Those who have supported this work financially ð

Finally, thanks to the OS contributors of this project, keeping it going and identifying bugs.

Citation

If you use this in your research, please cite the paper:

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Whisper

Cons of Whisper

Code Comparison

Key Differences

Use Cases

Community and Support

Pros of whisper.cpp

Cons of whisper.cpp

Code Comparison

Pros of faster-whisper

Cons of faster-whisper

Code Comparison

Key Differences

Pros of fairseq

Cons of fairseq

Code Comparison

Key Differences

Use Cases

Pros of silero-models

Cons of silero-models

Code Comparison

Convert designs to code with AI

README

WhisperX

Newð¨

Setup âï¸

1. Simple Installation (Recommended)

2. Advanced Installation Options

Option A: Install from GitHub

Option B: Developer Installation

Common Issues & Troubleshooting ð§

libcudnn Dependencies (GPU Users)

Speaker Diarization

Usage ð¬ (command line)

English

Other languages

E.g. German

Python usage ð

Demos ð

Technical Details ð·ââï¸

Limitations â ï¸

Contribute ð§âð«

TODO ð

Contact/Support ð

Acknowledgements ð

Citation

Top Related Projects

Convert designs to code with AI

Newð¨

Setup âï¸

Common Issues & Troubleshooting ð§

Usage ð¬ (command line)

Python usage ð

Demos ð

Technical Details ð·ââï¸

Limitations â ï¸

Contribute ð§âð«

TODO ð

Contact/Support ð

Acknowledgements ð