insanely-fast-whisper

No description available

8,570

619

8,570

107

View on GitHub

Top Related Projects

whisper

85,961

Robust Speech Recognition via Large-Scale Weak Supervision

whisper.cpp

41,097

Port of OpenAI's Whisper model in C/C++

whisperX

16,462

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

faster-whisper

17,373

Faster Whisper transcription with CTranslate2

whisper-jax

4,618

JAX implementation of OpenAI's Whisper model for up to 70x speed-up on TPU.

Quick Overview

Insanely Fast Whisper is a project that optimizes the Whisper speech recognition model for faster inference. It leverages various techniques and libraries to achieve significant speed improvements over the original implementation, making it suitable for real-time applications and large-scale audio processing tasks.

Pros

Dramatically improved inference speed compared to the original Whisper model
Supports both CPU and GPU acceleration
Easy to use with a simple API
Compatible with various audio input formats

Cons

May have slightly reduced accuracy compared to the original Whisper model
Requires additional dependencies for optimal performance
Limited documentation and examples for advanced use cases
Potential compatibility issues with future Whisper model updates

Code Examples

Basic transcription:

from insanely_fast_whisper import WhisperModel

model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Transcription with language detection:

from insanely_fast_whisper import WhisperModel

model = WhisperModel("large-v2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", language="auto")

print(f"Detected language: {info.language}")
for segment in segments:
    print(segment.text)

Batch processing multiple audio files:

from insanely_fast_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
audio_files = ["file1.wav", "file2.mp3", "file3.ogg"]

for audio_file in audio_files:
    segments, _ = model.transcribe(audio_file)
    print(f"Transcription for {audio_file}:")
    for segment in segments:
        print(segment.text)
    print("\n")

Getting Started

To get started with Insanely Fast Whisper, follow these steps:

Install the required dependencies:

pip install git+https://github.com/Vaibhavs10/insanely-fast-whisper.git

Import the WhisperModel and create an instance:

from insanely_fast_whisper import WhisperModel
model = WhisperModel("base", device="cuda", compute_type="float16")

Transcribe an audio file:

segments, info = model.transcribe("path/to/audio.mp3")
for segment in segments:
    print(segment.text)

Competitor Comparisons

whisper

85,961

Robust Speech Recognition via Large-Scale Weak Supervision

Pros of Whisper

More comprehensive and versatile, supporting multiple languages and tasks
Backed by OpenAI, with extensive documentation and community support
Offers fine-tuning capabilities for specific use cases

Cons of Whisper

Generally slower processing speed, especially for longer audio files
Higher computational requirements, which can be resource-intensive
May require more setup and configuration for optimal performance

Code Comparison

Whisper:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Insanely Fast Whisper:

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

The main difference is that Insanely Fast Whisper utilizes optimized libraries and GPU acceleration to achieve faster processing speeds, especially for longer audio files. However, Whisper offers more flexibility and features, making it suitable for a wider range of applications.

whisper.cpp

41,097

Port of OpenAI's Whisper model in C/C++

Pros of whisper.cpp

Lightweight and efficient C++ implementation
Runs entirely on CPU, suitable for devices without GPUs
Supports various quantization levels for model compression

Cons of whisper.cpp

Generally slower than GPU-accelerated implementations
Limited to Whisper models, less flexibility for other architectures
Requires more manual setup and compilation

Code Comparison

whisper.cpp:

// Load model
struct whisper_context * ctx = whisper_init_from_file("ggml-base.en.bin");

// Process audio
whisper_full_default(ctx, params, pcmf32.data(), pcmf32.size());

// Print result
const char * text = whisper_full_get_text(ctx);
printf("%s", text);

insanely-fast-whisper:

# Load model and processor
model = WhisperModel("openai/whisper-large-v3", compute_type="int8")

# Transcribe audio
segments, info = model.transcribe("audio.mp3", beam_size=5)

# Print result
for segment in segments:
    print(segment.text)

The code snippets demonstrate the basic usage of each library. whisper.cpp uses C++ and requires manual model loading and processing, while insanely-fast-whisper leverages Python and provides a more high-level API for transcription tasks.

whisperX

16,462

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Pros of WhisperX

Offers advanced features like word-level timestamps and speaker diarization
Supports multiple languages and provides language identification
Includes a VAD (Voice Activity Detection) feature for improved accuracy

Cons of WhisperX

May have slower processing speed compared to Insanely Fast Whisper
Requires more dependencies and setup steps
Potentially higher computational resource requirements

Code Comparison

WhisperX:

import whisperx

model = whisperx.load_model("large-v2")
result = model.transcribe("audio.mp3")
result = whisperx.align(result["segments"], model, "audio.mp3", "en")

Insanely Fast Whisper:

from faster_whisper import WhisperModel

model = WhisperModel("large-v2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

Both repositories provide efficient speech recognition capabilities, but they cater to different use cases. WhisperX offers more advanced features like word-level timestamps and speaker diarization, making it suitable for complex transcription tasks. Insanely Fast Whisper, on the other hand, focuses on speed and efficiency, making it ideal for quick transcriptions or real-time applications. The choice between the two depends on the specific requirements of your project, balancing between advanced features and processing speed.

faster-whisper

17,373

Faster Whisper transcription with CTranslate2

Pros of faster-whisper

More established project with a longer development history
Supports a wider range of Whisper models, including medium and large
Offers more fine-grained control over transcription parameters

Cons of faster-whisper

Generally slower transcription speeds compared to insanely-fast-whisper
Requires more manual configuration and setup
Less focus on optimizing for consumer-grade hardware

Code Comparison

faster-whisper:

from faster_whisper import WhisperModel

model = WhisperModel("large-v2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

insanely-fast-whisper:

from insanely_fast_whisper import WhisperModel

model = WhisperModel("large-v2", device="cuda", compute_type="float16")
result = model.transcribe("audio.mp3")

Both projects aim to provide faster Whisper transcription, but they take different approaches. faster-whisper focuses on a broader range of models and more customization options, while insanely-fast-whisper prioritizes speed optimizations for specific use cases. The choice between them depends on the user's specific needs, hardware capabilities, and desired level of control over the transcription process.

whisper-jax

4,618

JAX implementation of OpenAI's Whisper model for up to 70x speed-up on TPU.

Pros of whisper-jax

Utilizes JAX for efficient GPU acceleration, potentially offering faster performance on compatible hardware
Provides more detailed documentation and examples for advanced usage scenarios
Offers a wider range of pre-trained models, including multilingual options

Cons of whisper-jax

May have a steeper learning curve due to JAX-specific implementation
Less focus on ease of use for beginners compared to insanely-fast-whisper
Requires additional dependencies, which could increase setup complexity

Code Comparison

whisper-jax:

import jax
from whisper_jax import FlaxWhisperPipline

pipeline = FlaxWhisperPipline("openai/whisper-large-v2")
text = pipeline("audio.mp3")

insanely-fast-whisper:

from faster_whisper import WhisperModel

model = WhisperModel("large-v2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

Both repositories aim to provide efficient implementations of the Whisper model for speech recognition. whisper-jax leverages JAX for potential performance gains on compatible hardware, while insanely-fast-whisper focuses on simplicity and ease of use. The choice between them depends on specific requirements, hardware compatibility, and user expertise.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Insanely Fast Whisper

An opinionated CLI to transcribe Audio files w/ Whisper on-device! Powered by ð¤ Transformers, Optimum & flash-attn

TL;DR - Transcribe 150 minutes (2.5 hours) of audio in less than 98 seconds - with OpenAI's Whisper Large v3. Blazingly fast transcription is now a reality!â¡ï¸

pipx install insanely-fast-whisper==0.0.15 --force

Not convinced? Here are some benchmarks we ran on a Nvidia A100 - 80GB ð

Optimisation type	Time to Transcribe (150 mins of Audio)
large-v3 (Transformers) (`fp32`)	~31 (31 min 1 sec)
large-v3 (Transformers) (`fp16` + `batching [24]` + `bettertransformer`)	~5 (5 min 2 sec)
large-v3 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2`)	*~2 (1 min 38 sec)*
distil-large-v2 (Transformers) (`fp16` + `batching [24]` + `bettertransformer`)	~3 (3 min 16 sec)
distil-large-v2 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2`)	*~1 (1 min 18 sec)*
large-v2 (Faster Whisper) (`fp16` + `beam_size [1]`)	~9.23 (9 min 23 sec)
large-v2 (Faster Whisper) (`8-bit` + `beam_size [1]`)	~8 (8 min 15 sec)

P.S. We also ran the benchmarks on a Google Colab T4 GPU instance too!

P.P.S. This project originally started as a way to showcase benchmarks for Transformers, but has since evolved into a lightweight CLI for people to use. This is purely community driven. We add whatever community seems to have a strong demand for!

ð Blazingly fast transcriptions via your terminal! â¡ï¸

We've added a CLI to enable fast transcriptions. Here's how you can use it:

Install insanely-fast-whisper with pipx (pip install pipx or brew install pipx):

pipx install insanely-fast-whisper

â ï¸ If you have python 3.11.XX installed, pipx may parse the version incorrectly and install a very old version of insanely-fast-whisper without telling you (version 0.0.8, which won't work anymore with the current BetterTransformers). In that case, you can install the latest version by passing --ignore-requires-python to pip:

pipx install insanely-fast-whisper --force --pip-args="--ignore-requires-python"

If you're installing with pip, you can pass the argument directly: pip install insanely-fast-whisper --ignore-requires-python.

Run inference from any path on your computer:

insanely-fast-whisper --file-name <filename or URL>

Note: if you are running on macOS, you also need to add --device-id mps flag.

ð¥ You can run Whisper-large-v3 w/ Flash Attention 2 from this CLI too:

insanely-fast-whisper --file-name <filename or URL> --flash True

ð You can run distil-whisper directly from this CLI too:

insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name <filename or URL>

Don't want to install insanely-fast-whisper? Just use pipx run:

pipx run insanely-fast-whisper --file-name <filename or URL>

[!NOTE] The CLI is highly opinionated and only works on NVIDIA GPUs & Mac. Make sure to check out the defaults and the list of options you can play around with to maximise your transcription throughput. Run insanely-fast-whisper --help or pipx run insanely-fast-whisper --help to get all the CLI arguments along with their defaults.

CLI Options

The insanely-fast-whisper repo provides an all round support for running Whisper in various settings. Note that as of today 26th Nov, insanely-fast-whisper works on both CUDA and mps (mac) enabled devices.

  -h, --help            show this help message and exit
  --file-name FILE_NAME
                        Path or URL to the audio file to be transcribed.
  --device-id DEVICE_ID
                        Device ID for your GPU. Just pass the device number when using CUDA, or "mps" for Macs with Apple Silicon. (default: "0")
  --transcript-path TRANSCRIPT_PATH
                        Path to save the transcription output. (default: output.json)
  --model-name MODEL_NAME
                        Name of the pretrained model/ checkpoint to perform ASR. (default: openai/whisper-large-v3)
  --task {transcribe,translate}
                        Task to perform: transcribe or translate to another language. (default: transcribe)
  --language LANGUAGE   
                        Language of the input audio. (default: "None" (Whisper auto-detects the language))
  --batch-size BATCH_SIZE
                        Number of parallel batches you want to compute. Reduce if you face OOMs. (default: 24)
  --flash FLASH         
                        Use Flash Attention 2. Read the FAQs to see how to install FA2 correctly. (default: False)
  --timestamp {chunk,word}
                        Whisper supports both chunked as well as word level timestamps. (default: chunk)
  --hf-token HF_TOKEN
                        Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips
  --diarization_model DIARIZATION_MODEL
                        Name of the pretrained model/ checkpoint to perform diarization. (default: pyannote/speaker-diarization)
  --num-speakers NUM_SPEAKERS
                        Specifies the exact number of speakers present in the audio file. Useful when the exact number of participants in the conversation is known. Must be at least 1. Cannot be used together with --min-speakers or --max-speakers. (default: None)
  --min-speakers MIN_SPEAKERS
                        Sets the minimum number of speakers that the system should consider during diarization. Must be at least 1. Cannot be used together with --num-speakers. Must be less than or equal to --max-speakers if both are specified. (default: None)
  --max-speakers MAX_SPEAKERS
                        Defines the maximum number of speakers that the system should consider in diarization. Must be at least 1. Cannot be used together with --num-speakers. Must be greater than or equal to --min-speakers if both are specified. (default: None)

Frequently Asked Questions

How to correctly install flash-attn to make it work with insanely-fast-whisper?

Make sure to install it via pipx runpip insanely-fast-whisper install flash-attn --no-build-isolation. Massive kudos to @li-yifei for helping with this.

How to solve an AssertionError: Torch not compiled with CUDA enabled error on Windows?

The root cause of this problem is still unknown, however, you can resolve this by manually installing torch in the virtualenv like python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121. Thanks to @pto2k for all tdebugging this.

How to avoid Out-Of-Memory (OOM) exceptions on Mac?

The mps backend isn't as optimised as CUDA, hence is way more memory hungry. Typically you can run with --batch-size 4 without any issues (should use roughly 12GB GPU VRAM). Don't forget to set --device-id mps.

How to use Whisper without a CLI?

All you need to run is the below snippet:

pip install --upgrade transformers optimum accelerate

import torch
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3", # select checkpoint from https://huggingface.co/openai/whisper-large-v3#model-details
    torch_dtype=torch.float16,
    device="cuda:0", # or mps for Mac devices
    model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
)

outputs = pipe(
    "<FILE_NAME>",
    chunk_length_s=30,
    batch_size=24,
    return_timestamps=True,
)

outputs

Acknowledgements

OpenAI Whisper team for open sourcing such a brilliant check point.
Hugging Face Transformers team, specifically Arthur, Patrick, Sanchit & Yoach (alphabetical order) for continuing to maintain Whisper in Transformers.
Hugging Face Optimum team for making the BetterTransformer API so easily accessible.
Patrick Arminio for helping me tremendously to put together this CLI.

Community showcase

@ochen1 created a brilliant MVP for a CLI here: https://github.com/ochen1/insanely-fast-whisper-cli (Try it out now!)
@arihanv created an app (Shush) using NextJS (Frontend) & Modal (Backend): https://github.com/arihanv/Shush (Check it outtt!)
@kadirnar created a python package on top of the transformers with optimisations: https://github.com/kadirnar/whisper-plus (Go go go!!!)

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Whisper

Cons of Whisper

Code Comparison

Pros of whisper.cpp

Cons of whisper.cpp

Code Comparison

Pros of WhisperX

Cons of WhisperX

Code Comparison

Pros of faster-whisper

Cons of faster-whisper

Code Comparison

Pros of whisper-jax

Cons of whisper-jax

Code Comparison

Convert designs to code with AI

README

Insanely Fast Whisper

ð Blazingly fast transcriptions via your terminal! â¡ï¸

CLI Options

Frequently Asked Questions

How to use Whisper without a CLI?

Acknowledgements

Community showcase

Top Related Projects

Convert designs to code with AI

ð Blazingly fast transcriptions via your terminal! â¡ï¸