Whisper
High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
Port of OpenAI's Whisper model in C/C++
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Faster Whisper transcription with CTranslate2
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Quick Overview
Const-me/Whisper is a high-performance C++ implementation of OpenAI's Whisper automatic speech recognition (ASR) model. It provides a Windows DLL and command-line interface for transcribing audio files, offering significantly faster processing times compared to the original Python implementation.
Pros
- Substantially faster performance than the original Python implementation
- Windows-native DLL for easy integration into other applications
- Command-line interface for quick and easy transcription of audio files
- Supports various audio formats and multiple languages
Cons
- Limited to Windows operating systems
- Requires specific hardware (AVX2 support) for optimal performance
- May have slight differences in output compared to the original Whisper implementation
- Limited documentation compared to the original Python version
Code Examples
// Initialize the Whisper context
WhisperContext* ctx = whisper_init("ggml-base.en.bin");
// Process an audio file
whisper_full(ctx, whisper_full_default_params(), "audio.wav", nullptr, nullptr);
// Get the number of text segments
int n_segments = whisper_full_n_segments(ctx);
// Print transcribed text
for (int i = 0; i < n_segments; i++) {
const char* text = whisper_full_get_segment_text(ctx, i);
printf("%s", text);
}
// Free the context
whisper_free(ctx);
// Set custom parameters for transcription
whisper_full_params params = whisper_full_default_params();
params.print_progress = false;
params.print_special = false;
params.language = "en";
params.translate = false;
// Process audio with custom parameters
whisper_full(ctx, params, "audio.mp3", nullptr, nullptr);
// Get word-level timing information
for (int i = 0; i < whisper_full_n_segments(ctx); i++) {
int n_tokens = whisper_full_n_tokens(ctx, i);
for (int j = 0; j < n_tokens; j++) {
const whisper_token_data* token = whisper_full_get_token_data(ctx, i, j);
printf("Word: %s, Start: %.2f, End: %.2f\n", token->text, token->t0, token->t1);
}
}
Getting Started
- Clone the repository:
git clone https://github.com/Const-me/Whisper.git
- Open the solution in Visual Studio 2022
- Build the project in Release mode
- Download a Whisper model (e.g., ggml-base.en.bin) and place it in the same directory as the executable
- Run the command-line tool:
whisper.exe -m ggml-base.en.bin -f audio.wav
For programmatic use, include the necessary headers and link against the built DLL in your C++ project.
Competitor Comparisons
Robust Speech Recognition via Large-Scale Weak Supervision
Pros of Whisper
- Original implementation by OpenAI, ensuring accuracy and alignment with the published model
- Extensive documentation and community support
- Supports multiple languages and tasks (transcription, translation)
Cons of Whisper
- Slower performance, especially on CPU
- Higher memory requirements
- Less optimized for real-time applications
Code Comparison
Whisper:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
Whisper:
auto result = whisper_full_default(ctx, params, pcmf32.data(), pcmf32.size());
if (result == 0) {
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
const char* text = whisper_full_get_segment_text(ctx, i);
printf("%s", text);
}
}
Summary
Whisper is the original implementation with broad language support and extensive documentation. However, it may be slower and less optimized for real-time use. Whisper offers improved performance and memory efficiency, particularly for C++ applications, but may have a steeper learning curve and less comprehensive documentation.
Port of OpenAI's Whisper model in C/C++
Pros of whisper.cpp
- Highly optimized C++ implementation, offering better performance
- Supports various platforms including mobile devices and web browsers
- Provides a command-line interface for easy integration
Cons of whisper.cpp
- Limited to C++ language, potentially reducing accessibility for some developers
- May require more setup and configuration compared to Whisper
Code Comparison
whisper.cpp:
// Initialize whisper context
struct whisper_context * ctx = whisper_init_from_file("ggml-base.en.bin");
// Process audio
whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size());
// Print result
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
const char * text = whisper_full_get_segment_text(ctx, i);
printf("%s", text);
}
Whisper:
using var whisper = WhisperFactory.FromPath("ggml-base.en.bin");
using var context = whisper.CreateContext();
var result = context.RunFull(new FullParams
{
Language = "en",
Translate = false,
}, audioData);
foreach (var segment in result.Segments)
Console.WriteLine(segment.Text);
Both repositories provide implementations of OpenAI's Whisper model for speech recognition. whisper.cpp focuses on performance and cross-platform support, while Whisper offers a more accessible C# implementation. The code comparison shows the basic usage of each library, highlighting the differences in syntax and approach.
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Pros of WhisperX
- Offers word-level timestamps and speaker diarization
- Supports multiple languages and provides language detection
- Includes a VAD (Voice Activity Detection) feature for improved accuracy
Cons of WhisperX
- May have higher computational requirements due to additional features
- Potentially slower processing speed compared to Whisper
- Could be more complex to set up and use for beginners
Code Comparison
WhisperX:
import whisperx
model = whisperx.load_model("large-v2")
result = model.transcribe("audio.mp3")
print(result["segments"])
Whisper:
#include "whisper.h"
whisper_context * ctx = whisper_init_from_file("ggml-large.bin");
whisper_full_params params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
whisper_full(ctx, params, audio_data, audio_len, "en");
The code snippets demonstrate the basic usage of each library. WhisperX uses Python and provides a more straightforward API, while Whisper is implemented in C++ and requires more setup. WhisperX offers additional features like speaker diarization and word-level timestamps, making it more suitable for advanced transcription tasks. However, Whisper may be more performant for basic transcription needs and could be easier to integrate into existing C++ projects.
Faster Whisper transcription with CTranslate2
Pros of faster-whisper
- Significantly faster inference times, especially for long audio files
- Supports both CPU and GPU acceleration
- Implements efficient beam search for improved transcription accuracy
Cons of faster-whisper
- May have slightly lower accuracy compared to the original Whisper model
- Requires additional dependencies (CTranslate2 and FFmpeg)
- Less straightforward installation process for non-technical users
Code Comparison
faster-whisper:
from faster_whisper import WhisperModel
model = WhisperModel("large-v2", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Whisper:
import whisper
model = whisper.load_model("large")
result = model.transcribe("audio.mp3")
print(result["text"])
The code comparison shows that faster-whisper offers more granular control over transcription parameters and provides segment-level output, while Whisper has a simpler API but less flexibility in output formatting.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- Comprehensive toolkit for sequence modeling tasks
- Supports a wide range of architectures and tasks
- Highly customizable and extensible
Cons of fairseq
- Steeper learning curve due to its complexity
- Requires more setup and configuration
- May be overkill for simple speech recognition tasks
Code Comparison
Whisper (C++ implementation):
void CWhisperContext::runEncoder( const float* samples, int nSamples )
{
ggml_tensor* mel = ggml_new_tensor_2d( ctx, GGML_TYPE_F32, n_mel, n_frames );
whisper_mel_spectrogram( samples, nSamples, hParams.n_sample_rate, mel->data, mel->ne[0], mel->ne[1] );
// ... (encoder processing)
}
fairseq (Python implementation):
class WhisperEncoder(FairseqEncoder):
def forward(self, src_tokens, src_lengths):
x = self.embed_positions(src_tokens)
x = self.dropout_module(x)
encoder_padding_mask = src_tokens.eq(self.padding_idx)
x = self.transformer_layers(x, encoder_padding_mask)
return {
"encoder_out": [x], # T x B x C
"encoder_padding_mask": [encoder_padding_mask], # B x T
}
Note: The code snippets are simplified examples and may not represent the full functionality of each repository.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
This project is a Windows port of the whisper.cpp implementation.
Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model.
Quick Start Guide
Download WhisperDesktop.zip from the âReleasesâ section of this repository, unpack the ZIP, and run WhisperDesktop.exe.
On the first screen it will ask you to download a model.
I recommend ggml-medium.bin
(1.42GB in size), because Iâve mostly tested the software with that model.
The next screen allows to transcribe an audio file.
Thereâs another screen which allows to capture and transcribe or translate live audio from a microphone.
Features
-
Vendor-agnostic GPGPU based on DirectCompute; another name for that technology is âcompute shaders in Direct3D 11â
-
Plain C++ implementation, no runtime dependencies except essential OS components
-
Much faster than OpenAIâs implementation.
On my desktop computer with GeForce 1080Ti GPU, medium model, 3:24 min speech took 45 seconds to transcribe with PyTorch and CUDA, but only 19 seconds with my implementation and DirectCompute.
Funfact: thatâs 9.63 gigabytes runtime dependencies, versus 431 kilobytesWhisper.dll
-
Mixed F16 / F32 precision: Windows requires support of
R16_FLOAT
buffers since D3D version 10.0 -
Built-in performance profiler which measures execution time of individual compute shaders
-
Low memory usage
-
Media Foundation for audio handling, supports most audio and video formats (with the notable exception of Ogg Vorbis), and most audio capture devices which work on Windows (except some professional ones, which only implementing ASIO API).
-
Voice activity detection for audio capture.
The implementation is based on the 2009 article âA simple but efficient real-time voice activity detection algorithmâ by Mohammad Moattar and Mahdi Homayoonpoor. -
Easy to use COM-style API. Idiomatic C# wrapper available on nuget.
Version 1.10 introduced scripting support for PowerShell 5.1, thatâs the older âWindows PowerShellâ version which comes pre-installed on Windows. -
Pre-built binaries available
The only supported platform is 64-bit Windows.
Should work on Windows 8.1 or newer, but I have only tested on Windows 10.
The library requires a Direct3D 11.0 capable GPU, which in 2023 simply means âany hardware GPUâ.
The most recent GPU without D3D 11.0 support was Intel Sandy Bridge from 2011.
On the CPU side, the library requires AVX1 and F16C support.
Developer Guide
Build Instructions
-
Clone this repository
-
Open
WhisperCpp.sln
in Visual Studio 2022. Iâm using the freeware community edition, version 17.4.4. -
Switch to
Release
configuration -
Build and run
CompressShaders
C# project, in theTools
subfolder of the solution. To run that project, right click in visual studio, âSet as startup projectâ, then in the main menu of VS âDebug / Start Without Debuggingâ. When completed successfully, you should see a console window with a line like that:
Compressed 46 compute shaders, 123.5 kb -> 18.0 kb
-
Build
Whisper
project to get the native DLL, orWhisperNet
for the C# wrapper and nuget package, or the examples.
Other Notes
If you gonna consume the library in a software built with Visual C++ 2022 or newer, you probably redistribute Visual C++ runtime DLLs in the form of the .msm
merge module,
or vc_redist.x64.exe binary.
If you do that, right click on the Whisper
project, Properties, C/C++, Code Generation,
switch âRuntime Libraryâ setting from Multi-threaded (/MT)
to Multi-threaded DLL (/MD)
,
and rebuild: the binary will become smaller.
The library includes RenderDoc GPU debugger integration.
When launched your program from RenderDoc, hold F12 key to capture the compute calls.
If you gonna debug HLSL shaders, use the debug build of the DLL, it includes debug build of the shaders and youâll get better UX in the debugger.
The repository includes a lot of code which was only used for development:
couple alternative model implementations, compatible FP64 versions of some compute shaders, debug tracing and the tool to compare the traces, etc.
That stuff is disabled by preprocessor macros or constexpr
flags, I hope itâs fine to keep here.
Performance Notes
I have a limited selection of GPUs in this house.
Specifically, I have optimized for nVidia 1080Ti, Radeon Vega 8 inside Ryzen 7 5700G, and Radeon Vega 7 inside Ryzen 5 5600U.
Hereâs the summary.
The nVidia delivers relative speed 5.8 for the large model, 10.6 for the medium model.
The AMD Ryzen 5 5600U APU delivers relative speed about 2.2 for the medium model. Not great, but still, much faster than realtime.
I have also tested on nVidia 1650: slower than 1080Ti but pretty good, much faster than realtime.
I have also tested on Intel HD Graphics 4000 inside Core i7-3612QM, the relative speed was 0.14 for medium model, 0.44 for small model.
Thatâs much slower than realtime, but I was happy to find my software works even on the integrated mobile GPU launched in 2012.
Iâm not sure the performance is ideal on discrete AMD GPUs, or integrated Intel GPUs, have not specifically optimized for them.
Ideally, they might need slightly different builds of a couple of the most expensive compute shaders, mulMatTiled.hlsl
and mulMatByRowTiled.hlsl
And maybe other adjustments, like the useReshapedMatMul()
value in Whisper/D3D/device.h
header file.
I donât know how to measure that, but I have a feeling the bottleneck is memory, not compute.
Someone on Hacker News has tested on 3060Ti,
the version with GDDR6 memory.
Compared to 1080Ti, that GPU has 1.3x FP32 FLOPS, but 0.92x VRAM bandwidth.
The app was about 10% slower on the 3060Ti.
Further Optimisations
I have only spent a few days optimizing performance of these shaders.
It might be possible to do much better, hereâs a few ideas.
-
Newer GPUs like Radeon Vega or nVidia 1650 have higher FP16 performance compared to FP32, yet my compute shaders are only using FP32 data type.
Half The Precision, Twice The Fun -
In the current version, FP16 tensors are using shader resource views to upcast loaded values, and unordered access views to downcast stored ones.
Might be a good idea to switch to byte address buffers, load/store complete 4-bytes values, and upcast / downcast in HLSL withf16tof32
/f32tof16
intrinsics. -
In the current version all shaders are compiled offline, and
Whisper.dll
includes DXBC byte codes.
The HLSL compilerD3DCompiler_47.dll
is an OS component, and is pretty fast. For the expensive compute shaders, itâs probably a good idea to ship HLSL instead of DXBC, and compile on startup with environment-specific values for the macros. -
It might be a good idea to upgrade the whole thing from D3D11 to D3D12.
The newer API is harder to use, but includes potentially useful features not exposed to D3D11: wave intrinsics, and explicit FP16.
Missing Features
Automatic language detection is not implemented.
In the current version thereâs high latency for realtime audio capture.
Specifically, depending on voice detection the figure is about 5-10 seconds.
At least in my tests, the model wasnât happy when I supplied too short pieces of the audio.
I have increased the latency and called it a day, but ideally this needs a better fix for optimal UX.
Final Words
From my perspective, this is an unpaid hobby project, which I completed over the 2022-23 winter holydays.
The code probably has bugs.
The software is provided âas isâ, without warranty of any kind.
Thanks to Georgi Gerganov for whisper.cpp implementation,
and the models in GGML binary format.
I donât program Python, and I donât know anything about the ML ecosystem.
I wouldnât even start this project without a good C++ reference implementation, to test my version against.
That whisper.cpp project has an example which uses
the same GGML implementation to run another OpenAIâs model, GPT-2.
It shouldnât be hard to support that ML model with the compute shaders and relevant infrastructure already implemented in this project.
If you find this useful, Iâll be very grateful if you consider a donation to âCome Back Aliveâ foundation.
Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
Port of OpenAI's Whisper model in C/C++
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Faster Whisper transcription with CTranslate2
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot