⬡PipelineAudio & SpeechFree
WhisperX
Whisper with fast forced alignment, accurate word-level timestamps, and multi-speaker diarization.
Installs280k
Rating★ 4.7
Reviews93
WhisperX
WhisperX extends OpenAI Whisper with phoneme-based forced alignment for word-level timestamps accurate to ±20ms, and integrates pyannote.audio for speaker diarization — letting you output speaker-labelled transcripts in one command.
Key Features
- Word-level timestamps: phoneme alignment via
wav2vec2gives far more accurate boundaries than Whisper's built-in timestamps - Speaker diarization: plug in a HuggingFace pyannote token to automatically label each segment by speaker
- Batched inference: chunked audio with
faster-whisperbackend for 70× real-time throughput on GPU - Language detection: automatic per-segment language ID for multilingual recordings
- SRT/VTT output: emit subtitle files directly from the CLI
- Minimal code change: drop-in replacement for
whisper.load_modelin existing pipelines
Quick Start
pip install whisperx
# Transcribe with word timestamps and speaker labels
whisperx audio.mp3 \
--model large-v3 \
--diarize \
--hf_token hf_xxx \
--output_format srt
import whisperx
model = whisperx.load_model("large-v3", device="cuda", compute_type="float16")
result = model.transcribe("audio.mp3", batch_size=16)
aligned = whisperx.align(result["segments"], ...)
npx ai-supply add whisperx-forced-alignment-diarization
Curated mirror of the open-source WhisperX (BSD-2-Clause). Get it from the source.