Name: WhisperX
Availability: InStock
Author: ai-supply

WhisperX

WhisperX extends OpenAI Whisper with phoneme-based forced alignment for word-level timestamps accurate to ±20ms, and integrates pyannote.audio for speaker diarization — letting you output speaker-labelled transcripts in one command.

Key Features

Word-level timestamps: phoneme alignment via wav2vec2 gives far more accurate boundaries than Whisper's built-in timestamps
Speaker diarization: plug in a HuggingFace pyannote token to automatically label each segment by speaker
Batched inference: chunked audio with faster-whisper backend for 70× real-time throughput on GPU
Language detection: automatic per-segment language ID for multilingual recordings
SRT/VTT output: emit subtitle files directly from the CLI
Minimal code change: drop-in replacement for whisper.load_model in existing pipelines

Quick Start

pip install whisperx

# Transcribe with word timestamps and speaker labels
whisperx audio.mp3 \
  --model large-v3 \
  --diarize \
  --hf_token hf_xxx \
  --output_format srt

import whisperx

model = whisperx.load_model("large-v3", device="cuda", compute_type="float16")
result = model.transcribe("audio.mp3", batch_size=16)
aligned = whisperx.align(result["segments"], ...)

npx ai-supply add whisperx-forced-alignment-diarization

Curated mirror of the open-source WhisperX (BSD-2-Clause). Get it from the source.

WhisperX

WhisperX

Key Features

Quick Start

More from @ai-supply