A simple ASR API service powered by WhisperX for transcription with speaker diarization. Built for self-hosters running Speakr or similar applications.
v0.3.2: two Docker image variants are now published per release.
Your GPU Pull this tag RTX 50xx (Blackwell) learnedmachine/whisperx-asr-service:blackwell(PyTorch 2.8.0 / cu128)Anything else (10xx, 20xx, 30xx, 40xx, A-series, H-series) learnedmachine/whisperx-asr-service:latest(PyTorch 2.7.1 / cu126)The
:latesttag was previously broken on Pascal cards. v0.3.2 fixes this by re-pinning torch after the WhisperX install. See Image Variants and Changelog.
- Transcribes audio files using OpenAI Whisper models
- Identifies speakers ("Who spoke when") using Pyannote.audio
- Returns word-level timestamps
- Supports 90+ languages
- Outputs JSON, SRT, VTT, TSV formats
- Runs on your own GPU hardware via Docker
- Not production-grade: Basic error handling, no authentication
- GPU required: Needs NVIDIA GPU with 14GB+ VRAM for large models
- File size limits: Large audio files (>1GB) can cause out-of-memory errors
- VRAM usage: Memory consumption increases with file size and diarization
- Alpha software: Expect bugs and breaking changes
Audio --> Whisper (transcription) --> Wav2Vec2 (alignment) --> Pyannote (speaker ID) --> Output
The service supports two serving modes:
- Simple mode (default): Single-process uvicorn with an async GPU queue. Requests are serialized through a semaphore so only one pipeline runs on the GPU at a time. Good for single-GPU, low-traffic, or development use.
- Ray Serve mode: Runs on Ray Serve with cross-request batching (
@serve.batch). Scales from 1 GPU to multi-GPU. Two pipeline strategies are available:- Replicate (default): Each GPU runs the complete pipeline (Whisper + Align + Diarize). 4 GPUs = 4 independent pipeline replicas, no cross-GPU data transfer.
- Split: Each stage runs as a separate deployment with fractional GPU allocation. Useful when you want independent scaling per stage.
GPU memory requirements vary by model size:
| Whisper Model | VRAM Required (with diarization) | Suitable GPUs |
|---|---|---|
| tiny, base | ~4-5GB | RTX 3060 8GB, RTX 2060, GTX 1660 Ti |
| small | ~6GB | RTX 3060, RTX 2070, RTX 2080 |
| medium | ~10GB | RTX 3080, RTX 3060 12GB, RTX 2080 Ti |
| large-v2, large-v3 | ~14GB | RTX 3090, RTX 4090, A6000, A100 |
Note: Measured with preloaded model + alignment + pyannote community-1 diarization on RTX 3090
Minimum Configuration (small/medium models):
- GPU: NVIDIA RTX 3060 (12GB VRAM) or better
- CPU: 8+ cores
- RAM: 16GB
- Storage: 50GB SSD
Recommended (large-v3 with diarization):
- GPU: NVIDIA RTX 3090 (24GB VRAM) or RTX 4090
- CPU: 12+ cores
- RAM: 32GB
- Storage: 100GB SSD
- Docker and Docker Compose
- NVIDIA Docker Runtime (for GPU support)
- Hugging Face Account (for speaker diarization models)
Two prebuilt Docker images are published per release. They differ only in the PyTorch wheel they ship; the application code is identical.
| Tag | PyTorch | CUDA wheels | Supported GPUs |
|---|---|---|---|
:latest, :vX.Y.Z |
2.7.1 | cu126 | Pascal (10xx) through Hopper. Compatible with the broadest GPU range. |
:blackwell, :vX.Y.Z-blackwell |
2.8.0 | cu128 | Blackwell (RTX 50xx). Drops Pascal/Maxwell support per the PyTorch 2.8 cuDNN/CUDA 12.8 build. |
If you have an RTX 50xx, use the -blackwell tag. Everyone else: use :latest.
To build a custom variant locally, override the build args:
docker build \
--build-arg TORCH_VERSION=2.7.1 \
--build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu126 \
-t whisperx-asr-service:custom .Get up and running in 3 steps using the prebuilt Docker image:
Speaker diarization requires a Hugging Face token and model access:
a) Create Hugging Face Account:
- Visit: https://huggingface.co/join and sign up
b) Accept Model User Agreements (ALL REQUIRED):
You need to accept agreements for all three models:
- pyannote/speaker-diarization-community-1 - Click "Agree and access repository"
- pyannote/segmentation-3.0 - Click "Agree and access repository"
- pyannote/speaker-diarization-3.1 - Click "Agree and access repository"
c) Generate Access Token:
- Visit: https://huggingface.co/settings/tokens
- Click "New token", name it (e.g., "whisperx-diarization")
- Select "Read" permission and generate
- Copy the token (starts with
hf_...)
Create a .env file with your Hugging Face token:
# Create .env file
cat > .env << 'EOF'
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
DEVICE=cuda
COMPUTE_TYPE=float16
BATCH_SIZE=16
PRELOAD_MODEL=large-v3
MAX_FILE_SIZE_MB=1000
# Serve mode: simple (default) or ray (Ray Serve with batching)
SERVE_MODE=simple
EOFReplace hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx with your actual token.
Download the docker-compose.yml file and start the service:
# Download docker-compose.yml
curl -O https://raw.githubusercontent.com/murtaza-nasir/whisperx-asr-service/main/docker-compose.yml
# RTX 50xx (Blackwell) only: switch to the blackwell image
# sed -i 's|whisperx-asr-service:latest|whisperx-asr-service:blackwell|' docker-compose.yml
# Start the service (pulls prebuilt image automatically)
docker compose up -d
# Check logs
docker compose logs -fThe bundled docker-compose.yml uses :latest (PyTorch 2.7.1 / cu126),
which works on every NVIDIA card from Pascal through Hopper. RTX 50xx
users should swap to :blackwell (PyTorch 2.8.0 / cu128). See
Image Variants for the full matrix.
Or run with Docker command:
# Pick the right tag for your GPU:
IMAGE=learnedmachine/whisperx-asr-service:latest # 10xx-40xx, A/H-series
# IMAGE=learnedmachine/whisperx-asr-service:blackwell # RTX 50xx only
docker run -d \
--name whisperx-asr-api \
--gpus all \
-p 9000:9000 \
-e DEVICE=cuda \
-e COMPUTE_TYPE=float16 \
-e BATCH_SIZE=16 \
-e HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
-e PRELOAD_MODEL=large-v3 \
-v whisperx-cache:/.cache \
--restart unless-stopped \
"$IMAGE"The service will be available at http://localhost:9000
# Health check
curl http://localhost:9000/health
# Test transcription
curl -X POST http://localhost:9000/asr \
-F "audio_file=@your_audio.mp3" \
-F "language=en"For development or if you want to build from source:
git clone https://github.com/murtaza-nasir/whisperx-asr-service.git
cd whisperx-asr-service# Copy example environment file
cp .env.example .env
# Edit .env and add your Hugging Face token
nano .envUsing docker-compose.dev.yml (with live code mounting):
# Build and start
docker compose -f docker-compose.dev.yml up -d --build
# Check logs
docker compose -f docker-compose.dev.yml logs -fOr build manually:
# Build image
docker build -t whisperx-asr-service .
# Run container
docker run -d \
--name whisperx-asr-api \
--gpus all \
-p 9000:9000 \
--env-file .env \
-v whisperx-cache:/.cache \
whisperx-asr-serviceNote: The docker-compose.dev.yml file mounts ./app directory for live code changes without rebuilding.
Once running, visit http://localhost:9000/docs for interactive API documentation.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
audio_file |
File | Required | Audio file to transcribe |
task |
String | transcribe |
Task type: transcribe or translate |
language |
String | Auto-detect | Language code (e.g., en, es, fr) |
model |
String | large-v3 |
Whisper model: tiny, base, small, medium, large-v2, large-v3 |
initial_prompt |
String | None | Context or spelling guide to steer the model |
hotwords |
String | None | Comma-separated words to bias transcription toward |
output_format |
String | json |
Output format: json, text, srt, vtt, tsv |
word_timestamps |
Boolean | true |
Return word-level timestamps |
diarize |
Boolean | true |
Enable speaker diarization |
num_speakers |
Integer | Auto | Exact number of speakers (if known, overrides min/max) |
min_speakers |
Integer | Auto | Minimum number of speakers |
max_speakers |
Integer | Auto | Maximum number of speakers |
Example Request (JSON output):
curl -X POST http://localhost:9000/asr \
-F "audio_file=@meeting.mp3" \
-F "language=en" \
-F "model=large-v3" \
-F "output_format=json" \
-F "diarize=true" \
-F "min_speakers=2" \
-F "max_speakers=5"Example Request (SRT subtitles):
curl -X POST http://localhost:9000/asr \
-F "audio_file=@video.mp4" \
-F "language=en" \
-F "output_format=srt" \
-F "diarize=false"Example Response (JSON):
{
"text": [
{
"start": 0.5,
"end": 2.3,
"text": " Hello, welcome to the meeting.",
"speaker": "SPEAKER_00",
"words": [
{"word": "Hello", "start": 0.5, "end": 0.8, "score": 0.95},
{"word": "welcome", "start": 0.9, "end": 1.2, "score": 0.93}
]
}
],
"language": "en",
"segments": [...],
"word_segments": [...]
}When you know the exact number of speakers, use num_speakers for more accurate diarization:
curl -X POST http://localhost:9000/asr \
-F "audio_file=@interview.mp3" \
-F "num_speakers=2" \
-F "diarize=true"This overrides min_speakers and max_speakers and typically provides better accuracy than range-based detection.
This service automatically uses exclusive speaker diarization when available from the pyannote community-1 model. This feature simplifies reconciliation between fine-grained speaker diarization timestamps and transcription timestamps, making it ideal for applications like Speakr where you need to align transcripts with speaker segments.
Benefits:
- More accurate timestamp alignment between speakers and words
- Better handling of speaker transitions
- Simplified post-processing for multi-speaker transcripts
Whisper often misspells brand names, acronyms, and domain-specific terms. You can improve accuracy using hotwords and initial_prompt:
hotwordsbiases the model's beam search to favor specific words during decodinginitial_promptprovides a sentence of context that primes the model to expect certain spellings
Example: transcribing audio that says "We deployed Speakr on a Kubernetes cluster using CTranslate2 for inference. PyAnnote handles the diarization."
# Baseline (no hints)
curl -X POST "http://localhost:9000/asr?language=en" \
-F "audio_file=@meeting.mp3"
# Result: "We deployed Speaker...using ctranslate2...PnNote handles the diarization."
# With hotwords
curl -X POST "http://localhost:9000/asr?language=en&hotwords=Speakr,CTranslate2,PyAnnote" \
-F "audio_file=@meeting.mp3"
# Result: "We deployed Speaker...using CTranslate2...PyAnnote handles the diarization."
# With hotwords + initial_prompt (best results)
curl -X POST "http://localhost:9000/asr?language=en&hotwords=Speakr,CTranslate2,PyAnnote&initial_prompt=Speakr is a transcription app." \
-F "audio_file=@meeting.mp3"
# Result: "We deployed Speakr...using CTranslate2...PyAnnote handles the diarization."| Word | No hints | Hotwords | Hotwords + initial_prompt |
|---|---|---|---|
| CTranslate2 | ctranslate2 | CTranslate2 | CTranslate2 |
| PyAnnote | PnNote | PyAnnote | PyAnnote |
| Speakr | Speaker | Speaker | Speakr |
hotwords alone fixes most spelling issues. Words that sound identical to common English words (like "Speakr" vs "Speaker") may need initial_prompt as well to provide enough context for the model to override its default.
The OpenAI-compatible endpoints (/v1/audio/transcriptions) also support a hotwords form field. If only prompt is provided, it is used as hotwords.
A test script is included to verify hotwords with your own audio:
./tests/test_hotwords.sh testfiles/your_audio.flac
# Custom hotwords
HOTWORDS="MyBrand,TechTerm" INITIAL_PROMPT="MyBrand is a product." \
./tests/test_hotwords.sh testfiles/your_audio.flacTo use this service with Speakr instead of the default ASR endpoint:
Update Speakr's .env file:
# Enable ASR endpoint
USE_ASR_ENDPOINT=true
# Point to WhisperX service
ASR_BASE_URL=http://whisperx-asr-api:9000If Speakr and WhisperX are in the same Docker Compose stack, use the container name. Otherwise, use http://localhost:9000.
- On GPU Machine: Deploy this service
# Make service accessible from network
# Edit docker compose.yml ports:
ports:
- "0.0.0.0:9000:9000" # Expose to network- On Speakr Machine: Update configuration
# In Speakr's .env file
USE_ASR_ENDPOINT=true
ASR_BASE_URL=http://<GPU_MACHINE_IP>:9000Note: Replace <GPU_MACHINE_IP> with your GPU server's IP address. Use firewall rules to restrict access to trusted machines only.
Edit .env to customize:
# GPU or CPU processing
DEVICE=cuda # cuda for GPU, cpu for CPU-only
# Computation precision
COMPUTE_TYPE=float16 # float16 (GPU), float32 (CPU), int8 (faster, lower quality)
# Batch size (higher = faster but more memory). Default is device-aware:
# 16 on cuda, 2 on cpu. Long audio on CPU benefits from BATCH_SIZE=1.
BATCH_SIZE=16 # 16 for 8GB VRAM, 32+ for high-end GPUs, 1-2 on CPU
# Hugging Face token for diarization
HF_TOKEN=hf_xxx...
# Model preloading (optional, reduces first-request latency)
PRELOAD_MODEL=large-v3 # Leave empty to disable, or set to: tiny, base, small, medium, large-v2, large-v3
# Maximum file size in MB (prevents out-of-memory errors)
MAX_FILE_SIZE_MB=1000 # Default 1GB, adjust lower for GPUs with <16GB VRAM
# Idle model eviction (default disabled). When > 0, Whisper models that have
# not served a request in this many seconds are unloaded from memory by a
# background sweep. The next request that needs the model will reload it.
MODEL_KEEP_ALIVE_SECONDS=0 # 0 disables eviction; e.g. 3600 = 1 hour
MODEL_EVICTION_INTERVAL_SECONDS=60 # Sweep frequency (floor of 30 seconds)Controlled by the SERVE_MODE environment variable in your .env file.
SERVE_MODE=simpleRuns uvicorn directly. Requests are serialized through an async GPU semaphore so the event loop stays responsive while GPU work runs in a thread pool. This is backward-compatible with previous versions.
You can tune GPU_CONCURRENCY=1 (default) to control how many pipeline runs execute concurrently. Leave at 1 for single-GPU setups.
SERVE_MODE=rayRuns on Ray Serve with cross-request batching (@serve.batch). Two pipeline strategies are available, controlled by PIPELINE_STRATEGY:
PIPELINE_STRATEGY=replicate
NUM_GPU_REPLICAS=4 # one full pipeline per GPUEach GPU runs the complete 3-stage pipeline. Ray Serve routes incoming requests across replicas.
+-- GPU 0: [Whisper + Align + Diarize] --+
HTTP --> Proxy --> +-- GPU 1: [Whisper + Align + Diarize] --+--> Response
+-- GPU 2: [Whisper + Align + Diarize] --+
+-- GPU 3: [Whisper + Align + Diarize] --+
PIPELINE_STRATEGY=splitEach pipeline stage runs as a separate deployment with fractional GPU allocation and cross-request batching (@serve.batch). Stage-level pipelining means request B can start transcription while request A is still in diarization.
HTTP --> Proxy --> ASR Ingress
|
+--------+--------+
| | |
Whisper Align Diarize
(GPU 0.5) (GPU 0.3) (GPU 0.2)
Both strategies achieve similar throughput (~6.3-6.5x speedup on 4x RTX 3090 with 8 concurrent workers).
| Replicate | Split | |
|---|---|---|
| Configuration | Simple (just set NUM_GPU_REPLICAS) |
Complex (GPU fractions, per-stage replicas, bin-packing tuning) |
| Cross-GPU transfer | None (audio stays on one GPU) | Yes (results move between stages) |
| Tail latency | Higher variance | Lower and more consistent |
| Scaling | Add a GPU, add a replica | Scale bottleneck stages independently (e.g. more Whisper replicas) |
| VRAM per GPU | Must fit all 3 models | Each stage uses a fraction |
| Best for | Most setups, multi-GPU throughput | Advanced tuning, heterogeneous stage scaling |
All optional, shown with defaults:
# Pipeline strategy: replicate (full pipeline per GPU) or split (stage per GPU)
PIPELINE_STRATEGY=replicate
# Number of pipeline replicas (set to number of GPUs)
NUM_GPU_REPLICAS=1
# Cross-request batch sizes per stage (tune for GPU VRAM)
WHISPER_BATCH_SIZE=4
ALIGN_BATCH_SIZE=8
DIARIZE_BATCH_SIZE=2
# Seconds to wait collecting a batch before processing what's available
BATCH_WAIT_TIMEOUT=0.1
# Split strategy only: fractional GPU allocation per stage
WHISPER_GPU_FRACTION=0.5
ALIGN_GPU_FRACTION=0.3
DIARIZE_GPU_FRACTION=0.2
# Split strategy only: per-stage replica overrides (fall back to NUM_GPU_REPLICAS)
# WHISPER_NUM_REPLICAS=2
# ALIGN_NUM_REPLICAS=1
# DIARIZE_NUM_REPLICAS=1# 4 GPUs, full pipeline on each (recommended)
SERVE_MODE=ray
PIPELINE_STRATEGY=replicate
NUM_GPU_REPLICAS=4
# 3 GPUs, one stage per GPU
SERVE_MODE=ray
PIPELINE_STRATEGY=split
WHISPER_GPU_FRACTION=1.0
ALIGN_GPU_FRACTION=1.0
DIARIZE_GPU_FRACTION=1.0
# 4 GPUs, hybrid: 2 Whisper + 1 Align + 1 Diarize
SERVE_MODE=ray
PIPELINE_STRATEGY=split
WHISPER_NUM_REPLICAS=2 WHISPER_GPU_FRACTION=1.0
ALIGN_NUM_REPLICAS=1 ALIGN_GPU_FRACTION=1.0
DIARIZE_NUM_REPLICAS=1 DIARIZE_GPU_FRACTION=1.0When running in Ray mode, the Ray Dashboard is available at http://localhost:8265 for monitoring deployments, replicas, and request metrics.
To restrict which GPUs the service uses, create a docker-compose.dev.local.yml override (gitignored):
services:
whisperx-asr:
environment:
- CUDA_VISIBLE_DEVICES=0,1,2,3
- NUM_GPU_REPLICAS=4Use CUDA_VISIBLE_DEVICES (not NVIDIA_VISIBLE_DEVICES) since the Docker Compose deploy section exposes all GPUs to the container.
Then start with:
docker compose -f docker-compose.dev.yml -f docker-compose.dev.local.yml up -dAvailable Whisper models (speed vs accuracy tradeoff):
| Model | Parameters | VRAM (model only) | VRAM (full pipeline*) | Speed | Quality |
|---|---|---|---|---|---|
tiny |
39M | ~1GB | ~4GB | Fastest | Lowest |
base |
74M | ~1GB | ~5GB | Very Fast | Low |
small |
244M | ~2GB | ~6GB | Fast | Medium |
medium |
769M | ~5GB | ~10GB | Moderate | Good |
large-v2 |
1550M | ~10GB | ~14GB | Slow | Excellent |
large-v3 |
1550M | ~10GB | ~14GB | Slow | Best |
*Full pipeline = Whisper model + alignment model + pyannote speaker diarization (measured on RTX 3090)
Recommendation:
- Use
large-v3for best quality (requires 16GB+ VRAM) - Use
smallormediumfor speed/resource constraints (8-12GB VRAM)
# Start in simple mode (default)
docker compose up -d
# Or start in Ray Serve mode
# Set SERVE_MODE=ray in your .env file, then:
docker compose up -d
# View logs
docker compose logs -fNote: When using Ray Serve mode, Docker Compose is configured with shm_size: 8g for Ray's shared memory object store. The Ray Dashboard is exposed on port 8265.
# Real-time logs
docker compose logs -f
# Last 100 lines
docker compose logs --tail=100
# Specific container logs
docker logs whisperx-asr-api# Check service health
curl http://localhost:9000/health
# Response:
{
"status": "healthy",
"device": "cuda",
"loaded_models": ["large-v3"],
"serve_mode": "ray"
}GET /metrics returns OpenMetrics text format suitable for a Prometheus
scrape config:
curl http://localhost:9000/metrics
# # HELP whisperx_requests_total Total HTTP requests by endpoint and status
# # TYPE whisperx_requests_total counter
# whisperx_requests_total{endpoint="/asr",status="ok"} 12.0
# whisperx_request_duration_seconds_bucket{endpoint="/asr",le="60.0"} 11.0
# ...| Metric | Type | Notes |
|---|---|---|
whisperx_requests_total{endpoint,status} |
Counter | status is ok, http_<code>, or error |
whisperx_request_duration_seconds{endpoint} |
Histogram | End-to-end handler time |
whisperx_active_transcriptions |
Gauge | In-flight /asr requests |
whisperx_loaded_models |
Gauge | Whisper models currently in cache |
whisperx_model_evictions_total{model} |
Counter | Models unloaded by the idle-eviction sweep |
whisperx_audio_duration_seconds |
Histogram | Submitted audio duration |
whisperx_audio_size_megabytes |
Histogram | Submitted file size |
whisperx_vram_allocated_bytes |
Gauge | CUDA memory_allocated() (0 on CPU) |
whisperx_service_info |
Info | Static labels: version, device, compute_type, serve_mode |
Ray Serve caveat: /metrics is served by the ingress process. Whisper
models load inside replica processes, so whisperx_loaded_models,
whisperx_vram_allocated_bytes, and whisperx_model_evictions_total will
read 0 (or be missing) in Ray Serve mode. These counters and gauges are
per-process; without prometheus_client multi-process mode the ingress'
registry never sees the replica's events. HTTP-level metrics (request
counts, durations, audio sizes) are accurate in both modes because they
are recorded inside the ingress handler. Use the Ray Dashboard at port
8265 for per-replica state, and tail serve/replica_* log files for
eviction events. Multi-process metrics support is a planned follow-up.
In simple mode (SERVE_MODE=simple) all metrics including the eviction
counter and loaded-model gauge work as expected, since the model cache
and the /metrics handler share a process. Each Whisper model load
pre-registers a whisperx_model_evictions_total{model="<name>"} 0 row
so dashboards can graph the metric from the first load onward, not only
after the first eviction.
The legacy JSON shape (queue/loaded model state) is still available at
GET /queue-metrics for callers that depended on it.
Monitor GPU usage:
# NVIDIA GPU stats
nvidia-smi -l 1
# Docker container stats
docker stats whisperx-asr-api
# Ray Dashboard (Ray mode only)
# Open http://localhost:8265 in your browserThis service can run completely offline after an initial setup with internet access. This is useful for air-gapped environments or when you want to avoid network latency.
- Start the container with internet access
- Run at least one transcription request with diarization enabled to cache all models:
curl -X POST http://localhost:9000/asr \ -F "audio_file=@test.mp3" \ -F "diarize=true"
- This downloads and caches:
- Whisper model (e.g., large-v3)
- Alignment model (wav2vec2)
- Pyannote speaker diarization models
Add HF_HUB_OFFLINE=1 to your docker-compose.yml environment section:
environment:
- HF_HUB_OFFLINE=1
# ... other environment variablesImportant: This must be set directly in docker-compose.yml, not in the .env file.
Then restart the container:
docker compose down && docker compose up -dThe service will now operate without any network requests to Hugging Face.
| Component | Cache Location | Notes |
|---|---|---|
| Whisper models | /.cache/models--Systran--faster-whisper-* |
Downloaded on first use |
| Alignment model | /.cache/wav2vec2_*.pth |
Downloaded on first alignment |
| Pyannote models | /.cache/huggingface/hub/models--pyannote--* |
Downloaded on first diarization |
| NLTK tokenizers | /.cache/nltk_data/ |
Pre-downloaded in Docker image |
If you see errors like Failed to resolve 'huggingface.co':
- Ensure you ran a full transcription with diarization while online
- Verify
HF_HUB_OFFLINE=1is set indocker-compose.yml(not.env) - Check the cache volume contains the models:
docker exec whisperx-asr-api ls -la /.cache/huggingface/hub/
Symptom: Service runs on CPU despite having GPU
Solution:
# Verify NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
# If fails, install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart dockerSymptom: CUDA out of memory errors or VRAM exhaustion with large files
Solutions:
- Reduce file size limit in
.env:MAX_FILE_SIZE_MB=500(default is 1000MB) - Use smaller model:
smallormediuminstead oflarge-v3 - Reduce batch size in
.env:BATCH_SIZE=8orBATCH_SIZE=4 - Use int8 precision:
COMPUTE_TYPE=int8(lower quality but less memory) - Split large files: Process audio in smaller chunks before uploading
- Disable diarization: For very large files, skip speaker diarization
Note: The service automatically clears GPU cache between operations to minimize VRAM buildup, but very large files (>500MB) can still cause issues.
Symptom: No speaker labels in output
Solutions:
- Verify HF_TOKEN is set correctly
- Accept model user agreements on Hugging Face
- Check logs for diarization errors:
docker compose logs - Ensure
diarize=truein request
Symptom: Transcription takes too long
Solutions:
- Use GPU instead of CPU (
DEVICE=cuda) - Use smaller model for faster processing
- Increase
BATCH_SIZE(if you have VRAM) - Disable diarization if not needed:
diarize=false
Symptom: Error message containing Weights only load failed or GLOBAL omegaconf.listconfig.ListConfig was not an allowed global
This occurs due to a security change in PyTorch 2.6 where weights_only=True became the default for torch.load().
Solution:
Add this environment variable to your docker-compose.yml:
environment:
- TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=trueImportant: Setting this in .env file alone may not work - it must be set directly in docker-compose.yml under the environment section.
See WhisperX Issue #1304 for more details.
Check logs:
docker compose logs whisperx-asrCommon causes:
- Invalid audio format (use ffmpeg to convert)
- Model not loaded (check VRAM, logs)
- Incorrect parameters (check API docs)
The service supports formats that WhisperX can process (via FFmpeg):
- Audio: MP3, WAV, M4A, FLAC, AAC, OGG, WMA
- Video: MP4, AVI, MOV, MKV, WebM (audio track extracted)
- Other: AMR, 3GP, 3GPP
Note: Large files (>1GB) may cause out-of-memory errors as files are loaded entirely into memory.
This service has NO built-in authentication or security features.
If exposing to a network:
- Use firewall rules to restrict access
- Consider putting behind a reverse proxy
- Store HF_TOKEN securely (use
.envfile, not hardcoded)
If you run the prebuilt image, pull the new tag:
docker compose pull
docker compose up -dIf you build from source, the default build now installs PyTorch 2.7.1 from the cu126 wheel index (Pascal-Hopper). To target a different combo:
git pull
docker compose build --no-cache \
--build-arg TORCH_VERSION=2.8.0 \
--build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
docker compose up -dBuild args also work via Compose's build.args if you check that into
your overlay file.
# Remove model cache
docker compose down -v
docker volume rm whisperx-asr-service_whisperx-cache
# Rebuild
docker compose up -dBackup the cache volume to preserve downloaded models:
docker run --rm -v whisperx-asr-service_whisperx-cache:/cache \
-v $(pwd):/backup ubuntu tar czf /backup/whisperx-cache-backup.tar.gz /cacheA stress test script is included to measure throughput and latency under concurrent load:
# Default: 4 concurrent workers, all files in testfiles/
python tests/stress_test.py
# 8 concurrent workers, 3 rounds
python tests/stress_test.py --workers 8 --rounds 3
# Test OpenAI-compat endpoint
python tests/stress_test.py --endpoint openai
# Without diarization
python tests/stress_test.py --no-diarizePlace .mp3 files in the testfiles/ directory (gitignored). The report shows per-request latency, throughput in requests/minute, and the speedup from concurrent execution. See tests/README.md for full details.
This project is MIT licensed. See LICENSE for details.
WhisperX is licensed under BSD-4-Clause. See WhisperX repository for details.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For issues and questions:
- GitHub Issues: Create an issue
- WhisperX Issues: WhisperX repository
- WhisperX: m-bain/whisperX
- WhisperX Pyannote.audio 4 Support: sealambda/whisperX@feat/pyannote-audio-4 - This service uses sealambda's fork for pyannote.audio 4.0 compatibility
- OpenAI Whisper: openai/whisper
- Pyannote.audio: pyannote/pyannote-audio
- Docker WhisperX: jim60105/docker-whisperX
Reported issues fixed
- Pascal/Blackwell image variants (#15):
:latestships PyTorch 2.7.1 (cu126, supports Pascal through Hopper). New:blackwelltag ships PyTorch 2.8.0 (cu128) for RTX 50xx. TheDockerfilenow exposesTORCH_VERSIONandTORCH_INDEX_URLbuild args and re-pins torch after the WhisperX install so the requested version sticks (the upstream fork was silently upgrading torch to 2.8 and breaking Pascal). CI publishes both variants per release with separate buildx cache scopes; either job can fail independently. - Device-aware
BATCH_SIZEdefault (#12): the default is now 16 on cuda and 2 on cpu. The hardcoded 16 was OOM-killing CPU runs (exit 137) on audio longer than ~30 minutes. - Idle model eviction (#16): new
MODEL_KEEP_ALIVE_SECONDSenv var (default 0/disabled) unloads Whisper models that have been idle longer than the configured window.MODEL_EVICTION_INTERVAL_SECONDScontrols sweep frequency (floor 30s). The next request that needs an evicted model reloads it transparently. - Real Prometheus
/metrics(#13):/metricsnow returns OpenMetrics text instead of JSON. New histograms and counters cover request duration, status, audio duration/size, in-flight requests, loaded model count, model evictions, and VRAM. The previous JSON shape is preserved at/queue-metricsfor callers that depended on it. See Prometheus Metrics for the full table and the Ray Serve caveat.
Other fixes shipped in v0.3.2
/asraccepts the OpenAI-style aliases advertised by/v1/models(whisper-1,whisper-tiny,whisper-large-v3, ...). Previously these returned a 500 because the raw value was passed straight through tofaster_whisper.WhisperModel./v1/modelsis sourced fromfaster_whisper.available_models()instead of a hardcoded list, so the advertised set stays in sync with whatever engine version is installed (about 20 canonical names plus thewhisper-1alias).app.*log records from Ray Serve replicas now flow into the per-replica log file. PreviouslyLoading WhisperX model: X,Starting transcription..., etc. were silently dropped because Ray Serve disables propagation on its own logger.whisperx_model_evictions_total{model="<name>"}is pre-registered with value 0 each time a model loads, so dashboards can graph the metric from the first load onward instead of only after the first eviction. Simple mode only; see the Ray Serve caveat.
New env vars
| Variable | Default | Description |
|---|---|---|
MODEL_KEEP_ALIVE_SECONDS |
0 (disabled) |
Idle window after which a Whisper model is unloaded |
MODEL_EVICTION_INTERVAL_SECONDS |
60 (floor 30) |
Sweep cadence for the eviction daemon |
New build args
| Build arg | Default | Description |
|---|---|---|
TORCH_VERSION |
2.7.1 |
PyTorch version to install |
TORCH_INDEX_URL |
https://download.pytorch.org/whl/cu126 |
PyTorch wheel index URL |
Tests
tests/test_v0_3_2.sh and tests/test_keep_alive.sh cover the new endpoints and eviction logic; both verified end-to-end against a Ray Serve container.
- Thread-safe model loading with double-checked locking for concurrent request safety
- Add Ray Serve mode for high-throughput ASR with cross-request batching
- Two pipeline strategies:
replicate(full pipeline per GPU) andsplit(stage per GPU) - Multi-GPU support via
NUM_GPU_REPLICASand per-stage replica/fraction config - Refactor pipeline into shared stage functions (
app/pipeline.py) - Add async GPU queue with semaphore for simple mode (non-blocking event loop)
- Add
/metricsendpoint for pipeline monitoring - Add
SERVE_MODE,PIPELINE_STRATEGY,NUM_GPU_REPLICASenv vars - Add
entrypoint.shfor automatic mode switching in Docker - GPU pinning via
NVIDIA_VISIBLE_DEVICESin local compose overrides - Add stress test suite (
tests/stress_test.py)
- Add /v1/models and /v1/audio/transcriptions endpoints for OpenAI API compatibility
- Add diarize parameter for broader API compatibility
- Add offline mode support and fix model caching
- Use Query() parameters to work as a drop-in replacement for other Whisper ASR services
- Initial release
- WhisperX integration with API wrapper
- Speaker diarization support
- Docker deployment