WhisperX ASR API Service

⚠️ Alpha Version - For Self-Hosting Enthusiasts

A simple ASR API service powered by WhisperX for transcription with speaker diarization. Built for self-hosters running Speakr or similar applications.

v0.3.2: two Docker image variants are now published per release.

Your GPU Pull this tag

RTX 50xx (Blackwell) learnedmachine/whisperx-asr-service:blackwell (PyTorch 2.8.0 / cu128)

Anything else (10xx, 20xx, 30xx, 40xx, A-series, H-series) learnedmachine/whisperx-asr-service:latest (PyTorch 2.7.1 / cu126)

The :latest tag was previously broken on Pascal cards. v0.3.2 fixes this by re-pinning torch after the WhisperX install. See Image Variants and Changelog.

What This Does

Transcribes audio files using OpenAI Whisper models
Identifies speakers ("Who spoke when") using Pyannote.audio
Returns word-level timestamps
Supports 90+ languages
Outputs JSON, SRT, VTT, TSV formats
Runs on your own GPU hardware via Docker

Limitations

Not production-grade: Basic error handling, no authentication
GPU required: Needs NVIDIA GPU with 14GB+ VRAM for large models
File size limits: Large audio files (>1GB) can cause out-of-memory errors
VRAM usage: Memory consumption increases with file size and diarization
Alpha software: Expect bugs and breaking changes

How It Works

Audio --> Whisper (transcription) --> Wav2Vec2 (alignment) --> Pyannote (speaker ID) --> Output

The service supports two serving modes:

Simple mode (default): Single-process uvicorn with an async GPU queue. Requests are serialized through a semaphore so only one pipeline runs on the GPU at a time. Good for single-GPU, low-traffic, or development use.
Ray Serve mode: Runs on Ray Serve with cross-request batching (@serve.batch). Scales from 1 GPU to multi-GPU. Two pipeline strategies are available:
- Replicate (default): Each GPU runs the complete pipeline (Whisper + Align + Diarize). 4 GPUs = 4 independent pipeline replicas, no cross-GPU data transfer.
- Split: Each stage runs as a separate deployment with fractional GPU allocation. Useful when you want independent scaling per stage.

Prerequisites

Hardware Requirements

GPU memory requirements vary by model size:

Whisper Model	VRAM Required (with diarization)	Suitable GPUs
tiny, base	~4-5GB	RTX 3060 8GB, RTX 2060, GTX 1660 Ti
small	~6GB	RTX 3060, RTX 2070, RTX 2080
medium	~10GB	RTX 3080, RTX 3060 12GB, RTX 2080 Ti
large-v2, large-v3	~14GB	RTX 3090, RTX 4090, A6000, A100

Note: Measured with preloaded model + alignment + pyannote community-1 diarization on RTX 3090

Minimum Configuration (small/medium models):

GPU: NVIDIA RTX 3060 (12GB VRAM) or better
CPU: 8+ cores
RAM: 16GB
Storage: 50GB SSD

Recommended (large-v3 with diarization):

GPU: NVIDIA RTX 3090 (24GB VRAM) or RTX 4090
CPU: 12+ cores
RAM: 32GB
Storage: 100GB SSD

Software Requirements

Docker and Docker Compose
NVIDIA Docker Runtime (for GPU support)
Hugging Face Account (for speaker diarization models)

Image Variants

Two prebuilt Docker images are published per release. They differ only in the PyTorch wheel they ship; the application code is identical.

Tag	PyTorch	CUDA wheels	Supported GPUs
`:latest`, `:vX.Y.Z`	2.7.1	cu126	Pascal (10xx) through Hopper. Compatible with the broadest GPU range.
`:blackwell`, `:vX.Y.Z-blackwell`	2.8.0	cu128	Blackwell (RTX 50xx). Drops Pascal/Maxwell support per the PyTorch 2.8 cuDNN/CUDA 12.8 build.

If you have an RTX 50xx, use the -blackwell tag. Everyone else: use :latest.

To build a custom variant locally, override the build args:

docker build \
  --build-arg TORCH_VERSION=2.7.1 \
  --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu126 \
  -t whisperx-asr-service:custom .

Quick Start (Prebuilt Image)

Get up and running in 3 steps using the prebuilt Docker image:

1. Get Hugging Face Token and Model Access

Speaker diarization requires a Hugging Face token and model access:

a) Create Hugging Face Account:

Visit: https://huggingface.co/join and sign up

b) Accept Model User Agreements (ALL REQUIRED):

You need to accept agreements for all three models:

pyannote/speaker-diarization-community-1 - Click "Agree and access repository"
pyannote/segmentation-3.0 - Click "Agree and access repository"
pyannote/speaker-diarization-3.1 - Click "Agree and access repository"

c) Generate Access Token:

Visit: https://huggingface.co/settings/tokens
Click "New token", name it (e.g., "whisperx-diarization")
Select "Read" permission and generate
Copy the token (starts with hf_...)

⚠️ Important: Without accepting all model agreements, you'll get "403 Access Denied" errors.

2. Create Configuration File

Create a .env file with your Hugging Face token:

# Create .env file
cat > .env << 'EOF'
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
DEVICE=cuda
COMPUTE_TYPE=float16
BATCH_SIZE=16
PRELOAD_MODEL=large-v3
MAX_FILE_SIZE_MB=1000
# Serve mode: simple (default) or ray (Ray Serve with batching)
SERVE_MODE=simple
EOF

Replace hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx with your actual token.

3. Run with Docker Compose (Recommended)

Download the docker-compose.yml file and start the service:

# Download docker-compose.yml
curl -O https://raw.githubusercontent.com/murtaza-nasir/whisperx-asr-service/main/docker-compose.yml

# RTX 50xx (Blackwell) only: switch to the blackwell image
# sed -i 's|whisperx-asr-service:latest|whisperx-asr-service:blackwell|' docker-compose.yml

# Start the service (pulls prebuilt image automatically)
docker compose up -d

# Check logs
docker compose logs -f

The bundled docker-compose.yml uses :latest (PyTorch 2.7.1 / cu126), which works on every NVIDIA card from Pascal through Hopper. RTX 50xx users should swap to :blackwell (PyTorch 2.8.0 / cu128). See Image Variants for the full matrix.

Or run with Docker command:

# Pick the right tag for your GPU:
IMAGE=learnedmachine/whisperx-asr-service:latest      # 10xx-40xx, A/H-series
# IMAGE=learnedmachine/whisperx-asr-service:blackwell # RTX 50xx only

docker run -d \
  --name whisperx-asr-api \
  --gpus all \
  -p 9000:9000 \
  -e DEVICE=cuda \
  -e COMPUTE_TYPE=float16 \
  -e BATCH_SIZE=16 \
  -e HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
  -e PRELOAD_MODEL=large-v3 \
  -v whisperx-cache:/.cache \
  --restart unless-stopped \
  "$IMAGE"

The service will be available at http://localhost:9000

4. Test the Service

# Health check
curl http://localhost:9000/health

# Test transcription
curl -X POST http://localhost:9000/asr \
  -F "audio_file=@your_audio.mp3" \
  -F "language=en"

Build from Source (Advanced)

For development or if you want to build from source:

1. Clone Repository

git clone https://github.com/murtaza-nasir/whisperx-asr-service.git
cd whisperx-asr-service

2. Set Up Environment

# Copy example environment file
cp .env.example .env

# Edit .env and add your Hugging Face token
nano .env

3. Build and Run

Using docker-compose.dev.yml (with live code mounting):

# Build and start
docker compose -f docker-compose.dev.yml up -d --build

# Check logs
docker compose -f docker-compose.dev.yml logs -f

Or build manually:

# Build image
docker build -t whisperx-asr-service .

# Run container
docker run -d \
  --name whisperx-asr-api \
  --gpus all \
  -p 9000:9000 \
  --env-file .env \
  -v whisperx-cache:/.cache \
  whisperx-asr-service

Note: The docker-compose.dev.yml file mounts ./app directory for live code changes without rebuilding.

API Documentation

Once running, visit http://localhost:9000/docs for interactive API documentation.

Main Endpoint: POST /asr

Parameters:

Parameter	Type	Default	Description
`audio_file`	File	Required	Audio file to transcribe
`task`	String	`transcribe`	Task type: `transcribe` or `translate`
`language`	String	Auto-detect	Language code (e.g., `en`, `es`, `fr`)
`model`	String	`large-v3`	Whisper model: `tiny`, `base`, `small`, `medium`, `large-v2`, `large-v3`
`initial_prompt`	String	None	Context or spelling guide to steer the model
`hotwords`	String	None	Comma-separated words to bias transcription toward
`output_format`	String	`json`	Output format: `json`, `text`, `srt`, `vtt`, `tsv`
`word_timestamps`	Boolean	`true`	Return word-level timestamps
`diarize`	Boolean	`true`	Enable speaker diarization
`num_speakers`	Integer	Auto	Exact number of speakers (if known, overrides min/max)
`min_speakers`	Integer	Auto	Minimum number of speakers
`max_speakers`	Integer	Auto	Maximum number of speakers

Example Request (JSON output):

curl -X POST http://localhost:9000/asr \
  -F "audio_file=@meeting.mp3" \
  -F "language=en" \
  -F "model=large-v3" \
  -F "output_format=json" \
  -F "diarize=true" \
  -F "min_speakers=2" \
  -F "max_speakers=5"

Example Request (SRT subtitles):

curl -X POST http://localhost:9000/asr \
  -F "audio_file=@video.mp4" \
  -F "language=en" \
  -F "output_format=srt" \
  -F "diarize=false"

Example Response (JSON):

{
  "text": [
    {
      "start": 0.5,
      "end": 2.3,
      "text": " Hello, welcome to the meeting.",
      "speaker": "SPEAKER_00",
      "words": [
        {"word": "Hello", "start": 0.5, "end": 0.8, "score": 0.95},
        {"word": "welcome", "start": 0.9, "end": 1.2, "score": 0.93}
      ]
    }
  ],
  "language": "en",
  "segments": [...],
  "word_segments": [...]
}

Advanced Speaker Diarization Features

Exact Speaker Count

When you know the exact number of speakers, use num_speakers for more accurate diarization:

curl -X POST http://localhost:9000/asr \
  -F "audio_file=@interview.mp3" \
  -F "num_speakers=2" \
  -F "diarize=true"

This overrides min_speakers and max_speakers and typically provides better accuracy than range-based detection.

Exclusive Speaker Diarization

This service automatically uses exclusive speaker diarization when available from the pyannote community-1 model. This feature simplifies reconciliation between fine-grained speaker diarization timestamps and transcription timestamps, making it ideal for applications like Speakr where you need to align transcripts with speaker segments.

Benefits:

More accurate timestamp alignment between speakers and words
Better handling of speaker transitions
Simplified post-processing for multi-speaker transcripts

Custom Vocabulary (Hotwords)

Whisper often misspells brand names, acronyms, and domain-specific terms. You can improve accuracy using hotwords and initial_prompt:

hotwords biases the model's beam search to favor specific words during decoding
initial_prompt provides a sentence of context that primes the model to expect certain spellings

Example: transcribing audio that says "We deployed Speakr on a Kubernetes cluster using CTranslate2 for inference. PyAnnote handles the diarization."

# Baseline (no hints)
curl -X POST "http://localhost:9000/asr?language=en" \
  -F "audio_file=@meeting.mp3"
# Result: "We deployed Speaker...using ctranslate2...PnNote handles the diarization."

# With hotwords
curl -X POST "http://localhost:9000/asr?language=en&hotwords=Speakr,CTranslate2,PyAnnote" \
  -F "audio_file=@meeting.mp3"
# Result: "We deployed Speaker...using CTranslate2...PyAnnote handles the diarization."

# With hotwords + initial_prompt (best results)
curl -X POST "http://localhost:9000/asr?language=en&hotwords=Speakr,CTranslate2,PyAnnote&initial_prompt=Speakr is a transcription app." \
  -F "audio_file=@meeting.mp3"
# Result: "We deployed Speakr...using CTranslate2...PyAnnote handles the diarization."

Word	No hints	Hotwords	Hotwords + initial_prompt
CTranslate2	ctranslate2	CTranslate2	CTranslate2
PyAnnote	PnNote	PyAnnote	PyAnnote
Speakr	Speaker	Speaker	Speakr

hotwords alone fixes most spelling issues. Words that sound identical to common English words (like "Speakr" vs "Speaker") may need initial_prompt as well to provide enough context for the model to override its default.

The OpenAI-compatible endpoints (/v1/audio/transcriptions) also support a hotwords form field. If only prompt is provided, it is used as hotwords.

A test script is included to verify hotwords with your own audio:

./tests/test_hotwords.sh testfiles/your_audio.flac

# Custom hotwords
HOTWORDS="MyBrand,TechTerm" INITIAL_PROMPT="MyBrand is a product." \
  ./tests/test_hotwords.sh testfiles/your_audio.flac

Integration with Speakr

To use this service with Speakr instead of the default ASR endpoint:

If Running on the Same Machine

Update Speakr's .env file:

# Enable ASR endpoint
USE_ASR_ENDPOINT=true

# Point to WhisperX service
ASR_BASE_URL=http://whisperx-asr-api:9000

If Speakr and WhisperX are in the same Docker Compose stack, use the container name. Otherwise, use http://localhost:9000.

If Running on a Different GPU Machine

On GPU Machine: Deploy this service

# Make service accessible from network
# Edit docker compose.yml ports:
ports:
  - "0.0.0.0:9000:9000"  # Expose to network

On Speakr Machine: Update configuration

# In Speakr's .env file
USE_ASR_ENDPOINT=true
ASR_BASE_URL=http://<GPU_MACHINE_IP>:9000

Note: Replace <GPU_MACHINE_IP> with your GPU server's IP address. Use firewall rules to restrict access to trusted machines only.

Configuration

Environment Variables

Edit .env to customize:

# GPU or CPU processing
DEVICE=cuda              # cuda for GPU, cpu for CPU-only

# Computation precision
COMPUTE_TYPE=float16     # float16 (GPU), float32 (CPU), int8 (faster, lower quality)

# Batch size (higher = faster but more memory). Default is device-aware:
# 16 on cuda, 2 on cpu. Long audio on CPU benefits from BATCH_SIZE=1.
BATCH_SIZE=16           # 16 for 8GB VRAM, 32+ for high-end GPUs, 1-2 on CPU

# Hugging Face token for diarization
HF_TOKEN=hf_xxx...

# Model preloading (optional, reduces first-request latency)
PRELOAD_MODEL=large-v3   # Leave empty to disable, or set to: tiny, base, small, medium, large-v2, large-v3

# Maximum file size in MB (prevents out-of-memory errors)
MAX_FILE_SIZE_MB=1000    # Default 1GB, adjust lower for GPUs with <16GB VRAM

# Idle model eviction (default disabled). When > 0, Whisper models that have
# not served a request in this many seconds are unloaded from memory by a
# background sweep. The next request that needs the model will reload it.
MODEL_KEEP_ALIVE_SECONDS=0          # 0 disables eviction; e.g. 3600 = 1 hour
MODEL_EVICTION_INTERVAL_SECONDS=60  # Sweep frequency (floor of 30 seconds)

Serve Mode

Controlled by the SERVE_MODE environment variable in your .env file.

Simple Mode (default)

SERVE_MODE=simple

Runs uvicorn directly. Requests are serialized through an async GPU semaphore so the event loop stays responsive while GPU work runs in a thread pool. This is backward-compatible with previous versions.

You can tune GPU_CONCURRENCY=1 (default) to control how many pipeline runs execute concurrently. Leave at 1 for single-GPU setups.

Ray Serve Mode

SERVE_MODE=ray

Runs on Ray Serve with cross-request batching (@serve.batch). Two pipeline strategies are available, controlled by PIPELINE_STRATEGY:

Replicate Strategy (default, recommended)

PIPELINE_STRATEGY=replicate
NUM_GPU_REPLICAS=4       # one full pipeline per GPU

Each GPU runs the complete 3-stage pipeline. Ray Serve routes incoming requests across replicas.

                    +-- GPU 0: [Whisper + Align + Diarize] --+
HTTP --> Proxy -->  +-- GPU 1: [Whisper + Align + Diarize] --+--> Response
                    +-- GPU 2: [Whisper + Align + Diarize] --+
                    +-- GPU 3: [Whisper + Align + Diarize] --+

Split Strategy

PIPELINE_STRATEGY=split

Each pipeline stage runs as a separate deployment with fractional GPU allocation and cross-request batching (@serve.batch). Stage-level pipelining means request B can start transcription while request A is still in diarization.

HTTP --> Proxy --> ASR Ingress
                       |
              +--------+--------+
              |        |        |
          Whisper   Align    Diarize
         (GPU 0.5) (GPU 0.3) (GPU 0.2)

Strategy Comparison

Both strategies achieve similar throughput (~6.3-6.5x speedup on 4x RTX 3090 with 8 concurrent workers).

	Replicate	Split
Configuration	Simple (just set `NUM_GPU_REPLICAS`)	Complex (GPU fractions, per-stage replicas, bin-packing tuning)
Cross-GPU transfer	None (audio stays on one GPU)	Yes (results move between stages)
Tail latency	Higher variance	Lower and more consistent
Scaling	Add a GPU, add a replica	Scale bottleneck stages independently (e.g. more Whisper replicas)
VRAM per GPU	Must fit all 3 models	Each stage uses a fraction
Best for	Most setups, multi-GPU throughput	Advanced tuning, heterogeneous stage scaling

Ray Serve Configuration

All optional, shown with defaults:

# Pipeline strategy: replicate (full pipeline per GPU) or split (stage per GPU)
PIPELINE_STRATEGY=replicate

# Number of pipeline replicas (set to number of GPUs)
NUM_GPU_REPLICAS=1

# Cross-request batch sizes per stage (tune for GPU VRAM)
WHISPER_BATCH_SIZE=4
ALIGN_BATCH_SIZE=8
DIARIZE_BATCH_SIZE=2

# Seconds to wait collecting a batch before processing what's available
BATCH_WAIT_TIMEOUT=0.1

# Split strategy only: fractional GPU allocation per stage
WHISPER_GPU_FRACTION=0.5
ALIGN_GPU_FRACTION=0.3
DIARIZE_GPU_FRACTION=0.2

# Split strategy only: per-stage replica overrides (fall back to NUM_GPU_REPLICAS)
# WHISPER_NUM_REPLICAS=2
# ALIGN_NUM_REPLICAS=1
# DIARIZE_NUM_REPLICAS=1

Multi-GPU Examples

# 4 GPUs, full pipeline on each (recommended)
SERVE_MODE=ray
PIPELINE_STRATEGY=replicate
NUM_GPU_REPLICAS=4

# 3 GPUs, one stage per GPU
SERVE_MODE=ray
PIPELINE_STRATEGY=split
WHISPER_GPU_FRACTION=1.0
ALIGN_GPU_FRACTION=1.0
DIARIZE_GPU_FRACTION=1.0

# 4 GPUs, hybrid: 2 Whisper + 1 Align + 1 Diarize
SERVE_MODE=ray
PIPELINE_STRATEGY=split
WHISPER_NUM_REPLICAS=2  WHISPER_GPU_FRACTION=1.0
ALIGN_NUM_REPLICAS=1    ALIGN_GPU_FRACTION=1.0
DIARIZE_NUM_REPLICAS=1  DIARIZE_GPU_FRACTION=1.0

When running in Ray mode, the Ray Dashboard is available at http://localhost:8265 for monitoring deployments, replicas, and request metrics.

GPU Pinning

To restrict which GPUs the service uses, create a docker-compose.dev.local.yml override (gitignored):

services:
  whisperx-asr:
    environment:
      - CUDA_VISIBLE_DEVICES=0,1,2,3
      - NUM_GPU_REPLICAS=4

Use CUDA_VISIBLE_DEVICES (not NVIDIA_VISIBLE_DEVICES) since the Docker Compose deploy section exposes all GPUs to the container.

Then start with:

docker compose -f docker-compose.dev.yml -f docker-compose.dev.local.yml up -d

Model Selection

Available Whisper models (speed vs accuracy tradeoff):

Model	Parameters	VRAM (model only)	VRAM (full pipeline*)	Speed	Quality
`tiny`	39M	~1GB	~4GB	Fastest	Lowest
`base`	74M	~1GB	~5GB	Very Fast	Low
`small`	244M	~2GB	~6GB	Fast	Medium
`medium`	769M	~5GB	~10GB	Moderate	Good
`large-v2`	1550M	~10GB	~14GB	Slow	Excellent
`large-v3`	1550M	~10GB	~14GB	Slow	Best

*Full pipeline = Whisper model + alignment model + pyannote speaker diarization (measured on RTX 3090)

Recommendation:

Use large-v3 for best quality (requires 16GB+ VRAM)
Use small or medium for speed/resource constraints (8-12GB VRAM)

Running the Service

# Start in simple mode (default)
docker compose up -d

# Or start in Ray Serve mode
# Set SERVE_MODE=ray in your .env file, then:
docker compose up -d

# View logs
docker compose logs -f

Note: When using Ray Serve mode, Docker Compose is configured with shm_size: 8g for Ray's shared memory object store. The Ray Dashboard is exposed on port 8265.

Monitoring and Logs

View Logs

# Real-time logs
docker compose logs -f

# Last 100 lines
docker compose logs --tail=100

# Specific container logs
docker logs whisperx-asr-api

Health Check

# Check service health
curl http://localhost:9000/health

# Response:
{
  "status": "healthy",
  "device": "cuda",
  "loaded_models": ["large-v3"],
  "serve_mode": "ray"
}

Prometheus Metrics

GET /metrics returns OpenMetrics text format suitable for a Prometheus scrape config:

curl http://localhost:9000/metrics
# # HELP whisperx_requests_total Total HTTP requests by endpoint and status
# # TYPE whisperx_requests_total counter
# whisperx_requests_total{endpoint="/asr",status="ok"} 12.0
# whisperx_request_duration_seconds_bucket{endpoint="/asr",le="60.0"} 11.0
# ...

Metric	Type	Notes
`whisperx_requests_total{endpoint,status}`	Counter	`status` is `ok`, `http_<code>`, or `error`
`whisperx_request_duration_seconds{endpoint}`	Histogram	End-to-end handler time
`whisperx_active_transcriptions`	Gauge	In-flight `/asr` requests
`whisperx_loaded_models`	Gauge	Whisper models currently in cache
`whisperx_model_evictions_total{model}`	Counter	Models unloaded by the idle-eviction sweep
`whisperx_audio_duration_seconds`	Histogram	Submitted audio duration
`whisperx_audio_size_megabytes`	Histogram	Submitted file size
`whisperx_vram_allocated_bytes`	Gauge	CUDA `memory_allocated()` (0 on CPU)
`whisperx_service_info`	Info	Static labels: version, device, compute_type, serve_mode

Ray Serve caveat: /metrics is served by the ingress process. Whisper models load inside replica processes, so whisperx_loaded_models, whisperx_vram_allocated_bytes, and whisperx_model_evictions_total will read 0 (or be missing) in Ray Serve mode. These counters and gauges are per-process; without prometheus_client multi-process mode the ingress' registry never sees the replica's events. HTTP-level metrics (request counts, durations, audio sizes) are accurate in both modes because they are recorded inside the ingress handler. Use the Ray Dashboard at port 8265 for per-replica state, and tail serve/replica_* log files for eviction events. Multi-process metrics support is a planned follow-up.

In simple mode (SERVE_MODE=simple) all metrics including the eviction counter and loaded-model gauge work as expected, since the model cache and the /metrics handler share a process. Each Whisper model load pre-registers a whisperx_model_evictions_total{model="<name>"} 0 row so dashboards can graph the metric from the first load onward, not only after the first eviction.

The legacy JSON shape (queue/loaded model state) is still available at GET /queue-metrics for callers that depended on it.

Performance Monitoring

Monitor GPU usage:

# NVIDIA GPU stats
nvidia-smi -l 1

# Docker container stats
docker stats whisperx-asr-api

# Ray Dashboard (Ray mode only)
# Open http://localhost:8265 in your browser

Offline Use

This service can run completely offline after an initial setup with internet access. This is useful for air-gapped environments or when you want to avoid network latency.

Initial Setup (requires internet)

Start the container with internet access

Run at least one transcription request with diarization enabled to cache all models:

curl -X POST http://localhost:9000/asr \
  -F "audio_file=@test.mp3" \
  -F "diarize=true"

This downloads and caches:
- Whisper model (e.g., large-v3)
- Alignment model (wav2vec2)
- Pyannote speaker diarization models

Enable Offline Mode

Add HF_HUB_OFFLINE=1 to your docker-compose.yml environment section:

environment:
  - HF_HUB_OFFLINE=1
  # ... other environment variables

Important: This must be set directly in docker-compose.yml, not in the .env file.

Then restart the container:

docker compose down && docker compose up -d

The service will now operate without any network requests to Hugging Face.

What Gets Cached

Component	Cache Location	Notes
Whisper models	`/.cache/models--Systran--faster-whisper-*`	Downloaded on first use
Alignment model	`/.cache/wav2vec2_*.pth`	Downloaded on first alignment
Pyannote models	`/.cache/huggingface/hub/models--pyannote--*`	Downloaded on first diarization
NLTK tokenizers	`/.cache/nltk_data/`	Pre-downloaded in Docker image

Troubleshooting Offline Mode

If you see errors like Failed to resolve 'huggingface.co':

Ensure you ran a full transcription with diarization while online
Verify HF_HUB_OFFLINE=1 is set in docker-compose.yml (not .env)
Check the cache volume contains the models: docker exec whisperx-asr-api ls -la /.cache/huggingface/hub/

Troubleshooting

GPU Not Detected

Symptom: Service runs on CPU despite having GPU

Solution:

# Verify NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

# If fails, install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Out of Memory Errors

Symptom: CUDA out of memory errors or VRAM exhaustion with large files

Solutions:

Reduce file size limit in .env: MAX_FILE_SIZE_MB=500 (default is 1000MB)
Use smaller model: small or medium instead of large-v3
Reduce batch size in .env: BATCH_SIZE=8 or BATCH_SIZE=4
Use int8 precision: COMPUTE_TYPE=int8 (lower quality but less memory)
Split large files: Process audio in smaller chunks before uploading
Disable diarization: For very large files, skip speaker diarization

Note: The service automatically clears GPU cache between operations to minimize VRAM buildup, but very large files (>500MB) can still cause issues.

Speaker Diarization Not Working

Symptom: No speaker labels in output

Solutions:

Verify HF_TOKEN is set correctly
Accept model user agreements on Hugging Face
Check logs for diarization errors: docker compose logs
Ensure diarize=true in request

Slow Processing

Symptom: Transcription takes too long

Solutions:

Use GPU instead of CPU (DEVICE=cuda)
Use smaller model for faster processing
Increase BATCH_SIZE (if you have VRAM)
Disable diarization if not needed: diarize=false

PyTorch 2.6 Weights Loading Error

Symptom: Error message containing Weights only load failed or GLOBAL omegaconf.listconfig.ListConfig was not an allowed global

This occurs due to a security change in PyTorch 2.6 where weights_only=True became the default for torch.load().

Solution:

Add this environment variable to your docker-compose.yml:

environment:
  - TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=true

Important: Setting this in .env file alone may not work - it must be set directly in docker-compose.yml under the environment section.

See WhisperX Issue #1304 for more details.

API Returns 500 Errors

Check logs:

docker compose logs whisperx-asr

Common causes:

Invalid audio format (use ffmpeg to convert)
Model not loaded (check VRAM, logs)
Incorrect parameters (check API docs)

Supported Audio Formats

The service supports formats that WhisperX can process (via FFmpeg):

Audio: MP3, WAV, M4A, FLAC, AAC, OGG, WMA
Video: MP4, AVI, MOV, MKV, WebM (audio track extracted)
Other: AMR, 3GP, 3GPP

Note: Large files (>1GB) may cause out-of-memory errors as files are loaded entirely into memory.

Security Notes

This service has NO built-in authentication or security features.

If exposing to a network:

Use firewall rules to restrict access
Consider putting behind a reverse proxy
Store HF_TOKEN securely (use .env file, not hardcoded)

Maintenance

Updating WhisperX

If you run the prebuilt image, pull the new tag:

docker compose pull
docker compose up -d

If you build from source, the default build now installs PyTorch 2.7.1 from the cu126 wheel index (Pascal-Hopper). To target a different combo:

git pull
docker compose build --no-cache \
  --build-arg TORCH_VERSION=2.8.0 \
  --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
docker compose up -d

Build args also work via Compose's build.args if you check that into your overlay file.

Clearing Cache

# Remove model cache
docker compose down -v
docker volume rm whisperx-asr-service_whisperx-cache

# Rebuild
docker compose up -d

Backup

Backup the cache volume to preserve downloaded models:

docker run --rm -v whisperx-asr-service_whisperx-cache:/cache \
  -v $(pwd):/backup ubuntu tar czf /backup/whisperx-cache-backup.tar.gz /cache

Stress Testing

A stress test script is included to measure throughput and latency under concurrent load:

# Default: 4 concurrent workers, all files in testfiles/
python tests/stress_test.py

# 8 concurrent workers, 3 rounds
python tests/stress_test.py --workers 8 --rounds 3

# Test OpenAI-compat endpoint
python tests/stress_test.py --endpoint openai

# Without diarization
python tests/stress_test.py --no-diarize

Place .mp3 files in the testfiles/ directory (gitignored). The report shows per-request latency, throughput in requests/minute, and the speedup from concurrent execution. See tests/README.md for full details.

License

This project is MIT licensed. See LICENSE for details.

WhisperX is licensed under BSD-4-Clause. See WhisperX repository for details.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

Support

For issues and questions:

GitHub Issues: Create an issue
WhisperX Issues: WhisperX repository

Credits

WhisperX: m-bain/whisperX
WhisperX Pyannote.audio 4 Support: sealambda/whisperX@feat/pyannote-audio-4 - This service uses sealambda's fork for pyannote.audio 4.0 compatibility
OpenAI Whisper: openai/whisper
Pyannote.audio: pyannote/pyannote-audio
Docker WhisperX: jim60105/docker-whisperX

Changelog

v0.3.2 (2026-05-03)

Reported issues fixed

Pascal/Blackwell image variants (#15): :latest ships PyTorch 2.7.1 (cu126, supports Pascal through Hopper). New :blackwell tag ships PyTorch 2.8.0 (cu128) for RTX 50xx. The Dockerfile now exposes TORCH_VERSION and TORCH_INDEX_URL build args and re-pins torch after the WhisperX install so the requested version sticks (the upstream fork was silently upgrading torch to 2.8 and breaking Pascal). CI publishes both variants per release with separate buildx cache scopes; either job can fail independently.
Device-aware BATCH_SIZE default (#12): the default is now 16 on cuda and 2 on cpu. The hardcoded 16 was OOM-killing CPU runs (exit 137) on audio longer than ~30 minutes.
Idle model eviction (#16): new MODEL_KEEP_ALIVE_SECONDS env var (default 0/disabled) unloads Whisper models that have been idle longer than the configured window. MODEL_EVICTION_INTERVAL_SECONDS controls sweep frequency (floor 30s). The next request that needs an evicted model reloads it transparently.
Real Prometheus /metrics (#13): /metrics now returns OpenMetrics text instead of JSON. New histograms and counters cover request duration, status, audio duration/size, in-flight requests, loaded model count, model evictions, and VRAM. The previous JSON shape is preserved at /queue-metrics for callers that depended on it. See Prometheus Metrics for the full table and the Ray Serve caveat.

Other fixes shipped in v0.3.2

/asr accepts the OpenAI-style aliases advertised by /v1/models (whisper-1, whisper-tiny, whisper-large-v3, ...). Previously these returned a 500 because the raw value was passed straight through to faster_whisper.WhisperModel.
/v1/models is sourced from faster_whisper.available_models() instead of a hardcoded list, so the advertised set stays in sync with whatever engine version is installed (about 20 canonical names plus the whisper-1 alias).
app.* log records from Ray Serve replicas now flow into the per-replica log file. Previously Loading WhisperX model: X, Starting transcription..., etc. were silently dropped because Ray Serve disables propagation on its own logger.
whisperx_model_evictions_total{model="<name>"} is pre-registered with value 0 each time a model loads, so dashboards can graph the metric from the first load onward instead of only after the first eviction. Simple mode only; see the Ray Serve caveat.

New env vars

Variable	Default	Description
`MODEL_KEEP_ALIVE_SECONDS`	`0` (disabled)	Idle window after which a Whisper model is unloaded
`MODEL_EVICTION_INTERVAL_SECONDS`	`60` (floor 30)	Sweep cadence for the eviction daemon

New build args

Build arg	Default	Description
`TORCH_VERSION`	`2.7.1`	PyTorch version to install
`TORCH_INDEX_URL`	`https://download.pytorch.org/whl/cu126`	PyTorch wheel index URL

Tests

tests/test_v0_3_2.sh and tests/test_keep_alive.sh cover the new endpoints and eviction logic; both verified end-to-end against a Ray Serve container.

v0.3.0 (2026-02-28)

Thread-safe model loading with double-checked locking for concurrent request safety
Add Ray Serve mode for high-throughput ASR with cross-request batching
Two pipeline strategies: replicate (full pipeline per GPU) and split (stage per GPU)
Multi-GPU support via NUM_GPU_REPLICAS and per-stage replica/fraction config
Refactor pipeline into shared stage functions (app/pipeline.py)
Add async GPU queue with semaphore for simple mode (non-blocking event loop)
Add /metrics endpoint for pipeline monitoring
Add SERVE_MODE, PIPELINE_STRATEGY, NUM_GPU_REPLICAS env vars
Add entrypoint.sh for automatic mode switching in Docker
GPU pinning via NVIDIA_VISIBLE_DEVICES in local compose overrides
Add stress test suite (tests/stress_test.py)

v0.2.0 (2025-01-21)

Add /v1/models and /v1/audio/transcriptions endpoints for OpenAI API compatibility
Add diarize parameter for broader API compatibility
Add offline mode support and fix model caching
Use Query() parameters to work as a drop-in replacement for other Whisper ASR services

v0.1.1alpha (2025-11-23)

Initial release
WhisperX integration with API wrapper
Speaker diarization support
Docker deployment

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
app		app
tests		tests
.env.example		.env.example
.gitignore		.gitignore
DOCKERHUB.md		DOCKERHUB.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
test-api.sh		test-api.sh

Your GPU	Pull this tag
RTX 50xx (Blackwell)	`learnedmachine/whisperx-asr-service:blackwell` (PyTorch 2.8.0 / cu128)
Anything else (10xx, 20xx, 30xx, 40xx, A-series, H-series)	`learnedmachine/whisperx-asr-service:latest` (PyTorch 2.7.1 / cu126)

Folders and files

Latest commit

History

Repository files navigation

WhisperX ASR API Service

What This Does

Limitations

How It Works

Prerequisites

Hardware Requirements

Software Requirements

Image Variants

Quick Start (Prebuilt Image)

1. Get Hugging Face Token and Model Access

2. Create Configuration File

3. Run with Docker Compose (Recommended)

4. Test the Service

Build from Source (Advanced)

1. Clone Repository

2. Set Up Environment

3. Build and Run

API Documentation

Main Endpoint: POST /asr

Advanced Speaker Diarization Features

Exact Speaker Count

Exclusive Speaker Diarization

Custom Vocabulary (Hotwords)

Integration with Speakr

If Running on the Same Machine

If Running on a Different GPU Machine

Configuration

Environment Variables

Serve Mode

Simple Mode (default)

Ray Serve Mode

Replicate Strategy (default, recommended)

Split Strategy

Strategy Comparison

Ray Serve Configuration

Multi-GPU Examples

GPU Pinning

Model Selection

Running the Service

Monitoring and Logs

View Logs

Health Check

Prometheus Metrics

Performance Monitoring

Offline Use

Initial Setup (requires internet)

Enable Offline Mode

What Gets Cached

Troubleshooting Offline Mode

Troubleshooting

GPU Not Detected

Out of Memory Errors

Speaker Diarization Not Working

Slow Processing

PyTorch 2.6 Weights Loading Error

API Returns 500 Errors

Supported Audio Formats

Security Notes

Maintenance

Updating WhisperX

Clearing Cache

Backup

Stress Testing

License

Contributing

Support

Credits

Changelog

v0.3.2 (2026-05-03)

v0.3.0 (2026-02-28)

v0.2.0 (2025-01-21)

v0.1.1alpha (2025-11-23)

About

Resources

License

Uh oh!

Packages