Skip to content

lhpereira/whisperx-asr-service

 
 

Repository files navigation

WhisperX ASR API Service

Version License: MIT Docker Build Docker Pulls GPU Required Status

⚠️ Alpha Version - For Self-Hosting Enthusiasts

A simple ASR API service powered by WhisperX for transcription with speaker diarization. Built for self-hosters running Speakr or similar applications.

v0.3.2: two Docker image variants are now published per release.

Your GPU Pull this tag
RTX 50xx (Blackwell) learnedmachine/whisperx-asr-service:blackwell (PyTorch 2.8.0 / cu128)
Anything else (10xx, 20xx, 30xx, 40xx, A-series, H-series) learnedmachine/whisperx-asr-service:latest (PyTorch 2.7.1 / cu126)

The :latest tag was previously broken on Pascal cards. v0.3.2 fixes this by re-pinning torch after the WhisperX install. See Image Variants and Changelog.

What This Does

  • Transcribes audio files using OpenAI Whisper models
  • Identifies speakers ("Who spoke when") using Pyannote.audio
  • Returns word-level timestamps
  • Supports 90+ languages
  • Outputs JSON, SRT, VTT, TSV formats
  • Runs on your own GPU hardware via Docker

Limitations

  • Not production-grade: Basic error handling, no authentication
  • GPU required: Needs NVIDIA GPU with 14GB+ VRAM for large models
  • File size limits: Large audio files (>1GB) can cause out-of-memory errors
  • VRAM usage: Memory consumption increases with file size and diarization
  • Alpha software: Expect bugs and breaking changes

How It Works

Audio --> Whisper (transcription) --> Wav2Vec2 (alignment) --> Pyannote (speaker ID) --> Output

The service supports two serving modes:

  • Simple mode (default): Single-process uvicorn with an async GPU queue. Requests are serialized through a semaphore so only one pipeline runs on the GPU at a time. Good for single-GPU, low-traffic, or development use.
  • Ray Serve mode: Runs on Ray Serve with cross-request batching (@serve.batch). Scales from 1 GPU to multi-GPU. Two pipeline strategies are available:
    • Replicate (default): Each GPU runs the complete pipeline (Whisper + Align + Diarize). 4 GPUs = 4 independent pipeline replicas, no cross-GPU data transfer.
    • Split: Each stage runs as a separate deployment with fractional GPU allocation. Useful when you want independent scaling per stage.

Prerequisites

Hardware Requirements

GPU memory requirements vary by model size:

Whisper Model VRAM Required (with diarization) Suitable GPUs
tiny, base ~4-5GB RTX 3060 8GB, RTX 2060, GTX 1660 Ti
small ~6GB RTX 3060, RTX 2070, RTX 2080
medium ~10GB RTX 3080, RTX 3060 12GB, RTX 2080 Ti
large-v2, large-v3 ~14GB RTX 3090, RTX 4090, A6000, A100

Note: Measured with preloaded model + alignment + pyannote community-1 diarization on RTX 3090

Minimum Configuration (small/medium models):

  • GPU: NVIDIA RTX 3060 (12GB VRAM) or better
  • CPU: 8+ cores
  • RAM: 16GB
  • Storage: 50GB SSD

Recommended (large-v3 with diarization):

  • GPU: NVIDIA RTX 3090 (24GB VRAM) or RTX 4090
  • CPU: 12+ cores
  • RAM: 32GB
  • Storage: 100GB SSD

Software Requirements

  • Docker and Docker Compose
  • NVIDIA Docker Runtime (for GPU support)
  • Hugging Face Account (for speaker diarization models)

Image Variants

Two prebuilt Docker images are published per release. They differ only in the PyTorch wheel they ship; the application code is identical.

Tag PyTorch CUDA wheels Supported GPUs
:latest, :vX.Y.Z 2.7.1 cu126 Pascal (10xx) through Hopper. Compatible with the broadest GPU range.
:blackwell, :vX.Y.Z-blackwell 2.8.0 cu128 Blackwell (RTX 50xx). Drops Pascal/Maxwell support per the PyTorch 2.8 cuDNN/CUDA 12.8 build.

If you have an RTX 50xx, use the -blackwell tag. Everyone else: use :latest.

To build a custom variant locally, override the build args:

docker build \
  --build-arg TORCH_VERSION=2.7.1 \
  --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu126 \
  -t whisperx-asr-service:custom .

Quick Start (Prebuilt Image)

Get up and running in 3 steps using the prebuilt Docker image:

1. Get Hugging Face Token and Model Access

Speaker diarization requires a Hugging Face token and model access:

a) Create Hugging Face Account:

b) Accept Model User Agreements (ALL REQUIRED):

You need to accept agreements for all three models:

  1. pyannote/speaker-diarization-community-1 - Click "Agree and access repository"
  2. pyannote/segmentation-3.0 - Click "Agree and access repository"
  3. pyannote/speaker-diarization-3.1 - Click "Agree and access repository"

c) Generate Access Token:

⚠️ Important: Without accepting all model agreements, you'll get "403 Access Denied" errors.

2. Create Configuration File

Create a .env file with your Hugging Face token:

# Create .env file
cat > .env << 'EOF'
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
DEVICE=cuda
COMPUTE_TYPE=float16
BATCH_SIZE=16
PRELOAD_MODEL=large-v3
MAX_FILE_SIZE_MB=1000
# Serve mode: simple (default) or ray (Ray Serve with batching)
SERVE_MODE=simple
EOF

Replace hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx with your actual token.

3. Run with Docker Compose (Recommended)

Download the docker-compose.yml file and start the service:

# Download docker-compose.yml
curl -O https://raw.githubusercontent.com/murtaza-nasir/whisperx-asr-service/main/docker-compose.yml

# RTX 50xx (Blackwell) only: switch to the blackwell image
# sed -i 's|whisperx-asr-service:latest|whisperx-asr-service:blackwell|' docker-compose.yml

# Start the service (pulls prebuilt image automatically)
docker compose up -d

# Check logs
docker compose logs -f

The bundled docker-compose.yml uses :latest (PyTorch 2.7.1 / cu126), which works on every NVIDIA card from Pascal through Hopper. RTX 50xx users should swap to :blackwell (PyTorch 2.8.0 / cu128). See Image Variants for the full matrix.

Or run with Docker command:

# Pick the right tag for your GPU:
IMAGE=learnedmachine/whisperx-asr-service:latest      # 10xx-40xx, A/H-series
# IMAGE=learnedmachine/whisperx-asr-service:blackwell # RTX 50xx only

docker run -d \
  --name whisperx-asr-api \
  --gpus all \
  -p 9000:9000 \
  -e DEVICE=cuda \
  -e COMPUTE_TYPE=float16 \
  -e BATCH_SIZE=16 \
  -e HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
  -e PRELOAD_MODEL=large-v3 \
  -v whisperx-cache:/.cache \
  --restart unless-stopped \
  "$IMAGE"

The service will be available at http://localhost:9000

4. Test the Service

# Health check
curl http://localhost:9000/health

# Test transcription
curl -X POST http://localhost:9000/asr \
  -F "audio_file=@your_audio.mp3" \
  -F "language=en"

Build from Source (Advanced)

For development or if you want to build from source:

1. Clone Repository

git clone https://github.com/murtaza-nasir/whisperx-asr-service.git
cd whisperx-asr-service

2. Set Up Environment

# Copy example environment file
cp .env.example .env

# Edit .env and add your Hugging Face token
nano .env

3. Build and Run

Using docker-compose.dev.yml (with live code mounting):

# Build and start
docker compose -f docker-compose.dev.yml up -d --build

# Check logs
docker compose -f docker-compose.dev.yml logs -f

Or build manually:

# Build image
docker build -t whisperx-asr-service .

# Run container
docker run -d \
  --name whisperx-asr-api \
  --gpus all \
  -p 9000:9000 \
  --env-file .env \
  -v whisperx-cache:/.cache \
  whisperx-asr-service

Note: The docker-compose.dev.yml file mounts ./app directory for live code changes without rebuilding.


API Documentation

Once running, visit http://localhost:9000/docs for interactive API documentation.

Main Endpoint: POST /asr

Parameters:

Parameter Type Default Description
audio_file File Required Audio file to transcribe
task String transcribe Task type: transcribe or translate
language String Auto-detect Language code (e.g., en, es, fr)
model String large-v3 Whisper model: tiny, base, small, medium, large-v2, large-v3
initial_prompt String None Context or spelling guide to steer the model
hotwords String None Comma-separated words to bias transcription toward
output_format String json Output format: json, text, srt, vtt, tsv
word_timestamps Boolean true Return word-level timestamps
diarize Boolean true Enable speaker diarization
num_speakers Integer Auto Exact number of speakers (if known, overrides min/max)
min_speakers Integer Auto Minimum number of speakers
max_speakers Integer Auto Maximum number of speakers

Example Request (JSON output):

curl -X POST http://localhost:9000/asr \
  -F "audio_file=@meeting.mp3" \
  -F "language=en" \
  -F "model=large-v3" \
  -F "output_format=json" \
  -F "diarize=true" \
  -F "min_speakers=2" \
  -F "max_speakers=5"

Example Request (SRT subtitles):

curl -X POST http://localhost:9000/asr \
  -F "audio_file=@video.mp4" \
  -F "language=en" \
  -F "output_format=srt" \
  -F "diarize=false"

Example Response (JSON):

{
  "text": [
    {
      "start": 0.5,
      "end": 2.3,
      "text": " Hello, welcome to the meeting.",
      "speaker": "SPEAKER_00",
      "words": [
        {"word": "Hello", "start": 0.5, "end": 0.8, "score": 0.95},
        {"word": "welcome", "start": 0.9, "end": 1.2, "score": 0.93}
      ]
    }
  ],
  "language": "en",
  "segments": [...],
  "word_segments": [...]
}

Advanced Speaker Diarization Features

Exact Speaker Count

When you know the exact number of speakers, use num_speakers for more accurate diarization:

curl -X POST http://localhost:9000/asr \
  -F "audio_file=@interview.mp3" \
  -F "num_speakers=2" \
  -F "diarize=true"

This overrides min_speakers and max_speakers and typically provides better accuracy than range-based detection.

Exclusive Speaker Diarization

This service automatically uses exclusive speaker diarization when available from the pyannote community-1 model. This feature simplifies reconciliation between fine-grained speaker diarization timestamps and transcription timestamps, making it ideal for applications like Speakr where you need to align transcripts with speaker segments.

Benefits:

  • More accurate timestamp alignment between speakers and words
  • Better handling of speaker transitions
  • Simplified post-processing for multi-speaker transcripts

Custom Vocabulary (Hotwords)

Whisper often misspells brand names, acronyms, and domain-specific terms. You can improve accuracy using hotwords and initial_prompt:

  • hotwords biases the model's beam search to favor specific words during decoding
  • initial_prompt provides a sentence of context that primes the model to expect certain spellings

Example: transcribing audio that says "We deployed Speakr on a Kubernetes cluster using CTranslate2 for inference. PyAnnote handles the diarization."

# Baseline (no hints)
curl -X POST "http://localhost:9000/asr?language=en" \
  -F "audio_file=@meeting.mp3"
# Result: "We deployed Speaker...using ctranslate2...PnNote handles the diarization."

# With hotwords
curl -X POST "http://localhost:9000/asr?language=en&hotwords=Speakr,CTranslate2,PyAnnote" \
  -F "audio_file=@meeting.mp3"
# Result: "We deployed Speaker...using CTranslate2...PyAnnote handles the diarization."

# With hotwords + initial_prompt (best results)
curl -X POST "http://localhost:9000/asr?language=en&hotwords=Speakr,CTranslate2,PyAnnote&initial_prompt=Speakr is a transcription app." \
  -F "audio_file=@meeting.mp3"
# Result: "We deployed Speakr...using CTranslate2...PyAnnote handles the diarization."
Word No hints Hotwords Hotwords + initial_prompt
CTranslate2 ctranslate2 CTranslate2 CTranslate2
PyAnnote PnNote PyAnnote PyAnnote
Speakr Speaker Speaker Speakr

hotwords alone fixes most spelling issues. Words that sound identical to common English words (like "Speakr" vs "Speaker") may need initial_prompt as well to provide enough context for the model to override its default.

The OpenAI-compatible endpoints (/v1/audio/transcriptions) also support a hotwords form field. If only prompt is provided, it is used as hotwords.

A test script is included to verify hotwords with your own audio:

./tests/test_hotwords.sh testfiles/your_audio.flac

# Custom hotwords
HOTWORDS="MyBrand,TechTerm" INITIAL_PROMPT="MyBrand is a product." \
  ./tests/test_hotwords.sh testfiles/your_audio.flac

Integration with Speakr

To use this service with Speakr instead of the default ASR endpoint:

If Running on the Same Machine

Update Speakr's .env file:

# Enable ASR endpoint
USE_ASR_ENDPOINT=true

# Point to WhisperX service
ASR_BASE_URL=http://whisperx-asr-api:9000

If Speakr and WhisperX are in the same Docker Compose stack, use the container name. Otherwise, use http://localhost:9000.

If Running on a Different GPU Machine

  1. On GPU Machine: Deploy this service
# Make service accessible from network
# Edit docker compose.yml ports:
ports:
  - "0.0.0.0:9000:9000"  # Expose to network
  1. On Speakr Machine: Update configuration
# In Speakr's .env file
USE_ASR_ENDPOINT=true
ASR_BASE_URL=http://<GPU_MACHINE_IP>:9000

Note: Replace <GPU_MACHINE_IP> with your GPU server's IP address. Use firewall rules to restrict access to trusted machines only.

Configuration

Environment Variables

Edit .env to customize:

# GPU or CPU processing
DEVICE=cuda              # cuda for GPU, cpu for CPU-only

# Computation precision
COMPUTE_TYPE=float16     # float16 (GPU), float32 (CPU), int8 (faster, lower quality)

# Batch size (higher = faster but more memory). Default is device-aware:
# 16 on cuda, 2 on cpu. Long audio on CPU benefits from BATCH_SIZE=1.
BATCH_SIZE=16           # 16 for 8GB VRAM, 32+ for high-end GPUs, 1-2 on CPU

# Hugging Face token for diarization
HF_TOKEN=hf_xxx...

# Model preloading (optional, reduces first-request latency)
PRELOAD_MODEL=large-v3   # Leave empty to disable, or set to: tiny, base, small, medium, large-v2, large-v3

# Maximum file size in MB (prevents out-of-memory errors)
MAX_FILE_SIZE_MB=1000    # Default 1GB, adjust lower for GPUs with <16GB VRAM

# Idle model eviction (default disabled). When > 0, Whisper models that have
# not served a request in this many seconds are unloaded from memory by a
# background sweep. The next request that needs the model will reload it.
MODEL_KEEP_ALIVE_SECONDS=0          # 0 disables eviction; e.g. 3600 = 1 hour
MODEL_EVICTION_INTERVAL_SECONDS=60  # Sweep frequency (floor of 30 seconds)

Serve Mode

Controlled by the SERVE_MODE environment variable in your .env file.

Simple Mode (default)

SERVE_MODE=simple

Runs uvicorn directly. Requests are serialized through an async GPU semaphore so the event loop stays responsive while GPU work runs in a thread pool. This is backward-compatible with previous versions.

You can tune GPU_CONCURRENCY=1 (default) to control how many pipeline runs execute concurrently. Leave at 1 for single-GPU setups.

Ray Serve Mode

SERVE_MODE=ray

Runs on Ray Serve with cross-request batching (@serve.batch). Two pipeline strategies are available, controlled by PIPELINE_STRATEGY:

Replicate Strategy (default, recommended)
PIPELINE_STRATEGY=replicate
NUM_GPU_REPLICAS=4       # one full pipeline per GPU

Each GPU runs the complete 3-stage pipeline. Ray Serve routes incoming requests across replicas.

                    +-- GPU 0: [Whisper + Align + Diarize] --+
HTTP --> Proxy -->  +-- GPU 1: [Whisper + Align + Diarize] --+--> Response
                    +-- GPU 2: [Whisper + Align + Diarize] --+
                    +-- GPU 3: [Whisper + Align + Diarize] --+
Split Strategy
PIPELINE_STRATEGY=split

Each pipeline stage runs as a separate deployment with fractional GPU allocation and cross-request batching (@serve.batch). Stage-level pipelining means request B can start transcription while request A is still in diarization.

HTTP --> Proxy --> ASR Ingress
                       |
              +--------+--------+
              |        |        |
          Whisper   Align    Diarize
         (GPU 0.5) (GPU 0.3) (GPU 0.2)
Strategy Comparison

Both strategies achieve similar throughput (~6.3-6.5x speedup on 4x RTX 3090 with 8 concurrent workers).

Replicate Split
Configuration Simple (just set NUM_GPU_REPLICAS) Complex (GPU fractions, per-stage replicas, bin-packing tuning)
Cross-GPU transfer None (audio stays on one GPU) Yes (results move between stages)
Tail latency Higher variance Lower and more consistent
Scaling Add a GPU, add a replica Scale bottleneck stages independently (e.g. more Whisper replicas)
VRAM per GPU Must fit all 3 models Each stage uses a fraction
Best for Most setups, multi-GPU throughput Advanced tuning, heterogeneous stage scaling
Ray Serve Configuration

All optional, shown with defaults:

# Pipeline strategy: replicate (full pipeline per GPU) or split (stage per GPU)
PIPELINE_STRATEGY=replicate

# Number of pipeline replicas (set to number of GPUs)
NUM_GPU_REPLICAS=1

# Cross-request batch sizes per stage (tune for GPU VRAM)
WHISPER_BATCH_SIZE=4
ALIGN_BATCH_SIZE=8
DIARIZE_BATCH_SIZE=2

# Seconds to wait collecting a batch before processing what's available
BATCH_WAIT_TIMEOUT=0.1

# Split strategy only: fractional GPU allocation per stage
WHISPER_GPU_FRACTION=0.5
ALIGN_GPU_FRACTION=0.3
DIARIZE_GPU_FRACTION=0.2

# Split strategy only: per-stage replica overrides (fall back to NUM_GPU_REPLICAS)
# WHISPER_NUM_REPLICAS=2
# ALIGN_NUM_REPLICAS=1
# DIARIZE_NUM_REPLICAS=1
Multi-GPU Examples
# 4 GPUs, full pipeline on each (recommended)
SERVE_MODE=ray
PIPELINE_STRATEGY=replicate
NUM_GPU_REPLICAS=4

# 3 GPUs, one stage per GPU
SERVE_MODE=ray
PIPELINE_STRATEGY=split
WHISPER_GPU_FRACTION=1.0
ALIGN_GPU_FRACTION=1.0
DIARIZE_GPU_FRACTION=1.0

# 4 GPUs, hybrid: 2 Whisper + 1 Align + 1 Diarize
SERVE_MODE=ray
PIPELINE_STRATEGY=split
WHISPER_NUM_REPLICAS=2  WHISPER_GPU_FRACTION=1.0
ALIGN_NUM_REPLICAS=1    ALIGN_GPU_FRACTION=1.0
DIARIZE_NUM_REPLICAS=1  DIARIZE_GPU_FRACTION=1.0

When running in Ray mode, the Ray Dashboard is available at http://localhost:8265 for monitoring deployments, replicas, and request metrics.

GPU Pinning

To restrict which GPUs the service uses, create a docker-compose.dev.local.yml override (gitignored):

services:
  whisperx-asr:
    environment:
      - CUDA_VISIBLE_DEVICES=0,1,2,3
      - NUM_GPU_REPLICAS=4

Use CUDA_VISIBLE_DEVICES (not NVIDIA_VISIBLE_DEVICES) since the Docker Compose deploy section exposes all GPUs to the container.

Then start with:

docker compose -f docker-compose.dev.yml -f docker-compose.dev.local.yml up -d

Model Selection

Available Whisper models (speed vs accuracy tradeoff):

Model Parameters VRAM (model only) VRAM (full pipeline*) Speed Quality
tiny 39M ~1GB ~4GB Fastest Lowest
base 74M ~1GB ~5GB Very Fast Low
small 244M ~2GB ~6GB Fast Medium
medium 769M ~5GB ~10GB Moderate Good
large-v2 1550M ~10GB ~14GB Slow Excellent
large-v3 1550M ~10GB ~14GB Slow Best

*Full pipeline = Whisper model + alignment model + pyannote speaker diarization (measured on RTX 3090)

Recommendation:

  • Use large-v3 for best quality (requires 16GB+ VRAM)
  • Use small or medium for speed/resource constraints (8-12GB VRAM)

Running the Service

# Start in simple mode (default)
docker compose up -d

# Or start in Ray Serve mode
# Set SERVE_MODE=ray in your .env file, then:
docker compose up -d

# View logs
docker compose logs -f

Note: When using Ray Serve mode, Docker Compose is configured with shm_size: 8g for Ray's shared memory object store. The Ray Dashboard is exposed on port 8265.

Monitoring and Logs

View Logs

# Real-time logs
docker compose logs -f

# Last 100 lines
docker compose logs --tail=100

# Specific container logs
docker logs whisperx-asr-api

Health Check

# Check service health
curl http://localhost:9000/health

# Response:
{
  "status": "healthy",
  "device": "cuda",
  "loaded_models": ["large-v3"],
  "serve_mode": "ray"
}

Prometheus Metrics

GET /metrics returns OpenMetrics text format suitable for a Prometheus scrape config:

curl http://localhost:9000/metrics
# # HELP whisperx_requests_total Total HTTP requests by endpoint and status
# # TYPE whisperx_requests_total counter
# whisperx_requests_total{endpoint="/asr",status="ok"} 12.0
# whisperx_request_duration_seconds_bucket{endpoint="/asr",le="60.0"} 11.0
# ...
Metric Type Notes
whisperx_requests_total{endpoint,status} Counter status is ok, http_<code>, or error
whisperx_request_duration_seconds{endpoint} Histogram End-to-end handler time
whisperx_active_transcriptions Gauge In-flight /asr requests
whisperx_loaded_models Gauge Whisper models currently in cache
whisperx_model_evictions_total{model} Counter Models unloaded by the idle-eviction sweep
whisperx_audio_duration_seconds Histogram Submitted audio duration
whisperx_audio_size_megabytes Histogram Submitted file size
whisperx_vram_allocated_bytes Gauge CUDA memory_allocated() (0 on CPU)
whisperx_service_info Info Static labels: version, device, compute_type, serve_mode

Ray Serve caveat: /metrics is served by the ingress process. Whisper models load inside replica processes, so whisperx_loaded_models, whisperx_vram_allocated_bytes, and whisperx_model_evictions_total will read 0 (or be missing) in Ray Serve mode. These counters and gauges are per-process; without prometheus_client multi-process mode the ingress' registry never sees the replica's events. HTTP-level metrics (request counts, durations, audio sizes) are accurate in both modes because they are recorded inside the ingress handler. Use the Ray Dashboard at port 8265 for per-replica state, and tail serve/replica_* log files for eviction events. Multi-process metrics support is a planned follow-up.

In simple mode (SERVE_MODE=simple) all metrics including the eviction counter and loaded-model gauge work as expected, since the model cache and the /metrics handler share a process. Each Whisper model load pre-registers a whisperx_model_evictions_total{model="<name>"} 0 row so dashboards can graph the metric from the first load onward, not only after the first eviction.

The legacy JSON shape (queue/loaded model state) is still available at GET /queue-metrics for callers that depended on it.

Performance Monitoring

Monitor GPU usage:

# NVIDIA GPU stats
nvidia-smi -l 1

# Docker container stats
docker stats whisperx-asr-api

# Ray Dashboard (Ray mode only)
# Open http://localhost:8265 in your browser

Offline Use

This service can run completely offline after an initial setup with internet access. This is useful for air-gapped environments or when you want to avoid network latency.

Initial Setup (requires internet)

  1. Start the container with internet access
  2. Run at least one transcription request with diarization enabled to cache all models:
    curl -X POST http://localhost:9000/asr \
      -F "audio_file=@test.mp3" \
      -F "diarize=true"
  3. This downloads and caches:
    • Whisper model (e.g., large-v3)
    • Alignment model (wav2vec2)
    • Pyannote speaker diarization models

Enable Offline Mode

Add HF_HUB_OFFLINE=1 to your docker-compose.yml environment section:

environment:
  - HF_HUB_OFFLINE=1
  # ... other environment variables

Important: This must be set directly in docker-compose.yml, not in the .env file.

Then restart the container:

docker compose down && docker compose up -d

The service will now operate without any network requests to Hugging Face.

What Gets Cached

Component Cache Location Notes
Whisper models /.cache/models--Systran--faster-whisper-* Downloaded on first use
Alignment model /.cache/wav2vec2_*.pth Downloaded on first alignment
Pyannote models /.cache/huggingface/hub/models--pyannote--* Downloaded on first diarization
NLTK tokenizers /.cache/nltk_data/ Pre-downloaded in Docker image

Troubleshooting Offline Mode

If you see errors like Failed to resolve 'huggingface.co':

  1. Ensure you ran a full transcription with diarization while online
  2. Verify HF_HUB_OFFLINE=1 is set in docker-compose.yml (not .env)
  3. Check the cache volume contains the models: docker exec whisperx-asr-api ls -la /.cache/huggingface/hub/

Troubleshooting

GPU Not Detected

Symptom: Service runs on CPU despite having GPU

Solution:

# Verify NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

# If fails, install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Out of Memory Errors

Symptom: CUDA out of memory errors or VRAM exhaustion with large files

Solutions:

  1. Reduce file size limit in .env: MAX_FILE_SIZE_MB=500 (default is 1000MB)
  2. Use smaller model: small or medium instead of large-v3
  3. Reduce batch size in .env: BATCH_SIZE=8 or BATCH_SIZE=4
  4. Use int8 precision: COMPUTE_TYPE=int8 (lower quality but less memory)
  5. Split large files: Process audio in smaller chunks before uploading
  6. Disable diarization: For very large files, skip speaker diarization

Note: The service automatically clears GPU cache between operations to minimize VRAM buildup, but very large files (>500MB) can still cause issues.

Speaker Diarization Not Working

Symptom: No speaker labels in output

Solutions:

  1. Verify HF_TOKEN is set correctly
  2. Accept model user agreements on Hugging Face
  3. Check logs for diarization errors: docker compose logs
  4. Ensure diarize=true in request

Slow Processing

Symptom: Transcription takes too long

Solutions:

  1. Use GPU instead of CPU (DEVICE=cuda)
  2. Use smaller model for faster processing
  3. Increase BATCH_SIZE (if you have VRAM)
  4. Disable diarization if not needed: diarize=false

PyTorch 2.6 Weights Loading Error

Symptom: Error message containing Weights only load failed or GLOBAL omegaconf.listconfig.ListConfig was not an allowed global

This occurs due to a security change in PyTorch 2.6 where weights_only=True became the default for torch.load().

Solution:

Add this environment variable to your docker-compose.yml:

environment:
  - TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=true

Important: Setting this in .env file alone may not work - it must be set directly in docker-compose.yml under the environment section.

See WhisperX Issue #1304 for more details.

API Returns 500 Errors

Check logs:

docker compose logs whisperx-asr

Common causes:

  • Invalid audio format (use ffmpeg to convert)
  • Model not loaded (check VRAM, logs)
  • Incorrect parameters (check API docs)

Supported Audio Formats

The service supports formats that WhisperX can process (via FFmpeg):

  • Audio: MP3, WAV, M4A, FLAC, AAC, OGG, WMA
  • Video: MP4, AVI, MOV, MKV, WebM (audio track extracted)
  • Other: AMR, 3GP, 3GPP

Note: Large files (>1GB) may cause out-of-memory errors as files are loaded entirely into memory.

Security Notes

This service has NO built-in authentication or security features.

If exposing to a network:

  • Use firewall rules to restrict access
  • Consider putting behind a reverse proxy
  • Store HF_TOKEN securely (use .env file, not hardcoded)

Maintenance

Updating WhisperX

If you run the prebuilt image, pull the new tag:

docker compose pull
docker compose up -d

If you build from source, the default build now installs PyTorch 2.7.1 from the cu126 wheel index (Pascal-Hopper). To target a different combo:

git pull
docker compose build --no-cache \
  --build-arg TORCH_VERSION=2.8.0 \
  --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
docker compose up -d

Build args also work via Compose's build.args if you check that into your overlay file.

Clearing Cache

# Remove model cache
docker compose down -v
docker volume rm whisperx-asr-service_whisperx-cache

# Rebuild
docker compose up -d

Backup

Backup the cache volume to preserve downloaded models:

docker run --rm -v whisperx-asr-service_whisperx-cache:/cache \
  -v $(pwd):/backup ubuntu tar czf /backup/whisperx-cache-backup.tar.gz /cache

Stress Testing

A stress test script is included to measure throughput and latency under concurrent load:

# Default: 4 concurrent workers, all files in testfiles/
python tests/stress_test.py

# 8 concurrent workers, 3 rounds
python tests/stress_test.py --workers 8 --rounds 3

# Test OpenAI-compat endpoint
python tests/stress_test.py --endpoint openai

# Without diarization
python tests/stress_test.py --no-diarize

Place .mp3 files in the testfiles/ directory (gitignored). The report shows per-request latency, throughput in requests/minute, and the speedup from concurrent execution. See tests/README.md for full details.

License

This project is MIT licensed. See LICENSE for details.

WhisperX is licensed under BSD-4-Clause. See WhisperX repository for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

Support

For issues and questions:

Credits

Changelog

v0.3.2 (2026-05-03)

Reported issues fixed

  • Pascal/Blackwell image variants (#15): :latest ships PyTorch 2.7.1 (cu126, supports Pascal through Hopper). New :blackwell tag ships PyTorch 2.8.0 (cu128) for RTX 50xx. The Dockerfile now exposes TORCH_VERSION and TORCH_INDEX_URL build args and re-pins torch after the WhisperX install so the requested version sticks (the upstream fork was silently upgrading torch to 2.8 and breaking Pascal). CI publishes both variants per release with separate buildx cache scopes; either job can fail independently.
  • Device-aware BATCH_SIZE default (#12): the default is now 16 on cuda and 2 on cpu. The hardcoded 16 was OOM-killing CPU runs (exit 137) on audio longer than ~30 minutes.
  • Idle model eviction (#16): new MODEL_KEEP_ALIVE_SECONDS env var (default 0/disabled) unloads Whisper models that have been idle longer than the configured window. MODEL_EVICTION_INTERVAL_SECONDS controls sweep frequency (floor 30s). The next request that needs an evicted model reloads it transparently.
  • Real Prometheus /metrics (#13): /metrics now returns OpenMetrics text instead of JSON. New histograms and counters cover request duration, status, audio duration/size, in-flight requests, loaded model count, model evictions, and VRAM. The previous JSON shape is preserved at /queue-metrics for callers that depended on it. See Prometheus Metrics for the full table and the Ray Serve caveat.

Other fixes shipped in v0.3.2

  • /asr accepts the OpenAI-style aliases advertised by /v1/models (whisper-1, whisper-tiny, whisper-large-v3, ...). Previously these returned a 500 because the raw value was passed straight through to faster_whisper.WhisperModel.
  • /v1/models is sourced from faster_whisper.available_models() instead of a hardcoded list, so the advertised set stays in sync with whatever engine version is installed (about 20 canonical names plus the whisper-1 alias).
  • app.* log records from Ray Serve replicas now flow into the per-replica log file. Previously Loading WhisperX model: X, Starting transcription..., etc. were silently dropped because Ray Serve disables propagation on its own logger.
  • whisperx_model_evictions_total{model="<name>"} is pre-registered with value 0 each time a model loads, so dashboards can graph the metric from the first load onward instead of only after the first eviction. Simple mode only; see the Ray Serve caveat.

New env vars

Variable Default Description
MODEL_KEEP_ALIVE_SECONDS 0 (disabled) Idle window after which a Whisper model is unloaded
MODEL_EVICTION_INTERVAL_SECONDS 60 (floor 30) Sweep cadence for the eviction daemon

New build args

Build arg Default Description
TORCH_VERSION 2.7.1 PyTorch version to install
TORCH_INDEX_URL https://download.pytorch.org/whl/cu126 PyTorch wheel index URL

Tests

tests/test_v0_3_2.sh and tests/test_keep_alive.sh cover the new endpoints and eviction logic; both verified end-to-end against a Ray Serve container.

v0.3.0 (2026-02-28)

  • Thread-safe model loading with double-checked locking for concurrent request safety
  • Add Ray Serve mode for high-throughput ASR with cross-request batching
  • Two pipeline strategies: replicate (full pipeline per GPU) and split (stage per GPU)
  • Multi-GPU support via NUM_GPU_REPLICAS and per-stage replica/fraction config
  • Refactor pipeline into shared stage functions (app/pipeline.py)
  • Add async GPU queue with semaphore for simple mode (non-blocking event loop)
  • Add /metrics endpoint for pipeline monitoring
  • Add SERVE_MODE, PIPELINE_STRATEGY, NUM_GPU_REPLICAS env vars
  • Add entrypoint.sh for automatic mode switching in Docker
  • GPU pinning via NVIDIA_VISIBLE_DEVICES in local compose overrides
  • Add stress test suite (tests/stress_test.py)

v0.2.0 (2025-01-21)

  • Add /v1/models and /v1/audio/transcriptions endpoints for OpenAI API compatibility
  • Add diarize parameter for broader API compatibility
  • Add offline mode support and fix model caching
  • Use Query() parameters to work as a drop-in replacement for other Whisper ASR services

v0.1.1alpha (2025-11-23)

  • Initial release
  • WhisperX integration with API wrapper
  • Speaker diarization support
  • Docker deployment

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 81.1%
  • Shell 15.8%
  • Dockerfile 3.1%