WhisperX Worker for RunPod

A serverless worker that provides high-quality speech transcription with timestamp alignment and speaker diarization using WhisperX on the RunPod platform.

Features

Automatic speech transcription with WhisperX
Automatic language detection
Word-level timestamp alignment
Speaker diarization (optional)
Highly parallelized batch processing
Voice activity detection with configurable parameters
RunPod serverless compatibility

Input Parameters

Parameter	Type	Required	Default	Description
`audio_file`	string	Yes	N/A	URL to the audio file for transcription
`language`	string	No	`null`	ISO code of the language spoken in the audio (e.g., 'en', 'fr'). If not specified, automatic detection will be performed
`language_detection_min_prob`	float	No	`0`	Minimum probability threshold for language detection
`language_detection_max_tries`	int	No	`5`	Maximum number of attempts for language detection
`initial_prompt`	string	No	`null`	Optional text to provide as a prompt for the first transcription window
`batch_size`	int	No	`64`	Batch size for parallelized input audio transcription
`temperature`	float	No	`0`	Temperature to use for sampling (higher = more random)
`vad_onset`	float	No	`0.500`	Voice Activity Detection onset threshold
`vad_offset`	float	No	`0.363`	Voice Activity Detection offset threshold
`align_output`	bool	No	`false`	Whether to align Whisper output for accurate word-level timestamps
`diarization`	bool	No	`false`	Whether to assign speaker ID labels to segments
`huggingface_access_token`	string	No*	`null`	HuggingFace token for diarization model access (*Required if diarization is enabled)
`min_speakers`	int	No	`null`	Minimum number of speakers (only applicable if diarization is enabled)
`max_speakers`	int	No	`null`	Maximum number of speakers (only applicable if diarization is enabled)
`debug`	bool	No	`false`	Whether to print compute/inference times and memory usage information
`speaker_samples`	list	No	`[]`	List of speaker sample objects for speaker diarization

Usage Examples

Basic Transcription

{
  "input": {
    "audio_file": "https://github.com/runpod-workers/sample-inputs/raw/main/audio/gettysburg.wav"
  }
}

Transcription with Language Detection and Alignment

{
  "input": {
    "audio_file": "https://github.com/runpod-workers/sample-inputs/raw/main/audio/gettysburg.wav",
    "align_output": true,
    "batch_size": 32,
    "debug": true
  }
}

Full Configuration with Diarization

{
  "input": {
    "audio_file": "https://github.com/runpod-workers/sample-inputs/raw/main/audio/gettysburg.wav",
    "language": "en",
    "batch_size": 32,
    "temperature": 0.2,
    "align_output": true,
    "diarization": true,
    "huggingface_access_token": "YOUR_HUGGINGFACE_TOKEN",
    "min_speakers": 2,
    "max_speakers": 5,
    "debug": true
  }
}

Full Configuration with Speaker Verification. There is no limit to the number of voice you can upload, but precision maybe be reduced over a certain threshold

  "input": {
    "audio_file": "https://example.com/audio/sample.mp3",
    "language": "en",
    "batch_size": 32,
    "temperature": 0.2,
    "align_output": true,
    "diarization": true,
    "huggingface_access_token": "YOUR_HUGGINGFACE_TOKEN",
    "min_speakers": 2,
    "max_speakers": 5,
    "debug": true,
    "speaker_verification": true,
    "speaker_samples": [
      {
        "name": "Speaker1",
        "url": "https://example.com/speaker1.wav"
      },
      {
        "name": "Speaker2",
        "url": "https://example.com/speaker2.wav"
      },
      {
        "name": "Speaker3",
        "url": "https://example.com/speaker3.wav"
      }
      ...
    ]
  }
}
## Output Format

The service returns a JSON object structured as follows:

### Without Diarization

```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Transcribed text segment 1",
      "words": [
        {"word": "Transcribed", "start": 0.1, "end": 0.7},
        {"word": "text", "start": 0.8, "end": 1.2},
        {"word": "segment", "start": 1.3, "end": 1.9},
        {"word": "1", "start": 2.0, "end": 2.4}
      ]
    },
    {
      "start": 2.5,
      "end": 5.0,
      "text": "Transcribed text segment 2",
      "words": [
        {"word": "Transcribed", "start": 2.6, "end": 3.2},
        {"word": "text", "start": 3.3, "end": 3.7},
        {"word": "segment", "start": 3.8, "end": 4.4},
        {"word": "2", "start": 4.5, "end": 4.9}
      ]
    }
  ],
  "detected_language": "en",
  "language_probability": 0.997
}

With Diarization

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Transcribed text segment 1",
      "words": [
        {"word": "Transcribed", "start": 0.1, "end": 0.7, "speaker": "SPEAKER_01"},
        {"word": "text", "start": 0.8, "end": 1.2, "speaker": "SPEAKER_01"},
        {"word": "segment", "start": 1.3, "end": 1.9, "speaker": "SPEAKER_01"},
        {"word": "1", "start": 2.0, "end": 2.4, "speaker": "SPEAKER_01"}
      ],
      "speaker": "SPEAKER_01"
    },
    {
      "start": 2.5,
      "end": 5.0,
      "text": "Transcribed text segment 2",
      "words": [
        {"word": "Transcribed", "start": 2.6, "end": 3.2, "speaker": "SPEAKER_02"},
        {"word": "text", "start": 3.3, "end": 3.7, "speaker": "SPEAKER_02"},
        {"word": "segment", "start": 3.8, "end": 4.4, "speaker": "SPEAKER_02"},
        {"word": "2", "start": 4.5, "end": 4.9, "speaker": "SPEAKER_02"}
      ],
      "speaker": "SPEAKER_02"
    }
  ],
  "detected_language": "en",
  "language_probability": 0.997,
  "speakers": {
    "SPEAKER_01": {"name": "Speaker 1", "time": 2.5},
    "SPEAKER_02": {"name": "Speaker 2", "time": 2.5}
  }
}

Performance Considerations

GPU Memory: Adjust batch_size based on available GPU memory for optimal performance
Processing Time: Enabling diarization and alignment will increase processing time
File Size: Large audio files may require more processing time and resources
Language Detection: For shorter audio clips, language detection may be less accurate

Troubleshooting

Common Issues

"Model was trained with pyannote.audio 0.0.1, yours is X.X.X"
- This is a warning only and shouldn't affect functionality in most cases
- If issues persist, consider downgrading pyannote.audio
Diarization failures
- Ensure you're providing a valid HuggingFace access token
- Try specifying reasonable min/max speaker values

Development and Deployment

Building Your Own Image

docker build -t your-username/whisperx-worker:your-tag .

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Acknowledgments

This project utilizes code from WhisperX, licensed under the BSD-2-Clause license
Special thanks to the RunPod team for the serverless platform

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WhisperX Worker for RunPod

Features

Input Parameters

Usage Examples

Basic Transcription

Transcription with Language Detection and Alignment

Full Configuration with Diarization

Full Configuration with Speaker Verification. There is no limit to the number of voice you can upload, but precision maybe be reduced over a certain threshold

With Diarization

Performance Considerations

Troubleshooting

Common Issues

Development and Deployment

Building Your Own Image

License

Acknowledgments

Contributing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

WhisperX Worker for RunPod

Features

Input Parameters

Usage Examples

Basic Transcription

Transcription with Language Detection and Alignment

Full Configuration with Diarization

Full Configuration with Speaker Verification. There is no limit to the number of voice you can upload, but precision maybe be reduced over a certain threshold

With Diarization

Performance Considerations

Troubleshooting

Common Issues

Development and Deployment

Building Your Own Image

License

Acknowledgments

Contributing