A transformer-based speaker diarization system for audio processing and analysis.
This project requires FFmpeg for audio/video processing. Install it using:
# Ubuntu/Debian
sudo apt update && sudo apt install -y ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.htmlInstall the required Python packages:
pip install -r requirements.txtThe main dependencies include:
lhotse==1.31.1- Audio data processingyamlargparse==1.31.1- YAML configuration parsinggdown==5.2.0- Google Drive file downloadstorchaudio>=2.0.0- Audio processing with PyTorch
If using a development container, ensure FFmpeg is installed in the container. Add this to your Dockerfile or devcontainer.json:
RUN apt-get update && apt-get install -y ffmpegThe system supports multiple datasets including AVA-AVD, MSWild, and VoxConverse. Use the data manager to download and prepare datasets:
python datasets/data_manager.py --config your_config.ymlThe AVA-AVD dataset requires video processing capabilities. Make sure FFmpeg is installed before processing:
# Test AVA-AVD integration (small subset)
python datasets/data_manager.py --config test_ava_avd_small.yml
# Full AVA-AVD processing (with videos)
python datasets/data_manager.py --config test_ava_avd_full.ymlThe AVA-AVD dataset processing has been optimized to avoid slow TorchAudio StreamReader calls:
- Uses
ffprobedirectly for metadata extraction (much faster) - Avoids deprecated TorchAudio APIs that cause slowdowns
- Processes videos in parallel when possible
For large datasets:
- Consider processing splits separately
- Use
download_videos: falsefor annotation-only testing - Ensure sufficient disk space for video files
If you encounter FFmpeg-related errors:
- Ensure FFmpeg is installed:
ffmpeg -version - Check that the FFmpeg binary is in your PATH
- For development containers, ensure FFmpeg is installed in the container
You may see deprecation warnings from TorchAudio. These are expected and don't affect functionality:
UserWarning: torio.io._streaming_media_decoder.StreamingMediaDecoder has been deprecated
For video datasets like AVA-AVD:
- Ensure sufficient disk space (videos can be large)
- Consider using
download_videos: falsein config for testing without full video downloads - Video processing requires significant memory and CPU resources
If you see a CUDA OOM originating from causal_linear_attention (e.g. an error pointing at context_cumsum), it means the causal linear attention implementation has built large intermediate tensors for the chunk being processed. Recommended steps:
- Reduce training/evaluation batch size in your config (fastest fix).
- Reduce
eval_knobs.sliding_windoworglobal_config.features.batch_durationto shorten sequence length. - Reduce
model.encoder.nb_features/model.decoder.nb_featuresto lower random-features dimensionality. - Tune
causal_chunk_sizefor the model layers — smaller chunk sizes lower peak memory at the cost of some runtime overhead. - Run with mixed precision if not already enabled (config:
training.mixed_precision: true). - Set environment variable to avoid fragmentation:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueWhere possible it's recommended to try a smaller causal_chunk_size first. The code now supports passing causal_chunk_size through the model construction so you can set this from your model config (lower -> safer for memory).
TS-DIA/
├── datasets/ # Dataset management and recipes
│ ├── recipes/ # Dataset-specific download/prepare functions
│ ├── data_manager.py # Main dataset management system
│ └── dataset_types.py # Dataset configuration classes
├── model/ # Transformer model components
├── data/ # Dataset storage
├── manifests/ # Processed dataset manifests
├── configs/ # Configuration files
└── examples/ # Usage examples
- AVA-AVD: Audio-visual diarization dataset with video files
- MSWild: Multi-speaker wild dataset
- VoxConverse: Speaker diarization benchmark dataset