- π€ Smart Language Detection - Automatically detects Swedish and English (for optimal alignment models)
- π₯ Speaker Diarization - Identifies who said what in multi-speaker recordings (Swedish works without HF_TOKEN)
- π Automated Workflow - Drop files β Get transcripts β Files archived
- π± Native Notifications - macOS notifications for processing status
- π Multiple Formats - Default TXT output, customizable to include JSON, SRT, VTT, and TSV
- β‘ Optimized Performance - float32 compute type for better quality (configurable)
- π‘οΈ Production Ready - Built with uv for reliable dependency management
- macOS only (notifications and system integration)
- Python 3.9-3.12
- uv package manager
- HuggingFace account for speaker diarization (optional for Swedish)
# Clone this repository
git clone https://github.com/jimmystridh/whisperx-transcription.git
cd whisperx-transcription
# Install dependencies
uv sync
# Configure your HuggingFace token (optional for Swedish)
cp .env.sample .env
# Edit .env and add your HF_TOKEN from https://huggingface.co/settings/tokens
# Note: Swedish diarization works without HF_TOKEN# Test with a single file
uv run python transcribe.py path/to/your/audio.m4a
# Start automatic processing
uv run python watcher.pyThat's it! Drop audio files in incoming/ and watch the magic happen! β¨
whisperx-transcription/
βββ π incoming/ # Drop new recordings here
βββ π transcripts/ # Your transcripts appear here
βββ π recording_archive/ # Processed files stored here
βββ π transcribe.py # Core transcription engine
βββ π watcher.py # File monitoring daemon
βββ βοΈ .env # Your configuration
βββ π¦ pyproject.toml # Project dependencies
βββ π README.md # This guide
- Drop audio files into the
incoming/folder - Detection - System detects language and loads optimal models
- Transcription - WhisperX processes with timestamps and alignment
- Diarization - Identifies different speakers (if enabled)
- Output - Multiple format files saved to
transcripts/ - Archive - Original files moved to
recording_archive/ - Notify - macOS notification confirms completion
Perfect for ongoing transcription needs:
uv run python watcher.py- Processes existing files first (no need to move files around)
- Monitors
incoming/folder continuously - Processes files as soon as they're added
- Sends macOS notifications for each completed transcription
For one-off transcriptions:
uv run python transcribe.py my-recording.m4a -o ./transcriptsProcess existing files all at once:
uv run python watcher.py --process-existing --onceuv run python transcribe.py [file] [options]
-o, --output DIR Output directory (default: ./transcripts)
-l, --language CODE Force language (sv/en, auto-detected by default)
--no-diarize Skip speaker identification
--device DEVICE cpu or cuda (default: cpu)
--formats FORMAT Output formats (default: txt)
--all-formats Generate all output formatsuv run python watcher.py [options]
--incoming DIR Watch directory (default: ./incoming)
--transcripts DIR Output directory (default: ./transcripts)
--archive DIR Archive directory (default: ./recording_archive)
--process-existing Process files already in incoming/
--once Process once and exit (don't watch)| Format | Extension | Notes |
|---|---|---|
| MPEG-4 Audio | .m4a |
Recommended |
| MPEG-4 Video | .mp4 |
Audio extracted |
| QuickTime | .mov |
Audio extracted |
| WAVE | .wav |
Uncompressed |
| MP3 | .mp3 |
Compressed |
| FLAC | .flac |
Lossless |
| OGG | .ogg |
Open source |
Smart Detection: The system automatically detects Swedish vs English to select optimal alignment models.
| Language | Alignment Model | Auto-Detection | Diarization |
|---|---|---|---|
| πΈπͺ Swedish | KBLab/wav2vec2-base-voxpopuli-sv-swedish |
β Supported | No HF_TOKEN needed |
| πΊπΈ English | WAV2VEC2_ASR_LARGE_LV60K_960H |
β Supported | Requires HF_TOKEN |
| π Others | English model (fallback) | β Manual only | Requires HF_TOKEN |
Note: Language detection currently focuses on Swedish/English optimization. Other languages require manual specification with -l flag.
By default, each audio file generates TXT output (customizable):
| Format | File | Description |
|---|---|---|
| π Plain Text | filename.txt |
Clean transcript with speaker labels |
| π§ JSON | filename.json |
Full metadata, timestamps, confidence scores |
| π¬ SRT | filename.srt |
Subtitle format for video players |
| π WebVTT | filename.vtt |
Web-compatible subtitle format |
| π TSV | filename.tsv |
Spreadsheet-friendly with precise timings |
transcripts/
βββ meeting-2025-01-15.txt # [Speaker 1] Welcome everyone...
βββ meeting-2025-01-15.json # {"segments": [{"start": 0.5, ...}]}
βββ meeting-2025-01-15.srt # 1\n00:00:00,500 --> 00:00:03,200\n...
βββ meeting-2025-01-15.vtt # WEBVTT\n\n00:00:00.500 --> 00:00:03.200\n...
βββ meeting-2025-01-15.tsv # start end speaker text
Stay informed with native notifications:
- π Processing Started - When transcription begins
- β Transcription Complete - With language, duration, and speaker info
- β Processing Failed - Error details for troubleshooting
# Note: Primarily tested on macOS (CPU-only)
# GPU support available but not extensively tested on macOS
uv sync --extra gpu # Install CUDA support
uv run python transcribe.py file.m4a --device cuda# Compare different configurations (float32 is now default)
uv run python benchmark.py audio.m4a
uv run python benchmark.py audio.m4a --compute-types "float32,int8,float16"uv sync --extra dev # Install Jupyter, matplotlib, etc.Set up automatic startup with launchd:
# Create service file
cat > ~/Library/LaunchAgents/com.user.whisperx-watcher.plist << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.user.whisperx-watcher</string>
<key>ProgramArguments</key>
<array>
<string>uv</string>
<string>run</string>
<string>python</string>
<string>watcher.py</string>
</array>
<key>WorkingDirectory</key>
<string>/path/to/your/whisperx-transcription</string>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/path/to/your/whisperx-transcription/watcher.out.log</string>
<key>StandardErrorPath</key>
<string>/path/to/your/whisperx-transcription/watcher.err.log</string>
</dict>
</plist>
EOF
# Start service
launchctl load ~/Library/LaunchAgents/com.user.whisperx-watcher.plist| Issue | Solution |
|---|---|
| "HF_TOKEN not found" | Create .env file with your HuggingFace token (not needed for Swedish) |
| "CUDA not available" | Normal on Mac - system automatically uses CPU |
| Files not processing | Check file format, ensure complete upload, verify macOS compatibility |
| Slow processing | Use benchmark tool to find optimal settings for your hardware |
| No notifications | Check macOS notification permissions in System Preferences |
| Language detection issues | Currently optimized for Swedish/English only - use -l flag for others |
Monitor system activity:
tail -f transcription.log # Transcription process
tail -f watcher.log # File monitoring# Quick system test
uv run python transcribe.py --help
uv run python watcher.py --help
# Process test file
uv run python watcher.py --process-existing --once- π Timing: Run during off-hours for CPU-intensive tasks
- π File Size: Split very long recordings (>2 hours) for better processing
- π Batch Mode: Use
--process-existingfor multiple files - πΎ Storage: Archive old transcripts to save disk space
Here's a typical day with the system:
# Morning: Start the watcher
uv run python watcher.py
# During the day: Drop recordings in incoming/
# - meeting-notes.m4a
# - interview-candidate.mov
# - conference-call.wav
# System automatically:
# β
Detects languages
# β
Transcribes with speakers
# β
Saves multiple formats
# β
Archives originals
# β
Sends notifications
# Evening: Find all transcripts ready in transcripts/Found a bug or have a feature request?
- Check existing issues
- Create detailed bug reports
- Share your use cases
This project is open source. Feel free to modify and adapt for your needs.
Made with β€οΈ for seamless audio transcription