Skip to content

dillingerstaffing/readback

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

your_interview Audio Triage Tool

Trims silence from an interview audio file with FFmpeg, then does a fast triage pass — transcript, sentiment, and energy per chunk — and writes a JSON report.

Quick Start (< 2 minutes)

Prerequisites: Zig 0.13+, FFmpeg, Python 3.8+

git clone <repo>
cd readback

zig build
mkdir -p input output
cp your_interview.mp3 input/your_interview.mp3
zig-out/bin/readback input/your_interview.mp3

Output:

  • output/trimmed.mp3 — silence removed
  • output/report.json — analysis report
  • output/your_interview.mp3_sentiment.json — per-chunk transcript + sentiment (with sentiment setup; see below)

Prerequisites

  • Zig 0.13+: Download from ziglang.org
  • FFmpeg: Install via brew install ffmpeg (macOS), apt install ffmpeg (Ubuntu), or ffmpeg.org
  • Python 3.8+ (optional, only if using sentiment analysis)

Build

zig build test        # Run full test suite
zig build             # Build binary

Binary location: zig-out/bin/readback

Usage

# Trim silence and generate report (no sentiment analysis)
zig-out/bin/readback input/your_interview.mp3

# With custom output paths
zig-out/bin/readback input/your_interview.mp3 output/trimmed.mp3 output/report.json

Sentiment Analysis (Optional)

Emotion is read from the audio (prosody), not the words — so unusual tone on ordinary speech is caught. "How are you?" said hostilely is flagged; the same words said calmly are not.

For each 8-second chunk:

  • emotionneutral / angry / fearful / happy / sad / calm, derived from the three dimensions below with a neutral deadzone (only clearly off-neutral chunks get a non-neutral label, which guards against misclassification)
  • valence / arousal / dominance — the raw audio-model outputs (each ~0–1), kept for auditing. Hostility = high arousal + high dominance + low valence; dominance is what separates "angry" from "fearful"
  • text — speech transcript (OpenAI Whisper, tiny model)

The emotion model is audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim, fine-tuned on MSP-Podcast (real conversational speech) — unlike acted-emotion models, it generalizes to natural audio. Valence is the model's weakest dimension (a known limitation of speech emotion recognition), so treat it as the coarsest of the three. Whisper-tiny on fixed 8-second cuts splits some utterances: transcripts are triage signposts, not clip- or show-note-grade.

Setup (one-time)

./scripts/setup_sentiment.sh

This creates .venv/ with all Python dependencies (torch, transformers, openai-whisper, soundfile, soxr). The emotion model (~661 MB) and Whisper tiny weights (~75 MB) download on first run.

Run with sentiment analysis

zig build
zig-out/bin/readback input/your_interview.mp3

Outputs:

  • output/your_interview.mp3_sentiment.json — normalized per-chunk transcript + sentiment
  • output/your_interview.mp3.raw_sentiment.json — raw Python output (intermediate, also a cache)

Audio is decoded in a single sequential pass, one 8-second chunk in memory at a time, so a four-hour file never loads into memory: decoding a 4.4-hour MP3 is ~5 s of I/O at a flat ~0.2 GB, and total process memory is set by the models (~1 GB), independent of file length. If the raw sentiment JSON already exists for an input, inference is skipped — delete it to force a fresh run.

Steady-state runtime: both the emotion model and Whisper run on every chunk (~0.3 s of inference per 8 s chunk on an M-series Mac — emotion on MPS, Whisper on CPU), so a four-hour file (~2,000 chunks) completes in roughly 12 minutes (measured: a 4.39 h MP3 → 1,978 segments in 12.0 min, 0.96 GB peak RSS). That cost is paid once: the raw-JSON cache is what stands between a re-run and re-inferring every chunk again, so keep output/<name>.raw_sentiment.json unless you intend to recompute. Use --no-transcribe (Python CLI) to skip Whisper when you only need emotion.

The neutral/emotion split is governed by a tunable deadzone (--deadzone, default 0.10); widen it to label more chunks neutral, narrow it to surface more emotion.

Cache files (not committed)

These are auto-generated and excluded from git:

  • .venv/ — Python virtual environment
  • ~/.cache/whisper/ — Whisper model weights cache
  • ~/.cache/huggingface/ — audeering emotion-model weights cache

Troubleshooting

zig: command not found

  • Download and install Zig from ziglang.org. Ensure zig is on your PATH.

ffmpeg: command not found

  • Install FFmpeg: brew install ffmpeg (macOS), apt install ffmpeg (Ubuntu), or see ffmpeg.org.

Sentiment analysis is slow on first run

  • This is normal: the emotion model (~661 MB) and Whisper tiny weights (~75 MB) download on first use, then cache.
  • Both models run per chunk on CPU/MPS, so long files take a while. The raw-JSON cache means you only pay it once per input.

Tests fail

  • Run zig build test for the Zig suite and .venv/bin/python -m pytest sentiment/tests/test_cli.py for the Python unit suite.

Development

  • Code layout: Zig in src/, Python sentiment in sentiment/
  • Run tests: zig build test or zig test src/<module>/<module>_test.zig
  • Run sentiment CLI directly: .venv/bin/python -m interview_sentiment.cli --input <file> --output <file>

About

Turn a recording into a read-back: trimmed audio, a local transcript, the tone and energy across the conversation, and the moments worth a second listen. Runs locally; the audio never leaves your machine.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors