Trims silence from an interview audio file with FFmpeg, then does a fast triage pass — transcript, sentiment, and energy per chunk — and writes a JSON report.
Prerequisites: Zig 0.13+, FFmpeg, Python 3.8+
git clone <repo>
cd readback
zig build
mkdir -p input output
cp your_interview.mp3 input/your_interview.mp3
zig-out/bin/readback input/your_interview.mp3Output:
output/trimmed.mp3— silence removedoutput/report.json— analysis reportoutput/your_interview.mp3_sentiment.json— per-chunk transcript + sentiment (with sentiment setup; see below)
- Zig 0.13+: Download from ziglang.org
- FFmpeg: Install via
brew install ffmpeg(macOS),apt install ffmpeg(Ubuntu), or ffmpeg.org - Python 3.8+ (optional, only if using sentiment analysis)
zig build test # Run full test suite
zig build # Build binaryBinary location: zig-out/bin/readback
# Trim silence and generate report (no sentiment analysis)
zig-out/bin/readback input/your_interview.mp3
# With custom output paths
zig-out/bin/readback input/your_interview.mp3 output/trimmed.mp3 output/report.jsonEmotion is read from the audio (prosody), not the words — so unusual tone on ordinary speech is caught. "How are you?" said hostilely is flagged; the same words said calmly are not.
For each 8-second chunk:
- emotion —
neutral/angry/fearful/happy/sad/calm, derived from the three dimensions below with a neutral deadzone (only clearly off-neutral chunks get a non-neutral label, which guards against misclassification) - valence / arousal / dominance — the raw audio-model outputs (each ~0–1), kept for auditing. Hostility = high arousal + high dominance + low valence; dominance is what separates "angry" from "fearful"
- text — speech transcript (OpenAI Whisper,
tinymodel)
The emotion model is audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim,
fine-tuned on MSP-Podcast (real conversational speech) — unlike acted-emotion
models, it generalizes to natural audio. Valence is the model's weakest dimension
(a known limitation of speech emotion recognition), so treat it as the coarsest of
the three. Whisper-tiny on fixed 8-second cuts splits some utterances: transcripts
are triage signposts, not clip- or show-note-grade.
./scripts/setup_sentiment.shThis creates .venv/ with all Python dependencies (torch, transformers,
openai-whisper, soundfile, soxr). The emotion model (~661 MB) and Whisper
tiny weights (~75 MB) download on first run.
zig build
zig-out/bin/readback input/your_interview.mp3Outputs:
output/your_interview.mp3_sentiment.json— normalized per-chunk transcript + sentimentoutput/your_interview.mp3.raw_sentiment.json— raw Python output (intermediate, also a cache)
Audio is decoded in a single sequential pass, one 8-second chunk in memory at a time, so a four-hour file never loads into memory: decoding a 4.4-hour MP3 is ~5 s of I/O at a flat ~0.2 GB, and total process memory is set by the models (~1 GB), independent of file length. If the raw sentiment JSON already exists for an input, inference is skipped — delete it to force a fresh run.
Steady-state runtime: both the emotion model and Whisper run on every chunk
(~0.3 s of inference per 8 s chunk on an M-series Mac — emotion on MPS, Whisper
on CPU), so a four-hour file (~2,000 chunks) completes in roughly 12 minutes
(measured: a 4.39 h MP3 → 1,978 segments in 12.0 min, 0.96 GB peak RSS).
That cost is paid once: the raw-JSON cache is what stands between a re-run and
re-inferring every chunk again, so keep output/<name>.raw_sentiment.json
unless you intend to recompute. Use --no-transcribe (Python CLI) to skip Whisper
when you only need emotion.
The neutral/emotion split is governed by a tunable deadzone (--deadzone, default
0.10); widen it to label more chunks neutral, narrow it to surface more emotion.
These are auto-generated and excluded from git:
.venv/— Python virtual environment~/.cache/whisper/— Whisper model weights cache~/.cache/huggingface/— audeering emotion-model weights cache
zig: command not found
- Download and install Zig from ziglang.org. Ensure
zigis on your PATH.
ffmpeg: command not found
- Install FFmpeg:
brew install ffmpeg(macOS),apt install ffmpeg(Ubuntu), or see ffmpeg.org.
Sentiment analysis is slow on first run
- This is normal: the emotion model (~661 MB) and Whisper
tinyweights (~75 MB) download on first use, then cache. - Both models run per chunk on CPU/MPS, so long files take a while. The raw-JSON cache means you only pay it once per input.
Tests fail
- Run
zig build testfor the Zig suite and.venv/bin/python -m pytest sentiment/tests/test_cli.pyfor the Python unit suite.
- Code layout: Zig in
src/, Python sentiment insentiment/ - Run tests:
zig build testorzig test src/<module>/<module>_test.zig - Run sentiment CLI directly:
.venv/bin/python -m interview_sentiment.cli --input <file> --output <file>