A local web app for turning long-form interview videos into short clips — with AI transcription, speaker diarization, interactive editing, and CSV export.
| Layer | Tech |
|---|---|
| Frontend | React + TypeScript + Vite + Tailwind |
| Backend | FastAPI (Python) |
| Transcription | faster-whisper |
| Speaker ID | pyannote.audio 3.1 |
| Clip cutting | ffmpeg |
| Descriptions | OpenAI GPT-4o |
| YouTube download | yt-dlp |
- Python 3.10+ and pip
- Node.js 18+ and npm
- ffmpeg installed (
brew install ffmpegon macOS) - yt-dlp installed (
brew install yt-dlporpip install yt-dlp)
-
Clone and configure
cp .env.example .env # Edit .env with your API keys -
API keys in
.envOPENAI_API_KEY— for GPT-4o clip descriptions (optional but recommended)HF_TOKEN— for speaker diarization via pyannote.audio (optional)- Accept model terms at https://huggingface.co/pyannote/speaker-diarization-3.1
-
Start everything
chmod +x start.sh ./start.sh
Then open http://localhost:5173
- Upload — Paste a YouTube URL or choose a local video file. Enter interviewer and interviewee names.
- Process — Whisper transcribes the audio; pyannote.audio identifies speakers. Takes a few minutes.
- Transcript — Click a segment to start a selection, click another to end it. Hit Create Clip.
- Clips — For each clip:
- Drag the green/red markers on the timeline to fine-tune start/end times
- Enter thumbnail text
- Generate a description with GPT-4o (or write your own)
- Click Cut clip to render the video segment
- Export — Preview and download
clips.csvwith columns:Artist | Start | End | Thumbnail | Title | Description | Full link
- All data is stored in memory and on disk under
backend/jobs/. Nothing leaves your machine except API calls to OpenAI/HuggingFace. - Speaker diarization can be skipped (no HF token needed) — speakers will show as
SPEAKER_00,SPEAKER_01etc. - Use
WHISPER_MODEL=basefor faster (less accurate) transcription,large-v3for best results.