Video Watching English Learn — Segmentation & Clips (CLI + Electron)
Overview
- Two pipelines are provided to segment movie audio/video and cut clips:
- Option A (fast):
inaSpeechSegmenter+ffmpegvia Python script. - Option B (robust):
pyannote.audio+whisperx+ffmpegin a Jupyter notebook.
- Option A (fast):
- A simple Electron GUI is included to run Option A on Windows and produce clips without using the terminal.
User Guide
- For a more vivid, step-by-step walkthrough, see
USER_GUIDE.md.
Docker
- Build the image:
docker compose build
- Run Option A (CLI) inside the container:
docker compose run --rm cli bash -lc "python scripts/option_a_segment_and_cut.py --input samples/tts/tts_speech.mp4 --label female --ffmpeg $(which ffmpeg) --execute"- Replace
--inputwith your video path under the mounted workspace.
- Launch JupyterLab for Option B:
docker compose up jupyter- Open
http://localhost:8888in your browser and loadnotebooks/pyannote_whisperx_pipeline.ipynb. - Note: model downloads may be large; consider mirrors or local caches.
Quick Start (Electron GUI)
- Prereqs:
- Windows 10/11
- Python 3.10+ on PATH (for running the segmentation script)
- Node.js 18+ (for running Electron)
- Install dependencies:
- Set npm registry for speed:
npm config set registry https://registry.npmmirror.com npm install
- Set npm registry for speed:
- Run the app in dev mode:
npm run start- In the GUI, click “Detect Python & FFmpeg” (it auto-fills paths). Provide a video path (e.g.,
samples\\tts\\tts_speech.mp4). Choose a label (female/male/speech), then “Run Option A”.
- Package a Windows EXE:
npm run dist- Output installer/exe appears under
dist/. The build bundles the GUI and uses your localscripts/andffmpeg/folders.
CLI Usage (Option A)
- Command:
python scripts/option_a_segment_and_cut.py --input <video.mp4> --label <female|male|speech> --ffmpeg <path-to-ffmpeg.exe> --execute
- Outputs:
outputs/option_a/segments_<label>.csv— segment timestampsoutputs/option_a/cut_<label>_clips.bat— generated Windows cutting commandsoutputs/option_a/clips/clip_XXX.mp4— cut clips (with--execute)
Notebook (Option B)
- Path:
notebooks/pyannote_whisperx_pipeline.ipynb - Purpose: diarization + transcription + word alignment + clip cutting command generation.
- Note: Requires reliable model downloads from Hugging Face; use mirrors or pre-cached models in limited-network environments.
Troubleshooting
- No speech detected:
- Some content is classified as music/noEnergy. Try
--label female/--label maleor relax VAD thresholds. - Use a sample with clear spoken voice;
samples\\tts\\tts_speech.mp4is provided.
- Some content is classified as music/noEnergy. Try
- FFmpeg missing:
- A portable FFmpeg is included under
ffmpeg/ffmpeg-8.0-essentials_build/bin/ffmpeg.exe. - In Docker, ffmpeg is installed system-wide and available via
$(which ffmpeg).
- A portable FFmpeg is included under
- WhisperX/pyannote model download timeouts:
- Set mirrors:
set HF_ENDPOINT=https://hf-mirror.com(PowerShell:$env:HF_ENDPOINT = 'https://hf-mirror.com'). - Provide local model caches if internet is constrained.
- Set mirrors:
Credits
inaSpeechSegmenter,ffmpeg,pyannote.audio,whisperx— see their respective repositories for licensing and usage details.
==========================================================
This project provides two practical pipelines to segment a movie by speaker and cut clips for male speech:
- Option A — quick + simple using inaSpeechSegmenter + ffmpeg
- Option B — higher-quality using pyannote.audio + WhisperX + (optional) inaSpeechSegmenter + ffmpeg
Folder layout:
scripts/option_a_segment_and_cut.py— ready-to-run Python script for Option Anotebooks/pyannote_whisperx_pipeline.ipynb— step-by-step notebook for Option B
What it does:
- Extracts audio from your video
- Runs inaSpeechSegmenter to tag segments as
male/female/music/noise - Generates a CSV and a Windows
.batfile withffmpegcommands to cut male-only clips - Can optionally execute the cuts automatically
Prerequisites:
- Windows: install ffmpeg and ensure
ffmpeg.exeis in PATH. For example:winget install --id=Gyan.FFmpeg. - Python 3.8–3.13
pip install inaSpeechSegmenter
Run:
python scripts/option_a_segment_and_cut.py --input C:\path\to\movie.mp4 --copy --execute
Useful flags:
--ffmpeg C:\path\to\ffmpeg.exe— provide explicit ffmpeg path if not in PATH--label male|female|speech— choose which label to cut (default:male)--pad 0.05— add padding (seconds) around each segment--merge-gap 0.2— merge adjacent segments closer than this gap--min-dur 1.0— drop segments shorter than this duration--copy— stream copy (fast, may have less-precise boundaries)--reencode— reencode to H.264/AAC (precise boundaries, slower)--execute— actually run ffmpeg and produceclips/*.mp4
Outputs:
outputs/option_a/extracted_audio.wav— audio extracted from the inputoutputs/option_a/segments_male.csv— segment timestamps and labelsoutputs/option_a/cut_male_clips.bat— Windows batch file to cut clipsoutputs/option_a/clips/clip_001.mp4(etc.) — if--executeis used
What it does:
- Extracts audio (
ffmpeg) - Runs speaker diarization with
pyannote.audio(“who spoke when”) - Transcribes and aligns words with
whisperx(word-level timestamps) - Optionally maps speakers to perceived gender via
inaSpeechSegmenter - Outputs CSVs and a
.batfile to cut male clips (speaker-level or frame-level)
Prerequisites:
- ffmpeg installed and available
- Python 3.8–3.13
pip install pyannote.audio whisperx inaSpeechSegmenter- Hugging Face token (user access) for
pyannote/speaker-diarization-community-1: https://huggingface.co/settings/tokens - GPU optional (CUDA 12.8 recommended for speed); CPU works but is slower
Usage:
- Open the notebook:
notebooks/pyannote_whisperx_pipeline.ipynb - Set
INPUT_VIDEO,OUTDIR,FFMPEG_PATH(if needed), andHF_TOKENin the first cell - Run cells in order; outputs are saved under
outputs/option_b/
Outputs:
outputs/option_b/speaker_turns.csv— diarized speaker turnsoutputs/option_b/words.csv— word-level timestampsoutputs/option_b/words_with_speakers.csv— words assigned to speakersoutputs/option_b/gender_segments.csv— optional gender segments (via inaSpeech)outputs/option_b/speaker_gender.csv— speaker→gender mapping (heuristic)outputs/option_b/male_cuts.csv— final cut listoutputs/option_b/cut_male_clips.bat— batch commands to cut clips
- Gender detection from voice infers perceived gender from audio traits and can be inaccurate and biased across languages/cultures/conditions. Expect mistakes; review outputs.
- Overlapping speech: diarization can handle overlap better than basic segmentation; decide whether to include/exclude mixed speech in your cuts.
- Copyright: cutting and redistributing movie clips may violate copyright. For personal study, local laws vary. Ensure your use complies with license and jurisdiction.
- Performance: long movies take time to process. Consider chunking or a machine with a decent CPU/GPU.
- inaSpeechSegmenter — CNN-based segmentation with speech/music/noise and speaker gender labels: https://github.com/ina-foss/inaSpeechSegmenter
- pyannote.audio — state-of-the-art speaker diarization toolkit: https://github.com/pyannote/pyannote-audio
- WhisperX — Whisper + word-level alignment + diarization integration: https://github.com/m-bain/whisperX
- OpenVINO diarization tutorial — pipeline concepts and tips: https://docs.openvino.ai/2023.3/notebooks/212-pyannote-speaker-diarization-with-output.html
- SpeechBrain — speech toolkit (embeddings/classifiers optional): https://github.com/speechbrain/speechbrain
Start with Option A to validate your workflow quickly. If you need better handling of overlapping speakers or a transcript with word-level timestamps, move to Option B and map speaker IDs to male/female using inaSpeechSegmenter or a simple classifier.