Skip to content

MexLinker/video_en_learn

Repository files navigation

Video Watching English Learn — Segmentation & Clips (CLI + Electron)

Overview

  • Two pipelines are provided to segment movie audio/video and cut clips:
    • Option A (fast): inaSpeechSegmenter + ffmpeg via Python script.
    • Option B (robust): pyannote.audio + whisperx + ffmpeg in a Jupyter notebook.
  • A simple Electron GUI is included to run Option A on Windows and produce clips without using the terminal.

User Guide

  • For a more vivid, step-by-step walkthrough, see USER_GUIDE.md.

Docker

  • Build the image:
    • docker compose build
  • Run Option A (CLI) inside the container:
    • docker compose run --rm cli bash -lc "python scripts/option_a_segment_and_cut.py --input samples/tts/tts_speech.mp4 --label female --ffmpeg $(which ffmpeg) --execute"
    • Replace --input with your video path under the mounted workspace.
  • Launch JupyterLab for Option B:
    • docker compose up jupyter
    • Open http://localhost:8888 in your browser and load notebooks/pyannote_whisperx_pipeline.ipynb.
    • Note: model downloads may be large; consider mirrors or local caches.

Quick Start (Electron GUI)

  • Prereqs:
    • Windows 10/11
    • Python 3.10+ on PATH (for running the segmentation script)
    • Node.js 18+ (for running Electron)
  • Install dependencies:
    • Set npm registry for speed: npm config set registry https://registry.npmmirror.com
    • npm install
  • Run the app in dev mode:
    • npm run start
    • In the GUI, click “Detect Python & FFmpeg” (it auto-fills paths). Provide a video path (e.g., samples\\tts\\tts_speech.mp4). Choose a label (female/male/speech), then “Run Option A”.
  • Package a Windows EXE:
    • npm run dist
    • Output installer/exe appears under dist/. The build bundles the GUI and uses your local scripts/ and ffmpeg/ folders.

CLI Usage (Option A)

  • Command:
    • python scripts/option_a_segment_and_cut.py --input <video.mp4> --label <female|male|speech> --ffmpeg <path-to-ffmpeg.exe> --execute
  • Outputs:
    • outputs/option_a/segments_<label>.csv — segment timestamps
    • outputs/option_a/cut_<label>_clips.bat — generated Windows cutting commands
    • outputs/option_a/clips/clip_XXX.mp4 — cut clips (with --execute)

Notebook (Option B)

  • Path: notebooks/pyannote_whisperx_pipeline.ipynb
  • Purpose: diarization + transcription + word alignment + clip cutting command generation.
  • Note: Requires reliable model downloads from Hugging Face; use mirrors or pre-cached models in limited-network environments.

Troubleshooting

  • No speech detected:
    • Some content is classified as music/noEnergy. Try --label female/--label male or relax VAD thresholds.
    • Use a sample with clear spoken voice; samples\\tts\\tts_speech.mp4 is provided.
  • FFmpeg missing:
    • A portable FFmpeg is included under ffmpeg/ffmpeg-8.0-essentials_build/bin/ffmpeg.exe.
    • In Docker, ffmpeg is installed system-wide and available via $(which ffmpeg).
  • WhisperX/pyannote model download timeouts:
    • Set mirrors: set HF_ENDPOINT=https://hf-mirror.com (PowerShell: $env:HF_ENDPOINT = 'https://hf-mirror.com').
    • Provide local model caches if internet is constrained.

Credits

  • inaSpeechSegmenter, ffmpeg, pyannote.audio, whisperx — see their respective repositories for licensing and usage details.

==========================================================

This project provides two practical pipelines to segment a movie by speaker and cut clips for male speech:

  • Option A — quick + simple using inaSpeechSegmenter + ffmpeg
  • Option B — higher-quality using pyannote.audio + WhisperX + (optional) inaSpeechSegmenter + ffmpeg

Folder layout:

  • scripts/option_a_segment_and_cut.py — ready-to-run Python script for Option A
  • notebooks/pyannote_whisperx_pipeline.ipynb — step-by-step notebook for Option B

Option A — Quick + Simple (Recommended to start)

What it does:

  • Extracts audio from your video
  • Runs inaSpeechSegmenter to tag segments as male/female/music/noise
  • Generates a CSV and a Windows .bat file with ffmpeg commands to cut male-only clips
  • Can optionally execute the cuts automatically

Prerequisites:

  • Windows: install ffmpeg and ensure ffmpeg.exe is in PATH. For example: winget install --id=Gyan.FFmpeg.
  • Python 3.8–3.13
  • pip install inaSpeechSegmenter

Run:

python scripts/option_a_segment_and_cut.py --input C:\path\to\movie.mp4 --copy --execute

Useful flags:

  • --ffmpeg C:\path\to\ffmpeg.exe — provide explicit ffmpeg path if not in PATH
  • --label male|female|speech — choose which label to cut (default: male)
  • --pad 0.05 — add padding (seconds) around each segment
  • --merge-gap 0.2 — merge adjacent segments closer than this gap
  • --min-dur 1.0 — drop segments shorter than this duration
  • --copy — stream copy (fast, may have less-precise boundaries)
  • --reencode — reencode to H.264/AAC (precise boundaries, slower)
  • --execute — actually run ffmpeg and produce clips/*.mp4

Outputs:

  • outputs/option_a/extracted_audio.wav — audio extracted from the input
  • outputs/option_a/segments_male.csv — segment timestamps and labels
  • outputs/option_a/cut_male_clips.bat — Windows batch file to cut clips
  • outputs/option_a/clips/clip_001.mp4 (etc.) — if --execute is used

Option B — Higher Quality (Diarization + Transcript + Alignment)

What it does:

  • Extracts audio (ffmpeg)
  • Runs speaker diarization with pyannote.audio (“who spoke when”)
  • Transcribes and aligns words with whisperx (word-level timestamps)
  • Optionally maps speakers to perceived gender via inaSpeechSegmenter
  • Outputs CSVs and a .bat file to cut male clips (speaker-level or frame-level)

Prerequisites:

  • ffmpeg installed and available
  • Python 3.8–3.13
  • pip install pyannote.audio whisperx inaSpeechSegmenter
  • Hugging Face token (user access) for pyannote/speaker-diarization-community-1: https://huggingface.co/settings/tokens
  • GPU optional (CUDA 12.8 recommended for speed); CPU works but is slower

Usage:

  1. Open the notebook: notebooks/pyannote_whisperx_pipeline.ipynb
  2. Set INPUT_VIDEO, OUTDIR, FFMPEG_PATH (if needed), and HF_TOKEN in the first cell
  3. Run cells in order; outputs are saved under outputs/option_b/

Outputs:

  • outputs/option_b/speaker_turns.csv — diarized speaker turns
  • outputs/option_b/words.csv — word-level timestamps
  • outputs/option_b/words_with_speakers.csv — words assigned to speakers
  • outputs/option_b/gender_segments.csv — optional gender segments (via inaSpeech)
  • outputs/option_b/speaker_gender.csv — speaker→gender mapping (heuristic)
  • outputs/option_b/male_cuts.csv — final cut list
  • outputs/option_b/cut_male_clips.bat — batch commands to cut clips

Accuracy, Bias, and Legal Notes

  • Gender detection from voice infers perceived gender from audio traits and can be inaccurate and biased across languages/cultures/conditions. Expect mistakes; review outputs.
  • Overlapping speech: diarization can handle overlap better than basic segmentation; decide whether to include/exclude mixed speech in your cuts.
  • Copyright: cutting and redistributing movie clips may violate copyright. For personal study, local laws vary. Ensure your use complies with license and jurisdiction.
  • Performance: long movies take time to process. Consider chunking or a machine with a decent CPU/GPU.

References (open source tools)

Quick Recommendation

Start with Option A to validate your workflow quickly. If you need better handling of overlapping speakers or a transcript with word-level timestamps, move to Option B and map speaker IDs to male/female using inaSpeechSegmenter or a simple classifier.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages