Skip to content

Sebastian-Zok/StudioVoiceHelper

Repository files navigation

Studio Voice Helper

Python wrapper around the NVIDIA Maxine Studio Voice NIM that accepts audio or video, enhances the speech track, and writes the result back out.

Features

  • Accepts audio or video input
  • Extracts audio from video, resamples to 48 kHz PCM WAV (NIM requirement)
  • Silence-aware chunking keeps each piece under the NIM file-size limit
  • Sends chunks to the NIM via gRPC and stitches results in order
  • Remuxes enhanced audio back into the original video container
  • Supports NGC_API_KEY via env vars / .env
  • Cleans up temp files (unless --debug)

Easy Setup

Requirements: Docker Desktop with an NVIDIA GPU (and nvidia-container-toolkit on Linux, or WSL2 GPU support on Windows).

0. Create API Key

The AI model is free to use but the download is restricted. Therefore you have to create a free API Key here: https://org.ngc.nvidia.com/setup/api-keys

1. Configure your API key

Create a .env file and put the API key there.

cp .env.example .env
# Open .env and set NGC_API_KEY to your key from https://org.ngc.nvidia.com/setup/api-keys

2. Put your media file in ./data/

cp /path/to/your/video.mp4 data/

3. Start the NIM

The first run will download the model weights — this may take several minutes.

docker compose up -d nim

Check that it is ready (status should show healthy):

docker compose ps nim

4. Improve your audio

To enhance the audio simply run:

docker compose run --rm helper --input /data/video.mp4 --output /data/output.mp4

Paths inside the container must use /data/ — the mounted volume.

5. Stop when done

Your enhanced audiofile or video should now be in the output folder. So feel free to stop the containers:

docker compose down

Advanced Setup

Prerequisites

Tool Purpose
Python 3.10+ Runtime
ffmpeg Audio extraction, resampling, remuxing
NVIDIA Maxine Studio Voice NIM gRPC inference server (GPU required)

Quick Start

1. Clone & install

Linux / macOS

git clone <repo-url> && cd StudioVoiceHelper

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Windows (PowerShell)

git clone <repo-url>; cd StudioVoiceHelper

python -m venv .venv; .\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

2. Compile gRPC protos

python compile_protos.py

3. Configure

cp .env.example .env
# Edit .env and set NGC_API_KEY

4. Start the NIM container

See NVIDIA docs.

docker run -it --rm --name=studio-voice \
    --runtime=nvidia --gpus all --shm-size=8GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -e STREAMING=false \
    -p 8001:8001 \
    nvcr.io/nim/nvidia/maxine-studio-voice:latest

5. Run

# Enhance audio inside a video
python -m app.cli --input input.mp4 --output output.mp4

# Enhance a standalone audio file
python -m app.cli --input noisy.wav --output clean.wav

# Verbose / debug mode (keeps temp files)
python -m app.cli --input in.mp4 --output out.mp4 --debug

CLI Options

Flag Description
-i, --input Input audio or video file (required)
-o, --output Output file path (required)
--target NIM gRPC endpoint (default 127.0.0.1:8001)
--model-type 48k-hq (default), 48k-ll, or 16k-hq
--debug Keep temp files, verbose logging
-v, --verbose Enable DEBUG-level logging

Environment Variables

All optional; override via .env or shell.

Variable Default Purpose
NGC_API_KEY (none) NVIDIA NGC API key
NIM_TARGET 127.0.0.1:8001 NIM gRPC address
NIM_MODEL_TYPE 48k-hq Model variant
NIM_MAX_CHUNK_BYTES 34000000 Max WAV chunk size (bytes)
MIN_SILENCE_MS 400 Min silence for split detection
SILENCE_THRESH_DBFS -40 Silence dBFS threshold
DEBUG false Keep temp files

Docker

docker-compose.yml defines two services:

Service What it does
nim Runs the NVIDIA Maxine Studio Voice NIM container (requires NVIDIA GPU)
helper Runs this wrapper; waits for the NIM to be healthy before starting

Prerequisites

  • NVIDIA GPU with nvidia-container-toolkit installed
  • On Windows: Docker Desktop with WSL2 GPU support enabled
  • NGC_API_KEY set in .env

Build

docker compose build helper

Run

# 1. Start (and keep running) the NIM in the background.
#    On first launch it downloads models — this can take several minutes.
docker compose up -d nim

# 2. Wait for it to become healthy (optional — helper depends_on handles this too)
docker compose ps nim

# 3. Run the helper against a file in ./data/
docker compose run --rm helper --input /data/input.mp4 --output /data/output.mp4

# 4. Stop the NIM when done
docker compose down

Note: Paths inside the container must start with /data/ (the mounted volume), not host-relative paths like .\data\file.mp4.

Manual docker run (without compose)

Linux / macOS

docker run --rm \
    -e NGC_API_KEY=$NGC_API_KEY \
    -e NIM_TARGET=host.docker.internal:8001 \
    -v "$(pwd)/data:/data" \
    studio-voice-helper \
    --input /data/input.mp4 --output /data/output.mp4

Windows (PowerShell)

docker run --rm `
    -e NGC_API_KEY=$env:NGC_API_KEY `
    -e NIM_TARGET=host.docker.internal:8001 `
    -v "${PWD}\data:/data" `
    studio-voice-helper `
    --input /data/input.mp4 --output /data/output.mp4

Tests

pip install pytest
pytest tests/ -v

Project Structure

app/
├── __init__.py
├── __main__.py        # python -m app entrypoint
├── cli.py             # Argument parsing, logging setup
├── config.py          # Env / .env config loader
├── media.py           # ffmpeg: extract, resample, remux
├── chunking.py        # Silence-aware splitting & stitching
├── nim_client.py      # gRPC client for Studio Voice NIM
├── pipeline.py        # Orchestrates the full flow
└── _generated/        # Proto-compiled gRPC stubs
protos/
└── studiovoice.proto  # NIM proto definition
tests/
├── test_config.py
├── test_media.py
├── test_chunking.py
└── test_pipeline.py

License

MIT

About

Helper Script for NVIDEA NIM Studio Voice. Transforms low-quality audio into studio quality!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors