Python wrapper around the NVIDIA Maxine Studio Voice NIM that accepts audio or video, enhances the speech track, and writes the result back out.
- Accepts audio or video input
- Extracts audio from video, resamples to 48 kHz PCM WAV (NIM requirement)
- Silence-aware chunking keeps each piece under the NIM file-size limit
- Sends chunks to the NIM via gRPC and stitches results in order
- Remuxes enhanced audio back into the original video container
- Supports
NGC_API_KEYvia env vars /.env - Cleans up temp files (unless
--debug)
Requirements: Docker Desktop with an NVIDIA GPU (and nvidia-container-toolkit on Linux, or WSL2 GPU support on Windows).
The AI model is free to use but the download is restricted. Therefore you have to create a free API Key here: https://org.ngc.nvidia.com/setup/api-keys
Create a .env file and put the API key there.
cp .env.example .env
# Open .env and set NGC_API_KEY to your key from https://org.ngc.nvidia.com/setup/api-keyscp /path/to/your/video.mp4 data/The first run will download the model weights — this may take several minutes.
docker compose up -d nimCheck that it is ready (status should show healthy):
docker compose ps nimTo enhance the audio simply run:
docker compose run --rm helper --input /data/video.mp4 --output /data/output.mp4Paths inside the container must use /data/ — the mounted volume.
Your enhanced audiofile or video should now be in the output folder. So feel free to stop the containers:
docker compose down| Tool | Purpose |
|---|---|
| Python 3.10+ | Runtime |
| ffmpeg | Audio extraction, resampling, remuxing |
| NVIDIA Maxine Studio Voice NIM | gRPC inference server (GPU required) |
git clone <repo-url> && cd StudioVoiceHelper
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtgit clone <repo-url>; cd StudioVoiceHelper
python -m venv .venv; .\.venv\Scripts\Activate.ps1
pip install -r requirements.txtpython compile_protos.pycp .env.example .env
# Edit .env and set NGC_API_KEYSee NVIDIA docs.
docker run -it --rm --name=studio-voice \
--runtime=nvidia --gpus all --shm-size=8GB \
-e NGC_API_KEY=$NGC_API_KEY \
-e STREAMING=false \
-p 8001:8001 \
nvcr.io/nim/nvidia/maxine-studio-voice:latest# Enhance audio inside a video
python -m app.cli --input input.mp4 --output output.mp4
# Enhance a standalone audio file
python -m app.cli --input noisy.wav --output clean.wav
# Verbose / debug mode (keeps temp files)
python -m app.cli --input in.mp4 --output out.mp4 --debug| Flag | Description |
|---|---|
-i, --input |
Input audio or video file (required) |
-o, --output |
Output file path (required) |
--target |
NIM gRPC endpoint (default 127.0.0.1:8001) |
--model-type |
48k-hq (default), 48k-ll, or 16k-hq |
--debug |
Keep temp files, verbose logging |
-v, --verbose |
Enable DEBUG-level logging |
All optional; override via .env or shell.
| Variable | Default | Purpose |
|---|---|---|
NGC_API_KEY |
(none) | NVIDIA NGC API key |
NIM_TARGET |
127.0.0.1:8001 |
NIM gRPC address |
NIM_MODEL_TYPE |
48k-hq |
Model variant |
NIM_MAX_CHUNK_BYTES |
34000000 |
Max WAV chunk size (bytes) |
MIN_SILENCE_MS |
400 |
Min silence for split detection |
SILENCE_THRESH_DBFS |
-40 |
Silence dBFS threshold |
DEBUG |
false |
Keep temp files |
docker-compose.yml defines two services:
| Service | What it does |
|---|---|
nim |
Runs the NVIDIA Maxine Studio Voice NIM container (requires NVIDIA GPU) |
helper |
Runs this wrapper; waits for the NIM to be healthy before starting |
- NVIDIA GPU with nvidia-container-toolkit installed
- On Windows: Docker Desktop with WSL2 GPU support enabled
NGC_API_KEYset in.env
docker compose build helper# 1. Start (and keep running) the NIM in the background.
# On first launch it downloads models — this can take several minutes.
docker compose up -d nim
# 2. Wait for it to become healthy (optional — helper depends_on handles this too)
docker compose ps nim
# 3. Run the helper against a file in ./data/
docker compose run --rm helper --input /data/input.mp4 --output /data/output.mp4
# 4. Stop the NIM when done
docker compose downNote: Paths inside the container must start with
/data/(the mounted volume), not host-relative paths like.\data\file.mp4.
docker run --rm \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_TARGET=host.docker.internal:8001 \
-v "$(pwd)/data:/data" \
studio-voice-helper \
--input /data/input.mp4 --output /data/output.mp4docker run --rm `
-e NGC_API_KEY=$env:NGC_API_KEY `
-e NIM_TARGET=host.docker.internal:8001 `
-v "${PWD}\data:/data" `
studio-voice-helper `
--input /data/input.mp4 --output /data/output.mp4pip install pytest
pytest tests/ -vapp/
├── __init__.py
├── __main__.py # python -m app entrypoint
├── cli.py # Argument parsing, logging setup
├── config.py # Env / .env config loader
├── media.py # ffmpeg: extract, resample, remux
├── chunking.py # Silence-aware splitting & stitching
├── nim_client.py # gRPC client for Studio Voice NIM
├── pipeline.py # Orchestrates the full flow
└── _generated/ # Proto-compiled gRPC stubs
protos/
└── studiovoice.proto # NIM proto definition
tests/
├── test_config.py
├── test_media.py
├── test_chunking.py
└── test_pipeline.py
MIT