Anvil Audio

A pluggable studio tool for AI audio generation. Swap models, keep your workflow.

Anvil Audio is a refactored and extended fork of stable-audio-tools by Stability AI. It turns a single-model inference codebase into a clean, swappable-component platform where models, conditioners, and compressors are first-class abstractions.

Supports Stable Audio diffusion models (Stability AI), ACE-Step music-generation models (ACE Studio / StepFun), and MLX-accelerated Stable Audio models on Apple Silicon — all through a unified registry, CLI, and Gradio UI.

What's New in Anvil

Pluggable pipeline architecture — BasePipeline, BaseGenerator, BaseCompressor, BaseConditioner ABCs; swap any component without touching the rest of your workflow.
Named model registry — anvil generate --model stable-audio-open-1.0 --prompt "..." loads the right pipeline automatically; add your own entries in ~/.anvil-audio/registry.yaml.
ACE-Step support — optional integration with ACE-Step v1.5 for full-song music generation with lyrics and style tags.
MLX acceleration — Apple Silicon users can install mlx-audiogen to get native Metal GPU inference for Stable Audio models (~2x faster than PyTorch MPS); weights are auto-converted and cached on first use.
Output management — collision-free timestamped filenames, JSON metadata sidecars with generation_duration_seconds, batch manifests, and project-scoped folders under ~/anvil-audio-outputs/.
MPS / CUDA / CPU auto-detection — runs on Apple Silicon, NVIDIA GPUs, or CPU with no flags needed.
anvil generate CLI — multi-GPU via Accelerate, wav/flac/mp3 output, batch YAML conditions, per-run seed control; anvil --list-models works at the top level.
Gradio web UI — project name, seed input, live metadata panel, model dropdown with hot-reload, device field.
Built-in audio editor — post-processing tab with normalize, trim, fade, time stretch, pitch shift, EQ, and reverb; non-destructive exports with full effects sidecar.
MCP server — expose all generation and editing capabilities to Claude and other MCP clients over stdio; models are cached between calls.
Python 3.12+ — uses modern union syntax, slots=True dataclasses, and lowercase generics throughout.

Requirements

Python 3.12 or 3.13 (strongly recommended — Python 3.14 is too new for several ML dependencies and will cause build failures)
PyTorch 2.0 or later

Install

Python 3.12 or 3.13 is strongly recommended. Python 3.14 is too new — several ML dependencies (scipy, k-diffusion) don't have pre-built wheels for it yet and will attempt to compile from source. If you're on Homebrew Python, check your version with python3 --version and install 3.12/3.13 via brew install python@3.13 if needed.

git clone https://github.com/DaRealDaHoodie/anvil-audio.git
cd anvil-audio

# Python 3.12+ requires a virtual environment (mandatory on Homebrew / system Python)
python3.13 -m venv .venv        # use python3.12 if 3.13 isn't available
source .venv/bin/activate       # Windows: .venv\Scripts\activate

pip install .
# avoid Accelerate import error on some setups
pip uninstall -y transformer-engine

If you must use Python 3.14 and hit scipy build errors, install a Fortran compiler first:

brew install gcc        # provides gfortran, required to compile scipy from source
pip install .

Quick Start

The fastest path to generating audio:

# 1. Clone and install (see Install above)
# 2. Add a model to your registry (see User Registry below)
# 3. Launch the Gradio UI — loads your first registered model automatically
python run_gradio.py

Or from the CLI in one command:

anvil generate --model stable-audio-open-1.0 --prompt "wooden door creak"

Stable Audio Models

CLI

# Use a registered model by name
anvil generate --model stable-audio-open-1.0 --prompt "wooden door creak"

# List all registered models
anvil --list-models

# Batch generation from a YAML file
anvil generate --model stable-audio-open-1.0 --cond-yaml-path batch.yaml --output-dir ./out

# Legacy path (local config + checkpoint)
anvil generate --model-config config.json --ckpt-path model.ckpt \
    --prompt "rain on a tin roof" --output-dir ./out

Multi-GPU generation is supported via Accelerate.

Gradio web UI

# Load by registry name (recommended)
python run_gradio.py --model stable-audio-open-1.0

# Load from HuggingFace Hub directly
python run_gradio.py --pretrained-name stabilityai/stable-audio-open-1.0

# No args — loads the first model from your registry
python run_gradio.py

# Route outputs to a named project folder
python run_gradio.py --model stable-audio-open-1.0 --project sfx-pack-v1

Presets & Reproducibility

Every generation saves a JSON sidecar alongside the audio containing the full parameters — prompt, seed, steps, CFG scale, model, duration, and for edits the full effects chain. These sidecars double as presets: the Load Recent dropdown at the bottom of the Generate tab shows your last 10 generations from the current project, and selecting one pre-populates all fields instantly. Load Preset accepts any .json sidecar via drag-and-drop, so you can share settings with collaborators or reload a preset from a different project. Tweak any field after loading and hit Generate to create a variation.

ACE-Step Music Generation (optional)

ACE-Step is an open-source full-song music generation model that supports style tags and full lyric input. Anvil integrates it through the same registry and UI as Stable Audio — no separate server or app required.

ACE-Step is optional. If you don't install it, all other Anvil functionality works as normal.

Install ACE-Step

# Clone the ACE-Step repo (Anvil imports it directly — no pip install needed)
git clone https://github.com/ace-step/ACE-Step.git /path/to/ACE-Step
cd /path/to/ACE-Step

# Install ACE-Step's dependencies into your Anvil venv
pip install -r requirements.txt

# ACE-Step requires transformers 4.x (not 5.x)
pip install "transformers>=4.51.0,<4.58.0"

The model weights are downloaded automatically from HuggingFace the first time you generate.

ACESTEP_PROJECT_ROOT must be set for Anvil to find the ACE-Step models. Without it, ACE-Step entries will not appear in the registry at all.

export ACESTEP_PROJECT_ROOT=/path/to/your/ACE-Step

# Verify it's registered:
anvil --list-models

Add the export to your shell profile (.zshrc, .bashrc, etc.) to make it permanent.

LM lyric planner

ACE-Step ships a separate 5 Hz LM lyric planner that produces structured audio_codes fed into the DiT. Using it gives significantly better vocal structure and timing compared to passing raw lyric text directly — this is the path the standalone ACE-Step Gradio UI uses.

Anvil initialises the LM planner automatically using the checkpoint specified in the registry entry. The built-in entries default to:

Model	LM checkpoint
`acestep-v1.5-turbo`	`acestep-5Hz-lm-1.7B` (lighter, faster)
`acestep-v1.5-sft`	`acestep-5Hz-lm-4B` (heavier, better quality)

Both checkpoints are relative paths within <ACESTEP_PROJECT_ROOT>/checkpoints/ and are downloaded automatically from HuggingFace on first use.

You can override the LM checkpoint for a specific registry entry via lm_model_path in registry.yaml (see ACE-Step models below), or set a global fallback for all ACE-Step models via an environment variable:

export ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-4B

If the LM planner fails to initialise (missing checkpoint, memory constraints, etc.) Anvil falls back gracefully to DiT-only generation and prints a warning. For purely instrumental tracks the LM step is skipped automatically regardless of the setting.

Apple Silicon acceleration

On macOS, Anvil automatically sets ACESTEP_LM_BACKEND=mlx before initializing ACE-Step, enabling the MLX backend for the 5Hz LM lyric planner (unless you've already set the variable yourself). The DiT and VAE also run via native MLX on Apple Silicon automatically — no flags needed.

If you run the ACE-Step MCP server or API separately, set the variable in your shell or MCP env block:

export ACESTEP_LM_BACKEND=mlx

Built-in registry entries

Anvil registers two ACE-Step variants automatically when the repo is found:

Model name	Description	Steps	Notes
`acestep-v1.5-turbo`	Fast generation	50	Good for drafts and quick iteration
`acestep-v1.5-sft`	Full quality	100	Better for final exports

CLI

# Single prompt (instrumental)
anvil generate --model acestep-v1.5-turbo \
    --prompt "indie pop, acoustic guitar, warm vocals, upbeat"

# Batch generation with lyrics (see example file)
anvil generate --model acestep-v1.5-turbo \
    --cond-yaml-path example/generation/acestep_conditions.yaml \
    --output-dir ./out

Batch YAML format — each entry supports prompt, lyrics, and seconds_total:

tracks:
  indie_pop:
    prompt: "indie pop, acoustic guitar, warm vocals, upbeat, sunny afternoon"
    lyrics: |
      [verse]
      Walking down the open road
      Sunlight through the trees
      [chorus]
      This is where I start again
    seconds_total: 30.0

  electronic_instrumental:
    prompt: "electronic, synthwave, driving bass, retro 80s, cinematic"
    lyrics: "[Instrumental]"
    seconds_total: 45.0

Gradio web UI

# Turbo (fast, good for iteration)
python run_gradio.py --model acestep-v1.5-turbo

# Full quality
python run_gradio.py --model acestep-v1.5-sft

The ACE-Step UI adds a Lyrics field below the prompt. Leave it blank or enter [Instrumental] for tracks with no vocals. Structure lyrics with section markers like [verse], [chorus], [bridge].

Audio Editor

The Gradio UI includes a built-in Edit tab for quick post-processing without leaving Anvil. It works with any loaded model and any audio file — not just files generated in the current session.

Available tools:

Tool	What it does
Normalize	Peak or LUFS loudness targeting
Trim silence	Strips quiet sections from the edges
Fade in / fade out	Linear ramp from/to silence
Loop / clip	Trim the audio to a start/end range
Time stretch	Speed up or slow down without changing pitch
Pitch shift	Transpose up or down in semitones
EQ	Low shelf, peak mid, high shelf
Reverb	Room size, damping, wet/dry mix

Effects are applied in a fixed chain (trim → clip → stretch/pitch → EQ → reverb → fade → normalize), which keeps results predictable regardless of the order you adjust knobs.

Typical workflow:

Generate on the Generate tab
Switch to Edit
Click Load Last Generation — the output loads automatically
Adjust effects; click Preview to hear the result
Click Export when satisfied

Export creates a new file via the output manager — the original is never touched. The JSON sidecar for the exported file records the source path and the full effects chain so you can always trace what was applied and replay it.

You can also drag any audio file into the source field to edit files from outside Anvil.

MLX Acceleration (Apple Silicon)

On M1/M2/M3/M4 Macs, Anvil can run Stable Audio inference through mlx-audiogen, which ports the DiT, VAE, and T5 conditioner to Apple's native MLX framework. This runs directly on the Metal GPU without going through PyTorch MPS.

Benchmark — 30-second clip, stable-audio-open-1.0:

Backend	Time
PyTorch MPS	~61 s
MLX (native Metal)	~31 s

Enable MLX

pip install mlx-audiogen

That's it. Once installed, two new models appear in the registry automatically:

Model name	Source model
`stable-audio-open-small-mlx`	`stabilityai/stable-audio-open-small`
`stable-audio-open-1.0-mlx`	`stabilityai/stable-audio-open-1.0`

On first use, Anvil downloads the original HuggingFace weights and converts them to MLX safetensors format. Converted weights are cached at:

~/.cache/anvil-audio/mlx-weights/<model-slug>/

Subsequent loads skip conversion and go straight to inference.

Usage

# CLI
anvil generate --model stable-audio-open-1.0-mlx --prompt "rain on leaves"

# Gradio — select from the model dropdown
python run_gradio.py --model stable-audio-open-1.0-mlx

MLX models use a rectified-flow sampler (euler or rk4). The sigma_max range is [0.01, 2.0] — values outside this range (e.g. the PyTorch default of 500.0) are automatically clamped to 1.0.

Requirements

macOS on Apple Silicon (M1 or later)
pip install mlx-audiogen

mlx-audiogen is an optional dependency — Anvil works normally on all platforms without it. The MLX model entries only appear in the registry on Apple Silicon with mlx-audiogen installed.

User registry — custom MLX weights

If you have pre-converted weights in a custom directory, point to them via mlx_weights_dir:

- name: my-mlx-model
  pipeline_type: mlx_diffusion
  pretrained_name: stabilityai/stable-audio-open-small
  mlx_weights_dir: /path/to/my/converted/weights
  default_params:
    steps: 100
    cfg_scale: 7.0
    sampler_type: euler
    sigma_max: 1.0

MCP Server

Anvil exposes its full capabilities as an MCP server so Claude and other MCP clients can generate and edit audio directly without the Gradio UI or manual CLI commands.

Install

pip install mcp

The mcp package is not installed by default. Everything else is already a dependency.

Available tools

Tool	What it does
`generate_audio`	Generate a clip from a prompt; auto-selects model if not specified
`batch_generate`	Generate multiple clips in one call
`edit_audio`	Post-process a file with normalize, trim, EQ, reverb, etc.
`list_models`	All registered models with type, limits, and loaded status
`get_model_info`	Full details for one model
`list_recent_outputs`	Recent output files with their metadata, newest-first
`get_generation_metadata`	Read the sidecar for any output file
`list_projects`	Project folders under `~/anvil-audio-outputs/`
`set_active_project`	Set a default project so you don't repeat it every call

All generate_audio and batch_generate responses include generation_duration_seconds — the wall-clock time from the start of inference to the file being written. This lets you compare backends directly (e.g. PyTorch MPS vs MLX) without any external timing.

Models are loaded lazily on first use and cached between calls — switching between two models during a session only pays the load cost once per model.

Claude Desktop config

Add this to ~/Library/Application Support/Claude/claude_desktop_config.json (create the file if it doesn't exist):

{
  "mcpServers": {
    "anvil-audio": {
      "command": "/path/to/anvil-audio/.venv/bin/python",
      "args": ["-m", "anvil_audio.mcp_server"],
      "env": {
        "ACESTEP_PROJECT_ROOT": "/path/to/ACE-Step"
      }
    }
  }
}

Replace /path/to/anvil-audio with the absolute path to your clone. The env block is required for ACE-Step models — Claude Desktop doesn't inherit your shell environment, so the variable must be set explicitly. Omit the env block if you're only using Stable Audio models.

Claude Code config

Add to ~/.claude.json under mcpServers:

{
  "mcpServers": {
    "anvil-audio": {
      "command": "/path/to/anvil-audio/.venv/bin/python",
      "args": ["-m", "anvil_audio.mcp_server"],
      "type": "stdio",
      "env": {
        "ACESTEP_PROJECT_ROOT": "/path/to/ACE-Step"
      }
    }
  }
}

Example session

Once configured, Claude can generate and edit audio directly:

You:    Generate a short thunderstorm ambience clip
Claude: [calls generate_audio(prompt="thunderstorm ambience, rain, distant thunder", duration_seconds=20)]
        Generated: ~/anvil-audio-outputs/default/20260401_181907_thunderstorm_...wav
        Generation time: 31.2 s

You:    Add a slight fade in and normalize it to -14 LUFS
Claude: [calls edit_audio(file_path="...", fade_in=2.0, normalize=True,
                          normalize_target_db=-14, normalize_lufs=True)]
        Exported: ~/anvil-audio-outputs/default/20260401_181942_edit_...wav

User Registry

Add your own models to ~/.anvil-audio/registry.yaml. The file is a YAML list — entries with the same name as a built-in will override it.

Stable Audio / diffusion models

- name: my-sfx-model
  pretrained_name: myorg/my-sfx-model        # HuggingFace Hub
  default_params:
    steps: 100
    cfg_scale: 7.0

- name: local-vae-dit
  model_config_path: /path/to/config.json
  ckpt_path: /path/to/model.ckpt
  pretransform_ckpt_path: /path/to/vae.ckpt

MLX models (Apple Silicon)

- name: my-mlx-model
  pipeline_type: mlx_diffusion
  pretrained_name: stabilityai/stable-audio-open-small
  # Optional: point to a directory with pre-converted MLX safetensors.
  # Omit to use the auto-convert cache at ~/.cache/anvil-audio/mlx-weights/
  mlx_weights_dir: /path/to/converted/weights
  default_params:
    steps: 100
    cfg_scale: 7.0
    sampler_type: euler
    sigma_max: 1.0

ACE-Step models

- name: my-acestep-finetune
  pipeline_type: acestep
  acestep_project_root: /path/to/ACE-Step    # path to the cloned repo
  model_config_path: acestep-v15-turbo        # checkpoint variant name
  # Optional: override the LM lyric-planner checkpoint.
  # Relative paths are resolved under <acestep_project_root>/checkpoints/.
  # Omit to use the built-in default (1.7B for turbo, 4B for sft).
  lm_model_path: acestep-5Hz-lm-1.7B
  default_params:
    steps: 50
    cfg_scale: 4.0
    audio_duration: 60

`run_gradio.py` flags

Flag	Description
`--model`	Registry model name (e.g. `stable-audio-open-1.0`, `acestep-v1.5-turbo`)
`--pretrained-name`	HuggingFace Hub repo ID (e.g. `stabilityai/stable-audio-open-1.0`)
`--model-config`	Local model config JSON (ignored if `--model` or `--pretrained-name` set)
`--ckpt-path`	Local checkpoint (ignored if `--model` or `--pretrained-name` set)
`--pretransform-ckpt-path`	Optional separate VAE checkpoint
`--username` / `--password`	Gradio auth
`--model-half`	Use float16 inference
`--device`	`cuda`, `mps`, or `cpu` (auto-detects if omitted)
`--project`	Outputs go to `~/anvil-audio-outputs/{project}/`
`--share`	Create a public Gradio share URL

`anvil generate` flags

Flag	Default	Description
`--model NAME`	—	Registry model name
`--list-models`	—	Print registry and exit (also works as `anvil --list-models`)
`--model-config PATH`	—	Legacy: local JSON config
`--ckpt-path PATH`	—	Legacy: local checkpoint
`--pretransform-ckpt-path PATH`	—	Separate VAE checkpoint
`--prompt TEXT`	—	Single text prompt
`--cond-yaml-path PATH`	—	Batch YAML conditions file
`--seconds-start`	`0.0`	Start time (seconds)
`--seconds-total`	`30.0`	Duration (seconds)
`--output-dir`	`./output`	Output directory
`--format`	`wav`	`wav`, `flac`, or `mp3`
`--clip-length`	off	Clip to `seconds_total`
`--sample-steps`	pipeline default	Diffusion / inference steps
`--cfg-scale`	pipeline default	CFG guidance scale
`--sampler-type`	pipeline default	Sampler type
`--sigma-min` / `--sigma-max`	pipeline default	Noise schedule bounds
`--n-sample-per-cond`	`1`	Samples per condition
`--batch-size`	`10`	Items per GPU batch
`--seed`	`-1` (random)	RNG seed
`--device`	auto	`cuda`, `mps`, or `cpu`

Logging

Training requires a Weights & Biases account:

wandb login
# or pass as env var
export WANDB_API_KEY="your-key-here"

Training

Configuration files

You need two config files before starting a training run:

model config — defines architecture and training hyperparameters
dataset config — points to your audio and metadata

See docs/datasets.md for dataset config details.

Training from scratch

python3 train.py \
    --dataset-config /path/to/dataset/config \
    --model-config /path/to/model/config \
    --name my_experiment

Fine-tuning

Resume from a wrapped checkpoint: --ckpt-path path/to/wrapped.ckpt
Start fresh from an unwrapped pre-trained model: --pretrained-ckpt-path path/to/unwrapped.ckpt

Unwrapping a model

Training checkpoints include the full training wrapper (discriminators, EMA, optimizer states). Unwrap before using for inference or as a pretransform:

python3 unwrap_model.py \
    --model-config /path/to/model/config \
    --ckpt-path /path/to/wrapped/ckpt.ckpt \
    --name /path/to/output/unwrapped_name

Training Stable Audio 2.0

Prerequisites

1. CLAP encoder checkpoint

Download music_audioset_epoch_15_esc_90.14.pt from the LAION CLAP repository and set clap_ckpt_path in stable_audio_2_0.json:

"config": {
    "clap_ckpt_path": "ckpt/clap/music_audioset_epoch_15_esc_90.14.pt"
}

2. Audio + metadata

Each audio file needs a paired JSON sidecar with at minimum a prompt field:

dataset/
├── music_1.wav
├── music_1.json   ← {"prompt": "upbeat electronic track with positive vibes"}
├── music_2.wav
├── music_2.json
└── ...

Stage 1 — VAE-GAN

MODEL_CONFIG="anvil_audio/configs/model_configs/autoencoders/stable_audio_2_0_vae.json"
DATASET_CONFIG="anvil_audio/configs/dataset_configs/local_training_example.json"

python3 train.py \
    --dataset-config ${DATASET_CONFIG} \
    --model-config ${MODEL_CONFIG} \
    --name "vae_training" \
    --num-gpus 8 \
    --batch-size 10 \
    --num-workers 8 \
    --save-dir ./output

After training, unwrap the checkpoint before Stage 2.

Stage 2 — Diffusion Transformer (DiT)

MODEL_CONFIG="anvil_audio/configs/model_configs/txt2audio/stable_audio_2_0.json"
PRETRANSFORM_CKPT="/path/to/unwrapped_vae.ckpt"

python3 train.py \
    --dataset-config ${DATASET_CONFIG} \
    --model-config ${MODEL_CONFIG} \
    --pretransform-ckpt-path ${PRETRANSFORM_CKPT} \
    --name "dit_training" \
    --num-gpus 8 \
    --batch-size 10 \
    --save-dir ./output

Reconstruction test

python3 reconstruct_audios.py \
    --model-config ${MODEL_CONFIG} \
    --ckpt-path /path/to/unwrapped_vae.ckpt \
    --audio-dir /path/to/original_audio/ \
    --output-dir /path/to/reconstructed/ \
    --frame-duration 1.0 \
    --overlap-rate 0.01 \
    --batch-size 50

Container Setup

Build a Docker image and optionally convert to Singularity for HPC clusters:

NAME=anvil-audio
docker build -t ${NAME} -f ./container/anvil-audio.Dockerfile .

# Convert to Singularity
singularity build anvil-audio.sif docker-daemon://anvil-audio

Backlog

PyPI package (pip install anvil-audio)
Contribution guidelines
More audio augmentations
Troubleshooting section

Licensing

Anvil Audio is MIT licensed. It builds on several open-source projects and optional model weights with their own licenses — see THIRD_PARTY_NOTICES.md for full attributions.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
LICENSES		LICENSES
anvil_audio		anvil_audio
assets/fig		assets/fig
container		container
docs		docs
example		example
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
UPDATES_HISTORY.md		UPDATES_HISTORY.md
defaults.ini		defaults.ini
generate.py		generate.py
pyproject.toml		pyproject.toml
reconstruct_audios.py		reconstruct_audios.py
run_gradio.py		run_gradio.py
setup.py		setup.py
train.py		train.py
unwrap_model.py		unwrap_model.py

Folders and files

Latest commit

History

Repository files navigation

Anvil Audio

What's New in Anvil

Requirements

Install

Quick Start

Stable Audio Models

CLI

Gradio web UI

Presets & Reproducibility

ACE-Step Music Generation (optional)

Install ACE-Step

LM lyric planner

Apple Silicon acceleration

Built-in registry entries

CLI

Gradio web UI

Audio Editor

MLX Acceleration (Apple Silicon)

Enable MLX

Usage

Requirements

User registry — custom MLX weights

MCP Server

Install

Available tools

Claude Desktop config

Claude Code config

Example session

User Registry

Stable Audio / diffusion models

MLX models (Apple Silicon)

ACE-Step models

run_gradio.py flags

anvil generate flags

Logging

Training

Configuration files

Training from scratch

Fine-tuning

Unwrapping a model

Training Stable Audio 2.0

Prerequisites

Stage 1 — VAE-GAN

Stage 2 — Diffusion Transformer (DiT)

Reconstruction test

Container Setup

Backlog

Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run_gradio.py` flags

`anvil generate` flags

Packages