A pluggable studio tool for AI audio generation. Swap models, keep your workflow.
Anvil Audio is a refactored and extended fork of
stable-audio-tools by Stability AI.
It turns a single-model inference codebase into a clean, swappable-component platform where
models, conditioners, and compressors are first-class abstractions.
Supports Stable Audio diffusion models (Stability AI), ACE-Step music-generation models (ACE Studio / StepFun), and MLX-accelerated Stable Audio models on Apple Silicon — all through a unified registry, CLI, and Gradio UI.
- Pluggable pipeline architecture —
BasePipeline,BaseGenerator,BaseCompressor,BaseConditionerABCs; swap any component without touching the rest of your workflow. - Named model registry —
anvil generate --model stable-audio-open-1.0 --prompt "..."loads the right pipeline automatically; add your own entries in~/.anvil-audio/registry.yaml. - ACE-Step support — optional integration with ACE-Step v1.5 for full-song music generation with lyrics and style tags.
- MLX acceleration — Apple Silicon users can install
mlx-audiogento get native Metal GPU inference for Stable Audio models (~2x faster than PyTorch MPS); weights are auto-converted and cached on first use. - Output management — collision-free timestamped filenames, JSON metadata sidecars with
generation_duration_seconds, batch manifests, and project-scoped folders under~/anvil-audio-outputs/. - MPS / CUDA / CPU auto-detection — runs on Apple Silicon, NVIDIA GPUs, or CPU with no flags needed.
anvil generateCLI — multi-GPU via Accelerate, wav/flac/mp3 output, batch YAML conditions, per-run seed control;anvil --list-modelsworks at the top level.- Gradio web UI — project name, seed input, live metadata panel, model dropdown with hot-reload, device field.
- Built-in audio editor — post-processing tab with normalize, trim, fade, time stretch, pitch shift, EQ, and reverb; non-destructive exports with full effects sidecar.
- MCP server — expose all generation and editing capabilities to Claude and other MCP clients over stdio; models are cached between calls.
- Python 3.12+ — uses modern union syntax,
slots=Truedataclasses, and lowercase generics throughout.
- Python 3.12 or 3.13 (strongly recommended — Python 3.14 is too new for several ML dependencies and will cause build failures)
- PyTorch 2.0 or later
Python 3.12 or 3.13 is strongly recommended. Python 3.14 is too new — several ML dependencies (scipy, k-diffusion) don't have pre-built wheels for it yet and will attempt to compile from source. If you're on Homebrew Python, check your version with
python3 --versionand install 3.12/3.13 viabrew install python@3.13if needed.
git clone https://github.com/DaRealDaHoodie/anvil-audio.git
cd anvil-audio
# Python 3.12+ requires a virtual environment (mandatory on Homebrew / system Python)
python3.13 -m venv .venv # use python3.12 if 3.13 isn't available
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install .
# avoid Accelerate import error on some setups
pip uninstall -y transformer-engineIf you must use Python 3.14 and hit scipy build errors, install a Fortran compiler first:
brew install gcc # provides gfortran, required to compile scipy from source
pip install .The fastest path to generating audio:
# 1. Clone and install (see Install above)
# 2. Add a model to your registry (see User Registry below)
# 3. Launch the Gradio UI — loads your first registered model automatically
python run_gradio.pyOr from the CLI in one command:
anvil generate --model stable-audio-open-1.0 --prompt "wooden door creak"# Use a registered model by name
anvil generate --model stable-audio-open-1.0 --prompt "wooden door creak"
# List all registered models
anvil --list-models
# Batch generation from a YAML file
anvil generate --model stable-audio-open-1.0 --cond-yaml-path batch.yaml --output-dir ./out
# Legacy path (local config + checkpoint)
anvil generate --model-config config.json --ckpt-path model.ckpt \
--prompt "rain on a tin roof" --output-dir ./outMulti-GPU generation is supported via Accelerate.
# Load by registry name (recommended)
python run_gradio.py --model stable-audio-open-1.0
# Load from HuggingFace Hub directly
python run_gradio.py --pretrained-name stabilityai/stable-audio-open-1.0
# No args — loads the first model from your registry
python run_gradio.py
# Route outputs to a named project folder
python run_gradio.py --model stable-audio-open-1.0 --project sfx-pack-v1Every generation saves a JSON sidecar alongside the audio containing the full parameters — prompt, seed, steps, CFG scale, model, duration, and for edits the full effects chain. These sidecars double as presets: the Load Recent dropdown at the bottom of the Generate tab shows your last 10 generations from the current project, and selecting one pre-populates all fields instantly. Load Preset accepts any .json sidecar via drag-and-drop, so you can share settings with collaborators or reload a preset from a different project. Tweak any field after loading and hit Generate to create a variation.
ACE-Step is an open-source full-song music generation model that supports style tags and full lyric input. Anvil integrates it through the same registry and UI as Stable Audio — no separate server or app required.
ACE-Step is optional. If you don't install it, all other Anvil functionality works as normal.
# Clone the ACE-Step repo (Anvil imports it directly — no pip install needed)
git clone https://github.com/ace-step/ACE-Step.git /path/to/ACE-Step
cd /path/to/ACE-Step
# Install ACE-Step's dependencies into your Anvil venv
pip install -r requirements.txt
# ACE-Step requires transformers 4.x (not 5.x)
pip install "transformers>=4.51.0,<4.58.0"The model weights are downloaded automatically from HuggingFace the first time you generate.
ACESTEP_PROJECT_ROOT must be set for Anvil to find the ACE-Step models. Without it,
ACE-Step entries will not appear in the registry at all.
export ACESTEP_PROJECT_ROOT=/path/to/your/ACE-Step
# Verify it's registered:
anvil --list-modelsAdd the export to your shell profile (.zshrc, .bashrc, etc.) to make it permanent.
ACE-Step ships a separate 5 Hz LM lyric planner that produces structured audio_codes fed
into the DiT. Using it gives significantly better vocal structure and timing compared to
passing raw lyric text directly — this is the path the standalone ACE-Step Gradio UI uses.
Anvil initialises the LM planner automatically using the checkpoint specified in the registry entry. The built-in entries default to:
| Model | LM checkpoint |
|---|---|
acestep-v1.5-turbo |
acestep-5Hz-lm-1.7B (lighter, faster) |
acestep-v1.5-sft |
acestep-5Hz-lm-4B (heavier, better quality) |
Both checkpoints are relative paths within <ACESTEP_PROJECT_ROOT>/checkpoints/ and are
downloaded automatically from HuggingFace on first use.
You can override the LM checkpoint for a specific registry entry via lm_model_path in
registry.yaml (see ACE-Step models below), or set a global fallback
for all ACE-Step models via an environment variable:
export ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-4BIf the LM planner fails to initialise (missing checkpoint, memory constraints, etc.) Anvil falls back gracefully to DiT-only generation and prints a warning. For purely instrumental tracks the LM step is skipped automatically regardless of the setting.
On macOS, Anvil automatically sets ACESTEP_LM_BACKEND=mlx before initializing ACE-Step,
enabling the MLX backend for the 5Hz LM lyric planner (unless you've already set the variable
yourself). The DiT and VAE also run via native MLX on Apple Silicon automatically — no flags
needed.
If you run the ACE-Step MCP server or API separately, set the variable in your shell or MCP
env block:
export ACESTEP_LM_BACKEND=mlxAnvil registers two ACE-Step variants automatically when the repo is found:
| Model name | Description | Steps | Notes |
|---|---|---|---|
acestep-v1.5-turbo |
Fast generation | 50 | Good for drafts and quick iteration |
acestep-v1.5-sft |
Full quality | 100 | Better for final exports |
# Single prompt (instrumental)
anvil generate --model acestep-v1.5-turbo \
--prompt "indie pop, acoustic guitar, warm vocals, upbeat"
# Batch generation with lyrics (see example file)
anvil generate --model acestep-v1.5-turbo \
--cond-yaml-path example/generation/acestep_conditions.yaml \
--output-dir ./outBatch YAML format — each entry supports prompt, lyrics, and seconds_total:
tracks:
indie_pop:
prompt: "indie pop, acoustic guitar, warm vocals, upbeat, sunny afternoon"
lyrics: |
[verse]
Walking down the open road
Sunlight through the trees
[chorus]
This is where I start again
seconds_total: 30.0
electronic_instrumental:
prompt: "electronic, synthwave, driving bass, retro 80s, cinematic"
lyrics: "[Instrumental]"
seconds_total: 45.0# Turbo (fast, good for iteration)
python run_gradio.py --model acestep-v1.5-turbo
# Full quality
python run_gradio.py --model acestep-v1.5-sftThe ACE-Step UI adds a Lyrics field below the prompt. Leave it blank or enter
[Instrumental] for tracks with no vocals. Structure lyrics with section markers
like [verse], [chorus], [bridge].
The Gradio UI includes a built-in Edit tab for quick post-processing without leaving Anvil. It works with any loaded model and any audio file — not just files generated in the current session.
Available tools:
| Tool | What it does |
|---|---|
| Normalize | Peak or LUFS loudness targeting |
| Trim silence | Strips quiet sections from the edges |
| Fade in / fade out | Linear ramp from/to silence |
| Loop / clip | Trim the audio to a start/end range |
| Time stretch | Speed up or slow down without changing pitch |
| Pitch shift | Transpose up or down in semitones |
| EQ | Low shelf, peak mid, high shelf |
| Reverb | Room size, damping, wet/dry mix |
Effects are applied in a fixed chain (trim → clip → stretch/pitch → EQ → reverb → fade → normalize), which keeps results predictable regardless of the order you adjust knobs.
Typical workflow:
- Generate on the Generate tab
- Switch to Edit
- Click Load Last Generation — the output loads automatically
- Adjust effects; click Preview to hear the result
- Click Export when satisfied
Export creates a new file via the output manager — the original is never touched. The JSON sidecar for the exported file records the source path and the full effects chain so you can always trace what was applied and replay it.
You can also drag any audio file into the source field to edit files from outside Anvil.
On M1/M2/M3/M4 Macs, Anvil can run Stable Audio inference through mlx-audiogen, which ports the DiT, VAE, and T5 conditioner to Apple's native MLX framework. This runs directly on the Metal GPU without going through PyTorch MPS.
Benchmark — 30-second clip, stable-audio-open-1.0:
| Backend | Time |
|---|---|
| PyTorch MPS | ~61 s |
| MLX (native Metal) | ~31 s |
pip install mlx-audiogenThat's it. Once installed, two new models appear in the registry automatically:
| Model name | Source model |
|---|---|
stable-audio-open-small-mlx |
stabilityai/stable-audio-open-small |
stable-audio-open-1.0-mlx |
stabilityai/stable-audio-open-1.0 |
On first use, Anvil downloads the original HuggingFace weights and converts them to MLX safetensors format. Converted weights are cached at:
~/.cache/anvil-audio/mlx-weights/<model-slug>/
Subsequent loads skip conversion and go straight to inference.
# CLI
anvil generate --model stable-audio-open-1.0-mlx --prompt "rain on leaves"
# Gradio — select from the model dropdown
python run_gradio.py --model stable-audio-open-1.0-mlxMLX models use a rectified-flow sampler (euler or rk4). The sigma_max range is
[0.01, 2.0] — values outside this range (e.g. the PyTorch default of 500.0) are
automatically clamped to 1.0.
- macOS on Apple Silicon (M1 or later)
pip install mlx-audiogen
mlx-audiogen is an optional dependency — Anvil works normally on all platforms without it. The MLX model entries only appear in the registry on Apple Silicon with mlx-audiogen installed.
If you have pre-converted weights in a custom directory, point to them via mlx_weights_dir:
- name: my-mlx-model
pipeline_type: mlx_diffusion
pretrained_name: stabilityai/stable-audio-open-small
mlx_weights_dir: /path/to/my/converted/weights
default_params:
steps: 100
cfg_scale: 7.0
sampler_type: euler
sigma_max: 1.0Anvil exposes its full capabilities as an MCP server so Claude and other MCP clients can generate and edit audio directly without the Gradio UI or manual CLI commands.
pip install mcpThe mcp package is not installed by default. Everything else is already a dependency.
| Tool | What it does |
|---|---|
generate_audio |
Generate a clip from a prompt; auto-selects model if not specified |
batch_generate |
Generate multiple clips in one call |
edit_audio |
Post-process a file with normalize, trim, EQ, reverb, etc. |
list_models |
All registered models with type, limits, and loaded status |
get_model_info |
Full details for one model |
list_recent_outputs |
Recent output files with their metadata, newest-first |
get_generation_metadata |
Read the sidecar for any output file |
list_projects |
Project folders under ~/anvil-audio-outputs/ |
set_active_project |
Set a default project so you don't repeat it every call |
All generate_audio and batch_generate responses include generation_duration_seconds —
the wall-clock time from the start of inference to the file being written. This lets you
compare backends directly (e.g. PyTorch MPS vs MLX) without any external timing.
Models are loaded lazily on first use and cached between calls — switching between two models during a session only pays the load cost once per model.
Add this to ~/Library/Application Support/Claude/claude_desktop_config.json
(create the file if it doesn't exist):
{
"mcpServers": {
"anvil-audio": {
"command": "/path/to/anvil-audio/.venv/bin/python",
"args": ["-m", "anvil_audio.mcp_server"],
"env": {
"ACESTEP_PROJECT_ROOT": "/path/to/ACE-Step"
}
}
}
}Replace /path/to/anvil-audio with the absolute path to your clone. The env
block is required for ACE-Step models — Claude Desktop doesn't inherit your
shell environment, so the variable must be set explicitly. Omit the env block
if you're only using Stable Audio models.
Add to ~/.claude.json under mcpServers:
{
"mcpServers": {
"anvil-audio": {
"command": "/path/to/anvil-audio/.venv/bin/python",
"args": ["-m", "anvil_audio.mcp_server"],
"type": "stdio",
"env": {
"ACESTEP_PROJECT_ROOT": "/path/to/ACE-Step"
}
}
}
}Once configured, Claude can generate and edit audio directly:
You: Generate a short thunderstorm ambience clip
Claude: [calls generate_audio(prompt="thunderstorm ambience, rain, distant thunder", duration_seconds=20)]
Generated: ~/anvil-audio-outputs/default/20260401_181907_thunderstorm_...wav
Generation time: 31.2 s
You: Add a slight fade in and normalize it to -14 LUFS
Claude: [calls edit_audio(file_path="...", fade_in=2.0, normalize=True,
normalize_target_db=-14, normalize_lufs=True)]
Exported: ~/anvil-audio-outputs/default/20260401_181942_edit_...wav
Add your own models to ~/.anvil-audio/registry.yaml. The file is a YAML list — entries
with the same name as a built-in will override it.
- name: my-sfx-model
pretrained_name: myorg/my-sfx-model # HuggingFace Hub
default_params:
steps: 100
cfg_scale: 7.0
- name: local-vae-dit
model_config_path: /path/to/config.json
ckpt_path: /path/to/model.ckpt
pretransform_ckpt_path: /path/to/vae.ckpt- name: my-mlx-model
pipeline_type: mlx_diffusion
pretrained_name: stabilityai/stable-audio-open-small
# Optional: point to a directory with pre-converted MLX safetensors.
# Omit to use the auto-convert cache at ~/.cache/anvil-audio/mlx-weights/
mlx_weights_dir: /path/to/converted/weights
default_params:
steps: 100
cfg_scale: 7.0
sampler_type: euler
sigma_max: 1.0- name: my-acestep-finetune
pipeline_type: acestep
acestep_project_root: /path/to/ACE-Step # path to the cloned repo
model_config_path: acestep-v15-turbo # checkpoint variant name
# Optional: override the LM lyric-planner checkpoint.
# Relative paths are resolved under <acestep_project_root>/checkpoints/.
# Omit to use the built-in default (1.7B for turbo, 4B for sft).
lm_model_path: acestep-5Hz-lm-1.7B
default_params:
steps: 50
cfg_scale: 4.0
audio_duration: 60| Flag | Description |
|---|---|
--model |
Registry model name (e.g. stable-audio-open-1.0, acestep-v1.5-turbo) |
--pretrained-name |
HuggingFace Hub repo ID (e.g. stabilityai/stable-audio-open-1.0) |
--model-config |
Local model config JSON (ignored if --model or --pretrained-name set) |
--ckpt-path |
Local checkpoint (ignored if --model or --pretrained-name set) |
--pretransform-ckpt-path |
Optional separate VAE checkpoint |
--username / --password |
Gradio auth |
--model-half |
Use float16 inference |
--device |
cuda, mps, or cpu (auto-detects if omitted) |
--project |
Outputs go to ~/anvil-audio-outputs/{project}/ |
--share |
Create a public Gradio share URL |
| Flag | Default | Description |
|---|---|---|
--model NAME |
— | Registry model name |
--list-models |
— | Print registry and exit (also works as anvil --list-models) |
--model-config PATH |
— | Legacy: local JSON config |
--ckpt-path PATH |
— | Legacy: local checkpoint |
--pretransform-ckpt-path PATH |
— | Separate VAE checkpoint |
--prompt TEXT |
— | Single text prompt |
--cond-yaml-path PATH |
— | Batch YAML conditions file |
--seconds-start |
0.0 |
Start time (seconds) |
--seconds-total |
30.0 |
Duration (seconds) |
--output-dir |
./output |
Output directory |
--format |
wav |
wav, flac, or mp3 |
--clip-length |
off | Clip to seconds_total |
--sample-steps |
pipeline default | Diffusion / inference steps |
--cfg-scale |
pipeline default | CFG guidance scale |
--sampler-type |
pipeline default | Sampler type |
--sigma-min / --sigma-max |
pipeline default | Noise schedule bounds |
--n-sample-per-cond |
1 |
Samples per condition |
--batch-size |
10 |
Items per GPU batch |
--seed |
-1 (random) |
RNG seed |
--device |
auto | cuda, mps, or cpu |
Training requires a Weights & Biases account:
wandb login
# or pass as env var
export WANDB_API_KEY="your-key-here"You need two config files before starting a training run:
- model config — defines architecture and training hyperparameters
- dataset config — points to your audio and metadata
See docs/datasets.md for dataset config details.
python3 train.py \
--dataset-config /path/to/dataset/config \
--model-config /path/to/model/config \
--name my_experiment- Resume from a wrapped checkpoint:
--ckpt-path path/to/wrapped.ckpt - Start fresh from an unwrapped pre-trained model:
--pretrained-ckpt-path path/to/unwrapped.ckpt
Training checkpoints include the full training wrapper (discriminators, EMA, optimizer states). Unwrap before using for inference or as a pretransform:
python3 unwrap_model.py \
--model-config /path/to/model/config \
--ckpt-path /path/to/wrapped/ckpt.ckpt \
--name /path/to/output/unwrapped_name1. CLAP encoder checkpoint
Download music_audioset_epoch_15_esc_90.14.pt from the
LAION CLAP repository
and set clap_ckpt_path in stable_audio_2_0.json:
"config": {
"clap_ckpt_path": "ckpt/clap/music_audioset_epoch_15_esc_90.14.pt"
}2. Audio + metadata
Each audio file needs a paired JSON sidecar with at minimum a prompt field:
dataset/
├── music_1.wav
├── music_1.json ← {"prompt": "upbeat electronic track with positive vibes"}
├── music_2.wav
├── music_2.json
└── ...
MODEL_CONFIG="anvil_audio/configs/model_configs/autoencoders/stable_audio_2_0_vae.json"
DATASET_CONFIG="anvil_audio/configs/dataset_configs/local_training_example.json"
python3 train.py \
--dataset-config ${DATASET_CONFIG} \
--model-config ${MODEL_CONFIG} \
--name "vae_training" \
--num-gpus 8 \
--batch-size 10 \
--num-workers 8 \
--save-dir ./outputAfter training, unwrap the checkpoint before Stage 2.
MODEL_CONFIG="anvil_audio/configs/model_configs/txt2audio/stable_audio_2_0.json"
PRETRANSFORM_CKPT="/path/to/unwrapped_vae.ckpt"
python3 train.py \
--dataset-config ${DATASET_CONFIG} \
--model-config ${MODEL_CONFIG} \
--pretransform-ckpt-path ${PRETRANSFORM_CKPT} \
--name "dit_training" \
--num-gpus 8 \
--batch-size 10 \
--save-dir ./outputpython3 reconstruct_audios.py \
--model-config ${MODEL_CONFIG} \
--ckpt-path /path/to/unwrapped_vae.ckpt \
--audio-dir /path/to/original_audio/ \
--output-dir /path/to/reconstructed/ \
--frame-duration 1.0 \
--overlap-rate 0.01 \
--batch-size 50Build a Docker image and optionally convert to Singularity for HPC clusters:
NAME=anvil-audio
docker build -t ${NAME} -f ./container/anvil-audio.Dockerfile .
# Convert to Singularity
singularity build anvil-audio.sif docker-daemon://anvil-audio- PyPI package (
pip install anvil-audio) - Contribution guidelines
- More audio augmentations
- Troubleshooting section
Anvil Audio is MIT licensed. It builds on several open-source projects and optional model weights with their own licenses — see THIRD_PARTY_NOTICES.md for full attributions.