SongGeneration - 16GB VRAM Optimized Fork

This is a performance-optimized fork of the original SongGeneration project. It is specifically redesigned to run the v2 Large model on consumer-grade GPUs with 16GB of VRAM.

Key Optimizations

v2 Large on 16GB VRAM: Achieved through 8-bit µ-law quantization for KV-caching and FP16 model conversion (reducing the model footprint from 13GB to 9.5GB). Combined with fused QKV/MLP layers, these optimizations significantly lower the VRAM entry barrier without sacrificing output quality.
Long-form Generation: Support for song lengths up to 280 seconds.
Triple-Phase Memory Management: The workflow is split into three independent stages to ensure only one model occupies the VRAM at a time.
Precision Balance: Latent generation is optimized for memory, while the final diffuser stage remains in full fp32 precision for high-quality audio output.
Code Cleanup: Redundant dependencies and unused legacy code have been removed to create a streamlined experience.

System Requirements

The following setup was used for development and verification. While optimized for AMD hardware, it is architecturally compatible with NVIDIA systems.

GPU: Minimum 16GB VRAM (Verified on AMD RX 9070).
System RAM: 32GB System RAM (At least 26GB must be allocated to WSL2).
OS: Linux or Windows with WSL2.
Environment: ROCm 7.2.1 with librocdxg, Python 3.12, PyTorch 2.11 (sdpa), Triton-ROCm 3.6.

Installation

Base Environment: Install PyTorch with the appropriate backend for your hardware (CUDA or ROCm).
Dependencies:
```
pip install -r requirements.txt
```

Model Preparation

Checkpoints are not included and have to be downloaded from HuggingFace. Please see download_ckpts.sh and the directory structure under ckpt/ as a guide.

Run the conversion scripts to prepare the models for the 16GB workflow:
* python ckpt/songgeneration/convert_fp16.py
* python ckpt/songgeneration/convert_ckpt_data_structure.py
* python ckpt/model_septoken/convert_fp32.py
* python ckpt/model_1rvq/convert_fp32.py

Workflow (Three-Phase Process)

To minimize VRAM usage, execute the generation in the following sequence:

Phase 1: Conditioning (jsonl2conditions.sh --input sample/your.jsonl) – Audio source separation via Demucs if audio prompt is provided.
Phase 2: Token Generation (conditions2tokens.sh) – v2 Large inference using µ-law cache.
Phase 3: Audio Synthesis (tokens2audio.sh) – Final rendering using model septoken and VAE.

Configuration & Input

Customizing `config.yaml`

lyric_processor

max_dur is the maximum song length in seconds. Default is 280.

lm

max_position_embeddings is the maximum kv_cache length for the main transformer. 8210 tokens are needed for a song about 280 seconds long
max_position_embeddings_sub is the maximum kv_cache for the sub transfomer. I use the same value here as I used for main.
use_flash_attn_2 false activates PyTorch sdpa which on my system is a lot faster than the flash_attn package. If you enable flash attention you automatically use the fp16 kv cache.
use_q8_kv_cache true uses the int8 µ-law cache while false uses standard fp16 kv-caching.
q8_kv_cache_mu You can experiment with different µ-law values here. 64.0 is the default.

Input Format (`.jsonl`)

jsonl2conditions.sh --input expects a JSONL file where each line represents a separate song:
{"idx": "unique_songname", "gt_lyric": "[intro-short] ; [verse] lyrics ; [outro-short]", "descriptions": "style and mood description"}
See ./conf/vocab.yaml for structure tags within gt_lyric.
See ./sample/description/* for different type info for descriptions.

Optional conditioning:

Add "prompt_audio_path": "path/to/file.wav" for specific audio prompts.
Add "auto_prompt_audio_type": "type" for automatic conditioning.
Supported types: 'Pop', 'Latin', 'Rock', 'Electronic', 'Metal', 'Country', 'R&B/Soul', 'Ballad', 'Jazz', 'World', 'Hip-Hop', 'Funk', 'Soundtrack' or 'Auto'.
Expert Settings: You can override global settings per song by adding a parameters string: "parameters": "temp:0.9, cfg_coef:1.5, record_window:50, top_p:0.9, top_k:500"

See examples under ./sample

Credits

Original Project: Tencent SongGeneration
Optimizations: µ-law implementation and refactoring by Siriusquirrel.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
ckpt		ckpt
codeclm		codeclm
conf		conf
img		img
sample		sample
third_party		third_party
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conditions2tokens.py		conditions2tokens.py
conditions2tokens.sh		conditions2tokens.sh
download_ckpts.sh		download_ckpts.sh
jsonl2conditions.py		jsonl2conditions.py
jsonl2conditions.sh		jsonl2conditions.sh
requirements.txt		requirements.txt
sound.py		sound.py
tokens2audio.py		tokens2audio.py
tokens2audio.sh		tokens2audio.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SongGeneration - 16GB VRAM Optimized Fork

Key Optimizations

System Requirements

Installation

Model Preparation

Workflow (Three-Phase Process)

Configuration & Input

Customizing `config.yaml`

lyric_processor

lm

Input Format (`.jsonl`)

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SongGeneration - 16GB VRAM Optimized Fork

Key Optimizations

System Requirements

Installation

Model Preparation

Workflow (Three-Phase Process)

Configuration & Input

Customizing config.yaml

lyric_processor

lm

Input Format (.jsonl)

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Customizing `config.yaml`

Input Format (`.jsonl`)

Packages