Neural audio codec built from scratch, designed for AMD APU tri-processor inference.
The encoder runs on the NPU (INT8-friendly convolutions), the quantizer runs on the CPU (sequential codebook lookups), and the decoder runs on the GPU via Vulkan (parallel transposed convolutions). Routed by R.A.G-Race-Router.
| Component | Target processor | Design rationale |
|---|---|---|
| Encoder | NPU (XDNA 2) | Small Conv1d stack, INT8-quantizable, low power |
| Quantizer (RVQ) | CPU | Sequential codebook lookups, not parallelizable |
| Decoder | GPU (Vulkan) | Parallel ConvTranspose1d, benefits from 16 CUs |
Audio (44.1kHz mono)
-> Encoder: Conv1d stack, 512x downsample -> ~86 tokens/sec
-> RVQ: 8 codebooks x 2048 entries, residual quantization
-> Decoder: ConvTranspose1d stack, 512x upsample -> 44.1kHz audio
| Spec | Value |
|---|---|
| Parameters | 32.4M |
| Sample rate | 44.1 kHz (native, no resampling from 32kHz) |
| Codebooks | 8 x 2048 entries |
| Latent dimension | 128 |
| Compression ratio | 512x (44,100 samples/sec -> ~86 tokens/sec) |
| Loss function | L1 + multi-resolution STFT + commitment (no adversarial) |
| APU-Codec | EnCodec | |
|---|---|---|
| Sample rate | 44.1 kHz | 24/32 kHz |
| Codebooks | 8 x 2048 | 4 x 1024 |
| Processor split | NPU / CPU / GPU | Single GPU |
| Quantization noise | Lower (larger codebooks) | Higher |
Architecture complete, training started (~1 epoch).
- Model definition: done
- Training loop: done (L1 + spectral + commitment loss, AdamW, cosine LR)
- First training run: started on synthetic data
- Inference test harness: done
- Real audio training: pending (needs audio dataset)
# Set this for MIOpen kernel search on gfx1150
export MIOPEN_FIND_MODE=3
# Activate a PyTorch environment with ROCm support
source ~/pytorch-gfx1150-env/bin/activate
# Train (uses ~/Music/ or generates synthetic data)
python training/train.pyNote:
MIOPEN_FIND_MODE=3is required for training on gfx1150 (RDNA 3.5). Without it, MIOpen's default kernel search hangs or crashes on unsupported architectures.
from model.codec import APUCodec
import torch
codec = APUCodec()
print(f"Parameters: {codec.param_count:,}") # 32,400,xxx
audio = torch.randn(1, 1, 44100 * 2) # 2 seconds
reconstructed, codes, loss = codec(audio)
print(f"Codes shape: {codes.shape}") # (1, 8, ~172)
print(f"Compression: {codec.compression_ratio}x") # 512xapu-codec/
├── model/
│ └── codec.py # APUCodec, Encoder, Decoder, RVQ
├── training/
│ └── train.py # Training loop (L1 + spectral + commitment)
├── inference/
│ └── test_codec.py # Inference test harness
└── README.md
- R.A.G-Race-Router — Tri-processor inference runtime that routes this codec across NPU/CPU/GPU
- amdxdna-strix-fix — NPU driver fix required for encoder deployment
- unified-ml — Vulkan + HIP kernel benchmarks for the GPU decoder path
- pytorch-gfx1150 — PyTorch with native gfx1150 GPU support (used for training)
Peter Clemente (@Peterc3-dev)
MIT