Unofficial Implementation of "FINALLY: fast and universal speech enhancement with studio-like quality"
Explore details and sample results on our GitHub Pages: https://inverse-ai.github.io/FINALLY-Speech-Enhancement/, which includes comprehensive information about the FINALLY speech enhancement model, audio examples comparing input and enhanced speech, and spectrogram visualizations for easy comparison.
Try the model live at https://noise-reducer.com (with SE v1.0) to enhance your audio.
For architecture details, see the FINALLY: fast and universal speech enhancement with studio-like quality paper.
We recommend using Conda to manage dependencies.
conda create -n finally_env python=3.10 pip
conda activate finally_envpip install -r requirements.txtThe model uses WavLM-Large from Hugging Face as a frozen feature extractor.
- Automatically downloaded via
transformerswhen training starts:
from transformers import WavLMModel
wavlm = WavLMModel.from_pretrained(
"microsoft/wavlm-large",
output_hidden_states=True,
force_download=bool(os.getenv("FORCE", False))
)Trainable parameters per component (in millions):
| Component | Parameters (M) |
|---|---|
| SpectralUNet | 4.5 |
| WavLM Post Processing | 50 |
| HiFi Pre Processing | 33 |
| HiFi(v1) | 14 |
| WaveUNet | 10.7 |
| SpectralMaskNet | 16.5 |
| WaveUNet Upsampler | 15 |
- Total trainable parameters: 143.7 M
- Non-trainable WavLM parameters: 315 M
- Total number of parameters (including WavLM): 458.7
Data handling and preprocessing are implemented in the datasets_manager/ directory, which includes:
datasets.py– definitions for dataset structures and preprocessingdataloaders.py– PyTorch dataloaders for training and validationaugmentations_modules.py– audio augmentation utilitiesinference_datasets.py– dataset structures and preprocessing for inference
Note: Your datasets should be placed inside the datasets/ directory.
For more clarity about dataset structure and correct directory paths, refer to the config files and datasets_manager/datasets.py.
To train the model on your dataset, provide a config file and run name.
Train the generator only.
python train.py exp.config_path=configs/stage1_config.yaml exp.run_name=stage1- New Training: Set
gen:checkpoint_path: nullwhen starting from the beginning. - Resuming: Provide the path to the latest Stage 1 checkpoint in
gen:checkpoint_path.
Train with a discriminator.
python train.py exp.config_path=configs/stage2_config.yaml exp.run_name=stage2- Setup: Use the last checkpoint from Stage 1 for
gen:checkpoint_path. - Note: Set
disc:checkpoint_path: nullwhen starting Stage 2.
Upsample result from 16kHz to 48kHz.
python train.py exp.config_path=configs/stage3_config.yaml exp.run_name=stage3- Setup: Use the last checkpoint from Stage 2 for
gen:checkpoint_path. - Note: Set
disc:checkpoint_path: null. - Configuration: Ensure
gen:args:use_upsamplewaveunet: trueis set in the config to enable 16k to 48k upsampling.
Optionally, you can specify the device e.g. exp.device=cuda:0.
We provide support for Distributed Data Parallel (DDP) training to speed up the process using multiple GPUs.
To launch training on multiple GPUs (e.g., 2 GPUs), use torchrun:
torchrun --nproc_per_node=2 train_ddp.py \
exp.config_path=configs/stage3_config_ddp.yaml \
exp.run_name=stage3_ddpWhen using DDP (see configs/stage3_config_ddp.yaml), pay attention to these parameters:
- Batch Size:
data.train_batch_sizeis the batch size per GPU. - Effective Batch Size:
train.effective_batch_sizeis the total batch size across all GPUs and accumulation steps. - Auto-accumulation: The trainer automatically calculates the required gradient accumulation steps based on the world size and target effective batch size.
Relevant files: train_ddp.py, trainers/finally_trainer_ddp.py, and configs/*_ddp.yaml.
To enhance speech from input audio files, provide the config and run name. Example:
python inference.py exp.config_path=configs/inferenc_config.yaml exp.run_name=inferenceOur model was trained on both the datasets mentioned in the paper and additional high-quality datasets curated by us.
The table below compares the performance of the model using various metrics.
| Metric | Paper’s Score | Ours Score |
|---|---|---|
| UTMOS | 4.32 | 4.30 |
| WV-MOS | 4.87 | 4.62 |
| DNSMOS | 3.22 | 3.30 |
| PESQ | 2.94 | 3.22 |
| STOI | 0.92 | 0.95 |
| SDR | 4.6 | 6.79 |
During our implementation of the FINALLY speech enhancement model, we have identified several areas for improvement:
A tiny amount of stationary noise remains in the enhanced audio, which is particularly audible at high volume during silence sections.
When integrating UTMOS loss, we observe that the speaker's accent occasionally changes in low SNR (Signal-to-Noise Ratio) portions. Interestingly, the accent remains preserved when training without UTMOS loss, suggesting a trade-off between perceived quality scores and speaker identity preservation.
The model sometimes exhibits voice identity shifts (the voice sounds like a different person) when the input speech is extremely quiet or masked by heavy noise to the point of being nearly unintelligible.
We invite the research community to help resolve these challenges, or alternative approaches to address these issues. If you have experience with:
- WavLM feature extraction and perceptual losses
- Speech enhancement model training and loss balancing
- Phoneme preservation techniques in generative models
Please feel free to:
- Open an issue in the Issues section to discuss potential solutions, report bugs, or share feedback.
- Submit a pull request with experimental results.
- Share relevant research papers or approaches.
Your insights and contributions could help improve the quality and robustness of this implementation.