GitHub - inverse-ai/FINALLY-Speech-Enhancement: FINALLY: Fast and universal speech enhancement model delivering studio-quality audio for a wide range of recordings.

Unofficial Implementation of "FINALLY: fast and universal speech enhancement with studio-like quality"

Introduction

Explore details and sample results on our GitHub Pages: https://inverse-ai.github.io/FINALLY-Speech-Enhancement/, which includes comprehensive information about the FINALLY speech enhancement model, audio examples comparing input and enhanced speech, and spectrogram visualizations for easy comparison.

Try the model live at https://noise-reducer.com (with SE v1.0) to enhance your audio.

For architecture details, see the FINALLY: fast and universal speech enhancement with studio-like quality paper.

Environment

We recommend using Conda to manage dependencies.

Create Conda Environment

conda create -n finally_env python=3.10 pip
conda activate finally_env

Install Dependencies

pip install -r requirements.txt

Pretrained Models

The model uses WavLM-Large from Hugging Face as a frozen feature extractor.

Automatically downloaded via transformers when training starts:

from transformers import WavLMModel

wavlm = WavLMModel.from_pretrained(
    "microsoft/wavlm-large",
    output_hidden_states=True,
    force_download=bool(os.getenv("FORCE", False))
)

Model Size

Trainable parameters per component (in millions):

Component	Parameters (M)
SpectralUNet	4.5
WavLM Post Processing	50
HiFi Pre Processing	33
HiFi(v1)	14
WaveUNet	10.7
SpectralMaskNet	16.5
WaveUNet Upsampler	15

Total trainable parameters: 143.7 M
Non-trainable WavLM parameters: 315 M
Total number of parameters (including WavLM): 458.7

Data Processing

Data handling and preprocessing are implemented in the datasets_manager/ directory, which includes:

datasets.py – definitions for dataset structures and preprocessing
dataloaders.py – PyTorch dataloaders for training and validation
augmentations_modules.py – audio augmentation utilities
inference_datasets.py – dataset structures and preprocessing for inference

Note: Your datasets should be placed inside the datasets/ directory.
For more clarity about dataset structure and correct directory paths, refer to the config files and datasets_manager/datasets.py.

Training

To train the model on your dataset, provide a config file and run name.

Stage 1: Initial Training

Train the generator only.

python train.py exp.config_path=configs/stage1_config.yaml exp.run_name=stage1

New Training: Set gen:checkpoint_path: null when starting from the beginning.
Resuming: Provide the path to the latest Stage 1 checkpoint in gen:checkpoint_path.

Stage 2: Generative and Adversarial Training

Train with a discriminator.

python train.py exp.config_path=configs/stage2_config.yaml exp.run_name=stage2

Setup: Use the last checkpoint from Stage 1 for gen:checkpoint_path.
Note: Set disc:checkpoint_path: null when starting Stage 2.

Stage 3: High Fidelity Upsampling

Upsample result from 16kHz to 48kHz.

python train.py exp.config_path=configs/stage3_config.yaml exp.run_name=stage3

Setup: Use the last checkpoint from Stage 2 for gen:checkpoint_path.
Note: Set disc:checkpoint_path: null.
Configuration: Ensure gen:args:use_upsamplewaveunet: true is set in the config to enable 16k to 48k upsampling.

Optionally, you can specify the device e.g. exp.device=cuda:0.

Multi-GPU Training (DDP)

We provide support for Distributed Data Parallel (DDP) training to speed up the process using multiple GPUs.

Run with Torchrun

To launch training on multiple GPUs (e.g., 2 GPUs), use torchrun:

torchrun --nproc_per_node=2 train_ddp.py \
         exp.config_path=configs/stage3_config_ddp.yaml \
         exp.run_name=stage3_ddp

Key DDP Configuration Changes

When using DDP (see configs/stage3_config_ddp.yaml), pay attention to these parameters:

Batch Size: data.train_batch_size is the batch size per GPU.
Effective Batch Size: train.effective_batch_size is the total batch size across all GPUs and accumulation steps.
Auto-accumulation: The trainer automatically calculates the required gradient accumulation steps based on the world size and target effective batch size.

Relevant files: train_ddp.py, trainers/finally_trainer_ddp.py, and configs/*_ddp.yaml.

Inference

To enhance speech from input audio files, provide the config and run name. Example:

python inference.py exp.config_path=configs/inferenc_config.yaml exp.run_name=inference

Evaluation Scores

Our model was trained on both the datasets mentioned in the paper and additional high-quality datasets curated by us.

The table below compares the performance of the model using various metrics.

Metric	Paper’s Score	Ours Score
UTMOS	4.32	4.30
WV-MOS	4.87	4.62
DNSMOS	3.22	3.30
PESQ	2.94	3.22
STOI	0.92	0.95
SDR	4.6	6.79

Current Challenges

During our implementation of the FINALLY speech enhancement model, we have identified several areas for improvement:

Challenge 1: Stationary Noise in Silence

A tiny amount of stationary noise remains in the enhanced audio, which is particularly audible at high volume during silence sections.

Challenge 2: Accent Alteration with UTMOS Loss

When integrating UTMOS loss, we observe that the speaker's accent occasionally changes in low SNR (Signal-to-Noise Ratio) portions. Interestingly, the accent remains preserved when training without UTMOS loss, suggesting a trade-off between perceived quality scores and speaker identity preservation.

Challenge 3: Voice Identity Shifts in Low-Intelligibility Speech

The model sometimes exhibits voice identity shifts (the voice sounds like a different person) when the input speech is extremely quiet or masked by heavy noise to the point of being nearly unintelligible.

Contributing

We invite the research community to help resolve these challenges, or alternative approaches to address these issues. If you have experience with:

WavLM feature extraction and perceptual losses
Speech enhancement model training and loss balancing
Phoneme preservation techniques in generative models

Please feel free to:

Open an issue in the Issues section to discuss potential solutions, report bugs, or share feedback.
Submit a pull request with experimental results.
Share relevant research papers or approaches.

Your insights and contributions could help improve the quality and robustness of this implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unofficial Implementation of "FINALLY: fast and universal speech enhancement with studio-like quality"

Introduction

Environment

Create Conda Environment

Install Dependencies

Pretrained Models

Model Size

Data Processing

Training

Stage 1: Initial Training

Stage 2: Generative and Adversarial Training

Stage 3: High Fidelity Upsampling

Multi-GPU Training (DDP)

Run with Torchrun

Key DDP Configuration Changes

Inference

Evaluation Scores

Current Challenges

Challenge 1: Stationary Noise in Silence

Challenge 2: Accent Alteration with UTMOS Loss

Challenge 3: Voice Identity Shifts in Low-Intelligibility Speech

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
datasets_manager		datasets_manager
metrics		metrics
models		models
trainers		trainers
utils		utils
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py
train_ddp.py		train_ddp.py

Folders and files

Latest commit

History

Repository files navigation

Unofficial Implementation of "FINALLY: fast and universal speech enhancement with studio-like quality"

Introduction

Environment

Create Conda Environment

Install Dependencies

Pretrained Models

Model Size

Data Processing

Training

Stage 1: Initial Training

Stage 2: Generative and Adversarial Training

Stage 3: High Fidelity Upsampling

Multi-GPU Training (DDP)

Run with Torchrun

Key DDP Configuration Changes

Inference

Evaluation Scores

Current Challenges

Challenge 1: Stationary Noise in Silence

Challenge 2: Accent Alteration with UTMOS Loss

Challenge 3: Voice Identity Shifts in Low-Intelligibility Speech

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages