A comprehensive evaluation under different types and levels of environment noises using two popular Automatic Speech Recognition (ASR) systems - DeepSpeech and wav2letter. Our reposiory contains three parts: noisy audio synthesis, a DeepSpeech model evaluation wrapper and wav2letter model wrapper.
We made a Python code that can duplicate audio dataset's folder tree structure and add noise onto audio files.
Python packages we used for noisy audio synthesis code are: docopt, pydub, numpy, os, fnmatch, shutil.
Please make sure install all of them before running the code. You can install them by running following line:
pip install docopt pydub numpy os fnmatch shutil
The usage of the noisy audio synthesis code on commnad-line is:
python make_noisy_dataset.py <audio_dataset_dir> <noise_dataset_dir> <destination_dir> <file_type> <snr>
For example, python make_noisy_dataset.py 'LibriSpeech/' '15 Free Ambient Sound Effects/Busy City Street.mp3' './' 'wav' 0. Also, please don't forget to include '/' at the end of <audio_dataset_dir> or <destination_dir>.
You can adjust how much noise corrupted the audio file by adjusting SNR value in decibel scale. SNR means signal-to-noise ratio, and it is simply how much the audio is corrupted by the noise.
We used some free open source noise files, which were downloaded here. However, you can use any other noise files other than these.
This is a wrapper of DeepSpeech-Mozilla. Architecture is from paper Baidu's Deep Speech Paper. Framework is implemented by Mozilla. We use its speech recognition inference module and implemented the WER result part.
Decoder: CTC + language model beam search
Language model: KenLM
- Download .so file from here.
- Install DeepSpeech with
pip3 install deepspeech. - Run
DeepSpeech-mozilla/batch_trans_xer.pyto generate transcripts from audio input and save in .txt files and then calculate the WER, CER and SER results from generated transcritps and labels.
This is a wrapper of wav2letter++ model by Facebook AI Research. wav2letter++ is a fast open source speech processing toolkit from the Speech Team at Facebook AI Research. It is written entirely in C++ and uses the ArrayFire tensor library. Because there is some error in the pretrained model, we slightly modified the codes and trained it on the librispeech’s train-clean-100 dataset for 24 hours. Also we made the WER result implementation part.
Decoder: CTC + language model beam search
Language model: 3-gram LM trained from libriSpeech corpus
- Install wav2letter with docker using
sudo docker run --runtime=nvidia --rm -itd --ipc=host --name w2l wav2letter/wav2letter:cuda-latest. - Run
Split.pyto extract and save the labels in correct format. - Run
WERto calculate the WER results from generated transcritps and labels.