Denoising
active,
Founded 20 months ago

This is a project that implements DEMUCS model proposed in Real Time Speech Enhancement in the Waveform Domain from scratch in Pytorch. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. The web interface for this project is available at hugging face. You can record your voice in noisy conditions and get denoised version using DEMUCS model. In the scope of this project Valentini dataset in used. It is clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. There are 56 speakers and ~10 gb of speech data. For model improvement it is possible to use a bigger training set from DNS challenge.

Denoising

This is a repo that implements DEMUCS model proposed in Real Time Speech Enhancement in the Waveform Domain from scratch in Pytorch. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. The web interface for this project is available at hugging face. You can record your voice in noisy conditions and get denoised version using DEMUCS model.

Potential usages

  • Real time denoising in communication systems (such as skype)
  • Improving speech assistants (ASR part)

Data

In the scope of this project Valentini dataset in used. It is clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. There are 56 speakers and ~10 gb of speech data.

For model improvement it is possible to use a bigger training set from DNS challenge.

Training

The training process in impemented in Pytorch. The data is (noisy speech, clean speech) pairs that are loaded as 2 second samples, randomly cutted from audio and padded if necessary. Model is optimized using SGD. In terms of loss functions, the L1 loss and MultiResolutionSTFTLoss are used. MultiResolutionSTFTLoss is the sum of STFT loss over different window sizes, hop sizes and fft sizes.

Metrics

  • Perceptual Evaluation of Speech Quality (PESQ)
  • Short-Time Objective Intelligibility (STOI)

The PESQ metric is used for estimating overall speech quality after denoising and STOI is used for estimating speech intelligibility after denoising. Intelligibility measure is highly correlated with the intelligibility of degraded speech signals

Experiments

For tracking experiments local server of Weights & Biases is used. To manage configs for different experiments hydra is used. It allows an easy way to track configs and override paramaters.

ExperimentDescriptionResult
BaselineInitial experiment with L1 lossPoor quality
Baseline_L1_Multi_STFT_lossChanged loss to Multi STFT + L1 lossBetter performance
L1_Multi_STFT_no_resampleTried to train without resamplingNo impovement, probably because RELU on the last layer
Updated_DEMUCSUsed relu in the last layer. Removed it.Significant improvement
wav_normalizationTried to normalized wav by std during trainingSmall improvement
original_srTrain with original sample rateSignificant improvement
increased_LIncreased number of encoder-decoder pairs from 3 to 5Performance comparable with original_sr
double_srTrain with double sample rateSmall improvement
replicate paperLower learning rate and fix bug in dataloaderMassive improvement!

img.png

Final model

<pre><code> H: 64 L: 5 encoder: conv1: kernel_size: 8 stride: 2 conv2: kernel_size: 1 stride: 1 decoder: conv1: kernel_size: 1 stride: 1 conv2: kernel_size: 8 stride: 2 </code></pre>

Testing

valentini_PESQvalentini_STOI
Spectral Gating1.74330.8844
Demucs2.48380.9192
DEMUCS (facebook)3.0795

Cookies help us deliver our services. By using our services, you agree to our use of cookies.