Single-channel speech denoising by masking the colored spectrograms
<p>Speech denoising (SD) covers the algorithms that remove the background noise from the target speech and thus improve its quality and intelligibility. In this paper, a novel SD technique is proposed that masks the colored spectrogram. U-Net (a deep neural network fundamentally developed for...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | |
| منشور في: |
2025
|
| الموضوعات: | |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| الملخص: | <p>Speech denoising (SD) covers the algorithms that remove the background noise from the target speech and thus improve its quality and intelligibility. In this paper, a novel SD technique is proposed that masks the colored spectrogram. U-Net (a deep neural network fundamentally developed for image segmentation) is trained on the noisy log-powered colored spectrograms (LPcS), using the binarized Mel spectrograms as ground truth (GT). After training, the colored spectrogram of the noisy speech is passed through U-Net, which generates a soft mask at its output. This mask is applied to the magnitude matrix of the short-time Fourier transform (STFT) of the noisy speech to retrieve the magnitude matrix of the estimated speech. This matrix is later combined with the noisy phase matrix to recover the target speech. The results show that with masking-based targets, the colored spectrograms provide an improvement of 0.12 points in perceptual evaluation of speech quality (PESQ) score, 4 % in short time objective intelligibility (STOI), and a 163 times reduction in network learnable parameters, as compared to when they are processed by a mapping-based model using pix2pix generative adversarial network (GAN) followed by a feedforward regression neural network. With a slightly reduced PESQ score (by 0.58 points), the proposed model offers an improvement of 2 % in STOI, and 4375 and 1135 times reduction respectively in the required number of training epochs and network parameters when compared to a GAN-based model augmented by WavLM; a large-scale self-supervised learning model. Similarly, it offers an improvement of 1 % in STOI and a reduction of 33 and 200 times, respectively, in network size and training epochs when compared to a complex variational U-Net-based model. Also, with comparable PESQ, the proposed system offers almost 2 % improvement in STOI, and a 2 times reduction in network size and 100 times reduction in training epochs, when compared to a lightweight system using automatic dimension reduction of network layers by a structured pruning method.</p><h2>Other Information</h2> <p> Published in: Computers and Electrical Engineering<br> License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.compeleceng.2025.110656" target="_blank">https://dx.doi.org/10.1016/j.compeleceng.2025.110656</a></p> |
|---|