In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on statistical signal processing have been proposed and widely used in the past. However, the performance of these methods was limited and thus failed in sophisticated acoustic scenarios. Over the last decade, deep learning as a primary tool to develop data-driven information systems has led to revolutionary advances in speech enhancement. In this context, speech enhancement is treated as a supervised learning problem, which does not suffer from issues faced by traditional methods. This supervised learning problem has three main components: input features, learning machine, and training target. In this thesis, various deep learning architectures and methods are developed to deal with the current limitations of these three components. First, we propose a serial hybrid neural network model integrating a new low-complexity fully-convolutional convolutional neural network (CNN) and a long short-term memory (LSTM) network to estimate a phase-sensitive mask for speech enhancement. Instead of using traditional acoustic features as the input of the model, a CNN is employed to automatically extract sophisticated speech features that can maximize the performance of a model. Then, an LSTM network is chosen as the learning machine to model strong temporal dynamics of speech. The model is designed to take full advantage of the temporal dependencies and spectral correlations present in the input speech signal while keeping the model complexity low. Also, an attention technique is embedded to recalibrate the useful CNN-extracted features adaptively. Through extensive comparative experiments, we show that the proposed model significantly outperforms some known neural network-based speech enhancement methods in the presence of highly non-stationary noises, while it exhibits a relatively small number of model parameters compared to some commonly employed DNN-based methods. Most of the available approaches for speech enhancement using deep neural networks face a number of limitations: they do not exploit the information contained in the phase spectrum, while their high computational complexity and memory requirements make them unsuited for real-time applications. Hence, a new phase-aware composite deep neural network is proposed to address these challenges. Specifically, magnitude processing with spectral mask and phase reconstruction using phase derivative are proposed as key subtasks of the new network to simultaneously enhance the magnitude and phase spectra. Besides, the neural network is meticulously designed to take advantage of strong temporal and spectral dependencies of speech, while its components perform independently and in parallel to speed up the computation. The advantages of the proposed PACDNN model over some well-known DNN-based SE methods are demonstrated through extensive comparative experiments. Considering that some acoustic scenarios could be better handled using a number of low-complexity sub-DNNs, each specifically designed to perform a particular task, we propose another very low complexity and fully convolutional framework, performing speech enhancement in short-time modified discrete cosine transform (STMDCT) domain. This framework is made up of two main stages: classification and mapping. In the former stage, a CNN-based network is proposed to classify the input speech based on its utterance-level attributes, i.e., signal-to-noise ratio and gender. In the latter stage, four well-trained CNNs specialized for different specific and simple tasks transform the STMDCT of noisy input speech to the clean one. Since this framework is designed to perform in the STMDCT domain, there is no need to deal with the phase information, i.e., no phase-related computation is required. Moreover, the training target length is only one-half of those in the previous chapters, leading to lower computational complexity and less demand for the mapping CNNs. Although there are multiple branches in the model, only one of the expert CNNs is active for each time, i.e., the computational burden is related only to a single branch at anytime. Also, the mapping CNNs are fully convolutional, and their computations are performed in parallel, thus reducing the computational time. Moreover, this proposed framework reduces the latency by %55 compared to the models in the previous chapters. Through extensive experimental studies, it is shown that the MBSE framework not only gives a superior speech enhancement performance but also has a lower complexity compared to some existing deep learning-based methods.
Divisions: | > > |
---|---|
Item Type: | Thesis (PhD) |
Authors: | |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Electrical and Computer Engineering |
Date: | 9 June 2021 |
Thesis Supervisor(s): | Zhu, Wei-Ping |
ID Code: | 988619 |
Deposited By: | Mojtaba Hasannezhad |
Deposited On: | 29 Nov 2021 16:49 |
Last Modified: | 29 Nov 2021 16:49 |
Repository Staff Only: item control page
Downloads per month over past year
View more statistics
Speech enhancement algorithms aim to improve the quality and intelligibility of speech signals degraded by noise to improve human or machine interpretation of speech. Thanks to large-scale datasets and online simulation, supervised algorithms based on deep neural networks can accurately suppress non-stationary noise, making them useful in practice for real-time communication systems and as the front end of automatic speech recognition systems. Despite all the advances, the extent to which these algorithms are robust to adverse acoustic conditions and phonetic categories of speech stimuli is still being investigated.
This thesis addresses supervised speech enhancement in three parts. First, we describe the four-region error that serves as a diagnostic tool for speech enhancement algorithms. Compared to popular perceptual measures of speech quality, the four-region error distinguishes between two universal problems: under-suppression and over-suppression. We will show that all algorithms exhibit a trade-off between these error types and describe loss functions that balance the two. Second, we address the under-suppression problem within the frequency-domain speech enhancement framework. In the domain of instantaneous signal-to-noise ratio (ISNR), we unify algorithms trained on different targets. We will show that all methods face inevitable uncertainties as the ISNR decreases. We then introduce uncertainty learning that quantifies these uncertainties and improves noise reduction capability. Third, we address the over-suppression problem by incorporating phonetic information into the supervised framework. Through measurements of phonetically-dependent four-region error, we identify the over-suppression problem in obstruents in American English as the critical challenge of frequency-domain algorithms. We further identify a class of time-domain algorithms that exhibit different trade-offs and use them to train a phonetic segregation network. Finally, we explore phonetically-dependent channel selection rules to improve automatic speech recognition accuracy.
Join the community, add a new evaluation result row, speech enhancement.
229 papers with code • 13 benchmarks • 20 datasets
Speech Enhancement is a signal processing task that involves improving the quality of speech signals captured under noisy or degraded conditions. The goal of speech enhancement is to make speech signals clearer, more intelligible, and more pleasant to listen to, which can be used for various applications such as voice recognition, teleconferencing, and hearing aids.
( Image credit: A Fully Convolutional Neural Network For Speech Enhancement )
Trend | Dataset | Best Model | -->Paper | Code | Compare |
---|---|---|---|---|---|
PESQetarian | -->|||||
MP-SENet | -->|||||
Inter-Channel Conv-TasNet | -->|||||
MaxDI (Baseline) | -->|||||
DCUnet-MC | -->|||||
SGMSE+ | -->|||||
SepFormer | -->|||||
DCUNet-MC | -->|||||
Audio-Visual concat-ref | -->|||||
SEMamba (+PCS) | -->|||||
SE-MelGAN | -->|||||
SepFormer | -->|||||
DeFT-AN | -->
Proximal policy optimization algorithms.
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.
alexjc/neural-enhance • 27 Mar 2016
We consider image transformation problems, where an input image is transformed into an output image.
In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms.
Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction.
Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality.
In hearing aids, the presence of babble noise degrades hearing intelligibility of human speech greatly.
In our proposed FullSubNet, we connect a pure full-band model and a pure sub-band model sequentially and use practical joint training to integrate these two types of models' advantages.
JasonSWFu/MetricGAN • 13 May 2019
Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores.
google/lyra • 7 Jul 2021
We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs.
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Shen, Huizhi | |
dc.date.accessioned | 2016-03-10T01:56:51Z | |
dc.date.available | 2016-03-10T01:56:51Z | |
dc.date.issued | 2016 | |
dc.identifier.citation | Shen, H. (2016). Speech enhancement via adaptive beamforming. Master's thesis, Nanyang Technological University, Singapore. | |
dc.identifier.uri | http://hdl.handle.net/10356/66088 | |
dc.description.abstract | Beamforming is an array signal processing technique for extracting signals from one or more directions while suppressing noise from other. Applications of the technique include direction-of-arrival (DOA) estimation of signal sources and directional signal enhancement. In the past decades, several beamforming approaches have been proposed. Among them, adaptive beamformer estimates the filter coefficients by utilizing knowledge of the signal and environment resulting in its popularity for a nonstationary environment. However, its performance can be degraded significantly due to large number of interferers, room reverberation, and DOA mismatch. Research work documented in this thesis aims to achieve robust speech source extraction using single or distributed microphone arrays in a non-stationary environment with time-varying background noise and multiple speech interferers. In order to reduce the sensitivity of adaptive beamformer to model mismatch, the probability of interference and/or noise occurrence is first estimated and subsequently applied to the optimization process, where only contributions from interference and noise are utilized to ensure minimum distortion of the desired speech signal. The estimated coefficients are then adjusted to relax the restriction of DOA for a reverberant environment. For single array, this probability is obtained using properties of the Hermitian angle. For distributed arrays, the mutual information provides knowledge of the presence of the common desired signal. | en_US |
dc.format.extent | 71 p. | en_US |
dc.language.iso | en | en_US |
dc.subject | DRNTU::Engineering::Electrical and electronic engineering::Electronic systems::Signal processing | en_US |
dc.title | Speech enhancement via adaptive beamforming | en_US |
dc.type | Thesis | |
dc.contributor.supervisor | Andy Khong Wai Hoong | en_US |
dc.contributor.school | School of Electrical and Electronic Engineering | en_US |
dc.description.degree | Master of Engineering | en_US |
item.fulltext | With Fulltext | - |
item.grantfulltext | restricted | - |
Appears in Collections: |
File | Description | Size | Format | |
---|---|---|---|---|
Restricted Access | 10.2 MB | Adobe PDF |
Download(s).
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.
Help | Advanced Search
Title: multichannel speech enhancement without beamforming.
Abstract: Deep neural networks are often coupled with traditional spatial filters, such as MVDR beamformers for effectively exploiting spatial information. Even though single-stage end-to-end supervised models can obtain impressive enhancement, combining them with a traditional beamformer and a DNN-based post-filter in a multistage processing provides additional improvements. In this work, we propose a two-stage strategy for multi-channel speech enhancement that does not require a traditional beamformer for additional performance. First, we propose a novel attentive dense convolutional network (ADCN) for estimating real and imaginary parts of complex spectrogram. ADCN obtains state-of-the-art results among single-stage models. Next, we use ADCN with a recently proposed triple-path attentive recurrent network (TPARN) for estimating waveform samples. The proposed strategy uses two insights; first, using different approaches in two stages; and second, using a stronger model in the first stage. We illustrate the efficacy of our strategy by evaluating multiple models in a two-stage approach with and without a traditional beamformer.
Comments: | Accepted for publication in ICASSP 2022 |
Subjects: | Sound (cs.SD); Audio and Speech Processing (eess.AS) |
Cite as: | [cs.SD] |
(or [cs.SD] for this version) | |
Focus to learn more arXiv-issued DOI via DataCite |
Access paper:.
Bibtex formatted citation.
Code, data and media associated with this article, recommenders and search tools.
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
New citation alert added.
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Please log in to your account
Bibliometrics & citations, view options, recommendations, a regression approach to single-channel speech separation via high-resolution deep neural networks.
We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other ...
We address issues for improving hands-free speech enhancement and speech recognition performance in different car environments using a single distant microphone. This paper describes a new single-channel in-car speech enhancement method that estimates ...
In this paper, we propose a novel wavelet denoising system using time-frequency adaptation for providing speech enhancement robustness to non-stationary and colored noise. Different from the conventional methods in threshold choosing, e.g. invariant ...
Published in.
Academic Press Ltd.
United Kingdom
Author tags.
Other metrics, bibliometrics, article metrics.
Login options.
Check if you have access through your login credentials or your institution to get full access on this article.
Share this publication link.
Copying failed.
Affiliations, export citations.
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
IMAGES
VIDEO
COMMENTS
FASE model also demonstrates improvements in compute time. To the best of our knowledge, FASE model is the first meta-learning architecture to address the prob-lem of speaker dependency in audio-visual speech enhancement, using few-shot learning approaches. Extending few-shot learning methods to cope with more shots.
representative corpus. This thesis explores two adaptations necessary to meet the requirements of real-time enhancement for lower-powered devices by resolving these two issues. Firstly, in Chapter 3, the nature of speech is exploited to impose a hierarchical structure on the clean speech corpus, to facilitate a tree-based search of the corpus.
speech enhancement techniques, algorithms are either/combinely based on the model of noisy speech or/and perceptual model of speech using masking threshold. The generalized diagram of single channel enhancement technique is shown in Fig. 1. Fig. 1. Single channel enhancement technique One of the early papers [1] in speech enhancement
environments. This thesis focuses on improving separation and enhancement performance in the real-world environment. The rst contribution in this thesis is to address monaural speech sepa-ration and enhancement within reverberant room environment by designing new training targets and advanced network structures. The second contribu-
I would like to thank Dr. David V. Anderson for this wonderful thesis opportunity as well as his guidance and support. Thanks to everyone in lab ESP who advised and helped me along the way. To Dr. Chin-hui Lee's, his lecturing on Digital Speech Processing inspired my interest in speech processing; To Dr. Mark A. Davenport, his materials in
Single-Microphone Speech Enhancement and Separation Using Deep Learning PhD Thesis Morten Kolbæk 2018 arXiv:1808.10620v2 [cs.SD] 4 Dec 2018
In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on ...
This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement is a classic problem in speech processing. In the last years, researches using deep learning has been used in many speech processing tasks since they have provided very satisfactory results.
Except for simple cases where speech and noise can be easily separated in time or frequency, traditional single channel noise reduction methods can improve speech quality but not speech intelligibility, for reasons that are still not entirely understood. Single-channel speech intelligibility enhancement is more challenging than multi-channel
This thesis aims to investigate if and how pre-trained models like wav2vec 2.0 [4] or Whisper [5] can be used to improve the quality of neural speech enhancement models. A corpus of training and test data sets with additional audio artifacts like background noise, codec compression, reverberation and down-sampling are already available.
Speech enhancement algorithms aim to improve the quality and intelligibility of speech signals degraded by noise to improve human or machine interpretation of speech. Thanks to large-scale datasets and online simulation, supervised algorithms based on deep neural networks can accurately suppress non-stationary noise, making them useful in practice for real-time communication systems and as the ...
Figure 1.2: Signal flow graph of Speech Enhancement System explored in this Thesis. The hypothesis of this Thesis lies in the fact that using an Automatic Speech Recognition (ASR) system as a loss function in a model architecture that has been proven to work well with speech enhancement could eventually close the gap be-
This thesis looks into one of the aspects that can ruin the availability of speech: background noise. The field of study is called speech enhancement and aims to re-move noise and background intrusiveness to introduce clarity and in-telligibility to the speech sample. Speech enhancement is an important subject that has several appli-
Wang-Y-2016-PhD-Thesis.pdf: Thesis: 8.15 MB: Adobe PDF: View/Open. Title: Speech enhancement in the modulation domain: Authors: Wang, Yu: Item Type: Thesis or dissertation: Abstract: The goal of a speech enhancement algorithm is to reduce or eliminate background noise without distorting the speech signal. Although speech enhancement is ...
The next work approaches the speech enhancement problem in wireless-communicated binaural hearing aids. In this case, the two devices are connected with a wireless link, which increases the power consumption. The objective in this thesis is the de-sign of low-cost speech enhancement algorithms that increase the energy e ciency
Gaze Strategies and Audiovisual Speech Enhancement by Astrid Yi A thesis submitted in conformity with the requirements ... demonstrating an audiovisual speech enhancement of 35% when subjects wore a visual lipreading aid which encoded di erent speech signal features (voice pitch, energy of the
In this thesis, two topics are integrated - the famous MMSE estimator, Kalman Filter and speech processing. In other words, the application of Kalman lter in speech enhancement is explored in detail. Speech enhancement is the removal of noise from corrupted speech and has applications in cellular and radio communication, voice controlled ...
speech from noise requires prior knowledge of both, as the mask is created based o↵of the relative strengths of the speech signal and the noise. This strategy also faces diculty if the noise and target speech occupy similar frequency ranges as is the case with babble noise. More recent studies in speech enhancement related to the cocktail ...
Speech Enhancement. 229 papers with code • 13 benchmarks • 20 datasets. Speech Enhancement is a signal processing task that involves improving the quality of speech signals captured under noisy or degraded conditions. The goal of speech enhancement is to make speech signals clearer, more intelligible, and more pleasant to listen to, which ...
Shen, H. (2016). Speech enhancement via adaptive beamforming. Master's thesis, Nanyang Technological University, Singapore. ... Thesis_ShenHuizhi_amended.pdf Restricted Access: 10.2 MB: Adobe PDF: View/Open: Page view(s) 371 Updated on Apr 18, 2024 Download(s) ...
View PDF Abstract: Deep neural networks are often coupled with traditional spatial filters, such as MVDR beamformers for effectively exploiting spatial information. Even though single-stage end-to-end supervised models can obtain impressive enhancement, combining them with a traditional beamformer and a DNN-based post-filter in a multistage processing provides additional improvements.
This thesis deals with speech enhancement, which refers to the restoration of the clean speech. The speech enhancement implementation should preferably be robust to the right environment in which they are intended. Moreover, versatility and flexibility are key features for speech enhancement devices e.g. the ability to adapt to changing ...
Speech Enhancement in Hands-Free Device (Hearing Aid) with emphasis on Elko's Beamformer Master's Thesis TELAGAREDDI S N U V RAMESH This thesis is presented as a part of Degree of Master of Science in Electrical Engineering with Emphasis on Signal Processing Blekinge Institute of Technology April, 2012 Blekinge Institute of Technology
Abstract. Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted ...