speech enhancement thesis pdf

Study resources
Calendar - Graduate
Calendar - Undergraduate
Class schedules
Class cancellations
Course registration
Important academic dates
More academic resources
Campus services
IT services
Job opportunities
Mental health support
Student Service Centre (Birks)
Calendar of events
Latest news
Media Relations
Faculties, Schools & Colleges
Arts and Science
Gina Cody School of Engineering and Computer Science
John Molson School of Business
School of Graduate Studies
All Schools, Colleges & Departments
Directories

Spectrum Research Repository

How to Deposit an Article
How to Deposit an Article with a DOI
Spectrum Deposit Checklist
How to Prepare a Thesis for Deposit
How to Deposit a Thesis
How to Deposit a Research Creation Thesis
How to Prepare a Graduate Project (Non-thesis) for Deposit
How to Deposit a Graduate Project (Non-thesis)
by Department
by Document Type
Spectrum and ORCID

Speech Enhancement with Improved Deep Learning Methods

Hasannezhad, Mojtaba (2021) Speech Enhancement with Improved Deep Learning Methods. PhD thesis, Concordia University.

(application/pdf)
Hasannezhad_PhD_F2021.pdf - Accepted Version 17MB

In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on statistical signal processing have been proposed and widely used in the past. However, the performance of these methods was limited and thus failed in sophisticated acoustic scenarios. Over the last decade, deep learning as a primary tool to develop data-driven information systems has led to revolutionary advances in speech enhancement. In this context, speech enhancement is treated as a supervised learning problem, which does not suffer from issues faced by traditional methods. This supervised learning problem has three main components: input features, learning machine, and training target. In this thesis, various deep learning architectures and methods are developed to deal with the current limitations of these three components. First, we propose a serial hybrid neural network model integrating a new low-complexity fully-convolutional convolutional neural network (CNN) and a long short-term memory (LSTM) network to estimate a phase-sensitive mask for speech enhancement. Instead of using traditional acoustic features as the input of the model, a CNN is employed to automatically extract sophisticated speech features that can maximize the performance of a model. Then, an LSTM network is chosen as the learning machine to model strong temporal dynamics of speech. The model is designed to take full advantage of the temporal dependencies and spectral correlations present in the input speech signal while keeping the model complexity low. Also, an attention technique is embedded to recalibrate the useful CNN-extracted features adaptively. Through extensive comparative experiments, we show that the proposed model significantly outperforms some known neural network-based speech enhancement methods in the presence of highly non-stationary noises, while it exhibits a relatively small number of model parameters compared to some commonly employed DNN-based methods. Most of the available approaches for speech enhancement using deep neural networks face a number of limitations: they do not exploit the information contained in the phase spectrum, while their high computational complexity and memory requirements make them unsuited for real-time applications. Hence, a new phase-aware composite deep neural network is proposed to address these challenges. Specifically, magnitude processing with spectral mask and phase reconstruction using phase derivative are proposed as key subtasks of the new network to simultaneously enhance the magnitude and phase spectra. Besides, the neural network is meticulously designed to take advantage of strong temporal and spectral dependencies of speech, while its components perform independently and in parallel to speed up the computation. The advantages of the proposed PACDNN model over some well-known DNN-based SE methods are demonstrated through extensive comparative experiments. Considering that some acoustic scenarios could be better handled using a number of low-complexity sub-DNNs, each specifically designed to perform a particular task, we propose another very low complexity and fully convolutional framework, performing speech enhancement in short-time modified discrete cosine transform (STMDCT) domain. This framework is made up of two main stages: classification and mapping. In the former stage, a CNN-based network is proposed to classify the input speech based on its utterance-level attributes, i.e., signal-to-noise ratio and gender. In the latter stage, four well-trained CNNs specialized for different specific and simple tasks transform the STMDCT of noisy input speech to the clean one. Since this framework is designed to perform in the STMDCT domain, there is no need to deal with the phase information, i.e., no phase-related computation is required. Moreover, the training target length is only one-half of those in the previous chapters, leading to lower computational complexity and less demand for the mapping CNNs. Although there are multiple branches in the model, only one of the expert CNNs is active for each time, i.e., the computational burden is related only to a single branch at anytime. Also, the mapping CNNs are fully convolutional, and their computations are performed in parallel, thus reducing the computational time. Moreover, this proposed framework reduces the latency by %55 compared to the models in the previous chapters. Through extensive experimental studies, it is shown that the MBSE framework not only gives a superior speech enhancement performance but also has a lower complexity compared to some existing deep learning-based methods.

Divisions:	> >
Item Type:	Thesis (PhD)
Authors:
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Electrical and Computer Engineering
Date:	9 June 2021
Thesis Supervisor(s):	Zhu, Wei-Ping
ID Code:	988619
Deposited By:	Mojtaba Hasannezhad
Deposited On:	29 Nov 2021 16:49
Last Modified:	29 Nov 2021 16:49

Repository Staff Only: item control page

Downloads per month over past year

View more statistics

Speech Enhancement Using Deep Neural Networks

Speech enhancement algorithms aim to improve the quality and intelligibility of speech signals degraded by noise to improve human or machine interpretation of speech. Thanks to large-scale datasets and online simulation, supervised algorithms based on deep neural networks can accurately suppress non-stationary noise, making them useful in practice for real-time communication systems and as the front end of automatic speech recognition systems. Despite all the advances, the extent to which these algorithms are robust to adverse acoustic conditions and phonetic categories of speech stimuli is still being investigated.

This thesis addresses supervised speech enhancement in three parts. First, we describe the four-region error that serves as a diagnostic tool for speech enhancement algorithms. Compared to popular perceptual measures of speech quality, the four-region error distinguishes between two universal problems: under-suppression and over-suppression. We will show that all algorithms exhibit a trade-off between these error types and describe loss functions that balance the two. Second, we address the under-suppression problem within the frequency-domain speech enhancement framework. In the domain of instantaneous signal-to-noise ratio (ISNR), we unify algorithms trained on different targets. We will show that all methods face inevitable uncertainties as the ISNR decreases. We then introduce uncertainty learning that quantifies these uncertainties and improves noise reduction capability. Third, we address the over-suppression problem by incorporating phonetic information into the supervised framework. Through measurements of phonetically-dependent four-region error, we identify the over-suppression problem in obstruents in American English as the critical challenge of frequency-domain algorithms. We further identify a class of time-domain algorithms that exhibit different trade-offs and use them to train a phonetic segregation network. Finally, we explore phonetically-dependent channel selection rules to improve automatic speech recognition accuracy.

Degree Type

Dissertation
Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Usage metrics

Computer Engineering

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, speech enhancement.

229 papers with code • 13 benchmarks • 20 datasets

Speech Enhancement is a signal processing task that involves improving the quality of speech signals captured under noisy or degraded conditions. The goal of speech enhancement is to make speech signals clearer, more intelligible, and more pleasant to listen to, which can be used for various applications such as voice recognition, teleconferencing, and hearing aids.

( Image credit: A Fully Convolutional Neural Network For Speech Enhancement )

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> -->

Trend	Dataset	Best Model	Paper	Code	Compare
		PESQetarian
		MP-SENet
		Inter-Channel Conv-TasNet
		MaxDI (Baseline)
		DCUnet-MC
		SGMSE+
		SepFormer
		DCUNet-MC
		Audio-Visual concat-ref
		SEMamba (+PCS)
		SE-MelGAN
		SepFormer
		DeFT-AN

Most implemented papers

Proximal policy optimization algorithms.

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

alexjc/neural-enhance • 27 Mar 2016

We consider image transformation problems, where an input image is transformed into an output image.

SEGAN: Speech Enhancement Generative Adversarial Network

In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms.

Phase-aware Speech Enhancement with Deep Complex U-Net

Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction.

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality.

A Fully Convolutional Neural Network for Speech Enhancement

In hearing aids, the presence of babble noise degrades hearing intelligibility of human speech greatly.

FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

In our proposed FullSubNet, we connect a pure full-band model and a pure sub-band model sequentially and use practical joint training to integrate these two types of models' advantages.

MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement

JasonSWFu/MetricGAN • 13 May 2019

Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores.

SoundStream: An End-to-End Neural Audio Codec

google/lyra • 7 Jul 2021

We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs.

Show simple item record
Show full item record
Export item record

DC Field	Value	Language
dc.contributor.author	Shen, Huizhi
dc.date.accessioned	2016-03-10T01:56:51Z
dc.date.available	2016-03-10T01:56:51Z
dc.date.issued	2016
dc.identifier.citation	Shen, H. (2016). Speech enhancement via adaptive beamforming. Master's thesis, Nanyang Technological University, Singapore.
dc.identifier.uri	http://hdl.handle.net/10356/66088
dc.description.abstract	Beamforming is an array signal processing technique for extracting signals from one or more directions while suppressing noise from other. Applications of the technique include direction-of-arrival (DOA) estimation of signal sources and directional signal enhancement. In the past decades, several beamforming approaches have been proposed. Among them, adaptive beamformer estimates the filter coefficients by utilizing knowledge of the signal and environment resulting in its popularity for a nonstationary environment. However, its performance can be degraded significantly due to large number of interferers, room reverberation, and DOA mismatch. Research work documented in this thesis aims to achieve robust speech source extraction using single or distributed microphone arrays in a non-stationary environment with time-varying background noise and multiple speech interferers. In order to reduce the sensitivity of adaptive beamformer to model mismatch, the probability of interference and/or noise occurrence is first estimated and subsequently applied to the optimization process, where only contributions from interference and noise are utilized to ensure minimum distortion of the desired speech signal. The estimated coefficients are then adjusted to relax the restriction of DOA for a reverberant environment. For single array, this probability is obtained using properties of the Hermitian angle. For distributed arrays, the mutual information provides knowledge of the presence of the common desired signal.	en_US
dc.format.extent	71 p.	en_US
dc.language.iso	en	en_US
dc.subject	DRNTU::Engineering::Electrical and electronic engineering::Electronic systems::Signal processing	en_US
dc.title	Speech enhancement via adaptive beamforming	en_US
dc.type	Thesis
dc.contributor.supervisor	Andy Khong Wai Hoong	en_US
dc.contributor.school	School of Electrical and Electronic Engineering	en_US
dc.description.degree	Master of Engineering	en_US
item.fulltext	With Fulltext	-
item.grantfulltext	restricted	-
Appears in Collections:

Files in This Item:

File	Description	Size	Format
Restricted Access		10.2 MB	Adobe PDF

Page view(s)

Download(s).

Google Scholar TM

Help | Advanced Search

Computer Science > Sound

Title: multichannel speech enhancement without beamforming.

Abstract: Deep neural networks are often coupled with traditional spatial filters, such as MVDR beamformers for effectively exploiting spatial information. Even though single-stage end-to-end supervised models can obtain impressive enhancement, combining them with a traditional beamformer and a DNN-based post-filter in a multistage processing provides additional improvements. In this work, we propose a two-stage strategy for multi-channel speech enhancement that does not require a traditional beamformer for additional performance. First, we propose a novel attentive dense convolutional network (ADCN) for estimating real and imaginary parts of complex spectrogram. ADCN obtains state-of-the-art results among single-stage models. Next, we use ADCN with a recently proposed triple-path attentive recurrent network (TPARN) for estimating waveform samples. The proposed strategy uses two insights; first, using different approaches in two stages; and second, using a stronger model in the first stage. We illustrate the efficacy of our strategy by evaluating multiple models in a two-stage approach with and without a traditional beamformer.

Comments:	Accepted for publication in ICASSP 2022
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	[cs.SD]
	(or [cs.SD] for this version)
	Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

Other Formats

References & Citations

Google Scholar
Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Advanced Search

Single-channel speech enhancement using colored spectrograms

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, a regression approach to single-channel speech separation via high-resolution deep neural networks.

We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other ...

Single-Channel Multiple Regression for In-Car Speech Enhancement

We address issues for improving hands-free speech enhancement and speech recognition performance in different car environments using a single distant microphone. This paper describes a new single-channel in-car speech enhancement method that estimates ...

A Wavelet-Based Denoising System Using Time-Frequency Adaptation for Speech Enhancement

In this paper, we propose a novel wavelet denoising system using time-frequency adaptation for providing speech enhancement robustness to non-stationary and colored noise. Different from the conventional methods in threshold choosing, e.g. invariant ...

Information

Published in.

Academic Press Ltd.

United Kingdom

Publication History

Author tags.

Spectrograms
Speech denoising
Deep neural network
Research-article

Contributors

Other metrics, bibliometrics, article metrics.

0 Total Citations
0 Total Downloads
Downloads (Last 12 months) 0
Downloads (Last 6 weeks) 0

View options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
Download citation
Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IMAGES

(PDF) On Improvement of Speech Intelligibility and Quality: A Survey of
(PDF) Comparison of Speech Enhancement Algorithms
01 8445 speech enhancement
Informative Speech Thesis Statement
(PDF) A Review of Speech Signal Enhancement Techniques
(PDF) Speech Enhancement Techniques for Digital Hearing Aids

VIDEO

speech enhancement tool
Speech 101 Brainstorming Thesis Practice TII
DialogueEnhance2
Thesis Statement
How to write an effective Thesis Statement
Textless Speech-to-Speech Translation on Real Data #nlp #SpeechProcessing

COMMENTS

PDF Few-Shot Learning Neural Network for Audio-Visual Speech Enhancement
FASE model also demonstrates improvements in compute time. To the best of our knowledge, FASE model is the first meta-learning architecture to address the prob-lem of speaker dependency in audio-visual speech enhancement, using few-shot learning approaches. Extending few-shot learning methods to cope with more shots.
PDF Speech Enhancement for Real-Time Applications
representative corpus. This thesis explores two adaptations necessary to meet the requirements of real-time enhancement for lower-powered devices by resolving these two issues. Firstly, in Chapter 3, the nature of speech is exploited to impose a hierarchical structure on the clean speech corpus, to facilitate a tree-based search of the corpus.
PDF Speech Enhancement Techniques: Quality vs. Intelligibility
speech enhancement techniques, algorithms are either/combinely based on the model of noisy speech or/and perceptual model of speech using masking threshold. The generalized diagram of single channel enhancement technique is shown in Fig. 1. Fig. 1. Single channel enhancement technique One of the early papers [1] in speech enhancement
PDF Advanced Deep Neural Networks for Speech Separation and Enhancement
environments. This thesis focuses on improving separation and enhancement performance in the real-world environment. The rst contribution in this thesis is to address monaural speech sepa-ration and enhancement within reverberant room environment by designing new training targets and advanced network structures. The second contribu-
PDF Single Channel Speech Enhancement With Residual Learning and Recurrent
I would like to thank Dr. David V. Anderson for this wonderful thesis opportunity as well as his guidance and support. Thanks to everyone in lab ESP who advised and helped me along the way. To Dr. Chin-hui Lee's, his lecturing on Digital Speech Processing inspired my interest in speech processing; To Dr. Mark A. Davenport, his materials in
Single-Microphone Speech Enhancement and Separation Using Deep Learning
Single-Microphone Speech Enhancement and Separation Using Deep Learning PhD Thesis Morten Kolbæk 2018 arXiv:1808.10620v2 [cs.SD] 4 Dec 2018
Speech Enhancement with Improved Deep Learning Methods
In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on ...
PDF Speech Enhancement using Deep Learning
This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement is a classic problem in speech processing. In the last years, researches using deep learning has been used in many speech processing tasks since they have provided very satisfactory results.
PDF Deep Neural Network Approach for Single Channel Speech Enhancement
Except for simple cases where speech and noise can be easily separated in time or frequency, traditional single channel noise reduction methods can improve speech quality but not speech intelligibility, for reasons that are still not entirely understood. Single-channel speech intelligibility enhancement is more challenging than multi-channel
PDF Improving neural speech enhancement with pre-trained models
This thesis aims to investigate if and how pre-trained models like wav2vec 2.0 [4] or Whisper [5] can be used to improve the quality of neural speech enhancement models. A corpus of training and test data sets with additional audio artifacts like background noise, codec compression, reverberation and down-sampling are already available.
Speech Enhancement Using Deep Neural Networks
Speech enhancement algorithms aim to improve the quality and intelligibility of speech signals degraded by noise to improve human or machine interpretation of speech. Thanks to large-scale datasets and online simulation, supervised algorithms based on deep neural networks can accurately suppress non-stationary noise, making them useful in practice for real-time communication systems and as the ...
PDF Automatic Speech Recognition-Driven Speech Enhancement
Figure 1.2: Signal flow graph of Speech Enhancement System explored in this Thesis. The hypothesis of this Thesis lies in the fact that using an Automatic Speech Recognition (ASR) system as a loss function in a model architecture that has been proven to work well with speech enhancement could eventually close the gap be-
PDF Deep Learning for Speech Enhancement
This thesis looks into one of the aspects that can ruin the availability of speech: background noise. The ﬁeld of study is called speech enhancement and aims to re-move noise and background intrusiveness to introduce clarity and in-telligibility to the speech sample. Speech enhancement is an important subject that has several appli-
Spiral: Speech enhancement in the modulation domain
Wang-Y-2016-PhD-Thesis.pdf: Thesis: 8.15 MB: Adobe PDF: View/Open. Title: Speech enhancement in the modulation domain: Authors: Wang, Yu: Item Type: Thesis or dissertation: Abstract: The goal of a speech enhancement algorithm is to reduce or eliminate background noise without distorting the speech signal. Although speech enhancement is ...
PDF Speech Enhancement Algorithms for Audiological Applications
The next work approaches the speech enhancement problem in wireless-communicated binaural hearing aids. In this case, the two devices are connected with a wireless link, which increases the power consumption. The objective in this thesis is the de-sign of low-cost speech enhancement algorithms that increase the energy e ciency
PDF Gaze Strategies and Audiovisual Speech Enhancement
Gaze Strategies and Audiovisual Speech Enhancement by Astrid Yi A thesis submitted in conformity with the requirements ... demonstrating an audiovisual speech enhancement of 35% when subjects wore a visual lipreading aid which encoded di erent speech signal features (voice pitch, energy of the
PDF Kalman Filter in Speech Enhancement
In this thesis, two topics are integrated - the famous MMSE estimator, Kalman Filter and speech processing. In other words, the application of Kalman lter in speech enhancement is explored in detail. Speech enhancement is the removal of noise from corrupted speech and has applications in cellular and radio communication, voice controlled ...
PDF A Fully Convolutional Neural Network Approach to End-to-End Speech
speech from noise requires prior knowledge of both, as the mask is created based o↵of the relative strengths of the speech signal and the noise. This strategy also faces diculty if the noise and target speech occupy similar frequency ranges as is the case with babble noise. More recent studies in speech enhancement related to the cocktail ...
Speech Enhancement
Speech Enhancement. 229 papers with code • 13 benchmarks • 20 datasets. Speech Enhancement is a signal processing task that involves improving the quality of speech signals captured under noisy or degraded conditions. The goal of speech enhancement is to make speech signals clearer, more intelligible, and more pleasant to listen to, which ...
Speech enhancement via adaptive beamforming
Shen, H. (2016). Speech enhancement via adaptive beamforming. Master's thesis, Nanyang Technological University, Singapore. ... Thesis_ShenHuizhi_amended.pdf Restricted Access: 10.2 MB: Adobe PDF: View/Open: Page view(s) 371 Updated on Apr 18, 2024 Download(s) ...
Multichannel Speech Enhancement without Beamforming
View PDF Abstract: Deep neural networks are often coupled with traditional spatial filters, such as MVDR beamformers for effectively exploiting spatial information. Even though single-stage end-to-end supervised models can obtain impressive enhancement, combining them with a traditional beamformer and a DNN-based post-filter in a multistage processing provides additional improvements.
PDF Modulation Domain Improved Adaptive Gain Equalizer for Single Channel
This thesis deals with speech enhancement, which refers to the restoration of the clean speech. The speech enhancement implementation should preferably be robust to the right environment in which they are intended. Moreover, versatility and flexibility are key features for speech enhancement devices e.g. the ability to adapt to changing ...
PDF Speech Enhancement in Hands-Free Device (Hearing Aid) with ...
Speech Enhancement in Hands-Free Device (Hearing Aid) with emphasis on Elko's Beamformer Master's Thesis TELAGAREDDI S N U V RAMESH This thesis is presented as a part of Degree of Master of Science in Electrical Engineering with Emphasis on Signal Processing Blekinge Institute of Technology April, 2012 Blekinge Institute of Technology
Single-channel speech enhancement using colored spectrograms
Abstract. Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted ...