Date – Details
Doctoral Defense: Phase-Aware Spectral Speech Enhancement Using Deep Learning Techniques
Lars Thieling
Monday, 5th May 2025
02:00 PM
IKS 4G / hybrid
Speech communication is crucial for human interaction and has become central in various domains like entertainment, education, and healthcare. However, speech signals often suffer from impairments due to, e.g., background noise, reverberation, acoustic echo, limited bandwidth, and packet losses. These impairments lead to degraded speech quality and intelligibility, ultimately resulting in unsatisfactory communication experiences. To mitigate the degradations, speech enhancement techniques are required.
In recent years, speech enhancement approaches leveraging deep learning (DL) have achieved significant advancements and established new benchmarks in the field. Particularly in noise reduction, deep neural networks (DNNs) have facilitated enhancements even in challenging scenarios characterized by very low signal-to-noise ratios (SN Rs) and highly non-stationary noise environments. Most speech enhancement methods operate in the spectral domain and traditionally focus on processing the magnitude spectrum, as the phase spectrum is often deemed less relevant under moderate SNR conditions. However, the remarkable success of DNNs in these magnitude-controlled approaches at low SNR levels has increased the need for enhancing the phase, leading to a greater research focus on this aspect. Nowadays, many modern speech enhancement approaches are phase-aware, meaning they estimate both the magnitude and phase spectrum simultaneously.
The task of this thesis is to develop concepts and algorithms in the emerging topic of phase-aware speech enhancement using DNNs with the aim of providing new insights and directions for the next generation of speech enhancement approaches.
The first major contribution of the dissertation is a novel magnitude-controlled two-stage speech enhancement (TSSE) approach, which is designed to be effective under challenging SNR conditions. This method consists of a mask estimation and a speech extraction stage, with each stage utilizing a DNN specifically designed for its respective task. Unlike other state-of-the-art solutions that perform masking of the noisy spectrum in the first stage, the estimated mask is utilized as prior information in the second stage. Hence, the mask provides rough classification into speech- and noise dominated regions and facilitates precise extraction of the speech, thereby eliminating the need to restore it only from the unmasked areas. The second major contribution of the dissertation is a novel two-stage phase reconstruction (TSPR) approach. For a given magnitude spectrum, this method first estimates phase derivatives using DNNs and then combines these estimates into a unified phase spectrum that can be utilized for speech synthesis. In the TSPR approach, modifications are proposed for both stages. For the first stage, a preprocessing step and a new loss function are introduced that simplify the DNN training and stabilize it against hyperparameter variations.
In the second stage, a new phase combination method is presented that recursively calculates each time-frequency entry of the phase spectrum from the estimated phase derivatives in its local vicinity. Compared to other state-of-the-art combination methods, this approach leverages the magnitude spectrum as prior information, ultimately resulting in improved performance.
The third major contribution of the dissertation is the development of novel phase spectra that can be utilized in phase-controlled speech enhancement approaches. A silence-generating phase is introduced, which achieves perfect cancellation through destructive interference of the time signals from adjacent frames during synthesis. By suitably combining this silence-generating phase with the clean speech phase, a combined consistent-inconsistent phase (CIP) is developed. This CIP enables noise reduction through pure modification of the phase without altering the noisy magnitude spectrum before synthesis. Using this CIP in a phase-controlled approach performed similarly or even better than a magnitude-controlled approach using the clean magnitude spectrum, highlighting the remarkable potential of phase processing in speech enhancement.
The fourth major contribution of the dissertation is the phase-aware two-stage speech enhancement (PATSSE) approach. This PATSSE approach is a phase-aware extension of the magnitude-controlled TSSE approach, which not only predicts the clean speech magnitude spectrum but also estimates the proposed CIP by building upon and extending concepts from the TSPR approach. Specifically, there is a distinction between a separated and a joint PATSSE. While the separated PATSSE approach generates independent estimates for the magnitude and phase spectra, the joint PATSSE approach introduces an additional joint loss term to optimize these estimates simultaneously. For the joint approach, a perceptually motivated loss is proposed, which considers aspects of human perception and therefore generally has an increased correlation with subjective listening results. Objective and subjective evaluation results demonstrate the effectiveness of both the additional estimation of CIP in the separated approach and the simultaneous optimization of the estimates in the joint approach.
