Kolloquium - Details zum Vortrag

Sie werden über Vorträge rechtzeitig per E-Mail informiert werden, wenn Sie den Newsletter des kommunikationstechnischen Kolloquiums abonnieren.

Alle Interessierten sind herzlich eingeladen, eine Anmeldung ist nicht erforderlich.

Master-Vortrag: End-to-End Speech Inpainting Using Convolutional Network Structures

Jingcheng Tian
Mittwoch, 17. Juni 2020
11:00 Uhr
virtueller Konferenzraum

Speech signals are often subject to interference or damage in the time or frequency domain during transmission. There are many ways to address these disturbances. One of them is Packet Loss Concealment (PLC), which is a technology designed to minimize the practical effect of lost packets in digital communications. Bandwidth Extension (BWE), on the other hand, is the process of extending the frequency range of a signal.


Speech inpainting, a generalized version of BWE and PLC, refers to the loss of the signal at any time and any frequency, rather than a fixed time and frequency. The term inpainting comes from image inpainting, which comprises a subarea of digital image processing, where already many deep learning-based techniques for reconstruction of broken pictures exist. However, in the field of speech, this technology has not been widely spread. Only a few dictionary-based speech signal inpainting approaches exist. In this work we  introduce a model for solving speech inpainting task.
Learning-based methods have been proven to have a better performance compared to traditional algorithms in front-end processing, such as speech noise reduction and BWE. However, most algorithms extract features and use magnitude spectrograms as input to the model. One disadvantage of this is the lack of phase information.


WaveNet, the very famous model which is used for speech synthesis, uses a dilated CNN to directly generate raw audio. This thesis uses a modified WaveNet to make it directly read raw speech and generate speech, that is, the input and output of the model are lossless and no information is lost. Instead, CNN does the feature extraction automatically in the first layers. At the same time, the huge space complexity required by WaveNet is reduced. In addition, the characteristics of the causality are modified into symmetry. The model can not only see the past information, but also the future information, which increases the receptive field and improves the accuracy of the model. We also introduce different loss functions for comparison. In the experiments, we tried different types of data, different noise, different loss of time and frequency. Computational evaluation shows that this method can reconstruct speech signals, not only in magnitude, but also in phase.

zurück