Conferencing

High Quality Video Conferencing

Audio and video signal processing for high‐quality multi‐point video conferencing implies strict real‐time processing and transmission constraints as well as bit rate limitations on the overall system design and implementation. Within the collaborative Ziel 2 research project „Connected Visual Reality” a new conference system has been developed achieving high presentation quality as well as high flexibility with respect to room set‐ups, clients, and network configurations. A key element is a newly developed multimodal signal processing concept for speaker localization and activity estimation. Furthermore, the use of sophisticated coding and signal enhancement techniques as well as new features such as artificial bandwidth extension enables implementation with cost‐efficient consumer electronics instead of specialized conference room installations.

Our technical focus was on a new multimodal signal processing concept applied for identification of the most active talkers in a video conference system, even among competing talkers in a single room. The new interacting audio and video analysis scheme consists of dedicated beamformer‐driven speaker activity estimation in combination with face detection and tracking. Complementary information from both, audio and video signals, is exchanged, merged, and transmitted via metadata. The proposed multimodal signal processing concept enables an automatic audio‐visual scene composition at the receiver side, where the most active talkers are arranged and displayed side by side for an enhanced conversational experience. In contrast to other, commercially available high quality solutions this system has been intentionally designed for off‐the shelf consumer electronics at low cost. The developed conference system has been validated by a real‐time prototype implementation. It was successfully demonstrated at CeBIT and other scientific conferences.

The project was selected for funding by the NRW Ziel 2‐Program “Regionale Wettbewerbsfähigkeit und Beschäftigung” 2007‐2013, co‐funded by ERDF “European Regional Development Fund”. 

References

[bulla13]
Christopher Bulla, Christian Feldmann, Magnus Schäfer, Florian Heese, Thomas Schlien, and Martin Schink
High Quality Video Conferencing: Region of Interest Encoding and Joint Video/Audio Analysis
International Journal on Advances in Telecommunications, December 2013

[schlien13]
Thomas Schlien, Florian Heese, Magnus Schäfer, Christiane Antweiler, and Peter Vary
Audiosignalverarbeitung für Videokonferenzsysteme
Workshop Audiosignal- und Sprachverarbeitung (WASP), September 2013

[heese13]
Florian Heese, Magnus Schäfer, Jona Wernerus, and Peter Vary
Numerical Near Field Optimization of a Non-Uniform Sub-band Filter-and-Sum Beamformer
Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013

[heese12a]
Florian Heese, Magnus Schäfer, Peter Vary, Elior Hadad, Shmulik Markovich Golan, and Sharon Gannot
Comparison of Supervised and Semi-supervised Beamformers Using Real Audio Recordings
Proceedings of IEEE 27-th Convention of Electrical and Electronics Engineers in Israel (IEEEI), November 2012

[schaefer12b]
Magnus Schäfer, Florian Heese, Jona Wernerus, and Peter Vary
Numerical Near Field Optimization of Weighted Delay-and-Sum Microphone Arrays
Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), September 2012

[hamm12]
Laurits Hamm, Tobias Engelbert, Jose Lausuch, Arturo Martin de Nicolas, Ramsundar Kandasamy, Martin Schink, Christian Feldmann, Christopher Bulla, Magnus Schäfer, Florian Heese, Thomas Schlien, and Christiane Antweiler
Connected Visual Reality – High Quality Audio Visual Communication in Heterogeneous Networks
International Workshop on Acoustic Signal Enhancement (IWAENC), September 2012