Estimating acoustic parameters, such as the localization of a sound source, the geometry, or the acoustical properties of an environment from audio recordings, is a crucial component of audio augmented reality systems. These tasks become especially challenging in the blind setting, e.g., when using noisy recordings of human speakers. Significant progress has been made in recent years thanks to the advent of supervised machine learning. However, these methods are often hindered by the limited availability of real-world annotated data for such tasks. A common strategy has been to use acoustic simulators to train such models, a framework we refer to as "Virtually Supervised Learning." In this talk, we will explore how the realism of simulation impacts the generalizability of virtually-supervised models to real-world data. We will focus on the tasks of sound source localization, room geometry estimation, and reverberation time estimation from noisy multichannel speech recordings. Our results suggests that enhancing the realism of the source, microphone, and wall responses during simulated training by making them frequency- and angle-dependent significantly improves generalization performance.
Methodological advances for Audio Augmented Reality and its applications
As part of the project HAIKUS (ANR-19-CE23-0023), funded by the French national research agency, IRCAM, LORIA and IJLRA organized a one-day workshop focusing on methodological advances for Audio Augmented Reality and its applications.
Audio Augmented Reality (AAR) seeks to integrate computer-generated and/or pre-recorded auditory content into the listener's real-world environment. Hearing plays a vital role in understanding and interacting with our spatial environment. It significantly enhances the auditory experience and increases user engagement in Augmented Reality (AR) applications, particularly in artistic creation, cultural mediation, entertainment and communication industries.
Audio-signal processors are a key component of the AAR workflow, as they are required for real-time control of 3D sound spatialisation and artificial reverberation applied to virtual sound events. These tools have now reached a level of maturity, capable of supporting large multichannel loudspeaker systems as well as binaural rendering on headphones. However, the accuracy of the spatial processing applied to virtual sound objects is essential to ensure their seamless integration into the listener's real environment, thereby guaranteeing a high-quality user experience. To achieve this level of integration, methods are needed to identify the acoustic properties of the environment and adjust the spatialization engine's parameters accordingly. Ideally, such methods should enable automatic inference of the acoustic channel's characteristics, based solely on live recordings of the natural, and often dynamic, sounds present in the real environment (e.g. voices, noise, ambient sounds, moving sources). These topics are gaining increasing attention, especially in light of recent advances on data-driven approaches within the field of acoustics. In parallel, perceptual studies are conducted to define the level of requirements needed to guarantee a coherent sound experience.
Organising committee: Antoine Deleforge (INRIA), François Ollivier (MPIA-IJLRA), Olivier Warusfel (IRCAM)