Sound of the Metaverse: Meta Creates AI Models to Improve Virtual Audio

The Visual Acoustic-Matching model can transform an audio clip recorded anywhere with an image of a room or other space into one that sounds like it was recorded in that room. One application for this could be to ensure that everyone in a video chat experience sounds the same.

Zoom calls, metaverse meetings, and virtual events could all benefit from a series of AI models developed by Meta engineers that, according to the company, match sound to imagery, mimicking how humans experience sound in the real world.

Visual-Acoustic Matching, Visually-Informed Dereverberation, and VisualVoice are the three models developed in collaboration with researchers from the University of Texas at Austin. The models have been made available to developers by Meta.

“We need AI models that understand a person’s physical surroundings based on how they look as well as how things sound,” the company wrote in a blog post introducing the new models.

“For example, there is a significant difference in how a concert sounds in a large venue versus in your living room.” This is due to the fact that the geometry of a physical space, the materials and surfaces in the area, and the proximity of where the sounds are coming from all influence how we hear audio.”

New Audio AI Models From Meta

The Visual Acoustic-Matching model can transform an audio clip recorded anywhere with an image of a room or other space into one that sounds like it was recorded in that room. One application for this could be to ensure that everyone in a video chat experience sounds the same. So, if one is at home, another in a coffee shop, and a third in an office, the sound could be adjusted so that what you hear sounds like it’s coming from the room you’re sitting in.

Visually-Informed Dereverberation is the inverse of this model; it takes sounds and visual cues from a space and then focuses on removing reverberation from the space. It can, for example, focus on violin music even if it is recorded inside a large train station.

Finally, the VisualVoice model separates speech from other background sounds and voices using visual and audio cues, allowing the listener to focus on a specific conversation. This could be used in a large conference hall with a large number of people mingling. This focused audio technique could also be used to generate higher-quality subtitles or make future machine learning easier to understand speech output when more than one person is speaking, according to Meta.

How AI Can Improve Audio in Virtual Reality

According to Rob Godman, a reader in music at the University of Hertfordshire and an expert in acoustic spaces, this work takes a human need to understand where we are in the world and applies it to virtual audio settings. “We need to consider how humans perceive sound in their surroundings,” Godman says. “Human beings want to know where the sound is coming from, how big and small a space is.”

When we listen to sound being created, we hear a variety of things. One is the source, but you also pay attention to what happens to sound when it interacts with the environment – the acoustics.” Being able to accurately capture and mimic that second aspect, he explains, could make virtual audio worlds and spaces appear more realistic, as well as eliminate the disconnect that humans may experience if the visuals do not match the audio.

An example would be a concert in which a band performs outside, but the audio is recorded inside a cathedral, complete with significant reverb. Because reverb isn’t something you’d expect to find on a beach, the mismatch of sound and visual would be surprising and off-putting. The most significant change, according to Godman, is how the listener’s perception is taken into account when these AI models are implemented. “The position of the listener must be carefully considered,” he says. “The sound made close to a person versus metres away is significant.” It is based on the speed of sound in air, so even a small delay in reaching a person is critical.”

He explained that users will “spend thousands of pounds on curved monitors but won’t pay more than £20 for a pair of headphones,” which is part of the problem with improving audio. Professor Mark Plumbley, an EPSRC Fellow in AI for Sound at the University of Surrey, is working on classifiers for various types of sounds, which can be removed or highlighted in recordings. “You need the vision and sound to match if you’re going to create this realistic experience for people,” he says.

“It is more difficult for a computer than I believe it would be for humans.” When we listen to sounds, an effect known as directional marking helps us focus on the sound from the person in front of us and ignore sounds from the side.

This is something Plumbley is used to doing in the real world. “If you’re at a cocktail party with a lot of conversations going on, you can focus on the one that interests you, and we can block out sounds from the side or elsewhere,” he says. “In a virtual audio world, this is a difficult task.”

He claims that much of this work has resulted from advances in machine learning, with better deep learning techniques that work across multiple disciplines, including sound and image AI. “Many of these are related to signal processing,” Plumbley adds.

“Whether it’s sounds, gravitational waves, or financial data time series.” They deal with signals that appear over time. Previously, researchers had to create separate extraction methods for different types of objects. Deep learning models are now capable of extracting patterns.”


Leave a Reply

Your email address will not be published. Required fields are marked *