4D Audio-Visual Learning: Multimodal Perception and Generation from Spaces to Humans

4D Audio-Visual Learning: Multimodal Perception and Generation from Spaces to Humans

Dr. Changan Chen (Stanford)
Packard 101

Apr

This event ended 303 days ago.

Wed, Apr 9 2025, 4:30pm

Talk Abstract: Humans use multiple modalities to perceive the world, including vision, sound, touch, and smell. Among them, vision and sound are two of the most important modalities that naturally co-occur. Recent works have been exploring this natural correspondence between sight and sound, which are however mainly object-centric. While exciting, the correspondence with 3D spaces and humans is understudied. For example, the sound we hear is transformed by spaces and as humans, we also produce sounds in our daily lives. In this talk, I present 4D audio-visual learning, which learns the correspondence between sight and sounds in spaces and humans. More specifically, I focus on three topics in this direction: simulating sounds in spaces for embodied AI, generating multimodal human interaction sounds, and modeling the language of human motion. Throughout these topics, I use vision as the main bridge to connect audio and scene understanding and show promising results in building fundamental simulation platforms, enabling multimodal embodied AI, learning how actions sound from in-the-wild egocentric videos, and unifying verbal and non-verbal language of human motion. In the last part of my talk, I will discuss potential research that remains to be explored in the future for 4D audio-visual learning.

Speaker Biography: Changan Chen is a postdoc-affiliated researcher at Stanford University working with Dr. Fei-Fei Li and Dr. Ehsan Adeli. He received his PhD in CS from UT Austin. His work focuses on multimodal learning and embodied AI. He led the development of audio-visual simulation platforms SoundSpaces 1.0 and 2.0. He has previously been a visiting researcher at FAIR for three years. He was a recipient of Adobe Research Fellowship 2022. His research has also been featured in media such as MIT Technology Review, VentureBeat, Yahoo News, etc. He was the leading organizer of AV4D Workshop, ECCV 2022, ICCV 2023 and Multimodalities for 3D Scenes (M3DS) CVPR 2024. He also co-organized the Embodied AI workshop, CVPR 2023, CVPR 2024 and CVPR 2025.

SCIEN Colloquium and EE 292E