Networking for Distributed AI – What’s so special about it
Y2E2 111
Zoom
About the talk: Distributed AI workloads are unique in the way they make use of the network. As a consequence, traditional networking solutions are not ideal for interconnecting the massive number of processors deployed in clusters for AI training and inference. Given that the network plays a key role in Distributed AI’s performance and power consumption, academia and industry have devoted an enormous amount effort to develop specialized solutions.
After analyzing the unique networking requirements of distributed AI workloads, this talk describes the network architectures commonly used to achieve the extreme scale of AI clusters and the standard protocol stacks developed for this specific context. It then considers the benefits of more advanced solutions to address the needs of distributed AI workloads, such as collective communication offload and optical networking.
About the speaker: Mario Baldi is a Fellow at AMD, Research and Advanced Development.
He has held various positions in startup and established companies in the computer networking industry, as well as several visiting professorships at Universities in four continents. He has authored over 150 scientific publications and two books; he is an inventor on dozens of patents. While more recently his focus has been on optimizing networking for distributed AI workloads, over the years, Mario's research and engineering work has spanned across various computer networks-related areas, including programmable data planes, big data analytics, trust in distributed software execution, internetworking, high performance switching, optical networking, quality of service, multimedia over packet networks.
Mario holds an M.S. with honors (Summa Cum Laude) in Electrical Engineering and a Ph.D. in Computer and Systems Engineering, both from Politecnico di Torino, Italy.