Scale Your AI: Multi-Node Training & Profiling - Research Infrastructure Services

Join us in an advanced workshop: Scalable Deep Learning on RIS Compute2. This session is designed for users ready to move beyond single-GPU constraints and master the art of distributed high-performance AI.

Prerequisites: Completion of Intro to PyTorch/Containers (or equivalent experience) and intermediate Python skills.

💡 What you’ll learn:

Multi-Node & Multi-GPU Execution: Orchestrating complex jobs across the RIS Compute2 fabric.
PyTorch Scaling with Slurm: Implementing Distributed Data Parallel (DDP) and managing multi-node communication.
NVIDIA Nsight Systems & Compute: How to profile your code, identify kernel bottlenecks, and optimize GPU utilization.

❓ Why attend:

Don’t just run your code, optimize it. Learn how to use professional-grade profiling tools to ensure your PyTorch models are running at peak efficiency across our cluster.

This training will take place on Zoom; you will automatically receive a calendar invite that includes the Zoom link after registration.