12.6.4 : Simulation
Tuesday, Mar 17 1:00 AM - 1:40 AM CET : Cutting-Edge Molecular Dynamics on the Latest Multi-Node NVLink Technology [S81542]
- Alan Gray, Principal Developer Technology Engineer, NVIDIA
- Mahesh Doijade, Sr. Compute Developer Technology Engineer, NVIDIA
Tuesday, Mar 17 5:00 PM - 5:40 PM CET : The State of the Art of Quantum Chemistry on GPUs: EXESS, Exascale, and Floating-Point Emulation on Blackwell [S81503]
- Giuseppe M. J. Barca, Professor at MIPS and ANU, Co-Founder and Head of Research at QDX, Monash Institute of Pharmaceutical Sciences, Australian National University, and QDX Technologies
LIVE : Tuesday, Mar 17 11:00 PM - 11:40 PM CET : Accelerate Geospatial Workflows for Planetary Insight [S81732]
- May Casterline, Director, Solutions Architecture, NVIDIA
- Kiruthika Devaraj, VP of Spacecraft, Planet
- Looking at history => 2M years in the past watching Andromeda galaxy
- Getting compute at close as the sensor
- 800PB of data about earth since 80 years
- 200 satelittes around the Earth : soon 1m resolution
- Ask to data directly
- Down raw data as fast as possible
- Entrirely on the GPU
- Blob of data => 2s on a single GPU
- Comparing GPU to 1 CPU thread (=> lazy but they don't care about CPU)
- Next step : atmospheric compensation
- Jetsen can fly on Pelican satelittes
- Thermal management in space ? Orin : not that compute intensive. You have to dissipate hit as quick as possible and then radiate it
- Downlink latency => Edge compute on the satelittes => 50 MB on image compressed to 1MB with model nad the communication link will be faster and faster
- How many satelittes you need ? Pelican 30 min revisit (~22 satelittes)
- Band registrations ? Most satelittes have TDI sensor, bands are sligtly offset while the satellite moves
- Use raw data direclty into a model to get rough description quickly. They could in theory train a model on raw data
- Pelican constellation have a 5 year life time
Tuesday, Mar 17 11:00 PM - 11:40 PM CET : The Earth System at 1 km Resolution: Breaking Frontiers in Climate Science [S82185]
- Daniel Klocke Group Leader, Max Planck Institute for Meteorology (MPI-M)
- The resolution has a big impact on the topology and accuracy of the results
- Mont Blan: 4810m => 4018m at 1km, 1394m in traditional simulation
- At km scale we see high cloud structures and rain shaft
- At km scale the computing is simplified (less bugs, less assumptions, simpler models)
- Getting information at scale (global or local)
- Incorporate cycles of water, energy and carbon
- 1km simulation is possible with exa scale computer
- 220m NICAM for atmospheric simulation only
- For now best multip hysic simulation have 200km resolution
- 1 million lines of code of Fortran and OpenACC to take account all interactions
- They used Jupiter : 24000 GH200, 1 EFlops 4 top 500
- ALPS : 11000 GB200, 0.435 EFlops, 8 Top500
- Since the atmosphere moves faster, they need smaller time steps
- A lot of small kernel => use of Cuda Graph
- atmosphere simulation on Hopper GPU,
- Ocean simulation on Grace CPU
- Pragmas in Fortran code represent about 50 percent of the whole code base
- They will remove Pragmas to gain portability
- About 10^12 degrees of freedom
- 145.7 simulated days per day on Jupiter
- Next => 150m resolution => what are clouds doing in a warmer climate
- They did some system tuning to have a nice scaling on both Jupiter and Alps even if their network are different
- They plan to deal with different king of precipitation depending on the clouds, but for now their micro physic simulation is not complex enough
- They will try seebottle (model from NVidia to simulate climate)
- Stefan Henneking, Research Associate, The University of Texas at Austin
- Omar Ghattas, Professor and Cockrell Endowed Chair in Engineering, The University of Texas at Austin
LIVE : Thursday, Mar 19 5:00 PM - 5:40 PM CET : Magic Attention: A Composable Framework for Exploring Warp-Specialized Attention on Blackwell [S82294]
- Manish Gupta, Member of Technical Staff, Magic AI, Inc
- SLIDES OK : GTC2026/S82294_1773959552971001Ivf7.pdf
- Flash attention algorithms
- Composable component, instruction interleaving
- Handle low level specific hardware variation
- Statix and dynamic scheduler
- Magic attention on FA4 with 2 queues schedule, 2-3 weeks with 2 queue schedule
- THe peak performance might be not at the max package
- TMEM is limited
- Precompute QK to have a softmax ready
- Let's work spacialised everything
- 95% of the CTA performances (Cooperative Thread Array) == Thread Block
LIVE : Thursday, Mar 19 7:00 PM - 8:30 PM CET : An Introduction to the Newton Physics Engine for Robotics [S81613]
- NO access