12.4.2 : Hardware
- Monday, (6:00 PM - 6:50 PM CET) Early Science with Grace Hopper at Scale on Alps [S62157]
- Thomas Schulthess : Director, ETH Zurich/The Swiss National Supercomputing Center (CSCS)
- Each blade : 6 or 7 KW
- 400Gb/s dasn toute la Suisse
- Beaucoup de contexte sur les instituts et les universités
- 2017 : 5320 P100
- ALPS : Les météorologiest ont tout de suite accépté le projet
- 2016 : 192 K80 (90% of computational peak from GPU)
- Les météorologistes on 40 ans d'expérience dans le stokage de grandes quantités de données
- ALPS a été construit avec le projet EXCLAIM (2021 - 2027)
- L'idée est de faire une simulation de la météo avec tout mélangé (atmosphère, sol, etc)
- Mais en consommant moins de 5MW
- Implémentation de ICON en CUDA
- Ils simulent 60 jours de météo en 1 jour
- ALPS devrait être opérationel cet été
- 16x CG4 sur ALPS simulation de 300 jours par jour completement couplé (64 GPU avec Grace)
- Grace 375 GB/s en test stream
- Ils vont arrivé à 200 MW mais sur plusieurs pays
- Ils envisagent sérieusement d'utiliser de la précision mixte pour leurs simulations
- Il faut améliorer les tests et l'intégration de ces simulations, même avec des algorithmes d'aprentissage automatiques
- Tuesday, (4:00 PM - 4:25 PM CET) How to Safely and Successfully Boost Your Data Center Productivity and ROI (Presented by DDN) [S63121]
- James Coomer : Sr. Vice President, Products, DDN
- Slides
- DDN : data storage and data management solutions
- Use DDN ExaScaler on EOS (computing center of NVidia)
- Tackle the storage problem especially for AI training (60% data load and only 5% Model Load, 43% on checkpoint on LLM)
- MMAP 20x faster for read
- Grace : write 7 GB/s for one thread
- All packed into a blade server (2U)
- Take away I/O from execution time but computing time is still optimised every year
- It is based on a Kind of Lustre but simplified
- Tuesday, (4:00 PM - 4:50 PM CET) Revolutionizing Supercomputing: Unleashing the Power of Grace [S62579]
- Grace and Grace Hopper users
- Interested in ARM architectures goes well so move to Grace
- User reach 4 TB/s bandwidth of Grace Hopper
- The only way to get user to compute on ARM is to have only this architecture avalaible
- In Europe : About 35 cents per kilowatt
- Using more efficient architectures save money which can be reuse on otherthing than Hardware, such as manpower
- We built a Ferrari, now we need good drivers
- Easier to port on Grace Hopper than previously
- If you are doing a good job as a programmer it will be easy to port, if the software is a mess, this will be difficult
- Tuesday, (10:00 PM - 10:50 PM CET) Scientific Computing With NVIDIA Grace and the Arm Software Ecosystem [S61598]
- John Linford : Principal Technical Product Manager, Datacenter CPU Software, NVIDIA
- Lars Koesterke : Research Associate/Manager, University of Texas at Austin
- Grace architecture combinaition of Arm architecture and micro-architecture and NVidia Chip to Chip connection and Network
- LPDDR5 : low power high bandwidth memory
- NVlink C2C coherent memory
- No code modifiction needed : same tool chain, same compilers, etc
- ARM Software ecosystem : Armv8 SBSA (same binary for tools)
- The is an equivalent of MKL for ARM
- CUDA has feature parity for Grace / Hopper and x86
- The full ARM ecosystem is working
- NVHPC, G++ 12.3, Clang, gfortran works well on Grace
- NVidia contributes to LLVM (lld => llvm linker)
- NVPL (provides cuda version of commonly used math libraries) : BLAS, LAPACK, PBLAS, SCALAPACK, TENSOR, SPARSE, RAND, FFT
- Still -flto for link time optimisation
- You can optimise matrix allocation if you want to improve performances but this is not an obligation to start
- Le son n'est plus synchro dans la deuxième partie de la vidéo
- Wednesday, Mar 2012:00 AM - 12:25 AM CET Accelerating HPC and AI Applications with NVIDIA BlueField DPUs: Strategies and Benefits [S61956]
- Dhabaleswar K (DK) Panda : Professor and University Distinguished Scholar, The Ohio State University
- Slides
- Nouveautés dans MVAPICH avec BlueField 1, 2 and 3
- How to design high performance middleware for CPU, DPU, GPU
- MVAPICH since 2001 (23 year in 2024)
- Support of all the interconnect
- Non blocking point to point architecture
- MPI : a lot of blocking all-to-all
- no computation but it takes a lot of time on big machines, waste of computation
- Introduction of non blocking collective => better performance but somebody has to do it
- BlueField 1 and 2 : Staged transfer (read the local DPU memory, but not necessary)
- BlueField 3 : GVMI : Guest Virtual Memory Id
- Save about 20 % for all to all communication
- But workload has to be modified
- Remove blocking collective to non-blocking all to all : 12 to 21 % speed up (32 nodes)
- MVAPICH DPU : 5 to 18 % more GFlops
- Do the same for mpi_isend and mpi_irecv : from 13 to 18% speed up
- PETSc (used in Adflow, DAFoam, FreeFEM, MFEM, MOOSE, OpenFoam, etc) : 18 to 24 % speed up on 256x256x256 problem and 2, 4 and 8 nodes
- X-ScaleAI-DPU package to support PyTorch
- Quelques problèmes de son
- Wednesday, (6:00 PM - 6:50 PM CET) The Next-Generation DGX Architecture for Generative AI [S62421]
- Mike Houston : Vice President and Chief Architect of AI systems, NVIDIA
- Julie Bernauer : Senior Director, Data Center Systems Engineering, NVIDIA
- Slides
- DGX cluster since 2016
- 4000 GPU seems to be the norm for AI traning now
- 2 EOS : 1 santa clara (top 5 in top 500) and one in texas
- A ot of replicas with super pods
- Teem : Applice system at NVidia : how to build cluster
- Design and Architecture of super POD
- CI for test regression on POD
- EOS : reference architecture for H100 (to do all the tests)
- Everything is liquid cool
- 576 nodes
- Rack layout : 4 DGX, 3 PDU, minimuse cable lenght as much as possible
- If not possible because of power, 1 or 2 PDU is OK
- Since A100 they use liquid cool system
- Support multiple linux distributions and CPU architectures
- Slurm config with Pyxis and Enroot
- SHARPv3 is supported with the NVIDIA plugin
- NGC container : ensure Reproduciblility if a customer has a problem
- Test the hardware as quick as possible when it is finished
- DGX B200 and
- DGX GB200 NVL (72 GPU + 36 CPU) : 18 compute trays (blade server, 2 Grace GPU and 4 B200 GPU as GB200 modules), 9 NVLink, 4 InfiniBand per compute Tray, 2 BF3-NICp er compute Tray for the storage
- Liquid cooling leak detection
- Le son n'est pas synchro et carrément en avance sur la vidéo
- Goal : land in data center that exist
- Liquid cool high density system :
- Storage : orange stuff (maybe)
- Compute fabric is blue (maybe)
- Multimode figer is the light blue (maybe)
- Standard RJ45 cable in green (maybe)
- On NVL72 there is no NVLink connection between rack (this version)
- How much fiber needed ? 10000 km
- Fibers come in big bundle
- EOS : 600km of fiber
- Wednesday, (7:00 PM - 7:50 PM CET)Accelerating Scientific Workflows With the NVIDIA Grace Hopper Platform [S62337]
- Mathias Wagner : Senior Developer Technology Engineer, NVIDIA
- Slides
- Try ti limit data transfer as much a possible
- Almost 2x between A100 to GH200 just with I/O optimisation on MPTRAC
- Difference between PCIe 4, 5 and C2C
- A lot of examples and Applications
- Redites de l'année dernière
- Simulation of Sycamore : GH200 9.11x compare to A100+x86, 2.33x with H100+x86
- Again a lot of examples of applications