12.4.2 : Hardware

  • Monday, (6:00 PM - 6:50 PM CET) Early Science with Grace Hopper at Scale on Alps [S62157]
    • Thomas Schulthess : Director, ETH Zurich/The Swiss National Supercomputing Center (CSCS)
    • Each blade : 6 or 7 KW
    • 400Gb/s dasn toute la Suisse
    • Beaucoup de contexte sur les instituts et les universités
    • 2017 : 5320 P100
    • ALPS : Les météorologiest ont tout de suite accépté le projet
    • 2016 : 192 K80 (90% of computational peak from GPU)
    • Les météorologistes on 40 ans d'expérience dans le stokage de grandes quantités de données
    • ALPS a été construit avec le projet EXCLAIM (2021 - 2027)
    • L'idée est de faire une simulation de la météo avec tout mélangé (atmosphère, sol, etc)
    • Mais en consommant moins de 5MW
    • Implémentation de ICON en CUDA
    • Ils simulent 60 jours de météo en 1 jour
    • ALPS devrait être opérationel cet été
    • 16x CG4 sur ALPS simulation de 300 jours par jour completement couplé (64 GPU avec Grace)
    • Grace 375 GB/s en test stream
    • Ils vont arrivé à 200 MW mais sur plusieurs pays
    • Ils envisagent sérieusement d'utiliser de la précision mixte pour leurs simulations
    • Il faut améliorer les tests et l'intégration de ces simulations, même avec des algorithmes d'aprentissage automatiques
  • Tuesday, (4:00 PM - 4:25 PM CET) How to Safely and Successfully Boost Your Data Center Productivity and ROI (Presented by DDN) [S63121]
    • James Coomer : Sr. Vice President, Products, DDN
    • Slides
    • DDN : data storage and data management solutions
    • Use DDN ExaScaler on EOS (computing center of NVidia)
    • Tackle the storage problem especially for AI training (60% data load and only 5% Model Load, 43% on checkpoint on LLM)
    • MMAP 20x faster for read
    • Grace : write 7 GB/s for one thread
    • All packed into a blade server (2U)
    • Take away I/O from execution time but computing time is still optimised every year
    • It is based on a Kind of Lustre but simplified
  • Tuesday, (4:00 PM - 4:50 PM CET) Revolutionizing Supercomputing: Unleashing the Power of Grace [S62579]
    • Grace and Grace Hopper users
    • Interested in ARM architectures goes well so move to Grace
    • User reach 4 TB/s bandwidth of Grace Hopper
    • The only way to get user to compute on ARM is to have only this architecture avalaible
    • In Europe : About 35 cents per kilowatt
    • Using more efficient architectures save money which can be reuse on otherthing than Hardware, such as manpower
    • We built a Ferrari, now we need good drivers
    • Easier to port on Grace Hopper than previously
    • If you are doing a good job as a programmer it will be easy to port, if the software is a mess, this will be difficult
  • Tuesday, (10:00 PM - 10:50 PM CET) Scientific Computing With NVIDIA Grace and the Arm Software Ecosystem [S61598]
    • John Linford : Principal Technical Product Manager, Datacenter CPU Software, NVIDIA
    • Lars Koesterke : Research Associate/Manager, University of Texas at Austin
    • Grace architecture combinaition of Arm architecture and micro-architecture and NVidia Chip to Chip connection and Network
    • LPDDR5 : low power high bandwidth memory
    • NVlink C2C coherent memory
    • No code modifiction needed : same tool chain, same compilers, etc
    • ARM Software ecosystem : Armv8 SBSA (same binary for tools)
    • The is an equivalent of MKL for ARM
    • CUDA has feature parity for Grace / Hopper and x86
    • The full ARM ecosystem is working
    • NVHPC, G++ 12.3, Clang, gfortran works well on Grace
    • NVidia contributes to LLVM (lld => llvm linker)
    • NVPL (provides cuda version of commonly used math libraries) : BLAS, LAPACK, PBLAS, SCALAPACK, TENSOR, SPARSE, RAND, FFT
    • Still -flto for link time optimisation
    • You can optimise matrix allocation if you want to improve performances but this is not an obligation to start
    • Le son n'est plus synchro dans la deuxième partie de la vidéo
  • Wednesday, Mar 2012:00 AM - 12:25 AM CET Accelerating HPC and AI Applications with NVIDIA BlueField DPUs: Strategies and Benefits [S61956]
    • Dhabaleswar K (DK) Panda : Professor and University Distinguished Scholar, The Ohio State University
    • Slides
    • Nouveautés dans MVAPICH avec BlueField 1, 2 and 3
    • How to design high performance middleware for CPU, DPU, GPU
    • MVAPICH since 2001 (23 year in 2024)
    • Support of all the interconnect
    • Non blocking point to point architecture
    • MPI : a lot of blocking all-to-all
    • no computation but it takes a lot of time on big machines, waste of computation
    • Introduction of non blocking collective => better performance but somebody has to do it
    • BlueField 1 and 2 : Staged transfer (read the local DPU memory, but not necessary)
    • BlueField 3 : GVMI : Guest Virtual Memory Id
    • Save about 20 % for all to all communication
    • But workload has to be modified
    • Remove blocking collective to non-blocking all to all : 12 to 21 % speed up (32 nodes)
    • MVAPICH DPU : 5 to 18 % more GFlops
    • Do the same for mpi_isend and mpi_irecv : from 13 to 18% speed up
    • PETSc (used in Adflow, DAFoam, FreeFEM, MFEM, MOOSE, OpenFoam, etc) : 18 to 24 % speed up on 256x256x256 problem and 2, 4 and 8 nodes
    • X-ScaleAI-DPU package to support PyTorch
    • Quelques problèmes de son
  • Wednesday, (6:00 PM - 6:50 PM CET) The Next-Generation DGX Architecture for Generative AI [S62421]
    • Mike Houston : Vice President and Chief Architect of AI systems, NVIDIA
    • Julie Bernauer : Senior Director, Data Center Systems Engineering, NVIDIA
    • Slides
    • DGX cluster since 2016
    • 4000 GPU seems to be the norm for AI traning now
    • 2 EOS : 1 santa clara (top 5 in top 500) and one in texas
    • A ot of replicas with super pods
    • Teem : Applice system at NVidia : how to build cluster
    • Design and Architecture of super POD
    • CI for test regression on POD
    • EOS : reference architecture for H100 (to do all the tests)
    • Everything is liquid cool
    • 576 nodes
    • Rack layout : 4 DGX, 3 PDU, minimuse cable lenght as much as possible
    • If not possible because of power, 1 or 2 PDU is OK
    • Since A100 they use liquid cool system
    • Support multiple linux distributions and CPU architectures
    • Slurm config with Pyxis and Enroot
    • SHARPv3 is supported with the NVIDIA plugin
    • NGC container : ensure Reproduciblility if a customer has a problem
    • Test the hardware as quick as possible when it is finished
    • DGX B200 and
    • DGX GB200 NVL (72 GPU + 36 CPU) : 18 compute trays (blade server, 2 Grace GPU and 4 B200 GPU as GB200 modules), 9 NVLink, 4 InfiniBand per compute Tray, 2 BF3-NICp er compute Tray for the storage
    • Liquid cooling leak detection
    • Le son n'est pas synchro et carrément en avance sur la vidéo
    • Goal : land in data center that exist
    • Liquid cool high density system :
    • Storage : orange stuff (maybe)
    • Compute fabric is blue (maybe)
    • Multimode figer is the light blue (maybe)
    • Standard RJ45 cable in green (maybe)
    • On NVL72 there is no NVLink connection between rack (this version)
    • How much fiber needed ? 10000 km
    • Fibers come in big bundle
    • EOS : 600km of fiber
  • Wednesday, (7:00 PM - 7:50 PM CET)Accelerating Scientific Workflows With the NVIDIA Grace Hopper Platform [S62337]
    • Mathias Wagner : Senior Developer Technology Engineer, NVIDIA
    • Slides
    • Try ti limit data transfer as much a possible
    • Almost 2x between A100 to GH200 just with I/O optimisation on MPTRAC
    • Difference between PCIe 4, 5 and C2C
    • A lot of examples and Applications
    • Redites de l'année dernière
    • Simulation of Sycamore : GH200 9.11x compare to A100+x86, 2.33x with H100+x86
    • Again a lot of examples of applications