12.4.3 : HPC

  • Tuesday, (4:00 PM - 4:25 PM CET) NERSC-10 Benchmarks on Grace Hopper and Milan-A100 Systems: A Performance and Energy Case Study [S61402]
    • Zhengji Zhao : HPC Architect and Performance Engineer, NERSC
    • Slides
    • NERSC : mission HPC for DOE Office of science Research
    • 35 PB flash file system
    • NERSC-10 : coming in 2026 : 10x compare to previous version
    • Perlmutter : 4 A100 GPU + 1 Milan CPU : based on PCie gen 4 and NVLink 3
    • Blade server a bit different than NVidia ones : GH100
    • All flash Lustre file system of Lustre file system
    • Only HPC SDK 22.7 and 23.7
    • nvidia-smi can measure GPU power
    • nvidia-smi --loop-ms=500 --format=csv,nounits --query-gpu=index,timestamp,clocks.sm,power.draw,temperature.gpu,clocks_throttle_reasons.active,utilization.gpu
    • Periodic power on code execution for H100 and A100, about 30% fluctuations
    • More than 1.7x speed up without any optimisation
    • H100 consumes more power but saves 40% of Consumption compare to A100
    • Nous on ne dépasse jamais les 240W par GPU sur des A100
  • Tuesday, (6:00 PM - 6:50 PM CET) Achieving Higher Performance From Your Data Center and Cloud Application [S62388]
    • Daniel Horowitz : Senior Director of Engineering, Developer Tools, NVIDIA
    • Ankur Srivastava : Senior Solution Architect, Amazon Web Services
    • Slides
    • Point the, GPU utilisation, SM active, communication, Compute, Overlap, Cuda Kernel Statistics
    • NSight in Slurm
    • NSight in Kubernetes
  • Tuesday (7:00 PM - 7:50 PM CET) From Scratch to Extreme: Boosting Service Throughput by Dozens of Times With Step-by-Step Optimization [S62410]
    • Gems Guo : Developer Technology Engineer, NVIDIA
    • Slides
    • Les couleurs sont louches
    • Batched matrix multiplications (32x32x32)
    • Asnychronous data copy with several stream
    • Warp collective
    • Latency and average latency is not linear
    • Multiple streams and multiple threads on CPU to manage GPU streams
    • Page locking improves parallelism on GPU ?
    • Dans tous les cas la mémoire unifiée déchire
    • asynchronous schedulling
    • Persistent kernel pull
    • Getting the optimal number of thread and stream for the problem
    • On ne va pas en mentir, c'est intéressant mais très difficile à suivre. Il y a même du texte à l'envers dans les slides
  • Tuesday, (11:30 PM - 11:55 PM CET) High-Speed Streaming Signal Processing: Teaming Up the NIC and GPU [S61931]
    • John Romein : Researcher, ASTRON (Netherlands Institute for Radio Astronomy)
    • Radioastronomy
    • Fast radio burst, dark matter, etc
    • More sensitible telescopes leads hopefully to more discovery science
    • LOFAR : 100s antenna
    • Station is a group of antenna (processing with FPGA)
    • Correlator Combines antenna data and uses GPU
    • Correlator creates matrices of samples, compute only half of the matrix because it is symetric
    • Use GPU Tensor Cores is you need GEMM
    • Complex number are not supported by tensor cores
    • Tensor cores are too fast so it is difficult to get data at this rate
    • GH200 up to 500 TOPS on this case
    • They want ot avoid GPU network transfer passing by the CPU
    • GPU can handle the network packets, the majority of data goes directly on GPU
    • PCIe GEN 4 : ~26 GB/s
    • PCIe GEN 5 (H100) : ~52 GB/s
    • A100 : (on 2 x 100Gb/s lines) get 198.6 Gb/s into A100 => carefull tunning to avoid packet loss
    • Jetson : 100 GbE NIC in PCIe slot => 99.6 Gb/s on one 100 Gb/s line with additionnal packet copy (needed because of the overhead of DPDK library)
    • GH200 : 398.6 Gb/s on line at 400 Gb/s (again need copy because of DPDK)
    • Packet loss when there are too many packet buffer on flight
  • Wednesday, (7:00 PM - 7:50 PM CET) Grace Hopper Superchip Architecture and Performance Optimizations for Deep Learning Applications [S61159]
    • Matthias Jouanneaux : DevTech Compute, NVIDIA
    • Slides
    • no record (pas de son)
  • Thursday, Mar 2112:00 AM - 12:50 AM CET Energy and Power Efficiency for Applications on the Latest NVIDIA Technology [S62419]
    • Alan Gray : Principal Developer Technology Engineer, NVIDIA
    • Slides
    • NVidia GPU can be configured to run a low clock frequency
    • Citation directe de la même présentation de la GTC 2023
    • A lot of different examples and applications
    • Power depends a lot of applciation usage
    • Time x Power = Energy (and some graphs)
    • TensorRT has an influence on Final Energy Consumption
    • A lot of Tests on A100
  • Thursday, Mar 2112:00 AM - 12:25 AM CET Harnessing Grace Hopper's Capabilities to Accelerate Vector Database Search [S62339]
    • Akira Naruse : Principal Developer Technology Engineer, NVIDIA
    • A lot of applications : RAG, Image search, molecular search, ANNS, Graph-ANNS
    • The number of dimension is increasing
    • Exact solution is quite impossible to get, but we need nice accuracy
    • Cagra is faster than HNSW because these algorithms are limited by memory bandiwdth
    • Weakness of Graph-ANNS : need 10 GPU to handle I billion scale vector DB
    • 384 GB for vector DB for DEEP-1G
    • Reducable to 52 GB with scalar quantisation and product quantisation
    • No lossy compression for graph index
    • Grace Hopper Helps
    • Compression of vector, but access to PQ-Code book is random
    • Cagra is kind of state of the art today for vector search
    • No support for huge pages in cudaMallocHost
    • Grace Hopper allows MMAP for Huge pages
    • GH200 7x faster than x86+H100 => 1MQuery per second at 90 percent accuracy
    • CAGRA-Q with compressed DB : 23x compare to HNSW CPU (269 GB, 88M vector 768 dimensions
    • CAGRA-Q will be avalaible in RAPIDS : cuVS very soon
    • C'est intéresant mais il faut s'accrocher
    • Using Grace Hopper Coherent memory has still a drop a 5%
  • Thursday, (5:00 PM - 5:50 PM CET) VMAF CUDA: Running at Transcode Speed [S62417]
    • Jorge Estrada : Software Engineer, Snap, Inc.
    • Xavier Drudis : Staff Software Engineer, Snap, Inc
    • Cem Moluluo : Senior Developer Technology Engineer, NVIDIA
    • Slides
    • Video Quality Metric
    • 4 ms latency on 4k videos
    • Live VMAF run asynchronously thank to NVENC on GPU
    • 45% reduction of storage for memories media
  • Thursday, (6:00 PM - 6:25 PM CET) RAPIDS Accelerator for Apache Spark Propels Data Center Efficiency and Cost Savings [S62130]
    • Eyal Hirsch : Software Engineer, Taboola
    • Slides
    • Slides
    • no video