HPC

12.4.3 : HPC

Tuesday, (4:00 PM - 4:25 PM CET) NERSC-10 Benchmarks on Grace Hopper and Milan-A100 Systems: A Performance and Energy Case Study [S61402]
- Zhengji Zhao : HPC Architect and Performance Engineer, NERSC
- Slides
- NERSC : mission HPC for DOE Office of science Research
- 35 PB flash file system
- NERSC-10 : coming in 2026 : 10x compare to previous version
- Perlmutter : 4 A100 GPU + 1 Milan CPU : based on PCie gen 4 and NVLink 3
- Blade server a bit different than NVidia ones : GH100
- All flash Lustre file system of Lustre file system
- Only HPC SDK 22.7 and 23.7
- nvidia-smi can measure GPU power
- nvidia-smi --loop-ms=500 --format=csv,nounits --query-gpu=index,timestamp,clocks.sm,power.draw,temperature.gpu,clocks_throttle_reasons.active,utilization.gpu
- Periodic power on code execution for H100 and A100, about 30% fluctuations
- More than 1.7x speed up without any optimisation
- H100 consumes more power but saves 40% of Consumption compare to A100
- Nous on ne dépasse jamais les 240W par GPU sur des A100
Tuesday, (6:00 PM - 6:50 PM CET) Achieving Higher Performance From Your Data Center and Cloud Application [S62388]
- Daniel Horowitz : Senior Director of Engineering, Developer Tools, NVIDIA
- Ankur Srivastava : Senior Solution Architect, Amazon Web Services
- Slides
- Point the, GPU utilisation, SM active, communication, Compute, Overlap, Cuda Kernel Statistics
- NSight in Slurm
- NSight in Kubernetes
Tuesday (7:00 PM - 7:50 PM CET) From Scratch to Extreme: Boosting Service Throughput by Dozens of Times With Step-by-Step Optimization [S62410]
- Gems Guo : Developer Technology Engineer, NVIDIA
- Slides
- Les couleurs sont louches
- Batched matrix multiplications (32x32x32)
- Asnychronous data copy with several stream
- Warp collective
- Latency and average latency is not linear
- Multiple streams and multiple threads on CPU to manage GPU streams
- Page locking improves parallelism on GPU ?
- Dans tous les cas la mémoire unifiée déchire
- asynchronous schedulling
- Persistent kernel pull
- Getting the optimal number of thread and stream for the problem
- On ne va pas en mentir, c'est intéressant mais très difficile à suivre. Il y a même du texte à l'envers dans les slides
Tuesday, (11:30 PM - 11:55 PM CET) High-Speed Streaming Signal Processing: Teaming Up the NIC and GPU [S61931]
- John Romein : Researcher, ASTRON (Netherlands Institute for Radio Astronomy)
- Radioastronomy
- Fast radio burst, dark matter, etc
- More sensitible telescopes leads hopefully to more discovery science
- LOFAR : 100s antenna
- Station is a group of antenna (processing with FPGA)
- Correlator Combines antenna data and uses GPU
- Correlator creates matrices of samples, compute only half of the matrix because it is symetric
- Use GPU Tensor Cores is you need GEMM
- Complex number are not supported by tensor cores
- Tensor cores are too fast so it is difficult to get data at this rate
- GH200 up to 500 TOPS on this case
- They want ot avoid GPU network transfer passing by the CPU
- GPU can handle the network packets, the majority of data goes directly on GPU
- PCIe GEN 4 : ~26 GB/s
- PCIe GEN 5 (H100) : ~52 GB/s
- A100 : (on 2 x 100Gb/s lines) get 198.6 Gb/s into A100 => carefull tunning to avoid packet loss
- Jetson : 100 GbE NIC in PCIe slot => 99.6 Gb/s on one 100 Gb/s line with additionnal packet copy (needed because of the overhead of DPDK library)
- GH200 : 398.6 Gb/s on line at 400 Gb/s (again need copy because of DPDK)
- Packet loss when there are too many packet buffer on flight
Wednesday, (7:00 PM - 7:50 PM CET) Grace Hopper Superchip Architecture and Performance Optimizations for Deep Learning Applications [S61159]
- Matthias Jouanneaux : DevTech Compute, NVIDIA
- Slides
- no record (pas de son)
Thursday, Mar 2112:00 AM - 12:50 AM CET Energy and Power Efficiency for Applications on the Latest NVIDIA Technology [S62419]
- Alan Gray : Principal Developer Technology Engineer, NVIDIA
- Slides
- NVidia GPU can be configured to run a low clock frequency
- Citation directe de la même présentation de la GTC 2023
- A lot of different examples and applications
- Power depends a lot of applciation usage
- Time x Power = Energy (and some graphs)
- TensorRT has an influence on Final Energy Consumption
- A lot of Tests on A100
Thursday, Mar 2112:00 AM - 12:25 AM CET Harnessing Grace Hopper's Capabilities to Accelerate Vector Database Search [S62339]
- Akira Naruse : Principal Developer Technology Engineer, NVIDIA
- A lot of applications : RAG, Image search, molecular search, ANNS, Graph-ANNS
- The number of dimension is increasing
- Exact solution is quite impossible to get, but we need nice accuracy
- Cagra is faster than HNSW because these algorithms are limited by memory bandiwdth
- Weakness of Graph-ANNS : need 10 GPU to handle I billion scale vector DB
- 384 GB for vector DB for DEEP-1G
- Reducable to 52 GB with scalar quantisation and product quantisation
- No lossy compression for graph index
- Grace Hopper Helps
- Compression of vector, but access to PQ-Code book is random
- Cagra is kind of state of the art today for vector search
- No support for huge pages in cudaMallocHost
- Grace Hopper allows MMAP for Huge pages
- GH200 7x faster than x86+H100 => 1MQuery per second at 90 percent accuracy
- CAGRA-Q with compressed DB : 23x compare to HNSW CPU (269 GB, 88M vector 768 dimensions
- CAGRA-Q will be avalaible in RAPIDS : cuVS very soon
- C'est intéresant mais il faut s'accrocher
- Using Grace Hopper Coherent memory has still a drop a 5%
Thursday, (5:00 PM - 5:50 PM CET) VMAF CUDA: Running at Transcode Speed [S62417]
- Jorge Estrada : Software Engineer, Snap, Inc.
- Xavier Drudis : Staff Software Engineer, Snap, Inc
- Cem Moluluo : Senior Developer Technology Engineer, NVIDIA
- Slides
- Video Quality Metric
- 4 ms latency on 4k videos
- Live VMAF run asynchronously thank to NVENC on GPU
- 45% reduction of storage for memories media
Thursday, (6:00 PM - 6:25 PM CET) RAPIDS Accelerator for Apache Spark Propels Data Center Efficiency and Cost Savings [S62130]
- Eyal Hirsch : Software Engineer, Taboola
- Slides
- Slides
- no video