12.4.4 : Developpement HPC
Monday, Attendre la dispo (4:00 PM - 5:20 PM CET) disponible Introduction to CUDA Programming and Performance Optimization [S62191]
Athena Elafrou, Developer Technology Engineer, NVIDIA
À regarder absolument si vous programmez en CUDA
Hopper 4 warps scheduller per SM
soa is 6x faster than aos on H100 (and on all GPU in general)
Using atomicAdd do not fetch a, but L2 update value of a with b*c
Use ncu to get metric
Limit of 256 registers per threads (get the info in NSight)
Monday, (5:00 PM - 5:25 PM CET) Accelerating Pandas with Zero Code Change using RAPIDS cuDF [S62168]
Ashwin Srinath : Senior Software Engineer, NVIDIA
cuDF : all core functionnalities of Pandas but on GPU
10 to 100x speed up compare to Pandas on 100-300k rows x 2 columns DataFrame
import cudf
or python -m cudf.pandas script.py
Pandas is everywhere but largely single threaded
Pandas is not a query engine
cuDF supports 100% Pandas API (if operation can be used on GPU, it uses GPU, otherwise it uses CPU)
36.7s to 720ms for read + groupBy + sort
%load_ext cudf.pandas
>100x faster join and > 40x Group by
Example with Pandas + LLMs : cuDF becomes the fastest component of the pipeline
cuDF passes 94% of Pandas unit tests (some edge case are to be solved)
Stay on GPU as much as possible
from cudf.pandas import Profiler (or %%cudf.pandas.profiler
Pay attention to GPU memory usage (when GPU is FULL, operations go back to CPU, so this can lead to unexpected slow down)
Use Idiomatic pandas pattern (user defined function 4s, and 846ms for native method)
Date time index is uspported on GPU but some functionnalities are not ready yet
cuDF does not really depends on numpy (it will come to cuDF when numpy will ne the defaut for Pandas 3 maybe)
For multiple GPU : use dask.cudf
Monday, (5:00 PM - 6:40 PM CET) Profilers, Python, and Performance: Nsight Tools for Optimizing Modern CUDA Workloads [DLIT61667]
Monday, Mar (5:30 PM - 6:50 PM CET) disponible Advanced Performance Optimization in CUDA [S62192]
Igor Terentiev : Engineer V, Dev Tech S/W, NVIDIA
Slides
CTA (Cooperative Thread Array) == Thread Block
CGA (Cooperative Grid Array) == Thread Block Cluster
Code snippets: namespace cg = cooperative_groups;
Increased SMEM saves GMEM trips in many algorithms
linear rank: cg.this_cluster().block_rank()
3D rank : cg.this_cluster().block_index()
Thread synchronization:
CTA -> cg.this_block().sync()
CGA -> cg.this_cluster().sync()
Remote DSMEM (distributed shared memory) has to be mapped:
__shared__ T smem;
auto dsmem_ptr = cg::this_cluster().map_shared_rank(&smem, rank);
__syncthreads();
cg::this_grid().sync();
CUDA Graph for optimised kernel launch
Monday, Mar 18 6:00 PM - 6:50 PM CET Magnum IO GPUDirect, NCCL, NVSHMEM, and GDA-KI on Grace Hopper and Hopper systems [S61368]
Pak Markthub : Senior Software Engineer, NVIDIA
Harry Petty : NVIDIA
Davide Rossetti : Senior Software Engineer, NVIDIA
Slides
MagnumIO umbrella
Storage IO, Network IO, In Network Computing, NBLink Network, collective Operations
NVlink in MagnumIO show 30% speed up performance
NCCL : parallel efficiency
Parallel efficiency drop to 50% at 128 GPU
GPUDirect RDMA : GPU and NIC are inline
GPUDirect P2P : Cuda kernel running on GPU0 and uses memory of GPU5
GPUDirect Async : Mapping of the NIC memory on the GPU
GPUDirect virtualisation : KVM
GH200 main coherent plateform
NIC memory registration based on DMA-BUF with contribution to Linux kernel (with GPL libraries)
GDA-KI : Kernel in SM can trig the NIC (possibility to prepare WQE in parallel)
NVSHMEM IBGDA : IBRC default communication proxy, thread, warp and block scope
270 Million packets per second with 4 CTA in shmem_p_bw
94 Million packets per second with 8 CTA in shmem_g_bw
Jacobi test : scalaing close to NVLink
NCCL IBGDA : MPI like semantic for collective operation
For all to all NCCL
Support network transport with plugin, IBGDA external network plugin
Register a landing zone, as a FIFO for GPU
Channel : local stream to talk to a NIC
512 GPU : 75
Quelques problèmes de son
Monday, (6:00 PM - 6:50 PM CET) No More Porting: Accelerated Computing With Standard C++, Fortran, and Python [S61204]
Wednesday 11am No More Porting: Accelerated Computing With Standard C++, Fortran, and Python: A Q&A From the EMEA Region [S61204a]
Anastasia Stulova : HPC Compiler Programming Models Architect, NVIDIA
Jeff Larkin : HPC Architect, NVIDIA
Slides
No need to be a ninja programmer
Develop applications that run averywhere
Standard C++, Fortran and common python used on real cases
Avalaible with containers
Nothing unique to grace hopper but improves a lot
C++ is a parallel language
using algorithms not for loops
std ::mdspan c++23 Avalaible
C++26 preview (sender recievers Avalaible)
-stdpar=multicore or -stdpar=gpu
std ::mdspan : solution for calibration
M-AIA : same perf on CPU but 8.74x on GPU
Fortran is still a lively parallel language
No-coarray for Fortran 2018
Reduce subclause added in Fortran 2023
MiniWeather : Triple loop converted into a do concurent loop
Same CPU perf but 18x for GPU
ABINIT : Same CPU but 5.5x on GPU (FFTW interface on GPU) so no difference in code
Python is not a standard but a set of packages
Each have to develop its own GPU portage
Legate solves this issue
No need to change code for Grace Hopper
Cuda, or OpenACC will Work
On grace difference bandwidth to access Data on CPU and On GPU
stdpar automaticly place data where it should be
NVHPC compilers : -gpu=managed (but mostly automatic) to manage memory automatically
Unified memory with -gpu=unifed since 2023
It is possible to mix nvcc (CUDA) with nvc++
Languages like Julia and Rust will be integrated
Optimization for block size ? Yes, it is expected to be automatically and also able to tune it manuallu
C++ Dynamic cast is not supported now
The approch is to developpe almost everything with standard algorithms and then finetune what is needed (and debug them on CPU)
Parallel from the start
Tuesday, (4:00 PM - 4:50 PM CET) Accelerating and Securing GPU Accesses to Large Datasets [S62559]
CJ Newburn : Distinguished Engineer, NVIDIA
Oren Duer : Director, Software Architecture, NBU, NVIDIA
Vikram Sharma Maithody : Senior Research Scientist, NVIDIA
Slides
Tuesday, Mar (5:00 PM - 5:25 PM CET) Restoring the Scientific Method to HPC: High Performance Reproducible Parallel Reductions [S62405]
Tuesday, (5:00 PM - 5:50 PM CET) Demystify CUDA Debugging and Performance with Powerful Developer Tools [S62256]
Jackson Marusarz , Technical Product Manager, NVIDIA
Slides
Python + nvtx for NSight
New NSight Features : Host Grace Profiling, Extention to Jupyter Lab
Soon : Python Call Stack and syntax Highlighting
Tuesday, (5:00 PM - 5:50 PM CET) RAPIDS in 2024: Accelerated Data Science Everywhere [S62741]
Dante Gamma Desavre : Machine Learning Engineering Manager, NVIDIA
Nick Becker : Senior Technical Product Manager, NVIDIA
RAPIDS : ecosystem
Forecasting : from once a month of once a year to once a day
RAPIDS for Dask of Apache Spark
cuVS : accelerated vertor search
NVidia AI entreprise : open source fondation
Use a CPU/GPU switch or both, or custom Cuda Kernel
Pandas acceleration : cuDFPandas, all Pandas API but does not accelerated everything, but accelerate everything on the GPU
python -m cudf.pandas script.py
networkX : networkX cuGraph Backend (more than 60 algorithms in networkX, some are 600x or 5x faster)
OK event if you don't have a GPU
Integrated in networkX documentation
Numba Cuda : Supports shared memory or cooperative groups
Data interoperability
Dask supports configurable GPU backend (local cuda cluster instead of local CPU cluster)
Dask Expression : simplify expression on GPU also, deal with multiple GPU and avoid memory break
Apache spark : 5.5x faster
Most popular algos in Spark ML
XGBoost 2.0 : integration nvFlare, UCX to spped up network
cuVS : cuda vertor search : accelerated vector search for C++ and Python, Rust and C for vector data base, and graph base ANN CAGRA-G (IVF-PQ with CAGRA)
CAGRA-Q has quantization
Cuda Toolkit in conda forge
JupyterLab NVDashboard (high level view) (GPU metrics), and NSight
NVTX to annotate coda (C++, Python)
Workbench : smoothly start from a laptop and go into a cluster of GPUs
NVIdia Launchpad
RAPIDS Ecosystem of tools : Legate is a distributed runtime (so different level of abstration)
Tuesday, (7:00 PM - 7:50 PM CET) CUDA: New Features and Beyond [S62400]
Stephen Jones : CUDA Architect, NVIDIA
Slides
Combined Hardware and software
Power is the metric that matter
What can I do with 20 MW
Data movement and computation
Consumption scale with the square of the matisse on floating point computation
LU decomposition in tensor core
Mixed precision 6x faster than double with same result
How to use tensor cores : cuBLAS, cuBLASLt, cuTLASS (write tensor core code inside your own kernel)
cuBLASDx : device extention for cuBLAS
Kernel Fusion to save load, with Jit
CUTLASS can mix with PyTorch
NSight manages code origin (C++, Python etc) even global lock of python interpreter
Warp : Differentiable kernel Development (example with bread simulation)
Breaking bread simulation
Legate into Jax
NSight on large scale machine and interfaec with Jupyter Lab
GPU Direct : copy direct on network
GPUDirect Async : trigger from the GPU
GPUDirect-KnI : Keernel initiated
NCCL and NVsmhem traces in NSight
20percent speed up on grace hopper for transformer Training because of cache data in grace
cuGraph : task graph on GPU (6 years of work)
DGCM : monitor data center
Tuesday, Mar (7:00 PM - 7:50 PM CET) A Deep Dive into the Latest HPC Software [S61203]
Jeff Larkin : HPC Architect, NVIDIA
Slides
Pas de son
Tuesday, (10:00 PM - 10:50 PM CET) Unlocking Developer Productivity across CPU and GPU with Mojo [S62220]
Chris Lattner : Chief Executive Officer and Co-Founder, Modular AI
Mostafa Hagog : Performance & Compiler Engineering Lead, Modular AI
Slides
Tuesday, Mar (11:00 PM - 11:50 PM CET) CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores [S61198]
Vijay Thakkar : Senior Architect, NVIDIA
Jack Kosaian : Senior Architect, NVIDIA
Slides
CUTLASS : C++ BLAS template library, 2.5MDowload /mounth
Gère la fusion de produit de matrice, et les FP8 (E5M2 et E4M3)
Composable schedulers
5 Conceptual Herarchy
Kernel Layer, Device Layer : Bock and Grid stuff
Reduction of API
Convolution in CUTLASS 3.5
Convolution collective
Harbitraty number of dimention
GEMMs are just a tensor contraction (GeTT to implicit GeMM)
They use Hopper TMA
Beta API with 1, 2 and 3 dimensions
Group GENN and Mixture of experts (MoE)
Weight quantisation : Using lighter weight, from FP16 to INT4 for example and use tensor core for that GeMM
Epiloge Visitor Tree (EVT) : compose complex epiloges
Also available in Python
It is possible to extend CUTLASS with custom Kernels
CUTLASS 3.6 for summer with sparcity for Hopper
Mix precision GEMM is not supported
Wednesday, Mar 201:30 AM - 1:45 AM CET Practical Tips for using Grace Hopper to Dramatically Accelerate your Deep Learning and HPC pipelines [EXPT63133]
Wednesday, (4:00 PM - 4:25 PM CET) Perform High-Efficiency Search, Improve Data Freshness, and Increase Recall With GPU-Accelerated Vector Search and RAG Workflows [S62599]
Charles Xie : CEO, Zilliz
Corey Nolet : Principal Engineer, RAPIDS ML, NVIDIA
Slides
Vector database system and RAG system ad how to put GPU in that, introducing cuSV
vector to text to video to molecules
Milvus : first vector data base in the world (6 years ago)
Vector data base to enhance proprietery domain data to
Retrieval-Augmented Generation (RAG) avoid hallucination, improve accuracy
GPU acceleration on worker nodes
Brute force algorithms are OK on GPU, => CAGRA (almost 0 build time and very high Throughput
CAGRA : Find nearest neigbour on GPU
CAGRA graph can be used also on CPU and HNSW is faster on CAGRA graph
1TB raw text to 2.5 TB vector (635 M × vector or 1024 dimensions)
Wraps RAFT and increase language support
cuVS for C, C++, Python, Rust
They need a separate GPU if cuVS is used to train a model
Wednesday, Mar (4:00 PM - 4:50 PM CET) Mastering CUDA C++: Modern Best Practices with the CUDA C++ Core Libraries [S62175]
Jake Hemstad, Software Engineering Manager, NVIDIA
Georgii Evtushenko : Senior Software Engineer, NVIDIA
no record
Wednesday, (6:00 PM - 6:25 PM CET) Performance Optimization for Grace CPU Superchip [S62275]
Wednesday, (6:00 PM - 6:50 PM CET)Legate: A Productive Programming Framework for Composable, Scalable, Accelerated Libraries [S62262]
Wonchan Lee : Senior Software Engineering Lead, NVIDIA
Manolis Papadakis : Senior Software Engineer, Legate Framework, NVIDIA
Slides
Use a wide range of accelerated hardware
Not easy to program
Use Numpy scipy and well used frameworks in python to GPU
Transparently scaling program
Numpy, Scipy, Pandas, Scikit Learn, dmicXGBoost, JAX, Zarr, HDF, ...
Each implementation should be Transparently scaling
No expensive operation to go from a library to an other
Common solution for these problems
First, legate run time stack : Legate + Legion + Realm
User do not have to change their code on new Hardware
Numpy -> Legate cunumeric (presented in GTC 2023)
Pandas -> Legate DataFrame (new)
Scikit learn -> Legate ML (new)
XGBoost -> Legate Boost (new)
JAX -> Legate JAX (new)
HDF -> Legate IO (new)
Python program -> task graph into Legate Runtime
Problem on Data Partition
Some time data need to be repartitionned
Problem if repartitionning appears outside of legate
Legate Runtime : common way to deal with partition (you can add partinioning constraints)
Legate Jax : from the popular ML framework with flexible tensor partinioning strategies
But no support to scale up to 1k GPU
Augment the JAX pipelining parallelism
JAx uses Legate XLA backend => comparable performance compare to the state of the art from 128, 256 or 512 GPUs but with all libraries
Rapids + Legate :
Rapids collection of fast and maintained kernel for ETL and ML (for Pandas, XGBoost and HDF)
Legate Boost : Mix model boosting
SLAC ; Stanford Linear Accelerator Center
Develop on your laptop and scale to a computing center for production
Legate IO : HDF5 and Zarr with GPU DirectStorage
Switching between CPU and GPU in one click
Exemple with multiphysics solver : possibility to map computation on CPU and GPU (Slide 13)
Legate Sparse : Sparse matrix implementation
Task Fusion : MLIR Kernel Fusion (perf slide 14)
Complete C++ reimplementation of the core layer, it was previously in Python but it turn to be the bottleneck of the perf
Legate JAX use C++ Legate
And Legate STL, Implement your C++ Programs in functionnal style legate::stl::transform_reduce
NSight has Logical view and physical view with legate to help user to understand what is going on
conda install -c nvidia -c conda-forge -c legate cunumeric
https://github.com/nv-legate
Legate design is more inpired by stl rather than TBB (so no TBB in this)
Will legate also work on Jetsen, Orin ? Normally no, but the runtime overhead could be exessive on smaller CPU and GPU
Release on May on July 2024 (for new legate version with all python, C++ stuff)
The runtime does not care how you implement your library
Wednesday, (10:00 PM - 10:50 PM CET) Multi GPU Programming Models for HPC and AI [S61339]
Jiri Kraus : Principal DevTech Compute, NVIDIA
Slides
Un peu de NCCL, MPI, OpenShmem, NVshmem, à voir si vous développez des calculs basés sur la communication
8xH100 with non blocking communication
Domain decomposition : optimise the number of neigbour or optimise the communications between neigbour
A lot of examples
Do the communication while computing still ran
NVSHMEM : direct communication between GPU kernels
NVShmem faster with -dlto (link time optimisation)
Start fusing kernel before and after communication
NCCL and NVShmem can be used with CUDA Graphs
The afinity between the network adapter and the GPU is very important
Better use a cuda aware MPI on a cluster if possible
Wednesday, (11:00 PM - 11:50 PM CET) More Data, Faster: GPU Memory Management Best Practices in Python and C++ [S62550]
Mark Harris : Distinguished Engineer, NVIDIA
You have to watch this if your are a library developer (C++ or Python)
No Metadata from host allocator when used to allocate data on device because it cannot read it
Multiworkflow can be problematic because each one allocate a pool of memory without cooperate we will get out if memeory
cudaMalloc / cudaFree => synchronous and slow free for security
cudaFree adds bubble to the pipeline
On 16 V100 => 88 percent of time spent into memory management, dropped to 0 with pool
cudaMallocAsyn cuda 11.2
RMM : fast and flexible memory management
Used in cuDF, cuML, RAFT
by default RMM uses cudaMalloc and cudaFree, but they can be replaced with a pool allocator using stream
RMM is in transistion (refactoring ongoing)
deallocate may reuse memory in the next allocation asynchronously
rmm::device_buffer : unitilisated bytes memory
rmm::device_uvector : kind of thrust::device_vector but unitilisated
thrust does not know streeam
binning_memory_ressource : separate small allocation from large allocation using a different device if needed
Share memory pool between libraries
External allocator interface
Best Practices :
no raw allocation
Give your library an external allocator interface (allows libraries cooperation)
Use stream ordered allocation
Use stream ordered data containers
Thursday, live (5:00 PM - 5:50 PM CET) How To Write A CUDA Program: The Ninja Edition [S62401]
Stephen Jones (SW), CUDA Architect, NVIDIA
How to think about writing a program
Last 10 percents of peak performance is hard to get
1M Threads on H100
Data Parallelism : GPU (Throughput machine)
Task Parallelism : CPU (Latency machine)
GPU : expects 100 waves
Map data to threads
10x in cache L2
split task then data fit in cache
don't start all your kernels from top left to preserve cache of last kernel
No all to all
you can write cuda code in python (cuPy, numba) but you have to understant how the machine works
Tensor Parallelism is data Parallelism
Now way to automate size of block because it is problem dependent
If many kernels just use cudaGraph
NSight gives info of the low level cache using by kernel (occupancy, latency, etc)