Computing

12.5.3 : Computing

Monday, Mar 17 9:00 PM - 10:30 PM CET : CUDA Techniques to Maximize Memory Bandwidth and Hide Latency [S72683]

Athena Elafrou : Developer Technology Engineer, NVIDIA
Allard Hendriksen : Sr. Developer Technology, NVIDIA
Slides
A must watch if our are a cuda deveopper or if you want to improve performances of your computation
Hopper => distributed shared memory => thread block cluster
Unrolling with an extra for loop is better when the compiler cannot proves tables are not overlapping eachother
But it increase register usage.
On Blackwell, you could use up tp 40 percent of the registers just to increate bandwidth which does not left you many registers for computing
__pipeline_wait_prior(n) => wait for all the transfer except the n last
you do want to use a commit for non divergent code, otherwise it will produce a commit for each conditional paths
Producer consumer partern => one part of the threads are fetching the data and all do the computing after
but synchronisation is needed
5 percent increase bandwidth on an hadamard product
20 percents on more complex computation
TMA 1D and ND
Source destination pointers must be aligned on 16 bytes.
Copy size must be a multiple of 16 bytes
thrust::transform uses TMA under the hood so you can get it for free
but it is better if you use : cuda::proclaim_copyable_arguments(lambda) around the given lambda to tell the compiler that data can be copied to the shared memory
The compiler cannot prove the TMA is launched only for one thread. So it will produce a peal
To avoid that, you have to use cooperative_groups::invoke_one(cooperative_groups::coalesced_thread(), [&](){...})
TMA is efficient to transfer data of more than 2KiB
LDGSTS is better to transfer data of less than 1KiB
Between 1 and 2 KiB you can use both
In volta : memory model with sequentially consistant gives guaranty but can be a bit slow
Acquire : any load which is after the std::memory_order_acquire, stays after
Release : any load which is before the std::memory_order_release, stays before
Relaxed : any ordering with std::memory_order_relaxed
cuda::thread_scope_thread => local thread
cuda::thread_scope_block => only thread block
cuda::thread_scope_device => only the current GPU
cuda::thread_scope_system => Other thread in system (CPU, GPU, other nodes) can observe this thread'ss load and stores
In a SM, data have to be consistent only in L1 cache
Threads in a cluster have to exchange data through the L2 cache
see slide 88
New instruction in Hopper => store async to store a value from a SM into the shared memory of an other SM
//sychronise thread in cluster
ptx::barrier_cluster_arrive(ptx::sem_release)
//do intependent work
ptx::barrier_cluster_work(ptx::sem_acquire)
>5x with the store async compare to classic cluster::async on H100
These optimisation are part of CUTLASS and provides an optimised kernel for each architectures
If you use PyTorch rhigh now you migh want to these technics but at some point they will be part of PyTorch backend
look at the cude c++ documentation

Monday, Mar 17 9:00 PM - 9:40 PM CET : Optimizing Multi-Language Scientific Simulations: A Grace Superchip Case Study [S72537]

Holly Wilper : Manager, System Software Tools, NVIDIA
Michael Wall : Senior Scientist, Los Alamos National Lab
Slides
VENADO : Grace SuperChip at Los Alamos
SEDACS : multi language scientific simulations (many thouthands of atoms)
From quantum simulation to classical molecule simulations
Graph based methods
NSight : CPU, GPU, network and storage profiling (including Linux virtual file system)
https://nvidia.github.io/NVTX/doxygen/index.html
https://nvidia.github.io/NVTX/doxygen-cpp/index.html
https://github.com/NVIDIA/NVTX/tree/release-v3/
C++ and Python NVTX available in cuda libs
Fortran and Rust wrapers for NVTX available from the community
Reducing communication improve performances
Possibility to track advanced SIMD instruction with NSight
Transfering data allocated by Python on the host and used on the device by C++ cuda caused a lot of lags (they did not use the unified memory of Grace Hopper because the code is quite old)

Monday, Mar 17 11:00 PM - 12:30 AM CET : CUDA Techniques to Maximize Compute and Instruction Throughput [S72685]

Ben Pinzone : Compute Developer Technology Engineer, NVIDIA
David Clark : Compute Performance Developer Technology Engineer, NVIDIA
Slides
Always check what the profiler is telling you
Details page of NSight compute you can see warp divergence (work divergence because threads in the same warp are too different)
import-source-file=true to solve source when importing in nsight
making a queue with the work to do and in a second step, processing the data (if queuing is not too expensive in this case)
NCU : Get the info in the scheduler Statistics
Find the origin of Stall
New dependencies columns comming soon
Occupancy on Hopper is up to 64 warps per SM
__forceinline__ if you really want for force the compiler to inline (but can be very expensive on big functions, and still automatic on very small functions)
Use CCCL as much as possible (with cub, and thrust)
Use only the precision that your application only needs
signed iterator will allows more optimisation than unsigned iterators on nvidia compiler (but do not do that with other compilers and use long unsigned int)
overflow of unsigned is defined behaviour
overflow of signed is undefined behaviour and NVidia compilers take liberty
Power of dot and cross products to speed up computation and avoid trigonometric functions
Tensor Core Gen 5 with Blackwell and new instructions

Tuesday, Mar 18 9:00 PM - 9:40 PM CET : RAPIDS in 2025: Accelerated Data Science Everywhere [S73290]

Nick Becker : Sr. Technical Product Manager, NVIDIA
Slides
Industry needs high performance data science
RAPIDS : foundation of open sourced libraries
cuGraph helps PyTorch geometric
RAPIDS is a part of CUDA-X ecosystem
Zero code change for Apache SPark or Pandas etc
Pandas acceleration up to 50x with GPU accelerated
GPU accelerated Polars : 100000000 record in seconds, on GPU => beta version for now
Falls back on the CPU when an operation is not supported
Memory can be a challenge => scale beyong GPU memory limit (on Pascal+ GPU)
available by default to scale up Pandas and Polars
I/O in the cloud and remote storage (Amazon S3) : you can optimise it yourself but they choose a safe default
Apache Spark with RAPIDS => 6x on quite standard request
Scikit learn complemented with UMAP (¨Dimentionaly reduction), HDBSCAN (Clustering)
cuML accelerate UMAP and HDBSCAN without any code change => from 5 to 200x acceleration
Any third party library using UMAP or HDBSCAN ill be accelerated as well
available on open beta
spark-rapids-submit <options> app.py (just change from Python to this)
XGBoost 3.0 : pushing the memory limit and the scaling, can scale on 1 TB data on Grace Hopper
FIL : Forest Inference Library for Scikit-Learn, XGBoost, LightGBM and cuML
NetworkX acceleration with cuGraph
PyG up to 3x on training
Even better on Grace Blackwell
RAG : Retreval Augmented Generation
NeMo curator to filter your data before training
cuVS : accelerated vector search to build indexes
and a lot of documentation
Aether : optimise Apache Spark workflows => tools to optimize, redeploy and check performances
JupyterLab NVDashboard
Google colab Gemini coding assistant is aware now if you are using a GPU
Not only about AI

***** Wednesday, Mar 19 5:00 PM - 5:40 PM CET : How You Should Write a CUDA C++ Kernel [S72575]

Georgii Evtushenko : Sr. Software Engineer, NVIDIA
Slides
Kernels are asynchronous
NVBench -> interface simuilar to Google bench
Speed up is not always a good metric. It is better to compare to the speed of light => theorical peak performance of the GPU
sudo::ceil_div divides two integers and round the result to the greater integer (great to compute block and grid size)
use cuda::std::span or cuda::std::mdspan in your kernel instead of std::span or std::mdspan
Little's law : concurency = latency * throughput
fma operand has to wait until data are available. So we get a high ellapsed time in fma even if the computing is memory bound
cuobjdump --dump-ptx vec to dump ptx instructions
__launch_bounds__(load_size) to tell the compiler the expecte size of the kernel
There is only one register file per SM and if you inroll too much your kernels, at some point only one kernel will use most of registers and the others will fight to get ressources
To get the best grid size :
int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&num_blocks, kernel, block_size, 0);
To get the best block size :
int _;
int num_threads;
cudaOccupancyMaxPotentialBlockSize(&_, &num_threads, kernel, 0);
use cuda::ptx instead of asm
compute-sanitizer for CUDA kernels correctness checking
Use vocabulary types (span, mdspan, etc.)
Use libcu++ for latest hardware features
Replace intrinsics with atomics
cuda-gdb for CUDA kernels debugging
Use NVIDIA Nsight Systems to identify bottlenecks in your application
NVIDIA Nsight Compute for in-depth kernel profiling
The big problem on the LLMs is they always want to implement what you ask and do not use the relevant library for that

Wednesday, Mar 19 7:00 PM - 7:40 PM CET : A Deep Dive Into the Latest HPC Software [S71350]

Becca Zandstein : Director, Product Management, NVIDIA
Slides
HPC SDK => Scale to the full system you can use (SUpport x86, ARM or PowerPC CPU)
Provide a way to program on Laptop, Worksation, Cluster with smooth transition
Single binary to run on parallel GPU OK now
NEMO : Ocean Simulation => 64x on Grace Hopper with standard parallel fortran do concurrent
cuFFTmp uses NVShmem to perform the all to all communication (which is a bad idea to scale)
VASP multi GPU integration with HPC SDK
NSight Copilot integrated inside vscode
nvmath-python : currently beta
built on NVIdia core library with JIT
seamless interoparability with pytorch
Warp open source python differentiable kernel
math unit can kill your performance but tiles programming can inprove efficiency and productivity. I don't get why this is related
cuPyNumeric : write code simply with python and numpy. import cupynumeric as np
cupti python : collect directly your performance data directly from python
python profiling available (slide 37)
How to approach petascale / exascale ?
Iterative linear solver ? Not included in the HPC SDK stack. And CUDA Graph ???
Does HPC SDK with lower precision ? Yes, this will come.
cuPy <=> cuPyNumeric ?
Using copilot to translate Python to Fortran and Vice versa ? No yet available.
NVPL : for all of our CPU, so Grace for now)

Wednesday, Mar 19 10:00 PM - 10:40 PM CET : How Math Libraries Can Help Accelerate Your Applications on Blackwell GPUs [S72434]

Babak Hejazi : Senior Engineering Manager, NVIDIA
Azi Riahi : NVIDIA
Slides
Math Libraries => ease of use
more than 400 math libraries on CUDA-X (cuSPARSE, cuBLAS, AmgX, cuSOLVER, cuDSS, cuTENSOR, cuFFT, cuRAND, etc)
GB200 NVL72 => 72 GPU connected with NVLink and 36 Grace CPU
DGX B200 => 8 Blackwell connected with NVLink and Intel x86 CPU
DGX/HDX/H100/H200 : 8 H100 conneced with NVLink and Intel x86 CPU, network with infiniband
Blackwell emulation on FP64/FP32
emulation on FP64/FP32 in cuBLAS matmuls to keep performances but using FP16 and other at higher data flow
FP32 with BF16 tensor cores
FP64 with INT8 tensor cores
Tests on real cases : ecTrans (weather forecasting), BerkleyGW (Material Simulation)
About 2x on training at lower precision for AI
cuSparse from 1.4, 1.5 up to 2.4 speed up of Blackwell compare to H100
cuFFT, cuFFTmp : multi GPU, Multi Node library with as little overhead a possible
strong scaling on GB200 NVL72 up to 64 GPUs (roughly 41x on 64 GPUs which is good)
Lack on parformance from 8 to 64 GPU due to Infiniband, no problem on GB200 NVL72
cuDSS : Cuda Direct Sparse Solver (uses both CPU and GPU)
nvComp uses Blackwell hardware decompression engine up to 600GB/s decompression (9x on deflate and 4.5 on Snappy)
nvJPEG uses hardware decompression since Hopper and Blackwell

Tuesday, Mar 18 12:00 AM - 12:40 AM CET : NCCL: The Inter-GPU Communication Library Powering Multi-GPU AI [S72583]

Sylvain Jeaugey : Distinguished Engineer, NVIDIA
Slides
NCCL : NVidia Collective Communication Library
Data Parallelism / Tensor Parallelism / Pipeline Parallelism / Expert Parallelism
Different techniques of optimisation and different depending on the hardware
Big PICe bottleneck for all to all (cf Presentation of DK last year)
NCCL now avoid extra copies, recude the number of SMs and improve overlap
from 5 to 20 percent improvement in end to end training
PAT algortihm (Parallel Agregated Tree)
RAS sub system to diagnose applications
And profiler plugin
In the future : all to all with better latencies
Fused communication and computation kernels (distributed matmul)
New stack in NCCL comming in next few months
You can use any memory you want as long as it is CUDA valid
Symetric memory to improve performance (version 2.27)
CUDA level API (2.28) custom operation, Computation and communication fusion, new algorithms
NVlink sharp operation and GPU direct async (version 2.28) NVLink and IBGDA
Symetric memory to remove a lot of checks that kill performances => 4x latency improvement => officially inspired from NVShmem
Everything is allready preconnected
Integration with high level device (MLIR, TileIR) => build algorithm in Python and compile them on the fly inside CUDA
Source code available on github and even for preview to give a try
More improvement when you scale to more than 8 GPUs for NVlink to be efficient
Pytorch people are trying to integrate NCCL in their profiler

Tuesday, Mar 18 9:00 PM - 9:40 PM CET : CUDA: New Features and Beyond [S72383]

Stephen Jones (SW) : CUDA Architect, NVIDIA
Slides
CUDA => 250 SDKs
Every layer has a compiler builtin
Optimization is a Hierartical problem
Optimisation of the compile time for JIT but not only
Keeping track of dependencies in python is tricky
cuda.core : high level CUDA for Python based of cuda python
Python NVTX
Talking about future
cuPyNumeric : 1 to 1 porting GPU of numpy
import cupynumeric as np (can scale on a full computing center)
nvmath.python : optimised math for GPU + JIT compilation
cuda.parallel and cuda.cooperative : accelerated libraries
don't rewrite a FFT or a sort
Let the compiler do the thread mapping, you just have to do the block mapping
cuTile : python tiling for cuda, cuTile is simpler cuda, first release a little later this year
Warp + Warp Tile
you have to pick the right tool for your job
CUTLASS has a python backend now and get almost performances of C++
Tensor core programming in python with CUTLASS 4.0
CUDA DTX : CUDA Distributed Execution => beyong
NSight will be distributed into container so you do not have to have special container to run NSight

Tuesday, Mar 18 9:00 PM - 9:40 PM CET : Application Optimization for NVIDIA Grace CPU [S72978]

Lukas Krenz : Sr. Developer Technology Engineer, NVIDIA
Mathias Wagner : Sr. Developer Technology Engineer, NVIDIA
Slides
Grace is a familly of CPU
It should work because the ARM ecosystem is very mature
-O3 -mcpu=native -ffp-contract=fast
-flto : Link time optimization
Be careful with Fast math (with -Ofast)
-fsigned-char or -funsigned-char if your application needs signed or unsigned char
-fno-stack-array for fortran
Prefer BLAS/LAPACK or FFTW interfaces
Performance Analysis
PMU : Performance Monitoring Unit
NSight support ARM and Grace already
Perf, Likwid or eBPF
nvtxRangePush("Some function")... nvtxRangePop() with include nvtx3/nbToolsExt.h
nsys profile --cpu-core-metrics=help
https://github.com/brendangregg/FlameGraph
https://godbolt.org
nvidia compilers : -Minfo=vect to get info on vectorization
Arm memory order is weakly ordered compare to x86
load/store can be reordered by the CPU at runtime

Tuesday, Mar 18 10:00 PM - 10:40 PM CET : Accelerated Tabular Processing with Polars & RAPIDS [S74362] (RUST)

Matt Harrison : Python and Data Science Expert, Author, and Educator, MetaSnake
William Hill : Developer Advocate, Data Science, NVIDIA
no live
Slides
Polars : DataFrame library, core written in Rust, and has lazy evaluation
Python is a slow language
You can install Polars with pip or uv (which is better than pip of course)
Polars does not support 4 bits integers for now
Function building is close to Rust iterators
Change pl.read_parquet to pl.scan_parquet to enable lazyness
agg because aggregate is too long to write
hvplot.polars to plot
\rust.collect(engine='GPU') to run on GPU (with the NVidia Package)
See Polars Definitive Guide
x2 faster than Polars (not a lot but Polars is supposed to be fast already)
https://github.com/TNieuwdorp/polars-benchmarks/
Polars has some streaming support if data does not fit in the RAM, but do fit on the disk

Tuesday, Mar 18 11:00 PM - 11:40 PM CET : Energy-Efficient Supercomputing Through Tensor Core-Accelerated Mixed-Precision Computing and Floating-Point Emulation [S71487]

Samuel Rodriguez Bernabeu : Senior Software Engineer CUDA Math Libraries, NVIDIA
Slides
HPL benchmark
56bit mantissa up to 2x faster than classic FP64
Intuitive that manimpulating fewer bits should be faster
Mix precision have been around for a long time and they compute a approximate result but a fair amount of expertise is needed to evaluate where it is relevant for the given computation
Represent 1 FP32 as 3 BF16 numbers
BF16*9 is more accurate than FP32 on weather forecasting simulation
More complex for denormal number but it works
2.1x on material BerkleyGW
Sparse matrices converted to dense => 64 bit (S=8) really close to FP64 on a linera system solving
It is easier to explore the needed precision
Automatic precision tuning framework to analyse the precision of DGEMM
They fall back to FP64 fit they need too much slice to convert FP64
Introducing first in cuBLAS, cuTENSOR and cuSOLVER very soon in cuBLAS

Wednesday, Mar 19 12:00 AM - 12:40 AM CET : Insights From NVIDIA Research [S73202]

Bill Dally : Chief Scientist and SVP of Research, NVIDIA
Slides
RTCore : ray tracing, pass through hirerarchy
NVLink and NVSwitch came from NVidia Research
SPring 22 : AI for ship design
Fall 22 : instant NeRF
High signaling to get through the grey balls, ground reference signaling (use ground as a reference)
50 GBit/s per pin
3D dies stacking interconnect
Cache Aware GEMM Kernel (CAGE)
SMs are reading data in reverse order to avoid cache trashing at maximum
Prefill : what things have to use other things in memory
Decode : look at which tokens have to be attended to
Speed up of 3 on decode stage
Transformers, great but costly
Mamba + Transformers => Hymba
Reasoning about the string is Reasoning about the molecule

Wednesday, Mar 19 1:00 PM - 1:40 PM CET : Accelerate High-Performance Signal Processing Using GPU/CUDA [S71459]

Anders Ahlander : Sr. Systems Designer, Saab
Jacob Lundberg : Systems Engineer, Saab
Slides
Swarms for closely spaced drones
Switching from CPU to GPU
Radar : Radio Detection and Ranging
Pulse doppler radar
Focus on signal processing
AESA radar is a kind of reverse SKA detector
But huge leap is computation
Small matrices but at a high rate
cuSOLVER, cuBLAS and cuFFT
The final steps are still on the CPU and uses many batches
GPUs take longer to get overflowed by computation than CPU, so the hardware changes is less frequent
A100 gives the best performances for fp64 computation
NVIdia NV-radar Holoscan (edge computing, cloud and super computer)

Wednesday, Mar 19 4:00 PM - 4:40 PM CET : The CUDA C++ Developer’s Toolbox [S72572]

Bryce Adelstein Lelbach : Principal Architect, NVIDIA
Slides
default is on host compilation
thrust::universal_vector
thrust::constant_iterator
thrust::make_transform_iterator
thrust::make_zip_iterator even with +1 to get adjacent element
thrust::cuda::par_nosync
ok for thrust::for_each or transform because they do not allocate or return value
use cub if you want an ansynchronous reduce
If a universal_vector is bigger than the GPU memory, this is not a problem if the computing is chunckable
If the CPU and GPU share the same memory it could be better to use unified memory rather than do the copy yourself. But some caveats with Jetson (where there is only one memory for both CPU and GPU)
github.com/NVIDIA/accelerated-computing-hub

Wednesday, Mar 19 5:00 PM - 5:40 PM CET : The Performance of CUDA with the Flexibility of PyTorch [S71946]

Mark Saroufim : Software Engineer, Meta Platforms
Slides
CUDA is not flexible and Pytorch is slow
torch.comile uses jit profile to determine which computation to fuse and which to not
pyotrch : fundamentaly string to string conversion from Python to CUDA
The CUTLASS backend is generated on the fly by PyTorch but the performance is very dependent on the PyTorch version
The problem of the string to string is to find and fix typos
load_inline => simpler package but needs jit compile
20 minutes to compile LLMs with PyTorch => you should export your model once for your architecture.

Wednesday, Mar 19 6:00 PM - 6:40 PM CET : Get the Most Performance From Grace Hopper [S72687]

Akshay Subramaniam : Sr. AI Developer Technology Engineer, NVIDIA
Slides
Grace Hopper Superchip
Memory Consistency
Same for Grace Blackwell but with 2 GPU per CPU
They share the same programming model
cudaMallocManaged => allocate on first touch policy
ATS : address translation service
Access counter based Optimization to migrate data if the other side of the chip use them or not
But sometime you can have more control : cudaMemPrefetchAsync
cudaMemAdvise to tell a prefered location for the allocation
Maybe recompile on ARM for Grace with sm90 for Hopper
XLB : Accelerated Latice Boltzman Method (LBM) Stencil based method
Too much memory for the GPU, but data can be splited into slices and be streamed on the GPU
Out-of-core algorithm

Wednesday, Mar 19 7:00 PM - 7:40 PM CET : 1,001 Ways to Write CUDA Kernels in Python [S72449]

Leo Fang : Python CUDA Tech Lead, NVIDIA
Slides
How to stay in Python and even have performances
Clang/LLVM, MLIR, NVRTC / nvJitLink / NVVM
No need for preinstalled compiler such as G++ thank to JiT mechanism
python code.core : new API to access CUDA function in Python
template constexpr become JiT instruction
numba-cuda
@cuda.jit : all in python but hard to write
nvJitLink : generates a kernel as performant as CUDA
cuda.cooperative base of cub => very small and open-source
reduction of rows is a matrix vector multiply with a vector of 1 (as I showed in the CUDA/nvc++ webinar)
optimisation : set_smem(smem_vecOnes, nbElement, 1.) to set the vector of 1 directly in shared memory because we do not need a real vector for that
and then perform the GEMM with cuBLAS
cuTile : cuda with tiles programming (see Stephen Jones presentation)
@cuda.tile.jit : extension of CUDA
The cooperative nature is important
Warp : auto differentiable Python Kernels (you do not have to write the backward version of the kernel yourself) Open Source on Github
CUTALSS for Python compilation is faster and performance are quite the same
You can also use cupy, cupyx or triton
cuda.parallel : Write thrust like code in python (open source with Apache 2.0)
nvmath (but why matmul in advanced ?)
out_arr = xp.sum(in_arr, axis=-1) you can replace xp by PyTorch, CuPy, JAX or DASK and be portable
PF4 is on the roadmap for Python
You can generate CUBIN from Python and use it in C++

Wednesday, Mar 19 7:00 PM - 7:40 PM CET : Speed-of-Light Data Movement Between Storage and the GPU [S73012]

CJ Newburn : Distinguished Engineer, NVIDIA
Prashant Prabhu : NVIDIA
Vikram Sharma Mailthody : Senior Research Scientist, NVIDIA
Slides
If threads access a lot of data on GPU, we kill the bandwidth
SCADA : SCaled Acceletated Data Access
Look ahead to get things ready
Blufield 3 with Spectrum-X to deal with Conjection problems
High Band with data path to GPU
RDMA and TCP
CPU is not deeply involve in the data transfer
GPU Direct Storage since April 2021
Add support for PyTorch and HDF5
S3 Over RDMA/cuObject invertigation
cuFile : up to 6x performance, sometimes 1.02x
The plot slide 44 is very interesting
The math: 100 GB/s for Gen6/512B is 200M IOPs/GPU, 1600M for 8 GPUs
Explore applications with technologits and storage providers

Thursday, Mar 20 11:00 PM - 11:40 PM CET : How To Write A CUDA Program: The Parallel Programming Edition [S72897]

Stephen Jones (SW) : CUDA Architect, NVIDIA
Slides
A must watch (with a lot of nice schemas, so a bit less of notes)
Working on CUDA since 2008
How to avoid writing a GPU kernel
Oversubsciption : launch too much computing than the hardward can theorically handle
cub and cuda.cooperative
cuTile : tile programming for CUDA
Llama 3.1 inference with cuTile get to 10 percent to cuDNN which is the state of the art