Computing

12.6.3 : Computing

LIVE : Monday, Mar 16 11:00 PM - 11:40 PM CET : CUDA: New Features and Beyond [S81859]

Steven Jones : CUDA Architect, NVIDIA
From MIG to Stream, To Graph, To flexible workload
Task graph but not fill the all machine always
The parallelism is not a guaranty it is an opportunity
Modern workload are unpredictable
Several GPU or one single with Prefill and Decode
CudaGraphs can run on multiple GPUs
cuTile for Python and C++
A lot of hacking in the Cuda Stack to optimised computing and data movement will be replaced by a multi GPU Driver
Cutlass embeded NVShmem
You can debug numba-cuda

Tuesday, Mar 17 12:00 AM - 12:40 AM CET : cuTile Programming [S81433]

Bryce Lelbach : Principal Architect, NVIDIA
Grid wide programming
Sometime no library does what you need
Writing kernel -> thread level (SIMT)
Need a higher level abstraction
Built on TileIR
Starts with CUDA 13.1 but 13.2 support Ampere and ADA al well
cuTile get most performances of cuBLAS
cuTile C++ for CUDA 13.3
cuTile in Julia
cuTile in Rust (https://github.com/nblabs/cutile-rs)
SIMT : unit of execution is a thread
cuTile : unit of execution is a block
Tile array are immutable an are copied, not referenced for more optimisation (fusion, fission, et)
but a Tile copy is cheaper because Tile will not be materialise unecessarly
Today each tile size must be a power of two but this will be relaxed in future release
Quite good performances on sparse matrix multiplisation from matrix market compare to cusparse (with 78 lines of generic code, compare to 1035 lines of code)
Example of hydrodynamic
For now, we can mix tile and kernel in the same source file
In the future, we will be able to develop functions which can use both tile and kernel
Tiles will have support for multi-GPU libraries such as NVSHMEM and NCCL

LIVE : Tuesday, Mar 17 12:00 AM - 12:40 AM CET : Inside the NVIDIA AI Platform and Ecosystem [S81911]

Ian Buck, VP Hyperscale and HPC Computing, NVIDIA
Mixture of experts => a lot of Communication accross the GPUs / Racks
=> first time a meeting will get to a solution
KV cache => working memory of the Model to answer the question with the context
Vera Rubin => Available mid-2026
CPU will be used by Agent to test their generated code
Vera : Neural Branch prediction => 2 branch predictions per Cycle
2x perf per watt compare to x86
Compile the full linux kernel in 1s
Polyphe : all rack connected to all other racks

Tuesday, Mar 17 1:00 AM - 1:40 AM CET : The Case for Block-Based Programming With cuTile and Triton for Image Processing Workloads Used in Semi Manufacturing [S81894]

Pradeep Ramachandran, Director, KLA
Arjun Vadakkeveedu, KLA
Semi conductor have complex geometries
Width of the gate is from 50 to 100 nm
Manufacturing a chip takes from 3 to 6 mounth, then we have to test it
So we have to find defects as soon as possible => find issues of 40 nm
Huge processing needs => move as much algorithms as possible onto the GPU
Maintenance of software is a big Challenge. They have to stand for 30 years
For images blocks based programming is natural
cuTile today does not support random number generator
writing ptx could be fun but debugging them is horrible
cuTile support automatic TMA use for load and store
cuTile makes a difference between kernel and device function
Better performance debugging in Triton
Operations with shifted loads tend to perform poorly
Only power of two tile size are supported
Need Hardware / Software co-design to get performances

VIRTUAL : Tuesday, Mar 17 9:00 AM - 9:25 AM CET : Build a GPU-Accelerated Database Engine With CUDA [S82203]

Harishankar G, Member Leadership Staff, Zoho Corp.
Jalakandeshwaran A, Member Leadership Staff, Zoho Corp.
Postgres extensions
22 queries, under 2 minutes on 1 GPU with 90 GB of Memory
On the fly compilation of binary query ???
GPU accelerated decompression with 4090 GPU (up to 22x)
Pinned buffers, and NUMA pinning
Primitive from thrust
They will look at NVidia Blackwell

Tuesday, Mar 17 5:00 PM - 5:40 PM CET : Achieve Truly Serverless GPUs With libfuse, CRIU, and CUDA-Checkpoint [S81424]

Charles Frye, Member of Technical Staff, Modal

Tuesday, Mar 17 6:00 PM - 6:40 PM CET : Accelerate Open Science: Incorporating CUDA Into the SciPy Ecosystem [S82113]

Ianna Osborne, Research Software Engineer, Princeton University
Leo Fang, Python CUDA Tech Lead, NVIDIA
Travis Oliphant, CEO, OpenTeams
Gaël Varoquaux, Co-Founder, :probabl.
Katrina Riehl, Principal Technical Product Manager, NVIDIA

Tuesday, Mar 17 7:00 PM - 7:40 PM CET : Achieving 8x Lower Cost Analytics with GPU-Accelerated DuckDB [S81870]

Xiangyao Yu, Assistant Professor, University of Wisconsin-Madison
Bobbi Yogatama, Sr. Systems Software Engineer, NVIDIA

Wednesday, Mar 18 12:00 AM - 12:40 AM CET : Accelerating Quantum Chemistry on GPUs—Latest Advances [S81770]

Robert Parrish, Sr. Engineering Manager, NVIDIA
Anders Blom, Technical Product Manager, Synopsys
cuEST : cuda Electronic Structure Theory
where elcectrons goes in any molecule
could be 50x faster than CPU state of the art before
21.7x in one afternoon (from 56 CPU threads to 1 B200 + 14 CPU threads)
53.4x on B200 alone
cuEST forms the potential matrix to predict where electrons go
about 300 atoms molecule on a B200
closed source, free of charge project
they only need from 8-9 decimal precision to compute DF-K
Using Osaki I and II DGEMM method
nice precision with Osaki II and 8 moduli
RTX6000 + Emulation and you get basically a A100 performances
They learned how to trun their problems into matrix multiplcation

Wednesday, Mar 18 5:00 PM - 5:40 PM CET : All the Precision, None of the Transistors [S81811]

Harun Bayraktar, Sr. Director of Software Engineering, Libraries, NVIDIA
Blackwell GPU => 208B Transistors
FMA => Scalar part
MMA => Matrix Multiply part (tensor core)
Less and less FP64 in the new hardwares but this does not mean that performance is dropped
27x between FP64 and FP32 in performance
In term of performance, all is increasing
What if we do not use linear algebra
Osaki I and II methods
ADP : Automatic Dynamic Precision => Guarantend DGEMM Accuracy while using reduced precision tensor cores extensions of the Osaki Scheme
Exponent spank capacity algorithm
shiped in cuBLAS in 2025 for FP32 ad FP64
Well tested
Impact wth broaden through other CUDA-X libraries
But is uses more memory, right ? => no, we can mitigate that with tiling, on the test we can use 4GB and get same performances with same memory consuption
DGEMM on GB200 up to 4x on big matrices, with Osaki II method, even better with non square matrices
cuEST to speed up Quantum Chemistry simulation
Without emulation Hopper can beat Blackwell some times, but with emulation the speed up is massive on Blackwell, even with hopper
A lot of these do not need any code change, just enable it with some environment variables
Now, you can use all computing unit of your GPU
At some point this will some into A100, but it depends on the computing
A FP16 is not 4 FP4 so makes no sense

Thursday, Mar 191:00 AM - 1:50 AM CET : Ozaki Scheme: Addressing the Accuracy Challenge in the Era of Low-Precision Accelerators [S82285]

Katsuhisa Ozaki, Professor, Shibaura Institute of Technology
High performance in matrix multiplication
Performance for higher precision increase slower than lower precision
Topic limited to matrix multiplication
Ozaki-I scheme from 2012
We can multiply fp16 values and accumulate in fp32
of int8 for multiplication and int32 for addition
in CUDA 13.0 update 2
export CUBLAS_EMULATE_DOUBLE_PRECISION=1
export CUBLAS_SIMULATION_STRATEGY=performant (automatically set precision)
export CUBLAS_SIMULATION_STRATEGY=eager (automatically use emulation)
precision controled by : cublasSetFixedPointEmulationMaxMantissaBitCount
Automatic dynamic precision : ADP
GEMM : Matrix–matrix multiplication of two general real matrices
SYMM : Multiplication of a symmetric matrix and a general matrix
SYRK : Symmetric rank-𝑘 update using a matrix and its transpose
SYR2K : Symmetric rank-2𝑘 update using two matrices
TRMM : Multiplication of a triangular matrix and a general matrix
TRSM : Solution of a matrix equation with a triangular coefficient matrix
Osaki-II Scheme : for DGEMM, DSYRK, DTRMM
Chinese Remainder Theorem
For k = 7, Osaki-I needs 28 matrix multiplication of A and B, but Osaki-II with 14 moduli will only need 14-15 matrix multiplication
Rerpoducible results with different GPUs
Adaptive accuracy for Osaki-I, if k slices is not enough, we can compute alpha more without recomputing previous k slices
Osaki-II is not adaptive because if we add more multuli the product of moduli is changed
Matrix has to be well conditionned
Obvously more efficient in an algorithm which highly depends on matrix multiplication (LU decomposition, QR, Eigen Value, Cholesky Decomposition, etc)

Wednesday, Mar 18 6:00 PM - 6:40 PM CET : Pushing the Boundaries of CAE and EDA With NVIDIA cuDSS, the CUDA Direct Sparse Solver [S81824]

Azi Riahi, Principal Product Manager, NVIDIA

Wednesday, Mar 18 6:00 PM - 6:40 PM CET : Massive GPU Acceleration for Scalable Quantum Compilation and Graph Analytics [S81533]

Oded Green, Sr. Developer Technology Engineer, NVIDIA
Yulun Wang, Staff Scientist, Quantum Control, Q-CTRL

Wednesday, Mar 18 7:00 PM - 7:40 PM CET : Achieve Peak Tensor Core Performance for GEMM on Blackwell via CUTLASS Python [S81463]

Vicki Wang, Distinguished Engineer, NVIDIA
Manas Sahni, Senior Software Engineer, NVIDIA
CUTLASS : C++ and Python DSLs
CUTLASS 4.4 manage TMA
Since Blackwell : 2CTA can both share a matrix multiplication
Performing TMA asynchronously able tensor core computing to overlap
New CUTLASS DSL and CuTeDSL
CuTeDLS : available in @cute.experimental.jit and @cute.experimental.kernel
Get the same performances with less effort
Express what you want to do and not what the kernel is

Wednesday, Mar 18 11:00 PM - 12:30 AM CET : Performance Optimization of cuTile Kernels [S81705]

Rob Armstrong, CUDA Technical PM Lead, NVIDIA
Vishal Mehta, DevTech Engineer, NVIDIA
CudaTile hides complexity of classical SIMT model
How to guide the compiler with a system of hints
Kernel with a load at the begining and a store at the end to express tile computation
a tile kernel is just a new flavour of kernel
We can profile cuTile with ncu profile compute
You want to use large tiles to use tensor core efficiently
ftz=true => flush denormal to zero
You can fuse kernel in cuTile
3.74x with all optimisation tricks compare to first naive kernel
2-CTA mode is slower in this example
Tile of 256 cost too much memory and there is no gain
Requires CUDA 13.1+
TileGym for examples, autotuning, benchmarks
The goal of cuTile is to get 80% of speed of ligth with minimum effort
Could be more depending on the computing
cuTile cannot integrate ptx, which is non portable. But is it possible with Triton

LIVE : Wednesday, Mar 18 11:00 PM - 11:40 PM CET : Architects of the Accelerated Age: How CUDA Builders Changed the World (and How You’re Next) [S82416]

Stephen Jones (SW), CUDA Architect, NVIDIA
Joe Stam, Head of Research Technology, Core Strategies, Jump Trading
Kate Clark, Distinguished DevTech Engineer, NVIDIA
Paulius Micikevicius, Software Engineer, Meta Superintelligence Labs
Wen-Mei Hwu, Sr. Distinguished Research Scientist, Senior Distinguished Research Scientist and Senior Research Director at NVIDIA
2010 2000 GPU but no chassi => Wooden computer => 3 on Green 500
Challenge with CUDA : building something new and make people use it
Vidéo éditor software in 4K when it was not really a thing
I gets people a while to get confortable with a new technology
People wait for several generations of GPUs to evaluate stability of development
Now you need a full rack to do stuff, where are we going ?
A lot of investigations can still be done on a single GPU
There alway be development platforms which are laptops and not gigatic GPUs platforms with Racks
You hire smart people because they are creative not because they can type fast
Tools have to make up the gap when hardware completely continue to increase
If you cannot debug it you need some tool to fill up that gap for you
Push and pull between hardware and software ?
TMA of Hopper was driven by software
Better L1 cache is for the software
This tension is healthy
Do we need to learn CUDA because of AI ?
When you write a prompt, you have to express the prompt to ensure the model will generate the kernel well. So you still have to learn CUDA.
You remove the tedious syntax, but you have to express what you want the hardware to do.
When you have billions and billions of dollars of inverstment, every percent of performance matters.
And you have to understand what the model if doing

Thursday, Mar 19 12:00 AM - 12:40 AM CET : Accelerated Building Blocks for Next-Generation AI+HPC Workloads [S81792]

Heidi Poxon, Director of Product, HPC Software, NVIDIA

Thursday, Mar 19 5:00 PM - 6:30 PM CET : Maximize Memory Bandwidth on Modern GPUs: Practical Techniques, Patterns, and Worked Examples [S81666]

Benedikt Dorschner, Sr. DevTech Engineer, NVIDIA
Matthew Martineau, Sr. Developer Technology Engineer, NVIDIA
Sam Mish, DevTech Engineer, NVIDIA

Thursday, Mar 19 5:00 PM - 5:40 PM CET : Python All the Way Down: Speed-of-Light CUDA Without Leaving Python [S81531]

Ashwin Srinath, Senior Software Engineer, NVIDIA
Ianna Osborne, Research Software Engineer, Princeton University

Thursday, Mar 19 6:00 PM - 6:40 PM CET : Best Practices in Building and Packaging CUDA Python Libraries [S82370]

Vyas Ramasubramani, Sr. Systems Software Engineer, NVIDIA
Jonathan Dekhtiar, Sr. CUDA Python Engineer, NVIDIA
A must see if you are a developer !
if you can avoid binary packages, do it.
don't use pip
Pip is stupid, don't use it, use conda, oh, conda is slow and not fiable, use pixi
MMA : Matrix Multiplication and Accumulation
use cmake
For python use scikit-build-core and not meson or setuptools
Try to support >= 2 cuda major versions, and try to be compatible with all minor versions of one major CUDA version at once, in one package
Use conda-forge
cuda-pathfinder will find CUDA, cuBLAS or any package you need to get your dependencies
dependencies are broken because pip is horrible and people are lazy to make working packages on pip (and they are not helped by pip)
NVidia people try to increase the overlap of dependencies

Thursday, Mar 19 7:00 PM - 7:40 PM CET : Accelerate GPU Scientific Computing With nvmath-python [S81581]

Sergey Maydanov, Sr. Software Engineering Manager, NVIDIA
Aart Bik, Distinguished Engineer, NVIDIA
If you write your own math library you will have to maintain it for ever.
Very different performances from cupy A @ B, cupy.matmul(A, B) and nvmath-python matmul even if they call the same implementation under the hood
3x with JIT and kernel fusion
nvmath-python brinds random generation API for Python (on top of cuRand)
from nvmath.linalg import matmut for generic matrix multiplication
from nvmath.linalg.advanced import matmut for advanced matrix multiplication
For parallelism you can use mpi4py with MPIProcessingGroup
It supports also the FP32 emulation
version 0.9 comming just after GTC

LIVE : Thursday, Mar 19 11:00 PM - 11:40 PM CET : Orchestrate Next-Generation AI Workloads With Open-Source Slurm [S82420]

Danny Auble, Head Director for Slurm and Slinky, NVIDIA
Tim Wickberg, Director, System Software, Slurm and Slinky, NVIDIA
Slurm OpenSource and vendor agnostic
Pull request now allowed
Topology block plugin for NVL 72
--segment to place jobs better and group them
new configuration topology.yaml
New in slurm 2025.11
NVidia topograph to generte topology
120 seconds to retart a faling job => --requeue=expedite to fix it for jobs which uses 1000s nodes
EXPEDITING state to retart as soon as possible

LIVE : Thursday, Mar 19 11:00 PM - 11:40 PM CET : Top-K Selection at the Speed of Light [S81614]

Christina Zhang, DevTech Compute Engineer, NVIDIA
Elias Stehle, Senior Systems Software Engineer, NVIDIA
Yue Weng, DevTech, NVIDIA

Thursday, Mar 19 11:00 PM - 11:40 PM CET : Break Communication Barriers: Scale AI and HPC with NCCL, NIXL, and NVSHMEM [S82407]

Matthew Nicely, Group Manager, AI Platform SW, NVIDIA
NVSHmeme, NCCL, NXIL
No focus on fault tolerance because it is a part of buissness as usual
Communication ar integrated toward the stack
GROMACS : simulate molecules
NVShmem : don't do handshake on the CPU, stay on the GPU, up to 4x on bandwitdh
NCCL is an ecosystem
NCCL inspector => simplify debugging with NCCL
Since last year : NCCL has symetric memory (as NVShmem)
Zero SM communication => use a copy engine instead
NIXL : a new Communication ecosystem (KV page layout, Tier Selection Policy, Replication strategy)
Data movement : Memory, MVMe, S3 storage, etc
GPU threads can initiate communications with an other GPU
from 480us to 50us by using NIXL with DeepSeek
Spectrum-X with NCCL spectrum-x plugin you get feedback, real time congestion feedback
Clean noise isolation
The future :
NCCL : elastic buffer, Port Recovery, DSL Support, TMA support
NIXL : Container support, DSL support, Load balance and failover, KV compression service
NVShmem : DSL Support, User buffer registration enhancement, IB non-blocking RMA completion
They should work with cudaGraph with the dsl they will provide later this year