12.6.3 : Computing
LIVE : Monday, Mar 16 11:00 PM - 11:40 PM CET : CUDA: New Features and Beyond [S81859]
- Steven Jones : CUDA Architect, NVIDIA
- From MIG to Stream, To Graph, To flexible workload
- Task graph but not fill the all machine always
- The parallelism is not a guaranty it is an opportunity
- Modern workload are unpredictable
- Several GPU or one single with Prefill and Decode
- CudaGraphs can run on multiple GPUs
- cuTile for Python and C++
- A lot of hacking in the Cuda Stack to optimised computing and data movement will be replaced by a multi GPU Driver
- Cutlass embeded NVShmem
- You can debug numba-cuda
Tuesday, Mar 17 12:00 AM - 12:40 AM CET : cuTile Programming [S81433]
- Bryce Lelbach : Principal Architect, NVIDIA
- Grid wide programming
- Sometime no library does what you need
- Writing kernel -> thread level (SIMT)
- Need a higher level abstraction
- Built on TileIR
- Starts with CUDA 13.1 but 13.2 support Ampere and ADA al well
- cuTile get most performances of cuBLAS
- cuTile C++ for CUDA 13.3
- cuTile in Julia
- cuTile in Rust (https://github.com/nblabs/cutile-rs)
- SIMT : unit of execution is a thread
- cuTile : unit of execution is a block
- Tile array are immutable an are copied, not referenced for more optimisation (fusion, fission, et)
- but a Tile copy is cheaper because Tile will not be materialise unecessarly
- Today each tile size must be a power of two but this will be relaxed in future release
- Quite good performances on sparse matrix multiplisation from matrix market compare to cusparse (with 78 lines of generic code, compare to 1035 lines of code)
- Example of hydrodynamic
- For now, we can mix tile and kernel in the same source file
- In the future, we will be able to develop functions which can use both tile and kernel
- Tiles will have support for multi-GPU libraries such as NVSHMEM and NCCL
LIVE : Tuesday, Mar 17 12:00 AM - 12:40 AM CET : Inside the NVIDIA AI Platform and Ecosystem [S81911]
- Ian Buck, VP Hyperscale and HPC Computing, NVIDIA
- Mixture of experts => a lot of Communication accross the GPUs / Racks
- => first time a meeting will get to a solution
- KV cache => working memory of the Model to answer the question with the context
- Vera Rubin => Available mid-2026
- CPU will be used by Agent to test their generated code
- Vera : Neural Branch prediction => 2 branch predictions per Cycle
- 2x perf per watt compare to x86
- Compile the full linux kernel in 1s
- Polyphe : all rack connected to all other racks
Tuesday, Mar 17 1:00 AM - 1:40 AM CET : The Case for Block-Based Programming With cuTile and Triton for Image Processing Workloads Used in Semi Manufacturing [S81894]
- Pradeep Ramachandran, Director, KLA
- Arjun Vadakkeveedu, KLA
- Semi conductor have complex geometries
- Width of the gate is from 50 to 100 nm
- Manufacturing a chip takes from 3 to 6 mounth, then we have to test it
- So we have to find defects as soon as possible => find issues of 40 nm
- Huge processing needs => move as much algorithms as possible onto the GPU
- Maintenance of software is a big Challenge. They have to stand for 30 years
- For images blocks based programming is natural
- cuTile today does not support random number generator
- writing ptx could be fun but debugging them is horrible
- cuTile support automatic TMA use for load and store
- cuTile makes a difference between kernel and device function
- Better performance debugging in Triton
- Operations with shifted loads tend to perform poorly
- Only power of two tile size are supported
- Need Hardware / Software co-design to get performances
VIRTUAL : Tuesday, Mar 17 9:00 AM - 9:25 AM CET : Build a GPU-Accelerated Database Engine With CUDA [S82203]
- Harishankar G, Member Leadership Staff, Zoho Corp.
- Jalakandeshwaran A, Member Leadership Staff, Zoho Corp.
- Postgres extensions
- 22 queries, under 2 minutes on 1 GPU with 90 GB of Memory
- On the fly compilation of binary query ???
- GPU accelerated decompression with 4090 GPU (up to 22x)
- Pinned buffers, and NUMA pinning
- Primitive from thrust
- They will look at NVidia Blackwell
Tuesday, Mar 17 5:00 PM - 5:40 PM CET : Achieve Truly Serverless GPUs With libfuse, CRIU, and CUDA-Checkpoint [S81424]
- Charles Frye, Member of Technical Staff, Modal
Tuesday, Mar 17 6:00 PM - 6:40 PM CET : Accelerate Open Science: Incorporating CUDA Into the SciPy Ecosystem [S82113]
- Ianna Osborne, Research Software Engineer, Princeton University
- Leo Fang, Python CUDA Tech Lead, NVIDIA
- Travis Oliphant, CEO, OpenTeams
- Gaël Varoquaux, Co-Founder, :probabl.
- Katrina Riehl, Principal Technical Product Manager, NVIDIA
Tuesday, Mar 17 7:00 PM - 7:40 PM CET : Achieving 8x Lower Cost Analytics with GPU-Accelerated DuckDB [S81870]
- Xiangyao Yu, Assistant Professor, University of Wisconsin-Madison
- Bobbi Yogatama, Sr. Systems Software Engineer, NVIDIA
Wednesday, Mar 18 12:00 AM - 12:40 AM CET : Accelerating Quantum Chemistry on GPUs—Latest Advances [S81770]
- Robert Parrish, Sr. Engineering Manager, NVIDIA
- Anders Blom, Technical Product Manager, Synopsys
- cuEST : cuda Electronic Structure Theory
- where elcectrons goes in any molecule
- could be 50x faster than CPU state of the art before
- 21.7x in one afternoon (from 56 CPU threads to 1 B200 + 14 CPU threads)
- 53.4x on B200 alone
- cuEST forms the potential matrix to predict where electrons go
- about 300 atoms molecule on a B200
- closed source, free of charge project
- they only need from 8-9 decimal precision to compute DF-K
- Using Osaki I and II DGEMM method
- nice precision with Osaki II and 8 moduli
- RTX6000 + Emulation and you get basically a A100 performances
- They learned how to trun their problems into matrix multiplcation
Wednesday, Mar 18 5:00 PM - 5:40 PM CET : All the Precision, None of the Transistors [S81811]
- Harun Bayraktar, Sr. Director of Software Engineering, Libraries, NVIDIA
- Blackwell GPU => 208B Transistors
- FMA => Scalar part
- MMA => Matrix Multiply part (tensor core)
- Less and less FP64 in the new hardwares but this does not mean that performance is dropped
- 27x between FP64 and FP32 in performance
- In term of performance, all is increasing
- What if we do not use linear algebra
- Osaki I and II methods
- ADP : Automatic Dynamic Precision => Guarantend DGEMM Accuracy while using reduced precision tensor cores extensions of the Osaki Scheme
- Exponent spank capacity algorithm
- shiped in cuBLAS in 2025 for FP32 ad FP64
- Well tested
- Impact wth broaden through other CUDA-X libraries
- But is uses more memory, right ? => no, we can mitigate that with tiling, on the test we can use 4GB and get same performances with same memory consuption
- DGEMM on GB200 up to 4x on big matrices, with Osaki II method, even better with non square matrices
- cuEST to speed up Quantum Chemistry simulation
- Without emulation Hopper can beat Blackwell some times, but with emulation the speed up is massive on Blackwell, even with hopper
- A lot of these do not need any code change, just enable it with some environment variables
- Now, you can use all computing unit of your GPU
- At some point this will some into A100, but it depends on the computing
- A FP16 is not 4 FP4 so makes no sense
Thursday, Mar 191:00 AM - 1:50 AM CET : Ozaki Scheme: Addressing the Accuracy Challenge in the Era of Low-Precision Accelerators [S82285]
- Katsuhisa Ozaki, Professor, Shibaura Institute of Technology
- High performance in matrix multiplication
- Performance for higher precision increase slower than lower precision
- Topic limited to matrix multiplication
- Ozaki-I scheme from 2012
- We can multiply fp16 values and accumulate in fp32
- of int8 for multiplication and int32 for addition
- in CUDA 13.0 update 2
- export CUBLAS_EMULATE_DOUBLE_PRECISION=1
- export CUBLAS_SIMULATION_STRATEGY=performant (automatically set precision)
- export CUBLAS_SIMULATION_STRATEGY=eager (automatically use emulation)
- precision controled by : cublasSetFixedPointEmulationMaxMantissaBitCount
- Automatic dynamic precision : ADP
- GEMM : Matrix–matrix multiplication of two general real matrices
- SYMM : Multiplication of a symmetric matrix and a general matrix
- SYRK : Symmetric rank-𝑘 update using a matrix and its transpose
- SYR2K : Symmetric rank-2𝑘 update using two matrices
- TRMM : Multiplication of a triangular matrix and a general matrix
- TRSM : Solution of a matrix equation with a triangular coefficient matrix
- Osaki-II Scheme : for DGEMM, DSYRK, DTRMM
- Chinese Remainder Theorem
- For k = 7, Osaki-I needs 28 matrix multiplication of A and B, but Osaki-II with 14 moduli will only need 14-15 matrix multiplication
- Rerpoducible results with different GPUs
- Adaptive accuracy for Osaki-I, if k slices is not enough, we can compute alpha more without recomputing previous k slices
- Osaki-II is not adaptive because if we add more multuli the product of moduli is changed
- Matrix has to be well conditionned
- Obvously more efficient in an algorithm which highly depends on matrix multiplication (LU decomposition, QR, Eigen Value, Cholesky Decomposition, etc)
Wednesday, Mar 18 6:00 PM - 6:40 PM CET : Pushing the Boundaries of CAE and EDA With NVIDIA cuDSS, the CUDA Direct Sparse Solver [S81824]
- Azi Riahi, Principal Product Manager, NVIDIA
Wednesday, Mar 18 6:00 PM - 6:40 PM CET : Massive GPU Acceleration for Scalable Quantum Compilation and Graph Analytics [S81533]
- Oded Green, Sr. Developer Technology Engineer, NVIDIA
- Yulun Wang, Staff Scientist, Quantum Control, Q-CTRL
Wednesday, Mar 18 7:00 PM - 7:40 PM CET : Achieve Peak Tensor Core Performance for GEMM on Blackwell via CUTLASS Python [S81463]
- Vicki Wang, Distinguished Engineer, NVIDIA
- Manas Sahni, Senior Software Engineer, NVIDIA
- CUTLASS : C++ and Python DSLs
- CUTLASS 4.4 manage TMA
- Since Blackwell : 2CTA can both share a matrix multiplication
- Performing TMA asynchronously able tensor core computing to overlap
- New CUTLASS DSL and CuTeDSL
- CuTeDLS : available in @cute.experimental.jit and @cute.experimental.kernel
- Get the same performances with less effort
- Express what you want to do and not what the kernel is
Wednesday, Mar 18 11:00 PM - 12:30 AM CET : Performance Optimization of cuTile Kernels [S81705]
- Rob Armstrong, CUDA Technical PM Lead, NVIDIA
- Vishal Mehta, DevTech Engineer, NVIDIA
- CudaTile hides complexity of classical SIMT model
- How to guide the compiler with a system of hints
- Kernel with a load at the begining and a store at the end to express tile computation
- a tile kernel is just a new flavour of kernel
- We can profile cuTile with ncu profile compute
- You want to use large tiles to use tensor core efficiently
- ftz=true => flush denormal to zero
- You can fuse kernel in cuTile
- 3.74x with all optimisation tricks compare to first naive kernel
- 2-CTA mode is slower in this example
- Tile of 256 cost too much memory and there is no gain
- Requires CUDA 13.1+
- TileGym for examples, autotuning, benchmarks
- The goal of cuTile is to get 80% of speed of ligth with minimum effort
- Could be more depending on the computing
- cuTile cannot integrate ptx, which is non portable. But is it possible with Triton
LIVE : Wednesday, Mar 18 11:00 PM - 11:40 PM CET : Architects of the Accelerated Age: How CUDA Builders Changed the World (and How You’re Next) [S82416]
- Stephen Jones (SW), CUDA Architect, NVIDIA
- Joe Stam, Head of Research Technology, Core Strategies, Jump Trading
- Kate Clark, Distinguished DevTech Engineer, NVIDIA
- Paulius Micikevicius, Software Engineer, Meta Superintelligence Labs
- Wen-Mei Hwu, Sr. Distinguished Research Scientist, Senior Distinguished Research Scientist and Senior Research Director at NVIDIA
- 2010 2000 GPU but no chassi => Wooden computer => 3 on Green 500
- Challenge with CUDA : building something new and make people use it
- Vidéo éditor software in 4K when it was not really a thing
- I gets people a while to get confortable with a new technology
- People wait for several generations of GPUs to evaluate stability of development
- Now you need a full rack to do stuff, where are we going ?
- A lot of investigations can still be done on a single GPU
- There alway be development platforms which are laptops and not gigatic GPUs platforms with Racks
- You hire smart people because they are creative not because they can type fast
- Tools have to make up the gap when hardware completely continue to increase
- If you cannot debug it you need some tool to fill up that gap for you
- Push and pull between hardware and software ?
- TMA of Hopper was driven by software
- Better L1 cache is for the software
- This tension is healthy
- Do we need to learn CUDA because of AI ?
- When you write a prompt, you have to express the prompt to ensure the model will generate the kernel well. So you still have to learn CUDA.
- You remove the tedious syntax, but you have to express what you want the hardware to do.
- When you have billions and billions of dollars of inverstment, every percent of performance matters.
- And you have to understand what the model if doing
Thursday, Mar 19 12:00 AM - 12:40 AM CET : Accelerated Building Blocks for Next-Generation AI+HPC Workloads [S81792]
- Heidi Poxon, Director of Product, HPC Software, NVIDIA
Thursday, Mar 19 5:00 PM - 6:30 PM CET : Maximize Memory Bandwidth on Modern GPUs: Practical Techniques, Patterns, and Worked Examples [S81666]
- Benedikt Dorschner, Sr. DevTech Engineer, NVIDIA
- Matthew Martineau, Sr. Developer Technology Engineer, NVIDIA
- Sam Mish, DevTech Engineer, NVIDIA
Thursday, Mar 19 5:00 PM - 5:40 PM CET : Python All the Way Down: Speed-of-Light CUDA Without Leaving Python [S81531]
- Ashwin Srinath, Senior Software Engineer, NVIDIA
- Ianna Osborne, Research Software Engineer, Princeton University
Thursday, Mar 19 6:00 PM - 6:40 PM CET : Best Practices in Building and Packaging CUDA Python Libraries [S82370]
- Vyas Ramasubramani, Sr. Systems Software Engineer, NVIDIA
- Jonathan Dekhtiar, Sr. CUDA Python Engineer, NVIDIA
- A must see if you are a developer !
- if you can avoid binary packages, do it.
- don't use pip
- Pip is stupid, don't use it, use conda, oh, conda is slow and not fiable, use pixi
- MMA : Matrix Multiplication and Accumulation
- use cmake
- For python use scikit-build-core and not meson or setuptools
- Try to support >= 2 cuda major versions, and try to be compatible with all minor versions of one major CUDA version at once, in one package
- Use conda-forge
- cuda-pathfinder will find CUDA, cuBLAS or any package you need to get your dependencies
- dependencies are broken because pip is horrible and people are lazy to make working packages on pip (and they are not helped by pip)
- NVidia people try to increase the overlap of dependencies
Thursday, Mar 19 7:00 PM - 7:40 PM CET : Accelerate GPU Scientific Computing With nvmath-python [S81581]
- Sergey Maydanov, Sr. Software Engineering Manager, NVIDIA
- Aart Bik, Distinguished Engineer, NVIDIA
- If you write your own math library you will have to maintain it for ever.
- Very different performances from cupy A @ B, cupy.matmul(A, B) and nvmath-python matmul even if they call the same implementation under the hood
- 3x with JIT and kernel fusion
- nvmath-python brinds random generation API for Python (on top of cuRand)
- from nvmath.linalg import matmut for generic matrix multiplication
- from nvmath.linalg.advanced import matmut for advanced matrix multiplication
- For parallelism you can use mpi4py with MPIProcessingGroup
- It supports also the FP32 emulation
- version 0.9 comming just after GTC
LIVE : Thursday, Mar 19 11:00 PM - 11:40 PM CET : Orchestrate Next-Generation AI Workloads With Open-Source Slurm [S82420]
- Danny Auble, Head Director for Slurm and Slinky, NVIDIA
- Tim Wickberg, Director, System Software, Slurm and Slinky, NVIDIA
- Slurm OpenSource and vendor agnostic
- Pull request now allowed
- Topology block plugin for NVL 72
- --segment to place jobs better and group them
- new configuration topology.yaml
- New in slurm 2025.11
- NVidia topograph to generte topology
- 120 seconds to retart a faling job => --requeue=expedite to fix it for jobs which uses 1000s nodes
- EXPEDITING state to retart as soon as possible
LIVE : Thursday, Mar 19 11:00 PM - 11:40 PM CET : Top-K Selection at the Speed of Light [S81614]
- Christina Zhang, DevTech Compute Engineer, NVIDIA
- Elias Stehle, Senior Systems Software Engineer, NVIDIA
- Yue Weng, DevTech, NVIDIA
Thursday, Mar 19 11:00 PM - 11:40 PM CET : Break Communication Barriers: Scale AI and HPC with NCCL, NIXL, and NVSHMEM [S82407]
- Matthew Nicely, Group Manager, AI Platform SW, NVIDIA
- NVSHmeme, NCCL, NXIL
- No focus on fault tolerance because it is a part of buissness as usual
- Communication ar integrated toward the stack
- GROMACS : simulate molecules
- NVShmem : don't do handshake on the CPU, stay on the GPU, up to 4x on bandwitdh
- NCCL is an ecosystem
- NCCL inspector => simplify debugging with NCCL
- Since last year : NCCL has symetric memory (as NVShmem)
- Zero SM communication => use a copy engine instead
- NIXL : a new Communication ecosystem (KV page layout, Tier Selection Policy, Replication strategy)
- Data movement : Memory, MVMe, S3 storage, etc
- GPU threads can initiate communications with an other GPU
- from 480us to 50us by using NIXL with DeepSeek
- Spectrum-X with NCCL spectrum-x plugin you get feedback, real time congestion feedback
- Clean noise isolation
- The future :
- NCCL : elastic buffer, Port Recovery, DSL Support, TMA support
- NIXL : Container support, DSL support, Load balance and failover, KV compression service
- NVShmem : DSL Support, User buffer registration enhancement, IB non-blocking RMA completion
- They should work with cudaGraph with the dsl they will provide later this year