12.6.3 : Computing

LIVE : Monday, Mar 16 11:00 PM - 11:40 PM CET : CUDA: New Features and Beyond [S81859]
  • Steven Jones : CUDA Architect, NVIDIA
  • From MIG to Stream, To Graph, To flexible workload
  • Task graph but not fill the all machine always
  • The parallelism is not a guaranty it is an opportunity
  • Modern workload are unpredictable
  • Several GPU or one single with Prefill and Decode
  • CudaGraphs can run on multiple GPUs
  • cuTile for Python and C++
  • A lot of hacking in the Cuda Stack to optimised computing and data movement will be replaced by a multi GPU Driver
  • Cutlass embeded NVShmem
  • You can debug numba-cuda


Tuesday, Mar 17 12:00 AM - 12:40 AM CET : cuTile Programming [S81433]
  • Bryce Lelbach : Principal Architect, NVIDIA
  • Grid wide programming
  • Sometime no library does what you need
  • Writing kernel -> thread level (SIMT)
  • Need a higher level abstraction
  • Built on TileIR
  • Starts with CUDA 13.1 but 13.2 support Ampere and ADA al well
  • cuTile get most performances of cuBLAS
  • cuTile C++ for CUDA 13.3
  • cuTile in Julia
  • cuTile in Rust (https://github.com/nblabs/cutile-rs)
  • SIMT : unit of execution is a thread
  • cuTile : unit of execution is a block
  • Tile array are immutable an are copied, not referenced for more optimisation (fusion, fission, et)
  • but a Tile copy is cheaper because Tile will not be materialise unecessarly
  • Today each tile size must be a power of two but this will be relaxed in future release
  • Quite good performances on sparse matrix multiplisation from matrix market compare to cusparse (with 78 lines of generic code, compare to 1035 lines of code)
  • Example of hydrodynamic
  • For now, we can mix tile and kernel in the same source file
  • In the future, we will be able to develop functions which can use both tile and kernel
  • Tiles will have support for multi-GPU libraries such as NVSHMEM and NCCL


LIVE : Tuesday, Mar 17 12:00 AM - 12:40 AM CET : Inside the NVIDIA AI Platform and Ecosystem [S81911]
  • Ian Buck, VP Hyperscale and HPC Computing, NVIDIA
  • Mixture of experts => a lot of Communication accross the GPUs / Racks
  • => first time a meeting will get to a solution
  • KV cache => working memory of the Model to answer the question with the context
  • Vera Rubin => Available mid-2026
  • CPU will be used by Agent to test their generated code
  • Vera : Neural Branch prediction => 2 branch predictions per Cycle
  • 2x perf per watt compare to x86
  • Compile the full linux kernel in 1s
  • Polyphe : all rack connected to all other racks


Tuesday, Mar 17 1:00 AM - 1:40 AM CET : The Case for Block-Based Programming With cuTile and Triton for Image Processing Workloads Used in Semi Manufacturing [S81894]
  • Pradeep Ramachandran, Director, KLA
  • Arjun Vadakkeveedu, KLA
  • Semi conductor have complex geometries
  • Width of the gate is from 50 to 100 nm
  • Manufacturing a chip takes from 3 to 6 mounth, then we have to test it
  • So we have to find defects as soon as possible => find issues of 40 nm
  • Huge processing needs => move as much algorithms as possible onto the GPU
  • Maintenance of software is a big Challenge. They have to stand for 30 years
  • For images blocks based programming is natural
  • cuTile today does not support random number generator
  • writing ptx could be fun but debugging them is horrible
  • cuTile support automatic TMA use for load and store
  • cuTile makes a difference between kernel and device function
  • Better performance debugging in Triton
  • Operations with shifted loads tend to perform poorly
  • Only power of two tile size are supported
  • Need Hardware / Software co-design to get performances


VIRTUAL : Tuesday, Mar 17 9:00 AM - 9:25 AM CET : Build a GPU-Accelerated Database Engine With CUDA [S82203]
  • Harishankar G, Member Leadership Staff, Zoho Corp.
  • Jalakandeshwaran A, Member Leadership Staff, Zoho Corp.
  • Postgres extensions
  • 22 queries, under 2 minutes on 1 GPU with 90 GB of Memory
  • On the fly compilation of binary query ???
  • GPU accelerated decompression with 4090 GPU (up to 22x)
  • Pinned buffers, and NUMA pinning
  • Primitive from thrust
  • They will look at NVidia Blackwell


Tuesday, Mar 17 5:00 PM - 5:40 PM CET : Achieve Truly Serverless GPUs With libfuse, CRIU, and CUDA-Checkpoint [S81424]
  • Charles Frye, Member of Technical Staff, Modal


Tuesday, Mar 17 6:00 PM - 6:40 PM CET : Accelerate Open Science: Incorporating CUDA Into the SciPy Ecosystem [S82113]
  • Ianna Osborne, Research Software Engineer, Princeton University
  • Leo Fang, Python CUDA Tech Lead, NVIDIA
  • Travis Oliphant, CEO, OpenTeams
  • Gaël Varoquaux, Co-Founder, :probabl.
  • Katrina Riehl, Principal Technical Product Manager, NVIDIA


Tuesday, Mar 17 7:00 PM - 7:40 PM CET : Achieving 8x Lower Cost Analytics with GPU-Accelerated DuckDB [S81870]
  • Xiangyao Yu, Assistant Professor, University of Wisconsin-Madison
  • Bobbi Yogatama, Sr. Systems Software Engineer, NVIDIA


Wednesday, Mar 18 12:00 AM - 12:40 AM CET : Accelerating Quantum Chemistry on GPUs—Latest Advances [S81770]
  • Robert Parrish, Sr. Engineering Manager, NVIDIA
  • Anders Blom, Technical Product Manager, Synopsys
  • cuEST : cuda Electronic Structure Theory
  • where elcectrons goes in any molecule
  • could be 50x faster than CPU state of the art before
  • 21.7x in one afternoon (from 56 CPU threads to 1 B200 + 14 CPU threads)
  • 53.4x on B200 alone
  • cuEST forms the potential matrix to predict where electrons go
  • about 300 atoms molecule on a B200
  • closed source, free of charge project
  • they only need from 8-9 decimal precision to compute DF-K
  • Using Osaki I and II DGEMM method
  • nice precision with Osaki II and 8 moduli
  • RTX6000 + Emulation and you get basically a A100 performances
  • They learned how to trun their problems into matrix multiplcation


Wednesday, Mar 18 5:00 PM - 5:40 PM CET : All the Precision, None of the Transistors [S81811]
  • Harun Bayraktar, Sr. Director of Software Engineering, Libraries, NVIDIA
  • Blackwell GPU => 208B Transistors
  • FMA => Scalar part
  • MMA => Matrix Multiply part (tensor core)
  • Less and less FP64 in the new hardwares but this does not mean that performance is dropped
  • 27x between FP64 and FP32 in performance
  • In term of performance, all is increasing
  • What if we do not use linear algebra
  • Osaki I and II methods
  • ADP : Automatic Dynamic Precision => Guarantend DGEMM Accuracy while using reduced precision tensor cores extensions of the Osaki Scheme
  • Exponent spank capacity algorithm
  • shiped in cuBLAS in 2025 for FP32 ad FP64
  • Well tested
  • Impact wth broaden through other CUDA-X libraries
  • But is uses more memory, right ? => no, we can mitigate that with tiling, on the test we can use 4GB and get same performances with same memory consuption
  • DGEMM on GB200 up to 4x on big matrices, with Osaki II method, even better with non square matrices
  • cuEST to speed up Quantum Chemistry simulation
  • Without emulation Hopper can beat Blackwell some times, but with emulation the speed up is massive on Blackwell, even with hopper
  • A lot of these do not need any code change, just enable it with some environment variables
  • Now, you can use all computing unit of your GPU
  • At some point this will some into A100, but it depends on the computing
  • A FP16 is not 4 FP4 so makes no sense


Thursday, Mar 191:00 AM - 1:50 AM CET : Ozaki Scheme: Addressing the Accuracy Challenge in the Era of Low-Precision Accelerators [S82285]
  • Katsuhisa Ozaki, Professor, Shibaura Institute of Technology
  • High performance in matrix multiplication
  • Performance for higher precision increase slower than lower precision
  • Topic limited to matrix multiplication
  • Ozaki-I scheme from 2012
  • We can multiply fp16 values and accumulate in fp32
  • of int8 for multiplication and int32 for addition
  • in CUDA 13.0 update 2
  • export CUBLAS_EMULATE_DOUBLE_PRECISION=1
  • export CUBLAS_SIMULATION_STRATEGY=performant (automatically set precision)
  • export CUBLAS_SIMULATION_STRATEGY=eager (automatically use emulation)
  • precision controled by : cublasSetFixedPointEmulationMaxMantissaBitCount
  • Automatic dynamic precision : ADP
  • GEMM : Matrix–matrix multiplication of two general real matrices
  • SYMM : Multiplication of a symmetric matrix and a general matrix
  • SYRK : Symmetric rank-𝑘 update using a matrix and its transpose
  • SYR2K : Symmetric rank-2𝑘 update using two matrices
  • TRMM : Multiplication of a triangular matrix and a general matrix
  • TRSM : Solution of a matrix equation with a triangular coefficient matrix
  • Osaki-II Scheme : for DGEMM, DSYRK, DTRMM
  • Chinese Remainder Theorem
  • For k = 7, Osaki-I needs 28 matrix multiplication of A and B, but Osaki-II with 14 moduli will only need 14-15 matrix multiplication
  • Rerpoducible results with different GPUs
  • Adaptive accuracy for Osaki-I, if k slices is not enough, we can compute alpha more without recomputing previous k slices
  • Osaki-II is not adaptive because if we add more multuli the product of moduli is changed
  • Matrix has to be well conditionned
  • Obvously more efficient in an algorithm which highly depends on matrix multiplication (LU decomposition, QR, Eigen Value, Cholesky Decomposition, etc)


Wednesday, Mar 18 6:00 PM - 6:40 PM CET : Pushing the Boundaries of CAE and EDA With NVIDIA cuDSS, the CUDA Direct Sparse Solver [S81824]
  • Azi Riahi, Principal Product Manager, NVIDIA


Wednesday, Mar 18 6:00 PM - 6:40 PM CET : Massive GPU Acceleration for Scalable Quantum Compilation and Graph Analytics [S81533]
  • Oded Green, Sr. Developer Technology Engineer, NVIDIA
  • Yulun Wang, Staff Scientist, Quantum Control, Q-CTRL


Wednesday, Mar 18 7:00 PM - 7:40 PM CET : Achieve Peak Tensor Core Performance for GEMM on Blackwell via CUTLASS Python [S81463]
  • Vicki Wang, Distinguished Engineer, NVIDIA
  • Manas Sahni, Senior Software Engineer, NVIDIA
  • CUTLASS : C++ and Python DSLs
  • CUTLASS 4.4 manage TMA
  • Since Blackwell : 2CTA can both share a matrix multiplication
  • Performing TMA asynchronously able tensor core computing to overlap
  • New CUTLASS DSL and CuTeDSL
  • CuTeDLS : available in @cute.experimental.jit and @cute.experimental.kernel
  • Get the same performances with less effort
  • Express what you want to do and not what the kernel is


Wednesday, Mar 18 11:00 PM - 12:30 AM CET : Performance Optimization of cuTile Kernels [S81705]
  • Rob Armstrong, CUDA Technical PM Lead, NVIDIA
  • Vishal Mehta, DevTech Engineer, NVIDIA
  • CudaTile hides complexity of classical SIMT model
  • How to guide the compiler with a system of hints
  • Kernel with a load at the begining and a store at the end to express tile computation
  • a tile kernel is just a new flavour of kernel
  • We can profile cuTile with ncu profile compute
  • You want to use large tiles to use tensor core efficiently
  • ftz=true => flush denormal to zero
  • You can fuse kernel in cuTile
  • 3.74x with all optimisation tricks compare to first naive kernel
  • 2-CTA mode is slower in this example
  • Tile of 256 cost too much memory and there is no gain
  • Requires CUDA 13.1+
  • TileGym for examples, autotuning, benchmarks
  • The goal of cuTile is to get 80% of speed of ligth with minimum effort
  • Could be more depending on the computing
  • cuTile cannot integrate ptx, which is non portable. But is it possible with Triton


LIVE : Wednesday, Mar 18 11:00 PM - 11:40 PM CET : Architects of the Accelerated Age: How CUDA Builders Changed the World (and How You’re Next) [S82416]
  • Stephen Jones (SW), CUDA Architect, NVIDIA
  • Joe Stam, Head of Research Technology, Core Strategies, Jump Trading
  • Kate Clark, Distinguished DevTech Engineer, NVIDIA
  • Paulius Micikevicius, Software Engineer, Meta Superintelligence Labs
  • Wen-Mei Hwu, Sr. Distinguished Research Scientist, Senior Distinguished Research Scientist and Senior Research Director at NVIDIA
  • 2010 2000 GPU but no chassi => Wooden computer => 3 on Green 500
  • Challenge with CUDA : building something new and make people use it
  • Vidéo éditor software in 4K when it was not really a thing
  • I gets people a while to get confortable with a new technology
  • People wait for several generations of GPUs to evaluate stability of development
  • Now you need a full rack to do stuff, where are we going ?
  • A lot of investigations can still be done on a single GPU
  • There alway be development platforms which are laptops and not gigatic GPUs platforms with Racks
  • You hire smart people because they are creative not because they can type fast
  • Tools have to make up the gap when hardware completely continue to increase
  • If you cannot debug it you need some tool to fill up that gap for you
  • Push and pull between hardware and software ?
  • TMA of Hopper was driven by software
  • Better L1 cache is for the software
  • This tension is healthy
  • Do we need to learn CUDA because of AI ?
  • When you write a prompt, you have to express the prompt to ensure the model will generate the kernel well. So you still have to learn CUDA.
  • You remove the tedious syntax, but you have to express what you want the hardware to do.
  • When you have billions and billions of dollars of inverstment, every percent of performance matters.
  • And you have to understand what the model if doing


Thursday, Mar 19 12:00 AM - 12:40 AM CET : Accelerated Building Blocks for Next-Generation AI+HPC Workloads [S81792]
  • Heidi Poxon, Director of Product, HPC Software, NVIDIA


Thursday, Mar 19 5:00 PM - 6:30 PM CET : Maximize Memory Bandwidth on Modern GPUs: Practical Techniques, Patterns, and Worked Examples [S81666]
  • Benedikt Dorschner, Sr. DevTech Engineer, NVIDIA
  • Matthew Martineau, Sr. Developer Technology Engineer, NVIDIA
  • Sam Mish, DevTech Engineer, NVIDIA


Thursday, Mar 19 5:00 PM - 5:40 PM CET : Python All the Way Down: Speed-of-Light CUDA Without Leaving Python [S81531]
  • Ashwin Srinath, Senior Software Engineer, NVIDIA
  • Ianna Osborne, Research Software Engineer, Princeton University


Thursday, Mar 19 6:00 PM - 6:40 PM CET : Best Practices in Building and Packaging CUDA Python Libraries [S82370]
  • Vyas Ramasubramani, Sr. Systems Software Engineer, NVIDIA
  • Jonathan Dekhtiar, Sr. CUDA Python Engineer, NVIDIA
  • A must see if you are a developer !
  • if you can avoid binary packages, do it.
  • don't use pip
  • Pip is stupid, don't use it, use conda, oh, conda is slow and not fiable, use pixi
  • MMA : Matrix Multiplication and Accumulation
  • use cmake
  • For python use scikit-build-core and not meson or setuptools
  • Try to support >= 2 cuda major versions, and try to be compatible with all minor versions of one major CUDA version at once, in one package
  • Use conda-forge
  • cuda-pathfinder will find CUDA, cuBLAS or any package you need to get your dependencies
  • dependencies are broken because pip is horrible and people are lazy to make working packages on pip (and they are not helped by pip)
  • NVidia people try to increase the overlap of dependencies


Thursday, Mar 19 7:00 PM - 7:40 PM CET : Accelerate GPU Scientific Computing With nvmath-python [S81581]
  • Sergey Maydanov, Sr. Software Engineering Manager, NVIDIA
  • Aart Bik, Distinguished Engineer, NVIDIA
  • If you write your own math library you will have to maintain it for ever.
  • Very different performances from cupy A @ B, cupy.matmul(A, B) and nvmath-python matmul even if they call the same implementation under the hood
  • 3x with JIT and kernel fusion
  • nvmath-python brinds random generation API for Python (on top of cuRand)
  • from nvmath.linalg import matmut for generic matrix multiplication
  • from nvmath.linalg.advanced import matmut for advanced matrix multiplication
  • For parallelism you can use mpi4py with MPIProcessingGroup
  • It supports also the FP32 emulation
  • version 0.9 comming just after GTC


LIVE : Thursday, Mar 19 11:00 PM - 11:40 PM CET : Orchestrate Next-Generation AI Workloads With Open-Source Slurm [S82420]
  • Danny Auble, Head Director for Slurm and Slinky, NVIDIA
  • Tim Wickberg, Director, System Software, Slurm and Slinky, NVIDIA
  • Slurm OpenSource and vendor agnostic
  • Pull request now allowed
  • Topology block plugin for NVL 72
  • --segment to place jobs better and group them
  • new configuration topology.yaml
  • New in slurm 2025.11
  • NVidia topograph to generte topology
  • 120 seconds to retart a faling job => --requeue=expedite to fix it for jobs which uses 1000s nodes
  • EXPEDITING state to retart as soon as possible


LIVE : Thursday, Mar 19 11:00 PM - 11:40 PM CET : Top-K Selection at the Speed of Light [S81614]
  • Christina Zhang, DevTech Compute Engineer, NVIDIA
  • Elias Stehle, Senior Systems Software Engineer, NVIDIA
  • Yue Weng, DevTech, NVIDIA


Thursday, Mar 19 11:00 PM - 11:40 PM CET : Break Communication Barriers: Scale AI and HPC with NCCL, NIXL, and NVSHMEM [S82407]
  • Matthew Nicely, Group Manager, AI Platform SW, NVIDIA
  • NVSHmeme, NCCL, NXIL
  • No focus on fault tolerance because it is a part of buissness as usual
  • Communication ar integrated toward the stack
  • GROMACS : simulate molecules
  • NVShmem : don't do handshake on the CPU, stay on the GPU, up to 4x on bandwitdh
  • NCCL is an ecosystem
  • NCCL inspector => simplify debugging with NCCL
  • Since last year : NCCL has symetric memory (as NVShmem)
  • Zero SM communication => use a copy engine instead
  • NIXL : a new Communication ecosystem (KV page layout, Tier Selection Policy, Replication strategy)
  • Data movement : Memory, MVMe, S3 storage, etc
  • GPU threads can initiate communications with an other GPU
  • from 480us to 50us by using NIXL with DeepSeek
  • Spectrum-X with NCCL spectrum-x plugin you get feedback, real time congestion feedback
  • Clean noise isolation
  • The future :
  • NCCL : elastic buffer, Port Recovery, DSL Support, TMA support
  • NIXL : Container support, DSL support, Load balance and failover, KV compression service
  • NVShmem : DSL Support, User buffer registration enhancement, IB non-blocking RMA completion
  • They should work with cudaGraph with the dsl they will provide later this year