12.5.3 : Computing

Monday, Mar 17 9:00 PM - 10:30 PM CET : CUDA Techniques to Maximize Memory Bandwidth and Hide Latency [S72683]
  • Athena Elafrou : Developer Technology Engineer, NVIDIA
  • Allard Hendriksen : Sr. Developer Technology, NVIDIA
  • Slides
  • A must watch if our are a cuda deveopper or if you want to improve performances of your computation
  • Hopper => distributed shared memory => thread block cluster
  • Unrolling with an extra for loop is better when the compiler cannot proves tables are not overlapping eachother
  • But it increase register usage.
  • On Blackwell, you could use up tp 40 percent of the registers just to increate bandwidth which does not left you many registers for computing
  • __pipeline_wait_prior(n) => wait for all the transfer except the n last
  • you do want to use a commit for non divergent code, otherwise it will produce a commit for each conditional paths
  • Producer consumer partern => one part of the threads are fetching the data and all do the computing after
  • but synchronisation is needed
  • 5 percent increase bandwidth on an hadamard product
  • 20 percents on more complex computation
  • TMA 1D and ND
  • Source destination pointers must be aligned on 16 bytes.
  • Copy size must be a multiple of 16 bytes
  • thrust::transform uses TMA under the hood so you can get it for free
  • but it is better if you use : cuda::proclaim_copyable_arguments(lambda) around the given lambda to tell the compiler that data can be copied to the shared memory
  • The compiler cannot prove the TMA is launched only for one thread. So it will produce a peal
  • To avoid that, you have to use cooperative_groups::invoke_one(cooperative_groups::coalesced_thread(), [&](){...})
  • TMA is efficient to transfer data of more than 2KiB
  • LDGSTS is better to transfer data of less than 1KiB
  • Between 1 and 2 KiB you can use both
  • In volta : memory model with sequentially consistant gives guaranty but can be a bit slow
  • Acquire : any load which is after the std::memory_order_acquire, stays after
  • Release : any load which is before the std::memory_order_release, stays before
  • Relaxed : any ordering with std::memory_order_relaxed
  • cuda::thread_scope_thread => local thread
  • cuda::thread_scope_block => only thread block
  • cuda::thread_scope_device => only the current GPU
  • cuda::thread_scope_system => Other thread in system (CPU, GPU, other nodes) can observe this thread'ss load and stores
  • In a SM, data have to be consistent only in L1 cache
  • Threads in a cluster have to exchange data through the L2 cache
  • see slide 88
  • New instruction in Hopper => store async to store a value from a SM into the shared memory of an other SM
  • //sychronise thread in cluster
  • ptx::barrier_cluster_arrive(ptx::sem_release)
  • //do intependent work
  • ptx::barrier_cluster_work(ptx::sem_acquire)
  • >5x with the store async compare to classic cluster::async on H100
  • These optimisation are part of CUTLASS and provides an optimised kernel for each architectures
  • If you use PyTorch rhigh now you migh want to these technics but at some point they will be part of PyTorch backend
  • look at the cude c++ documentation


Monday, Mar 17 9:00 PM - 9:40 PM CET : Optimizing Multi-Language Scientific Simulations: A Grace Superchip Case Study [S72537]
  • Holly Wilper : Manager, System Software Tools, NVIDIA
  • Michael Wall : Senior Scientist, Los Alamos National Lab
  • Slides
  • VENADO : Grace SuperChip at Los Alamos
  • SEDACS : multi language scientific simulations (many thouthands of atoms)
  • From quantum simulation to classical molecule simulations
  • Graph based methods
  • NSight : CPU, GPU, network and storage profiling (including Linux virtual file system)
  • https://nvidia.github.io/NVTX/doxygen/index.html
  • https://nvidia.github.io/NVTX/doxygen-cpp/index.html
  • https://github.com/NVIDIA/NVTX/tree/release-v3/
  • C++ and Python NVTX available in cuda libs
  • Fortran and Rust wrapers for NVTX available from the community
  • Reducing communication improve performances
  • Possibility to track advanced SIMD instruction with NSight
  • Transfering data allocated by Python on the host and used on the device by C++ cuda caused a lot of lags (they did not use the unified memory of Grace Hopper because the code is quite old)


Monday, Mar 17 11:00 PM - 12:30 AM CET : CUDA Techniques to Maximize Compute and Instruction Throughput [S72685]
  • Ben Pinzone : Compute Developer Technology Engineer, NVIDIA
  • David Clark : Compute Performance Developer Technology Engineer, NVIDIA
  • Slides
  • Always check what the profiler is telling you
  • Details page of NSight compute you can see warp divergence (work divergence because threads in the same warp are too different)
  • import-source-file=true to solve source when importing in nsight
  • making a queue with the work to do and in a second step, processing the data (if queuing is not too expensive in this case)
  • NCU : Get the info in the scheduler Statistics
  • Find the origin of Stall
  • New dependencies columns comming soon
  • Occupancy on Hopper is up to 64 warps per SM
  • __forceinline__ if you really want for force the compiler to inline (but can be very expensive on big functions, and still automatic on very small functions)
  • Use CCCL as much as possible (with cub, and thrust)
  • Use only the precision that your application only needs
  • signed iterator will allows more optimisation than unsigned iterators on nvidia compiler (but do not do that with other compilers and use long unsigned int)
  • overflow of unsigned is defined behaviour
  • overflow of signed is undefined behaviour and NVidia compilers take liberty
  • Power of dot and cross products to speed up computation and avoid trigonometric functions
  • Tensor Core Gen 5 with Blackwell and new instructions


Tuesday, Mar 18 9:00 PM - 9:40 PM CET : RAPIDS in 2025: Accelerated Data Science Everywhere [S73290]
  • Nick Becker : Sr. Technical Product Manager, NVIDIA
  • Slides
  • Industry needs high performance data science
  • RAPIDS : foundation of open sourced libraries
  • cuGraph helps PyTorch geometric
  • RAPIDS is a part of CUDA-X ecosystem
  • Zero code change for Apache SPark or Pandas etc
  • Pandas acceleration up to 50x with GPU accelerated
  • GPU accelerated Polars : 100000000 record in seconds, on GPU => beta version for now
  • Falls back on the CPU when an operation is not supported
  • Memory can be a challenge => scale beyong GPU memory limit (on Pascal+ GPU)
  • available by default to scale up Pandas and Polars
  • I/O in the cloud and remote storage (Amazon S3) : you can optimise it yourself but they choose a safe default
  • Apache Spark with RAPIDS => 6x on quite standard request
  • Scikit learn complemented with UMAP (¨Dimentionaly reduction), HDBSCAN (Clustering)
  • cuML accelerate UMAP and HDBSCAN without any code change => from 5 to 200x acceleration
  • Any third party library using UMAP or HDBSCAN ill be accelerated as well
  • available on open beta
  • spark-rapids-submit <options> app.py (just change from Python to this)
  • XGBoost 3.0 : pushing the memory limit and the scaling, can scale on 1 TB data on Grace Hopper
  • FIL : Forest Inference Library for Scikit-Learn, XGBoost, LightGBM and cuML
  • NetworkX acceleration with cuGraph
  • PyG up to 3x on training
  • Even better on Grace Blackwell
  • RAG : Retreval Augmented Generation
  • NeMo curator to filter your data before training
  • cuVS : accelerated vector search to build indexes
  • and a lot of documentation
  • Aether : optimise Apache Spark workflows => tools to optimize, redeploy and check performances
  • JupyterLab NVDashboard
  • Google colab Gemini coding assistant is aware now if you are using a GPU
  • Not only about AI


***** Wednesday, Mar 19 5:00 PM - 5:40 PM CET : How You Should Write a CUDA C++ Kernel [S72575]
  • Georgii Evtushenko : Sr. Software Engineer, NVIDIA
  • Slides
  • Kernels are asynchronous
  • NVBench -> interface simuilar to Google bench
  • Speed up is not always a good metric. It is better to compare to the speed of light => theorical peak performance of the GPU
  • sudo::ceil_div divides two integers and round the result to the greater integer (great to compute block and grid size)
  • use cuda::std::span or cuda::std::mdspan in your kernel instead of std::span or std::mdspan
  • Little's law : concurency = latency * throughput
  • fma operand has to wait until data are available. So we get a high ellapsed time in fma even if the computing is memory bound
  • cuobjdump --dump-ptx vec to dump ptx instructions
  • __launch_bounds__(load_size) to tell the compiler the expecte size of the kernel
  • There is only one register file per SM and if you inroll too much your kernels, at some point only one kernel will use most of registers and the others will fight to get ressources
  • To get the best grid size :
  • int num_blocks;
  • cudaOccupancyMaxActiveBlocksPerMultiprocessor(&num_blocks, kernel, block_size, 0);
  • To get the best block size :
  • int _;
  • int num_threads;
  • cudaOccupancyMaxPotentialBlockSize(&_, &num_threads, kernel, 0);
  • use cuda::ptx instead of asm
  • compute-sanitizer for CUDA kernels correctness checking
  • Use vocabulary types (span, mdspan, etc.)
  • Use libcu++ for latest hardware features
  • Replace intrinsics with atomics
  • cuda-gdb for CUDA kernels debugging
  • Use NVIDIA Nsight Systems to identify bottlenecks in your application
  • NVIDIA Nsight Compute for in-depth kernel profiling
  • The big problem on the LLMs is they always want to implement what you ask and do not use the relevant library for that


Wednesday, Mar 19 7:00 PM - 7:40 PM CET : A Deep Dive Into the Latest HPC Software [S71350]
  • Becca Zandstein : Director, Product Management, NVIDIA
  • Slides
  • HPC SDK => Scale to the full system you can use (SUpport x86, ARM or PowerPC CPU)
  • Provide a way to program on Laptop, Worksation, Cluster with smooth transition
  • Single binary to run on parallel GPU OK now
  • NEMO : Ocean Simulation => 64x on Grace Hopper with standard parallel fortran do concurrent
  • cuFFTmp uses NVShmem to perform the all to all communication (which is a bad idea to scale)
  • VASP multi GPU integration with HPC SDK
  • NSight Copilot integrated inside vscode
  • nvmath-python : currently beta
  • built on NVIdia core library with JIT
  • seamless interoparability with pytorch
  • Warp open source python differentiable kernel
  • math unit can kill your performance but tiles programming can inprove efficiency and productivity. I don't get why this is related
  • cuPyNumeric : write code simply with python and numpy. import cupynumeric as np
  • cupti python : collect directly your performance data directly from python
  • python profiling available (slide 37)
  • How to approach petascale / exascale ?
  • Iterative linear solver ? Not included in the HPC SDK stack. And CUDA Graph ???
  • Does HPC SDK with lower precision ? Yes, this will come.
  • cuPy <=> cuPyNumeric ?
  • Using copilot to translate Python to Fortran and Vice versa ? No yet available.
  • NVPL : for all of our CPU, so Grace for now)


Wednesday, Mar 19 10:00 PM - 10:40 PM CET : How Math Libraries Can Help Accelerate Your Applications on Blackwell GPUs [S72434]
  • Babak Hejazi : Senior Engineering Manager, NVIDIA
  • Azi Riahi : NVIDIA
  • Slides
  • Math Libraries => ease of use
  • more than 400 math libraries on CUDA-X (cuSPARSE, cuBLAS, AmgX, cuSOLVER, cuDSS, cuTENSOR, cuFFT, cuRAND, etc)
  • GB200 NVL72 => 72 GPU connected with NVLink and 36 Grace CPU
  • DGX B200 => 8 Blackwell connected with NVLink and Intel x86 CPU
  • DGX/HDX/H100/H200 : 8 H100 conneced with NVLink and Intel x86 CPU, network with infiniband
  • Blackwell emulation on FP64/FP32
  • emulation on FP64/FP32 in cuBLAS matmuls to keep performances but using FP16 and other at higher data flow
  • FP32 with BF16 tensor cores
  • FP64 with INT8 tensor cores
  • Tests on real cases : ecTrans (weather forecasting), BerkleyGW (Material Simulation)
  • About 2x on training at lower precision for AI
  • cuSparse from 1.4, 1.5 up to 2.4 speed up of Blackwell compare to H100
  • cuFFT, cuFFTmp : multi GPU, Multi Node library with as little overhead a possible
  • strong scaling on GB200 NVL72 up to 64 GPUs (roughly 41x on 64 GPUs which is good)
  • Lack on parformance from 8 to 64 GPU due to Infiniband, no problem on GB200 NVL72
  • cuDSS : Cuda Direct Sparse Solver (uses both CPU and GPU)
  • nvComp uses Blackwell hardware decompression engine up to 600GB/s decompression (9x on deflate and 4.5 on Snappy)
  • nvJPEG uses hardware decompression since Hopper and Blackwell


Tuesday, Mar 18 12:00 AM - 12:40 AM CET : NCCL: The Inter-GPU Communication Library Powering Multi-GPU AI [S72583]
  • Sylvain Jeaugey : Distinguished Engineer, NVIDIA
  • Slides
  • NCCL : NVidia Collective Communication Library
  • Data Parallelism / Tensor Parallelism / Pipeline Parallelism / Expert Parallelism
  • Different techniques of optimisation and different depending on the hardware
  • Big PICe bottleneck for all to all (cf Presentation of DK last year)
  • NCCL now avoid extra copies, recude the number of SMs and improve overlap
  • from 5 to 20 percent improvement in end to end training
  • PAT algortihm (Parallel Agregated Tree)
  • RAS sub system to diagnose applications
  • And profiler plugin
  • In the future : all to all with better latencies
  • Fused communication and computation kernels (distributed matmul)
  • New stack in NCCL comming in next few months
  • You can use any memory you want as long as it is CUDA valid
  • Symetric memory to improve performance (version 2.27)
  • CUDA level API (2.28) custom operation, Computation and communication fusion, new algorithms
  • NVlink sharp operation and GPU direct async (version 2.28) NVLink and IBGDA
  • Symetric memory to remove a lot of checks that kill performances => 4x latency improvement => officially inspired from NVShmem
  • Everything is allready preconnected
  • Integration with high level device (MLIR, TileIR) => build algorithm in Python and compile them on the fly inside CUDA
  • Source code available on github and even for preview to give a try
  • More improvement when you scale to more than 8 GPUs for NVlink to be efficient
  • Pytorch people are trying to integrate NCCL in their profiler


Tuesday, Mar 18 9:00 PM - 9:40 PM CET : CUDA: New Features and Beyond [S72383]
  • Stephen Jones (SW) : CUDA Architect, NVIDIA
  • Slides
  • CUDA => 250 SDKs
  • Every layer has a compiler builtin
  • Optimization is a Hierartical problem
  • Optimisation of the compile time for JIT but not only
  • Keeping track of dependencies in python is tricky
  • cuda.core : high level CUDA for Python based of cuda python
  • Python NVTX
  • Talking about future
  • cuPyNumeric : 1 to 1 porting GPU of numpy
  • import cupynumeric as np (can scale on a full computing center)
  • nvmath.python : optimised math for GPU + JIT compilation
  • cuda.parallel and cuda.cooperative : accelerated libraries
  • don't rewrite a FFT or a sort
  • Let the compiler do the thread mapping, you just have to do the block mapping
  • cuTile : python tiling for cuda, cuTile is simpler cuda, first release a little later this year
  • Warp + Warp Tile
  • you have to pick the right tool for your job
  • CUTLASS has a python backend now and get almost performances of C++
  • Tensor core programming in python with CUTLASS 4.0
  • CUDA DTX : CUDA Distributed Execution => beyong
  • NSight will be distributed into container so you do not have to have special container to run NSight


Tuesday, Mar 18 9:00 PM - 9:40 PM CET : Application Optimization for NVIDIA Grace CPU [S72978]
  • Lukas Krenz : Sr. Developer Technology Engineer, NVIDIA
  • Mathias Wagner : Sr. Developer Technology Engineer, NVIDIA
  • Slides
  • Grace is a familly of CPU
  • It should work because the ARM ecosystem is very mature
  • -O3 -mcpu=native -ffp-contract=fast
  • -flto : Link time optimization
  • Be careful with Fast math (with -Ofast)
  • -fsigned-char or -funsigned-char if your application needs signed or unsigned char
  • -fno-stack-array for fortran
  • Prefer BLAS/LAPACK or FFTW interfaces
  • Performance Analysis
  • PMU : Performance Monitoring Unit
  • NSight support ARM and Grace already
  • Perf, Likwid or eBPF
  • nvtxRangePush("Some function")... nvtxRangePop() with include nvtx3/nbToolsExt.h
  • nsys profile --cpu-core-metrics=help
  • https://github.com/brendangregg/FlameGraph
  • https://godbolt.org
  • nvidia compilers : -Minfo=vect to get info on vectorization
  • Arm memory order is weakly ordered compare to x86
  • load/store can be reordered by the CPU at runtime


Tuesday, Mar 18 10:00 PM - 10:40 PM CET : Accelerated Tabular Processing with Polars & RAPIDS [S74362] (RUST)
  • Matt Harrison : Python and Data Science Expert, Author, and Educator, MetaSnake
  • William Hill : Developer Advocate, Data Science, NVIDIA
  • no live
  • Slides
  • Polars : DataFrame library, core written in Rust, and has lazy evaluation
  • Python is a slow language
  • You can install Polars with pip or uv (which is better than pip of course)
  • Polars does not support 4 bits integers for now
  • Function building is close to Rust iterators
  • Change pl.read_parquet to pl.scan_parquet to enable lazyness
  • agg because aggregate is too long to write
  • hvplot.polars to plot
  • \rust.collect(engine='GPU') to run on GPU (with the NVidia Package)
  • See Polars Definitive Guide
  • x2 faster than Polars (not a lot but Polars is supposed to be fast already)
  • https://github.com/TNieuwdorp/polars-benchmarks/
  • Polars has some streaming support if data does not fit in the RAM, but do fit on the disk


Tuesday, Mar 18 11:00 PM - 11:40 PM CET : Energy-Efficient Supercomputing Through Tensor Core-Accelerated Mixed-Precision Computing and Floating-Point Emulation [S71487]
  • Samuel Rodriguez Bernabeu : Senior Software Engineer CUDA Math Libraries, NVIDIA
  • Slides
  • HPL benchmark
  • 56bit mantissa up to 2x faster than classic FP64
  • Intuitive that manimpulating fewer bits should be faster
  • Mix precision have been around for a long time and they compute a approximate result but a fair amount of expertise is needed to evaluate where it is relevant for the given computation
  • Represent 1 FP32 as 3 BF16 numbers
  • BF16*9 is more accurate than FP32 on weather forecasting simulation
  • More complex for denormal number but it works
  • 2.1x on material BerkleyGW
  • Sparse matrices converted to dense => 64 bit (S=8) really close to FP64 on a linera system solving
  • It is easier to explore the needed precision
  • Automatic precision tuning framework to analyse the precision of DGEMM
  • They fall back to FP64 fit they need too much slice to convert FP64
  • Introducing first in cuBLAS, cuTENSOR and cuSOLVER very soon in cuBLAS


Wednesday, Mar 19 12:00 AM - 12:40 AM CET : Insights From NVIDIA Research [S73202]
  • Bill Dally : Chief Scientist and SVP of Research, NVIDIA
  • Slides
  • RTCore : ray tracing, pass through hirerarchy
  • NVLink and NVSwitch came from NVidia Research
  • SPring 22 : AI for ship design
  • Fall 22 : instant NeRF
  • High signaling to get through the grey balls, ground reference signaling (use ground as a reference)
  • 50 GBit/s per pin
  • 3D dies stacking interconnect
  • Cache Aware GEMM Kernel (CAGE)
  • SMs are reading data in reverse order to avoid cache trashing at maximum
  • Prefill : what things have to use other things in memory
  • Decode : look at which tokens have to be attended to
  • Speed up of 3 on decode stage
  • Transformers, great but costly
  • Mamba + Transformers => Hymba
  • Reasoning about the string is Reasoning about the molecule


Wednesday, Mar 19 1:00 PM - 1:40 PM CET : Accelerate High-Performance Signal Processing Using GPU/CUDA [S71459]
  • Anders Ahlander : Sr. Systems Designer, Saab
  • Jacob Lundberg : Systems Engineer, Saab
  • Slides
  • Swarms for closely spaced drones
  • Switching from CPU to GPU
  • Radar : Radio Detection and Ranging
  • Pulse doppler radar
  • Focus on signal processing
  • AESA radar is a kind of reverse SKA detector
  • But huge leap is computation
  • Small matrices but at a high rate
  • cuSOLVER, cuBLAS and cuFFT
  • The final steps are still on the CPU and uses many batches
  • GPUs take longer to get overflowed by computation than CPU, so the hardware changes is less frequent
  • A100 gives the best performances for fp64 computation
  • NVIdia NV-radar Holoscan (edge computing, cloud and super computer)


Wednesday, Mar 19 4:00 PM - 4:40 PM CET : The CUDA C++ Developer’s Toolbox [S72572]
  • Bryce Adelstein Lelbach : Principal Architect, NVIDIA
  • Slides
  • default is on host compilation
  • thrust::universal_vector
  • thrust::constant_iterator
  • thrust::make_transform_iterator
  • thrust::make_zip_iterator even with +1 to get adjacent element
  • thrust::cuda::par_nosync
  • ok for thrust::for_each or transform because they do not allocate or return value
  • use cub if you want an ansynchronous reduce
  • If a universal_vector is bigger than the GPU memory, this is not a problem if the computing is chunckable
  • If the CPU and GPU share the same memory it could be better to use unified memory rather than do the copy yourself. But some caveats with Jetson (where there is only one memory for both CPU and GPU)
  • github.com/NVIDIA/accelerated-computing-hub


Wednesday, Mar 19 5:00 PM - 5:40 PM CET : The Performance of CUDA with the Flexibility of PyTorch [S71946]
  • Mark Saroufim : Software Engineer, Meta Platforms
  • Slides
  • CUDA is not flexible and Pytorch is slow
  • torch.comile uses jit profile to determine which computation to fuse and which to not
  • pyotrch : fundamentaly string to string conversion from Python to CUDA
  • The CUTLASS backend is generated on the fly by PyTorch but the performance is very dependent on the PyTorch version
  • The problem of the string to string is to find and fix typos
  • load_inline => simpler package but needs jit compile
  • 20 minutes to compile LLMs with PyTorch => you should export your model once for your architecture.


Wednesday, Mar 19 6:00 PM - 6:40 PM CET : Get the Most Performance From Grace Hopper [S72687]
  • Akshay Subramaniam : Sr. AI Developer Technology Engineer, NVIDIA
  • Slides
  • Grace Hopper Superchip
  • Memory Consistency
  • Same for Grace Blackwell but with 2 GPU per CPU
  • They share the same programming model
  • cudaMallocManaged => allocate on first touch policy
  • ATS : address translation service
  • Access counter based Optimization to migrate data if the other side of the chip use them or not
  • But sometime you can have more control : cudaMemPrefetchAsync
  • cudaMemAdvise to tell a prefered location for the allocation
  • Maybe recompile on ARM for Grace with sm90 for Hopper
  • XLB : Accelerated Latice Boltzman Method (LBM) Stencil based method
  • Too much memory for the GPU, but data can be splited into slices and be streamed on the GPU
  • Out-of-core algorithm


Wednesday, Mar 19 7:00 PM - 7:40 PM CET : 1,001 Ways to Write CUDA Kernels in Python [S72449]
  • Leo Fang : Python CUDA Tech Lead, NVIDIA
  • Slides
  • How to stay in Python and even have performances
  • Clang/LLVM, MLIR, NVRTC / nvJitLink / NVVM
  • No need for preinstalled compiler such as G++ thank to JiT mechanism
  • python code.core : new API to access CUDA function in Python
  • template constexpr become JiT instruction
  • numba-cuda
  • @cuda.jit : all in python but hard to write
  • nvJitLink : generates a kernel as performant as CUDA
  • cuda.cooperative base of cub => very small and open-source
  • reduction of rows is a matrix vector multiply with a vector of 1 (as I showed in the CUDA/nvc++ webinar)
  • optimisation : set_smem(smem_vecOnes, nbElement, 1.) to set the vector of 1 directly in shared memory because we do not need a real vector for that
  • and then perform the GEMM with cuBLAS
  • cuTile : cuda with tiles programming (see Stephen Jones presentation)
  • @cuda.tile.jit : extension of CUDA
  • The cooperative nature is important
  • Warp : auto differentiable Python Kernels (you do not have to write the backward version of the kernel yourself) Open Source on Github
  • CUTALSS for Python compilation is faster and performance are quite the same
  • You can also use cupy, cupyx or triton
  • cuda.parallel : Write thrust like code in python (open source with Apache 2.0)
  • nvmath (but why matmul in advanced ?)
  • out_arr = xp.sum(in_arr, axis=-1) you can replace xp by PyTorch, CuPy, JAX or DASK and be portable
  • PF4 is on the roadmap for Python
  • You can generate CUBIN from Python and use it in C++


Wednesday, Mar 19 7:00 PM - 7:40 PM CET : Speed-of-Light Data Movement Between Storage and the GPU [S73012]
  • CJ Newburn : Distinguished Engineer, NVIDIA
  • Prashant Prabhu : NVIDIA
  • Vikram Sharma Mailthody : Senior Research Scientist, NVIDIA
  • Slides
  • If threads access a lot of data on GPU, we kill the bandwidth
  • SCADA : SCaled Acceletated Data Access
  • Look ahead to get things ready
  • Blufield 3 with Spectrum-X to deal with Conjection problems
  • High Band with data path to GPU
  • RDMA and TCP
  • CPU is not deeply involve in the data transfer
  • GPU Direct Storage since April 2021
  • Add support for PyTorch and HDF5
  • S3 Over RDMA/cuObject invertigation
  • cuFile : up to 6x performance, sometimes 1.02x
  • The plot slide 44 is very interesting
  • The math: 100 GB/s for Gen6/512B is 200M IOPs/GPU, 1600M for 8 GPUs
  • Explore applications with technologits and storage providers


Thursday, Mar 20 11:00 PM - 11:40 PM CET : How To Write A CUDA Program: The Parallel Programming Edition [S72897]
  • Stephen Jones (SW) : CUDA Architect, NVIDIA
  • Slides
  • A must watch (with a lot of nice schemas, so a bit less of notes)
  • Working on CUDA since 2008
  • How to avoid writing a GPU kernel
  • Oversubsciption : launch too much computing than the hardward can theorically handle
  • cub and cuda.cooperative
  • cuTile : tile programming for CUDA
  • Llama 3.1 inference with cuTile get to 10 percent to cuDNN which is the state of the art