12.4.4 : Developpement HPC

  • Monday, Attendre la dispo (4:00 PM - 5:20 PM CET) disponible Introduction to CUDA Programming and Performance Optimization [S62191]
    • Athena Elafrou, Developer Technology Engineer, NVIDIA
    • À regarder absolument si vous programmez en CUDA
    • Hopper 4 warps scheduller per SM
    • soa is 6x faster than aos on H100 (and on all GPU in general)
    • Using atomicAdd do not fetch a, but L2 update value of a with b*c
    • Use ncu to get metric
    • Limit of 256 registers per threads (get the info in NSight)
  • Monday, (5:00 PM - 5:25 PM CET) Accelerating Pandas with Zero Code Change using RAPIDS cuDF [S62168]
    • Ashwin Srinath : Senior Software Engineer, NVIDIA
    • cuDF : all core functionnalities of Pandas but on GPU
    • 10 to 100x speed up compare to Pandas on 100-300k rows x 2 columns DataFrame
    • import cudf
    • or python -m cudf.pandas script.py
    • Pandas is everywhere but largely single threaded
    • Pandas is not a query engine
    • cuDF supports 100% Pandas API (if operation can be used on GPU, it uses GPU, otherwise it uses CPU)
    • 36.7s to 720ms for read + groupBy + sort
    • %load_ext cudf.pandas
    • >100x faster join and > 40x Group by
    • Example with Pandas + LLMs : cuDF becomes the fastest component of the pipeline
    • cuDF passes 94% of Pandas unit tests (some edge case are to be solved)
    • Stay on GPU as much as possible
    • from cudf.pandas import Profiler (or %%cudf.pandas.profiler
    • Pay attention to GPU memory usage (when GPU is FULL, operations go back to CPU, so this can lead to unexpected slow down)
    • Use Idiomatic pandas pattern (user defined function 4s, and 846ms for native method)
    • Date time index is uspported on GPU but some functionnalities are not ready yet
    • cuDF does not really depends on numpy (it will come to cuDF when numpy will ne the defaut for Pandas 3 maybe)
    • For multiple GPU : use dask.cudf
  • Monday, (5:00 PM - 6:40 PM CET) Profilers, Python, and Performance: Nsight Tools for Optimizing Modern CUDA Workloads [DLIT61667]
    • no record
  • Monday, Mar (5:30 PM - 6:50 PM CET) disponible Advanced Performance Optimization in CUDA [S62192]
    • Igor Terentiev : Engineer V, Dev Tech S/W, NVIDIA
    • Slides
    • CTA (Cooperative Thread Array) == Thread Block
    • CGA (Cooperative Grid Array) == Thread Block Cluster
    • Code snippets: namespace cg = cooperative_groups;
    • Increased SMEM saves GMEM trips in many algorithms
    • linear rank: cg.this_cluster().block_rank()
    • 3D rank : cg.this_cluster().block_index()
    • Thread synchronization:
    • CTA -> cg.this_block().sync()
    • CGA -> cg.this_cluster().sync()
    • Remote DSMEM (distributed shared memory) has to be mapped:
    • __shared__ T smem;
    • auto dsmem_ptr = cg::this_cluster().map_shared_rank(&smem, rank);
    • __syncthreads();// Barrier + makes all previous writes visible to all threads in the block
    • cg::this_grid().sync();// Barrier + makes all previous writes visible to all threads in grid Including global memory !
    • CUDA Graph for optimised kernel launch
  • Monday, Mar 18 6:00 PM - 6:50 PM CET Magnum IO GPUDirect, NCCL, NVSHMEM, and GDA-KI on Grace Hopper and Hopper systems [S61368]
    • Pak Markthub : Senior Software Engineer, NVIDIA
    • Harry Petty : NVIDIA
    • Davide Rossetti : Senior Software Engineer, NVIDIA
    • Slides
    • MagnumIO umbrella
    • Storage IO, Network IO, In Network Computing, NBLink Network, collective Operations
    • NVlink in MagnumIO show 30% speed up performance
    • NCCL : parallel efficiency
    • Parallel efficiency drop to 50% at 128 GPU
    • GPUDirect RDMA : GPU and NIC are inline
    • GPUDirect P2P : Cuda kernel running on GPU0 and uses memory of GPU5
    • GPUDirect Async : Mapping of the NIC memory on the GPU
    • GPUDirect virtualisation : KVM
    • GH200 main coherent plateform
    • NIC memory registration based on DMA-BUF with contribution to Linux kernel (with GPL libraries)
    • GDA-KI : Kernel in SM can trig the NIC (possibility to prepare WQE in parallel)
    • NVSHMEM IBGDA : IBRC default communication proxy, thread, warp and block scope
    • 270 Million packets per second with 4 CTA in shmem_p_bw
    • 94 Million packets per second with 8 CTA in shmem_g_bw
    • Jacobi test : scalaing close to NVLink
    • NCCL IBGDA : MPI like semantic for collective operation
    • For all to all NCCL
    • Support network transport with plugin, IBGDA external network plugin
    • Register a landing zone, as a FIFO for GPU
    • Channel : local stream to talk to a NIC
    • 512 GPU : 75
    • Quelques problèmes de son
  • Monday, (6:00 PM - 6:50 PM CET) No More Porting: Accelerated Computing With Standard C++, Fortran, and Python [S61204]
  • Wednesday 11am No More Porting: Accelerated Computing With Standard C++, Fortran, and Python: A Q&A From the EMEA Region [S61204a]
    • Anastasia Stulova : HPC Compiler Programming Models Architect, NVIDIA
    • Jeff Larkin : HPC Architect, NVIDIA
    • Slides
    • No need to be a ninja programmer
    • Develop applications that run averywhere
    • Standard C++, Fortran and common python used on real cases
    • Avalaible with containers
    • Nothing unique to grace hopper but improves a lot
    • C++ is a parallel language
    • using algorithms not for loops
    • std::mdspan c++23 Avalaible
    • C++26 preview (sender recievers Avalaible)
    • -stdpar=multicore or -stdpar=gpu
    • std::mdspan : solution for calibration
    • M-AIA : same perf on CPU but 8.74x on GPU
    • Fortran is still a lively parallel language
    • No-coarray for Fortran 2018
    • Reduce subclause added in Fortran 2023
    • MiniWeather : Triple loop converted into a do concurent loop
    • Same CPU perf but 18x for GPU
    • ABINIT : Same CPU but 5.5x on GPU (FFTW interface on GPU) so no difference in code
    • Python is not a standard but a set of packages
    • Each have to develop its own GPU portage
    • Legate solves this issue
    • No need to change code for Grace Hopper
    • Cuda, or OpenACC will Work
    • On grace difference bandwidth to access Data on CPU and On GPU
    • stdpar automaticly place data where it should be
    • NVHPC compilers : -gpu=managed (but mostly automatic) to manage memory automatically
    • Unified memory with -gpu=unifed since 2023
    • It is possible to mix nvcc (CUDA) with nvc++
    • Languages like Julia and Rust will be integrated
    • Optimization for block size ? Yes, it is expected to be automatically and also able to tune it manuallu
    • C++ Dynamic cast is not supported now
    • The approch is to developpe almost everything with standard algorithms and then finetune what is needed (and debug them on CPU)
    • Parallel from the start
  • Tuesday, (4:00 PM - 4:50 PM CET) Accelerating and Securing GPU Accesses to Large Datasets [S62559]
    • CJ Newburn : Distinguished Engineer, NVIDIA
    • Oren Duer : Director, Software Architecture, NBU, NVIDIA
    • Vikram Sharma Maithody : Senior Research Scientist, NVIDIA
    • Slides
  • Tuesday, Mar (5:00 PM - 5:25 PM CET) Restoring the Scientific Method to HPC: High Performance Reproducible Parallel Reductions [S62405]
  • Tuesday, (5:00 PM - 5:50 PM CET) Demystify CUDA Debugging and Performance with Powerful Developer Tools [S62256]
    • Jackson Marusarz , Technical Product Manager, NVIDIA
    • Slides
    • Python + nvtx for NSight
    • New NSight Features : Host Grace Profiling, Extention to Jupyter Lab
    • Soon : Python Call Stack and syntax Highlighting
  • Tuesday, (5:00 PM - 5:50 PM CET) RAPIDS in 2024: Accelerated Data Science Everywhere [S62741]
    • Dante Gamma Desavre : Machine Learning Engineering Manager, NVIDIA
    • Nick Becker : Senior Technical Product Manager, NVIDIA
    • RAPIDS : ecosystem
    • Forecasting : from once a month of once a year to once a day
    • RAPIDS for Dask of Apache Spark
    • cuVS : accelerated vertor search
    • NVidia AI entreprise : open source fondation
    • Use a CPU/GPU switch or both, or custom Cuda Kernel
    • Pandas acceleration : cuDFPandas, all Pandas API but does not accelerated everything, but accelerate everything on the GPU
    • python -m cudf.pandas script.py
    • networkX : networkX cuGraph Backend (more than 60 algorithms in networkX, some are 600x or 5x faster)
    • OK event if you don't have a GPU
    • Integrated in networkX documentation
    • Numba Cuda : Supports shared memory or cooperative groups
    • Data interoperability
    • Dask supports configurable GPU backend (local cuda cluster instead of local CPU cluster)
    • Dask Expression : simplify expression on GPU also, deal with multiple GPU and avoid memory break
    • Apache spark : 5.5x faster
    • Most popular algos in Spark ML
    • XGBoost 2.0 : integration nvFlare, UCX to spped up network
    • cuVS : cuda vertor search : accelerated vector search for C++ and Python, Rust and C for vector data base, and graph base ANN CAGRA-G (IVF-PQ with CAGRA)
    • CAGRA-Q has quantization
    • Cuda Toolkit in conda forge
    • JupyterLab NVDashboard (high level view) (GPU metrics), and NSight
    • NVTX to annotate coda (C++, Python)
    • Workbench : smoothly start from a laptop and go into a cluster of GPUs
    • NVIdia Launchpad
    • RAPIDS Ecosystem of tools : Legate is a distributed runtime (so different level of abstration)
  • Tuesday, (7:00 PM - 7:50 PM CET) CUDA: New Features and Beyond [S62400]
    • Stephen Jones : CUDA Architect, NVIDIA
    • Slides
    • Combined Hardware and software
    • Power is the metric that matter
    • What can I do with 20 MW
    • Data movement and computation
    • Consumption scale with the square of the matisse on floating point computation
    • LU decomposition in tensor core
    • Mixed precision 6x faster than double with same result
    • How to use tensor cores : cuBLAS, cuBLASLt, cuTLASS (write tensor core code inside your own kernel)
    • cuBLASDx : device extention for cuBLAS
    • Kernel Fusion to save load, with Jit
    • CUTLASS can mix with PyTorch
    • NSight manages code origin (C++, Python etc) even global lock of python interpreter
    • Warp : Differentiable kernel Development (example with bread simulation)
    • Breaking bread simulation
    • Legate into Jax
    • NSight on large scale machine and interfaec with Jupyter Lab
    • GPU Direct : copy direct on network
    • GPUDirect Async : trigger from the GPU
    • GPUDirect-KnI : Keernel initiated
    • NCCL and NVsmhem traces in NSight
    • 20percent speed up on grace hopper for transformer Training because of cache data in grace
    • cuGraph : task graph on GPU (6 years of work)
    • DGCM : monitor data center
  • Tuesday, Mar (7:00 PM - 7:50 PM CET) A Deep Dive into the Latest HPC Software [S61203]
    • Jeff Larkin : HPC Architect, NVIDIA
    • Slides
    • Pas de son
  • Tuesday, (10:00 PM - 10:50 PM CET) Unlocking Developer Productivity across CPU and GPU with Mojo [S62220]
    • Chris Lattner : Chief Executive Officer and Co-Founder, Modular AI
    • Mostafa Hagog : Performance & Compiler Engineering Lead, Modular AI
    • Slides
  • Tuesday, Mar (11:00 PM - 11:50 PM CET) CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores [S61198]
    • Vijay Thakkar : Senior Architect, NVIDIA
    • Jack Kosaian : Senior Architect, NVIDIA
    • Slides
    • CUTLASS : C++ BLAS template library, 2.5MDowload /mounth
    • Gère la fusion de produit de matrice, et les FP8 (E5M2 et E4M3)
    • Composable schedulers
    • 5 Conceptual Herarchy
    • Kernel Layer, Device Layer : Bock and Grid stuff
    • Reduction of API
    • Convolution in CUTLASS 3.5
    • Convolution collective
    • Harbitraty number of dimention
    • GEMMs are just a tensor contraction (GeTT to implicit GeMM)
    • They use Hopper TMA
    • Beta API with 1, 2 and 3 dimensions
    • Group GENN and Mixture of experts (MoE)
    • Weight quantisation : Using lighter weight, from FP16 to INT4 for example and use tensor core for that GeMM
    • Epiloge Visitor Tree (EVT) : compose complex epiloges
    • Also available in Python
    • It is possible to extend CUTLASS with custom Kernels
    • CUTLASS 3.6 for summer with sparcity for Hopper
    • Mix precision GEMM is not supported
  • Wednesday, Mar 201:30 AM - 1:45 AM CET Practical Tips for using Grace Hopper to Dramatically Accelerate your Deep Learning and HPC pipelines [EXPT63133]
  • Wednesday, (4:00 PM - 4:25 PM CET) Perform High-Efficiency Search, Improve Data Freshness, and Increase Recall With GPU-Accelerated Vector Search and RAG Workflows [S62599]
    • Charles Xie : CEO, Zilliz
    • Corey Nolet : Principal Engineer, RAPIDS ML, NVIDIA
    • Slides
    • Vector database system and RAG system ad how to put GPU in that, introducing cuSV
    • vector to text to video to molecules
    • Milvus : first vector data base in the world (6 years ago)
    • Vector data base to enhance proprietery domain data to
    • Retrieval-Augmented Generation (RAG) avoid hallucination, improve accuracy
    • GPU acceleration on worker nodes
    • Brute force algorithms are OK on GPU, => CAGRA (almost 0 build time and very high Throughput
    • CAGRA : Find nearest neigbour on GPU
    • CAGRA graph can be used also on CPU and HNSW is faster on CAGRA graph
    • 1TB raw text to 2.5 TB vector (635 M × vector or 1024 dimensions)
    • Wraps RAFT and increase language support
    • cuVS for C, C++, Python, Rust
    • They need a separate GPU if cuVS is used to train a model
  • Wednesday, Mar (4:00 PM - 4:50 PM CET) Mastering CUDA C++: Modern Best Practices with the CUDA C++ Core Libraries [S62175]
    • Jake Hemstad, Software Engineering Manager, NVIDIA
    • Georgii Evtushenko : Senior Software Engineer, NVIDIA
    • no record
  • Wednesday, (6:00 PM - 6:25 PM CET) Performance Optimization for Grace CPU Superchip [S62275]
  • Wednesday, (6:00 PM - 6:50 PM CET)Legate: A Productive Programming Framework for Composable, Scalable, Accelerated Libraries [S62262]
    • Wonchan Lee : Senior Software Engineering Lead, NVIDIA
    • Manolis Papadakis : Senior Software Engineer, Legate Framework, NVIDIA
    • Slides
    • Use a wide range of accelerated hardware
    • Not easy to program
    • Use Numpy scipy and well used frameworks in python to GPU
    • Transparently scaling program
    • Numpy, Scipy, Pandas, Scikit Learn, dmicXGBoost, JAX, Zarr, HDF, ...
    • Each implementation should be Transparently scaling
    • No expensive operation to go from a library to an other
    • Common solution for these problems
    • First, legate run time stack : Legate + Legion + Realm
    • User do not have to change their code on new Hardware
    • Numpy -> Legate cunumeric (presented in GTC 2023)
    • Pandas -> Legate DataFrame (new)
    • Scikit learn -> Legate ML (new)
    • XGBoost -> Legate Boost (new)
    • JAX -> Legate JAX (new)
    • HDF -> Legate IO (new)
    • Python program -> task graph into Legate Runtime
    • Problem on Data Partition
    • Some time data need to be repartitionned
    • Problem if repartitionning appears outside of legate
    • Legate Runtime : common way to deal with partition (you can add partinioning constraints)
    • Legate Jax : from the popular ML framework with flexible tensor partinioning strategies
    • But no support to scale up to 1k GPU
    • Augment the JAX pipelining parallelism
    • JAx uses Legate XLA backend => comparable performance compare to the state of the art from 128, 256 or 512 GPUs but with all libraries
    • Rapids + Legate :
    • Rapids collection of fast and maintained kernel for ETL and ML (for Pandas, XGBoost and HDF)
    • Legate Boost : Mix model boosting
    • SLAC ; Stanford Linear Accelerator Center
    • Develop on your laptop and scale to a computing center for production
    • Legate IO : HDF5 and Zarr with GPU DirectStorage
    • Switching between CPU and GPU in one click
    • Exemple with multiphysics solver : possibility to map computation on CPU and GPU (Slide 13)
    • Legate Sparse : Sparse matrix implementation
    • Task Fusion : MLIR Kernel Fusion (perf slide 14)
    • Complete C++ reimplementation of the core layer, it was previously in Python but it turn to be the bottleneck of the perf
    • Legate JAX use C++ Legate
    • And Legate STL, Implement your C++ Programs in functionnal style legate::stl::transform_reduce
    • NSight has Logical view and physical view with legate to help user to understand what is going on
    • conda install -c nvidia -c conda-forge -c legate cunumeric
    • https://github.com/nv-legate
    • Legate design is more inpired by stl rather than TBB (so no TBB in this)
    • Will legate also work on Jetsen, Orin ? Normally no, but the runtime overhead could be exessive on smaller CPU and GPU
    • Release on May on July 2024 (for new legate version with all python, C++ stuff)
    • The runtime does not care how you implement your library
  • Wednesday, (10:00 PM - 10:50 PM CET) Multi GPU Programming Models for HPC and AI [S61339]
    • Jiri Kraus : Principal DevTech Compute, NVIDIA
    • Slides
    • Un peu de NCCL, MPI, OpenShmem, NVshmem, à voir si vous développez des calculs basés sur la communication
    • 8xH100 with non blocking communication
    • Domain decomposition : optimise the number of neigbour or optimise the communications between neigbour
    • A lot of examples
    • Do the communication while computing still ran
    • NVSHMEM : direct communication between GPU kernels
    • NVShmem faster with -dlto (link time optimisation)
    • Start fusing kernel before and after communication
    • NCCL and NVShmem can be used with CUDA Graphs
    • The afinity between the network adapter and the GPU is very important
    • Better use a cuda aware MPI on a cluster if possible
  • Wednesday, (11:00 PM - 11:50 PM CET) More Data, Faster: GPU Memory Management Best Practices in Python and C++ [S62550]
    • Mark Harris : Distinguished Engineer, NVIDIA
    • You have to watch this if your are a library developer (C++ or Python)
    • No Metadata from host allocator when used to allocate data on device because it cannot read it
    • Multiworkflow can be problematic because each one allocate a pool of memory without cooperate we will get out if memeory
    • cudaMalloc / cudaFree => synchronous and slow free for security
    • cudaFree adds bubble to the pipeline
    • On 16 V100 => 88 percent of time spent into memory management, dropped to 0 with pool
    • cudaMallocAsyn cuda 11.2
    • RMM : fast and flexible memory management
    • Used in cuDF, cuML, RAFT
    • by default RMM uses cudaMalloc and cudaFree, but they can be replaced with a pool allocator using stream
    • RMM is in transistion (refactoring ongoing)
    • deallocate may reuse memory in the next allocation asynchronously
    • rmm::device_buffer : unitilisated bytes memory
    • rmm::device_uvector : kind of thrust::device_vector but unitilisated
    • thrust does not know streeam
    • binning_memory_ressource : separate small allocation from large allocation using a different device if needed
    • Share memory pool between libraries
    • External allocator interface
    • Best Practices :
    • no raw allocation
    • Give your library an external allocator interface (allows libraries cooperation)
    • Use stream ordered allocation
    • Use stream ordered data containers
  • Thursday, live (5:00 PM - 5:50 PM CET) How To Write A CUDA Program: The Ninja Edition [S62401]
    • Stephen Jones (SW), CUDA Architect, NVIDIA
    • How to think about writing a program
    • Last 10 percents of peak performance is hard to get
    • 1M Threads on H100
    • Data Parallelism : GPU (Throughput machine)
    • Task Parallelism : CPU (Latency machine)
    • GPU : expects 100 waves
    • Map data to threads
    • 10x in cache L2
    • split task then data fit in cache
    • don't start all your kernels from top left to preserve cache of last kernel
    • No all to all
    • you can write cuda code in python (cuPy, numba) but you have to understant how the machine works
    • Tensor Parallelism is data Parallelism
    • Now way to automate size of block because it is problem dependent
    • If many kernels just use cudaGraph
    • NSight gives info of the low level cache using by kernel (occupancy, latency, etc)