Developpement HPC

12.4.4 : Developpement HPC

Monday, Attendre la dispo (4:00 PM - 5:20 PM CET) disponible Introduction to CUDA Programming and Performance Optimization [S62191]
- Athena Elafrou, Developer Technology Engineer, NVIDIA
- À regarder absolument si vous programmez en CUDA
- Hopper 4 warps scheduller per SM
- soa is 6x faster than aos on H100 (and on all GPU in general)
- Using atomicAdd do not fetch a, but L2 update value of a with b*c
- Use ncu to get metric
- Limit of 256 registers per threads (get the info in NSight)
Monday, (5:00 PM - 5:25 PM CET) Accelerating Pandas with Zero Code Change using RAPIDS cuDF [S62168]
- Ashwin Srinath : Senior Software Engineer, NVIDIA
- cuDF : all core functionnalities of Pandas but on GPU
- 10 to 100x speed up compare to Pandas on 100-300k rows x 2 columns DataFrame
- import cudf
- or python -m cudf.pandas script.py
- Pandas is everywhere but largely single threaded
- Pandas is not a query engine
- cuDF supports 100% Pandas API (if operation can be used on GPU, it uses GPU, otherwise it uses CPU)
- 36.7s to 720ms for read + groupBy + sort
- %load_ext cudf.pandas
- >100x faster join and > 40x Group by
- Example with Pandas + LLMs : cuDF becomes the fastest component of the pipeline
- cuDF passes 94% of Pandas unit tests (some edge case are to be solved)
- Stay on GPU as much as possible
- from cudf.pandas import Profiler (or %%cudf.pandas.profiler
- Pay attention to GPU memory usage (when GPU is FULL, operations go back to CPU, so this can lead to unexpected slow down)
- Use Idiomatic pandas pattern (user defined function 4s, and 846ms for native method)
- Date time index is uspported on GPU but some functionnalities are not ready yet
- cuDF does not really depends on numpy (it will come to cuDF when numpy will ne the defaut for Pandas 3 maybe)
- For multiple GPU : use dask.cudf
Monday, (5:00 PM - 6:40 PM CET) Profilers, Python, and Performance: Nsight Tools for Optimizing Modern CUDA Workloads [DLIT61667]
- no record
Monday, Mar (5:30 PM - 6:50 PM CET) disponible Advanced Performance Optimization in CUDA [S62192]
- Igor Terentiev : Engineer V, Dev Tech S/W, NVIDIA
- Slides
- CTA (Cooperative Thread Array) == Thread Block
- CGA (Cooperative Grid Array) == Thread Block Cluster
- Code snippets: namespace cg = cooperative_groups;
- Increased SMEM saves GMEM trips in many algorithms
- linear rank: cg.this_cluster().block_rank()
- 3D rank : cg.this_cluster().block_index()
- Thread synchronization:
- CTA -> cg.this_block().sync()
- CGA -> cg.this_cluster().sync()
- Remote DSMEM (distributed shared memory) has to be mapped:
- __shared__ T smem;
- auto dsmem_ptr = cg::this_cluster().map_shared_rank(&smem, rank);
- __syncthreads();// Barrier + makes all previous writes visible to all threads in the block
- cg::this_grid().sync();// Barrier + makes all previous writes visible to all threads in grid Including global memory !
- CUDA Graph for optimised kernel launch
Monday, Mar 18 6:00 PM - 6:50 PM CET Magnum IO GPUDirect, NCCL, NVSHMEM, and GDA-KI on Grace Hopper and Hopper systems [S61368]
- Pak Markthub : Senior Software Engineer, NVIDIA
- Harry Petty : NVIDIA
- Davide Rossetti : Senior Software Engineer, NVIDIA
- Slides
- MagnumIO umbrella
- Storage IO, Network IO, In Network Computing, NBLink Network, collective Operations
- NVlink in MagnumIO show 30% speed up performance
- NCCL : parallel efficiency
- Parallel efficiency drop to 50% at 128 GPU
- GPUDirect RDMA : GPU and NIC are inline
- GPUDirect P2P : Cuda kernel running on GPU0 and uses memory of GPU5
- GPUDirect Async : Mapping of the NIC memory on the GPU
- GPUDirect virtualisation : KVM
- GH200 main coherent plateform
- NIC memory registration based on DMA-BUF with contribution to Linux kernel (with GPL libraries)
- GDA-KI : Kernel in SM can trig the NIC (possibility to prepare WQE in parallel)
- NVSHMEM IBGDA : IBRC default communication proxy, thread, warp and block scope
- 270 Million packets per second with 4 CTA in shmem_p_bw
- 94 Million packets per second with 8 CTA in shmem_g_bw
- Jacobi test : scalaing close to NVLink
- NCCL IBGDA : MPI like semantic for collective operation
- For all to all NCCL
- Support network transport with plugin, IBGDA external network plugin
- Register a landing zone, as a FIFO for GPU
- Channel : local stream to talk to a NIC
- 512 GPU : 75
- Quelques problèmes de son
Monday, (6:00 PM - 6:50 PM CET) No More Porting: Accelerated Computing With Standard C++, Fortran, and Python [S61204]
Wednesday 11am No More Porting: Accelerated Computing With Standard C++, Fortran, and Python: A Q&A From the EMEA Region [S61204a]
- Anastasia Stulova : HPC Compiler Programming Models Architect, NVIDIA
- Jeff Larkin : HPC Architect, NVIDIA
- Slides
- No need to be a ninja programmer
- Develop applications that run averywhere
- Standard C++, Fortran and common python used on real cases
- Avalaible with containers
- Nothing unique to grace hopper but improves a lot
- C++ is a parallel language
- using algorithms not for loops
- std::mdspan c++23 Avalaible
- C++26 preview (sender recievers Avalaible)
- -stdpar=multicore or -stdpar=gpu
- std::mdspan : solution for calibration
- M-AIA : same perf on CPU but 8.74x on GPU
- Fortran is still a lively parallel language
- No-coarray for Fortran 2018
- Reduce subclause added in Fortran 2023
- MiniWeather : Triple loop converted into a do concurent loop
- Same CPU perf but 18x for GPU
- ABINIT : Same CPU but 5.5x on GPU (FFTW interface on GPU) so no difference in code
- Python is not a standard but a set of packages
- Each have to develop its own GPU portage
- Legate solves this issue
- No need to change code for Grace Hopper
- Cuda, or OpenACC will Work
- On grace difference bandwidth to access Data on CPU and On GPU
- stdpar automaticly place data where it should be
- NVHPC compilers : -gpu=managed (but mostly automatic) to manage memory automatically
- Unified memory with -gpu=unifed since 2023
- It is possible to mix nvcc (CUDA) with nvc++
- Languages like Julia and Rust will be integrated
- Optimization for block size ? Yes, it is expected to be automatically and also able to tune it manuallu
- C++ Dynamic cast is not supported now
- The approch is to developpe almost everything with standard algorithms and then finetune what is needed (and debug them on CPU)
- Parallel from the start
Tuesday, (4:00 PM - 4:50 PM CET) Accelerating and Securing GPU Accesses to Large Datasets [S62559]
- CJ Newburn : Distinguished Engineer, NVIDIA
- Oren Duer : Director, Software Architecture, NBU, NVIDIA
- Vikram Sharma Maithody : Senior Research Scientist, NVIDIA
- Slides
Tuesday, Mar (5:00 PM - 5:25 PM CET) Restoring the Scientific Method to HPC: High Performance Reproducible Parallel Reductions [S62405]
- Slides
- no record
Tuesday, (5:00 PM - 5:50 PM CET) Demystify CUDA Debugging and Performance with Powerful Developer Tools [S62256]
- Jackson Marusarz , Technical Product Manager, NVIDIA
- Slides
- Python + nvtx for NSight
- New NSight Features : Host Grace Profiling, Extention to Jupyter Lab
- Soon : Python Call Stack and syntax Highlighting
Tuesday, (5:00 PM - 5:50 PM CET) RAPIDS in 2024: Accelerated Data Science Everywhere [S62741]
- Dante Gamma Desavre : Machine Learning Engineering Manager, NVIDIA
- Nick Becker : Senior Technical Product Manager, NVIDIA
- RAPIDS : ecosystem
- Forecasting : from once a month of once a year to once a day
- RAPIDS for Dask of Apache Spark
- cuVS : accelerated vertor search
- NVidia AI entreprise : open source fondation
- Use a CPU/GPU switch or both, or custom Cuda Kernel
- Pandas acceleration : cuDFPandas, all Pandas API but does not accelerated everything, but accelerate everything on the GPU
- python -m cudf.pandas script.py
- networkX : networkX cuGraph Backend (more than 60 algorithms in networkX, some are 600x or 5x faster)
- OK event if you don't have a GPU
- Integrated in networkX documentation
- Numba Cuda : Supports shared memory or cooperative groups
- Data interoperability
- Dask supports configurable GPU backend (local cuda cluster instead of local CPU cluster)
- Dask Expression : simplify expression on GPU also, deal with multiple GPU and avoid memory break
- Apache spark : 5.5x faster
- Most popular algos in Spark ML
- XGBoost 2.0 : integration nvFlare, UCX to spped up network
- cuVS : cuda vertor search : accelerated vector search for C++ and Python, Rust and C for vector data base, and graph base ANN CAGRA-G (IVF-PQ with CAGRA)
- CAGRA-Q has quantization
- Cuda Toolkit in conda forge
- JupyterLab NVDashboard (high level view) (GPU metrics), and NSight
- NVTX to annotate coda (C++, Python)
- Workbench : smoothly start from a laptop and go into a cluster of GPUs
- NVIdia Launchpad
- RAPIDS Ecosystem of tools : Legate is a distributed runtime (so different level of abstration)
Tuesday, (7:00 PM - 7:50 PM CET) CUDA: New Features and Beyond [S62400]
- Stephen Jones : CUDA Architect, NVIDIA
- Slides
- Combined Hardware and software
- Power is the metric that matter
- What can I do with 20 MW
- Data movement and computation
- Consumption scale with the square of the matisse on floating point computation
- LU decomposition in tensor core
- Mixed precision 6x faster than double with same result
- How to use tensor cores : cuBLAS, cuBLASLt, cuTLASS (write tensor core code inside your own kernel)
- cuBLASDx : device extention for cuBLAS
- Kernel Fusion to save load, with Jit
- CUTLASS can mix with PyTorch
- NSight manages code origin (C++, Python etc) even global lock of python interpreter
- Warp : Differentiable kernel Development (example with bread simulation)
- Breaking bread simulation
- Legate into Jax
- NSight on large scale machine and interfaec with Jupyter Lab
- GPU Direct : copy direct on network
- GPUDirect Async : trigger from the GPU
- GPUDirect-KnI : Keernel initiated
- NCCL and NVsmhem traces in NSight
- 20percent speed up on grace hopper for transformer Training because of cache data in grace
- cuGraph : task graph on GPU (6 years of work)
- DGCM : monitor data center
Tuesday, Mar (7:00 PM - 7:50 PM CET) A Deep Dive into the Latest HPC Software [S61203]
- Jeff Larkin : HPC Architect, NVIDIA
- Slides
- Pas de son
Tuesday, (10:00 PM - 10:50 PM CET) Unlocking Developer Productivity across CPU and GPU with Mojo [S62220]
- Chris Lattner : Chief Executive Officer and Co-Founder, Modular AI
- Mostafa Hagog : Performance & Compiler Engineering Lead, Modular AI
- Slides
Tuesday, Mar (11:00 PM - 11:50 PM CET) CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores [S61198]
- Vijay Thakkar : Senior Architect, NVIDIA
- Jack Kosaian : Senior Architect, NVIDIA
- Slides
- CUTLASS : C++ BLAS template library, 2.5MDowload /mounth
- Gère la fusion de produit de matrice, et les FP8 (E5M2 et E4M3)
- Composable schedulers
- 5 Conceptual Herarchy
- Kernel Layer, Device Layer : Bock and Grid stuff
- Reduction of API
- Convolution in CUTLASS 3.5
- Convolution collective
- Harbitraty number of dimention
- GEMMs are just a tensor contraction (GeTT to implicit GeMM)
- They use Hopper TMA
- Beta API with 1, 2 and 3 dimensions
- Group GENN and Mixture of experts (MoE)
- Weight quantisation : Using lighter weight, from FP16 to INT4 for example and use tensor core for that GeMM
- Epiloge Visitor Tree (EVT) : compose complex epiloges
- Also available in Python
- It is possible to extend CUTLASS with custom Kernels
- CUTLASS 3.6 for summer with sparcity for Hopper
- Mix precision GEMM is not supported
Wednesday, Mar 201:30 AM - 1:45 AM CET Practical Tips for using Grace Hopper to Dramatically Accelerate your Deep Learning and HPC pipelines [EXPT63133]
- Slides
- no record
- Bof
Wednesday, (4:00 PM - 4:25 PM CET) Perform High-Efficiency Search, Improve Data Freshness, and Increase Recall With GPU-Accelerated Vector Search and RAG Workflows [S62599]
- Charles Xie : CEO, Zilliz
- Corey Nolet : Principal Engineer, RAPIDS ML, NVIDIA
- Slides
- Vector database system and RAG system ad how to put GPU in that, introducing cuSV
- vector to text to video to molecules
- Milvus : first vector data base in the world (6 years ago)
- Vector data base to enhance proprietery domain data to
- Retrieval-Augmented Generation (RAG) avoid hallucination, improve accuracy
- GPU acceleration on worker nodes
- Brute force algorithms are OK on GPU, => CAGRA (almost 0 build time and very high Throughput
- CAGRA : Find nearest neigbour on GPU
- CAGRA graph can be used also on CPU and HNSW is faster on CAGRA graph
- 1TB raw text to 2.5 TB vector (635 M × vector or 1024 dimensions)
- Wraps RAFT and increase language support
- cuVS for C, C++, Python, Rust
- They need a separate GPU if cuVS is used to train a model
Wednesday, Mar (4:00 PM - 4:50 PM CET) Mastering CUDA C++: Modern Best Practices with the CUDA C++ Core Libraries [S62175]
- Jake Hemstad, Software Engineering Manager, NVIDIA
- Georgii Evtushenko : Senior Software Engineer, NVIDIA
- no record
Wednesday, (6:00 PM - 6:25 PM CET) Performance Optimization for Grace CPU Superchip [S62275]
- Mathias Wagner, Senior Developer Technology Engineer, NVIDIA
- Slides
- https://docs.nvidia.com/grace
- -mcpu=native
Wednesday, (6:00 PM - 6:50 PM CET)Legate: A Productive Programming Framework for Composable, Scalable, Accelerated Libraries [S62262]
- Wonchan Lee : Senior Software Engineering Lead, NVIDIA
- Manolis Papadakis : Senior Software Engineer, Legate Framework, NVIDIA
- Slides
- Use a wide range of accelerated hardware
- Not easy to program
- Use Numpy scipy and well used frameworks in python to GPU
- Transparently scaling program
- Numpy, Scipy, Pandas, Scikit Learn, dmicXGBoost, JAX, Zarr, HDF, ...
- Each implementation should be Transparently scaling
- No expensive operation to go from a library to an other
- Common solution for these problems
- First, legate run time stack : Legate + Legion + Realm
- User do not have to change their code on new Hardware
- Numpy -> Legate cunumeric (presented in GTC 2023)
- Pandas -> Legate DataFrame (new)
- Scikit learn -> Legate ML (new)
- XGBoost -> Legate Boost (new)
- JAX -> Legate JAX (new)
- HDF -> Legate IO (new)
- Python program -> task graph into Legate Runtime
- Problem on Data Partition
- Some time data need to be repartitionned
- Problem if repartitionning appears outside of legate
- Legate Runtime : common way to deal with partition (you can add partinioning constraints)
- Legate Jax : from the popular ML framework with flexible tensor partinioning strategies
- But no support to scale up to 1k GPU
- Augment the JAX pipelining parallelism
- JAx uses Legate XLA backend => comparable performance compare to the state of the art from 128, 256 or 512 GPUs but with all libraries
- Rapids + Legate :
- Rapids collection of fast and maintained kernel for ETL and ML (for Pandas, XGBoost and HDF)
- Legate Boost : Mix model boosting
- SLAC ; Stanford Linear Accelerator Center
- Develop on your laptop and scale to a computing center for production
- Legate IO : HDF5 and Zarr with GPU DirectStorage
- Switching between CPU and GPU in one click
- Exemple with multiphysics solver : possibility to map computation on CPU and GPU (Slide 13)
- Legate Sparse : Sparse matrix implementation
- Task Fusion : MLIR Kernel Fusion (perf slide 14)
- Complete C++ reimplementation of the core layer, it was previously in Python but it turn to be the bottleneck of the perf
- Legate JAX use C++ Legate
- And Legate STL, Implement your C++ Programs in functionnal style legate::stl::transform_reduce
- NSight has Logical view and physical view with legate to help user to understand what is going on
- conda install -c nvidia -c conda-forge -c legate cunumeric
- https://github.com/nv-legate
- Legate design is more inpired by stl rather than TBB (so no TBB in this)
- Will legate also work on Jetsen, Orin ? Normally no, but the runtime overhead could be exessive on smaller CPU and GPU
- Release on May on July 2024 (for new legate version with all python, C++ stuff)
- The runtime does not care how you implement your library
Wednesday, (10:00 PM - 10:50 PM CET) Multi GPU Programming Models for HPC and AI [S61339]
- Jiri Kraus : Principal DevTech Compute, NVIDIA
- Slides
- Un peu de NCCL, MPI, OpenShmem, NVshmem, à voir si vous développez des calculs basés sur la communication
- 8xH100 with non blocking communication
- Domain decomposition : optimise the number of neigbour or optimise the communications between neigbour
- A lot of examples
- Do the communication while computing still ran
- NVSHMEM : direct communication between GPU kernels
- NVShmem faster with -dlto (link time optimisation)
- Start fusing kernel before and after communication
- NCCL and NVShmem can be used with CUDA Graphs
- The afinity between the network adapter and the GPU is very important
- Better use a cuda aware MPI on a cluster if possible
Wednesday, (11:00 PM - 11:50 PM CET) More Data, Faster: GPU Memory Management Best Practices in Python and C++ [S62550]
- Mark Harris : Distinguished Engineer, NVIDIA
- You have to watch this if your are a library developer (C++ or Python)
- No Metadata from host allocator when used to allocate data on device because it cannot read it
- Multiworkflow can be problematic because each one allocate a pool of memory without cooperate we will get out if memeory
- cudaMalloc / cudaFree => synchronous and slow free for security
- cudaFree adds bubble to the pipeline
- On 16 V100 => 88 percent of time spent into memory management, dropped to 0 with pool
- cudaMallocAsyn cuda 11.2
- RMM : fast and flexible memory management
- Used in cuDF, cuML, RAFT
- by default RMM uses cudaMalloc and cudaFree, but they can be replaced with a pool allocator using stream
- RMM is in transistion (refactoring ongoing)
- deallocate may reuse memory in the next allocation asynchronously
- rmm::device_buffer : unitilisated bytes memory
- rmm::device_uvector : kind of thrust::device_vector but unitilisated
- thrust does not know streeam
- binning_memory_ressource : separate small allocation from large allocation using a different device if needed
- Share memory pool between libraries
- External allocator interface
- Best Practices :
- no raw allocation
- Give your library an external allocator interface (allows libraries cooperation)
- Use stream ordered allocation
- Use stream ordered data containers
Thursday, live (5:00 PM - 5:50 PM CET) How To Write A CUDA Program: The Ninja Edition [S62401]
- Stephen Jones (SW), CUDA Architect, NVIDIA
- How to think about writing a program
- Last 10 percents of peak performance is hard to get
- 1M Threads on H100
- Data Parallelism : GPU (Throughput machine)
- Task Parallelism : CPU (Latency machine)
- GPU : expects 100 waves
- Map data to threads
- 10x in cache L2
- split task then data fit in cache
- don't start all your kernels from top left to preserve cache of last kernel
- No all to all
- you can write cuda code in python (cuPy, numba) but you have to understant how the machine works
- Tensor Parallelism is data Parallelism
- Now way to automate size of block because it is problem dependent
- If many kernels just use cudaGraph
- NSight gives info of the low level cache using by kernel (occupancy, latency, etc)