12.5.5 : AI



Monday, Mar 17 6:00 PM - 6:40 PM CET Blackwell Numerics for AI [S72458]
  • Paulius Micikevicius : NVIDIA
  • recording problem
  • hope for replay
  • Slides
  • FP4 for inference
  • MXFP8 : train on 8bits
  • FP16 : factor for the hole gradien
  • FP8 : scale factor for tensor
  • FP4 : scale factor for block
  • Need to block dot product for example and store scale in an efficency way (just for tensor cores)
  • Successful training could need from 10 to 20 binades
  • You don't want to round to the nearest but to the good range
  • decode scale is rounded up and encode scale is rounded down
  • OK to have few value to 0,
  • FLUX in FP4 : quite similar
  • PTQ : Post Training Quantization (converting a FP16 trained model into FP4)
  • LLM FP4 PTQ : lost a bit of precision at inference
  • QAT : Quantization Aware Training
  • MXFP8 : easy path for 8-bits inference
  • If you are compute limited, you could add sparsity of use a lower precision
  • If you are bandwidth limited its depdends which tensor is limiting the speed.
  • Addition are generally still in FP32 àr FP16 (which is totally normal even in smaller precision)
  • Hopper : fast accumulator mode but lost accuracy
  • FP4 quite decent accuracy
  • Quantization operation is not differentiable
  • Possibility to get back the score of BF16 with FP4
  • He's comparing FP8 for H200 and B200 but there are 2 types of FP8. So which one is used ?
  • Algorithm to find the layers which can be well compressed and still keep accuracy
  • KV cache can use a huge part of the GPU memory so it is important to quantize it
  • Using NVidia Nemotron 4
  • SVDQuant decomposes GEMM layer into simpler part to keep accuracy and reduce size
  • Distillation with Quantization
  • Integer quantization could work with diffusion models but do not produce the best results
  • Speculative decoding computes multiple token in one generation


Monday, Mar 17 6:00 PM - 6:40 PM CET : Building Generative AI for a Billion Indian Voices [S73151]
  • Pratyush Kumar : Co-Founder, Sarvam AI
  • Sunil Gupta : Co-Founder, Managing Director, and Chief Executive Officer, Yotta Data Centres
  • Slides
  • This guy is talking super fast
  • Tokenizer very efficient on Indan Languages (10 Languages + english)
  • From 52 to 296 tok/s on H100 (with 32 streams)
  • 20 cents per 100 token
  • Text to speach for all indian languages


Tuesday, Mar 18 11:00 PM - 11:40 PM CET : Accelerate Inference on NVIDIA GPUs [S72330]
  • Ce Zhang : CTO, Together AI
  • Slides
  • Very interesting presentation but it is better to have the slide because there are a lot of schemas
  • why only 80% of hardware ?
  • Attention is time consuming, => replace with Hyena (convolution)
  • Speculative prefill : you can keep only 10% of the tokens as input
  • A good model is a good speculator


Wednesday, Mar 19 4:00 PM - 4:40 PM CET : Bending Scaling Laws with Brighter Algorithms [S73204]
  • Yejin Choi : Senior Research Director, NVIDIA
  • Slices
  • The era of brute force scaling is over
  • SuperBPE makes token bigger (first learn by subwords followed by superwords
  • Less number of token to train over
  • Me : from sylabic to words to expressions
  • Socratic MCTS : Maieutic : questioning to knowledge
  • the LLM generate qestions during the reasoning
  • But verifier is very hard to do
  • Between 1 to 9 minutes per questions (average score 0.748 when GPT4o (0513) is at 0.540 and Claude 3.5 sonnet : 0.550)
  • Retro search for better reasoning
  • Challenge the status quo about searches
  • LLM can generate very long though and still when to the wrong answer (underthinking)
  • sometimes short term reasoning is better
  • Look at alternative path and see if they lead to a good answer
  • Test time training for
  • Unatural split between training and testing
  • Relevant for video long content
  • For a minute video, classic transformers cannot handle it because it would take 1000000 token for that
  • Learning to learn at tested time (personnally, I though test was there to prevent overfitting, but why not)
  • In this paradigm, hidden states are considered as part of the model
  • On Tom and Jerry, this shapes likes Tom and Jerry but the animation and consideration are really far from orignal William Hanna and Joseph Barbera work.
  • Even Tom and Jerry of Chuch Jones looks better...
  • The art of new and diverse synthetic data generation
  • Need a metric for diversity and Diverse topic does not matter for math
  • mathematical reasoning


Wednesday, Mar 19 7:00 PM - 7:40 PM CET : Advancing AI Reasoning: From Game Theoretic Methods to Complex Problem Solving [S73862]
  • Vartika Singh : Strategic AI Lead, NVIDIA
  • Noam Brown : Research Scientist, OpenAI (2017, poker AI 2 players, 2019 : multiplayer poker, then extand to diplomacy, all techniques of Reasoning for Poker, Chess and diplomacy were different and the idea was to find a way to generalize them all
  • Bryan Catanzaro : VP, Applied Deep Learning Research, NVIDIA. CUDNN, DLSS. Everyday I try to make researchers work together.
  • How do we improve inference but not a priority at the time for poker (games with partial information)
  • There wasn't a universal way to train on pocker, chess, diplomacy
  • Pluribus for poker : 150\$ to train a model at this time, on 28 CPU cores
  • Diplomacy games can spread on several weeks or month
  • The poker model was good so let's try on diplomacy
  • Matrix multiplication is the best way to create a computer that can do a lot of work
  • This is not trivial to trun matrix multiply into intelligence
  • We have to develop a Reasoning technique which is as performant as deeplearning for images
  • GPTo1 creates a new paradigm. As Alexnet several years ago.
  • model performance with a value is not relevant anymore because the inference time is not taken account.
  • Yes there are things that the models cannot do today but people are working on it


Thursday, Mar 20 8:00 AM - 8:40 AM CET : Profile Large Language Model Trainings on the Grace Hopper Superchip [S72967]
  • Giuseppe Fiameni : Solutions Architect, NVIDIA
  • Karin Sevegnani : Solutions Architect, NVIDIA
  • Slides
  • Both CPU and GPU super critical for computing tasks
  • PCIe is a bottleneck for CPU GPU programming => Virtual Unified Memory for Grace Hopper SuperShip
  • NeMo parallelise computation over 100 000s of nodes
  • NeMo integrate into a NVidia Triton inference server
  • Rocky Linux 9.3 with custom kernel
  • LoRA : efficient fine tuning method for Llama
  • nsys : -y to specify the time of the profiling and -d do delay when profiling starts
  • Profiling show the program is compute bound and not memory bound
  • wait causes by autograd of PyTorch
  • Offloading : memory store on GPU during training as temporary moved to CPU to reduce the memory used by the GPU => it reduce GPU memory and GPU utilisation but more synchronisation between CPU and GPU.
  • Mixed precision training to optimise training
  • FP8 training : speed improvement and efficency on Hopper


Tuesday, Mar 18 11:00 PM - 11:40 PM CET : The Impact of AI on Filmmaking: Real-World Transformation [S73743]
  • Jo Plaete : CIO and VFX Supervisor, Metaphysic, Metaphysic
  • Ed Ulbrich : Chief Content Officer and President of Production, Metaphysic
  • Very Insteresting but a lot of illustration, so hard to take notes
  • Slides
  • From 3D assets rendering to Neral Network inference
  • Real time generative AI
  • They take care of license for training