12.5.5 : AI
Monday, Mar 17 6:00 PM - 6:40 PM CET Blackwell Numerics for AI [S72458]
- Paulius Micikevicius : NVIDIA
- recording problem
- hope for replay
- Slides
- FP4 for inference
- MXFP8 : train on 8bits
- FP16 : factor for the hole gradien
- FP8 : scale factor for tensor
- FP4 : scale factor for block
- Need to block dot product for example and store scale in an efficency way (just for tensor cores)
- Successful training could need from 10 to 20 binades
- You don't want to round to the nearest but to the good range
- decode scale is rounded up and encode scale is rounded down
- OK to have few value to 0,
- FLUX in FP4 : quite similar
- PTQ : Post Training Quantization (converting a FP16 trained model into FP4)
- LLM FP4 PTQ : lost a bit of precision at inference
- QAT : Quantization Aware Training
- MXFP8 : easy path for 8-bits inference
- If you are compute limited, you could add sparsity of use a lower precision
- If you are bandwidth limited its depdends which tensor is limiting the speed.
- Addition are generally still in FP32 àr FP16 (which is totally normal even in smaller precision)
- Hopper : fast accumulator mode but lost accuracy
- FP4 quite decent accuracy
- Quantization operation is not differentiable
- Possibility to get back the score of BF16 with FP4
- He's comparing FP8 for H200 and B200 but there are 2 types of FP8. So which one is used ?
- Algorithm to find the layers which can be well compressed and still keep accuracy
- KV cache can use a huge part of the GPU memory so it is important to quantize it
- Using NVidia Nemotron 4
- SVDQuant decomposes GEMM layer into simpler part to keep accuracy and reduce size
- Distillation with Quantization
- Integer quantization could work with diffusion models but do not produce the best results
- Speculative decoding computes multiple token in one generation
Monday, Mar 17 6:00 PM - 6:40 PM CET : Building Generative AI for a Billion Indian Voices [S73151]
- Pratyush Kumar : Co-Founder, Sarvam AI
- Sunil Gupta : Co-Founder, Managing Director, and Chief Executive Officer, Yotta Data Centres
- Slides
- This guy is talking super fast
- Tokenizer very efficient on Indan Languages (10 Languages + english)
- From 52 to 296 tok/s on H100 (with 32 streams)
- 20 cents per 100 token
- Text to speach for all indian languages
Tuesday, Mar 18 11:00 PM - 11:40 PM CET : Accelerate Inference on NVIDIA GPUs [S72330]
- Ce Zhang : CTO, Together AI
- Slides
- Very interesting presentation but it is better to have the slide because there are a lot of schemas
- why only 80% of hardware ?
- Attention is time consuming, => replace with Hyena (convolution)
- Speculative prefill : you can keep only 10% of the tokens as input
- A good model is a good speculator
Wednesday, Mar 19 4:00 PM - 4:40 PM CET : Bending Scaling Laws with Brighter Algorithms [S73204]
- Yejin Choi : Senior Research Director, NVIDIA
- Slices
- The era of brute force scaling is over
- SuperBPE makes token bigger (first learn by subwords followed by superwords
- Less number of token to train over
- Me : from sylabic to words to expressions
- Socratic MCTS : Maieutic : questioning to knowledge
- the LLM generate qestions during the reasoning
- But verifier is very hard to do
- Between 1 to 9 minutes per questions (average score 0.748 when GPT4o (0513) is at 0.540 and Claude 3.5 sonnet : 0.550)
- Retro search for better reasoning
- Challenge the status quo about searches
- LLM can generate very long though and still when to the wrong answer (underthinking)
- sometimes short term reasoning is better
- Look at alternative path and see if they lead to a good answer
- Test time training for
- Unatural split between training and testing
- Relevant for video long content
- For a minute video, classic transformers cannot handle it because it would take 1000000 token for that
- Learning to learn at tested time (personnally, I though test was there to prevent overfitting, but why not)
- In this paradigm, hidden states are considered as part of the model
- On Tom and Jerry, this shapes likes Tom and Jerry but the animation and consideration are really far from orignal William Hanna and Joseph Barbera work.
- Even Tom and Jerry of Chuch Jones looks better...
- The art of new and diverse synthetic data generation
- Need a metric for diversity and Diverse topic does not matter for math
- mathematical reasoning
Wednesday, Mar 19 7:00 PM - 7:40 PM CET : Advancing AI Reasoning: From Game Theoretic Methods to Complex Problem Solving [S73862]
- Vartika Singh : Strategic AI Lead, NVIDIA
- Noam Brown : Research Scientist, OpenAI (2017, poker AI 2 players, 2019 : multiplayer poker, then extand to diplomacy, all techniques of Reasoning for Poker, Chess and diplomacy were different and the idea was to find a way to generalize them all
- Bryan Catanzaro : VP, Applied Deep Learning Research, NVIDIA. CUDNN, DLSS. Everyday I try to make researchers work together.
- How do we improve inference but not a priority at the time for poker (games with partial information)
- There wasn't a universal way to train on pocker, chess, diplomacy
- Pluribus for poker : 150\$ to train a model at this time, on 28 CPU cores
- Diplomacy games can spread on several weeks or month
- The poker model was good so let's try on diplomacy
- Matrix multiplication is the best way to create a computer that can do a lot of work
- This is not trivial to trun matrix multiply into intelligence
- We have to develop a Reasoning technique which is as performant as deeplearning for images
- GPTo1 creates a new paradigm. As Alexnet several years ago.
- model performance with a value is not relevant anymore because the inference time is not taken account.
- Yes there are things that the models cannot do today but people are working on it
Thursday, Mar 20 8:00 AM - 8:40 AM CET : Profile Large Language Model Trainings on the Grace Hopper Superchip [S72967]
- Giuseppe Fiameni : Solutions Architect, NVIDIA
- Karin Sevegnani : Solutions Architect, NVIDIA
- Slides
- Both CPU and GPU super critical for computing tasks
- PCIe is a bottleneck for CPU GPU programming => Virtual Unified Memory for Grace Hopper SuperShip
- NeMo parallelise computation over 100 000s of nodes
- NeMo integrate into a NVidia Triton inference server
- Rocky Linux 9.3 with custom kernel
- LoRA : efficient fine tuning method for Llama
- nsys : -y to specify the time of the profiling and -d do delay when profiling starts
- Profiling show the program is compute bound and not memory bound
- wait causes by autograd of PyTorch
- Offloading : memory store on GPU during training as temporary moved to CPU to reduce the memory used by the GPU => it reduce GPU memory and GPU utilisation but more synchronisation between CPU and GPU.
- Mixed precision training to optimise training
- FP8 training : speed improvement and efficency on Hopper
Tuesday, Mar 18 11:00 PM - 11:40 PM CET : The Impact of AI on Filmmaking: Real-World Transformation [S73743]
- Jo Plaete : CIO and VFX Supervisor, Metaphysic, Metaphysic
- Ed Ulbrich : Chief Content Officer and President of Production, Metaphysic
- Very Insteresting but a lot of illustration, so hard to take notes
- Slides
- From 3D assets rendering to Neral Network inference
- Real time generative AI
- They take care of license for training