AI

12.5.5 : AI

Monday, Mar 17 6:00 PM - 6:40 PM CET Blackwell Numerics for AI [S72458]

Paulius Micikevicius : NVIDIA
recording problem
hope for replay
Slides
FP4 for inference
MXFP8 : train on 8bits
FP16 : factor for the hole gradien
FP8 : scale factor for tensor
FP4 : scale factor for block
Need to block dot product for example and store scale in an efficency way (just for tensor cores)
Successful training could need from 10 to 20 binades
You don't want to round to the nearest but to the good range
decode scale is rounded up and encode scale is rounded down
OK to have few value to 0,
FLUX in FP4 : quite similar
PTQ : Post Training Quantization (converting a FP16 trained model into FP4)
LLM FP4 PTQ : lost a bit of precision at inference
QAT : Quantization Aware Training
MXFP8 : easy path for 8-bits inference
If you are compute limited, you could add sparsity of use a lower precision
If you are bandwidth limited its depdends which tensor is limiting the speed.
Addition are generally still in FP32 àr FP16 (which is totally normal even in smaller precision)
Hopper : fast accumulator mode but lost accuracy
FP4 quite decent accuracy
Quantization operation is not differentiable
Possibility to get back the score of BF16 with FP4
He's comparing FP8 for H200 and B200 but there are 2 types of FP8. So which one is used ?
Algorithm to find the layers which can be well compressed and still keep accuracy
KV cache can use a huge part of the GPU memory so it is important to quantize it
Using NVidia Nemotron 4
SVDQuant decomposes GEMM layer into simpler part to keep accuracy and reduce size
Distillation with Quantization
Integer quantization could work with diffusion models but do not produce the best results
Speculative decoding computes multiple token in one generation

Monday, Mar 17 6:00 PM - 6:40 PM CET : Building Generative AI for a Billion Indian Voices [S73151]

Pratyush Kumar : Co-Founder, Sarvam AI
Sunil Gupta : Co-Founder, Managing Director, and Chief Executive Officer, Yotta Data Centres
Slides
This guy is talking super fast
Tokenizer very efficient on Indan Languages (10 Languages + english)
From 52 to 296 tok/s on H100 (with 32 streams)
20 cents per 100 token
Text to speach for all indian languages

Tuesday, Mar 18 11:00 PM - 11:40 PM CET : Accelerate Inference on NVIDIA GPUs [S72330]

Ce Zhang : CTO, Together AI
Slides
Very interesting presentation but it is better to have the slide because there are a lot of schemas
why only 80% of hardware ?
Attention is time consuming, => replace with Hyena (convolution)
Speculative prefill : you can keep only 10% of the tokens as input
A good model is a good speculator

Wednesday, Mar 19 4:00 PM - 4:40 PM CET : Bending Scaling Laws with Brighter Algorithms [S73204]

Yejin Choi : Senior Research Director, NVIDIA
Slices
The era of brute force scaling is over
SuperBPE makes token bigger (first learn by subwords followed by superwords
Less number of token to train over
Me : from sylabic to words to expressions
Socratic MCTS : Maieutic : questioning to knowledge
the LLM generate qestions during the reasoning
But verifier is very hard to do
Between 1 to 9 minutes per questions (average score 0.748 when GPT4o (0513) is at 0.540 and Claude 3.5 sonnet : 0.550)
Retro search for better reasoning
Challenge the status quo about searches
LLM can generate very long though and still when to the wrong answer (underthinking)
sometimes short term reasoning is better
Look at alternative path and see if they lead to a good answer
Test time training for
Unatural split between training and testing
Relevant for video long content
For a minute video, classic transformers cannot handle it because it would take 1000000 token for that
Learning to learn at tested time (personnally, I though test was there to prevent overfitting, but why not)
In this paradigm, hidden states are considered as part of the model
On Tom and Jerry, this shapes likes Tom and Jerry but the animation and consideration are really far from orignal William Hanna and Joseph Barbera work.
Even Tom and Jerry of Chuch Jones looks better...
The art of new and diverse synthetic data generation
Need a metric for diversity and Diverse topic does not matter for math
mathematical reasoning

Wednesday, Mar 19 7:00 PM - 7:40 PM CET : Advancing AI Reasoning: From Game Theoretic Methods to Complex Problem Solving [S73862]

Vartika Singh : Strategic AI Lead, NVIDIA
Noam Brown : Research Scientist, OpenAI (2017, poker AI 2 players, 2019 : multiplayer poker, then extand to diplomacy, all techniques of Reasoning for Poker, Chess and diplomacy were different and the idea was to find a way to generalize them all
Bryan Catanzaro : VP, Applied Deep Learning Research, NVIDIA. CUDNN, DLSS. Everyday I try to make researchers work together.
How do we improve inference but not a priority at the time for poker (games with partial information)
There wasn't a universal way to train on pocker, chess, diplomacy
Pluribus for poker : 150\$ to train a model at this time, on 28 CPU cores
Diplomacy games can spread on several weeks or month
The poker model was good so let's try on diplomacy
Matrix multiplication is the best way to create a computer that can do a lot of work
This is not trivial to trun matrix multiply into intelligence
We have to develop a Reasoning technique which is as performant as deeplearning for images
GPTo1 creates a new paradigm. As Alexnet several years ago.
model performance with a value is not relevant anymore because the inference time is not taken account.
Yes there are things that the models cannot do today but people are working on it

Thursday, Mar 20 8:00 AM - 8:40 AM CET : Profile Large Language Model Trainings on the Grace Hopper Superchip [S72967]

Giuseppe Fiameni : Solutions Architect, NVIDIA
Karin Sevegnani : Solutions Architect, NVIDIA
Slides
Both CPU and GPU super critical for computing tasks
PCIe is a bottleneck for CPU GPU programming => Virtual Unified Memory for Grace Hopper SuperShip
NeMo parallelise computation over 100 000s of nodes
NeMo integrate into a NVidia Triton inference server
Rocky Linux 9.3 with custom kernel
LoRA : efficient fine tuning method for Llama
nsys : -y to specify the time of the profiling and -d do delay when profiling starts
Profiling show the program is compute bound and not memory bound
wait causes by autograd of PyTorch
Offloading : memory store on GPU during training as temporary moved to CPU to reduce the memory used by the GPU => it reduce GPU memory and GPU utilisation but more synchronisation between CPU and GPU.
Mixed precision training to optimise training
FP8 training : speed improvement and efficency on Hopper

Tuesday, Mar 18 11:00 PM - 11:40 PM CET : The Impact of AI on Filmmaking: Real-World Transformation [S73743]

Jo Plaete : CIO and VFX Supervisor, Metaphysic, Metaphysic
Ed Ulbrich : Chief Content Officer and President of Production, Metaphysic
Very Insteresting but a lot of illustration, so hard to take notes
Slides
From 3D assets rendering to Neral Network inference
Real time generative AI
They take care of license for training