AI

12.4.5 : AI

Monday, Mar (5:00 PM - 5:50 PM CET) Accelerating Drug Discovery: Optimizing Dynamic GPU Workflows with CUDA Graphs, Mapped Memory, C++ Coroutines, and More [S61156]
- no record
Monday, Mar (6:00 PM - 6:25 PM CET) Generative AI Powered With Extreme-Scale Computing to Discover Novel Metal-Organic Frameworks for Carbon Capture at Scale [S62172]
- no record
Monday, (6:00 PM - 6:50 PM CET) disponible XGBoost is All You Need [S62960]
- Bojan Tunguz, Data Scientist, NVIDIA
- Slides
- Do not use XGBoost for everything, but it is very robust and has GPU support
- XGBoost developed by NVidia these days
- XGBoost is not superior, and you can use Neural Networks if you want
- Tabular data : just data in columns of a table (of simple value, no local structure)
- Very used in science
- Gradien Bost Tree are not perfect
- Not sensitive on linearity (à vérifier, je ne suis pas convaincu)
- Apparemment ce type ne connaît pas Scikit-learn
- Beaucoup trop de texte dans les slides pour pas grand chose à la fin
- Use GPU version of VSM (vector support machine)
- XGBoost can be used with Dask
- Plein d'autre présentations montrent bien mieux la même chose et de manière plus claire
- Je pense que ce type a fait sa présentation juste pour dire qu'il a joué avec un DGX H100
- Bref, n'est pas ingénieur qui veut
Monday, (6:00 PM - 6:50 PM CET) Enterprise MLOps 101 [S61934]
- no record
Monday, (6:00 PM - 6:50 PM CET) live Revolutionizing Vision AI: From 2D to 3D Worlds [S62724]
- GANN : Deux trucs qui se battent (l'autre dit que c'est beau et élégant mais difficile à optimiser, il n'essaie pas de mémoriser le training set sinon ça n'aurait aucun intéret)
- Diffusion : l'inférence utilise le réseau plein de fois
- Mais peut couter cher en conssommation
- interpolation dans un espace à très nombreuse dimension est indissociable de la magie
- aucun des deux ne sera real time sur un téléphone
- de meilleures architectures sont en développement pour avoir des modèles plus rapides et plus efficaces
- On est passé de 64x64 à 1024x1024 en quelques années
- En 2016 on attendait la nuit pour un GANN, maintenant il faut un mois
- inférer de la 3d au lieu de la 2d fait sens au moins essayer de faire le pont entre les deux
- Transformer un radiance field en mesh est très difficile
- Dans tous les cas la standardisation aide à l'acceptation de la technilogie mais des questions sont encore en suspend comme la compression, c'est trop tôt pour standardiser pour le moment
- Ils veulent repenser le graphisme en terme de réseau de neurones
- repenser comment les animaux comprennent la 3d
- la vitesse d'exécution et l'interactivité sont fondamentales car on ne peut pas le faire avec un centre de calcul connecté à l'autre bout de la planéte
- l'apprentissage non-superviser peut encore s'amméliorer (combler les trous dans une images, etc)
- pour la 3d : Radiance Field ou signed direction
- le tout est de comprendre comment ça fonctionne. Si on ne comprend pas, ce n'est pas de la faute de l'algorithme, c'est de la faut de notre cerveau qui n'apréhende pas autant de dimensions
Tuesday, Mar 1911:00 PM - 11:50 PM CET A Culture of Open and Reproducible Research, in the Era of Large AI Generative Models [S62219]
- Size of model predict well its performance
- 2019 : full segmentation by pixel
- SAM (Segment Anything Model) : 11M image and 1B masks
- Coevolution between data and the model
- Open source : economix growth, better researchs and innovations, higher code quality, better responsability
- Llama : Feb 2023
- Llama 2 : July 2023 : 160 M downloads
- Llama 3 coming soon
- Internal and external red teaming to test the model
- 350k and 600k H100 Infrastructure
Tuesday 9:00 AM - 9:25 AM Le Futur de l'Intelligence Artificielle en France [SE62694]
- 85% des entreprises considèraient que c'était une technologie prioritaire
- EXOTECH : robotique
Tuesday, (4:00 PM - 4:50 PM CET) Large-Scale Graph GNN Training Accelerated With cuGraph [S61197]
- Joe Eaton, Distinguished System Engineer for Graph and Data Analytics, NVIDIA
- Slides
- Graph Neural Network at large scale
- Global Scale Graph
- From Network to Molecules
- Meaning of sparsity
- 1 trillion Edges
- 4 trillion edges for Page rank
- cuGraph Integrated in PyG and DGL
- Optimise the data loader for each use case
- Feature fetching to build the local graph to operate convolution on it
- NP hard problem : partitioning the graph on several GPU
- Accelerate every step with CUDA and GPU
- Read Parquet Files from GPU
- From 10 TB of data to 100 TB for samples, so it needs to be storable (on local disk with acceleration, much faster than randomly gather data on the flight)
- Need to do 100 samples in parallel
- Split graph structure from the feature to ensure performances (but need NVlink)
- Redundant computation to avoid communications
- EOS : 576 DGX = 4608 GPUs H100 (30 M Dollars)
- 1.6 TEdges and 113 GNodes, starting with 70 TB
- 1024 GPU for sampling and 1024 for traning
- Sampling 20 minutes
- Traning 3 minutes (1 epoch)
- Try to get the maximum overlap between training and sampling
- pylibcugraph
- More algorithms are coming
- Billion of small graph will be tackled also
- Real time graph embeding
- deal with dynamic graph (data masking to ignore parts of teh graph, already in)
- Real world knowledge is Propriety Graph
- No data base engine run on GPU yet
- GraphCore have hardware that are build for Graphs, but not efficient on sequential
Tuesday, Mar (6:00 PM - 6:50 PM CET) Training Deep Learning Models at Scale: How NCCL Enables Best Performance on AI Data Center Networks [S62129]
- Sylvain Jeaugey : Principal Engineer, NVIDIA
- Très proche de la présentation de la GTC 2023
Tuesday, Mar 196:00 PM - 6:25 PM CET AI Supercomputing: Pioneering the Future of Computational Science [S62242]
- Ian Buck : Vice President of Hyperscale and HPC, NVIDIA (inventor of CUDA)
- Moore's law has a problem
- Move onto accelerated computing : GPU
- 4.5M developers
- 3000 applications
- 600 SDKs and AI models
- reduce time by 3x and 4x energy
- Quantum Computing : using multiverse to do maths
- Weather simulation : from 10km to 50m resolution
- ALPS : 40 EFlops of AI
- Jupiter : 93 EFlops (24 000 GH200), 18 MW
- EOS : 10752 H100 : 42.6 EFlops
- Mixture of expert : half of time is spent to data exchange
- Titan : 7MW = 20PFlops in 2012
- Blackwell is 20PFlops but 1 kW
Wednesday, Mar 2010:00 PM - 10:25 PM CET Solving the Biggest Challenges in Generative AI [S62493]
- Matt Bell : Head of Product Research, Anthropic
- Anthropic
- 7 people from GPT-3
- Deliver powerful and save model to the market
- LLM scaling law thank to GPU
- hallucination : wrong Answer
- Leagle ability of a compagny for what their chatbot says (selling big stuff at 1 Dollars)
- Using constitution learned by the model to make sure it will question its anwers and conform to the constitution
- Retrival reformated generation
- THe model could be too harmless and refuse to answer an harmless task
- Scracthpad : Claude gives intermetiate partial answer before anwering. Presents the anwser after its reasoning
Tuesday, (6:00 PM - 6:50 PM CET) Accelerate ETL and Machine Learning in Apache Spark [S62257]
- Sameer Raheja : Senior Director of Engineering, NVIDIA
- Erik Ordentlich : Senior Manager, Distributed Machine Learning, NVIDIA
- Slides
- From Hadoop CPU only to Spark on GPU
- No code change required
- They don't accelerate UDF
- The CPU can overlap IO while GPU is computing
- 5.5x faster, 80% cheaper with cost
- Working using Grace Hopper
- RAPIDS Accelerator for Apache Spark
- Same Spark Core but based on RAPIDS
- pip install spark-rapids-user-tools
- Improvements :
  - Reliability
    - Spill framework to reduce OOM issues to minimize OOM or GPU specific config changes
    - OOM retry framework for automatic OOM handling in memory-intensive operators
  - Performance
    - Dynamic repartitioning in large/skewed hash joins
    - File caching
    - Improved I/O and larger chunk handling for Parquet
  - Usability
    - Tooling support on Azure and AWS Databricks, Google Dataproc and AWS EMR
  - Scaling to 100s of TB and beyond
  - ARM support
  - JSON handling improvements
  - Support for Delta Lake
- Take advantage of Decompression engine in Blackwell (snappy and zstd)
- from pyspark.ml.clustering import Kmeans => from spark_rapids_ml.clustering import Kmeans
- Based on NCCL
- Speed up depends on the number of computation per I/O
- Optimised version of cross validator
- No dataset acceleration on GPU because Java does nonsense in term of class description
- Spark NLP can laverage GPU
- If there is an out of memeory exception on the GPU, they catch it and try again with less data on the GPU
- GPU Direct Storage not used fully in Spark ML because of Hadoop (security issues), but in some area
Tuesday, (10:00 PM - 10:50 PM CET) LLMOps: The New Frontier of Machine Learning Operations [S62458]
- Nik Spirin : Director, Generative AI and LLMOps Platform, NVIDIA
- Michael Balint : Senior Manager, Product Architecture, NVIDIA
- no record
Tuesday, (11:00 PM - 11:25 PM CET) Throughput Performance Benchmarking: Pre-Training Foundational Large Language Models on Kubernetes [S61477]
- Ronen Dar : Chief Technology Officer, Run:AI
- Raz Rotenberg : Director of Engineering, Run:ai
- Slides
Tuesday, (11:00 PM - 11:50 PM CET) Optimizing Large-Scale Distributed GPU Training with Holistic Trace Analysis [S61453]
- Yuzhen Huang : Research Scientist, Meta Platforms
- Xizhou Feng : Software Engineer, Meta Platforms
- Slides
Tuesday, (11:00 PM - 11:50 PM CET) Efficient Deployment of Long Context Large Language Models [S62665]
- no record
Wednesday, Mar 20 (7:00 PM - 7:50 PM CET) Transforming AI [S63046]
- Jensen Huang : Founder and Chief Executive Officer, NVIDIA
- Ashish Vaswani : Co-Founder and CEO, Essential AI
- Noam Shazeer : Chief Executive Officer and Co-Founder, Character.AI
- Jakob Uszkoreit : Co-Founder and Chief Executive Officer, Inceptive
- Llion Jones : Co-Founder and Chief Technology Officer, Sakana AI
- Aidan Gomez : Co-Founder and Chief Executive Officer, Cohere
- Lukasz Kaiser : Member of Technical Staff, OpenAI
- Illia Polosukhin : Co-Founder, NEAR Protocol
- 1964 : no change to modern computing (software separated from hardware, software compatibility, etc)
- 10x every 5 year
- 20 Years : 10000x
- The way things changed stoped
- Computer graphic : large market and revolutionalise computing
- Software recognise pixel the meaning of the pixel : cat picture -> cat description -> cat picture
- Recieved creator of transformers (attention is all you need)
- It Start from quention answerig and current neural network couldn't do that
- RNN were a pain to deal with at this time
- Gradient descent is a much better teacher than me so I will let it do the work
- Gradient descent based on GEMM makes GPU happy
- Machine translation was so hard 5 years ago
- They drop part of the model and it became better and better => and find attention is all you need
- Transformer fits what the model does (cargonet, convolution, attention, recognition, google)
- Start from translation but wanted to make something more generic
- We should train one everything instead of using only text to text, or text to image, or image to text
- Fondamental improvement, breakthrough :
  - Spending the right aount of computation on what matter
  - 2+2 uses a GParameters but computers can do that easly
  - model can just pick a calculator (ChatGPT-4 does that now)
- What comes next after transformers
- You don't have to be better, you have to clearly obviously better
- They wanted to mimic text evolution, go back and forth
- Essential : build model that could learn new tasks efficiently
  - get them to the public to interact with people
- Character : incredible technology but not give to everyone, let's do it for real, give it to everyone
- Inceptive : improving people live with this technology, then alphefold 2, then MRNA,
  - Programming proteines tested for real
- Secana AI : School of fish, biology based learning, learning always wins
  - NVidia computing power : all we can do appart from gradient descent
  - Use all models dispo on hugingface and use evolutionary competition to scan the universe of parameters
- Coer, Computers can talk to us but nothing was changing, create plateform to use the product, make is cheaper and accessible
- Lucas : join OpenAI : ton of data + ton of Compute, hpefully we need only a ton of compute
- Teach machine to code (2017 a bit to early). programmable money, use that to generate more data, laverage programmable money, new way to contribute to it
- ChatGPT - 10 Trilion token : almost size of internet
- New will come from interactions
- Next big thing is reasoning
- Trying to figure out the right pronpt is not the way we have to use model utlimately (a bit ridiculous to search the right pronpt for hours)
- Learning for more abstract tasks
- You cannot do engering without mesurement system
- SSM (State Space Model) too complicated, not elegant yet (a very poorman's LSTM
  - We probably end up with a hybrid model
- How to go away from tokens
- We have never learn trully how to train model with gradient descent
- A whole industry is thankfull to them
Wednesday, Mar 2012:00 AM - 12:50 AM CET Warp: Advancing Simulation AI with Differentiable GPU Computing in Python [S63345]
- Miles Macklin : Director, Warp Engineering, NVIDIA
- Slides
- Developed since last 2-3 years
- CUDA : C++ Centric, quite low level (but higher than Vulkan), and no differentiable
- Find an easy way to write differentiable kernels
- Python has some packges for that, but they are sloooooow
- pip install warp-lang
- import warp as wp
- @wp.kernel
- Quite close to numba, but warp can to spatial computing (1d, 2d, 3d, sparse volumes, OpenVDB, ...)
- Integration to omniverse (usd compatibility)
- Warp use cases : Data processing, image processing, Simulation or Scripting
- Easy to write kernel and plug them into an existing simulator
- No distibuted computing in warp
- New warp.fem : partial differential equation (heat transfer, diffusion, elasticity)
- Builtin spatial math type (vec2, vec3, mat44, quaternion etc, but zero copy when interfaced with PyTorch, JAX or Numpy
- Possibility to include CUTLASS code (JIT but cached for next executions)
- The use of CUDA snipet is based on strings to very ugly
- In PyTorch : nograd
- In warp you ask for automatic differentiation
- Limitation on dynamic loops due to memory consuption
- Warp uses CUDA graph
- Warp supports mesh data structure (mesh queries)
- warp Hashgrid are 10x faster than Open3D
- Warp Hashgrid 128k particles, 10ms per frame
- Support of NanoVDB with warp.Volume
- Warp sim : simulation : Rigid Bodies, Particles, Constraints, Geometry, Forces
- Easy to write your own integrator
- warp.fem : Flexible Finite Element Base (FEM/DG) : diffusion, convection, fluid flow, etc
- Neural COnstructive Law : learn elacticity from video
- Neural Stress Field (Bread Simulation) 10x faster simulation
- Large Scale Fluid Simulation in collaboration with Modulus (GH200 + XLB + Warp + nvComp)
- Support simulation that don't fit in GPU memory (Grace Hopper => GH200)
- Differentiable Molecular Dynamic : Protein Docking : 10x faster than Torch (slide 28)
- Fast Grasp'D : Dexterous Multi-finger Grasp Generation : 120x faster than previous SOTA
- Robotic Perception : ANYmal Quadruped Training
- Extract the point cloud around 4000 robots using the built-in Warp hash grid
- ~12-16M points, 20ms for all the robots -> 27k FPS total
- Additional LIDAR simulation via. mesh ray-casts
- Deployed on Jetson AGX Orin hardware
- Future Developments :
- Add FFT in Warp
- Add neural network inference with custom simulation code, dense linear layers
- Warp Kernel in JAX
- Warp a JAX kernel in one primitive
- Integration with generative AI to generate warp code, natural language to write kernel in python
- omni.warp available in Omniverse extensions registry
- For now warp use CUDA cores and not RT cores
- The generated code is human readable and you can attach a debugger if you want
- Warp support multi GPU as CUDA does, but there is no automatic scaling
- We could integrated it with a differentiable render in omniverse but not for now
- There is not pure python way to use shared memory in warp
- Warp could be integrated in ISAAC simulation to make differentiable simulation
- JAX JIT can call a JIT of Warp
Wednesday, Mar 2012:00 AM - 12:50 AM CET Building a Lower-Carbon Future With HPC and AI in Energy [S62121]
- A lot a carbon based technologies and how to limit their use
- Move to cloud
- HPC and AI use a lot of energy but can help to develop low energy Consumption technologies
- Demand of energy is increasing
Wednesday, (5:30 PM - 5:55 PM CET) Advances in Optimization AI [S62495]
- Alex Fender : Senior Engineering Manager, NVIDIA
- Slides
Wednesday, Mar 206:00 PM - 6:25 PM CET An Intro to the MLPerf Benchmarks and New Generative AI Tests [S63557]
- David Kanter : Executive Director, MLCommons
- no record
Wednesday, (7:00 PM - 7:25 PM CET) Turbocharging NVIDIA DGX SuperPOD With Dell Data Lakehouse and PowerScale (Presented by Dell Technologies) [S62991]
- Jay Limbasiya : Former Data Scientist, Dell Technologies
- Darren Miller : Director, UDS Vertical Solutions, Dell Technologies
- Slides
- no record
Wednesday, (10:00 PM - 10:50 PM CET) Enabling High-Performance, Value-Driven AI for Real-World Use Cases [S62584]
- Mary Reagan : Product Manager, DataRobot
- Adam Tetelman : Principal Engineering Architect, Product Architecture, NVIDIA
- Slides
- Quelques problèmes de son
- Biggest model is not necessary the most efficient
- Sometime it gives a wrong answer in a very convinced way
- Need to remove hallucination and toxicity
- CHecking accuracy is difficult
- C'est principalement une présentation de commerciaux
Wednesday, (11:00 PM - 11:50 PM CET) Large Language Model Prompt Tuning [S62109]
- John Wu : Senior Product Manager, AI, Domino Data Lab
- Josh Mineroff : Director of Solution Architecture, Tech Alliances, Domino Data Lab
- Slides
- Quelques problèmes de son et d'image
- Adapt model to use cases
- On demand Infrastructure, Comprehensive Reproducibility, AI Factory, Model Governance
- Pretrained models have to be fine tune to fit needs of a particular industry
- Prompt Engineering: Use carefully structured inputs to guide the outputs.
- RAG (Retrieval Augmented Generation): Adds contextual information to prompts by querying a vector database for related information.
- Full Fine-Tuning: Transfer learning approach in which all the parameters are adjusted using task-speciﬁc data.
- Parameter-Efﬁcient Fine-Tuning (PEFT): Modiﬁes only a small select amount of parameters for more efﬁcient adaptation.
- LoRA : decompose a large matrix : decompose a large matrix into 2 smaller matrices in the attention layer (drastically reduce the number of parameters) -> standard for inference fine tuning
- The deltaW matrix is not applied in the model, but kept to be used during the inference
- NVidia NeMo and NVidia AI foundation Models
- Domino : Kubernetes native software
- Then a demo
- Select a workspace : Disk (needed size), Hardware (GPU), Service (Local DGX, Aws, etc) to start Jupyter Lab in the Kubernetes Pod
- Domino supports also Dask, Spark or MPI clusters
- NeMoton 3 8B is 17GB
- SQuaDS 2.0 : Stanford Question Answer Dataset
Thursday, (4:00 PM - 4:50 PM CET) Optimizing Parallelization and Overlap to Increase Training Efficiency using Megatron-Core [S61626]
- Xue Huang : Solution Architect, NVIDIA
- Yuliang Liu : Foundation Model Training Director, Kuaishou Technology (pre-recorded)
- Slides
Thursday, (7:00 PM - 7:50 PM CET) The Vision-AI Revolution powered by DeepStream [S62624]
- no record