GANN : Deux trucs qui se battent (l'autre dit que c'est beau et élégant mais difficile à optimiser, il n'essaie pas de mémoriser le training set sinon ça n'aurait aucun intéret)
Diffusion : l'inférence utilise le réseau plein de fois
Mais peut couter cher en conssommation
interpolation dans un espace à très nombreuse dimension est indissociable de la magie
aucun des deux ne sera real time sur un téléphone
de meilleures architectures sont en développement pour avoir des modèles plus rapides et plus efficaces
On est passé de 64x64 à 1024x1024 en quelques années
En 2016 on attendait la nuit pour un GANN, maintenant il faut un mois
inférer de la 3d au lieu de la 2d fait sens au moins essayer de faire le pont entre les deux
Transformer un radiance field en mesh est très difficile
Dans tous les cas la standardisation aide à l'acceptation de la technilogie mais des questions sont encore en suspend comme la compression, c'est trop tôt pour standardiser pour le moment
Ils veulent repenser le graphisme en terme de réseau de neurones
repenser comment les animaux comprennent la 3d
la vitesse d'exécution et l'interactivité sont fondamentales car on ne peut pas le faire avec un centre de calcul connecté à l'autre bout de la planéte
l'apprentissage non-superviser peut encore s'amméliorer (combler les trous dans une images, etc)
pour la 3d : Radiance Field ou signed direction
le tout est de comprendre comment ça fonctionne. Si on ne comprend pas, ce n'est pas de la faute de l'algorithme, c'est de la faut de notre cerveau qui n'apréhende pas autant de dimensions
Feature fetching to build the local graph to operate convolution on it
NP hard problem : partitioning the graph on several GPU
Accelerate every step with CUDA and GPU
Read Parquet Files from GPU
From 10 TB of data to 100 TB for samples, so it needs to be storable (on local disk with acceleration, much faster than randomly gather data on the flight)
Need to do 100 samples in parallel
Split graph structure from the feature to ensure performances (but need NVlink)
Redundant computation to avoid communications
EOS : 576 DGX = 4608 GPUs H100 (30 M Dollars)
1.6 TEdges and 113 GNodes, starting with 70 TB
1024 GPU for sampling and 1024 for traning
Sampling 20 minutes
Traning 3 minutes (1 epoch)
Try to get the maximum overlap between training and sampling
pylibcugraph
More algorithms are coming
Billion of small graph will be tackled also
Real time graph embeding
deal with dynamic graph (data masking to ignore parts of teh graph, already in)
Real world knowledge is Propriety Graph
No data base engine run on GPU yet
GraphCore have hardware that are build for Graphs, but not efficient on sequential
Jensen Huang : Founder and Chief Executive Officer, NVIDIA
Ashish Vaswani : Co-Founder and CEO, Essential AI
Noam Shazeer : Chief Executive Officer and Co-Founder, Character.AI
Jakob Uszkoreit : Co-Founder and Chief Executive Officer, Inceptive
Llion Jones : Co-Founder and Chief Technology Officer, Sakana AI
Aidan Gomez : Co-Founder and Chief Executive Officer, Cohere
Lukasz Kaiser : Member of Technical Staff, OpenAI
Illia Polosukhin : Co-Founder, NEAR Protocol
1964 : no change to modern computing (software separated from hardware, software compatibility, etc)
10x every 5 year
20 Years : 10000x
The way things changed stoped
Computer graphic : large market and revolutionalise computing
Software recognise pixel the meaning of the pixel : cat picture -> cat description -> cat picture
Recieved creator of transformers (attention is all you need)
It Start from quention answerig and current neural network couldn't do that
RNN were a pain to deal with at this time
Gradient descent is a much better teacher than me so I will let it do the work
Gradient descent based on GEMM makes GPU happy
Machine translation was so hard 5 years ago
They drop part of the model and it became better and better => and find attention is all you need
Transformer fits what the model does (cargonet, convolution, attention, recognition, google)
Start from translation but wanted to make something more generic
We should train one everything instead of using only text to text, or text to image, or image to text
Fondamental improvement, breakthrough :
Spending the right aount of computation on what matter
2+2 uses a GParameters but computers can do that easly
model can just pick a calculator (ChatGPT-4 does that now)
What comes next after transformers
You don't have to be better, you have to clearly obviously better
They wanted to mimic text evolution, go back and forth
Essential : build model that could learn new tasks efficiently
get them to the public to interact with people
Character : incredible technology but not give to everyone, let's do it for real, give it to everyone
Inceptive : improving people live with this technology, then alphefold 2, then MRNA,
Programming proteines tested for real
Secana AI : School of fish, biology based learning, learning always wins
NVidia computing power : all we can do appart from gradient descent
Use all models dispo on hugingface and use evolutionary competition to scan the universe of parameters
Coer, Computers can talk to us but nothing was changing, create plateform to use the product, make is cheaper and accessible
Lucas : join OpenAI : ton of data + ton of Compute, hpefully we need only a ton of compute
Teach machine to code (2017 a bit to early). programmable money, use that to generate more data, laverage programmable money, new way to contribute to it
ChatGPT - 10 Trilion token : almost size of internet
New will come from interactions
Next big thing is reasoning
Trying to figure out the right pronpt is not the way we have to use model utlimately (a bit ridiculous to search the right pronpt for hours)
Learning for more abstract tasks
You cannot do engering without mesurement system
SSM (State Space Model) too complicated, not elegant yet (a very poorman's LSTM
We probably end up with a hybrid model
How to go away from tokens
We have never learn trully how to train model with gradient descent
On demand Infrastructure, Comprehensive Reproducibility, AI Factory, Model Governance
Pretrained models have to be fine tune to fit needs of a particular industry
Prompt Engineering: Use carefully structured inputs to guide the outputs.
RAG (Retrieval Augmented Generation): Adds contextual information to prompts by querying a vector database for related information.
Full Fine-Tuning: Transfer learning approach in which all the parameters are adjusted using task-specific data.
Parameter-Efficient Fine-Tuning (PEFT): Modifies only a small select amount of parameters for more efficient adaptation.
LoRA : decompose a large matrix : decompose a large matrix into 2 smaller matrices in the attention layer (drastically reduce the number of parameters) -> standard for inference fine tuning
The deltaW matrix is not applied in the model, but kept to be used during the inference
NVidia NeMo and NVidia AI foundation Models
Domino : Kubernetes native software
Then a demo
Select a workspace : Disk (needed size), Hardware (GPU), Service (Local DGX, Aws, etc) to start Jupyter Lab in the Kubernetes Pod