12.6.2 : Hardware
Tuesday, Mar 17 11:00 PM - 11:40 PM CET : From Grace to Vera: NVIDIA’s next data center CPU [S81680]
- Matthias Langer, Sr. AI and Developer Technology Engineer, NVIDIA
- Lukas Alt, DevTech Engineer, NVIDIA
- Vera : Next CPU for AI and HPC
- Available second half of 2026
- Some customers already used Vera
- Vera liquid cooled Rack => 22528 CPU Cores in a single Rack => 256 Vera CPU
- Arm v9.2
- 6x128b SVE2 vectorization
- 88 Olympus Cores, 176 Threads with spacial multithreading
- 2 MB L2 Cache per core, 162 MB L3 Cache
- 2nd NVidia SCF, 3.4 TB/s (raw Hardware bandwidth), Scalable Coherency Fabric
- RAM : LPDDR5x SOCAMM
- Raw memory Bandwith up to 1.2 TB/s
- Memory Capacity up to 1.5 TB
- NV C2C => 2nd generation NVlink-C2C 1.8 TB/s
- PCIexpress : 88 Lanes Gen6 with x16, x8, x4, x2 bifurcation
- Expect uniform latency for NUMA on cores
- Olympus can decode 10 instruction per cycle (6 per cycle in Grace)
- Spatial Multi Threading (SMT)
- Single thread uses 6x128b, 2 threads use each 3x128b
- FP8 is supported on Vera, accumulate into fp16 or fp32
- Memory system ressource Partitionning And Monitoring (MPAM)
- Recommended GCC 15.2 and LLVM >= 22, -mcpu=olympus or -mcpu=grace on Grace
- OK with NVTX
- Linux Perf also available, 6 PMU counter can be used at the same time
- 2.02x compare to Grace on AV1 encoding benchmark
- Vera is Arm compatible so, compile on Arm should be enough to make your application run, however it is better to compile and optimise for Vera
- Vera should be available with several forms as for Grace
- The clock speed are on the same order of magnitude of Grace (nt the same but close)
- Basically is you want to run on both Grace and Vera, compile for Grace and it will be quite OK on Vera but is you need very optmised binary you can enable two optimisations binary path which are chosen at run time
- Up to 1050 W for a ship
- Sachin Katti, Head of Industrial Compute, OpenAI
- The most depresing talk ever.
- Many of these are thoughs
- Innovators : what shoud be the way to do something
- From stateless to statefull utilisation when models have the full context of your lives...
- 3x more compute consuption each year
- more 30GW in 2030
- The goal is to make the human be the bottleneck, for now this is not the case
- This will ensure the human does not change context by doing otherthings during the agent is working of the begining of the task
- No guaranty that agentic will scale with the number of GPUs or CPUs : 40 percents utilisation cannot be an acceptable utilisation for a GPU
Tuesday, Mar 17 11:00 PM - 11:40 PM CET : Liquid Cooling: How to Achieve 100% Heat Capture on 500kW+ Racks (Presented by CoolIT Systems) [S82004]
- Andrew Buckrell, Sr. Thermal Mechanical Engineer, CoolIT Systems
- Exponetial growth in power of computing ressource => CPU-GPU 80-90 °C, 65°C for peripheral components
- 20-30 L/s of liquid flux to cool the Rack
- 2030 => 1MW rack => getting 100% head capture is crutial
- Minimalize the thin of TIM (thermal intarface material, from CPU to hitter)
- Liquid metal TIM material => based of galium => highly reactive with aluminium (it destroys aluminium), mix with copper degrades hit transfer
- Down to 50µm of TIMs
- Try to avoid balloning effect which deforms the TIMs with the temperature and reduce the hit transfer
- Liquid cooling at high velocity => erosion and corrosion effect on copper
- Vapour chamber to cool RAM
- Hit colling with wet aluminium is a Challenge
- The future is 500w/cm^2 die, 15kW Chips,
- Hit transfer becomes better at higher temperatures
- PG0 : Propilene Glycol => no corrosion
- Water : freezing is a problem