Hardware

12.5.2 : Hardware

Wednesday, Mar 19 12:00 AM - 12:40 AM CET : Next-generation at Scale Compute in the Data Center [S73623]

Mike Houston : VP and Chief Architect of AI Systems, NVIDIA
Julie Bernauer : Senior Director, Applied Systems Engineering, NVIDIA
Slides
First to have 1000s GPU to scale up a test all
80 percent liquid cooling, 20 percent air cooling
Telemetry per rack (humidity, temperature, noise, airflow) Prometheus + Grafana, AlertManager and Splunk
A lot of cables in the rack means a lot of management
Do not put the checkpoint in your home dir when you train a model
No perf drop on each rack
Linux on the switch, so very simple to manage
Number of jobs is not always a multiple of 72, but this is more efficient
They use gilab CI a lot
very high density keber design
Next year : talk on keber based design
They start with slurm because this is simpler but their collegues are using kubernetes as well
Fault tolerance OK
Resilience is a bit tricky
Go to photonic will improve reliability
A small scratch in an optic fiber is the worst thing to debug because the working depend on the temperature
For now cooling is single phase DLC, but there is also two phases cooling
Ultra internet is still very much work in progress

Friday, Mar 21 6:00 PM - 6:40 PM CET : Accelerate Computational Lithography for Optical Proximity Correction [S71759]

Seongtae Jeong : Fellow, VP with Executive Privilege, Samsung Electronics
Slides
CuLitho Integration: Recent adoption of NVIDIA's cuLitho library for further performance improvements in computational lithography
GPU-Accelerated OPC: How Samsung is leveraging NVIDIA GPUs to accelerate optical proximity correction workloads
Performance Gains: Significant increases in both speed and efficiency of OPC processes due to GPU and cuLitho synergy
Scalability: The role of GPU-based solutions in meeting the growing complexity of semiconductor manufacturing
Future Outlook: Potential future enhancements in computational lithography through continued GPU optimization
Evolution of the size of the features (transistors on a die)
Print smaller feature => photolothography (you can see that as a very expensive scanner
features are smaller than the wavelenght : the brush is smaller than what we want to paint
OPC adjust the mask to make sure that the wafer will be printed correctly => iterative process
DIflecters which are not printed but help the printed shape to be correct
There is not 1 to 1 between what is on the mask and what is one the wafer
NVIDIA GPU acceleration since 2019 => but computation takes several weeks
Very branching code, difficult to take advantage of the parallelism
Layout polygon on the CPU, the rest on the GPU
cuLITHO since 2023
rasterisation : convergence from the data format to the format to print
6x speed up with cuLITHO with CPU + GPU
acceleration of Etch Bias correction is production ready
there is a need to upgrade old algorithm to fit on GPU for OPC
All the results are based on A100 but the will upgrade to H100