12.5.2 : Hardware
Wednesday, Mar 19 12:00 AM - 12:40 AM CET : Next-generation at Scale Compute in the Data Center [S73623]
- Mike Houston : VP and Chief Architect of AI Systems, NVIDIA
- Julie Bernauer : Senior Director, Applied Systems Engineering, NVIDIA
- Slides
- First to have 1000s GPU to scale up a test all
- 80 percent liquid cooling, 20 percent air cooling
- Telemetry per rack (humidity, temperature, noise, airflow) Prometheus + Grafana, AlertManager and Splunk
- A lot of cables in the rack means a lot of management
- Do not put the checkpoint in your home dir when you train a model
- No perf drop on each rack
- Linux on the switch, so very simple to manage
- Number of jobs is not always a multiple of 72, but this is more efficient
- They use gilab CI a lot
- very high density keber design
- Next year : talk on keber based design
- They start with slurm because this is simpler but their collegues are using kubernetes as well
- Fault tolerance OK
- Resilience is a bit tricky
- Go to photonic will improve reliability
- A small scratch in an optic fiber is the worst thing to debug because the working depend on the temperature
- For now cooling is single phase DLC, but there is also two phases cooling
- Ultra internet is still very much work in progress
Friday, Mar 21 6:00 PM - 6:40 PM CET : Accelerate Computational Lithography for Optical Proximity Correction [S71759]
- Seongtae Jeong : Fellow, VP with Executive Privilege, Samsung Electronics
- Slides
- CuLitho Integration: Recent adoption of NVIDIA's cuLitho library for further performance improvements in computational lithography
- GPU-Accelerated OPC: How Samsung is leveraging NVIDIA GPUs to accelerate optical proximity correction workloads
- Performance Gains: Significant increases in both speed and efficiency of OPC processes due to GPU and cuLitho synergy
- Scalability: The role of GPU-based solutions in meeting the growing complexity of semiconductor manufacturing
- Future Outlook: Potential future enhancements in computational lithography through continued GPU optimization
- Evolution of the size of the features (transistors on a die)
- Print smaller feature => photolothography (you can see that as a very expensive scanner
- features are smaller than the wavelenght : the brush is smaller than what we want to paint
- OPC adjust the mask to make sure that the wafer will be printed correctly => iterative process
- DIflecters which are not printed but help the printed shape to be correct
- There is not 1 to 1 between what is on the mask and what is one the wafer
- NVIDIA GPU acceleration since 2019 => but computation takes several weeks
- Very branching code, difficult to take advantage of the parallelism
- Layout polygon on the CPU, the rest on the GPU
- cuLITHO since 2023
- rasterisation : convergence from the data format to the format to print
- 6x speed up with cuLITHO with CPU + GPU
- acceleration of Etch Bias correction is production ready
- there is a need to upgrade old algorithm to fit on GPU for OPC
- All the results are based on A100 but the will upgrade to H100