12.5.2 : Hardware

Wednesday, Mar 19 12:00 AM - 12:40 AM CET : Next-generation at Scale Compute in the Data Center [S73623]
  • Mike Houston : VP and Chief Architect of AI Systems, NVIDIA
  • Julie Bernauer : Senior Director, Applied Systems Engineering, NVIDIA
  • Slides
  • First to have 1000s GPU to scale up a test all
  • 80 percent liquid cooling, 20 percent air cooling
  • Telemetry per rack (humidity, temperature, noise, airflow) Prometheus + Grafana, AlertManager and Splunk
  • A lot of cables in the rack means a lot of management
  • Do not put the checkpoint in your home dir when you train a model
  • No perf drop on each rack
  • Linux on the switch, so very simple to manage
  • Number of jobs is not always a multiple of 72, but this is more efficient
  • They use gilab CI a lot
  • very high density keber design
  • Next year : talk on keber based design
  • They start with slurm because this is simpler but their collegues are using kubernetes as well
  • Fault tolerance OK
  • Resilience is a bit tricky
  • Go to photonic will improve reliability
  • A small scratch in an optic fiber is the worst thing to debug because the working depend on the temperature
  • For now cooling is single phase DLC, but there is also two phases cooling
  • Ultra internet is still very much work in progress


Friday, Mar 21 6:00 PM - 6:40 PM CET : Accelerate Computational Lithography for Optical Proximity Correction [S71759]
  • Seongtae Jeong : Fellow, VP with Executive Privilege, Samsung Electronics
  • Slides
  • CuLitho Integration: Recent adoption of NVIDIA's cuLitho library for further performance improvements in computational lithography
  • GPU-Accelerated OPC: How Samsung is leveraging NVIDIA GPUs to accelerate optical proximity correction workloads
  • Performance Gains: Significant increases in both speed and efficiency of OPC processes due to GPU and cuLitho synergy
  • Scalability: The role of GPU-based solutions in meeting the growing complexity of semiconductor manufacturing
  • Future Outlook: Potential future enhancements in computational lithography through continued GPU optimization
  • Evolution of the size of the features (transistors on a die)
  • Print smaller feature => photolothography (you can see that as a very expensive scanner
  • features are smaller than the wavelenght : the brush is smaller than what we want to paint
  • OPC adjust the mask to make sure that the wafer will be printed correctly => iterative process
  • DIflecters which are not printed but help the printed shape to be correct
  • There is not 1 to 1 between what is on the mask and what is one the wafer
  • NVIDIA GPU acceleration since 2019 => but computation takes several weeks
  • Very branching code, difficult to take advantage of the parallelism
  • Layout polygon on the CPU, the rest on the GPU
  • cuLITHO since 2023
  • rasterisation : convergence from the data format to the format to print
  • 6x speed up with cuLITHO with CPU + GPU
  • acceleration of Etch Bias correction is production ready
  • there is a need to upgrade old algorithm to fit on GPU for OPC
  • All the results are based on A100 but the will upgrade to H100