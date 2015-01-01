About the Team

OpenAI’s Hardware Health team oversees all hardware health related aspects of our custom-built hyperscale supercomputers. The team is responsible for maximizing the available supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults.

The hardware health team is being incubated inside OpenAI’s Scaling team, which operates at the far edge of all available innovations in AI — doing the engineering and research required to train large-scale AI models of unprecedented capability.

About the Role

As a SWE in Hardware Health, you will work to maintain a sophisticated and comprehensive suite of hardware health tests and collaborate with researchers and our Supercomputing team on root-causing and reliably reproducing newly discovered problems.

The team moves at a fast pace and provides individuals with a high degree of autonomy and a strong ability to affect change.



An ideal candidate would have:

A balance of building and operational skills

Excellent abilities developing in python and shell scripting

A high degree of comfort digging into noisy data with SQL, PromQL, and Pandas

Experience developing reproducible analyses / building dashboards and visualizations

A high level of detail orientation and a good intuition for when results are “too good/bad to be true”

A strong sense of ownership causing them to very carefully monitor outcome of deployed updates

Prior TL experience as this is a 0-1 effort with team growth on the horizon

Bonus Points if you have expertise and interest in low level details of hardware components, protocols, and associated Linux tooling (PCIe, networking, power management, kernel perf tuning)