Blog

Discovering the minutiae of backend systems

Christian Gibson is an engineer on the Supercomputing team at OpenAI.

Person gazing across the room with an optimistic expression

Photo: Jake Stangel

December 8, 2022

Authors

What first interested you in engineering?

I was fortunate to discover programming at a young age and used that as a gateway to explore other topics. In middle school, a friend introduced me to the particular flavor of the BASIC programming language included with Texas Instruments calculators (my code was predictably unmaintainable given a restriction of 27 single-letter variables per program and a heavy reliance on GOTO statements). Nevertheless, we created some simple programs, like text-based adventure games, a chat app for linked calculators, and the usual quadratic formula aide.

Later on, I wrote more complicated programs: a visual helper for illustrating Newton’s method and an orbit calculator for estimating the position of the planets and their moons, which caught the eye of my school’s Linux club. Soon, I was tussling with NDISwrapper trying to get my laptop’s CardBus-based WiFi adapter working and setting my desktop windows ablaze with Compiz! That pattern of discovery via code continued throughout high school and beyond, resulting in my engineering interest today.

What made you come to OpenAI?

At my last job, I’d moved from a backend role into a full-stack position, only to find a distaste for frontend work and UX design. I wanted to move back to a role closer to backend systems and missed the interaction with Linux environments I’d enjoyed in academia. OpenAI offered the change in work I was looking for and then some; you’d be hard-pressed to find a better fit for what I was looking for than working on OpenAI’s supercomputing clusters.

What are the problems you’re focused on solving here at OpenAI?

Exploratory AI workflows are inherently fast-paced; researchers want to be able to take a preprint off of arXiv and test out new approaches without being encumbered by the platform they’re launching their code on. They are also incredibly complicated, with researchers behaving much like mathematicians—relying on the intuition they’ve built over their careers to design a solution in tackling whatever problem has caught their eye this week. The fact these runtimes are executing on some of the world’s largest supercomputers adds yet another layer of complexity, and handling that penultimate layer is where my team gets involved. We work to preempt research needs before they block progress and, failing that, we work with research teams to identify bottlenecks and implement workarounds as quickly as possible.

Person sitting at a cafeteria table with a glass of water and closed laptop

Photo: Jake Stangel

What do you think differentiates working on supercomputing at OpenAI from another place?

The sheer scale we operate at is, frankly, astonishing. Third-party hardware vendors routinely confide that we’re encountering issues they’ve never previously seen. Often this is simply because our installations have more hardware shoved into a single contiguous supercomputer than their other clients, although occasionally it’s a consequence of our performance expectations. The synchronized nature of most model training approaches results in a configuration where the entire cluster effectively runs at the speed of the slowest node.

Our most prominent models are trained on billion-dollar supercomputers, and as a result, we end up chasing down performance degradations that most others would ignore. It’s exciting to see something like a one-line change hit the mainline kernel, knowing that it’ll save ~6 days of compute across our fleet per week, or see a line item on a new driver release, knowing that it was one of our discoveries that resulted in the now-upstreamed fix.

What does a typical day at OpenAI look like for you?

My days generally consist of some mixture of working on code, investigating issues, and attending meetings. Meetings dominate my Tuesdays (and usually only Tuesdays, thankfully), and the remainder of the week is split between debugging and coding. Issues identified generally become coding work, e.g., writing up a design doc, pushing a quick hotfix to a PR branch, or adding passive health check logic to keep errant hardware out of our clusters.

Digging into the issues requires a bit of detective work. The research impact varies from the vague (“my job seems to be running slower than it was yesterday”) to the terrifyingly specific (“I think if I push more than 30Gbps over the Ethernet NIC, I cause a kernel panic?”). This is likely a familiar mix: productive on days that proceed as expected, and exciting when the expected is disrupted and you get the chance to learn something new.

OpenAI offers the opportunity to dig deep into aspects of computing ignored elsewhere.

What energizes you each day?

I rarely show up for work without having something top-of-mind that needs doing, and I’m generally aware of the specific team, project, and researchers that benefit from the timely completion of a task. OpenAI is the largest employer I’ve worked for and having an immediate appreciation of the impact of my work is crucial for my day-to-day motivation. I also get a kick out of discovering the minutiae of systems. OpenAI isn’t the first employer I’ve worked on backend systems for, but this is my first time working in the HPC space.

The technologies we work with often exist purely due to performance concerns bespoke to this space. I hadn’t needed to worry about the physical topology of our hardware at previous employers—ensuring that communication occurred within the same NUMA domain, for instance, or that a GPU utilized a co-located NVME or InfiniBand device via Nvidia’s GPUDirect, or that system processes were pinned to specific CPUs to avoid noisy neighbor conflicts with research runtimes. OpenAI offers the opportunity to dig deep into aspects of computing ignored elsewhere, which keeps me interested in the task at hand.

Where do you find inspiration?

Nothing is as inspiring as watching our research teams make progress on improving their models. Many groups set up Slack bots or simple playgrounds where you can interact with and test models still under development, allowing you to watch the models improve as training continues!

I also use the popular :meow_party: Slackmoji to tag motivating or inspirational content from our various Slack channels. Since I joined in mid-2020, I have more than 400 :meow_party:-tagged posts, averaging close to 4 per week!

Authors