Infrastructure for deep learning
Deep learning is an empirical science, and the quality of a group’s infrastructure is a multiplier on progress. Fortunately, today’s open-source ecosystem makes it possible for anyone to build great deep learning infrastructure.
Summary
In this post, we’ll share how deep learning research usually proceeds, describe the infrastructure choices we’ve made to support it, and open-source kubernetes-ec2-autoscaler(opens in a new window), a batch-optimized scaling manager for Kubernetes. We hope you find this post useful in building your own deep learning infrastructure.
The use case
A typical deep learning advance starts out as an idea, which you test on a small problem. At this stage, you want to run many ad-hoc experiments quickly. Ideally, you can just SSH into a machine, run a script in screen, and get a result in less than an hour.
Making the model really work usually requires seeing it fail in every conceivable way and finding ways to fix those limitations. (This is similar to building any new software system, where you’ll run your code many times to build an intuition for how it behaves.)
So deep learning infrastructure must allow users to flexibly introspect models, and it’s not enough to just expose summary statistics.
Once the model shows sufficient promise, you’ll scale it up to larger datasets and more GPUs. This requires long jobs that consume many cycles and last for multiple days. You’ll need careful experiment management, and to be extremely thoughtful about your chosen range of hyperparameters.
The early research process is unstructured and rapid; the latter is methodical and somewhat painful, but it’s all absolutely necessary to get a great result.
An example
The paper Improved Techniques for Training GANs(opens in a new window) began with Tim Salimans(opens in a new window) devising several ideas for improving Generative Adversarial Network training. We’ll describe the simplest of these ideas (which happened to produce the best-looking samples, though not the best semi-supervised learning).
GANs consist of a generator and a discriminator network. The generator tries to fool the discriminator, and the discriminator tries to distinguish between generated data and real data. Intuitively, a generator which can fool every discriminator is quite good. But there is a hard-to-fix failure mode: the generator can “collapse” by always outputting exactly the same (likely realistic-looking!) sample.
Tim had the idea to give discriminator an entire minibatch(opens in a new window) of samples as input, rather than just one sample. Thus the discriminator can tell whether the generator just constantly produces a single image. With the collapse discovered, gradients will be sent to the generator to correct the problem.
The next step was to prototype the idea on MNIST(opens in a new window) and CIFAR-10(opens in a new window). This required prototyping a small model as quickly as possible, running it on real data, and inspecting the result. After some rapid iteration, Tim got very encouraging CIFAR-10 samples(opens in a new window)=—pretty much the best samples we’d seen on this dataset.
However, deep learning (and AI algorithms in general) must be scaled to be truly impressive—a small neural network is a proof of concept, but a big neural network actually solves the problem and is useful. So Ian Goodfellow(opens in a new window) dug into scaling the model up to work on ImageNet(opens in a new window).
With a larger model and dataset, Ian needed to parallelize the model across multiple GPUs. Each job would push multiple machines to 90% CPU and GPU utilization, but even then the model took many days to train. In this regime, every experiment became precious, and he would meticulously log the results of each experiment.
Ultimately, while the results were good, they were not as good as we hoped. We’ve tested many hypotheses as to why, but still haven’t cracked it. Such is the nature of science.
Infrastructure
Software
The vast majority of our research code is written in Python, as reflected in our(opens in a new window) open(opens in a new window)-source(opens in a new window) projects(opens in a new window). We mostly use TensorFlow(opens in a new window) (or Theano(opens in a new window) in special cases) for GPU computing; for CPU we use those or Numpy(opens in a new window). Researchers also sometimes use higher-level frameworks like Keras(opens in a new window) on top of TensorFlow.
Like much of the deep learning community, we use Python 2.7. We generally use Anaconda(opens in a new window), which has convenient packaging for otherwise difficult packages such as OpenCV(opens in a new window) and performance optimizations(opens in a new window) for some scientific libraries.
Hardware
For an ideal batch job, doubling the number of nodes in your cluster will halve the job’s runtime. Unfortunately, in deep learning, people usually see very sublinear(opens in a new window) speedups from many GPUs. Top performance thus requires top-of-the-line GPUs. We also use quite a lot of CPU for simulators(opens in a new window), reinforcement learning environments(opens in a new window), or small-scale models (which run no faster on a GPU).
AWS(opens in a new window) generously agreed to donate a large amount of compute to us. We’re using them for CPU instances and for horizontally scaling up GPU jobs. We also run our own physical servers, primarily running Titan X(opens in a new window) GPUs. We expect to have a hybrid cloud for the long haul: it’s valuable to experiment with different GPUs, interconnects, and other techniques which may become important for the future of deep learning.
Provisioning
We approach infrastructure like many companies treat product: it must present a simple interface, and usability is as important as functionality. We use a consistent set of tools to manage all of our servers and configure them as identically as possible.
We use Terraform(opens in a new window) to set up our AWS cloud resources (instances, network routes, DNS records, etc). Our cloud and physical nodes run Ubuntu(opens in a new window) and are configured with Chef(opens in a new window). For faster spinup times, we pre-bake our cluster AMIs using Packer(opens in a new window). All our clusters use non-overlapping IP ranges and are interconnected over the public internet with OpenVPN(opens in a new window) on user laptops, and strongSwan(opens in a new window) on physical nodes (which act as AWS Customer Gateways(opens in a new window)).
We store people’s home directories, data sets, and results on NFS(opens in a new window) (on physical hardware) and EFS(opens in a new window)/S3(opens in a new window) (on AWS).
Orchestration
Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our infrastructure for small- and large-scale jobs, and we’re actively solidifying our toolkit for making distributed use-cases as accessible as local ones.
We provide a cluster of SSH nodes (both with and without GPUs) for ad-hoc experimentation, and run Kubernetes(opens in a new window) as our cluster scheduler for physical and AWS nodes. Our cluster spans 3 AWS regions—our jobs are bursty enough that we’ll sometimes hit capacity on individual regions.
Kubernetes requires each job to be a Docker container, which gives us dependency isolation and code snapshotting. However, building a new Docker container can add precious extra seconds to a researcher’s iteration cycle, so we also provide tooling to transparently ship code from a researcher’s laptop into a standard image.
We expose Kubernetes’s flannel(opens in a new window) network directly to researchers’ laptops, allowing users seamless network access to their running jobs. This is especially useful for accessing monitoring services such as TensorBoard(opens in a new window). (Our initial approach—which is cleaner from a strict isolation perspective—required people to create a Kubernetes Service(opens in a new window) for each port they wanted to expose, but we found that it added too much friction.)
kubernetes-ec2-autoscaler
Our workload is bursty and unpredictable: a line of research can go quickly from single-machine experimentation to needing 1,000 cores. For example, over a few weeks, one experiment went from an interactive phase on a single Titan X, to an experimental phase on 60 Titan Xs, to needing nearly 1600 AWS GPUs. Our cloud infrastructure thus needs to dynamically provision Kubernetes nodes.
It’s easy to run Kubernetes nodes in Auto Scaling(opens in a new window) groups, but it’s harder to correctly manage the size of those groups. After a batch job is submitted, the cluster knows exactly what resources it needs, and should allocate those directly. (In contrast, AWS’s Scaling Policies(opens in a new window) will spin up new nodes piecemeal until resources are no longer exhausted, which can take multiple iterations.) Also, the cluster needs to drain(opens in a new window) nodes before terminating them to avoid losing in-flight jobs.
It’s tempting to just use raw EC2 for big batch jobs, and indeed that’s where we started. However, the Kubernetes ecosystem adds quite a lot of value: low-friction tooling, logging, monitoring, ability to manage physical nodes separately from the running instances, and the like. Making Kubernetes autoscale correctly was easier than rebuilding this ecosystem on raw EC2.
We’re releasing kubernetes-ec2-autoscaler(opens in a new window), a batch-optimized scaling manager for Kubernetes. It runs as a normal Pod(opens in a new window) on Kubernetes and requires only that your worker nodes are in Auto Scaling groups.
The autoscaler works by polling the Kubernetes master’s state, which contains everything needed to calculate the cluster resource ask and capacity. If there’s excess capacity, it drains the relevant nodes and ultimately terminates them. If more resources are needed, it calculates what servers should be created and increases your Auto Scaling group sizes appropriately (or simply uncordons(opens in a new window) drained nodes, which avoids new node spinup time).
kubernetes-ec2-autoscaler handles multiple Auto Scaling groups, resources beyond CPU (memory and GPUs), and fine-grained constraints on your jobs such as AWS region and instance size. Additionally, bursty workloads can lead to Auto Scaling Groups timeouts and errors, since (surprisingly!) even AWS does not have infinite capacity. In these cases, kubernetes-ec2-autoscaler detects the error and overflows to a secondary AWS region.
Our infrastructure aims to maximize the productivity of deep learning researchers, allowing them to focus on the science. We’re building tools to further improve our infrastructure and workflow, and will share these in upcoming weeks and months. We welcome help(opens in a new window) to make this go even faster!