The use case
A typical deep learning advance starts out as an idea, which you test on a small problem. At this stage, you want to run many ad-hoc experiments quickly. Ideally, you can just SSH into a machine, run a script in screen, and get a result in less than an hour.
Making the model really work usually requires seeing it fail in every conceivable way and finding ways to fix those limitations. (This is similar to building any new software system, where you’ll run your code many times to build an intuition for how it behaves.)
So deep learning infrastructure must allow users to flexibly introspect models, and it’s not enough to just expose summary statistics.
Once the model shows sufficient promise, you’ll scale it up to larger datasets and more GPUs. This requires long jobs that consume many cycles and last for multiple days. You’ll need careful experiment management, and to be extremely thoughtful about your chosen range of hyperparameters.
The early research process is unstructured and rapid; the latter is methodical and somewhat painful, but it’s all absolutely necessary to get a great result.
The paper Improved Techniques for Training GANs began with Tim Salimans devising several ideas for improving Generative Adversarial Network training. We’ll describe the simplest of these ideas (which happened to produce the best-looking samples, though not the best semi-supervised learning).
GANs consist of a generator and a discriminator network. The generator tries to fool the discriminator, and the discriminator tries to distinguish between generated data and real data. Intuitively, a generator which can fool every discriminator is quite good. But there is a hard-to-fix failure mode: the generator can “collapse” by always outputting exactly the same (likely realistic-looking!) sample.
Tim had the idea to give discriminator an entire minibatch of samples as input, rather than just one sample. Thus the discriminator can tell whether the generator just constantly produces a single image. With the collapse discovered, gradients will be sent to the generator to correct the problem.
The next step was to prototype the idea on MNIST and CIFAR-10. This required prototyping a small model as quickly as possible, running it on real data, and inspecting the result. After some rapid iteration, Tim got very encouraging CIFAR-10 samples=—pretty much the best samples we’d seen on this dataset.
However, deep learning (and AI algorithms in general) must be scaled to be truly impressive—a small neural network is a proof of concept, but a big neural network actually solves the problem and is useful. So Ian Goodfellow dug into scaling the model up to work on ImageNet.
With a larger model and dataset, Ian needed to parallelize the model across multiple GPUs. Each job would push multiple machines to 90% CPU and GPU utilization, but even then the model took many days to train. In this regime, every experiment became precious, and he would meticulously log the results of each experiment.
Ultimately, while the results were good, they were not as good as we hoped. We’ve tested many hypotheses as to why, but still haven’t cracked it. Such is the nature of science.
The vast majority of our research code is written in Python, as reflected in our open-source projects. We mostly use TensorFlow (or Theano in special cases) for GPU computing; for CPU we use those or Numpy. Researchers also sometimes use higher-level frameworks like Keras on top of TensorFlow.
Like much of the deep learning community, we use Python 2.7. We generally use Anaconda, which has convenient packaging for otherwise difficult packages such as OpenCV and performance optimizations for some scientific libraries.
For an ideal batch job, doubling the number of nodes in your cluster will halve the job’s runtime. Unfortunately, in deep learning, people usually see very sublinear speedups from many GPUs. Top performance thus requires top-of-the-line GPUs. We also use quite a lot of CPU for simulators, reinforcement learning environments, or small-scale models (which run no faster on a GPU).
AWS generously agreed to donate a large amount of compute to us. We’re using them for CPU instances and for horizontally scaling up GPU jobs. We also run our own physical servers, primarily running Titan X GPUs. We expect to have a hybrid cloud for the long haul: it’s valuable to experiment with different GPUs, interconnects, and other techniques which may become important for the future of deep learning.
We approach infrastructure like many companies treat product: it must present a simple interface, and usability is as important as functionality. We use a consistent set of tools to manage all of our servers and configure them as identically as possible.
We use Terraform to set up our AWS cloud resources (instances, network routes, DNS records, etc). Our cloud and physical nodes run Ubuntu and are configured with Chef. For faster spinup times, we pre-bake our cluster AMIs using Packer. All our clusters use non-overlapping IP ranges and are interconnected over the public internet with OpenVPN on user laptops, and strongSwan on physical nodes (which act as AWS Customer Gateways).
Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our infrastructure for small- and large-scale jobs, and we’re actively solidifying our toolkit for making distributed use-cases as accessible as local ones.
We provide a cluster of SSH nodes (both with and without GPUs) for ad-hoc experimentation, and run Kubernetes as our cluster scheduler for physical and AWS nodes. Our cluster spans 3 AWS regions—our jobs are bursty enough that we’ll sometimes hit capacity on individual regions.
Kubernetes requires each job to be a Docker container, which gives us dependency isolation and code snapshotting. However, building a new Docker container can add precious extra seconds to a researcher’s iteration cycle, so we also provide tooling to transparently ship code from a researcher’s laptop into a standard image.
We expose Kubernetes’s flannel network directly to researchers’ laptops, allowing users seamless network access to their running jobs. This is especially useful for accessing monitoring services such as TensorBoard. (Our initial approach—which is cleaner from a strict isolation perspective—required people to create a Kubernetes Service for each port they wanted to expose, but we found that it added too much friction.)
Our workload is bursty and unpredictable: a line of research can go quickly from single-machine experimentation to needing 1,000 cores. For example, over a few weeks, one experiment went from an interactive phase on a single Titan X, to an experimental phase on 60 Titan Xs, to needing nearly 1600 AWS GPUs. Our cloud infrastructure thus needs to dynamically provision Kubernetes nodes.
It’s easy to run Kubernetes nodes in Auto Scaling groups, but it’s harder to correctly manage the size of those groups. After a batch job is submitted, the cluster knows exactly what resources it needs, and should allocate those directly. (In contrast, AWS’s Scaling Policies will spin up new nodes piecemeal until resources are no longer exhausted, which can take multiple iterations.) Also, the cluster needs to drain nodes before terminating them to avoid losing in-flight jobs.
It’s tempting to just use raw EC2 for big batch jobs, and indeed that’s where we started. However, the Kubernetes ecosystem adds quite a lot of value: low-friction tooling, logging, monitoring, ability to manage physical nodes separately from the running instances, and the like. Making Kubernetes autoscale correctly was easier than rebuilding this ecosystem on raw EC2.
The autoscaler works by polling the Kubernetes master’s state, which contains everything needed to calculate the cluster resource ask and capacity. If there’s excess capacity, it drains the relevant nodes and ultimately terminates them. If more resources are needed, it calculates what servers should be created and increases your Auto Scaling group sizes appropriately (or simply uncordons drained nodes, which avoids new node spinup time).
kubernetes-ec2-autoscaler handles multiple Auto Scaling groups, resources beyond CPU (memory and GPUs), and fine-grained constraints on your jobs such as AWS region and instance size. Additionally, bursty workloads can lead to Auto Scaling Groups timeouts and errors, since (surprisingly!) even AWS does not have infinite capacity. In these cases, kubernetes-ec2-autoscaler detects the error and overflows to a secondary AWS region.
Our infrastructure aims to maximize the productivity of deep learning researchers, allowing them to focus on the science. We’re building tools to further improve our infrastructure and workflow, and will share these in upcoming weeks and months. We welcome help to make this go even faster!