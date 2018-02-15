Our two-stage technique works like this: a ‘student’ neural network is given randomly selected input examples of concepts and is trained from those examples using traditional supervised learning methods to guess the correct concept labels. In the second step, we let the ‘teacher’ network—which has an intended concept to teach and access to labels linking concepts to examples—to test different examples on the student and see which concept labels the student assigns them, eventually converging on the smallest set of examples it needs to give to let the student guess the intended concept. These examples end up looking interpretable because they are still grounded to the concepts (via the student trained in step one).

In contrast, if we train the student and teacher jointly (as is done in a lot of current communication games), the student and teacher can collude to communicate via arbitrary examples that do not make sense to humans. For instance, the concept of a “dog” might end up being encoded through some arbitrary vectors that may be showing images of llamas and motorcycles, or a rectangle could be composed of two dots that look random to a human, but encode a specific rectangle’s dimensions.

To understand why our technique works, consider what happens when we use our method to teach the student to recognize concepts from example images that vary based on four properties: size (small, medium, large), color (red, blue, green), shape (square vs circle), and border (solid vs none).