Discovering types for entity disambiguation
We’ve built a system for automatically figuring out which object is meant by a word by having a neural network decide if the word belongs to each of about 100 automatically-discovered “types” (non-exclusive categories).
For example, given a sentence like “the prey saw the jaguar cross the jungle”, rather than trying to reason directly whether jaguar means the car, the animal, or something else, the system plays “20 questions” with a pre-chosen set of categories. This approach gives a big boost in state-of-the-art on several entity disambiguation datasets.
We achieve 94.88% accuracy on CoNLL (YAGO)(opens in a new window) (previous state of the arts: 91.50(opens in a new window)%, 91.70(opens in a new window)%) and 90.85% on TAC KBP 2010 challenge(opens in a new window) (previous state of the arts: 87.20(opens in a new window)%, and 87.70(opens in a new window)%). Previous methods used distributed representations(opens in a new window). Types can go almost all the way on these tasks, as perfect type prediction would give accuracies of 98.6-99%.
High-level overview
Our system uses the following steps:
Extract every Wikipedia-internal link to determine, for each word, the set of conceivable entities it can refer to. For example, when encountering the link
[jaguar](https://en.wikipedia.org/wiki/Jaguar)
in a Wikipedia page, we conclude thathttps://en.wikipedia.org/wiki/Jaguar
is one of the meanings of jaguar.Walk the Wikipedia category tree (using the Wikidata(opens in a new window) knowledge graph) to determine, for each entity, the set of categories it belongs to. For example, at the bottom of Jaguar cars Wikipedia page(opens in a new window) are the following categories (which themselves have their own categories, such as Automobiles(opens in a new window)): “British brands | Car brands | Jaguar cars | Jaguar vehicles.”
Pick a list of ~100 categories to be your “type” system, and optimize over this choice of categories so that they compactly express any entity. We know the mapping of entities to categories, so given a type system, we can represent each entity as a ~100-dimensional binary vector indicating membership in each category.
Using every Wikipedia-internal link and its surrounding context, produce training data mapping a word plus context to the ~100-dimensional binary representation of the corresponding entity, and train a neural network to predict this mapping. This chains together the previous steps: Wikipedia links map a word to an entity, we know the categories for each entity from step 2, and step 3 picked the categories in our type system.
At test time, given a word and surrounding context, our neural network’s output can be interpreted as the probability that the word belongs to each category. If we knew the exact set of category memberships, we would narrow down to one entity (assuming well-chosen categories). But instead, we must play a probabilistic 20 questions: use Bayes’ theorem(opens in a new window) to calculate the chance of the word disambiguating to each of its possible entities.
More examples
Here are some other examples of our system in action:
Cleaning the data
Wikidata’s knowledge graph can be turned into a source of training data for mapping fine-grained entities to types. We apply its instance of
(opens in a new window) relation recursively to determine the set of types for any given entity — for example, any descendent node of the human(opens in a new window) node has type human. Wikipedia can also provide entity-to-type mapping through its category link
(opens in a new window).
Wikipedia-internal link statistics provide a good estimate of the chance a particular phrase refers to some entity. However, this is noisy since Wikipedia will often link to specific instance of a type rather than the type itself (anaphora(opens in a new window) — e.g. king → Charles I of England) or link from a nickname (metonymy(opens in a new window)). This results in an explosion of associated entities (e.g. king has 974 associated entities) and distorted link frequencies (e.g. queen links to the band Queen(opens in a new window) 4920 times, Elizabeth II(opens in a new window) 1430 times, and monarch(opens in a new window) only 32 times).
The easiest approach is to prune(opens in a new window) rare(opens in a new window) links(opens in a new window), but this loses information. We instead use the Wikidata property graph to heuristically turn links into their “generic” meaning, as illustrated below.
After this process, king goes from 974 to 14 associated entities, while the number of links from queen to monarch(opens in a new window) increases from 32 to 3553.
Learning a good type system
We need to select the best type system and parameters such that disambiguation accuracy is maximized. There’s a huge number of possible sets of types, making an exact solution intractable. Instead, we use heuristic search or stochastic optimization (evolutionary algorithm) to select a type system, and gradient descent to train a type classifier to predict the behavior of the type system.
We need to select types that are discriminating (so quickly whittle down the possible set of entities), while being easy to learn (so surrounding context is informative for a neural network to infer that a type applies). We inform our search with two heuristics: learnability (average of area under the curve(opens in a new window) [AUC] scores of a classifier trained to predict type membership), and oracle accuracy (how well we would disambiguate entities if we predicted all types perfectly).
Type system evolution
We train binary classifiers to predict membership in each of the 150,000 most common types in our dataset, given a window of context. The area under the curve(opens in a new window) (AUC) of the classifier becomes the “learnability score” for that type. High AUC means it’s easy to predict this type from context; poor performance can mean we have little training data or that a word window isn’t terribly helpful (this tends to be true for unnatural categories like ISBNs(opens in a new window)). Our full model takes several days to train, so we instead use a much smaller model as a proxy in our “learnability score”, which takes only 2.5s to train.
We can now use these learnability scores and count statistics to estimate the performance of a given subset of types as our type system. Below you can run the Cross Entropy Method(opens in a new window) to discover types in your browser. Note how changing sample size and penalties affects the solution.
To better visualize what parts of the type system design are easy and hard, we invite you to try your hand at designing your own below. After choosing a high-level domain you can start looking at ambiguous examples. The possible answers are shown as circles on the top row, and the correct answer is the colored circle (hover to see its name). The bottom row contains types you can use. Lines connecting the top to the bottom row are inheritance relations. Select the relations you want. Once you have enough relations to separate the right answer from the rest, the example is disambiguated.
Neural type system
Using the top solution from our type system optimization, we can now label data from Wikipedia using labels generated by the type system. Using this data (in our experiments, 400M tokens for each of English and French), we can now train a bidirectional LSTM(opens in a new window) to independently predict all the type memberships for each word. On the Wikipedia source text, we only have supervision on intra-wiki links, however this is sufficient to train a deep neural network to predict type membership with an F1(opens in a new window) of over 0.91.
One of our type systems, discovered by beam search, includes types such as Aviation
, Clothing
, and Games
(as well as surprisingly specific ones like 1754 in Canada
—indicating 1754 was an exciting year in the dataset of 1,000 Wikipedia articles it was trained on); you can also view the full(opens in a new window) type system.
Inference
Predicting entities in a document usually relies on a “coherence” metric between different entities, e.g., measuring how well each entity fits with each other, which is O(N^2)
in the length of the document. Instead, our runtime is O(N)
as we need only to look up each phrase in a trie which maps phrases to their possible meanings. We rank each of the possible entities according to the link frequency seen in Wikipedia, refined by weighting each entity by its likelihood under the type classifier. New entities can be added just by specifying their type memberships (person, animal, country of origin, time period, etc.).
Next steps
Our approach has many differences to previous work on this problem. We’re interested in how well end-to-end learning of distributed representations(opens in a new window) performs in comparison to the type-based inference we developed here. The type systems here were discovered using a small Wikipedia subset; scaling to all of Wikipedia could discover a type system with broad application. We hope you find our code(opens in a new window) useful!
If you’d like to help push research like this forward, please apply to OpenAI!