We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO.^{[^footnote-1]} The prover uses a language model to find proofs of formal statements. Each time we find a new proof, we use it as new training data, which improves the neural network and enables it to iteratively find solutions to harder and harder statements.

We achieved a new state-of-the-art (41.2% vs 29.3%) on the miniF2F benchmark, a challenging collection of high-school olympiad problems. Our approach, which we call *statement curriculum learning*, consists of manually collecting a set of statements of varying difficulty levels (without proof) where the hardest statements are similar to the benchmark we target. Initially our neural prover is weak and can only prove a few of them. We iteratively search for new proofs and re-train our neural network on the newly discovered proofs, and after 8 iterations, our prover ends up being vastly superior when tested on miniF2F.

Formal mathematics is an exciting domain to study because of (i) its richness, letting you prove arbitrary theorems which require reasoning, creativity and insight and (ii) its similarity to games—where AI has been spectacularly successful—in that it has an automated way of determining whether a proof is successful (i.e., verified by the formal system). As demonstrated in the trivial example below, proving a formal statement requires generating a sequence of proof steps, each proof step consisting in a call to a tactic.^{[^footnote-2]}

These tactics take mathematical terms as arguments and each tactic call will transform the current statement to prove, into statements that are easier to prove, until nothing is left to prove.

## Problem 1

*Adapted from AMC12 2000 Problem 5*

# Prove that if $|x - 2| = p$, where $x < 2$, then $x - p = 2 - 2p$.

We observe that the capability to generate original mathematical terms required as arguments of tactics, which cannot be done without a neural language model, emerges from our training procedure. The proof below is an example of it: the proof step `use n + 1`

(entirely generated by our models) proposes to use `n + 1`

as a solution, the rest of the formal proof relying on the `ring_exp`

tactic to verify that it is indeed valid.

## Problem 2

*Adapted from AMC12B 2020 Problem 6*

# For all integers $n ≥ 9$, prove that $((n + 2)! −(n + 1)!) / n!$ is a perfect square.

We also observe that our models and search procedure are capable of producing proofs that chain multiple non-trivial reasoning steps. In the proof below, the model starts by using contraposition leading to the existential statement (`∃ (x : ℝ), f x ≠ a * x + b`

). It then generates a witness for it with `use (0 : ℝ)`

and finishes the proof by leveraging the `norm_num`

tactic.

## Problem 3

*Adapted from the MATH dataset*

# Let $f(x) = Ax + B$ and $g(x) = Bx + A$, where $A <br />e B$. If $f(g(x)) - g(f(x)) = B - A$, prove that $A + B = 0$.

## Problem 4

*Adapted from IMO 1964 Problem 2*

# Suppose $a$, $b$, $c$ are the sides of a triangle.

Prove that $a^2(b + c − a) + b^2(c + a − b) + c^2(a + b − c) \leq 3abc$.

## Problem 5

*Adapted from AIME 1984 Problem 1*

# Prove that $a2 + a4 + a6 + a8 + ...+ a98 = 93$ if $a1$, $a2$, $a3...$ is an arithmetic progression with common difference $1$, and $a1 + a2 + a3 + ... + a98 = 137$.

## Problem 6

*Adapted from IMO Longlist 1990 Problem 77*

# For $a, b, c$ reals, prove that $(a^2 + ab + b^2)(b^2 + bc + c^2)(c^2 + ca + a^2) \geq (ab + bc + ca)^3$.

Formal mathematics involves two main challenges that make a naive application of reinforcement learning unlikely to succeed.

- (i)
**Infinite action space**: not only does formal mathematics have an extremely large search space (like Go for example), it also has an infinite action space. At each step of a proof search, the model must choose not from a well-behaved finite set of actions, but a complex and infinite set of tactics, involving exogenous mathematical terms that have to be generated (e.g., generating a mathematical statement to be used as a witness, an object used in steps such as “there exists an*x*s.t. …”, or a cut, the introduction and the chaining of a lemma in the middle of a proof). - (ii)
**Lack of self-play**: conversely to 2-player games, a prover is not playing against an opponent but against a set of statements to prove. When faced with a statement that is just too hard, there is no obvious reframing that will let the prover generate intermediary easier statements to tackle first. This asymmetry prevents naive application of the self-play algorithms that were successful with 2-player games.

In our work, we address the infinite action space problem by sampling actions from a language model as we search for a proof. Language models have the capability to generate the tactic calls as well as the original mathematical terms often required as arguments. Our basis for addressing the lack of self-play is the observation that the key role of self-play in 2-player games is to provide an unsupervised curriculum. Our methodology proposes to replace this unsupervised curriculum with an auxiliary set of problem statements (without requiring proofs) of varying difficulty. We empirically show that, when the difficulty of these auxiliary problems is varied enough, our training procedure is able to solve a curriculum of increasingly difficult problems, eventually generalizing to the set of problems we care about.

While these results are extremely exciting, as they demonstrate that deep learning models are capable of non-trivial mathematical reasoning when interacting with a formal system, we are still very far from best-student performance on these competitions, only occasionally, rather than consistently, closing challenging olympiad problems. We hope nonetheless that our work will motivate research in this domain, in particular towards the IMO Grand Challenge and that the *statement curriculum learning* methodology we propose will help accelerate progress in automated reasoning in general.