A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry
With Molecule.one’s Maria, GPT‑5.4 found a surprising additive boosting Chan-Lam Coupling yields for over 80% of tested substrates.
OpenAI’s work in science is motivated by a simple belief: advanced AI can become a powerful partner for scientists, helping them explore more ideas, connect distant concepts, design better experiments, and accelerate discoveries that benefit humanity. We have already shared early examples of models contributing to novel results in mathematics, including work on the unit distance problem, in theoretical physics, through a new result on gluon amplitudes, and in biology, where GPT‑5 helped lower the cost of cell-free protein synthesis in an automated lab. We also introduced GPT‑Rosalind, a purpose-built model to support life sciences research and drug discovery workflows.
This project extends that trajectory into medicinal chemistry, where progress cannot be measured by reasoning alone. A hypothesis has to work in the lab with real molecules, instruments, and experimental noise. Working with Molecule.one(opens in a new window), we connected GPT‑5.4 to Maria—an agentic chemistry AI integrated with a high-throughput laboratory for autonomous research—and gave it an open-ended goal: to improve one of several important reaction classes. The system generated research proposals, designed and ran experiments, analyzed experimental data, and proposed follow-up experiments. Humans remained in the loop by designing steering and grading prompts and selecting proposals to test. They also made limited corrections to experimental plans, assisted with basic laboratory operations, and independently validated the final result.
The most promising proposal, OAI-M1-03, focused on a difficult but useful version of Chan–Lam coupling, a reaction chemists use to form carbon-nitrogen bonds. Starting from the open-ended goal of improving Chan–Lam coupling for process chemistry, GPT‑5.4 independently identified primary sulfonamides as a challenging, high-value substrate class and suggested that mild oxidants, including TEMPO, could improve the reaction.
Across two cycles of experimentation in Maria Lab that idea produced a significant improvement. Under the optimized conditions, measured yields improved for 88% of the boronic acids and 83% of the sulfonamides tested. The mean yield rose from 16.6% to 25.2%, and the share of reactions above 30% yield increased from 15.6% to 37.5%. Human chemists then repeated representative reactions at bench scale. Those experiments confirmed the microliter-scale results, showing higher yields for 11 of 14 substrate pairs, with a more than twofold increase in most cases. That matters because medicinal chemists need reactions that work not just in micro-liter screening experiments, but also in practical lab workflows used during drug discovery.
Improvements in this area of medicinal chemistry are particularly exciting because synthesis is often a major bottleneck in drug discovery: scientists can only test the molecules they can make or otherwise obtain. The sulfonamide group appears in medicines across a wide range of therapeutic areas, including anticancer drugs, antimicrobials, and diuretics, yet the Chan–Lam coupling of primary sulfonamides with boronic acids has historically given low yields. Making this form of the reaction more reliable could give medicinal chemists a broader and more practical way to produce and explore potentially useful molecules.
While this is still an early result, it provides another concrete example of the broader direction we are working toward: AI systems that can become valuable partners to scientists across much of the research loop. The model reviewed the literature, proposed an unexpected idea, helped design and analyze experiments, and arrived at a scientific finding that human chemists could evaluate.
Maria Lab: Molecule.one's specialized high-throughput laboratory that ran 10,080 reactions in OAI-M1-03
Organic chemistry underpins all small-molecule medicines, as well as products in agriculture, electronics, and materials science. A reaction is especially useful when it can make the same kind of chemical bond reliably across many different starting materials. When reactions produce low yields or too many unwanted byproducts, chemists may have to abandon otherwise promising molecules or spend significant time developing a different route. This makes synthesis a major bottleneck in drug discovery: scientists can generally only test the molecules they can make or otherwise obtain.
Chan–Lam coupling is useful in medicinal chemistry because it forms carbon-nitrogen bonds, which are common in medicines. However, the reaction does not work equally well for every class of molecule. In particular, coupling primary sulfonamides with boronic acids has historically produced low yields. Sulfonamides are an important family of molecules found in medicines used in oncology and infectious disease. Making this reaction more reliable could give medicinal chemists a broader and more practical way to produce and explore potentially useful molecules.
The combined system paired complementary capabilities. Prompts written by scientists working with Maria AI were used with GPT‑5.4 within a harness to generate and rank thousands of possible research proposals. Human chemists reviewed the small subset of proposals that ranked highest according to the system and selected four for laboratory testing. Maria AI then translated selected high-level plans into detailed lab instructions, ran thousands of high-throughput experiments, analyzed the raw data, and returned structured results to GPT‑5.4.
One of the four selected proposals, OAI-M1-03, suggested using mild oxidants such as TEMPO to improve the performance of the Chan-Lam reaction for sulfonamide synthesis. Chemists found the suggestion both surprising and interesting. We share the detailed findings from OAI-M1-03 in this blog post and in the paper(opens in a new window).
The final research proposal was then used by Maria to generate experimental grids, with slight corrections by humans. The largest human correction was to avoid dimethyl sulfoxide, or DMSO, as a solvent because chemists were concerned it could react with the stronger oxidants used as comparisons.
The full process took three months, from the first prompt on March 4th to sharing the OAI-M1-03 results with independent experts on June 4th.
We describe this workflow as near-autonomous, not fully autonomous, because human chemists still made important decisions throughout the process. The model proposed the key research ideas, while human chemists provided high-level steering and judgment, corrected experimental details, helped prepare lab consumables and reagents, and repeated key experiments by hand.
OAI-M1-03 identified TEMPO as a useful additive for the primary sulfonamide Chan-Lam coupling studied here. Under the optimized conditions, the reaction improved in two ways: average yield went up, and more substrate combinations reached practically useful yields.
Across two cycles, Maria ran a total of 10,080 reactions – more than a chemist running three reactions every day would run in a decade. That scale mattered because chemistry results can be misleading when they are tested on only a few examples. A reaction can look promising on one pair of starting materials, but fail across a broader set of molecules. Thousands of reactions made it possible to identify TEMPO among ten tested oxidants, see the effect repeat across diverse combinations, and find its limitations.
After analyzing the first round of data, the system proposed a more focused second round of experiments to test follow-up hypotheses. One useful follow-up finding was that TEMPO could be replaced by a much cheaper analog, 4-hydroxy-TEMPO, with little loss in performance.
The result also held up beyond Maria Lab’s microliter-scale screening format. Human chemists reproduced representative reactions manually at bench scale and observed an increase in yield for 11 of 14 substrate pairs; for eight pairs the increase was greater than twofold. That replication matters because very small-scale experiments can sometimes introduce artifacts that disappear at a larger scale. Bench-scale validation is also customary before research is published in a scientific journal.

Four external chemistry experts reviewed the preprint describing OAI-M1-03. Their assessments supported our view that the result was novel and worth sharing with the scientific community. The stronger test will come next: whether independent labs can reproduce the result, and whether chemists find it useful across a broader range of molecules.
Of the other three proposals generated by GPT‑5.4 and tested by Maria during the three-month period, OAI-M1-02 and OAI-M1-04 were experimentally proven in the Maria Lab, while OAI-M1-01 was disproven. Analysis of these results is ongoing.
This work shows that a model can make a useful contribution in organic chemistry. It did more than summarize the literature or suggest a one-off experiment: it proposed a specific surprising hypothesis and surfaced it for human review, designed experiments, interpreted experimental data, and designed follow-up experiments.
It does not show that AI can independently run a chemistry research program from end to end. Human judgment remained essential, and the workflow depended on specialized high-throughput infrastructure. It also does not establish that the method will generalize to other coupling reactions, other substrate classes, or manufacturing conditions.
The yield estimates came from a high-throughput platform, and bench validation covered 14 representative substrate pairs. More work is needed to characterize the reaction mechanism, define the substrate scope, measure performance under different laboratory conditions, and reproduce the result independently.
Chemistry capabilities require careful treatment because the same tools that can support medicine and materials science could also be misused. We deliberately scoped this work to a legitimate medicinal-chemistry problem: improving a known coupling reaction used to make drug-like molecules. The experiments did not involve toxins, chemical weapons, or requests to design harmful compounds. These results should not be read as evidence that the system can help with those harmful applications. The project did not test or demonstrate that.
We assess and mitigate emerging risks from advanced model capabilities through our Preparedness Framework, including risks related to chemical and biological domains. The model used in this work had already undergone relevant evaluations with the UK AI Security Institute, and the system was designed to refuse requests focused on harmful applications. The experimental workflow added another layer of control: human chemists selected which proposals entered the lab, reviewed experimental plans, and retained control of the physical infrastructure.
We think this is the responsible way to study AI's potential in experimental chemistry: choose a problem space with clear scientific value, pair model-level safeguards with expert oversight, and evaluate the system through constrained physical experiments. As these capabilities improve, we will continue to assess emerging risks, strengthen safeguards, and be specific about what a result does and does not imply.
The immediate next steps are scientific: test a broader range of starting materials, investigate why the additives improve the reaction, map where the effect works and fails, and support independent replication. Together, these studies will determine how broadly the method can be applied and how useful it is in practical medicinal chemistry workflows.
Our longer-term goal is to make AI systems reliable scientific partners that help researchers generate hypotheses, design experiments, interpret results, and decide what to test next, while remaining grounded in expert judgment, reliable measurement, and strong safeguards. Organic chemistry is a particularly high-leverage area because progress in small-molecule discovery and manufacturing depends on being able to make molecules reliably. Scientists can only test molecules they can make, and better synthesis can expand the range of ideas they can explore across medicine, agriculture, electronics, energy, and materials science. This result is one early example of that broader direction: a frontier model, specialized agents, an automated laboratory, and human chemists working together to move faster through the research loop and produce findings the scientific community can evaluate, reproduce, and build on.
We are grateful to the Molecule.one team and to the independent chemists who reviewed this work.