AI Solving Olympiad-Level Math

Claire Liang '29

Picture this: you pour hours upon hours into preparing for math competitions, such as MATHCOUNTS, the American Mathematics Competition (AMC) 10/12, and, if you’re really dedicated, the American Invitational Mathematics Examination (AIME). You take time outside of schoolwork just to get enough questions correct on the tests, and the prospect of even making it to the International Mathematical Olympiad (IMO) seems too unrealistic to even dream of. Then one day, a few artificial intelligence (AI) models easily solved the IMO questions. The once-unimaginable scenario of this is exactly what happened during the summer of 2025 (Kakaes, 2026). But beyond that, it demonstrated two things: AI’s exponential and enigmatic mathematical improvement in recent years, and the unknown impact it may have on future research.

The IMO is an annual international competition for high school students. It began in 1959 and is one of the most prestigious student math competitions, and even the most successful mathematicians and scientists highlight an IMO gold-medal win well into their careers. Over two days, students are given three problems each and are allotted four and a half hours to solve them with nothing but pen and paper. Each question is graded on a scale from 0 to 7 points, with expert mathematicians assigning the score, making a perfect score 42 points. Though these percentages vary each year, typically 8% of students win a gold medal, 17% earn a silver medal, and 25% earn a bronze medal (Davis & Marcus, 2025).

In 2025, with 630 contestants from 110 countries, 72 won gold medals with a cutoff score of 35. While just barely making the gold medal list, but nonetheless accomplishing something very major, Google’s Gemini Deep Think and OpenAI-IMO models both earned 35 points out of 42, making history as the best AI systems have ever done in the competition (Davis & Marcus, 2025; Unified Council, 2025). According to Petrov, Ivo et al. (2025), these new systems also outperformed previous ones, including Gemini-2.5-pro, o3, o4-mini, Grok 4, and DeepSeek-R1-0528. Their issues varied from extremely short answers, often with only a final answer, to citing non-existent theorems, to odd formatting, resulting in a highest score of 13 points, or 31% (Petrov, Ivo, et al., 2025). In 2024, DeepMind tested a combination of AlphaProof and AlphaGeometry2, and it ended up scoring a total of 28 points, which was the highest possible score for a silver-medal win that year (Davis & Marcus, 2025). This shows a quick improvement in AI’s ability to conduct mathematical proofs and reasoning.

Of course, this improvement didn’t come from nowhere. Pushmeet Kohli, DeepMind’s vice president of science, said DeepMind's practice in math has been in the works since 2018 (Kakaes, 2026). Before this major AI win, mathematicians often regarded AI models as too rudimentary to be of much use. Large Language Models, or LLMs, have often struggled with very advanced math problems involving full proofs in math olympiads, precise numerical calculation, and symbol manipulation. Other examples of error included errors in supposedly logical steps, a lack of creative problem-solving strategies, and frequent false claims that a problem was solved (Yue & Klein, 2025). However, after the results were released, they began experimenting with the models and discovered that they were good at puzzles. “2025 was the year when AI really started being useful for many different tasks,” said Terence Tao, a very prominent mathematician and professor at the University of California, Los Angeles (Kakaes, 2026). Additionally, with the creation and expansion of other commonly used AI tools, such as ChatGPT, DeepSeek, and even Gemini, Lawrenceville has endorsed LLMs. While this support may raise questions and concerns about academic honesty, when done correctly, the integration of AI and a learning environment better prepares students to navigate a technologically-filled future. Nonetheless, these algorithms have become very advanced rapidly, able to construct a hypothesis, prove it, and verify the proof with minimal human intervention (Kakaes, 2026). However, despite AI’s improvements, humans still far outperform it through exceptional moments during the competition that outshone machines—including their ability to spot patterns, creative thinking, and years spent mastering math (Unified Council, 2025).

While AI’s high IMO score is a major breakthrough, it is difficult to say whether it will affect applications of AI to actual mathematical research. Many top mathematicians did well on these contests, but many also did not, as IMO problems are not crucial for research, despite the skills tested being useful. Veríssimo (2025) noted that, in contests, questions require clever problem-solving strategies and are meant to be solved in a couple of hours, culminating in an elegant solution. However, in a research environment, questions are highly open-ended and can take several months to years to solve (Veríssimo, 2025). Furthermore, neither DeepMind nor OpenAI has provided much information about the design and training of their models, and no technical description of Deep Think has been published either. The only statement from DeepMind was in its blog: “[W]e additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving, and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.” However, it did not provide any details of the mentioned “hints” and “tips.” Additionally, OpenAI merely stated that Open-IMO is “an LLM doing next-token prediction” and that it uses “new experimental general purpose techniques” (Davis & Marcus, 2025). Because neither company has provided information on the effectiveness of their systems compared with previous ones, it is impossible to say what works and what doesn’t.

However, these AI models can still serve some function. DeepMind’s and OpenAI’s top performances can serve as tools to help tutor students in solving problems and preparing for such difficult competitions. Similarly, educators can train them using AI for assistance while also acknowledging the challenges of learning math (Davis, 2023; Unified Council, 2025). In schools, AI typically carries a negative connotation due to the widespread fear of cheating on homework assignments. However, Davis (2023) points out that they can also be valuable and as effective as the textbooks typically used, offering immediate feedback on students’ mistakes and helping with problem-solving. Beyond that, humans can bring creativity that AI cannot, while the latter can provide quick calculations; the two groups therefore effectively work together to potentially discover new masterpieces (Unified Council, 2025).

With AI’s exponential improvements over the past few years, competitions as difficult as the IMO have successfully been solved. However, AI still requires human creativity and intuition, leaving its ultimate role in mathematical research uncertain; success in competition doesn’t guarantee new discoveries. As mathematician Kevin Buzzard posted, “I certainly don’t agree that machines which can solve IMO problems will be useful for mathematicians doing research, in the same way that when I arrived in Cambridge, UK as an undergraduate clutching my IMO gold medal I was in no position to help any of the research mathematicians there.”

References

Boles, S. (2025, July 23). AI leaps from math dunce to whiz. Harvard Gazette. https://news.harvard.edu/gazette/story/2025/07/ai-leaps-from-math-dunce-to-whiz/

Council, U. (2022, November 10). AI vs Human Intuition at International Mathematics Olympiad. Unifiedcouncil.com; Unified Council. https://www.unifiedcouncil.com/Blog/international-mathematics-olympiad-2025-ai-vs-human

Davis, E., & Marcus, G. (2025, July 22). DeepMind and OpenAI achieve IMO Gold. What does it all mean? Substack.com; Marcus on AI. https://garymarcus.substack.com/p/deepmind-and-openai-achieve-imo-gold

Davis, V. (2023, September 21). Using AI to Encourage Productive Struggle in Math. Edutopia. https://www.edutopia.org/article/using-ai-encourage-productive-struggle-math-chatgpt-wolfram-alpha/

Kakaes, K. (2026, April 13). The AI Revolution in Math Has Arrived | Quanta Magazine. Quanta Magazine. https://www.quantamagazine.org/the-ai-revolution-in-math-has-arrived-20260413/

Petrov, I., Dekoninck, J., Balunović, M., Javanović, N., Minchev, K., Drencheva, M., Marinov, M., & Vechev, M. (2025, July 17). MathArena.ai. Not Even Bronze: Evaluating LLMs on 2025 International Math Olympiad. https://matharena.ai/imo/

Veríssimo, T. (2025, April 5). Mathematical Research vs Mathematical Olympiad. Medium. https://tiagoverissimokrypton.medium.com/mathematical-research-vs-mathematical-olympiad-68e8ff1c6153

Yue, J., & Klein, D. (2025). Benchmarking LLMs on Advanced Mathematical Reasoning. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.pdf