Problem-Solving: The AI That Beat Human Programmers

Shreshta Agrawal '28

In the 2025 International Collegiate Programming Contest (ICPC) World Finals, Artificial Intelligence (AI) competed against programmers from nearly 3,000 universities and 103 countries—and it won a gold medal (Cheng & Lin, 2025). The ICPC is an assessment-based competition in which participants must use code to analyze real-life scenarios and complete various activities such as movement simulations and mapping. For example, an ICPC question might ask a competitor to “optimize subway schedules” or “estimate oil reserves,” after providing sample inputs and outputs as well as illustrations (The Problems, n.d.). The ICPC evaluates a competitor’s ability to not only find logical patterns, but also to creatively interpret new situations. The World Finals challenge competitors to attempt approximately twelve problems within a five hour period, a tight time crunch even for machines. Most AI’s high information processing speed and multitude of training resources may make its problem-solving skills seem superior compared to humans. AI, however, has long struggled with the creativity and critical thinking that problem-solving requires, skills which humans develop through continuous perception of the surrounding world. Google Gemini Deep Think, released to the public in August 2025, is different. Deep Think is a research assistance feature that can be activated within the Gemini 2.5 Pro, a Large Language Model (LLM) that analyzes the language of input prompts and provides corresponding verbal answers. Gemini Deep Think has far-reaching implications in various fields of research, but its limited audience also raises questions about information accessibility.

Gemini may not be human, but it trains like one–by trying and trying again. Understanding Deep Think’s predecessors are integral to grasping the model’s capabilities.

Gemini 2.5 Pro, one of the AI models compatible with Deep Think and officially released in June 2025, enhanced the nuanced reasoning skills of previous models. Gemini 2.5 models all use parallel thinking: upon receiving an input, the AI’s thought process branches in different directions, critiquing its own solutions and building stronger responses. No matter how deeply an AI thinks, its following only one line of reasoning at a time often leads to “overthinking,” where the AI varies its outputs until it generates considerably inaccurate answers (Why “Thinking More,” n.d.). Parallel thinking consistently avoids this error as it allows an AI to start and progress multiple lines of reasoning at once, terminating an idea when it becomes illogical. According to one study, an AI was up to 20% more accurate with parallel thinking than a model that used previous methods (Why “Thinking More,” 2025).

Additionally, just like previous Gemini models, the 2.5 Pro uses a sparse
mixture-of-experts (Sparse MoE) model (Pushing the Frontier, 2025, p. 2). A sparse MoE combines different AI models and activates only a subset of their combined parameters based on an input, learning to assign different parameters to different input types. This is an energy-efficient structure which has advanced Gemini 2.5 Pro beyond its predecessors, but it tends to make AI “suffer from training instabilities” (Pushing the Frontier, 2025, p. 2). The Gemini 2.5 Pro, however, has used the sparse MoE structure to its advantage, showing what the development team reports to be “a considerable boost in performance straight out of pre-training compared to previous Gemini models” in regards to both stability and efficiency: these capabilities lend themselves to the AI’s high multimedia processing ability (Pushing the Frontier, 2025, p. 3). The Gemini 2.5 Pro can process and convert long sets of text, codebases, audio, and video files which it trains on to expand its logical and creative skills. The AI’s abilities have caused it to excel in standardized testing even beyond the ICPC. Against various Claude,
OpenAI, Grok, and DeepSeek models, the Gemini 2.5 Pro scored the highest on six benchmarks—Aider Polyglot, GPQA, Humanity’s Last Exam, FACTS Grounding, LOFT, and MRCR-V2—assessing its code editing skills, scientific accuracy, searching speed, and objective reasoning; furthermore, the AI even beat its own score, scoring five times higher on the Aider Polyglot test compared to a previous trial (Pushing the Frontier, 2025, pp. 12-13). The Gemini 2.5 Pro displays its effectiveness and reliability in processing and building on a diverse array of information, skills that are integral to applying it in countless research fields including mathematics and software engineering.

Gemini 2.5 Pro’s achievements may be difficult to surpass, but Deep Think consistently ranks higher. In 2025, Deep Think competed at the International Mathematics Olympiad (IMO), which was held in Queensland, Australia. The IMO is a nearly five-hour long competition in a variety of mathematical disciplines, including geometry, algebra, and statistics. The contest requires not only a deep grasp of mathematical concepts, but also logic reasoning and creative innovation, a feat difficult even for professional mathematicians. Deep Think not only scored well, but it became the first AI model to earn a gold medal, revealing its humanlike intelligence (Henkel, 2025, pp. 2-4). Deep Think’s capabilities are not limited to one subject, however: the AI displays its knowledge and reasoning skills in Humanity’s Last Exam (HLE), a multidisciplinary AI benchmark with over 1000 questions across topics such as math, science, and humanities. As made apparent by its name, the HLE has been proven to be one of the most difficult exams of its kind. Gemini 2.5 Pro ranked highly in comparison to many other AI models with a 21.6% accuracy, but Deep Think still beat it with a 34.8% accuracy, almost 13.2% higher (Deep Think, 2025, p. 4). Deep Think breaks barriers with every test it excels in, displaying the countless ways in which its skills can be applied. The AI’s performance at the ICPC displays its outstanding
coding capabilities, for example. According to Stuart Russel, a computer science professor at the University of California, Berkeley, Deep Think’s “performance may show progress towards making AI-based coding systems sufficiently accurate for producing high quality code” (Booth, 2025). Improving AI coding abilities means that people can rely on generated code with less human supervision, saving time and energy.

Deep Think’s performance in testing, which led up to its ICPC win, has prompted researchers to examine possible applications less than six months after the AI’s release. Regarding the AI’s accomplishments, ICPC Global Executive Director Dr. Bill Pouch commented, “Gemini successfully [...] achieving gold-level results [...] marks a key moment in defining the AI tools and academic standards needed for the next generation (Cheng & Lin, 2025). With the competition-level version of Deep Think unreleased to the public and a basic version locked behind a high paywall, though, can the AI model really benefit everyone? Deep Think is included in the Google AI Ultra subscription plan, which, as of October 2025, costs nearly $250 per month after a three-month discount period (Google One). There is indeed a significant performance gap between Deep Think and other AI, including free versions of Google Gemini with limited capabilities. While the competition-level version of Deep Think “can reason over several hours when solving complex math problems,” the public version sacrifices some advancement to make it quicker and more practical for public use (Johnivan, 2025). Beginning in August 2025, Google released Deep Think’s advanced version to “a limited group of academic researchers and mathematicians” for use and further testing (Johnivan, 2025). Still, these people make up a very small portion of the people looking to use Deep Think’s full potential. Considering the AI’s unique skills, its inaccessibility places those hoping to advance their research at a disadvantage. Additionally, the Deep Think model that people can
access—which does not include many competitive features—causes a disparity between developers’ claims and user experiences. For example, a user who submitted a novel outline to Deep Think noted that they were more dissatisfied with its feedback than that of other models (Raian, 2025). Additionally, many people find Deep Think to be lacking in the creative reasoning it is marketed for (Raian, 2025).

Using Gemini 2.5 Pro with Deep Think, Google’s success in the 2025 ICPC marks a great stride in the AI’s journey towards a level of logical and creative capability rivalling human minds. Still, Google has made Deep Think’s most competitive abilities unavailable to most consumers, so only a limited population may benefit from the breakthrough.

References

(2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Google DeepMind. Retrieved October 16, 2025 from
https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf?utm_so urce=chatgpt.com
(2025, August 1). Gemini 2.5 Deep Think - model card. Google. Retrieved October 16, 2025 from
https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Deep-Think Model-Card.pdf
(2025, July 9). Why 'Thinking More' Isn't Always Making Generative AI Smarter. A. James Clark School of Engineering. Retrieved 17 October, 2025 from
https://ece.umd.edu/news/story/why-thinking-more-isnt-always-making-generative-ai-sm arter
Booth, R. (2025, September 17). Google DeepMind claims ‘historic’ AI breakthrough in problem solving. The Guardian. Retrieved 16 October 2025 from
https://www.theguardian.com/technology/2025/sep/17/google-deepmind-claims-historic ai-breakthrough-in-problem-solving
Cheng, H. & Lin, H. (2025, September 17). Gemini achieves gold-medal level at the International Collegiate Programming Contest World Finals. Google DeepMind. Retrieved October 16, 2025 from
https://deepmind.google/discover/blog/gemini-achieves-gold-level-performance-at-the-in ternational-collegiate-programming-contest-world-finals/.
Comanici, G. et al. (2025, July 22). Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261v5. Retrieved October 16, 2025 from
https://doi.org/10.48550/arXiv.2507.06261
Henkel, J. (2025, August 27). The mathematician’s assistant: integrating AI into research practice. arXiv preprint arXiv2508.20236v1. Retrieved October 16, 2025 from https://doi.org/10.48550/arXiv.2508.20236
Johnivan, J. R. (2025, August 4). Gemini 2.5 Deep Think is Google’s most advanced AI model to date. TechRepublic. Retrieved October 19, 2025, from
https://www.techrepublic.com/article/news-deep-think-ai-research-variant-launch/?utm_s ource=chatgpt.com
Power your everyday with a Google AI plan. Google One. Retrieved October 16, 2025 from https://one.google.com/intl/en/about/google-ai-plans/
The Problems. The ICPC International Collegiate Programming Contest. Retrieved October 16, 2025 from https://icpc.global/compete/problems
Raian. (2025, May 21). Is Gemini 2.5 Pro Deep Think worth the hype?. Latenode. Retrieved October 16, 2025 from https://latenode.com/blog/gemini-deep-think-real-deal Whitwam, R. (2025, September 17). Gemini AI solves coding problem that stumped 139 human teams at ICPC World Finals. Ars Technica. Retrieved October 16, 2025 from https://arstechnica.com/google/2025/09/google-gemini-earns-gold-medal-in-icpc-world-fi nals-coding-competition/