

Advanced Version of Gemini With Deep Think Officially Achieves Gold-Medal Standard at the International Mathematical Olympiad (deepmind.google) 27
An anonymous reader shares a blog post: The International Mathematical Olympiad is the world's most prestigious competition for young mathematicians, and has been held annually since 1959. Each country taking part is represented by six elite, pre-university mathematicians who compete to solve six exceptionally difficult problems in algebra, combinatorics, geometry, and number theory. Medals are awarded to the top half of contestants, with approximately 8% receiving a prestigious gold medal.
Recently, the IMO has also become an aspirational challenge for AI systems as a test of their advanced mathematical problem-solving and reasoning capabilities. Last year, Google DeepMind's combined AlphaProof and AlphaGeometry 2 systems achieved the silver-medal standard, solving four out of the six problems and scoring 28 points. Making use of specialist formal languages, this breakthrough demonstrated that AI was beginning to approach elite human mathematical reasoning.
This year, we were amongst an inaugural cohort to have our model results officially graded and certified by IMO coordinators using the same criteria as for student solutions. Recognizing the significant accomplishments of this year's student-participants, we're now excited to share the news of Gemini's breakthrough performance. An advanced version of Gemini Deep Think solved five out of the six IMO problems perfectly, earning 35 total points, and achieving gold-medal level performance.
Recently, the IMO has also become an aspirational challenge for AI systems as a test of their advanced mathematical problem-solving and reasoning capabilities. Last year, Google DeepMind's combined AlphaProof and AlphaGeometry 2 systems achieved the silver-medal standard, solving four out of the six problems and scoring 28 points. Making use of specialist formal languages, this breakthrough demonstrated that AI was beginning to approach elite human mathematical reasoning.
This year, we were amongst an inaugural cohort to have our model results officially graded and certified by IMO coordinators using the same criteria as for student solutions. Recognizing the significant accomplishments of this year's student-participants, we're now excited to share the news of Gemini's breakthrough performance. An advanced version of Gemini Deep Think solved five out of the six IMO problems perfectly, earning 35 total points, and achieving gold-medal level performance.
Re: (Score:3)
Oh, so you must have found it easy when you got your gold medal?
Re: (Score:2)
The point is we have a myriad of "tests that are hard for humans, but don't necessarily translate to anything vaguely useful". In academics, a lot of tests are only demanding of reasoning ability because the human has limited memory. Computers short on actual "reasoning" largely make up for it by having just mind boggling amounts of something more akin to recall than reasoning (it's something a bit weirder, but as far as analogies go, recall is closer).
It's kind of like bragging that your RC boat could ge
AI Training (Score:2)
I have to wonder. Did "Gemini Deep Think" solve the problems or simply regurgitate the answer from the billions of sucked up webpages, math research papers, etc. used to train the model? Actual competitors don't have the complete history of https://math.stackexchange.com... [stackexchange.com] at their fingertips.
Re:AI Training (Score:5, Informative)
Re: AI Training (Score:1)
Thank you
Re:Ok (Score:4, Interesting)
Yes, they are high school students, but the students who get gold medals are students who frequently started studying for the IMO in 9th or 10th grade, and sometimes even earlier. And yes, mathematicians aren't always going to see the trick that a given problem relies on. And it is true that IMO problems often involve tricks or approaches that one can study for. The problems are not research mathematics, and a lot of very good mathematicians never did well at the IMO as a young person. At the same time, some problems which very hard in the IMO are things where mathematicians in specific areas would not have trouble. For example, P3 of 2007 is a tough graph theory problem https://artofproblemsolving.com/wiki/index.php/2007_IMO_Problems/Problem_3 [artofproblemsolving.com] but if one has done a lot of graph theory it may seem simpler. But the IMO also frequently involves some problems which require a degree not just of pre-existing knowledge, but also something we would normally call creativity or involve concepts that are just not standard. For example, P6 from 2014 involves a problem where the idea of the problem is essentially a creative thematic connection between geometry and Ramsey theory https://artofproblemsolving.com/wiki/index.php/2014_IMO_Problems/Problem_6 [artofproblemsolving.com]. P6 from 2009 is another interesting way things can go https://artofproblemsolving.com/wiki/index.php/2009_IMO_Problems/Problem_6 [artofproblemsolving.com] and is a curious one because although very few people got it right, once one has seen the solution it feels completely obvious (unlike some other IMO problems where even after having seen a solution it isn't clear where it came from). (Note that traditionally they aim it so that Problem 6 is the hardest problem.)
Your last point that these problems are more suited to solving with AI on a whole probably has some validity. A lot of the geometry-style IMO problems are highly narrow in framing, and there's been a lot more success with AI with those problems. And all IMO problems are in an important sense easier than genuine research problems, because you know either what you need to prove or very close to it, whereas one of the big issues in research is that you are often spending a massive amount of time trying to prove something that turns out to be false. And the IMO problems are also selected so that they do not require any "advanced" techniques such as calculus, which also drastically reduces what one's functional search space looks like. In fact, one issue some LLM AIs had early on when trying to do IMO problems is that the AIs would sometimes try to jump to developing solutions which used high-powered techniques which just weren't useful in that context. So no this isn't research math, but it remains extremely impressive and shows how fast these systems have gone. My own opinion has been from where things have gone in the last two years that the current AI systems would likely fizzle out in the sense discussed here https://scottaaronson.blog/?p=7266 [scottaaronson.blog]. This is evidence that I was wrong in that assessment.
In terms of where AI and research mathematics is going, we're still not there. But we are getting closer to the point where AI systems can be genuinely useful. For example, not too long ago, I was looking for a specific result of a type I had seen before, and I asked an LLM about it. The LLM hallucinated a bunch of junk, but it also kept hallucinating papers by a specific mathematician, and it turned out that there was an actual paper by that person that had the sort of result I was looking for. But more direct use of AI for not just looking up things like that but to do actual research is being developed. The major hope is that we'll use systems like Lean http [wikipedia.org]
Re: (Score:2)
But what about putting them into a Computer Algebra system? We have had these for decades now. In fact, I used one when I started my CS studies 35 years ago.
The very point of such a competition is to have a human do it, not a machine.
Re: (Score:2)
Re: (Score:2)
but the usefulness of the IMO as essentially a natural metric of how effective these AI systems are at difficult reasoning problems.
Since LLMs have zero reasoning capability (the math does not allow it), it is obviously a failure at this.
Re: (Score:2)
Re: (Score:2)
Or that watches based on quartz crystals are just an idle curiosity because they aren't built using intricate gear assemblies.
Re: (Score:2)
No, they are not designed to reason. You're arguing from ignorance of the subject just using rhetorical techniques. You're mistaking technical questions with real answers for philosophical questions where you can just blow any horseshit you want out your ass and then use rhetoric to "argue" it.
And your rhetoric is mostly red herrings and strawmen.
Re: (Score:3)
Re: (Score:2)
You're not even wrong. Perhaps start by taking a first year course on data compression. Once you get to a point where you understand how JPEG works, you will have everything you need to conceptually understand how it is possible for an LLM to represent its training data inside the parameters.
If you're lazy, then I'll just point you to the fact that LLM exploits have been found that make the LLM spit out its training data in clear text. [theregister.com]
A computer that can do math? (Score:3)
What will they think of next?
Re: (Score:3)
And other AI engines? (Score:1)
I have in mind, the latest Grok 4 Heavy, was he tested? I didn't saw any information about other engines doing that test to compare.
So? (Score:1)
I have no doubt that Maple or any other decent Computer Algebra system could have done the same ... 30 years ago. Or Wolfram Alpha.
This is a completely meaningless stunt. The only purpose is to deceive the stupid about what these systems can do, or rather cannot do.
Re: (Score:2)
Re: (Score:2)
Why would I waste time in what is clearly a failed technology? The stupid always need years and years to find out a hype is just a hype and never has any of the substance claimed. I can do it directly.
Re: (Score:2)
NOTHING OFFICIAL AT ALL (Score:1)
The IMO is pretty pissed off at Google because the results are embargoed from publication until the 28th of July.
Apparently Google did their own metrics and claimed the Gold-Medal. IMO is not happy,
https://arstechnica.com/ai/202... [arstechnica.com]
But hey, good on Goog for demonstrating some level of success, whether IMO gives them a gold medal or not.
Shame on them for violating the publication embargo, but I haven't read that contract or those T&Cs so I only
judge by what IMO says and what Goog says.
Complete Fail (Score:2)
What's worse than the self-grading is that the rules include not using calculators.
How can an AI do math without using a calculator?
There is no way a computer anything can score even 1% using the same rules as humans.