AI Systems Solve Just 2% of Advanced Maths Problems in New Benchmark Test 82
Leading AI systems are solving less than 2% of problems in a new advanced mathematics benchmark, revealing significant limitations in their reasoning capabilities, research group Epoch AI reported this week.
The benchmark, called FrontierMath, consists of hundreds of original research-level mathematics problems developed in collaboration with over 60 mathematicians, including Fields Medalists Terence Tao and Timothy Gowers. While top AI models like GPT-4 and Gemini 1.5 Pro achieve over 90% accuracy on traditional math tests, they struggle with FrontierMath's problems, which span computational number theory to algebraic geometry and require complex reasoning.
"These are extremely challenging. [...] The only way to solve them is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages," Tao said. The problems are designed to be "guessproof," with large numerical answers or complex mathematical objects as solutions, making it nearly impossible to solve without proper mathematical reasoning.
Further reading: New secret math benchmark stumps AI models and PhDs alike.
The benchmark, called FrontierMath, consists of hundreds of original research-level mathematics problems developed in collaboration with over 60 mathematicians, including Fields Medalists Terence Tao and Timothy Gowers. While top AI models like GPT-4 and Gemini 1.5 Pro achieve over 90% accuracy on traditional math tests, they struggle with FrontierMath's problems, which span computational number theory to algebraic geometry and require complex reasoning.
"These are extremely challenging. [...] The only way to solve them is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages," Tao said. The problems are designed to be "guessproof," with large numerical answers or complex mathematical objects as solutions, making it nearly impossible to solve without proper mathematical reasoning.
Further reading: New secret math benchmark stumps AI models and PhDs alike.
I'm surprised they managed 2%. (Score:5, Insightful)
pretty good for a system that is essentially repeated statistical guessing.
Re: (Score:2)
Well, the thing is that that can be said of our brains too.
see e.g. Anil Seth or Shamil Chandaria (or others) on Predictive Processing etc.
https://www.youtube.com/watch?... [youtube.com] (The Free Energy Principle and predictive processing Chandaria)
https://www.youtube.com/watch?... [youtube.com] (Your Brain Hallucinates Your Conscious Reality | Anil Seth | TED Seth from 7 years ago. (though TED talks are kind of cringey to me)
You are an algorithm too (Score:2)
just go down to the molecular level and humans are a stochastic algorithm
Re: (Score:2)
yeah sure, and if you try to compute that with regular methods (like the ones used by LLMs) you'll run out of atoms to do it with.
this seems like a real smart thing to say until you look past the semantic level theory.
Less than 2% of slashdotters could solve even one (Score:2)
so? AI are not intended to do logic problems per se. they reason like humans not logicians. Most humans could not solve these problems let alone 2%
Re: (Score:2)
Well, the thing is that that can be said of our brains too.
You may very well have understood and debunked the GP's objection, but my take on this debate is that you're missing the point. The problem with current LLMs isn't that they aren't brains. It's that they lack a certain nonlinearity they would need to be truly smarter than something like a cat. For example, I can ask LLMs esoteric questions about chemistry. I know they will at least have a chance at answering, because there is ample input text that describes the subject matter I want to know about. Note that
Re: (Score:2)
The bigger problem is that an LLM is a "Large Language Model".
It is trained on languages, aka books, articles etc.
To have an AI (I do not like to call that AI, because strictly speaking it is not) that does math: you need a "Large Math Model"
An LMM, and not an LLM.
You probably have to mix them, an LLM to understand your prompts, and an LMM to answer them.
Re: (Score:2)
Personally I find the vitalist view to make more sense and better fit observations.
The problem is that you are unnecessarily complicating your model without adding more predictive power to it. Addition of "pre-existing consciousness" will not increase the predictive power of a model of reality. It makes the model unnecessarily more complicated. But by Occam Razor, the most simple model (from all the models with the same predictive power) is most likely to be the correct one.
Re: (Score:2)
If it is purely an LLM with nothing else to it, then yes. But that still means we need to augment it better. AI, without any major breakthroughs or infeasible computational need, should be able to see the problem format and apply known techniques. The benchmark wasn't asking it to come up with an unknown algorithm or proof (yet). We shouldn't be making excuses that AI is merely an LLM.
Re: (Score:3, Interesting)
Well, it will be worse: It will not know which ones it solved and which ones it did not. Humans with a working mind do two things: 1. solve the problem or not and 2. evaluate whether they have solved the problem. Statistical guessers can sometimes do (1), but they cannot do (2) at all.
Re: (Score:3)
Which is why the actual systems designed to solve these things generally include testing their solutions.
Careful, you're in danger of invaliding your thesis that modern AI is just "statistical pattern matching."
Re: (Score:1)
Careful, you're in danger of invaliding your thesis that modern AI is just "statistical pattern matching."
This is just you being dishonest by misdirection. Nobody is talking about "modern" AI. What is being talked about, and you know that, is LLMs. And these are just statistical pattern matching.
Re: (Score:3)
You either misunderstand what even the commercial LLM-based systems are (to be fair, the zeitgeist's habit of calling them "LLMs" is annoying) or you're choosing to misrepresent them to support the particular hill you've chosen to die on.
From the article:
From the paper it's about:
Re: (Score:3)
I think a lot of these folks are missing implied point 3 and you might need to say it explicitly.
Humans can be held accountable for their mistakes and have an incentive to not just give an answer that might cause massive problems.
This is not an immediate moral problem, so I can understand why they might need the implications for business process explained to them as well >_>
Re: (Score:1)
Counter Argument: Global Warming.
Humans are not only not held accountable for their mistakes. They are elected to high positions of power and given significant rewards of every kind.
Re: (Score:2)
Can I just ask...
What do you think the last line of my previous post means?
Re: (Score:1)
I didn't care about the moral aspect since your foundation, "Humans can be held accountable for their mistakes", wasn't well laid.
Humans are *not* held accountable for their mistakes.
Especially once they get any decent amount of power.
Re: (Score:2)
you understand the difference between the engineering terms 'can' and 'must'?
Re: (Score:2)
I did not add that intentionally. What happens in reality is that some (few) humans with intact personal integrity, a desire to improve themselves and working general intelligence can judge themselves and decide to do better.
How well that "holding people accountable" works is something we see every day in crap software.
Re: (Score:2)
I'm not talking about personal integrity.
I'm talking about leaving a smoking crater that other people point at fearfully.
I completely agree with what you're saying otherwise... perhaps I'm the one who needs to make what I'm saying clearer.
Re: (Score:2)
That is because you do not understand how incentives work. That "smoking crater" has zero preventative value. There is enough research on, for example, the total absence of preventative value of the death penalty.
Re: (Score:2)
The death penalty absolutely has preventative value.
Re: (Score:2)
You know what counts as a smoking crater in this scenario?
A massive disconnect between expected income and actual income after several years of a hype cycle. A bunch of work already done that was expected to yield profit that visibly doesn't.
I'm with you on the death penalty. They need feedback in a language they actually understand.
It will get a lot better... (Score:2)
Naturally (Score:5, Funny)
Having learned from elementary school answer keys, it's not hard to guess that the word that best follows "4 + 4 =" is "8". That doesn't mean the LLM even knows what 4 or 8 is, much less that it can do even basic arithmetic.
Re: (Score:2)
Re:Naturally (Score:4, Funny)
Re: Naturally (Score:2, Troll)
Re: (Score:1)
Only when the answers are widely known and documented. Since LLMs don't have any means of performing logic operations like math, the LLM isn't actually DOING math (barring outside libraries, which isn't the LLM doing the math but more the UI/frontend choosing to load a Python library rather than sending the raw prompt to the LLM).
Re: (Score:1)
Exactly. It's just pattern matching published answers. That's why more novel solutions elude it.
Re: (Score:2)
State of the art chat bots are already acing math problems at undergrad level, which is probably already better than 90% of Americans.
Good thing I'm in the other half of Americans.
Re: Naturally (Score:2)
Re: (Score:2)
The great part about it is when an LLM can't even get that right (typically because such basic math is fucked up intentionally and sarcastically - 1+1=3 and all of that. But also because of the limited usefulness of the surrounding context in the content LLM trainers stole training data from).
Re: (Score:2)
No idea why one modded this funny. Perhaps a misclick.
This is correct and insightful.
Uh huh (Score:2)
Okay, so if the AI can solve 2% of those and we're going to call that "significant limitations in their reasoning capabilities" then what do we call everyone who's not a math grad student equipped with AI and a good symbolic math package?
I don't have a problem with calling most pe
Re:Uh huh (Score:5, Insightful)
Okay, so if the AI can solve 2% of those and we're going to call that "significant limitations in their reasoning capabilities" then what do we call everyone who's not a math grad student equipped with AI and a good symbolic math package?
The difference is that "everyone" [human without a graduate degree] is not trained on all the books and all of the internet. The point is that the LLMs should already know all the math needed to solve the problems.If you prompt them properly then they very likely can write [approximately cite] the mathematical knowledge needed to solve the problems.
Re: (Score:2)
That's ignoring the fact that LLM's literally aren't designed to reason. Especially so generalist LLMs like GPT.
That's ignoring the fact that if you took someone with 8 years of liberal education - in the strictest sense and meaning of an extremely broad education - and asked them to do the same problems, they would have the same issues. A broad overview of many subjects in no way makes you an expert in a niche area.
There's a reason why the medical AI's that are detecting cancer months and years before huma
Re: (Score:2)
The point is that the LLMs should already know all the math needed to solve the problems.
There you go, anthropomorphizing LLMs. LLMs do not KNOW anything. There is no circuitry or code for "knowing". If the EXACT answer is not held anywhere within their data storage, they will not be able to come up with the correct answer.
Randomness can coincidentally look like "knowing", which is why so many people are being fooled into thinking there is anything sentient or knowing about LLMs.
Re: (Score:2)
Yeah, sports fan reasoning. TeamILike lost, but that actually means they're better because reasons.
You can argue that the test is irrelevant because it's unfair, wrong, or testing the wrong thing, but you can't use a test in which you got your ass kicked to argue that you're better at the task it was testing.
I think it's testing the wrong thing. Good AI math models combine a language model to formalize the written problem, symbolic math and formal logic engines to do the actual logic manipulation and test t
Re: (Score:2)
Terrance Tao and a couple other fields medalists said they could likely only do 1 or 2 of these problems (which if out of 100 problems would be 1 or 2 percent), so i'm going to guess your average math grad student would get zero.
I like how we now have a test that the smartest mathematicians in the world can only individually solve single problems from, but we expect the "dumb AI" to score 100% otherwise it's not very smart.
Meanwhile we have entire school districts where not a single student is proficient in
Re: (Score:2)
To be clear, I don't think this is "OMG we have superhuman AGI."
I do think this is how it will happen. People will be criticizing some future system with "well, there's one guy in Japan who can do this particular problem better, and this one woman in New York who might be able to do this other thing better, and...." Then after that goes too we'll be left with "well, it doesn't *actually* understand" and "yeah but it's not *conscious* like I am."
Boris Johnson Algorithms (Score:2)
They are a Boris Johnson algorithm: highly educated but incapable of reason athough they can fake it sometimes.
Re: (Score:2)
They do not. Not even the original chatGPT did that.
Re: (Score:2)
What this indicates is that 2% of the answers were already in the training data. The AI did not solve the problems. The AI searched its training data and found 2% of the answers readily available to it. Those 2% were created by humans.
Re: (Score:2)
Imagine you want to test someone's lockpicking ability--they claim they can unlock a combination lock just by feel. But here's the wrinkle, you can't do it in person, you have to mail them the lock and then they do it by video call.
The problem with this plan is that between the time that they receive the lock in the mail and the video call they could brute force it with a machine that spins the lock through every combination and then just memorizes the answer and on the video call this supposed lock picker
Repeate after me: (Score:5, Informative)
For generative "AI" the following is true: "AI" has no reasoning ability. "AI" cannot solve problems. "AI" has no model of reality. "AI" can only fake these and as soon as you leave what its training data covered, it is lost.
Re: (Score:1)
So tell me, what can you do as soon as you leave your training data?
For example, could you prebaxel plume mostna 2fe1::a0-2^4 guh guh guh?
Re: (Score:3)
Animal brains have more than just ":training data". How else does a newborn animal breathe? Get up and run? What are phobias? They aren't any result of "training data".
Re: (Score:1)
I can do a thing called "thinking". You might be able to do so too, although you clearly are not at the moment.
Re: (Score:1)
Have they tried... (Score:3)
Not Reasoning. (Score:3)
Calling it Al is just fraud.
OF COURSE NOT (Score:2)
How hard are these problems? (Score:4, Interesting)
"problem set remains private" (Score:3)
Semi-private. Anthropic and OpenAI already combed their logs to find them.
Asking the questions to their next models will be completely useless.
And 2% is impressive. (Score:3)
“All of the problems I looked at were not really in my area and all looked like things I had no idea how to solve,” Gowers said. “They appear to be at a different level of difficulty from IMO problems.” The problems are designed not just to be hard but also to resist shortcuts. Each one is “guessproof,” meaning it’s nearly impossible to solve without doing the mathematical work. As the FrontierMath paper explains, the problems have large numerical answers or complex mathematical objects as solutions, with less than a 1% chance of guessing correctly without the proper reasoning.
So solving 2% should already be impressive. Unfortunately, some people are going to look at the headlines like the one above and think that this says that the AI are not impressive.
Re: (Score:2)
Nearly guessproof and "less than 1% chance of guessing the correct answer" are damn near antonyms.
Pick a number, 1 through 101. That's less than a 1% chance.
Nearly guessproof needs to be far, far more than that.
Re: (Score:2)
I haven't looked at the problems, but I'm pretty sure I won't be able to solve them, not being a graduate mathematician.
What would be interesting to know though is whether these are problems accessible to most post graduate mathematicians or whether they require specific skills so any one person will only be able to solve one or
Re: (Score:2)
Re: (Score:2)
The point is that the LLMs should already know all the math needed to solve the problems.
I suppose it is impressive that their data stores contained enough information to be able to get 2% of the problems solved... but ummm, there is no reasoning or even logic being used here. I fail to be impressed by this result.
About the same as an average human (Score:2)
I'd like to see the score for an average human. A truck driver, or an illiterate Congolese farmer, for example.
If they score less than 2%, does that mean AI is smarter?
Re: (Score:2)
I'd like to see the score for an average human.
No problem. Give me all of the content on the Internet, and the only thing that would hold me back is my reading and pattern matching speeds.
Seems a little beside the point. (Score:2)
Why are people spending so much effort trying to prove these things are not what they don't claim to be?
I can weld a hitch to a Miata frame, hook a 6000kg trailer to it, and declare it a shitty tower. Technically I'm right, but I don't know what else I would have accomplished...
What did they expect? (Score:2)
Solving maths problems requires innovative thinking. AI is just pre-existing, regurgitated content. What did they expect?
A case study in problem difficulty (Score:2)
Kaggle math challenges (Score:2)
I noticed this Kaggle $2M competition just the other day which looks neat. Of course it isn't FrontierMath. But the $1M AIMO 1 competition was about trying to get better than Gemma 7B's performance and it sounds like they blew past it from my brief look. I didn't take the time to see whether they actually have any reasoning in it though there are apparently math packages and some kind of feedback loops in there. AIMO 2 just started. Sure LLMs can't do math, for an arbitrary definition of math. They aren't d
Re: (Score:2)
p.s. Diving down the rabbit hole it turns out that actually these LLMs can write and execute Python code to calculate intermediate results. Still not "reasoning" but not seeing obvious impediments to it getting there if an LLM can call a tool in a TORA that can do some kind of mathematical reasoning based on actually understand mathematical concepts. Which isn't an LLM, so far anyway.
So...better than humans? (Score:2)
About right.