Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
AI Math

AI Systems Solve Just 2% of Advanced Maths Problems in New Benchmark Test 82

Leading AI systems are solving less than 2% of problems in a new advanced mathematics benchmark, revealing significant limitations in their reasoning capabilities, research group Epoch AI reported this week.

The benchmark, called FrontierMath, consists of hundreds of original research-level mathematics problems developed in collaboration with over 60 mathematicians, including Fields Medalists Terence Tao and Timothy Gowers. While top AI models like GPT-4 and Gemini 1.5 Pro achieve over 90% accuracy on traditional math tests, they struggle with FrontierMath's problems, which span computational number theory to algebraic geometry and require complex reasoning.

"These are extremely challenging. [...] The only way to solve them is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages," Tao said. The problems are designed to be "guessproof," with large numerical answers or complex mathematical objects as solutions, making it nearly impossible to solve without proper mathematical reasoning.

Further reading: New secret math benchmark stumps AI models and PhDs alike.
This discussion has been archived. No new comments can be posted.

AI Systems Solve Just 2% of Advanced Maths Problems in New Benchmark Test

Comments Filter:
  • by danda ( 11343 ) on Wednesday November 13, 2024 @03:28PM (#64943305)

    pretty good for a system that is essentially repeated statistical guessing.

    • by rta ( 559125 )

      Well, the thing is that that can be said of our brains too.

      see e.g. Anil Seth or Shamil Chandaria (or others) on Predictive Processing etc.

      https://www.youtube.com/watch?... [youtube.com] (The Free Energy Principle and predictive processing Chandaria)

      https://www.youtube.com/watch?... [youtube.com] (Your Brain Hallucinates Your Conscious Reality | Anil Seth | TED Seth from 7 years ago. (though TED talks are kind of cringey to me)

      • so? AI are not intended to do logic problems per se. they reason like humans not logicians. Most humans could not solve these problems let alone 2%

      • by piojo ( 995934 )

        Well, the thing is that that can be said of our brains too.

        You may very well have understood and debunked the GP's objection, but my take on this debate is that you're missing the point. The problem with current LLMs isn't that they aren't brains. It's that they lack a certain nonlinearity they would need to be truly smarter than something like a cat. For example, I can ask LLMs esoteric questions about chemistry. I know they will at least have a chance at answering, because there is ample input text that describes the subject matter I want to know about. Note that

        • The bigger problem is that an LLM is a "Large Language Model".
          It is trained on languages, aka books, articles etc.

          To have an AI (I do not like to call that AI, because strictly speaking it is not) that does math: you need a "Large Math Model"

          An LMM, and not an LLM.

          You probably have to mix them, an LLM to understand your prompts, and an LMM to answer them.

    • If it is purely an LLM with nothing else to it, then yes. But that still means we need to augment it better. AI, without any major breakthroughs or infeasible computational need, should be able to see the problem format and apply known techniques. The benchmark wasn't asking it to come up with an unknown algorithm or proof (yet). We shouldn't be making excuses that AI is merely an LLM.

    • Re: (Score:3, Interesting)

      by gweihir ( 88907 )

      Well, it will be worse: It will not know which ones it solved and which ones it did not. Humans with a working mind do two things: 1. solve the problem or not and 2. evaluate whether they have solved the problem. Statistical guessers can sometimes do (1), but they cannot do (2) at all.

      • by ceoyoyo ( 59147 )

        Which is why the actual systems designed to solve these things generally include testing their solutions.

        Careful, you're in danger of invaliding your thesis that modern AI is just "statistical pattern matching."

        • by gweihir ( 88907 )

          Careful, you're in danger of invaliding your thesis that modern AI is just "statistical pattern matching."

          This is just you being dishonest by misdirection. Nobody is talking about "modern" AI. What is being talked about, and you know that, is LLMs. And these are just statistical pattern matching.

          • by ceoyoyo ( 59147 )

            You either misunderstand what even the commercial LLM-based systems are (to be fair, the zeitgeist's habit of calling them "LLMs" is annoying) or you're choosing to misrepresent them to support the particular hill you've chosen to die on.

            From the article:

            Even with access to Python environments for testing and verification, top models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro scored extremely poorly.

            From the paper it's about:

            Our evaluation framework allows models to run experiments and r

      • I think a lot of these folks are missing implied point 3 and you might need to say it explicitly.

        Humans can be held accountable for their mistakes and have an incentive to not just give an answer that might cause massive problems.

        This is not an immediate moral problem, so I can understand why they might need the implications for business process explained to them as well >_>

        • Counter Argument: Global Warming.

          Humans are not only not held accountable for their mistakes. They are elected to high positions of power and given significant rewards of every kind.

          • Can I just ask...

            What do you think the last line of my previous post means?

            • I didn't care about the moral aspect since your foundation, "Humans can be held accountable for their mistakes", wasn't well laid.

              Humans are *not* held accountable for their mistakes.

              Especially once they get any decent amount of power.

        • by gweihir ( 88907 )

          I did not add that intentionally. What happens in reality is that some (few) humans with intact personal integrity, a desire to improve themselves and working general intelligence can judge themselves and decide to do better.

          How well that "holding people accountable" works is something we see every day in crap software.

          • I'm not talking about personal integrity.

            I'm talking about leaving a smoking crater that other people point at fearfully.

            I completely agree with what you're saying otherwise... perhaps I'm the one who needs to make what I'm saying clearer.

            • by gweihir ( 88907 )

              That is because you do not understand how incentives work. That "smoking crater" has zero preventative value. There is enough research on, for example, the total absence of preventative value of the death penalty.

              • The death penalty absolutely has preventative value.

              • You know what counts as a smoking crater in this scenario?

                A massive disconnect between expected income and actual income after several years of a hype cycle. A bunch of work already done that was expected to yield profit that visibly doesn't.

                I'm with you on the death penalty. They need feedback in a language they actually understand.

    • ....once they include the Frontier Math benchmark site in their training data!
  • Naturally (Score:5, Funny)

    by sjames ( 1099 ) on Wednesday November 13, 2024 @03:30PM (#64943311) Homepage Journal

    Having learned from elementary school answer keys, it's not hard to guess that the word that best follows "4 + 4 =" is "8". That doesn't mean the LLM even knows what 4 or 8 is, much less that it can do even basic arithmetic.

    • Sure it knows what 4 and 8 are. They're tokens! Everything is a token!
    • by migos ( 10321981 ) on Wednesday November 13, 2024 @05:01PM (#64943517)
      State of the art chat bots are already acing math problems at undergrad level, which is probably already better than 90% of Americans.
      • 99%. Fixed that for you.
      • by Rendus ( 2430 )

        Only when the answers are widely known and documented. Since LLMs don't have any means of performing logic operations like math, the LLM isn't actually DOING math (barring outside libraries, which isn't the LLM doing the math but more the UI/frontend choosing to load a Python library rather than sending the raw prompt to the LLM).

        • by sjames ( 1099 )

          Exactly. It's just pattern matching published answers. That's why more novel solutions elude it.

      • State of the art chat bots are already acing math problems at undergrad level, which is probably already better than 90% of Americans.

        Good thing I'm in the other half of Americans.

    • by Rendus ( 2430 )

      The great part about it is when an LLM can't even get that right (typically because such basic math is fucked up intentionally and sarcastically - 1+1=3 and all of that. But also because of the limited usefulness of the surrounding context in the content LLM trainers stole training data from).

    • No idea why one modded this funny. Perhaps a misclick.
      This is correct and insightful.

  • These are extremely challenging. [...] The only way to solve them is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages

    Okay, so if the AI can solve 2% of those and we're going to call that "significant limitations in their reasoning capabilities" then what do we call everyone who's not a math grad student equipped with AI and a good symbolic math package?

    I don't have a problem with calling most pe

    • Re:Uh huh (Score:5, Insightful)

      by vyvepe ( 809573 ) on Wednesday November 13, 2024 @03:56PM (#64943405)

      Okay, so if the AI can solve 2% of those and we're going to call that "significant limitations in their reasoning capabilities" then what do we call everyone who's not a math grad student equipped with AI and a good symbolic math package?

      The difference is that "everyone" [human without a graduate degree] is not trained on all the books and all of the internet. The point is that the LLMs should already know all the math needed to solve the problems.If you prompt them properly then they very likely can write [approximately cite] the mathematical knowledge needed to solve the problems.

      • That's ignoring the fact that LLM's literally aren't designed to reason. Especially so generalist LLMs like GPT.

        That's ignoring the fact that if you took someone with 8 years of liberal education - in the strictest sense and meaning of an extremely broad education - and asked them to do the same problems, they would have the same issues. A broad overview of many subjects in no way makes you an expert in a niche area.

        There's a reason why the medical AI's that are detecting cancer months and years before huma

      • The point is that the LLMs should already know all the math needed to solve the problems.

        There you go, anthropomorphizing LLMs. LLMs do not KNOW anything. There is no circuitry or code for "knowing". If the EXACT answer is not held anywhere within their data storage, they will not be able to come up with the correct answer.

        Randomness can coincidentally look like "knowing", which is why so many people are being fooled into thinking there is anything sentient or knowing about LLMs.

      • by ceoyoyo ( 59147 )

        Yeah, sports fan reasoning. TeamILike lost, but that actually means they're better because reasons.

        You can argue that the test is irrelevant because it's unfair, wrong, or testing the wrong thing, but you can't use a test in which you got your ass kicked to argue that you're better at the task it was testing.

        I think it's testing the wrong thing. Good AI math models combine a language model to formalize the written problem, symbolic math and formal logic engines to do the actual logic manipulation and test t

    • by Hodr ( 219920 )

      Terrance Tao and a couple other fields medalists said they could likely only do 1 or 2 of these problems (which if out of 100 problems would be 1 or 2 percent), so i'm going to guess your average math grad student would get zero.

      I like how we now have a test that the smartest mathematicians in the world can only individually solve single problems from, but we expect the "dumb AI" to score 100% otherwise it's not very smart.

      Meanwhile we have entire school districts where not a single student is proficient in

      • by ceoyoyo ( 59147 )

        To be clear, I don't think this is "OMG we have superhuman AGI."

        I do think this is how it will happen. People will be criticizing some future system with "well, there's one guy in Japan who can do this particular problem better, and this one woman in New York who might be able to do this other thing better, and...." Then after that goes too we'll be left with "well, it doesn't *actually* understand" and "yeah but it's not *conscious* like I am."

    • LLMs have no reasoning capabilities. They predict what word comes next based on their training data. So if their training data contains enough maths documents that are similar enough to the question asked then they have a chance to trot out something approaching a correct answer.

      They are a Boris Johnson algorithm: highly educated but incapable of reason athough they can fake it sometimes.
      • by ceoyoyo ( 59147 )

        They predict what word comes next based on their training data.

        They do not. Not even the original chatGPT did that.

    • What this indicates is that 2% of the answers were already in the training data. The AI did not solve the problems. The AI searched its training data and found 2% of the answers readily available to it. Those 2% were created by humans.

    • Imagine you want to test someone's lockpicking ability--they claim they can unlock a combination lock just by feel. But here's the wrinkle, you can't do it in person, you have to mail them the lock and then they do it by video call.

      The problem with this plan is that between the time that they receive the lock in the mail and the video call they could brute force it with a machine that spins the lock through every combination and then just memorizes the answer and on the video call this supposed lock picker

  • Repeate after me: (Score:5, Informative)

    by gweihir ( 88907 ) on Wednesday November 13, 2024 @03:48PM (#64943369)

    For generative "AI" the following is true: "AI" has no reasoning ability. "AI" cannot solve problems. "AI" has no model of reality. "AI" can only fake these and as soon as you leave what its training data covered, it is lost.

    • So tell me, what can you do as soon as you leave your training data?

      For example, could you prebaxel plume mostna 2fe1::a0-2^4 guh guh guh?

      • by dfghjk ( 711126 )

        Animal brains have more than just ":training data". How else does a newborn animal breathe? Get up and run? What are phobias? They aren't any result of "training data".

      • by gweihir ( 88907 )

        I can do a thing called "thinking". You might be able to do so too, although you clearly are not at the moment.

      • by KlomDark ( 6370 )
        Yes, definitely: Oonteb weekin wokken wollen!
  • by account_deleted ( 4530225 ) on Wednesday November 13, 2024 @03:50PM (#64943379)
    Comment removed based on user account deletion
  • by Fly Swatter ( 30498 ) on Wednesday November 13, 2024 @03:56PM (#64943403) Homepage
    Pattern matching.

    Calling it Al is just fraud.
  • An LLM does one thing only predict the next thing it should say. All of them can be much smarter than they currently are but that takes iterative calculation and error checking, which use too much GPU resources. So, in order to optimize them to use less resources they simply leave that part of almost all models. If you wanted an LLM model that 'isn't wrong' with math, it could easily be created specifically for that purpose but would effectively be multiple models stacked on top of each other, some that do
  • by oumuamua ( 6173784 ) on Wednesday November 13, 2024 @05:16PM (#64943537)
    You've reached AGI/ASI if you can solve them. We may have a good ASI test here:

    Matthew Barnett, an AI researcher, captured the significance of FrontierMath in a series of tweets. “The first thing to understand about FrontierMath is that it’s genuinely extremely hard,” Barnett wrote. “Almost everyone on Earth would score approximately 0%, even if they’re given a full day to solve each problem.”
    Barnett also speculated on what it might mean if AI eventually cracks the benchmark. “I claim that, once FrontierMath is completely solved, humans will be living alongside an entirely distinct set of intelligent beings,” he wrote. “We will be sharing this Earth with artificial minds that are, in an important sense, just as smart as we are.”

  • by Pinky's Brain ( 1158667 ) on Wednesday November 13, 2024 @05:34PM (#64943581)

    Semi-private. Anthropic and OpenAI already combed their logs to find them.

    Asking the questions to their next models will be completely useless.

  • by JoshuaZ ( 1134087 ) on Wednesday November 13, 2024 @06:05PM (#64943665) Homepage
    From the article:

    “All of the problems I looked at were not really in my area and all looked like things I had no idea how to solve,” Gowers said. “They appear to be at a different level of difficulty from IMO problems.” The problems are designed not just to be hard but also to resist shortcuts. Each one is “guessproof,” meaning it’s nearly impossible to solve without doing the mathematical work. As the FrontierMath paper explains, the problems have large numerical answers or complex mathematical objects as solutions, with less than a 1% chance of guessing correctly without the proper reasoning.

    So solving 2% should already be impressive. Unfortunately, some people are going to look at the headlines like the one above and think that this says that the AI are not impressive.

    • by Rendus ( 2430 )

      Nearly guessproof and "less than 1% chance of guessing the correct answer" are damn near antonyms.

      Pick a number, 1 through 101. That's less than a 1% chance.

      Nearly guessproof needs to be far, far more than that.

    • So solving 2% should already be impressive. Unfortunately, some people are going to look at the headlines like the one above and think that this says that the AI are not impressive.

      I haven't looked at the problems, but I'm pretty sure I won't be able to solve them, not being a graduate mathematician.

      What would be interesting to know though is whether these are problems accessible to most post graduate mathematicians or whether they require specific skills so any one person will only be able to solve one or

      • Tim Gowers said he had no idea how to solve many of the problems even if they were a bit outside his field. He got the Fields Medal for work connecting functional analysis and combinatorics which are two very different areas, so his own "area" is already very big. He's also done work in number theory, graph theory, and group theory. He's prolific and my guess is that he likely knows more about many fields that aren't his own than many grad students finishing up their work.
    • The point is that the LLMs should already know all the math needed to solve the problems.

      I suppose it is impressive that their data stores contained enough information to be able to get 2% of the problems solved... but ummm, there is no reasoning or even logic being used here. I fail to be impressed by this result.

  • I'd like to see the score for an average human. A truck driver, or an illiterate Congolese farmer, for example.
    If they score less than 2%, does that mean AI is smarter?

    • I'd like to see the score for an average human.

      No problem. Give me all of the content on the Internet, and the only thing that would hold me back is my reading and pattern matching speeds.

  • Why are people spending so much effort trying to prove these things are not what they don't claim to be?

    I can weld a hitch to a Miata frame, hook a 6000kg trailer to it, and declare it a shitty tower. Technically I'm right, but I don't know what else I would have accomplished...

  • Solving maths problems requires innovative thinking. AI is just pre-existing, regurgitated content. What did they expect?

  • As a professional mathematician specializing in representation theory, I evaluated these problems. One problem in particular caught my attention: "Orbit Counting of Matrix Tuples" (https://epoch.ai/frontiermath/benchmark-problems), rated as Medium-Low difficulty. I believe this problem was constructed "backwards" – taking a standard problem about classifying representations of the Coxeter group for a given Dynkin diagram and deliberately obscuring it to appear more intimidating. Once you decode its t
  • I noticed this Kaggle $2M competition just the other day which looks neat. Of course it isn't FrontierMath. But the $1M AIMO 1 competition was about trying to get better than Gemma 7B's performance and it sounds like they blew past it from my brief look. I didn't take the time to see whether they actually have any reasoning in it though there are apparently math packages and some kind of feedback loops in there. AIMO 2 just started. Sure LLMs can't do math, for an arbitrary definition of math. They aren't d

    • by mattr ( 78516 )

      p.s. Diving down the rabbit hole it turns out that actually these LLMs can write and execute Python code to calculate intermediate results. Still not "reasoning" but not seeing obvious impediments to it getting there if an LLM can call a tool in a TORA that can do some kind of mathematical reasoning based on actually understand mathematical concepts. Which isn't an LLM, so far anyway.

In order to dial out, it is necessary to broaden one's dimension.

Working...