

Mathematicians Find GPT-5 Makes Critical Errors in Original Proof Generation 60
University of Luxembourg mathematicians tested whether GPT-5 could extend a qualitative fourth-moment theorem to include explicit convergence rates, a previously unaddressed problem in the Malliavin-Stein framework. The September 2025 experiment, prompted by claims GPT-5 solved a convex optimization problem, revealed the AI made critical errors requiring constant human correction.
GPT-5 overlooked an essential covariance property easily deducible from provided documents. The researchers compared the experience to working with a junior assistant needing careful verification. They warned AI reliance during doctoral training risks students losing opportunities to develop fundamental mathematical skills through mistakes and exploration.
GPT-5 overlooked an essential covariance property easily deducible from provided documents. The researchers compared the experience to working with a junior assistant needing careful verification. They warned AI reliance during doctoral training risks students losing opportunities to develop fundamental mathematical skills through mistakes and exploration.
I'm surprised! (Score:5, Informative)
Re: Where are all the pet projects at, then, right (Score:3)
I'm sure that vibe coding is real and that it works, but nobody ever said the code will do what you think it does. The best thing it could do is crash and burn, and I'm pretty sure vibe coders can figure out fixes for it...provided they do more vibe coding. The second best thing it could do is fail in a really obvious way without actually crashing. And again, more vibe coding to the rescue.
But what happens when it fails in very subtle ways? Particularly on unknown edge cases. That's where fluffernutter or a
Re: (Score:2)
Re:I'm surprised! (Score:4, Informative)
I'm finding more and more hallucinations coming from AI every day. I asked multiple LLMs for a configuration file for some software, and they all made stuff up that didn't exist. For example, one result told me to use a specific plugin to achieve what I wanted because it was designed just for that purpose. Problem was, that plugin doesn't exist. Even the same LLM would come back and tell me there was no such thing.
Re: (Score:3)
I had my first LLM hallucination last week. When I asked how to perform a very specific task for an embedded Linux system, it suggested a specific package but hallucinated a command within that package that didn't exist. I thought maybe I just had an older version of the package, but nope, the latest version had no such feature. Looking at the man page for the utility in question, I was able to see how it jumped to that conclusion.
Re: (Score:2)
Hallucinations are interesting, but deception is what's really fun. If you were pretending to proofread a document. You'd only edit the beginning and end. I asked Claude to transform some code based on a spec, and it did so for the first ten class members and the last three, skipping everything in the middle. Fooled me.
Of course I'm not saying a program has a theory of mind and it knows it's going to fool me. That's silly. But this behavior is a functional equivalent of ordinary deception.
Re: (Score:2)
No theory of mind needed in your case, just ML engineers cu
Re:I'm surprised! (Score:4, Interesting)
I've been using it to write grant applications, and I share your opinion. It frequently makes mistakes (and 5 is worse in many ways than 4o). While it can certainly be used to create a rough draft of a document, the result is similar to what you would expect from a junior associate, with the same kinds of mistakes that create an, "OMG, no," response in the reader when it starts to make things up.
There was a lot of talk about how rapidly it would accelerate in performance. That progress seems to have stalled this year. I have a hard time thinking that we've started to see the ultimate asymptote of performance, but it seems like we've hit a region where the easy, early gains have all been made.
Re: (Score:2)
Re: (Score:2)
Or maybe we've started to see the decline because the AI training is now including more AI slop. GPT-4 and the like never had to deal with much AI slop during training because it was new and novel. But given lo
Re: (Score:2)
As expected (Score:3)
Real progress is being made, but LLMs are still imperfect and incomplete
They find unexpected patterns in large sets of words, but have no understanding
Pundits and hypemongers tell fantastic stories to attract investment
To effectively use the tech, skepticism and cross-checking is essential
LLMs Bad At Math (Score:2)
it should be well known that LLMs are bad at math.
LLMs work on tokens(syllables). Their number crunching capabilities are a work in progress. But, still fairly good.
Re: (Score:2)
But anyways, I'm unclear why deriving proofs wouldn't be rooted in a theorem-prover, using the deep net as a search heuristic. (Or is advanced math not rigidly rooted in applying logical deduction to axioms after all?)
Re: (Score:3)
Theorem provers cannot get to the required depth in most cases. And neither can LLMs. This is not a surprise, just one more data point that LLMs fundamentally suck and cannot be trusted.
Re:LLMs Bad At Math (Score:4, Insightful)
This is not a surprise, just one more data point that LLMs fundamentally suck and cannot be trusted.
Huh? LLMs are not perfect and are not expert-level in every single thing ever. But that doesn't mean they suck. Nothing does everything. A great LLM can fail to produce a perfect original proof but still be excellent at helping people adjust the tone of their writing or understanding interactions with others or developing communication skills, developing coping skills, or learning new subjects quickly. I've used ChatGPT for everything from landscaping to plumbing successfully. Right now it's helping to guide my diet, tracking macros and suggesting strategies and recipes to remain on target.
LLMs are a tool with use cases where they work well and use cases where they don't. They actually have a very wide set of use cases. A hammer doesn't suck just because I can't use it to cut my grass. That's not a use case where it excels. But a hammer is a perfect tool for hammering nails into wood and it's pretty decent at putting holes in drywall. Let's not throw out LLMs just because they don't do everything everywhere perfectly at all times. They're a brand new novel tool that's suddenly been put into millions of peoples' hands. And it's been massively improved over the past few years to expand its usefulness. But it's still just a tool.
Re: (Score:2)
To extend the hammer analogy, an LLM is like a hammer that's great at putting in nails, but sometimes it makes a really convincing looking nail insertion, but the nail hasn't actually gone in. You will find that out when you house collapses.
Re: (Score:2)
Yep. And that makes LLMs unsuitable to drive in nails (to stay with the analogy), because you need that "nail insertion" to be real and reliable.
Re:LLMs Bad At Math (Score:5, Insightful)
Their number crunching capabilities are a work in progress. But, still fairly good.
They still fail at basic arithmetic.
Remember that these do not and can not reason, analyze, consider, deliberate, or anything else we'd associate with a complex task. All they do, all they can do, is next-token prediction based on learned relationships between tokens. That's all. This is why they can articulate simple rules, but not apply them. It's why they can generate summaries of text that doesn't exist. They're not actually doing the task. They can't. That's simply not how they work.
In cases like TFA, they're not doing advanced mathematics, that's impossible, they're just generating text that looks like other text the exact same way they generate any other text.
Re: (Score:2)
All they do, all they can do, is next-token prediction based on learned relationships between tokens. That's all.
Humans could arguably also be described as continuously making a decision on what word to say next, but it would be misleading.
Re: (Score:3)
The difference between humans and LLMs is that we know exactly how LLMs work, but don't have any idea how humans work. While we can can make statements about LLMs with absolute certainty, any statement we make about humans is nothing but baseless speculation.
Re: (Score:2)
The difference between humans and LLMs is that we know exactly how LLMs work, but don't have any idea how humans work. While we can can make statements about LLMs with absolute certainty, any statement we make about humans is nothing but baseless speculation.
That's a good point. And I confess to being a bit flippant in equating humans and LLMs in my previous comment.
If LLMs do continue to improve, and knowledge about how humans work improves, the question about whether artificial intelligence is real intelligence might in the end come down to whether you are a materialist (believing that humans are bioligical machines), or whether you believe that humans have something special that can never be replicated.
Re: (Score:2)
real intelligence might in the end come down to whether you are a materialist (believing that humans are bioligical machines), or whether you believe that humans have something special that can never be replicated.
You don't need to appeal to mysticism to reject computationalism. Searle, for example, is a committed physicalist. Of course, we're senselessly conflating intelligence with subjectivity here, which is very silly.
"easily deducible" (Score:5, Informative)
Yeah, GPT doesn't "deduce" anything, it predicts the most probable next word.
Re: (Score:2)
They do quite a bit more than that. There's a good bit of reasoning that comes into play and newer models (really beginning with o3 on the ChatGPT side) can do multi-step reasoning where it'll first determine what the user is actually seeking, then determine what it needs to provide that, then begin the process of response generation based on all of that.
Re: (Score:3)
"There's a good bit of reasoning that comes into play and newer models (really beginning with o3 on the ChatGPT side) can do multi-step reasoning"
Re:"easily deducible" (Score:4, Informative)
Chained prediction, not deducing. Deduction is the process of inferring facts. LLM do not do that at all.
True deduction requires actual understanding, not merely predictive.
The classic explanation is "All men are mortal. Socrates is a man. Socrates is mortal." Deduction. But an LLM would not go through your process of creating the categories that you do in your head when you work it through. Instead it creates relationships and probabilities. Not deduction.
Re: (Score:2)
If you spend time with the higher-tier (paid) reasoning models, you’ll see they already operate in ways that are effectively deductive (i.e., behaviorally indistinguishable) within the bounds of where they operate well. So not novel theorem proving. But give them scheduling constraints, warranty/return policies, travel planning, or system troubleshooting, and they’ll parse the conditions, decompose the problem, and run through intermediate steps until they land on the right conclusion. That
Handed a syllogism to Claude (Score:2)
Sure! Except I found that Claude Opus 4.1 was able to solve it. And not just this very simple syllogism but I tried a more complex one from a novel of Heinlein's and it showed a logical proof and got (or nearly got) the answer. By pasting in the rest of the chapter I got it from, it solved it. Here is my chat. I think the point is, the LLM itself is a pattern matcher but enough capabilities have been bolted on to the extent that it can indeed solve logical puzzles. It may not "understand" them or be very sm
Figures! LLM's are bad @ math (Score:3)
Re: (Score:2)
Re: (Score:2)
Newsflash: AI good at making stuff up (Score:2)
not so good when asked to produce exact information.
That's why it's great to make deepfakes of Natalie Portman covered in hot grits, but not so great at coming up with real, existing law cases or math proofs.
Re: (Score:2)
Re: Newsflash: AI good at making stuff up (Score:2)
I feel old. Iâ(TM)ve been reading this site for so long, I remember when Natalie Portman was always the goto reference for attractive woman here. Along with BSD is dying, and âoein Soviet Russia âoe comments.
Re: (Score:2)
Hey, but at least you can take solace in the fact that UTF-8 characters are still fucked up. ®
Re: (Score:2)
And so of course, it doesn't mess up ®.
Wait... does UTF-8 work now??? Høly smökes!!!
Re: Newsflash: AI good at making stuff up (Score:2)
Can someone please help me, and not mock my uid number vs my ability to grok the utf 8 problem. What am I doing wrong? Iâ(TM)d gladly set something. Iâ(TM)m using safari on an iPhone.
Re: (Score:2)
Just disable Smart Punctuation in the General > Keyboards settings of your iPhone.
You can probably also long press the apostrophe key and choose the straight one each time if you want to keep that setting enabled.
I was amused by the link. (Score:2)
Emphasis is interesting (Score:5, Informative)
Relevant math background: the Gaussian integers are the complex numbers of the form a+bi where a and b are good, old-fashioned integers. For example, 2+3i or -1 +2i are Gaussian integers. Any integer n is a Gaussian integer since you can write it as n+0i. But say or 3- 0.5 i would not be Gaussian integers. Also notation: We write x|y to mean y is a multiple of x. We can use this notation in our regular integers (so for example 2|8 but it is not true that 3|8 ) or in the Gaussian integers where we are then allowed to multiple by another Gaussian integer. For example (2+i)| (2+i)(3-i). A good exercise if you have not seen the Gaussian integers before: Convince yourself that 1+i | 1+3i.
It also turns out that the Gaussian integers have an analog of unique prime factorization just as that in the usual integers. The Gaussian integers also have a notion of size called the norm. For a given Gaussian integer a+bi, the norm is a^2 +b^2 .
Recently I had to prove a specific Lemma where I needed to find all Gaussian integers and where both are Gaussian primes, and b|a^2 + a +1 and a|b+1. I had as a template a very similar Lemma in the integers which was a Lemma which said exactly which integers and b such that b|a^2 + a +1 and a|b+1. I worked out the proof, essentially modifying the version in the integers. Then, I did something I've often been doing after I've completed a small Lemma, namely giving the task to ChatGPT or another system and seen how they've done. For prior iterations (GPT3, ChatGPT , GPT4, 4o) this has almost universally been a disaster. But this time I gave the task to GPT5, and gave it the integer version to start with. It tried to do the same basic task and produced a result pretty close to mine, but it had multiple small errors in the process, to the point where I'm unsure if using it would have sped things up. But at the same time, the errors were genuinely small. For example, in proving in one subcase the system claimed that a specific number's norm needed to be at most 9, when it needed to be at most 10. These are not the sort of large jumps in reasoning that one saw with GPT4 or 4o. It might have been the case that if I had given this to GPT5 before proving it myself and then had corrected its errors I would have saved time. I generally doubt this is the case, but the fact that it is close to the point where that's now plausible is striking.
Re: (Score:2)
However that part, transforming th
Re: (Score:2)
One part of the Art of Mathematics consists in creatively "abusing" notation. This is not definable precisely, it comes with experience and skill. This is one reason why the above argument is fundamentally flawed, ie I don't think it's just a matter of time with the current technologies. Another reason of course is due to Goedel: some pr
Humanity is slowly learning... (Score:5, Insightful)
...the difference between appearing intelligent and being intelligent.
We're so good at recognizing patterns that we see some patterns when they're not really there. Faces on toast, intelligence in writing, etc, etc.
Predicting the next word in a sequence is shockingly useful, but it's not a substitute for symbolic reasoning.
Luxus problems (Score:2)
Only a few years ago we were happy when transformers made it possible that a text generator stays consistent for more than two sentences. Now we're picky when a LLM makes mistakes in "PhD level" problems. I think we shouldn't be too harsh with the tech, it does way more than one would have ever expected.
Re: (Score:2)
Details are not LLM's strong suit (Score:2)
Ask an LLM to paint a picture of a kitchen in Greece. It will generate something nice and pretty, with some Greek touches. But will the cooktop have the right number of control knobs? Will the electric outlets have the correct configuration, or will the painting even show electrical outlets? Will the faucets have knobs with a plausibly working design? Will the cabinet doors below the sink extend only to the bottom of the sink? There are SO many errors such a request is likely to generate.
An LLM's ability is
through mistakes and exploration (Score:2)
Re: (Score:2)
Re: (Score:2)
It's like flying with an airline that doesn't allow their pilots to touch the controls throughout their entire career. Until the rudder jams and the autopilot trips off. Now, do something.
Re: (Score:2)
THIS JUST IN (Score:2)
A glorified auto-complete system, that has at its heart a random number generator, sometimes makes mistakes. Film at 11.
Re: (Score:2)
Quite.
One of the best explanations I found was that we trained these statistical models to produce something that is convincingly like an answer to the question posed.
So when it makes up references and names and lawsuits and properties and programming keywords that don't exist, that at first look plausible but under scrutiny don't hold up... that's to be expected. It's doing exactly what it's designed to do. It made something LIKE an answer to the question posed.
Which, if anything, is even more dangerous
My test (Score:2)
Why? (Score:2)
Why would anyone expect a LLM to be able to do rigorous mathematical proofs? Better to substitute an LLM as a HS math teacher, who can prove things by hand-waving.