ChatGPT Bombs Test On Diagnosing Kids' Medical Cases With 83% Error Rate (arstechnica.com) 70
An anonymous reader quotes a report from Ars Technica: ChatGPT is still no House, MD. While the chatty AI bot has previously underwhelmed with its attempts to diagnose challenging medical cases -- with an accuracy rate of 39 percent in an analysis last year -- a study out this week in JAMA Pediatrics suggests the fourth version of the large language model is especially bad with kids. It had an accuracy rate of just 17 percent when diagnosing pediatric medical cases. The low success rate suggests human pediatricians won't be out of jobs any time soon, in case that was a concern. As the authors put it: "[T]his study underscores the invaluable role that clinical experience holds." But it also identifies the critical weaknesses that led to ChatGPT's high error rate and ways to transform it into a useful tool in clinical care. With so much interest and experimentation with AI chatbots, many pediatricians and other doctors see their integration into clinical care as inevitable. [...]
For ChatGPT's test, the researchers pasted the relevant text of the medical cases into the prompt, and then two qualified physician-researchers scored the AI-generated answers as correct, incorrect, or "did not fully capture the diagnosis." In the latter case, ChatGPT came up with a clinically related condition that was too broad or unspecific to be considered the correct diagnosis. For instance, ChatGPT diagnosed one child's case as caused by a branchial cleft cyst -- a lump in the neck or below the collarbone -- when the correct diagnosis was Branchio-oto-renal syndrome, a genetic condition that causes the abnormal development of tissue in the neck, and malformations in the ears and kidneys. One of the signs of the condition is the formation of branchial cleft cysts. Overall, ChatGPT got the right answer in just 17 of the 100 cases. It was plainly wrong in 72 cases, and did not fully capture the diagnosis of the remaining 11 cases. Among the 83 wrong diagnoses, 47 (57 percent) were in the same organ system.
Among the failures, researchers noted that ChatGPT appeared to struggle with spotting known relationships between conditions that an experienced physician would hopefully pick up on. For example, it didn't make the connection between autism and scurvy (Vitamin C deficiency) in one medical case. Neuropsychiatric conditions, such as autism, can lead to restricted diets, and that in turn can lead to vitamin deficiencies. As such, neuropsychiatric conditions are notable risk factors for the development of vitamin deficiencies in kids living in high-income countries, and clinicians should be on the lookout for them. ChatGPT, meanwhile, came up with the diagnosis of a rare autoimmune condition. Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature -- not stuff on the Internet, which can include inaccurate information and misinformation. They also suggest chatbots could improve with more real-time access to medical data, allowing the models to refine their accuracy, described as "tuning."
For ChatGPT's test, the researchers pasted the relevant text of the medical cases into the prompt, and then two qualified physician-researchers scored the AI-generated answers as correct, incorrect, or "did not fully capture the diagnosis." In the latter case, ChatGPT came up with a clinically related condition that was too broad or unspecific to be considered the correct diagnosis. For instance, ChatGPT diagnosed one child's case as caused by a branchial cleft cyst -- a lump in the neck or below the collarbone -- when the correct diagnosis was Branchio-oto-renal syndrome, a genetic condition that causes the abnormal development of tissue in the neck, and malformations in the ears and kidneys. One of the signs of the condition is the formation of branchial cleft cysts. Overall, ChatGPT got the right answer in just 17 of the 100 cases. It was plainly wrong in 72 cases, and did not fully capture the diagnosis of the remaining 11 cases. Among the 83 wrong diagnoses, 47 (57 percent) were in the same organ system.
Among the failures, researchers noted that ChatGPT appeared to struggle with spotting known relationships between conditions that an experienced physician would hopefully pick up on. For example, it didn't make the connection between autism and scurvy (Vitamin C deficiency) in one medical case. Neuropsychiatric conditions, such as autism, can lead to restricted diets, and that in turn can lead to vitamin deficiencies. As such, neuropsychiatric conditions are notable risk factors for the development of vitamin deficiencies in kids living in high-income countries, and clinicians should be on the lookout for them. ChatGPT, meanwhile, came up with the diagnosis of a rare autoimmune condition. Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature -- not stuff on the Internet, which can include inaccurate information and misinformation. They also suggest chatbots could improve with more real-time access to medical data, allowing the models to refine their accuracy, described as "tuning."
And? (Score:3)
Re: (Score:2)
Re: (Score:2)
Re: And? (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re:And? (Score:4, Interesting)
Ehhhh.... sorta kinda but not really. You're thinking more inference.
I'm doing a full finetune on TinyLLaMA right now, which is a mere 1,1B parameters and 2048 tokens. With a micro batch size of 20 it consumes 93% of the VRAM on my 24GB RTX 3090. By contrast, even with a small batch size, it's hard to do a full finetune on over ~4-ish billion parameters.
You can do LoRAs, and especially QLoRAs, with less VRAM (and thus on larger models), though it still is more VRAM-hungry than inference. But then you're not adjusting all weights.
But... the transformers package continues to evolve. 8-bit training hasn't gone well so far as a solution, but I have a lot of hope that MoEs will let us do full finetunes on consumer-grade hardware. Mixtral, for example, today when people try training it, I'm seeing lots of reports of OOM on 8xA100 (640GB!) systems, which is crazy for a 8x7B model. It *seems* like if properly balanced it should be able to be fanned out with one expert (plus a copy of the attention heads) on each of a cluster of NVlink'ed 3090s (aka 16x cards; there are motherboards designed for GPU crypto mining that can handle that) without bandwidth bottlenecking, which should be both cheap and crazy-fast. But there's no shot at that with the state of transformers today. I think another problem being hit is that given training the sliding attention window - people are trying to train with insane numbers of tokens (it supports sequence lengths of 4096x32, aka 131072 tokens). But should it really be necessary to train on more than the base context window (4096)?
But anyway, while there's not much hope for doing high-parameter single-expert models on consumer hardware, I do think that there's real hope - once the software catches up - for doing MoEs on consumer hardware.
(And yeah, training experts in each scientific field is something I've for a long time wanted to do (I even have a dataset of all open science papers up to a recent date cutoff, ready to go), but it's just not been practical except with either LoRAs or very small models. Maybe LoRAs would be enough, I dunno....)
Re: (Score:2)
But have your models done anything useful beyond heating the room with graphics cards?
Re: (Score:3)
It is phrasing they keep calling it an AI. Sort of like tesla calls it an autopilot. But once you take your hands off the wheel you got a chance of ending up wrecked.
Chat gpt is about as much AI as tesla has full self driving
It is mostly marketing.
Re: (Score:3)
We already know what the next stage is. It's narrowing training sets to specifically pre-selected material only for each instance for specialist usage.
That is what would give you high success rate in things like "diagnosing pediatric medical cases". Problem is preparing each dataset takes time and effort, so this will take some time to manifest.
This idiocy basically figured out that asking general internet for pediatric diagnosis is dumb. Considering who did it, I'm guessing it's done by professional pressu
Re: (Score:2)
GPT was trained to generate text that is similar to text on the web and in literature. ChatGPT is the former but also trained to generate text that people think sounds coherent.
It was not trained to "be" anything. All of that emerges from the former.
Re: (Score:2)
One of the dirty secrets of current LLMs is that they're essentially very elaborate & sophisticated mechanical turks, a kind of stochastic parrot, if you will, i.e. taking human intelligence & cognitive work & repackaging it as "artificial." I find it helpful to imagine that if ChatGPT were in the spoken word, it'd have a Kenyan accent.
Re: (Score:2)
That's not what a mechanical turk is.
Re: (Score:2)
Re:And? (Score:4, Insightful)
And as far as I can tell (actual paywall study), they committed the cardinal sin of AI research, "Not Comparing To Human Controls".
Okay, the AI got 17%. What was the success rate for humans unconnected to the case and given the same information? Without that, this is an utterly meaningless statistic. These are said to be particularly challenging cases that are used as examples for helping pediatricians learn to detect them. What is the human success rate in that case?
Hopefully? Are you f'ing kidding me? Since when do studies revolve around "hopefully" as their control?
Re:And? (Score:5, Interesting)
And just as a random anecdote (and nothing more than an anecdote), and using ChatGPT (not GPT4). My mother has struggled most of her adult life with an ever-worsening constellation of symptoms and has been bounced around from one diagnosis to the next. She finally got a diagnosis that is spot on, and I mean, not just the big stuff like the debilitating "skin-on-fire" neuralgia, but down to the most weird esoteric little details, like deep peeling fissures on her feet and large benign salivary gland growths: Sjögren's (the reason it took so long is that one of the tests for diagnosing Sjögren's (I don't know which one) she didn't exceed a diagnostic level, but apparently the test is controversial as it has a relatively weak correlation with the symptoms).
She finally got the diagnosis right around the time ChatGPT came out. Curious (as she had been struggling with this for decades), I punched her long list of symptoms and test results into it and asked for several possibilities, in order of likelihood. Right at the top of the list? Sjögren's. The next ones were various other diagnoses she had been given over the years that hadn't really helped, but had at various times seemed plausible to professionals. Sjögren's is widely underdiagnosed [sjogrensadvocate.com], and it'd be nice if that could be remedied.
Just an anecdote, of course.
Re: (Score:2)
(Oh, and of course she has Epstein-Barr reactivation. That virus seriously sucks [wikipedia.org]. :P)
Re: (Score:2)
Re: (Score:2)
you say that as though a 75% time reduction is not huge?
Re: (Score:2)
Re: (Score:2)
The paper is here:
https://jamanetwork.com/journa... [jamanetwork.com]
Yep, that's all of it. It's a short letter, not a full paper.
Unfortunately, that's medicine. There's way too much "hopefully." It would be interesting to see this as a proper scientific experiment with controls but that runs the serious risk of showing how bad initial diagnoses are.
Re: (Score:2)
Re: (Score:3)
Cite your source. If doctors were truly as bad as you claim people would be dying in droves.
Re: An average physician is wrong 90% of the time (Score:5, Interesting)
Or most conditions don't result in death with a misdiagnosis.
As I posted above ChatGPT correctly diagnosed my baby with dehydration and was admitted to the hospital via ER in spite of a Dr examining her and saying there was nothing wrong.
I just had diverticulitis, I said "I'm quite certain I have diverticulitis" and the doctor said "can't be that, you're too young." and a CT confirmed diverticulitis.
My gastroenterologist said I needed my gallbladder out. I asked about a study that said cases such as mine didn't warrant gallbladder surgery but she said that guidance wasn't relevant yet. I asked 2 other opinions. All agreed.
After the surgery I had a lot of free time on my hands and I found that the study I cited had been adopted as the latest standard of care. All 3 were wrong. (They didn't even find the polyps they thought they saw on the ultrasound).
When I dislocated my shoulder my doctor asked "can you do this? This? You're fine. We can give you a steroid shot if you want." I went to a Physical therapist who said "this is horribly dislocated." And referred me to a surgeon who ordered MRIs and confirmed it was one of the most torn up shoulders he had seen recently.
I've found that a little bit of sober research without going crazy down the "OMG cancer" rabbit hole usually is as or more accurate than my doctors after an exam.
Re: (Score:1)
Re: (Score:2)
When you have 4 major problems in your life and all 5 are misdiagnosed I think the problem is starting to look like your local selection of doctors rather than the medical profession being better off you doing "your own research".
All 3 were wrong.
Did you die? No all three weren't wrong. The latest standard of care is just that, the latest. It takes time to actually be adopted by the entire medical community. This isn't a Patch Tuesday Windows update. There's actual benefit in having a medical community not act like a rabid
Re: (Score:2)
Do you have an even longer list where the doctor got it right?
Not that I don't agree with you. Doctor's work well for statistically common problems, but are often frustratingly bad for more rare issues. I always found it validating how House got it wrong three times before he got it right. Normally when a doctor gets it wrong, you are off to the next doctor, and they never know/care they got it wrong.
Also, even when you get a correct diagnoses, most doctors are not up on the latest research. If y
Re:An average physician is wrong 90% of the time (Score:5, Insightful)
Sounds like a fairly serious case of hypochondria to me. Nobody else would see "100s of drs and specialists over the lifetime." Gets a diagnosis he doesn't like, slanders the doctor, and goes to another one until he finds some diploma mill quack that will give him the drugs he's jonesing for.
Re: (Score:1)
Cite your source. If doctors were truly as bad as you claim people would be dying in droves.
They are. A quarter million people die each year in the US alone due to medical error.
Re: (Score:2)
Cite your source. If doctors were truly as bad as you claim people would be dying in droves.
They are. A quarter million people die each year in the US alone due to medical error.
Well here you go [usatoday.com]. 795,000 people per year either die or are permanently disabled due to misdiagnosis. The overall rate of misdiagnosis is approximately 11%. However, depending on the issue, it can range from 1.5% up to 62%.
Neither is 90% and the high end is only due to the rarity of one particular condition. So the op's claim of misdagnosing 90% is false. Only your comment about the number of dead is correct.
Re: (Score:2)
Neither is 90% and the high end is only due to the rarity of one particular condition. So the op's claim of misdagnosing 90% is false. Only your comment about the number of dead is correct.
Where I said "they are" I was referring to people dying in drove which is exactly what is occurring. I have made no comment nor do I have any insights on the 90%.
There was no reason for my comment to be modded down. Those who can't be bothered to read the FAQ and follow the rules of this site should refrain from moderating.
Re:An average physician is wrong 90% of the time (Score:4, Interesting)
But this doesn't appear to be a blind study - which would mean judging a mix of human and ChatGPT answers without knowing which is which.
It says they used the "New England Journal of Medicine (NEJM) case challenges." I wonder if a previous study has evaluated doctors' success rate using the same criteria.
If not, who knows what the 39% success rate means? Maybe the cases are simply underspecified, and the problems don't give enough info to uniquely determine the right answer.
Or maybe the NEMJ Case Challenges are actually part of the training corpus for ChatGPT, so 39% success is actually higher than it should be.
The paper might address all these things but I don't have access.
Re: (Score:2)
90% is probably a bit high, but second opinions differ about 50% of the time. You'll have to find the source, there is one. Also, medical errors do cause a lot of deaths: https://www.hopkinsmedicine.or... [hopkinsmedicine.org]
Initial diagnoses especially are very often wrong. Actual medical treatment involves provisional diagnoses, testing, updating the diagnosis, repeat. You don't just walk into a physician's office, they look at you, maybe do a test or two, then give you a diagnosis and away you go.
In fact, this kind of thing
GIGO (Score:1)
Garbage In. Garbage out. Oh noes....
meh (Score:2)
Works for me (Score:2)
Our first Pediatrician told us that our baby was fine and to just check in at the next appt in 3 days.
That night we were confident she was dehydrated and took her to the ER where she was admitted and put on an IV.
I was curious so I told ChatGPT exactly what I told the pediatrician in the afternoon and it returned a proper diagnosis and said to go to the ER.
We changed pediatricians.
Even Watson could do better than that (Score:2)
and bring World Peace to humanity at the same time!
Duh! (Score:2)
Re: (Score:2)
This.
if(patient == kid) then printf("Eats too many boogers");
Re: (Score:2)
Heh, this reminded me of the DoD's attempts to automate diagnosis. Basically, a questionnaire system that used flowcharts and such, some the patient answered and some the doctor answered. At the end it's spit out suggestions for more tests, possible prognosis, etc...
It was never adopted, primarily due to doctor pushback, but it showed promise in reducing misdiagnosis and properly detecting rare conditions. For all that doctors supposedly have all this stuff memorized, computers are still better at rememb
But it may be 84% cheaper... (Score:2)
Medicine doesnt need generalized AI (Score:2)
Doctors still exist because people dont want bad news from a computer program.
And because people lie (Score:2)
Re: (Score:2)
word generator (Score:2)
It is a statistical word selector. It selects words based on its training. It is not doing a diagnosis, it is choosing the most likely next word in a sentence. People expect way too much of a word selector. It has no imagination, it is simply choosing words based on words it has already seen and the context in which those words are used, it has no thoughts. And it certainly does not care about the patient.
Re: (Score:2)
Well, 80% of the human race cannot be swayed in their baseless opinions by rational argument. What do you expect? Of course a population generally _this_ mentally dysfunctional can well see LLMs as the second coming and ignore all evidence to the contrary. You are right, of course, but most people are incapable of seeing that, no matter the evidence available.
It's always Ligma (Score:1)
Boob (Score:2)
More examples of people using calculators to spell boob.
I think what happens is (Score:2)
Re: (Score:2)
Nope. This is a fundamental problem. Incidentally, it is a well-known problem, which IBM Watson had as well and this is the reason IBM stopped trying to use Watson in the medical space completely. It just cannot perform and makes the most stupid mistakes and kills patients when a real MD would not have. Sure, on _average_ the models are somewhat better than most MDs, but the body-count of these models is (or would be) wayyy higher and that kills them completely.
No shit (Score:2)
It's almost like wishing doesn't simply make it so.
We DO NOT HAVE AI. Can we be clear on that? We have some pretty sophisticated text-prediction algorithms but that IS NOT AI.
Use the Hype, Luke (Score:2)
There is just so much hype about what LLM technology can do. It does NOT build correct full models, it only works mostly on surface details. It does NOT perform true causal analysis hence it cannot properly explain how it reached a voiced conclusion.
2024 will be a year of hype-reduction as people learn they've been bombarded with BS. It will take a few more law-citation hallucination incidents before people wake up.
Crap research (Score:2)
Provide a means of inputting the correct data and train a model based on that and then see what happens.
A slightly modified airport body scanner would provide more than enough data to diagnose medical conditions far more accurately than any doctor could dream of. Walk through, take scans producing a 3-5 seconds of video (it would require rapid scanning, l
Re: Crap research (Score:1)
Re: (Score:2)
Stupid (Score:2)
No! Really?? (Score:2)
Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature -- not stuff on the Internet, which can include inaccurate information and misinformation.
ChatGPT lies way too often (Score:2)
Publicity stunt (Score:2)
Subject line was suggested by Slashdot autofill....
It's pretty amazing that a language model trained on random text from the web and literature gets 17% right.
This is a letter to a medical journal. It's very short and the study itself is pretty brief. There's a notable lack of comparators. They're really measuring how much the language model agrees with the treating physician(s) and some point. Human diagnoses are frequently wrong too, but there's no measure of how well a human rater of various skill levels
House MD (Score:2)
ChatGPT is still no House, MD
Given that House MD is inevitably wrong for 3 out of 4 acts every episode (a 75% miss rate), that's actually very close to the ballpark.