Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI Medicine

ChatGPT Bombs Test On Diagnosing Kids' Medical Cases With 83% Error Rate (arstechnica.com) 70

An anonymous reader quotes a report from Ars Technica: ChatGPT is still no House, MD. While the chatty AI bot has previously underwhelmed with its attempts to diagnose challenging medical cases -- with an accuracy rate of 39 percent in an analysis last year -- a study out this week in JAMA Pediatrics suggests the fourth version of the large language model is especially bad with kids. It had an accuracy rate of just 17 percent when diagnosing pediatric medical cases. The low success rate suggests human pediatricians won't be out of jobs any time soon, in case that was a concern. As the authors put it: "[T]his study underscores the invaluable role that clinical experience holds." But it also identifies the critical weaknesses that led to ChatGPT's high error rate and ways to transform it into a useful tool in clinical care. With so much interest and experimentation with AI chatbots, many pediatricians and other doctors see their integration into clinical care as inevitable. [...]

For ChatGPT's test, the researchers pasted the relevant text of the medical cases into the prompt, and then two qualified physician-researchers scored the AI-generated answers as correct, incorrect, or "did not fully capture the diagnosis." In the latter case, ChatGPT came up with a clinically related condition that was too broad or unspecific to be considered the correct diagnosis. For instance, ChatGPT diagnosed one child's case as caused by a branchial cleft cyst -- a lump in the neck or below the collarbone -- when the correct diagnosis was Branchio-oto-renal syndrome, a genetic condition that causes the abnormal development of tissue in the neck, and malformations in the ears and kidneys. One of the signs of the condition is the formation of branchial cleft cysts. Overall, ChatGPT got the right answer in just 17 of the 100 cases. It was plainly wrong in 72 cases, and did not fully capture the diagnosis of the remaining 11 cases. Among the 83 wrong diagnoses, 47 (57 percent) were in the same organ system.

Among the failures, researchers noted that ChatGPT appeared to struggle with spotting known relationships between conditions that an experienced physician would hopefully pick up on. For example, it didn't make the connection between autism and scurvy (Vitamin C deficiency) in one medical case. Neuropsychiatric conditions, such as autism, can lead to restricted diets, and that in turn can lead to vitamin deficiencies. As such, neuropsychiatric conditions are notable risk factors for the development of vitamin deficiencies in kids living in high-income countries, and clinicians should be on the lookout for them. ChatGPT, meanwhile, came up with the diagnosis of a rare autoimmune condition. Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature -- not stuff on the Internet, which can include inaccurate information and misinformation. They also suggest chatbots could improve with more real-time access to medical data, allowing the models to refine their accuracy, described as "tuning."

This discussion has been archived. No new comments can be posted.

ChatGPT Bombs Test On Diagnosing Kids' Medical Cases With 83% Error Rate

Comments Filter:
  • by aldousd666 ( 640240 ) on Thursday January 04, 2024 @07:13PM (#64132989) Journal
    Unless you trained it on kids medical diagnoses, I think it's not really so surprising that you are getting wrong answers. If the idea of this article is to be surprised that it has not generalized 'medical diagnosis' yet, well, i guess color me unsurprised.
    • AI needs to be educated in the curriculum it is going to serve in better, maybe AI can go to medical school
      • by leptons ( 891340 )
        ChatGPT isn't AI, it's not "intelligence", it's a glorified filter. You throw in some words and it bounces them around and echoes of text someone else wrote pop out the other end. It has no capacity to reason, think, or diagnose anything.
        • People keep saying this as if you truly know how the human brain reasons and thinks. It is a good talking point for those skeptical in nature but you take it 10 steps too far calling it a filter.
          • by leptons ( 891340 )
            Human neural networks are far different than LLMs, which don't have experiences or motivations or anything that gives a meat computer something a digital automata will never have. A machine can unfeelingly infer a connection between two things but it does not know why. LLMs are "trained" on text written by humans, but it does not grasp the meaning of the text, it can't reason about or be actually creative with the text, and it's quite often just failing to produce a meaningful result because it lacks the co
    • ChatGPT was trained to be a general digital assistant. I haven't read anything about the selection process of what went in but I'm assuming the criteria were pretty broad. If that's the case, then I think the output generated, at least from what I've seen in fairly intensive use since it was made publicly available, reflects the vague, general nature of the input, so yeah, pretty much what you've said. It ain't much use for anything other than producing confident sounding bullshit, which may be enough for m
      • I originally thought they'd market specially trained bots for special industries, and I guess they still may do that. But it seems like 'everybody' knows how to build those with open source tools and models. I mean, Llama 2 and Mistral base models can be fine tuned on a laptop. Or, anyone renting the services of a professional for a few hours could do it on the cloud too. So it could be hard to monetize that idea, at least for Open AI. I would think that they will have to just keep wowing everyone with new
        • Re:And? (Score:4, Interesting)

          by Rei ( 128717 ) on Thursday January 04, 2024 @11:01PM (#64133369) Homepage

          I mean, Llama 2 and Mistral base models can be fine tuned on a laptop.

          Ehhhh.... sorta kinda but not really. You're thinking more inference.

          I'm doing a full finetune on TinyLLaMA right now, which is a mere 1,1B parameters and 2048 tokens. With a micro batch size of 20 it consumes 93% of the VRAM on my 24GB RTX 3090. By contrast, even with a small batch size, it's hard to do a full finetune on over ~4-ish billion parameters.

          You can do LoRAs, and especially QLoRAs, with less VRAM (and thus on larger models), though it still is more VRAM-hungry than inference. But then you're not adjusting all weights.

          But... the transformers package continues to evolve. 8-bit training hasn't gone well so far as a solution, but I have a lot of hope that MoEs will let us do full finetunes on consumer-grade hardware. Mixtral, for example, today when people try training it, I'm seeing lots of reports of OOM on 8xA100 (640GB!) systems, which is crazy for a 8x7B model. It *seems* like if properly balanced it should be able to be fanned out with one expert (plus a copy of the attention heads) on each of a cluster of NVlink'ed 3090s (aka 16x cards; there are motherboards designed for GPU crypto mining that can handle that) without bandwidth bottlenecking, which should be both cheap and crazy-fast. But there's no shot at that with the state of transformers today. I think another problem being hit is that given training the sliding attention window - people are trying to train with insane numbers of tokens (it supports sequence lengths of 4096x32, aka 131072 tokens). But should it really be necessary to train on more than the base context window (4096)?

          But anyway, while there's not much hope for doing high-parameter single-expert models on consumer hardware, I do think that there's real hope - once the software catches up - for doing MoEs on consumer hardware.

          (And yeah, training experts in each scientific field is something I've for a long time wanted to do (I even have a dataset of all open science papers up to a recent date cutoff, ready to go), but it's just not been practical except with either LoRAs or very small models. Maybe LoRAs would be enough, I dunno....)

      • It is phrasing they keep calling it an AI. Sort of like tesla calls it an autopilot. But once you take your hands off the wheel you got a chance of ending up wrecked.

        Chat gpt is about as much AI as tesla has full self driving
        It is mostly marketing.

      • by Luckyo ( 1726890 )

        We already know what the next stage is. It's narrowing training sets to specifically pre-selected material only for each instance for specialist usage.

        That is what would give you high success rate in things like "diagnosing pediatric medical cases". Problem is preparing each dataset takes time and effort, so this will take some time to manifest.

        This idiocy basically figured out that asking general internet for pediatric diagnosis is dumb. Considering who did it, I'm guessing it's done by professional pressu

      • by ceoyoyo ( 59147 )

        GPT was trained to generate text that is similar to text on the web and in literature. ChatGPT is the former but also trained to generate text that people think sounds coherent.

        It was not trained to "be" anything. All of that emerges from the former.

        • OpenAI paid an agency that paid hundreds of Kenyans (extremely low hourly rates) to train ChatGPT to be a digital assistant.

          One of the dirty secrets of current LLMs is that they're essentially very elaborate & sophisticated mechanical turks, a kind of stochastic parrot, if you will, i.e. taking human intelligence & cognitive work & repackaging it as "artificial." I find it helpful to imagine that if ChatGPT were in the spoken word, it'd have a Kenyan accent.
          • by ceoyoyo ( 59147 )

            That's not what a mechanical turk is.

            • Essentially, a mechanical Turk is a machine that is presented as autonomous but is actually operated by humans. I stand by this description of AI. It's somewhat abstracted & automated to a degree but GenAI is the work of thousands of human workers. GenAI is useless without them. Additionally, at the other end, we're all expected to evaluate the quality & validity of the output because, you know, hallucinations &, more often, output that looks convincing at first glance but is actually not fit fo
    • Re:And? (Score:4, Insightful)

      by Rei ( 128717 ) on Thursday January 04, 2024 @10:33PM (#64133325) Homepage

      And as far as I can tell (actual paywall study), they committed the cardinal sin of AI research, "Not Comparing To Human Controls".

      Okay, the AI got 17%. What was the success rate for humans unconnected to the case and given the same information? Without that, this is an utterly meaningless statistic. These are said to be particularly challenging cases that are used as examples for helping pediatricians learn to detect them. What is the human success rate in that case?

      Among the failures, researchers noted that ChatGPT appeared to struggle with spotting known relationships between conditions that an experienced physician would hopefully pick up on.

      Hopefully? Are you f'ing kidding me? Since when do studies revolve around "hopefully" as their control?

      • Re:And? (Score:5, Interesting)

        by Rei ( 128717 ) on Thursday January 04, 2024 @10:36PM (#64133333) Homepage

        And just as a random anecdote (and nothing more than an anecdote), and using ChatGPT (not GPT4). My mother has struggled most of her adult life with an ever-worsening constellation of symptoms and has been bounced around from one diagnosis to the next. She finally got a diagnosis that is spot on, and I mean, not just the big stuff like the debilitating "skin-on-fire" neuralgia, but down to the most weird esoteric little details, like deep peeling fissures on her feet and large benign salivary gland growths: Sjögren's (the reason it took so long is that one of the tests for diagnosing Sjögren's (I don't know which one) she didn't exceed a diagnostic level, but apparently the test is controversial as it has a relatively weak correlation with the symptoms).

        She finally got the diagnosis right around the time ChatGPT came out. Curious (as she had been struggling with this for decades), I punched her long list of symptoms and test results into it and asked for several possibilities, in order of likelihood. Right at the top of the list? Sjögren's. The next ones were various other diagnoses she had been given over the years that hadn't really helped, but had at various times seemed plausible to professionals. Sjögren's is widely underdiagnosed [sjogrensadvocate.com], and it'd be nice if that could be remedied.

        Just an anecdote, of course.

        • by Rei ( 128717 )

          (Oh, and of course she has Epstein-Barr reactivation. That virus seriously sucks [wikipedia.org]. :P)

        • My kid had something very rare when he was young. Doctors didn't really know what it was so i searched on the internet until i found something that had an exact match for the symptoms shown. What does chatgpt bring to the table other than making a one hour hunt into a 16 minute one?
      • by ceoyoyo ( 59147 )

        The paper is here:

        https://jamanetwork.com/journa... [jamanetwork.com]

        Yep, that's all of it. It's a short letter, not a full paper.

        Hopefully? Are you f'ing kidding me? Since when do studies revolve around "hopefully" as their control?

        Unfortunately, that's medicine. There's way too much "hopefully." It would be interesting to see this as a proper scientific experiment with controls but that runs the serious risk of showing how bad initial diagnoses are.

    • Comment removed based on user account deletion
  • Garbage In. Garbage out. Oh noes....

  • All that means is the training data hasn't been appropriate yet.
  • Our first Pediatrician told us that our baby was fine and to just check in at the next appt in 3 days.

    That night we were confident she was dehydrated and took her to the ER where she was admitted and put on an IV.

    I was curious so I told ChatGPT exactly what I told the pediatrician in the afternoon and it returned a proper diagnosis and said to go to the ER.

    We changed pediatricians.

  • and bring World Peace to humanity at the same time!

  • they need to add more else if statements.
    • by PPH ( 736903 )

      This.

      if(patient == kid) then printf("Eats too many boogers");

    • Heh, this reminded me of the DoD's attempts to automate diagnosis. Basically, a questionnaire system that used flowcharts and such, some the patient answered and some the doctor answered. At the end it's spit out suggestions for more tests, possible prognosis, etc...

      It was never adopted, primarily due to doctor pushback, but it showed promise in reducing misdiagnosis and properly detecting rare conditions. For all that doctors supposedly have all this stuff memorized, computers are still better at rememb

  • ... than employing a human doctor, and that will be sufficient to convince the "decision makers" to rather let ChatGPT do the diagnostic work.
    • Forget ChatGPT. Medical diagnosis was a solved problem in AI 20 years back. What doctors do doesnt need Generalized AI, a rule based expert system can do it. Medicine is mostly pattern recognition, it doesnt need higher functions like a Generalized AI.

      Doctors still exist because people dont want bad news from a computer program.
      • A rule based expert system expects accurate inputs but people lie to their doctors. Human doctors job is part lie detector, part pattern recognizer. Computers can do the pattern recognition easily. The lie detection is beyond the capability of current AI.
        • rules based systems are perfect for analyzing symptoms and providing a diagnosis, in fact they are probably far more reliable than a doctor at that. Yes social engineering is also part of what a doctor does, but most doctors suck balls at that too.
  • It is a statistical word selector. It selects words based on its training. It is not doing a diagnosis, it is choosing the most likely next word in a sentence. People expect way too much of a word selector. It has no imagination, it is simply choosing words based on words it has already seen and the context in which those words are used, it has no thoughts. And it certainly does not care about the patient.

    • by gweihir ( 88907 )

      Well, 80% of the human race cannot be swayed in their baseless opinions by rational argument. What do you expect? Of course a population generally _this_ mentally dysfunctional can well see LLMs as the second coming and ignore all evidence to the contrary. You are right, of course, but most people are incapable of seeing that, no matter the evidence available.

  • More examples of people using calculators to spell boob.

  • OpenAI wokifies these things too aggressively, since wokism = wrong logic, the system is gaining wrong logic throughout.
    • by gweihir ( 88907 )

      Nope. This is a fundamental problem. Incidentally, it is a well-known problem, which IBM Watson had as well and this is the reason IBM stopped trying to use Watson in the medical space completely. It just cannot perform and makes the most stupid mistakes and kills patients when a real MD would not have. Sure, on _average_ the models are somewhat better than most MDs, but the body-count of these models is (or would be) wayyy higher and that kills them completely.

  • It's almost like wishing doesn't simply make it so.

    We DO NOT HAVE AI. Can we be clear on that? We have some pretty sophisticated text-prediction algorithms but that IS NOT AI.

  • There is just so much hype about what LLM technology can do. It does NOT build correct full models, it only works mostly on surface details. It does NOT perform true causal analysis hence it cannot properly explain how it reached a voiced conclusion.

    2024 will be a year of hype-reduction as people learn they've been bombarded with BS. It will take a few more law-citation hallucination incidents before people wake up.

  • Drawing the conclusion that because a general LLM can't diagnose medical cases AI won't be able to replace GPs is beyond foolishness, it's outright idiocy.

    Provide a means of inputting the correct data and train a model based on that and then see what happens.

    A slightly modified airport body scanner would provide more than enough data to diagnose medical conditions far more accurately than any doctor could dream of. Walk through, take scans producing a 3-5 seconds of video (it would require rapid scanning, l
    • The "crap research" is what uou wrote. You have no clue about medical imaging if you think an airport security scanner has much diagnostic value. You're just plain wrong.
      • No, an airport security doesn't. A modified scanner on the other hand does. This has been proven in labs and proven in research. But I suppose skipping little things like qualifiers such as "slightly modified" for the purpose of convenience is... never mind
  • ChatGPT wasn't trained on medical data, so it isn't a surprise it doesn't do well. This is like giving a hairdresser a couple of files and then ask her/him to diagnose. There are special AI (fot instance by IBM) already being trained on real medical data, and those are becoming much MUCH better as most doctors on giving a correct diagnose.
  • Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature -- not stuff on the Internet, which can include inaccurate information and misinformation.

  • That is its main problem. It lies often, and with a naturalness that would put Donald Trump to shame. Most irritatingly, after getting caught in an obvious lie it apologizes claiming that it made a mistake, and proceeds to lie again.
  • Subject line was suggested by Slashdot autofill....

    It's pretty amazing that a language model trained on random text from the web and literature gets 17% right.

    This is a letter to a medical journal. It's very short and the study itself is pretty brief. There's a notable lack of comparators. They're really measuring how much the language model agrees with the treating physician(s) and some point. Human diagnoses are frequently wrong too, but there's no measure of how well a human rater of various skill levels

  • ChatGPT is still no House, MD

    Given that House MD is inevitably wrong for 3 out of 4 acts every episode (a 75% miss rate), that's actually very close to the ballpark.

Life is a game. Money is how we keep score. -- Ted Turner

Working...