ChatGPT Bombs Test On Diagnosing Kids' Medical Cases With 83% Error Rate (arstechnica.com) 70

Posted by BeauHD on Thursday January 04, 2024 @08:02PM from the missed-connections dept.

An anonymous reader quotes a report from Ars Technica: ChatGPT is still no House, MD. While the chatty AI bot has previously underwhelmed with its attempts to diagnose challenging medical cases -- with an accuracy rate of 39 percent in an analysis last year -- a study out this week in JAMA Pediatrics suggests the fourth version of the large language model is especially bad with kids. It had an accuracy rate of just 17 percent when diagnosing pediatric medical cases. The low success rate suggests human pediatricians won't be out of jobs any time soon, in case that was a concern. As the authors put it: "[T]his study underscores the invaluable role that clinical experience holds." But it also identifies the critical weaknesses that led to ChatGPT's high error rate and ways to transform it into a useful tool in clinical care. With so much interest and experimentation with AI chatbots, many pediatricians and other doctors see their integration into clinical care as inevitable. [...]

For ChatGPT's test, the researchers pasted the relevant text of the medical cases into the prompt, and then two qualified physician-researchers scored the AI-generated answers as correct, incorrect, or "did not fully capture the diagnosis." In the latter case, ChatGPT came up with a clinically related condition that was too broad or unspecific to be considered the correct diagnosis. For instance, ChatGPT diagnosed one child's case as caused by a branchial cleft cyst -- a lump in the neck or below the collarbone -- when the correct diagnosis was Branchio-oto-renal syndrome, a genetic condition that causes the abnormal development of tissue in the neck, and malformations in the ears and kidneys. One of the signs of the condition is the formation of branchial cleft cysts. Overall, ChatGPT got the right answer in just 17 of the 100 cases. It was plainly wrong in 72 cases, and did not fully capture the diagnosis of the remaining 11 cases. Among the 83 wrong diagnoses, 47 (57 percent) were in the same organ system.

Among the failures, researchers noted that ChatGPT appeared to struggle with spotting known relationships between conditions that an experienced physician would hopefully pick up on. For example, it didn't make the connection between autism and scurvy (Vitamin C deficiency) in one medical case. Neuropsychiatric conditions, such as autism, can lead to restricted diets, and that in turn can lead to vitamin deficiencies. As such, neuropsychiatric conditions are notable risk factors for the development of vitamin deficiencies in kids living in high-income countries, and clinicians should be on the lookout for them. ChatGPT, meanwhile, came up with the diagnosis of a rare autoimmune condition. Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature -- not stuff on the Internet, which can include inaccurate information and misinformation. They also suggest chatbots could improve with more real-time access to medical data, allowing the models to refine their accuracy, described as "tuning."

ChatGPT Bombs Test On Diagnosing Kids' Medical Cases With 83% Error Rate

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 70 Comments Log In/Create an Account

Comments Filter:

And? (Score:3)

by aldousd666 ( 640240 ) writes: on Thursday January 04, 2024 @08:13PM (#64132989) Journal

Unless you trained it on kids medical diagnoses, I think it's not really so surprising that you are getting wrong answers. If the idea of this article is to be surprised that it has not generalized 'medical diagnosis' yet, well, i guess color me unsurprised.

- Re: (Score:2)
  
  by FudRucker ( 866063 ) writes:
  
  AI needs to be educated in the curriculum it is going to serve in better, maybe AI can go to medical school
  - Re: (Score:2)
    
    by leptons ( 891340 ) writes:
    
    ChatGPT isn't AI, it's not "intelligence", it's a glorified filter. You throw in some words and it bounces them around and echoes of text someone else wrote pop out the other end. It has no capacity to reason, think, or diagnose anything.
    - Re: And? (Score:2)
      
      by sixminuteabs ( 1452973 ) writes:
      
      People keep saying this as if you truly know how the human brain reasons and thinks. It is a good talking point for those skeptical in nature but you take it 10 steps too far calling it a filter.
      - Re: (Score:2)
        
        by leptons ( 891340 ) writes:
        
        Human neural networks are far different than LLMs, which don't have experiences or motivations or anything that gives a meat computer something a digital automata will never have. A machine can unfeelingly infer a connection between two things but it does not know why. LLMs are "trained" on text written by humans, but it does not grasp the meaning of the text, it can't reason about or be actually creative with the text, and it's quite often just failing to produce a meaningful result because it lacks the co
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:2)
    
    by aldousd666 ( 640240 ) writes:
    
    I originally thought they'd market specially trained bots for special industries, and I guess they still may do that. But it seems like 'everybody' knows how to build those with open source tools and models. I mean, Llama 2 and Mistral base models can be fine tuned on a laptop. Or, anyone renting the services of a professional for a few hours could do it on the cloud too. So it could be hard to monetize that idea, at least for Open AI. I would think that they will have to just keep wowing everyone with new
    - Re:And? (Score:4, Interesting)
      
      by Rei ( 128717 ) writes: on Friday January 05, 2024 @12:01AM (#64133369) Homepage
      
      I mean, Llama 2 and Mistral base models can be fine tuned on a laptop.
      Ehhhh.... sorta kinda but not really. You're thinking more inference.
      I'm doing a full finetune on TinyLLaMA right now, which is a mere 1,1B parameters and 2048 tokens. With a micro batch size of 20 it consumes 93% of the VRAM on my 24GB RTX 3090. By contrast, even with a small batch size, it's hard to do a full finetune on over ~4-ish billion parameters.
      You can do LoRAs, and especially QLoRAs, with less VRAM (and thus on larger models), though it still is more VRAM-hungry than inference. But then you're not adjusting all weights.
      But... the transformers package continues to evolve. 8-bit training hasn't gone well so far as a solution, but I have a lot of hope that MoEs will let us do full finetunes on consumer-grade hardware. Mixtral, for example, today when people try training it, I'm seeing lots of reports of OOM on 8xA100 (640GB!) systems, which is crazy for a 8x7B model. It *seems* like if properly balanced it should be able to be fanned out with one expert (plus a copy of the attention heads) on each of a cluster of NVlink'ed 3090s (aka 16x cards; there are motherboards designed for GPU crypto mining that can handle that) without bandwidth bottlenecking, which should be both cheap and crazy-fast. But there's no shot at that with the state of transformers today. I think another problem being hit is that given training the sliding attention window - people are trying to train with insane numbers of tokens (it supports sequence lengths of 4096x32, aka 131072 tokens). But should it really be necessary to train on more than the base context window (4096)?
      But anyway, while there's not much hope for doing high-parameter single-expert models on consumer hardware, I do think that there's real hope - once the software catches up - for doing MoEs on consumer hardware.
      (And yeah, training experts in each scientific field is something I've for a long time wanted to do (I even have a dataset of all open science papers up to a recent date cutoff, ready to go), but it's just not been practical except with either LoRAs or very small models. Maybe LoRAs would be enough, I dunno....)
      
      - Re: (Score:2)
        
        by TechyImmigrant ( 175943 ) writes:
        
        But have your models done anything useful beyond heating the room with graphics cards?
  - Re: (Score:3)
    
    by peragrin ( 659227 ) writes:
    
    It is phrasing they keep calling it an AI. Sort of like tesla calls it an autopilot. But once you take your hands off the wheel you got a chance of ending up wrecked.
    Chat gpt is about as much AI as tesla has full self driving
    It is mostly marketing.
  - Re: (Score:3)
    
    by Luckyo ( 1726890 ) writes:
    
    We already know what the next stage is. It's narrowing training sets to specifically pre-selected material only for each instance for specialist usage.
    That is what would give you high success rate in things like "diagnosing pediatric medical cases". Problem is preparing each dataset takes time and effort, so this will take some time to manifest.
    This idiocy basically figured out that asking general internet for pediatric diagnosis is dumb. Considering who did it, I'm guessing it's done by professional pressu
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    GPT was trained to generate text that is similar to text on the web and in literature. ChatGPT is the former but also trained to generate text that people think sounds coherent.
    It was not trained to "be" anything. All of that emerges from the former.
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
      - Re: (Score:2)
        
        by ceoyoyo ( 59147 ) writes:
        
        That's not what a mechanical turk is.
        
        Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
- Re:And? (Score:4, Insightful)
  
  by Rei ( 128717 ) writes: on Thursday January 04, 2024 @11:33PM (#64133325) Homepage
  
  And as far as I can tell (actual paywall study), they committed the cardinal sin of AI research, "Not Comparing To Human Controls".
  Okay, the AI got 17%. What was the success rate for humans unconnected to the case and given the same information? Without that, this is an utterly meaningless statistic. These are said to be particularly challenging cases that are used as examples for helping pediatricians learn to detect them. What is the human success rate in that case?
  Among the failures, researchers noted that ChatGPT appeared to struggle with spotting known relationships between conditions that an experienced physician would hopefully pick up on.
  
  Hopefully? Are you f'ing kidding me? Since when do studies revolve around "hopefully" as their control?
  
  - Re:And? (Score:5, Interesting)
    
    by Rei ( 128717 ) writes: on Thursday January 04, 2024 @11:36PM (#64133333) Homepage
    
    And just as a random anecdote (and nothing more than an anecdote), and using ChatGPT (not GPT4). My mother has struggled most of her adult life with an ever-worsening constellation of symptoms and has been bounced around from one diagnosis to the next. She finally got a diagnosis that is spot on, and I mean, not just the big stuff like the debilitating "skin-on-fire" neuralgia, but down to the most weird esoteric little details, like deep peeling fissures on her feet and large benign salivary gland growths: Sjögren's (the reason it took so long is that one of the tests for diagnosing Sjögren's (I don't know which one) she didn't exceed a diagnostic level, but apparently the test is controversial as it has a relatively weak correlation with the symptoms).
    She finally got the diagnosis right around the time ChatGPT came out. Curious (as she had been struggling with this for decades), I punched her long list of symptoms and test results into it and asked for several possibilities, in order of likelihood. Right at the top of the list? Sjögren's. The next ones were various other diagnoses she had been given over the years that hadn't really helped, but had at various times seemed plausible to professionals. Sjögren's is widely underdiagnosed [sjogrensadvocate.com], and it'd be nice if that could be remedied.
    Just an anecdote, of course.
    
    - Re: (Score:2)
      
      by Rei ( 128717 ) writes:
      
      (Oh, and of course she has Epstein-Barr reactivation. That virus seriously sucks [wikipedia.org]. :P)
    - Re: (Score:2)
      
      by fluffernutter ( 1411889 ) writes:
      
      My kid had something very rare when he was young. Doctors didn't really know what it was so i searched on the internet until i found something that had an exact match for the symptoms shown. What does chatgpt bring to the table other than making a one hour hunt into a 16 minute one?
      - Re: (Score:2)
        
        by thegreatemu ( 1457577 ) writes:
        
        you say that as though a 75% time reduction is not huge?
        
        Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        No because anything AI says you need to confirm correct anyway. And no a 45 minute savings is not huge on one isolated thing.
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    The paper is here:
    https://jamanetwork.com/journa... [jamanetwork.com]
    Yep, that's all of it. It's a short letter, not a full paper.
    Hopefully? Are you f'ing kidding me? Since when do studies revolve around "hopefully" as their control?
    Unfortunately, that's medicine. There's way too much "hopefully." It would be interesting to see this as a proper scientific experiment with controls but that runs the serious risk of showing how bad initial diagnoses are.
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:3)
  
  by quonset ( 4839537 ) writes:
  
  Cite your source. If doctors were truly as bad as you claim people would be dying in droves.
  - Re: An average physician is wrong 90% of the time (Score:5, Interesting)
    
    by im_thatoneguy ( 819432 ) writes: on Thursday January 04, 2024 @08:37PM (#64133035)
    
    Or most conditions don't result in death with a misdiagnosis.
    As I posted above ChatGPT correctly diagnosed my baby with dehydration and was admitted to the hospital via ER in spite of a Dr examining her and saying there was nothing wrong.
    I just had diverticulitis, I said "I'm quite certain I have diverticulitis" and the doctor said "can't be that, you're too young." and a CT confirmed diverticulitis.
    My gastroenterologist said I needed my gallbladder out. I asked about a study that said cases such as mine didn't warrant gallbladder surgery but she said that guidance wasn't relevant yet. I asked 2 other opinions. All agreed.
    After the surgery I had a lot of free time on my hands and I found that the study I cited had been adopted as the latest standard of care. All 3 were wrong. (They didn't even find the polyps they thought they saw on the ultrasound).
    When I dislocated my shoulder my doctor asked "can you do this? This? You're fine. We can give you a steroid shot if you want." I went to a Physical therapist who said "this is horribly dislocated." And referred me to a surgeon who ordered MRIs and confirmed it was one of the most torn up shoulders he had seen recently.
    I've found that a little bit of sober research without going crazy down the "OMG cancer" rabbit hole usually is as or more accurate than my doctors after an exam.
    
    - Re: (Score:1)
      
      by notil ( 4291169 ) writes:
      
      I'm really sorry to hear how many misses you've had - it sounds like you've had a lot of practitioners that are either not good at communicating, or just not good in general. I'm not going to change your mind here obviously since you've had a ton of bad experiences, but frankly doing "a little bit of sober research" isn't accessible/an awesome idea for most, and I am not convinced tools like ChatGTP on their own will ever really do any better than the extremely messy and dysfunctional system we have in pla
    - Re: (Score:2)
      
      by thegarbz ( 1787294 ) writes:
      
      When you have 4 major problems in your life and all 5 are misdiagnosed I think the problem is starting to look like your local selection of doctors rather than the medical profession being better off you doing "your own research".
      All 3 were wrong.
      Did you die? No all three weren't wrong. The latest standard of care is just that, the latest. It takes time to actually be adopted by the entire medical community. This isn't a Patch Tuesday Windows update. There's actual benefit in having a medical community not act like a rabid
    - Re: (Score:2)
      
      by mesterha ( 110796 ) writes:
      
      Do you have an even longer list where the doctor got it right?
      Not that I don't agree with you. Doctor's work well for statistically common problems, but are often frustratingly bad for more rare issues. I always found it validating how House got it wrong three times before he got it right. Normally when a doctor gets it wrong, you are off to the next doctor, and they never know/care they got it wrong.
      Also, even when you get a correct diagnoses, most doctors are not up on the latest research. If y
  - Re:An average physician is wrong 90% of the time (Score:5, Insightful)
    
    by taustin ( 171655 ) writes: on Thursday January 04, 2024 @08:37PM (#64133037) Homepage Journal
    
    Sounds like a fairly serious case of hypochondria to me. Nobody else would see "100s of drs and specialists over the lifetime." Gets a diagnosis he doesn't like, slanders the doctor, and goes to another one until he finds some diploma mill quack that will give him the drugs he's jonesing for.
    
  - Re: (Score:1)
    
    by WaffleMonster ( 969671 ) writes:
    
    Cite your source. If doctors were truly as bad as you claim people would be dying in droves.
    They are. A quarter million people die each year in the US alone due to medical error.
    - Re: (Score:2)
      
      by quonset ( 4839537 ) writes:
      
      Cite your source. If doctors were truly as bad as you claim people would be dying in droves.
      They are. A quarter million people die each year in the US alone due to medical error.
      Well here you go [usatoday.com]. 795,000 people per year either die or are permanently disabled due to misdiagnosis. The overall rate of misdiagnosis is approximately 11%. However, depending on the issue, it can range from 1.5% up to 62%.
      Neither is 90% and the high end is only due to the rarity of one particular condition. So the op's claim of misdagnosing 90% is false. Only your comment about the number of dead is correct.
      - Re: (Score:2)
        
        by WaffleMonster ( 969671 ) writes:
        
        Neither is 90% and the high end is only due to the rarity of one particular condition. So the op's claim of misdagnosing 90% is false. Only your comment about the number of dead is correct.
        Where I said "they are" I was referring to people dying in drove which is exactly what is occurring. I have made no comment nor do I have any insights on the 90%.
        There was no reason for my comment to be modded down. Those who can't be bothered to read the FAQ and follow the rules of this site should refrain from moderating.
  - Re:An average physician is wrong 90% of the time (Score:4, Interesting)
    
    by timeOday ( 582209 ) writes: on Thursday January 04, 2024 @08:41PM (#64133045)
    
    90% does sound like a guesstimate.
    But this doesn't appear to be a blind study - which would mean judging a mix of human and ChatGPT answers without knowing which is which.
    It says they used the "New England Journal of Medicine (NEJM) case challenges." I wonder if a previous study has evaluated doctors' success rate using the same criteria.
    If not, who knows what the 39% success rate means? Maybe the cases are simply underspecified, and the problems don't give enough info to uniquely determine the right answer.
    Or maybe the NEMJ Case Challenges are actually part of the training corpus for ChatGPT, so 39% success is actually higher than it should be.
    The paper might address all these things but I don't have access.
    
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    90% is probably a bit high, but second opinions differ about 50% of the time. You'll have to find the source, there is one. Also, medical errors do cause a lot of deaths: https://www.hopkinsmedicine.or... [hopkinsmedicine.org]
    Initial diagnoses especially are very often wrong. Actual medical treatment involves provisional diagnoses, testing, updating the diagnosis, repeat. You don't just walk into a physician's office, they look at you, maybe do a test or two, then give you a diagnosis and away you go.
    In fact, this kind of thing
GIGO (Score:1)

by zenlessyank ( 748553 ) writes:

Garbage In. Garbage out. Oh noes....
meh (Score:2)

by bloodhawk ( 813939 ) writes:

All that means is the training data hasn't been appropriate yet.
Works for me (Score:2)

by im_thatoneguy ( 819432 ) writes:

Our first Pediatrician told us that our baby was fine and to just check in at the next appt in 3 days.
That night we were confident she was dehydrated and took her to the ER where she was admitted and put on an IV.
I was curious so I told ChatGPT exactly what I told the pediatrician in the afternoon and it returned a proper diagnosis and said to go to the ER.
We changed pediatricians.
Even Watson could do better than that (Score:2)

by thesjaakspoiler ( 4782965 ) writes:

and bring World Peace to humanity at the same time!
Duh! (Score:2)

by oldgraybeard ( 2939809 ) writes:

they need to add more else if statements.
- Re: (Score:2)
  
  by PPH ( 736903 ) writes:
  
  This.
  if(patient == kid) then printf("Eats too many boogers");
- Re: (Score:2)
  
  by Firethorn ( 177587 ) writes:
  
  Heh, this reminded me of the DoD's attempts to automate diagnosis. Basically, a questionnaire system that used flowcharts and such, some the patient answered and some the doctor answered. At the end it's spit out suggestions for more tests, possible prognosis, etc...
  It was never adopted, primarily due to doctor pushback, but it showed promise in reducing misdiagnosis and properly detecting rare conditions. For all that doctors supposedly have all this stuff memorized, computers are still better at rememb
But it may be 84% cheaper... (Score:2)

by ffkom ( 3519199 ) writes:

... than employing a human doctor, and that will be sufficient to convince the "decision makers" to rather let ChatGPT do the diagnostic work.
- Medicine doesnt need generalized AI (Score:2)
  
  by ghoul ( 157158 ) writes:
  
  Forget ChatGPT. Medical diagnosis was a solved problem in AI 20 years back. What doctors do doesnt need Generalized AI, a rule based expert system can do it. Medicine is mostly pattern recognition, it doesnt need higher functions like a Generalized AI.
  
  Doctors still exist because people dont want bad news from a computer program.
  - And because people lie (Score:2)
    
    by ghoul ( 157158 ) writes:
    
    A rule based expert system expects accurate inputs but people lie to their doctors. Human doctors job is part lie detector, part pattern recognizer. Computers can do the pattern recognition easily. The lie detection is beyond the capability of current AI.
    - Re: (Score:2)
      
      by bloodhawk ( 813939 ) writes:
      
      rules based systems are perfect for analyzing symptoms and providing a diagnosis, in fact they are probably far more reliable than a doctor at that. Yes social engineering is also part of what a doctor does, but most doctors suck balls at that too.
word generator (Score:2)

by awwshit ( 6214476 ) writes:

It is a statistical word selector. It selects words based on its training. It is not doing a diagnosis, it is choosing the most likely next word in a sentence. People expect way too much of a word selector. It has no imagination, it is simply choosing words based on words it has already seen and the context in which those words are used, it has no thoughts. And it certainly does not care about the patient.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Well, 80% of the human race cannot be swayed in their baseless opinions by rational argument. What do you expect? Of course a population generally _this_ mentally dysfunctional can well see LLMs as the second coming and ignore all evidence to the contrary. You are right, of course, but most people are incapable of seeing that, no matter the evidence available.
It's always Ligma (Score:1)

by SneedFeedAndSeed ( 10502403 ) writes:

Many such cases
Boob (Score:2)

by Berkyjay ( 1225604 ) writes:

More examples of people using calculators to spell boob.
I think what happens is (Score:2)

by elcor ( 4519045 ) writes:

OpenAI wokifies these things too aggressively, since wokism = wrong logic, the system is gaining wrong logic throughout.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Nope. This is a fundamental problem. Incidentally, it is a well-known problem, which IBM Watson had as well and this is the reason IBM stopped trying to use Watson in the medical space completely. It just cannot perform and makes the most stupid mistakes and kills patients when a real MD would not have. Sure, on _average_ the models are somewhat better than most MDs, but the body-count of these models is (or would be) wayyy higher and that kills them completely.
No shit (Score:2)

by argStyopa ( 232550 ) writes:

It's almost like wishing doesn't simply make it so.
We DO NOT HAVE AI. Can we be clear on that? We have some pretty sophisticated text-prediction algorithms but that IS NOT AI.
Use the Hype, Luke (Score:2)

by Walt Dismal ( 534799 ) writes:

There is just so much hype about what LLM technology can do. It does NOT build correct full models, it only works mostly on surface details. It does NOT perform true causal analysis hence it cannot properly explain how it reached a voiced conclusion.
2024 will be a year of hype-reduction as people learn they've been bombarded with BS. It will take a few more law-citation hallucination incidents before people wake up.
Crap research (Score:2)

by LostMyBeaver ( 1226054 ) writes:

Drawing the conclusion that because a general LLM can't diagnose medical cases AI won't be able to replace GPs is beyond foolishness, it's outright idiocy.

Provide a means of inputting the correct data and train a model based on that and then see what happens.

A slightly modified airport body scanner would provide more than enough data to diagnose medical conditions far more accurately than any doctor could dream of. Walk through, take scans producing a 3-5 seconds of video (it would require rapid scanning, l
- Re: Crap research (Score:1)
  
  by wasteoftime ( 1391425 ) writes:
  
  The "crap research" is what uou wrote. You have no clue about medical imaging if you think an airport security scanner has much diagnostic value. You're just plain wrong.
  - Re: (Score:2)
    
    by LostMyBeaver ( 1226054 ) writes:
    
    No, an airport security doesn't. A modified scanner on the other hand does. This has been proven in labs and proven in research. But I suppose skipping little things like qualifiers such as "slightly modified" for the purpose of convenience is... never mind
Stupid (Score:2)

by SuperDre ( 982372 ) writes:

ChatGPT wasn't trained on medical data, so it isn't a surprise it doesn't do well. This is like giving a hairdresser a couple of files and then ask her/him to diagnose. There are special AI (fot instance by IBM) already being trained on real medical data, and those are becoming much MUCH better as most doctors on giving a correct diagnose.
No! Really?? (Score:2)

by MTEK ( 2826397 ) writes:

Though the chatbot struggled in this test, the researchers suggest it could improve by being specifically and selectively trained on accurate and trustworthy medical literature -- not stuff on the Internet, which can include inaccurate information and misinformation.
ChatGPT lies way too often (Score:2)

by RUs1729 ( 10049396 ) writes:

That is its main problem. It lies often, and with a naturalness that would put Donald Trump to shame. Most irritatingly, after getting caught in an obvious lie it apologizes claiming that it made a mistake, and proceeds to lie again.
Publicity stunt (Score:2)

by ceoyoyo ( 59147 ) writes:

Subject line was suggested by Slashdot autofill....
It's pretty amazing that a language model trained on random text from the web and literature gets 17% right.
This is a letter to a medical journal. It's very short and the study itself is pretty brief. There's a notable lack of comparators. They're really measuring how much the language model agrees with the treating physician(s) and some point. Human diagnoses are frequently wrong too, but there's no measure of how well a human rater of various skill levels
House MD (Score:2)

by dcollins ( 135727 ) writes:

ChatGPT is still no House, MD
Given that House MD is inevitably wrong for 3 out of 4 acts every episode (a 75% miss rate), that's actually very close to the ballpark.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

And? (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: And? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:And? (Score:4, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:And? (Score:4, Insightful)

Re:And? (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: An average physician is wrong 90% of the time (Score:5, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re:An average physician is wrong 90% of the time (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re:An average physician is wrong 90% of the time (Score:4, Interesting)

Re: (Score:2)

GIGO (Score:1)

meh (Score:2)

Works for me (Score:2)

Even Watson could do better than that (Score:2)

Duh! (Score:2)

Re: (Score:2)

Re: (Score:2)

But it may be 84% cheaper... (Score:2)

Medicine doesnt need generalized AI (Score:2)

And because people lie (Score:2)

Re: (Score:2)

word generator (Score:2)

Re: (Score:2)

It's always Ligma (Score:1)

Boob (Score:2)

I think what happens is (Score:2)

Re: (Score:2)

No shit (Score:2)

Use the Hype, Luke (Score:2)

Crap research (Score:2)

Re: Crap research (Score:1)

Re: (Score:2)

Stupid (Score:2)

No! Really?? (Score:2)

ChatGPT lies way too often (Score:2)

Publicity stunt (Score:2)

House MD (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals