Follow Slashdot blog updates by subscribing to our blog RSS feed

ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds (nytimes.com) 102

Posted by EditorDavid on Monday November 18, 2024 @05:04AM from the medical-machines dept.

Dr. Adam Rodman, a Boston-based internal medicine expert, helped design a study testing 50 licensed physicians to see whether ChatGPT improved their diagnoses, reports the New York TImes. The results? "Doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot.

"And, to the researchers' surprise, ChatGPT alone outperformed the doctors." [ChatGPT-4] scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.

The study showed more than just the chatbot's superior performance. It unveiled doctors' sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.

And the study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systems' ability to solve complex diagnostic problems and offer explanations for their diagnoses. A.I. systems should be "doctor extenders," Dr. Rodman said, offering valuable second opinions on diagnoses.
"The results were similar across subgroups of different training levels and experience with the chatbot," the study concludes. "These results suggest that access alone to LLMs will not improve overall physician diagnostic reasoning in practice.

"These findings are particularly relevant now that many health systems offer Health Insurance Portability and Accountability Act-compliant chatbots that physicians can use in clinical settings, often with no to minimal training on how to use these tools."

This discussion has been archived. No new comments can be posted.

ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds

Load All Comments

Search 102 Comments Log In/Create an Account

Comments Filter:

Dunning-Kruger effect (Score:5, Insightful)

by backslashdot ( 95548 ) writes: on Monday November 18, 2024 @05:11AM (#64953461)

So the AI was 90% accurate, but most of the time doctors didn't trust it so when ahead with their own incorrect diagnosis? One thing I want to know is how bad the 10% that the AI missed were .. like major blunders or what? Also, what about the 26% that the doctors missed .. how severe was the error? Anyone read the actual study? (Yes I know it's linked, but I'm an a slashdotter.)

Share
twitter facebook
- Re:Dunning-Kruger effect (Score:5, Insightful)
  
  by martin-boundary ( 547041 ) writes: on Monday November 18, 2024 @05:29AM (#64953485)
  
  Accuracy alone means nothing, as usual.
  In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
  The point is that comparisons of classifiers (humans doctors or AI) are impossible on a linear scale, and anyone who reports results on a linear scale is biased. The math says so.
  
  Parent Share
  twitter facebook
  - Re:Dunning-Kruger effect (Score:5, Insightful)
    
    by phantomfive ( 622387 ) writes: on Monday November 18, 2024 @07:46AM (#64953611) Journal
    
    In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
    
    This makes no sense. The answers were graded by a panel of expert doctors. It wasn't a binary classification task, there were multiple answers to each question.
    
    Parent Share
    twitter facebook
    - Re: (Score:2, Informative)
      
      by martin-boundary ( 547041 ) writes:
      
      Ah you don't know enough about multiple class classification, sorry I should not have assumed.
      In a multiple class problem (say N possible answers) there is a NxN confusion matrix, so even more numbers that must be reported to compare two classifiers. Also, a multiple class problem can always be represented as a sequence of binary classifications, so there is really no loss of generality.
      In all cases, accuracy alone is not a useful way to compare two N-way classifiers or even rank a collection of them.
      - Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        I don't see how this is a classifier. How would you compare them?
        
        Re: Dunning-Kruger effect (Score:5, Informative)
        
        by beelsebob ( 529313 ) writes: on Monday November 18, 2024 @11:29AM (#64954021)
        
        How is it not a classifier? It takes a bunch of encoded information about signs and symptoms, and then attempts to identify the condition associated with them. Take a large amount of fuzzy information and tell you one (or a few) things that you could be looking at is pretty much exactly the definition of a classifier.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        Is there anything a neural network does that you wouldn't consider to be a classifier?
        
        Re: (Score:2)
        
        by beelsebob ( 529313 ) writes:
        
        Absolutely - you know - things that don't fit the description above. For example, generating huge fuzzy outputs from relatively small inputs, like image generation.
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        But as soon as a person says, "This image matches the prompt I gave" or "This image doesn't match the prompt I gave," wouldn't you say that's classification?
      - Re: (Score:2)
        
        by ceoyoyo ( 59147 ) writes:
        
        Ironic is the subject line.
- Re:Dunning-Kruger effect (Score:5, Insightful)
  
  by 93 Escort Wagon ( 326346 ) writes: on Monday November 18, 2024 @05:30AM (#64953487)
  
  Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
  As in "70 YO male, lifetime smoker, presents with a persistent cough and severe shortness of breath" = "gangrenous foot, immediate amputation required to save patient"
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Interesting)
    
    by ShanghaiBill ( 739463 ) writes:
    
    Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
    If they were, then that makes the human doctors look even worse.
    If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
    But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
    - Re: (Score:2)
      
      by cascadingstylesheet ( 140919 ) writes:
      
      Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
      If they were, then that makes the human doctors look even worse.
      If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
      But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
      Yeah, it only has to be better than the humans ... which seems to be a lower bar than expected.
      - Re: (Score:3)
        
        by Ol Olsoc ( 1175323 ) writes:
        
        Yeah, it only has to be better than the humans ... which seems to be a lower bar than expected.
        It is a low bar indeed. So many people look at doctors as infallible. Meh. They are just people, and I'd venture a guess that the popular meme of their infallibility leads many to believe that any decision they make is correct.
        I seldom go to see doctors, and as an example, my last visit during the plague year was for cellulitis. Doctor was apparently miffed that I self diagnosed, and wanted to prove to me it was something else. Then was a bit more agitated that my diagnosis was correct.
        Ego based diagno
        
        Re:Dunning-Kruger effect (Score:4, Interesting)
        
        by AleRunner ( 4556245 ) writes: on Monday November 18, 2024 @11:42AM (#64954047)
        
        You have lots of things right there. Especially about understanding that a doctor is a human and that you are responsible for your own body. That means that you are trying to use them as an expert to get advice rather than expecting them to fix you.
        I think a useful approach for IT guys going to the doctor is to think what you want in a good bug report. You want the full information and you don't want any suggested diagnosis until you heard everything. Someone comes to you with their disk full, you don't want to miss the fact that there's someone else writing to the disk at much greater speed than it should be. You don't want to hear "my disk is full, expand it for me". You want to hear "I have a 100TB disk, every time I clean out the logs and delete a month from the history it frees up 10TB, but then fills up again in two days".
        There's a reason the doctors moto isn't "cure them all". It's "do no harm". If the doctor gives the wrong cure - chops off the wrong leg trying to get rid of your cancer for example - that can be much worse than if they did nothing and left it for a different doctor to do the right cure. Nowadays they even changed the process for joint replacement so that the patient gets given a marker pen and writes something like "replace this knee" on their own leg.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by Ol Olsoc ( 1175323 ) writes:
        
        Nowadays they even changed the process for joint replacement so that the patient gets given a marker pen and writes something like "replace this knee" on their own leg.
        Yup - My SO had some shoulder operations over the last decade ending up with a full reverse replacement. Each time, they used the sharpie trick. A very good idea. It was a super success, her surgeon invited other residents in to observe during the follow up appointments.
        I think that some Doctors get annoyed with me because I approach them as an equal, not them an oracle of wisdom. Bruises da ego, I suppose.
    - Re: (Score:2)
      
      by AleRunner ( 4556245 ) writes:
      
      If they were, then that makes the human doctors look even worse.
      You are kind of assuming that ChatGPT only improved human doctors. It could be that there were some diagnoses that the human doctors got right, but ChatGPT provided a convincing but completely wrong justification for a different diagnosis. I wouldn't also rule out the uncanny valley effect. Probably the doctors quite quickly understand that the AI doesn't know what it's talking about, just spouts from a kind of hidden script. They come to actively distrust the opinion of the AI and find it offputting.
      In a r
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        In a real sense, this probably says more bad things about AI in general than about these particular doctors.
        No, it says quite a lot about the doctors.
        Diagnostic medicine is the art of finding the signal in an absurd amount of noise.
        If your answer to noisy signal is to discard it, then I sure as fuck don't want you treating me. Go kill someone else, thanks.
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    It's a good thing for algorithms, when they're wrong, to be spectacularly wrong. It makes it easier to identify the mistakes. Your guy with a cough is probably going to ask for a second opinion when they want to aputate his foot.
    Theoretically this would also be good for the human physicians, but they tend to argue very effectively in their own favour.
- Re:Dunning-Kruger effect (Score:5, Insightful)
  
  by Errol backfiring ( 1280012 ) writes: on Monday November 18, 2024 @05:32AM (#64953489) Journal
  
  Apart from this, doctors can be "intelligently wrong", by giving a diagnose which is not chiseled in stone and starting a treatment that would also help related illnesses. How often has your doctor said "call me when things get worse" as he has sent you home with a prescription?
  Doctors do not want 100% accuracy, as the amount of work to get the last percents right is huge and they have other patients to treat. They want accuracy that is good enough.
  
  Parent Share
  twitter facebook
  - Re:Dunning-Kruger effect (Score:5, Informative)
    
    by pz ( 113803 ) writes: on Monday November 18, 2024 @08:41AM (#64953689) Journal
    
    Part of the reason for this is that there are important concepts about being accurate and doing no harm. False positives and false negatives can be devastating. A false cancer diagnosis, for example, can ruin a patient's life, with substantial financial and psychological impact.
    There's also the related important statistical concepts of rates of occurrence. If the accuracy of a test is 99% correct with a 1% false positive rate, but the underlying rate of occurrence is very low, that 1% of false positives can lead to an overwhelming number of misdiagnoses. Not only do those carry significant unnecessary burden for the patient, they create a similar burden for the healthcare system.
    So, a doctor saying, "call me if it gets worse," is often thinking that it's very likely you have the flu, and much less likely that you have dengue fever. The conservative course will be the right one in the vast majority of the time. That idea is summed up in the saying famous within healthcare, "when you hear hooves, think horses, not zebras." Giving the patient an opportunity for re-review provides a path for treating the horse cases while providing a path to handle the zebras as well.
    An important part of the cited test, which I'd like to read, is if it presented cases with normal rates of occurrence.
    
    Parent Share
    twitter facebook
- - Re: (Score:2, Insightful)
    
    by Anonymous Coward writes:
    
    Who is this guy and why should I care?
    - Re: (Score:2)
      
      by taustin ( 171655 ) writes:
      
      Sounds to me like his middle name is "Quack Quack," and he's chief snake oil salesman for some multi-level marketing company.
      - Re: (Score:2)
        
        by AleRunner ( 4556245 ) writes:
        
        I dunno. I guess that he'd probably get short shrift from most MLM groups. They have better ethics than that. He should try somewhere more dubious, for example the US government that is to come.
    - Re: (Score:2)
      
      by ThurstonMoore ( 605470 ) writes:
      
      He used to party with RFK jr.
- Re:Dunning-Kruger effect (Score:5, Interesting)
  
  by gweihir ( 88907 ) writes: on Monday November 18, 2024 @07:41AM (#64953601)
  
  This is not the first time this has been tried. Remember IBM Watson? It had better stats that this thing here, but unfortunately when it was wrong, it would have occasionally killed the patient. Hence the application scenario was scrapped. I bet it is the same here.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    It had better stats that this thing here
    No, it didn't.
    At some cancers, it matched this. At most, it trailed wildly behind.
    but unfortunately when it was wrong, it would have occasionally killed the patient.
    You made this up. This is a risk for oncology in general.
    Hence the application scenario was scrapped.
    It most certainly fucking was not.
    IBM Watson was, and is, still used.
    However, it simply isn't that good, and is slowly being replaced by better things.
    
    Why do you misinform? Were you touched in a naughty place by an LLM?
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      You have really not followed what happened, did you? And then you have to be an AdHominem pushing asshole about it, which to me just indicates you have nothing solid.
      Here is a reference for you: https://spectrum.ieee.org/how-... [ieee.org]
      There are many more. Watson _failed_ in the medical space and it was due to, among other things, what today is called "hallucination".
      It is also possible you are lacking another source of information I have: Several invitation-only talks with attached workshops by Watson developers a
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        You have really not followed what happened, did you?
        Sure have. Everything I said was accurate, unlike your bullshit regurgitated from headlines attached to articles that you couldn't be bothered to read.
        And then you have to be an AdHominem pushing asshole about it
        Using the phrase ad hominem to mean "insulting" doesn't make you sound smart. It makes you sound like you're trying to sound smart.
        There was no argumentation via ad hominem here, so let that phrase rest.
        which to me just indicates you have nothing solid.
        Ah, finally, an actual fallacy.
        If that's how you judge the correctness of people, I can tell you right now that you're very often wrong.
        Here is a reference for you: https://spectrum.ieee.org/how- [ieee.org]... [ieee.org]
        Cute opinion piece,
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        To rescue your bullshit claims, go ahead and show us where it has killed people
        I see your problem: You cannot read. Because I never once claimed it did kill people.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        You're right, you didn't.
        
        What you said was:
        It had better stats that this thing here, but unfortunately when it was wrong, it would have occasionally killed the patient.
        Which is something you pulled entirely from your ass.
        Literally- you made it the fuck up.
        Now slink the fuck away.
- Re:Dunning-Kruger effect (Score:5, Informative)
  
  by phantomfive ( 622387 ) writes: on Monday November 18, 2024 @07:58AM (#64953631) Journal
  
  There were some* problems with the way the test was set up. For example,
  
  *) The test didn't seem to subtract from the score for wrong answers. The LLM couldn't have given a wildly wrong diagnosis and not have been penalized (that's what I understand from the paper).
  
  *) The test is set up to give the LLM a bit of an advantage with the Jenning's effect [go.com], that is, not all the humans were able to finish all the test cases, and they were instructed to be slow and accurate instead of being fast.
  
  You might ask, "How did this paper pass peer review? What is wrong with you?" And the answer is they weren't trying to test LLMs against doctors. They were trying to test how well doctors worked when augmented by LLMs. They had a little side note about LLMs vs doctors, but they are fully aware (and clearly state) that this doesn't mean LLMs are better than doctors.
  
  The main point is testing how well augmented doctors perform. The paper does good science (afaict) investigating this question. All the hype comes from the news article, and it is fake.
  
  *tl;dr the article is hype, the paper is good.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    Augmentation of doctors with LLMs seems obvious to me.
    Like previous attempts at doing this, trying to augment the causative understanding of doctors with ridiculously superior classification is a no-brainer, because it's the part that doctors, being human, statistically suck at.
    
    People create this mythology around human pattern matching ability- but the fact is, we're flat out terrible at objectively matching. There's little evolutionary need for science. If a false pattern match turns into any kind of be
    - Re: (Score:2)
      
      by phantomfive ( 622387 ) writes:
      
      Augmentation of doctors with LLMs seems obvious to me. Like previous attempts at doing this, trying to augment the causative understanding of doctors with ridiculously superior classification is a no-brainer, because it's the part that doctors, being human, statistically suck at.
      Ok, that's an intuition you've had that leads to a reasonable hypothesis. Sounds good.
      
      Under the scrutiny provided by the study at hand, the hypothesis is unsupported. Maybe another study will provide better results.
      - Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Under the scrutiny provided by the study at hand, the hypothesis is unsupported. Maybe another study will provide better results.
        I agree. All it really shows is that doctors continue to suck, and we need to find a way to prop up their very human failures so that their very excellent medical understanding can shine.
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        I think most people would disagree that "doctors suck," where is that coming from?
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        I think most people would disagree that "doctors suck,"
        Hard to argue that doctors don't have a captive audience ;)
        where is that coming from?
        It's nothing personal- it's just numbers.
        In general, MDs simply aren't very good [nih.gov] at their job.
        
        Now that's not to say "I could do better" or whatever. And certainly some excel far more than the median, but the median is quite bad. Half of all doctors are even worse than that.
        
        There's no need to worship these people. We need to invent the tools to make them better at what they do.
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        Interesting.
- Re: (Score:2)
  
  by StormReaver ( 59959 ) writes:
  
  One thing I want to know is how bad the 10% that the AI missed were....
  That's a good question. But I think a more pertinent one is, "how does a statistical likelihood of one letter following another one lead to an accurate diagnosis?" To me, the most likely answer is two fold:
  1) Lots of medical training data of diagnoses by humans. Without humans, LLMs are worthless. This is what AI proponents tend to sweep under the rug. Without continuous human output (data) to serve as LLM input, the LLMs will fall apart since LLM output cannot be used as LLM input without severe degradatio
  - Re: (Score:2)
    
    by backslashdot ( 95548 ) writes:
    
    There are dozens of ways AI can surpass humans even though humans taught it. Example 1, When a patient comes in with certain symptoms or test results the doctor may tell them don't worry its not cancer. But then eventually it does turn out to be cancer. The AI learns if a patient has certain symptoms it’s associated with cancer so it can detect it earlier. Doctors usually make the mistaken diagnosis in the early stage, but then later you can't ignore it. The doctor may be in the habit of telling a pat
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    When a result seems very mysterious, it's often a good idea to consider whether your starting assumptions are true. Often you will discover that they are not.
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    understanding is not possible for an LLM
    This is a religious argument.
    You can't back it up with any kind of objective measure.
    
    You are, otherwise correct, that current training methodologies do in fact require human output.
- Re: (Score:3)
  
  by bagofbeans ( 567926 ) writes:
  
  So the AI was 90% accurate, but most of the time doctors didn't trust it so when ahead with their own incorrect diagnosis? One thing I want to know is how bad the 10% that the AI missed were .. like major blunders or what? Also, what about the 26% that the doctors missed .. how severe was the error? Anyone read the actual study? (Yes I know it's linked, but I'm an a slashdotter.)
  Article worth a read to see the flaws. As you say, no analysis on the severity of mis-diagnosis by any party. Also, median doc e
- Doctors have used AI for decades (Score:3)
  
  by drnb ( 2434720 ) writes:
  
  There is nothing really new here. For example AI has been used to analyze medical imagery for anomalies for decades. A human lab tech might point out anomalies, and AI based analysis might point out anomalies, it's all input into the Dr's analysis.
- Re: (Score:1)
  
  by Narcocide ( 102829 ) writes:
  
  When I hear something like this about AI, the first thing I conclude is someone has heavily fudged the test to give the AI an unfair advantage, like they only chose inexperienced doctors to participate, or the test studies they were reviewing were also in the AI's training data so the real question is why it was only 90% accurate.
Study on ChatGPT-4 ... (Score:5, Insightful)

by JasterBobaMereel ( 1102861 ) writes: on Monday November 18, 2024 @05:27AM (#64953479)

The doctors were fed information about the patients that was already suitable for giving to ChatGPT ... not required to gather the information themselves
So the largest part of the job of Doctor was omitted, and replaced with data tailored for machines
The researchers gave little or no instruction on how to use ChatGPT, but then compared the results to them using it with all their ChatGPT skills ...
Study finds that people who know how to get the best out of ChatGPT use it well ... and Doctors when taken out of their normal environment do not do as well ...

Share
twitter facebook
- Re: Study on ChatGPT-4 ... (Score:4, Informative)
  
  by johnstrass1 ( 2451730 ) writes: on Monday November 18, 2024 @07:12AM (#64953575)
  
  Yes! The hardest part of a doctor is actually taking a history. Trying to get the story straight from somebody whoâ(TM)s unhappy because theyâ(TM)re in pain or had something bad happened to them or nervous takes many years of training. A more appropriate that it would be to simply have an iPad with ChatGPT and let one of those patients talk to the chat, bot and see how well they do versus talking to a doctor.
  
  Parent Share
  twitter facebook
- Re: Study on ChatGPT-4 ... (Score:4, Insightful)
  
  by flyingfsck ( 986395 ) writes: on Monday November 18, 2024 @07:23AM (#64953585)
  
  A doctor will never admit that he was wrong.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by esperto ( 3521901 ) writes:
  
  I think that was the point of it being used as tool by the doctor, chatGPT itself cannot get the patient history, but, from what I understood, comparing it being used by a doctor as an assistant or being feed the history and using its result as showed that the doctors were not using (or thrusting) the tool, because in such a low N 74% to 76% means there was no difference.
  I don't think training the doctors to use it is the issue, because using a LLM is as straight forward as it gets, it is just writing it do
- Re:Study on ChatGPT-4 ... (Score:5, Interesting)
  
  by buck-yar ( 164658 ) writes: on Monday November 18, 2024 @09:20AM (#64953731)
  
  As a next token generator, these LLMs sometimes might be finicky about the exact wording of the input. The article suggested that the increased accuracy of the LLMs over real doctors was due to how the prompts were worded to the LLM "This may be explained by the sensitivity of LLM output to prompt formulation." The wording they used precipitated the output because some words or phrases are often used in specific contexts, in this situation health, that the AI ends up being led to the answer. Guessing if they took the same cases and had them described with different wording that has the same meaning, the LLM's results would differ.
  
  Parent Share
  twitter facebook
AI isn't the relevant problem here. (Score:1, Interesting)

by jd ( 1658 ) writes:

The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
That, to me, is gross malpractice.
The correct answer is not necessarily more AI, but that might well be the end result. The correct answer is to require doctors to recertify through such test cases and withdrawing a license to practice if the success rate is under 90%.
AI is, ultimately, just using differential diagnosis, because that's
- Re: (Score:3)
  
  by martin-boundary ( 547041 ) writes:
  
  I too enjoyed watching House. But just because it's always sarcoidosis doesn't mean AI are doing anything resembling differential diagnosis.
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    Dr House isn't really relevant here because his cases tended to be rare or unusual illnesses. The study here specifically excluded rare pathologies from the test set. The goal wasn't to test ChatGPT against doctors, the goal was to see how well doctors performed when augmented by an LLM assistant.
- Re:AI isn't the relevant problem here. (Score:4, Interesting)
  
  by ShanghaiBill ( 739463 ) writes: on Monday November 18, 2024 @06:50AM (#64953553)
  
  The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
  TFA does not contain enough information to draw that conclusion.
  The correct answer is not necessarily more AI
  AI will be part of the solution.
  TFA says that ChatGPT reduced misdiagnoses from 26% to 24%. Two percent might not seem like much, but in a $5 trillion industry, it's a lot.
  Doctors will do much better if they're trained to use AI technology. It should be incorporated into medical school curriculum.
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by AleRunner ( 4556245 ) writes:
    
    Doctors will do much better if they're trained to use AI technology. It should be incorporated into medical school curriculum.
    Exactly this. There has to be a correct procedure. Probably 1) examine patient record observations into (electronic) notes 2) do the diagnosis yourself 3) feed notes and current diagnosis to AI system and get it to suggest alternatives with some kind of probabilities and links to official statements of those diagnoses 4) rethink and re-examine everything with new knowledge.
    By bringing AI in late in the process you allow the human and AI bias to be independent and ensure clean verification and training data.
- Re: (Score:2)
  
  by alvinrod ( 889928 ) writes:
  
  That seems all well and good until you have doctors refusing to take on harder cases because the likelihood of successful diagnosis is much lower and will hurt their metrics. Making anything involving humans entirely or largely metric driven is a recipe for disaster. What you wind up with is people who are best able to game the metrics and not those who are most capable of what everyone really wants but was poorly expressed as some collection of metrics.
- Re: (Score:2)
  
  by hey! ( 33014 ) writes:
  
  I think it's a case of sloppy headline writing. AI isn't better than doctors at diagnosing, it's better at guessing *from text descriptions* than doctors are. Doctors normally examine the patient, generate multiple hypotheses and then test those hypotheses with further diagnostic procedures.
What was actually evaluated. (Score:5, Insightful)

by KAdamM ( 2996395 ) writes: on Monday November 18, 2024 @05:53AM (#64953503)

In short: 50 patients were studied by real doctors in real hospitals and clinics, and they get a proper diagnosis. Whatever was written in the papers - short history of present illness, past medical history, and symptoms (e.g. temperature, pulse, skin description) - was given to other doctors and LLM. What is shows is that people, to get proper treatment, need direct contact patient with a doctor. This is what doctors are taught, and expected to do. LLM or online consultation will not replace that.

Share
twitter facebook
- Re: (Score:2)
  
  by ShanghaiBill ( 739463 ) writes:
  
  What it shows is that people, to get proper treatment, need direct contact patient with a doctor.
  The study described in TFA does not show that.
  The error rate of doctors with direct contact was not compared to those without.
  - Re: What was actually evaluated. (Score:1)
    
    by KAdamM ( 2996395 ) writes:
    
    Who gave the reference diagnosis then? The one 100% accurate?
    - Re: (Score:2)
      
      by ShanghaiBill ( 739463 ) writes:
      
      Who gave the reference diagnosis then? The one 100% accurate?
      The reference diagnosis is determined retrospectively from the patient outcome.
  - Re: (Score:2)
    
    by taustin ( 171655 ) writes:
    
    The error rate of doctors with direct contact was not compared to those without.
    The error rate of doctors with direct contact was also not compared to the AI error rate. It wasn't part of the study at all. Which makes the study pretty useless.
  - Re: (Score:2)
    
    by SoftwareArtist ( 1472499 ) writes:
    
    I just looked up the paper, and that's exactly what it shows. This study was conducted online. The doctors never saw the patients.
    And where did they get the "true" diagnosis the doctors were supposed to match? From the doctor who actually examined the patient and treated them.
- Re: (Score:1)
  
  by geekmux ( 1040042 ) writes:
  
  What is shows is that people, to get proper treatment, need direct contact patient with a doctor. This is what doctors are taught, and expected to do. LLM or online consultation will not replace that.
  Wrong. AI can be taught the entire medical history. Everything we know about medicine. Ever. And you don’t go to see a human doctor to talk to them about your diagnosis. You go to the doctor and both of you “talk” to the results of the tests you took. Which again, is something that can be automated. AI can also be taught how what “low” or “high” means when reading a blood report. Just like the human does.
  Test. Review Results. Diagnose. If a $50K car diagn
  - Re:What was actually evaluated. (Score:4, Interesting)
    
    by drinkypoo ( 153816 ) writes: <drink@hyperlogos.org> on Monday November 18, 2024 @09:46AM (#64953769) Homepage Journal
    
    Test. Review Results. Diagnose. If a $50K car diagnostic scanner can do that, I donâ(TM)t see why AI canâ(TM)t in medicine.
    A car diagnostic scanner does not simply plug in and diagnose the vehicle, except for a small subset of tests. At best they have guided diagnosis and the technician has to perform various tests. This is exactly like the scenario in which the doctor is using a software agent to assist with diagnosis because in both cases, you need a trained professional to operate the equipment and perform the final diagnosis. They have to know enough to fact-check the machine, just like I know enough to recognize when Google gives me a completely bullshit answer to a technical question. Google doesn't know, though; their answer is written as if it were correct whether it is or not. The same is true of every one of these tools.
    
    Parent Share
    twitter facebook
    - Re: (Score:1)
      
      by geekmux ( 1040042 ) writes:
      
      The car diagnostic scanner, was once untrusted too. Until it wasn’t.
      The same will eventually be true of AI. Once AI learns “high” and”low” parameters and is trained on what to do next (not unlike the highly-trained human following the expert machine), it won’t have to re-learn it. Better yet, it won’t ever forget it. Unlike human brains do.
      Your concerns, have an expiration date.
      - Re: (Score:3)
        
        by drinkypoo ( 153816 ) writes:
        
        The car diagnostic scanner, was once untrusted too. Until it wasnâ(TM)t.
        The people who know something about automotive diagnostics still don't trust it, which is how we can tell you know fuck-all about this subject.
  - Re: (Score:2)
    
    by Entrope ( 68843 ) writes:
    
    Test. Review Results. Diagnose. If a $50K car diagnostic scanner can do that, I donâ(TM)t see why AI canâ(TM)t in medicine.
    That's easy if you can run a complete set of tests and don't have false positive results from those tests. In real life, test results are ambiguous, better tests are expensive, doctors have to accept inputs like "it hurts when I do this" as a starting point, patients don't want to admit how little they exercise or how badly they eat, and so on.
    - Re: (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      Mechanics have to do the same thing! If I run a test on the Sprinter and it says there's a short in the EGR wiring I have to figure out if there actually is a short, or if it's actually a failed ground making it look like that, or if the EGR has seized up with soot and the motor that drives it is stalling out... The whole idea that you can just plug into even a car and have the scanner spit out the answer is horse shit. The scanner is used in conjunction with the manual and it has a whole series of troubles
- Re: (Score:2)
  
  by az-saguaro ( 1231754 ) writes:
  
  You are exactly correct.
  I am responding out of some sympathy. Other responses to your post don't seem to quite get it.
  For instance:
  The study described in TFA does not show that. The error rate of doctors with direct contact was not compared to those without.
  Well, technically he is correct.
  But I think he missed what you were saying.
  The study used 105 validated cases where real doctors had made correct diagnoses which are the basis for a standard training set for computer assisted diagnosis. From those 105 standard resource cases, the authors found 50 that best fit the intent of their study, and then they whittled it down to 6 to a
- Re: (Score:2)
  
  by SoftwareArtist ( 1472499 ) writes:
  
  This is a really common flaw in studies of this sort. They present the same data to the doctor and the AI, the AI does a better job of diagnosis, and they declare, "AI is better than doctors." You see papers like that all the time. But it's not true, because doctors do much more than read case reports before diagnosing real patients. Since the AI can't talk to the patient, observe their behavior, or conduct an exam, they make it "fair" by not letting the doctor do it either.
failure to understand procedure (Score:4, Insightful)

by TimothyHollins ( 4720957 ) writes: on Monday November 18, 2024 @06:30AM (#64953529)

I don't trust these conclusions *at all*.
AI, and machine learning, as performed by computer scientists, completely miss the meaning of data and protocol.
In machine learning/AI, a computer scientist will try to achieve the highest possible AUC. This is frequently seen when a dataset of 1,000,000 tests (99% controls, 1% cases) yields the best results when predicted as ANYTHING -> CONTROL. For a doctor, the 1% cases are the difficult part, not the 99% of controls.
A doctor should operate by a hierarchy of diagnoses. If you show up at the clinic with a bleeding ass, would you like the doctor to aim for maximum prediction score (there's a 95% chance it's nothing) or would you like your doctor to ass-ume the worst and schedule a colonoscopy for you? I would rather the second option, something the AI, and the people organizing this study, completely miss.

Share
twitter facebook
- Re: (Score:3)
  
  by phantomfive ( 622387 ) writes:
  
  You don't trust the conclusions because you didn't read the paper. You should have at least read the abstract before being skeptical. The abstract in this paper is reasonably good.
Billions (Score:1)

by fluffernutter ( 1411889 ) writes:

Billion dollar industry finds results in favor of billion dollar industry. How can we trust these results?
- Re: Billions (Score:2)
  
  by flyingfsck ( 986395 ) writes:
  
  The chances of anything coming from Mars are a million to one, he said.
  - Re: (Score:2)
    
    by fluffernutter ( 1411889 ) writes:
    
    I have no idea what that means. I certainly didn't say that.
AI is just a tool (Score:2)

by SecurID-Guy ( 10125484 ) writes:

Finally, someone gets it. AI is nothing more than an excellent tool. Key point: A.I. systems should be "doctor extenders," YES! I'm very much looking forward to a future where doctors will see AI as a colleague, not an adversary. I wonder if the doctors in the study would have been open to the alternate diagnosis whether it from was an AI, or some other licensed, certified physician unknown to the test subject.
Pattern Recognition (Score:3)

by Registered Coward v2 ( 447531 ) writes: on Monday November 18, 2024 @07:58AM (#64953633)

Machines are very good at pattern recognition, if fed the right data. Humans are as well, although biases and training sometimes conflict with getting the right answer. I suspect experience (the median years in practice was 3) may have impacted the results; given the type of cases: a broad range of pathologic settings, avoiding simplistic cases with limited plausible diagnoses, and excluding exceedingly rare cases.
For me, the takeaway is, LLMs can be a useful adjunct to diagnosis by a physician to help identify a range of possible causes, not replace a physician.
The real lesson healthcare CEOs may learn is "we can replace physicians with non-physicians and increase profits because Chat-GPT is better and cheaper than physicians."

Share
twitter facebook
Would you rather? (Score:1)

by madsh ( 266758 ) writes:

get diagnosed by a doctor who have seen 4,000 patients or a computer that have seen 4,000,000 patients? Would you rather receive treatment from a robot with no sense of ethics, morale and compassion or I would like to get treatment from a compassionate human with access to a experienced robot.
- Re: (Score:3)
  
  by geekmux ( 1040042 ) writes:
  
  I would like to get treatment from a compassionate human with access to a experienced robot.
  When the compassionate human fucks up in a very human way and attempts to dismiss the life-ending mistake with a compassionate apology, society will find the legal arguments against using humans in the future, quite compelling.
  Human liability, will become all that matters from a risk mitigation perspective if we allow the current legal system to continue. And we will.
- Re: (Score:2)
  
  by SuiteSisterMary ( 123932 ) writes:
  
  The diagnosis can benefit from AI.
  The treatment can come from the human.
  The AI can help monitor the efficacy and side effects of the treatment.
  The human can determine where to go from there.
  Though I will point out that a big part of being an effective doctor, especially in areas like surgery, one, by necessity, must lose certain elements of their compassion and empathy for humans. Cutting people open and removing bits of them, for example.
- Re: (Score:2)
  
  by ufgrat ( 6245202 ) writes:
  
  My mother was a very nice pleasant person with a bit of a temper. After being ignored for 5 hours in a hospital corridor, the nurse told her she needed to get up so they could run a test. She told the nurse to go to hell.
  Nurse went to doctor, told doctor patient was being unreasonable, so they hit her with a dose of an anti-psychotic medication called "Haldol"-- which the hospital's EMR system had documented my mother as being particularly sensitive to. She had an unusual, but not unheard of, reaction to
Having dealt with doctors (Score:1)

by 50000BTU_barbecue ( 588132 ) writes:

And their mush-brained indifference and arrogance, I think a potato has a better chance at diagnosing than a doctor.
You've Got Leprocy! (Score:3)

by Joe_Dragon ( 2206452 ) writes: on Monday November 18, 2024 @09:39AM (#64953755)

You've Got Leprocy!

Share
twitter facebook
The Gambler, the Nun and the LLM (Score:2)

by Pseudonymous Powers ( 4097097 ) writes:

There was a passage in a Hemingway short story about a small team of doctors, in which an older doctor told a young doctor that he was terrible and stupid, and that he would always be a danger to his patients, and then he gave him a book containing an algorithmic guide to diagnosis and told him to use it, always. And the younger doctor found this humiliating, but he used it, because he knew that the older doctor was right.
That's where AI should be good (Score:3)

by gnasher719 ( 869701 ) writes: on Monday November 18, 2024 @10:53AM (#64953905)

What we currently call "AI" is nothing more than glorified pattern recognition. And diagnosing patients is nothing but pattern recognition plus the ability to realise that sometimes the patterns are confusing.

I would bet that a closer study shows: Much higher failure rate when patterns are less clear (for example, if you diagnose based on skin discolouration, black patients will often be misdiagnosed by less experienced doctors, and by an AI).

Share
twitter facebook
Not trustworthy without restricted training (Score:1)

by Ceallach ( 24667 ) writes:

AI has a proven track record of making shit up. Medical AI could be possible IF and ONLY if they restrict the training data to trusted and verified medical sources. Otherwise you'll get medical "opinions" from some rando on reddit or tumblr mixed into the underlying model
Glaring flaw (Score:2)

by paul_engr ( 6280294 ) writes:

Something of the effect of "...the AI read the case report..." GIGO still applies, if not more than ever.
The Wisdom of Crowds (Score:2)

by Baron_Yam ( 643147 ) writes:

Given the AI would have been trained on human diagnosis, the answers it spits out should be a sort of vague merging of all the inputs. So if you had all those doctors doing all those diagnoses together, on average they'd collectively do as well as the AI.
To me, this suggests that AI isn't the answer; what you want is an expert system assembled step-by-step by humans with clear chains of reasoning that are easy to update as new medical knowledge becomes available.
If you want to put an AI chatbot front end o
- Re: (Score:2)
  
  by ceoyoyo ( 59147 ) writes:
  
  To me, this suggests that AI isn't the answer; what you want is an expert system assembled step-by-step by humans with clear chains of reasoning that are easy to update as new medical knowledge becomes available.
  You've descibed an AI expert system. They exist, and they work okay, but they can be brittle. Setting up those "clear chains of reasoning" by hand is a big job and it's easy to miss something.
  By "AI" I assume you actually mean something like GPT 4, a different kind of AI system, which is what they u
The one sector it really needs to be pushed (Score:2)

by SuperDre ( 982372 ) writes:

Medical is one of the most important sectors where we need a big push in advancement of AI as human doctors tend to make a lot of mistakes and misdiagnose. Best thing for our healthsector would be AI being able to replace human doctors (who themselves say they are overworked and swamped these days), this would be step one in lowering the cost, next would be to add surgical robots to AI to replace surgeons and make operating much more affordable. In the end when all mixed together it would be like you walk i
poobah (Score:2)

by groobly ( 6155920 ) writes:

Did the chatbot look at and interview the patient? Or was this 100% on paper?
Even the authors didn't think LLMs should diagnose (Score:1)

by Mufm ( 890904 ) writes:

"Results of this study should not be interpreted to indicate that LLMs should be used for diagnosis autonomously without physician oversight." .... "Furthermore, this study was acontextual, and clinicians’ understanding of the clinical environment is fundamental for high-quality decision-making. While early studies show that LLMs might effectively collect and summarize patient information, these capabilities need to be studied more thoroughly."
Re: (Score:2)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
Now throw something novel or undiscovered at it (Score:2)

by Torodung ( 31985 ) writes:

Sure. Most of this is settled science and humans sometimes overvalue their intuition and expertise.
Now throw something new at it. Have it discover a new diagnosis ("I don't know what this is"), a systemic autoimmune syndrome, a form of cancer never seen before.
I can guarantee you one thing it never said. "I don't know what this is."
Driving with ChatGPT is driving in the rear-view mirror. It's great for middle of the bell curve stuff, certainly better than humans, but it will never make progress.
Physicians are selected for memorization (Score:2)

by damn_registrars ( 1103043 ) writes:

When I was in undergrad, I pondered med school for a year or so. As I met more pre-med and medical students I realized that path was not for me. AMCAS is deliberately filtering for undergrads who excel at memorization / regurgitation, while unfortunately making life more difficult for scientists, engineers, and other students who actually want to know how things work. This leads to physicians who memorize very specific material and quickly toss aside previously memorized material.

A great example is O
- Not even a broken arm IME (Score:5, Interesting)
  
  by ihadafivedigituid ( 8391795 ) writes: on Monday November 18, 2024 @08:34AM (#64953681)
  
  I had an ER doc at a major hospital misdiagnose my very broken arm (left radius, above the wrist). Not only was my arm broken, but the recently installed steel plate that reduced a previous fracture in the same spot was broken, too. Yes, there was and is a large surgical scar so it's not like any of this was subtle--and I told him it was broken (along with the mushed pinky finger on the same side, which he did not miss).
  
  I had to argue with him to get an X-ray. When radiology came back showing the obvious mess in there, he defended himself because I reported no little pain in general and didn't jump when he manipulated the area.
  
  I learned a valuable lesson at only 19: I'm on my own. Fast forward 35 years, and another doctor almost killed me by grossly over prescribing (6X recommended) a drug that it turns out I was allergic to. If I wasn't in California, I'd have sued his ass off. But two attorneys set me straight: California's somewhat recent tort reform made it next to impossible to succeed in most malpractice suits. So I learned another lesson: I'm on my own, and no one gives a shit.
  
  Parent Share
  twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Dunning-Kruger effect (Score:5, Insightful)

Re:Dunning-Kruger effect (Score:5, Insightful)

Re:Dunning-Kruger effect (Score:5, Insightful)

Re: (Score:2, Informative)

Re: (Score:2)

Re: Dunning-Kruger effect (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Dunning-Kruger effect (Score:5, Insightful)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:3)

Re:Dunning-Kruger effect (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Dunning-Kruger effect (Score:5, Insightful)

Re:Dunning-Kruger effect (Score:5, Informative)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Dunning-Kruger effect (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Dunning-Kruger effect (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Doctors have used AI for decades (Score:3)

Re: (Score:1)

Study on ChatGPT-4 ... (Score:5, Insightful)

Re: Study on ChatGPT-4 ... (Score:4, Informative)

Re: Study on ChatGPT-4 ... (Score:4, Insightful)

Re: (Score:2)

Re:Study on ChatGPT-4 ... (Score:5, Interesting)

AI isn't the relevant problem here. (Score:1, Interesting)

Re: (Score:3)

Re: (Score:2)

Re:AI isn't the relevant problem here. (Score:4, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

What was actually evaluated. (Score:5, Insightful)

Re: (Score:2)

Re: What was actually evaluated. (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re:What was actually evaluated. (Score:4, Interesting)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

failure to understand procedure (Score:4, Insightful)

Re: (Score:3)

Billions (Score:1)

Re: Billions (Score:2)

Re: (Score:2)

AI is just a tool (Score:2)

Pattern Recognition (Score:3)

Would you rather? (Score:1)

Re: (Score:3)