ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds (nytimes.com) 33

Posted by EditorDavid on Monday November 18, 2024 @04:04AM from the medical-machines dept.

Dr. Adam Rodman, a Boston-based internal medicine expert, helped design a study testing 50 licensed physicians to see whether ChatGPT improved their diagnoses, reports the New York TImes. The results? "Doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot.

"And, to the researchers' surprise, ChatGPT alone outperformed the doctors." [ChatGPT-4] scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.

The study showed more than just the chatbot's superior performance. It unveiled doctors' sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.

And the study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systems' ability to solve complex diagnostic problems and offer explanations for their diagnoses. A.I. systems should be "doctor extenders," Dr. Rodman said, offering valuable second opinions on diagnoses.
"The results were similar across subgroups of different training levels and experience with the chatbot," the study concludes. "These results suggest that access alone to LLMs will not improve overall physician diagnostic reasoning in practice.

"These findings are particularly relevant now that many health systems offer Health Insurance Portability and Accountability Act-compliant chatbots that physicians can use in clinical settings, often with no to minimal training on how to use these tools."

ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds

Post Load All Comments

Search 33 Comments Log In/Create an Account

Comments Filter:

Dunning-Kruger effect (Score:2)

by backslashdot ( 95548 ) writes:

So the AI was 90% accurate, but most of the time doctors didn't trust it so when ahead with their own incorrect diagnosis? One thing I want to know is how bad the 10% that the AI missed were .. like major blunders or what? Also, what about the 26% that the doctors missed .. how severe was the error? Anyone read the actual study? (Yes I know it's linked, but I'm an a slashdotter.)
- Re:Dunning-Kruger effect (Score:4, Insightful)
  
  by martin-boundary ( 547041 ) writes: on Monday November 18, 2024 @04:29AM (#64953485)
  
  Accuracy alone means nothing, as usual.
  In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
  The point is that comparisons of classifiers (humans doctors or AI) are impossible on a linear scale, and anyone who reports results on a linear scale is biased. The math says so.
  
  Reply to This Parent Share
  Flag as Inappropriate
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
    
    This makes no sense. The answers were graded by a panel of expert doctors. It wasn't a binary classification task, there were multiple answers to each question.
    - Re: (Score:2)
      
      by martin-boundary ( 547041 ) writes:
      
      Ah you don't know enough about multiple class classification, sorry I should not have assumed.
      In a multiple class problem (say N possible answers) there is a NxN confusion matrix, so even more numbers that must be reported to compare two classifiers. Also, a multiple class problem can always be represented as a sequence of binary classifications, so there is really no loss of generality.
      In all cases, accuracy alone is not a useful way to compare two N-way classifiers or even rank a collection of them.
      - Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        I don't see how this is a classifier. How would you compare them?
- Re: (Score:2)
  
  by 93 Escort Wagon ( 326346 ) writes:
  
  Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
  As in "70 YO male, lifetime smoker, presents with a persistent cough and severe shortness of breath" = "gangrenous foot, immediate amputation required to save patient"
  - Re: (Score:2)
    
    by ShanghaiBill ( 739463 ) writes:
    
    Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
    If they were, then that makes the human doctors look even worse.
    If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
    But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
    - Re: (Score:1)
      
      by cascadingstylesheet ( 140919 ) writes:
      
      Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
      If they were, then that makes the human doctors look even worse.
      If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
      But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
      Yeah, it only has to be better than the humans ... which seems to be a lower bar than expected.
- Re: (Score:3)
  
  by Errol backfiring ( 1280012 ) writes:
  
  Apart from this, doctors can be "intelligently wrong", by giving a diagnose which is not chiseled in stone and starting a treatment that would also help related illnesses. How often has your doctor said "call me when things get worse" as he has sent you home with a prescription?
  Doctors do not want 100% accuracy, as the amount of work to get the last percents right is huge and they have other patients to treat. They want accuracy that is good enough.
- - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Who is this guy and why should I care?
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  This is not the first time this has been tried. Remember IBM Watson? It had better stats that this thing here, but unfortunately when it was wrong, it would have occasionally killed the patient. Hence the application scenario was scrapped. I bet it is the same here.
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  There were some* problems with the way the test was set up. For example,
  
  *) The test didn't seem to subtract from the score for wrong answers. The LLM couldn't have given a wildly wrong diagnosis and not have been penalized (that's what I understand from the paper).
  
  *) The test is set up to give the LLM a bit of an advantage with the Jenning's effect [go.com], that is, not all the humans were able to finish all the test cases, and they were instructed to be slow and accurate instead of being fast.
  
  You might ask,
Study on ChatGPT-4 ... (Score:5, Insightful)

by JasterBobaMereel ( 1102861 ) writes: on Monday November 18, 2024 @04:27AM (#64953479)

The doctors were fed information about the patients that was already suitable for giving to ChatGPT ... not required to gather the information themselves
So the largest part of the job of Doctor was omitted, and replaced with data tailored for machines
The researchers gave little or no instruction on how to use ChatGPT, but then compared the results to them using it with all their ChatGPT skills ...
Study finds that people who know how to get the best out of ChatGPT use it well ... and Doctors when taken out of their normal environment do not do as well ...

Reply to This Share
Flag as Inappropriate
- Re: Study on ChatGPT-4 ... (Score:1)
  
  by johnstrass1 ( 2451730 ) writes:
  
  Yes! The hardest part of a doctor is actually taking a history. Trying to get the story straight from somebody whoâ(TM)s unhappy because theyâ(TM)re in pain or had something bad happened to them or nervous takes many years of training. A more appropriate that it would be to simply have an iPad with ChatGPT and let one of those patients talk to the chat, bot and see how well they do versus talking to a doctor.
- Re: Study on ChatGPT-4 ... (Score:2)
  
  by flyingfsck ( 986395 ) writes:
  
  A doctor will never admit that he was wrong.
- Re: (Score:2)
  
  by esperto ( 3521901 ) writes:
  
  I think that was the point of it being used as tool by the doctor, chatGPT itself cannot get the patient history, but, from what I understood, comparing it being used by a doctor as an assistant or being feed the history and using its result as showed that the doctors were not using (or thrusting) the tool, because in such a low N 74% to 76% means there was no difference.
  I don't think training the doctors to use it is the issue, because using a LLM is as straight forward as it gets, it is just writing it do
AI isn't the relevant problem here. (Score:1)

by jd ( 1658 ) writes:

The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
That, to me, is gross malpractice.
The correct answer is not necessarily more AI, but that might well be the end result. The correct answer is to require doctors to recertify through such test cases and withdrawing a license to practice if the success rate is under 90%.
AI is, ultimately, just using differential diagnosis, because that's
- Re: (Score:3)
  
  by martin-boundary ( 547041 ) writes:
  
  I too enjoyed watching House. But just because it's always sarcoidosis doesn't mean AI are doing anything resembling differential diagnosis.
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    Dr House isn't really relevant here because his cases tended to be rare or unusual illnesses. The study here specifically excluded rare pathologies from the test set. The goal wasn't to test ChatGPT against doctors, the goal was to see how well doctors performed when augmented by an LLM assistant.
- Re: (Score:2)
  
  by ShanghaiBill ( 739463 ) writes:
  
  The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
  TFA does not contain enough information to draw that conclusion.
  The correct answer is not necessarily more AI
  AI will be part of the solution.
  TFA says that ChatGPT reduced misdiagnoses from 26% to 24%. Two percent might not seem like much, but in a $5 trillion industry, it's a lot.
  Doctors will do much better if they're trained to use AI technology. It should be incorporated into medical school curriculum.
What was actually evaluated. (Score:4, Insightful)

by KAdamM ( 2996395 ) writes: on Monday November 18, 2024 @04:53AM (#64953503)

In short: 50 patients were studied by real doctors in real hospitals and clinics, and they get a proper diagnosis. Whatever was written in the papers - short history of present illness, past medical history, and symptoms (e.g. temperature, pulse, skin description) - was given to other doctors and LLM. What is shows is that people, to get proper treatment, need direct contact patient with a doctor. This is what doctors are taught, and expected to do. LLM or online consultation will not replace that.

Reply to This Share
Flag as Inappropriate
- Re: (Score:1)
  
  by ShanghaiBill ( 739463 ) writes:
  
  What it shows is that people, to get proper treatment, need direct contact patient with a doctor.
  The study described in TFA does not show that.
  The error rate of doctors with direct contact was not compared to those without.
  - Re: What was actually evaluated. (Score:1)
    
    by KAdamM ( 2996395 ) writes:
    
    Who gave the reference diagnosis then? The one 100% accurate?
    - Re: (Score:2)
      
      by ShanghaiBill ( 739463 ) writes:
      
      Who gave the reference diagnosis then? The one 100% accurate?
      The reference diagnosis is determined retrospectively from the patient outcome.
- Re: (Score:2)
  
  by geekmux ( 1040042 ) writes:
  
  What is shows is that people, to get proper treatment, need direct contact patient with a doctor. This is what doctors are taught, and expected to do. LLM or online consultation will not replace that.
  Wrong. AI can be taught the entire medical history. Everything we know about medicine. Ever. And you don’t go to see a human doctor to talk to them about your diagnosis. You go to the doctor and both of you “talk” to the results of the tests you took. Which again, is something that can be automated. AI can also be taught how what “low” or “high” means when reading a blood report. Just like the human does.
  Test. Review Results. Diagnose. If a $50K car diagn
failure to understand procedure (Score:4, Insightful)

by TimothyHollins ( 4720957 ) writes: on Monday November 18, 2024 @05:30AM (#64953529)

I don't trust these conclusions *at all*.
AI, and machine learning, as performed by computer scientists, completely miss the meaning of data and protocol.
In machine learning/AI, a computer scientist will try to achieve the highest possible AUC. This is frequently seen when a dataset of 1,000,000 tests (99% controls, 1% cases) yields the best results when predicted as ANYTHING -> CONTROL. For a doctor, the 1% cases are the difficult part, not the 99% of controls.
A doctor should operate by a hierarchy of diagnoses. If you show up at the clinic with a bleeding ass, would you like the doctor to aim for maximum prediction score (there's a 95% chance it's nothing) or would you like your doctor to ass-ume the worst and schedule a colonoscopy for you? I would rather the second option, something the AI, and the people organizing this study, completely miss.

Reply to This Share
Flag as Inappropriate
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  You don't trust the conclusions because you didn't read the paper. You should have at least read the abstract before being skeptical. The abstract in this paper is reasonably good.
Billions (Score:2)

by fluffernutter ( 1411889 ) writes:

Billion dollar industry finds results in favor of billion dollar industry. How can we trust these results?
- Re: Billions (Score:2)
  
  by flyingfsck ( 986395 ) writes:
  
  The chances of anything coming from Mars are a million to one, he said.
AI is just a tool (Score:1)

by SecurID-Guy ( 10125484 ) writes:

Finally, someone gets it. AI is nothing more than an excellent tool. Key point: A.I. systems should be "doctor extenders," YES! I'm very much looking forward to a future where doctors will see AI as a colleague, not an adversary. I wonder if the doctors in the study would have been open to the alternate diagnosis whether it from was an AI, or some other licensed, certified physician unknown to the test subject.
Pattern Recognition (Score:2)

by Registered Coward v2 ( 447531 ) writes:

Machines are very good at pattern recognition, if fed the right data. Humans are as well, although biases and training sometimes conflict with getting the right answer. I suspect experience (the median years in practice was 3) may have impacted the results; given the type of cases: a broad range of pathologic settings, avoiding simplistic cases with limited plausible diagnoses, and excluding exceedingly rare cases.
For me, the takeaway is, LLMs can be a useful adjunct to diagnosis by a physician to help ide

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds (nytimes.com) 33

ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds More | Reply Login

ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds

Dunning-Kruger effect (Score:2)

Re:Dunning-Kruger effect (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Study on ChatGPT-4 ... (Score:5, Insightful)

Re: Study on ChatGPT-4 ... (Score:1)

Re: Study on ChatGPT-4 ... (Score:2)

Re: (Score:2)

AI isn't the relevant problem here. (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

What was actually evaluated. (Score:4, Insightful)

Re: (Score:1)

Re: What was actually evaluated. (Score:1)

Re: (Score:2)

Re: (Score:2)

failure to understand procedure (Score:4, Insightful)

Re: (Score:2)

Billions (Score:2)

Re: Billions (Score:2)

AI is just a tool (Score:1)

Pattern Recognition (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot