ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds (nytimes.com) 33
Dr. Adam Rodman, a Boston-based internal medicine expert, helped design a study testing 50 licensed physicians to see whether ChatGPT improved their diagnoses, reports the New York TImes. The results? "Doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot.
"And, to the researchers' surprise, ChatGPT alone outperformed the doctors." [ChatGPT-4] scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.
The study showed more than just the chatbot's superior performance. It unveiled doctors' sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.
And the study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systems' ability to solve complex diagnostic problems and offer explanations for their diagnoses. A.I. systems should be "doctor extenders," Dr. Rodman said, offering valuable second opinions on diagnoses.
"The results were similar across subgroups of different training levels and experience with the chatbot," the study concludes. "These results suggest that access alone to LLMs will not improve overall physician diagnostic reasoning in practice.
"These findings are particularly relevant now that many health systems offer Health Insurance Portability and Accountability Act-compliant chatbots that physicians can use in clinical settings, often with no to minimal training on how to use these tools."
"And, to the researchers' surprise, ChatGPT alone outperformed the doctors." [ChatGPT-4] scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.
The study showed more than just the chatbot's superior performance. It unveiled doctors' sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.
And the study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systems' ability to solve complex diagnostic problems and offer explanations for their diagnoses. A.I. systems should be "doctor extenders," Dr. Rodman said, offering valuable second opinions on diagnoses.
"The results were similar across subgroups of different training levels and experience with the chatbot," the study concludes. "These results suggest that access alone to LLMs will not improve overall physician diagnostic reasoning in practice.
"These findings are particularly relevant now that many health systems offer Health Insurance Portability and Accountability Act-compliant chatbots that physicians can use in clinical settings, often with no to minimal training on how to use these tools."
Dunning-Kruger effect (Score:2)
So the AI was 90% accurate, but most of the time doctors didn't trust it so when ahead with their own incorrect diagnosis? One thing I want to know is how bad the 10% that the AI missed were .. like major blunders or what? Also, what about the 26% that the doctors missed .. how severe was the error? Anyone read the actual study? (Yes I know it's linked, but I'm an a slashdotter.)
Re:Dunning-Kruger effect (Score:4, Insightful)
In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
The point is that comparisons of classifiers (humans doctors or AI) are impossible on a linear scale, and anyone who reports results on a linear scale is biased. The math says so.
Re: (Score:2)
In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
This makes no sense. The answers were graded by a panel of expert doctors. It wasn't a binary classification task, there were multiple answers to each question.
Re: (Score:2)
In a multiple class problem (say N possible answers) there is a NxN confusion matrix, so even more numbers that must be reported to compare two classifiers. Also, a multiple class problem can always be represented as a sequence of binary classifications, so there is really no loss of generality.
In all cases, accuracy alone is not a useful way to compare two N-way classifiers or even rank a collection of them.
Re: (Score:2)
Re: (Score:2)
Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
As in "70 YO male, lifetime smoker, presents with a persistent cough and severe shortness of breath" = "gangrenous foot, immediate amputation required to save patient"
Re: (Score:2)
Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
If they were, then that makes the human doctors look even worse.
If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
Re: (Score:1)
Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
If they were, then that makes the human doctors look even worse.
If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
Yeah, it only has to be better than the humans ... which seems to be a lower bar than expected.
Re: (Score:3)
Apart from this, doctors can be "intelligently wrong", by giving a diagnose which is not chiseled in stone and starting a treatment that would also help related illnesses. How often has your doctor said "call me when things get worse" as he has sent you home with a prescription?
Doctors do not want 100% accuracy, as the amount of work to get the last percents right is huge and they have other patients to treat. They want accuracy that is good enough.
Re: (Score:1)
Re: (Score:2)
This is not the first time this has been tried. Remember IBM Watson? It had better stats that this thing here, but unfortunately when it was wrong, it would have occasionally killed the patient. Hence the application scenario was scrapped. I bet it is the same here.
Re: (Score:2)
*) The test didn't seem to subtract from the score for wrong answers. The LLM couldn't have given a wildly wrong diagnosis and not have been penalized (that's what I understand from the paper).
*) The test is set up to give the LLM a bit of an advantage with the Jenning's effect [go.com], that is, not all the humans were able to finish all the test cases, and they were instructed to be slow and accurate instead of being fast.
You might ask,
Study on ChatGPT-4 ... (Score:5, Insightful)
The doctors were fed information about the patients that was already suitable for giving to ChatGPT ... not required to gather the information themselves
So the largest part of the job of Doctor was omitted, and replaced with data tailored for machines
The researchers gave little or no instruction on how to use ChatGPT, but then compared the results to them using it with all their ChatGPT skills ...
Study finds that people who know how to get the best out of ChatGPT use it well ... and Doctors when taken out of their normal environment do not do as well ...
Re: Study on ChatGPT-4 ... (Score:1)
Re: Study on ChatGPT-4 ... (Score:2)
Re: (Score:2)
I think that was the point of it being used as tool by the doctor, chatGPT itself cannot get the patient history, but, from what I understood, comparing it being used by a doctor as an assistant or being feed the history and using its result as showed that the doctors were not using (or thrusting) the tool, because in such a low N 74% to 76% means there was no difference.
I don't think training the doctors to use it is the issue, because using a LLM is as straight forward as it gets, it is just writing it do
AI isn't the relevant problem here. (Score:1)
The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
That, to me, is gross malpractice.
The correct answer is not necessarily more AI, but that might well be the end result. The correct answer is to require doctors to recertify through such test cases and withdrawing a license to practice if the success rate is under 90%.
AI is, ultimately, just using differential diagnosis, because that's
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
TFA does not contain enough information to draw that conclusion.
The correct answer is not necessarily more AI
AI will be part of the solution.
TFA says that ChatGPT reduced misdiagnoses from 26% to 24%. Two percent might not seem like much, but in a $5 trillion industry, it's a lot.
Doctors will do much better if they're trained to use AI technology. It should be incorporated into medical school curriculum.
What was actually evaluated. (Score:4, Insightful)
Re: (Score:1)
What it shows is that people, to get proper treatment, need direct contact patient with a doctor.
The study described in TFA does not show that.
The error rate of doctors with direct contact was not compared to those without.
Re: What was actually evaluated. (Score:1)
Re: (Score:2)
Who gave the reference diagnosis then? The one 100% accurate?
The reference diagnosis is determined retrospectively from the patient outcome.
Re: (Score:2)
What is shows is that people, to get proper treatment, need direct contact patient with a doctor. This is what doctors are taught, and expected to do. LLM or online consultation will not replace that.
Wrong. AI can be taught the entire medical history. Everything we know about medicine. Ever. And you don’t go to see a human doctor to talk to them about your diagnosis. You go to the doctor and both of you “talk” to the results of the tests you took. Which again, is something that can be automated. AI can also be taught how what “low” or “high” means when reading a blood report. Just like the human does.
Test. Review Results. Diagnose. If a $50K car diagn
failure to understand procedure (Score:4, Insightful)
I don't trust these conclusions *at all*.
AI, and machine learning, as performed by computer scientists, completely miss the meaning of data and protocol.
In machine learning/AI, a computer scientist will try to achieve the highest possible AUC. This is frequently seen when a dataset of 1,000,000 tests (99% controls, 1% cases) yields the best results when predicted as ANYTHING -> CONTROL. For a doctor, the 1% cases are the difficult part, not the 99% of controls.
A doctor should operate by a hierarchy of diagnoses. If you show up at the clinic with a bleeding ass, would you like the doctor to aim for maximum prediction score (there's a 95% chance it's nothing) or would you like your doctor to ass-ume the worst and schedule a colonoscopy for you? I would rather the second option, something the AI, and the people organizing this study, completely miss.
Re: (Score:2)
Billions (Score:2)
Re: Billions (Score:2)
AI is just a tool (Score:1)
Pattern Recognition (Score:2)
For me, the takeaway is, LLMs can be a useful adjunct to diagnosis by a physician to help ide