ChatGPT-4 Beat Doctors at Diagnosing Illness, Study Finds (nytimes.com) 61
Dr. Adam Rodman, a Boston-based internal medicine expert, helped design a study testing 50 licensed physicians to see whether ChatGPT improved their diagnoses, reports the New York TImes. The results? "Doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot.
"And, to the researchers' surprise, ChatGPT alone outperformed the doctors." [ChatGPT-4] scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.
The study showed more than just the chatbot's superior performance. It unveiled doctors' sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.
And the study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systems' ability to solve complex diagnostic problems and offer explanations for their diagnoses. A.I. systems should be "doctor extenders," Dr. Rodman said, offering valuable second opinions on diagnoses.
"The results were similar across subgroups of different training levels and experience with the chatbot," the study concludes. "These results suggest that access alone to LLMs will not improve overall physician diagnostic reasoning in practice.
"These findings are particularly relevant now that many health systems offer Health Insurance Portability and Accountability Act-compliant chatbots that physicians can use in clinical settings, often with no to minimal training on how to use these tools."
"And, to the researchers' surprise, ChatGPT alone outperformed the doctors." [ChatGPT-4] scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.
The study showed more than just the chatbot's superior performance. It unveiled doctors' sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.
And the study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systems' ability to solve complex diagnostic problems and offer explanations for their diagnoses. A.I. systems should be "doctor extenders," Dr. Rodman said, offering valuable second opinions on diagnoses.
"The results were similar across subgroups of different training levels and experience with the chatbot," the study concludes. "These results suggest that access alone to LLMs will not improve overall physician diagnostic reasoning in practice.
"These findings are particularly relevant now that many health systems offer Health Insurance Portability and Accountability Act-compliant chatbots that physicians can use in clinical settings, often with no to minimal training on how to use these tools."
Dunning-Kruger effect (Score:2)
So the AI was 90% accurate, but most of the time doctors didn't trust it so when ahead with their own incorrect diagnosis? One thing I want to know is how bad the 10% that the AI missed were .. like major blunders or what? Also, what about the 26% that the doctors missed .. how severe was the error? Anyone read the actual study? (Yes I know it's linked, but I'm an a slashdotter.)
Re:Dunning-Kruger effect (Score:4, Insightful)
In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
The point is that comparisons of classifiers (humans doctors or AI) are impossible on a linear scale, and anyone who reports results on a linear scale is biased. The math says so.
Re: (Score:3)
In a binary classification task, there are two numbers that should be reported, false negatives and true positives, or alternatively recall and precision, or alternatively the confusion matrix, etc.
This makes no sense. The answers were graded by a panel of expert doctors. It wasn't a binary classification task, there were multiple answers to each question.
Re: (Score:2, Interesting)
In a multiple class problem (say N possible answers) there is a NxN confusion matrix, so even more numbers that must be reported to compare two classifiers. Also, a multiple class problem can always be represented as a sequence of binary classifications, so there is really no loss of generality.
In all cases, accuracy alone is not a useful way to compare two N-way classifiers or even rank a collection of them.
Re: (Score:2)
Re: Dunning-Kruger effect (Score:3)
How is it not a classifier? It takes a bunch of encoded information about signs and symptoms, and then attempts to identify the condition associated with them. Take a large amount of fuzzy information and tell you one (or a few) things that you could be looking at is pretty much exactly the definition of a classifier.
Re: (Score:3)
Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
As in "70 YO male, lifetime smoker, presents with a persistent cough and severe shortness of breath" = "gangrenous foot, immediate amputation required to save patient"
Re: (Score:3)
Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
If they were, then that makes the human doctors look even worse.
If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
Re: (Score:2)
Also, what percentage of the 10% were blazingly wrong bull**** answers, AKA "hallucinations"?
If they were, then that makes the human doctors look even worse.
If the incorrect ChatGPT diagnoses were reasonable, the doctors likely made the same errors, and got an additional 14% wrong.
But if the incorrect ChatGPT diagnoses were blazing wrong bull****, the doctors should've easily corrected them, and got an additional 24% wrong.
Yeah, it only has to be better than the humans ... which seems to be a lower bar than expected.
Re: (Score:3)
Yeah, it only has to be better than the humans ... which seems to be a lower bar than expected.
It is a low bar indeed. So many people look at doctors as infallible. Meh. They are just people, and I'd venture a guess that the popular meme of their infallibility leads many to believe that any decision they make is correct.
I seldom go to see doctors, and as an example, my last visit during the plague year was for cellulitis. Doctor was apparently miffed that I self diagnosed, and wanted to prove to me it was something else. Then was a bit more agitated that my diagnosis was correct.
Ego based diagno
Re: (Score:3)
You have lots of things right there. Especially about understanding that a doctor is a human and that you are responsible for your own body. That means that you are trying to use them as an expert to get advice rather than expecting them to fix you.
I think a useful approach for IT guys going to the doctor is to think what you want in a good bug report. You want the full information and you don't want any suggested diagnosis until you heard everything. Someone comes to you with their disk full, you don't wan
Re: (Score:2)
If they were, then that makes the human doctors look even worse.
You are kind of assuming that ChatGPT only improved human doctors. It could be that there were some diagnoses that the human doctors got right, but ChatGPT provided a convincing but completely wrong justification for a different diagnosis. I wouldn't also rule out the uncanny valley effect. Probably the doctors quite quickly understand that the AI doesn't know what it's talking about, just spouts from a kind of hidden script. They come to actively distrust the opinion of the AI and find it offputting.
In a r
Re:Dunning-Kruger effect (Score:4, Insightful)
Apart from this, doctors can be "intelligently wrong", by giving a diagnose which is not chiseled in stone and starting a treatment that would also help related illnesses. How often has your doctor said "call me when things get worse" as he has sent you home with a prescription?
Doctors do not want 100% accuracy, as the amount of work to get the last percents right is huge and they have other patients to treat. They want accuracy that is good enough.
Re:Dunning-Kruger effect (Score:5, Informative)
Part of the reason for this is that there are important concepts about being accurate and doing no harm. False positives and false negatives can be devastating. A false cancer diagnosis, for example, can ruin a patient's life, with substantial financial and psychological impact.
There's also the related important statistical concepts of rates of occurrence. If the accuracy of a test is 99% correct with a 1% false positive rate, but the underlying rate of occurrence is very low, that 1% of false positives can lead to an overwhelming number of misdiagnoses. Not only do those carry significant unnecessary burden for the patient, they create a similar burden for the healthcare system.
So, a doctor saying, "call me if it gets worse," is often thinking that it's very likely you have the flu, and much less likely that you have dengue fever. The conservative course will be the right one in the vast majority of the time. That idea is summed up in the saying famous within healthcare, "when you hear hooves, think horses, not zebras." Giving the patient an opportunity for re-review provides a path for treating the horse cases while providing a path to handle the zebras as well.
An important part of the cited test, which I'd like to read, is if it presented cases with normal rates of occurrence.
Re: (Score:1)
Re: (Score:2)
Sounds to me like his middle name is "Quack Quack," and he's chief snake oil salesman for some multi-level marketing company.
Re: (Score:2)
I dunno. I guess that he'd probably get short shrift from most MLM groups. They have better ethics than that. He should try somewhere more dubious, for example the US government that is to come.
Re: (Score:2)
He used to party with RFK jr.
Re: (Score:3)
This is not the first time this has been tried. Remember IBM Watson? It had better stats that this thing here, but unfortunately when it was wrong, it would have occasionally killed the patient. Hence the application scenario was scrapped. I bet it is the same here.
Re:Dunning-Kruger effect (Score:5, Informative)
*) The test didn't seem to subtract from the score for wrong answers. The LLM couldn't have given a wildly wrong diagnosis and not have been penalized (that's what I understand from the paper).
*) The test is set up to give the LLM a bit of an advantage with the Jenning's effect [go.com], that is, not all the humans were able to finish all the test cases, and they were instructed to be slow and accurate instead of being fast.
You might ask, "How did this paper pass peer review? What is wrong with you?" And the answer is they weren't trying to test LLMs against doctors. They were trying to test how well doctors worked when augmented by LLMs. They had a little side note about LLMs vs doctors, but they are fully aware (and clearly state) that this doesn't mean LLMs are better than doctors.
The main point is testing how well augmented doctors perform. The paper does good science (afaict) investigating this question. All the hype comes from the news article, and it is fake.
*tl;dr the article is hype, the paper is good.
Re: (Score:2)
One thing I want to know is how bad the 10% that the AI missed were....
That's a good question. But I think a more pertinent one is, "how does a statistical likelihood of one letter following another one lead to an accurate diagnosis?" To me, the most likely answer is two fold:
1) Lots of medical training data of diagnoses by humans. Without humans, LLMs are worthless. This is what AI proponents tend to sweep under the rug. Without continuous human output (data) to serve as LLM input, the LLMs will fall apart since LLM output cannot be used as LLM input without severe degradatio
Re: (Score:2)
Article worth a read to see the flaws. As you say, no analysis on the severity of mis-diagnosis by any party. Also, median doc e
Study on ChatGPT-4 ... (Score:5, Insightful)
The doctors were fed information about the patients that was already suitable for giving to ChatGPT ... not required to gather the information themselves
So the largest part of the job of Doctor was omitted, and replaced with data tailored for machines
The researchers gave little or no instruction on how to use ChatGPT, but then compared the results to them using it with all their ChatGPT skills ...
Study finds that people who know how to get the best out of ChatGPT use it well ... and Doctors when taken out of their normal environment do not do as well ...
Re: Study on ChatGPT-4 ... (Score:3, Informative)
Re: Study on ChatGPT-4 ... (Score:3)
Re: (Score:2)
I think that was the point of it being used as tool by the doctor, chatGPT itself cannot get the patient history, but, from what I understood, comparing it being used by a doctor as an assistant or being feed the history and using its result as showed that the doctors were not using (or thrusting) the tool, because in such a low N 74% to 76% means there was no difference.
I don't think training the doctors to use it is the issue, because using a LLM is as straight forward as it gets, it is just writing it do
Re:Study on ChatGPT-4 ... (Score:4, Interesting)
AI isn't the relevant problem here. (Score:1, Interesting)
The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
That, to me, is gross malpractice.
The correct answer is not necessarily more AI, but that might well be the end result. The correct answer is to require doctors to recertify through such test cases and withdrawing a license to practice if the success rate is under 90%.
AI is, ultimately, just using differential diagnosis, because that's
Re: (Score:3)
Re: (Score:2)
Re: (Score:3)
The problem is that doctors are making elementary errors, failing to verify, and putting ego and large numbers of consultations a day over and above the wellbeing of patients.
TFA does not contain enough information to draw that conclusion.
The correct answer is not necessarily more AI
AI will be part of the solution.
TFA says that ChatGPT reduced misdiagnoses from 26% to 24%. Two percent might not seem like much, but in a $5 trillion industry, it's a lot.
Doctors will do much better if they're trained to use AI technology. It should be incorporated into medical school curriculum.
Re: (Score:2)
Doctors will do much better if they're trained to use AI technology. It should be incorporated into medical school curriculum.
Exactly this. There has to be a correct procedure. Probably 1) examine patient record observations into (electronic) notes 2) do the diagnosis yourself 3) feed notes and current diagnosis to AI system and get it to suggest alternatives with some kind of probabilities and links to official statements of those diagnoses 4) rethink and re-examine everything with new knowledge.
By bringing AI in late in the process you allow the human and AI bias to be independent and ensure clean verification and training data.
What was actually evaluated. (Score:5, Insightful)
Re: (Score:2)
What it shows is that people, to get proper treatment, need direct contact patient with a doctor.
The study described in TFA does not show that.
The error rate of doctors with direct contact was not compared to those without.
Re: What was actually evaluated. (Score:1)
Re: (Score:2)
Who gave the reference diagnosis then? The one 100% accurate?
The reference diagnosis is determined retrospectively from the patient outcome.
Re: (Score:2)
The error rate of doctors with direct contact was not compared to those without.
The error rate of doctors with direct contact was also not compared to the AI error rate. It wasn't part of the study at all. Which makes the study pretty useless.
Re: (Score:2)
What is shows is that people, to get proper treatment, need direct contact patient with a doctor. This is what doctors are taught, and expected to do. LLM or online consultation will not replace that.
Wrong. AI can be taught the entire medical history. Everything we know about medicine. Ever. And you don’t go to see a human doctor to talk to them about your diagnosis. You go to the doctor and both of you “talk” to the results of the tests you took. Which again, is something that can be automated. AI can also be taught how what “low” or “high” means when reading a blood report. Just like the human does.
Test. Review Results. Diagnose. If a $50K car diagn
Re: (Score:3)
Test. Review Results. Diagnose. If a $50K car diagnostic scanner can do that, I donâ(TM)t see why AI canâ(TM)t in medicine.
A car diagnostic scanner does not simply plug in and diagnose the vehicle, except for a small subset of tests. At best they have guided diagnosis and the technician has to perform various tests. This is exactly like the scenario in which the doctor is using a software agent to assist with diagnosis because in both cases, you need a trained professional to operate the equipment and perform the final diagnosis. They have to know enough to fact-check the machine, just like I know enough to recognize when Googl
Re: (Score:2)
The car diagnostic scanner, was once untrusted too. Until it wasn’t.
The same will eventually be true of AI. Once AI learns “high” and”low” parameters and is trained on what to do next (not unlike the highly-trained human following the expert machine), it won’t have to re-learn it. Better yet, it won’t ever forget it. Unlike human brains do.
Your concerns, have an expiration date.
Re: (Score:2)
The car diagnostic scanner, was once untrusted too. Until it wasnâ(TM)t.
The people who know something about automotive diagnostics still don't trust it, which is how we can tell you know fuck-all about this subject.
Re: (Score:2)
Test. Review Results. Diagnose. If a $50K car diagnostic scanner can do that, I donâ(TM)t see why AI canâ(TM)t in medicine.
That's easy if you can run a complete set of tests and don't have false positive results from those tests. In real life, test results are ambiguous, better tests are expensive, doctors have to accept inputs like "it hurts when I do this" as a starting point, patients don't want to admit how little they exercise or how badly they eat, and so on.
Re: (Score:2)
Mechanics have to do the same thing! If I run a test on the Sprinter and it says there's a short in the EGR wiring I have to figure out if there actually is a short, or if it's actually a failed ground making it look like that, or if the EGR has seized up with soot and the motor that drives it is stalling out... The whole idea that you can just plug into even a car and have the scanner spit out the answer is horse shit. The scanner is used in conjunction with the manual and it has a whole series of troubles
failure to understand procedure (Score:4, Insightful)
I don't trust these conclusions *at all*.
AI, and machine learning, as performed by computer scientists, completely miss the meaning of data and protocol.
In machine learning/AI, a computer scientist will try to achieve the highest possible AUC. This is frequently seen when a dataset of 1,000,000 tests (99% controls, 1% cases) yields the best results when predicted as ANYTHING -> CONTROL. For a doctor, the 1% cases are the difficult part, not the 99% of controls.
A doctor should operate by a hierarchy of diagnoses. If you show up at the clinic with a bleeding ass, would you like the doctor to aim for maximum prediction score (there's a 95% chance it's nothing) or would you like your doctor to ass-ume the worst and schedule a colonoscopy for you? I would rather the second option, something the AI, and the people organizing this study, completely miss.
Re: (Score:2)
Billions (Score:1)
Re: Billions (Score:2)
Not even a broken arm IME (Score:2)
I had to argue with him to get an X-ray. When radiology came back showing the obvious
AI is just a tool (Score:1)
Pattern Recognition (Score:2)
For me, the takeaway is, LLMs can be a useful adjunct to diagnosis by a physician to help ide
Would you rather? (Score:1)
Re: (Score:3)
I would like to get treatment from a compassionate human with access to a experienced robot.
When the compassionate human fucks up in a very human way and attempts to dismiss the life-ending mistake with a compassionate apology, society will find the legal arguments against using humans in the future, quite compelling.
Human liability, will become all that matters from a risk mitigation perspective if we allow the current legal system to continue. And we will.
Having dealt with doctors (Score:1)
And their mush-brained indifference and arrogance, I think a potato has a better chance at diagnosing than a doctor.
You've Got Leprocy! (Score:2)
You've Got Leprocy!
The Gambler, the Nun and the LLM (Score:2)
That's where AI should be good (Score:2)
I would bet that a closer study shows: Much higher failure rate when patterns are less clear (for example, if you diagnose based on skin discolouration, black patients will often be misdiagnosed by less experienced doctors, and by an AI).
Not trustworthy without restricted training (Score:1)
AI has a proven track record of making shit up. Medical AI could be possible IF and ONLY if they restrict the training data to trusted and verified medical sources. Otherwise you'll get medical "opinions" from some rando on reddit or tumblr mixed into the underlying model
Glaring flaw (Score:2)