The internet can serve as a powerful tool for self-education on medical topics. With ChatGPT now at patients’ fingertips, researchers from Brigham and Women’s Hospital sought to assess how consistently the AI chatbot provides recommendations for cancer treatment that align with National Comprehensive Cancer Network guidelines.
The team’s findings, published in JAMA Oncology, show that in one-third of cases, ChatGPT provided an inappropriate — or “non-concordant” — recommendation, highlighting the need for awareness of the technology’s limitations.
“Patients should feel empowered to educate themselves about their medical conditions, but they should always discuss with a clinician, and resources on the Internet should not be consulted in isolation,” said corresponding author Danielle Bitterman, a radiation oncologist and an instructor at Harvard Medical School. “ChatGPT responses can sound a lot like a human and can be quite convincing. But when it comes to clinical decision-making, there are so many subtleties for every patient’s unique situation. A right answer can be very nuanced, and not necessarily something ChatGPT or another large language model can provide.”
Although medical decision-making can be influenced by many factors, Bitterman and colleagues chose to evaluate the extent to which ChatGPT’s recommendations aligned with the NCCN guidelines, which are used by physicians across the country. They focused on the three most common cancers (breast, prostate, and lung cancer) and prompted ChatGPT to provide a treatment approach for each cancer based on the severity of the disease. In total, the researchers included 26 unique diagnosis descriptions and used four, slightly different prompts to ask ChatGPT to provide a treatment approach, generating a total of 104 prompts.
Nearly all responses (98 percent) included at least one treatment approach that agreed with NCCN guidelines. However, the researchers found that 34 percent of these responses also included one or more non-concordant recommendations, which were sometimes difficult to detect amidst otherwise sound guidance. A non-concordant treatment recommendation was defined as one that was only partially correct, for example, for a locally advanced breast cancer, a recommendation of surgery alone, without mention of another therapy modality. Notably, complete agreement occurred in only 62 percent of cases, underscoring both the complexity of the NCCN guidelines and the extent to which ChatGPT’s output could be vague or difficult to interpret.
In 12.5 percent of cases, ChatGPT produced “hallucinations,” or a treatment recommendation entirely absent from NCCN guidelines. These included recommendations of novel therapies or curative therapies for non-curative cancers. The authors emphasized that this form of misinformation can incorrectly set patients’ expectations about treatment and potentially impact the clinician-patient relationship.
Going forward, the researchers will explore how well both patients and clinicians can distinguish between medical advice written by a clinician versus a large language model. They also plan to prompt ChatGPT with more detailed clinical cases to further evaluate its clinical knowledge.
The authors used GPT-3.5-turbo-0301, one of the largest models available at the time they conducted the study and the model class that is currently used in the open-access version of ChatGPT (a newer version, GPT-4, is only available with the paid subscription). They also used the 2021 NCCN guidelines, because GPT-3.5-turbo-0301 was developed using data up to September 2021.
“It is an open research question as to the extent LLMs provide consistent logical responses as oftentimes ‘hallucinations’ are observed,” said first author Shan Chen of the Brigham’s AI in Medicine Program. “Users are likely to seek answers from the LLMs to educate themselves on health-related topics — similarly to how Google searches have been used. At the same time, we need to raise awareness that LLMs are not the equivalent of trained medical professionals.”