TY - JOUR
T1 - Artificial Intelligence Chatbot Responses to Patient Queries on Traumatic Brain Injury
T2 - An Expert Assessment of Reliability and Accuracy
AU - Schuss, Patrick
AU - Gonschorek, Andreas S
AU - Kämper, Michael
AU - Lemcke, Johannes
AU - Meisel, Hans-Jörg
AU - Rogge, Witold
AU - Schaan, Marc
AU - Schwenkreis, Peter
AU - Strowitzki, Martin
AU - Wohlfahrt, Kai
AU - Schmehl, Ingo
AU - Neuro-Trauma Working Group
N1 - Lerh-KH BG Klinikum Unfallklinik Murnau, Murnau, Germany
PY - 2025/11/21
Y1 - 2025/11/21
N2 - The increasing use of artificial intelligence-driven chatbots for medical queries requires a systematic evaluation of their accuracy, reliability, and potential role in patient education. This study assesses the performance of three widely used chatbots-ChatGPT, Google Gemini, and Microsoft CoPilot-in answering patient-oriented questions related to traumatic brain injury (TBI). A standardized set of questions related to TBI was developed, divided into eight subtopics, and presented to each chatbot using unified prompts. The responses were evaluated together with reference answers prepared by experts from a group of specialists in the fields of neurology, neurosurgery, and neurorehabilitation, and subsequently assessed in a survey of patients undergoing rehabilitation for TBI. Performance was evaluated using a modified scoring framework in five key dimensions of quality. Statistical analysis included multivariate analysis of variance to compare chatbot performance and logistic regression analysis to determine the likelihood of chatbot responses being considered an adequate substitute for expert advice. Significant differences between the chatbots were found in several quality dimensions, with ChatGPT scoring higher than Gemini and CoPilot on reliability, responsiveness, and perceived trustworthiness (p < 0.05). No chatbot consistently demonstrated an advantage in conveying empathy. Logistic regression analysis revealed that responses from ChatGPT were significantly more likely to be rated as an adequate substitute for expert input (p < 0.0001, OR = 4.3, 95% CI: 2.4-7.6). AI-driven chatbots vary in their ability to provide high-quality medical information, with significant differences in reliability and responsiveness. While ChatGPT outperformed other models in providing structured information, further improvements in context awareness and empathy are needed before broader clinical integration can be considered.
AB - The increasing use of artificial intelligence-driven chatbots for medical queries requires a systematic evaluation of their accuracy, reliability, and potential role in patient education. This study assesses the performance of three widely used chatbots-ChatGPT, Google Gemini, and Microsoft CoPilot-in answering patient-oriented questions related to traumatic brain injury (TBI). A standardized set of questions related to TBI was developed, divided into eight subtopics, and presented to each chatbot using unified prompts. The responses were evaluated together with reference answers prepared by experts from a group of specialists in the fields of neurology, neurosurgery, and neurorehabilitation, and subsequently assessed in a survey of patients undergoing rehabilitation for TBI. Performance was evaluated using a modified scoring framework in five key dimensions of quality. Statistical analysis included multivariate analysis of variance to compare chatbot performance and logistic regression analysis to determine the likelihood of chatbot responses being considered an adequate substitute for expert advice. Significant differences between the chatbots were found in several quality dimensions, with ChatGPT scoring higher than Gemini and CoPilot on reliability, responsiveness, and perceived trustworthiness (p < 0.05). No chatbot consistently demonstrated an advantage in conveying empathy. Logistic regression analysis revealed that responses from ChatGPT were significantly more likely to be rated as an adequate substitute for expert input (p < 0.0001, OR = 4.3, 95% CI: 2.4-7.6). AI-driven chatbots vary in their ability to provide high-quality medical information, with significant differences in reliability and responsiveness. While ChatGPT outperformed other models in providing structured information, further improvements in context awareness and empathy are needed before broader clinical integration can be considered.
U2 - 10.1177/08977151251401539
DO - 10.1177/08977151251401539
M3 - Original Article
C2 - 41335521
SN - 0897-7151
JO - Journal of neurotrauma
JF - Journal of neurotrauma
ER -