TY - JOUR
T1 - How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology
AU - Kollitsch, L
AU - Eredics, K
AU - Marszalek, M
AU - Rauchenwald, M
AU - Brookman-May, SD
AU - Burger, M
AU - Körner-Riffard, K
AU - May, M
N1 - Eredics: Department of Urology, Paracelsus Medical University, Salzburg, Austria
PY - 2024/1/10
Y1 - 2024/1/10
N2 - Purpose This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity.Methods ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively.Results Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (kappa = 0.67, p < 0.001) for ChatGPT-3.5, 74% (kappa = 0.40, p < 0.001) for ChatGPT-4, and 76% (kappa = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs.Conclusions The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.
AB - Purpose This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity.Methods ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively.Results Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (kappa = 0.67, p < 0.001) for ChatGPT-3.5, 74% (kappa = 0.40, p < 0.001) for ChatGPT-4, and 76% (kappa = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs.Conclusions The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.
KW - AI
KW - LLM
KW - ChatGPT-3.5
KW - ChatGPT-4
KW - BING AI
KW - Medical exam
KW - ISA
KW - EBU
KW - Urology exam
KW - Pass mark
U2 - 10.1007/s00345-023-04749-6
DO - 10.1007/s00345-023-04749-6
M3 - Original Article
C2 - 38197996
SN - 0724-4983
VL - 42
JO - WORLD JOURNAL OF UROLOGY
JF - WORLD JOURNAL OF UROLOGY
IS - 1
M1 - 20
ER -