Abstract

ObjectiveThis study was aimed to compare the accuracy of four state-of-the-art large language models (LLMs)(ChatGPT-5, ChatGPT-4o, DeepSeek, and Gemini 2.5 Pro) on Dentistry Specialization Exam (DUS) questions in Prosthodontics, and to assess the effect of querying method on performance.Materials and methodsA total of 128 multiple-choice DUS questions (106 knowledge-based; 22 case-based) from 13 exams (2012-2021) were administered. Each model was tested within a single 24-hour window (September 9, 2025). Two protocols were used: (1) Independent Query (each item in a fresh chat) and (2) Sequential Query (exam-like blocks of 10 items). First responses only were recorded and scored against official answer keys. Statistical analyses were performed using the chi-square (chi & sup2;) test to compare categorical outcomes (correct/incorrect responses) across different large language models and query strategies. Inter-rater agreement for question classification was assessed using Cohen's kappa coefficient, and all analyses were conducted at a significance level of alpha = 0.05.ResultsWith Independent Query, the accuracy for knowledge questions was found to be 91% (GPT-5), 86% (GPT-4o), 71% (DeepSeek), and 88% (Gemini 2.5 Pro) while total accuracy was detected as 86%, 83%, 66%, and 85% respectively. In this context, statistically significant differences were reported across models, also due to the low scores of DeepSeek. Case-based accuracies (64%, 68%, 45%, and 77%) did not differ significantly. With Sequential Query, knowledge accuracies were 75%, 73%, 63%, and 82% and case accuracies were 77%, 68%, 64%, and 91%, respectively. Total accuracy still differed across models (75%, 72%, 63%, and 84%). Within-model comparisons showed significant drops for knowledge items from Independent to Sequential querying for GPT-5 (91%-> 75%) and GPT-4o (86%-> 73%). DeepSeek and Gemini 2.5 Pro showed no significant changes. Notably, Gemini 2.5 Pro yielded the highest performance in case-based questions (91%) with sequential query.ConclusionsWhile the present study findings highlight current limitations in clinical reasoning, they support the conclusion that LLMs can be used as supplementary educational tools for DUS-style knowledge assessment but should not replace expert judgment or patient-specific decision-making.

  • Kapsamı

    Uluslararası

  • Type

    Hakemli

  • Index info

    WOS.SCI,WOS.SSCI

  • Language

    English

  • Article Type

    None