Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology.
大型語言模型(ChatGPT、Claude、DeepSeek、Gemini、Grok 及 Le Chat)在回答血液生理學題項分析選擇題時之準確性與可靠性評估
Cureus 2025-05-09
Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.
大型語言模型能通過美國外科醫學會住院醫師訓練考試嗎?Gemini、Copilot 與 ChatGPT 的比較性評估
Am Surg 2025-05-12
Comparison of a generative large language model to pharmacy student performance on therapeutics examinations.
生成式大型語言模型與藥學系學生在治療學考試表現之比較
Curr Pharm Teach Learn 2025-05-23
ChatGPT-3.5 在治療學考試的表現明顯不如藥學系學生,分數只有 53%,學生平均則有 82%。它在需要應用和案例分析的題目上特別吃力,只有在記憶型題目表現較好,顯示生成式 AI 在複雜醫學教育任務上還有不少限制。
PubMedDOI
GPT-4 versus human authors in clinically complex MCQ creation: A blinded analysis of item quality.
GPT-4 與人類作者在臨床複雜選擇題命題上的比較:題目品質的盲性分析
Med Teach 2025-05-29
Chatbots' Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis.
Chatbots 在產生醫學生單一最佳答案題目中的角色:比較分析
JMIR Med Educ 2025-05-30
A comparison of the psychometric properties of GPT-4 versus human novice and expert authors of clinically complex MCQs in a mock examination of Australian medical students.
GPT-4 與人類新手及專家作者在澳洲醫學生模擬考試中臨床複雜選擇題心理計量特性的比較
Med Teach 2025-06-12
Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.
以美國麻醉科醫學會(American Board of Anesthesiology)風格麻醉學試題評估大型語言模型:準確性、領域一致性與臨床意涵
J Cardiothorac Vasc Anesth 2025-06-15