AI in Medical Education: LLM Evaluation in TUS

Introduction

In recent years, technological advancements like Artificial Intelligence (AI) and Large Language Models (LLMs) have ushered in potential transformations in medical education and knowledge assessment methodologies. Particularly, these developments can make medical information more accessible and assessment more interactive.

Previous research has explored the performance of LLMs on various medical licensing exams such as the United States Medical Licensing Examination (USMLE) and the Japanese Medical Licensing Examination (JMLE), but these exams differ significantly from the TUS in structure and content. The TUS focuses on basic and clinical sciences, with a specific emphasis on the Turkish medical context, offering a unique opportunity to evaluate the capabilities of LLMs within a distinct assessment environment. This study aims to address this gap by evaluating the performance of four leading LLMs on the TUS. Furthermore, this study investigates the potential implications of these findings for curriculum design, AI-assisted medical training, and the future of medical assessment in Turkey. Specifically, we investigate how the performance of LLMs can inform the development of more effective educational resources and assessment strategies tailored to the Turkish medical curriculum. This investigation not only contributes to understanding the performance of specific languages but also informs the broader discourse on how AI can be effectively integrated into medical education and assessment globally.

The results of these studies indicate that ChatGPT and similar LLMs can play an important role in the process of medical education and knowledge assessment. AI and LLMs in medical information retrieval and evaluation methods can enable the development of innovative approaches and learning methods, especially in medical education. This study aims to further investigate the impact of LLMs on medical education and knowledge assessment by evaluating the performance of ChatGPT 4, Gemini 1.5 Pro, and Cohere-Command R+ in the Turkish Medical Specialization Training Entrance Exam.

This research delves into the application of advanced Artificial Intelligence (AI) models, notably ChatGPT 4, Gemini 1.5 Pro, Command R+, and Llama 3 70B, in medical education and assessment, focusing on their performance in addressing medical specialization exam questions. The study assesses these models’ abilities to conduct a comprehensive and systematic analysis of the Turkish Medical Specialization Training Entrance Exam questions, underscoring AI’s potential in medicine when considering factors like interpretative abilities and accuracy. The results indicate that AI models can significantly facilitate the medical education and assessment process, opening avenues for new applications and research areas. The main objective of this article is to evaluate the rapid advancements in AI technology and compare the responsiveness of different AI models. The study conducts a comparative analysis of ChatGPT 4, Gemini 1.5 Pro, Command R+, and Llama 3 70B, assessing their performance on 240 questions from the first term of the 2021 Turkish Medical Specialization Training Entrance Exam.

This comparison aims to elucidate the evolution and distinctions of AI technologies, focusing on their utility in specialized domains such as medical education and exam preparation. The ultimate goal is to provide insights that assist users in selecting the most suitable learning tools for their specific needs.

Methods

The questions were given to LLMs in Turkish. These questions were obtained from the official website of the Student Selection and Placement Center in a multiple-choice format (five options from A to E), with only one best answer. The answers were given by LLMs in Turkish.

The evaluation process was based on the correct answers published by the Student Selection and Placement Center. The article states: ‘The ‘correct’ answers to the questions of the artificial intelligence models were defined according to the answers published by the Student Selection and Placement Center. Only the answers identified as correct according to the instructions in the question text were accepted as ‘correct’.’ As both the questions and answers are in Turkish, the evaluation process involved comparing the Turkish answers of the LLMs with the official Turkish answer key provided by the Student Selection and Placement Center.

Medical Education Dataset

This study uses ChatGPT 4, Gemini 1.5 Pro, Command R+, and Llama 3 70B to test the capabilities of artificial intelligence models in evaluating medical knowledge and cases. The study was conducted on the questions of the Turkish Medical Specialization Training Entrance Exam held on March 21, 2021. The Turkish Medical Specialization Training Entrance Exam, organized by the Student Selection and Placement Center, consists of 240 questions. Basic knowledge questions in the first category test the knowledge and ethics required to complete medical education. The second category is case questions, which cover many diseases that measure analytical thinking and reasoning skills.

Question Difficulty Classification

The difficulty levels of the questions were classified based on the official candidate performance data released by the Student Selection and Placement Center. Specifically, the correct answer rate for each question, as reported by the center, was used to categorize the questions into five difficulty levels:

  • Level 1 (Easiest): Questions with a correct answer rate of 80% or higher.
  • Level 2: Questions with a correct answer rate between 60% and 79.9%.
  • Level 3 (Medium): Questions with a correct answer rate between 40% and 59.9%.
  • Level 4: Questions with a correct answer rate between 20% and 39.9%.
  • Level 5 (Most Difficult): Questions with a correct answer rate of 19.9% or lower.

The “correct” answers to the questions for the artificial intelligence models were defined according to the answers published by the Student Selection and Placement Center. Only the answers identified as correct according to the instructions in the question text were accepted as “correct.” Furthermore, the difficulty level of each question was classified into five levels, from 1 to 5, based on the correct answer rates published by the Student Selection and Placement Center. Questions with a correct answer rate of 80% and above were considered the easiest (Level 1), while questions with a correct answer rate of 19.9% and below were considered the most difficult (Level 5).

Knowledge and Case Domains

The Turkish Medical Specialization Training Entrance Exam, a pivotal step for medical graduates in Turkey aiming to specialize, assesses candidates in two key domains: knowledge and case domains. Understanding the distinction between these domains is crucial for adequate preparation. The knowledge domain focuses on evaluating the candidate’s theoretical understanding and factual knowledge within their chosen medical field. It tests the mastery of fundamental concepts and principles, establishing medical information pertinent to the profession. It represents the specific medical knowledge area being tested, such as basic medical sciences (anatomy, biochemistry, physiology, etc.) and clinical sciences (internal medicine, surgery, pediatrics, etc.) Case domain, on the other hand, represents the real-world scenarios or contexts for applying knowledge such as problem-solving, analytical thinking, critical thinking, decision-making, and applying concepts to real situations.

Prompt Engineering

Prompt engineering is designing and fine-tuning natural language prompts to obtain specific responses from language models or AI systems. In April 2024, we collected responses by directly querying language models through their respective web interfaces.

To ensure a fair evaluation of each model’s raw capabilities, a rigorous approach to control was implemented in how questions were presented to the LLMs. Each question was entered individually, and the conversation was reset before a new question was posed to prevent the model from learning or adapting based on prior interactions.

Data Analysis

All analyses were performed using Microsoft Office Excel and Python software. To compare the performance of LLMs across different question difficulties, unpaired Chi-square tests were conducted. A p-value threshold of p < 0.05 was used to determine statistical significance. This analysis assessed whether the accuracy of the models varied by the difficulty level of the questions.

Ethical Considerations

This study used only information published on the internet and did not involve human subjects. Therefore, approval from the Baskent University Ethics Committee was not required.

Results

The average number of correct answers of the candidates who participated in the Basic Medical Sciences exam in the first period of the 2021 Turkish Medical Specialization Training Entrance Exam was 51.63. The average number of correct answers in the Clinical Medical Sciences exam was 63.95. The average number of correct answers in the Clinical Medical Sciences exam was higher than the Basic Medical Sciences exam. In parallel with this situation, artificial intelligence technologies were also more successful in answering the Clinical Medical Sciences exam.

AI Performance

The performance of the AI platforms was assessed using the same metrics as human candidates.

  • ChatGPT 4:

    ChatGPT 4 achieved an average score of 103 correct answers in the Basic Medical Sciences section and 110 correct answers in the Clinical Medical Sciences section. This represents an overall accuracy of 88.75%, significantly outperforming the average human candidates in both sections (p < 0.001).

  • Llama 3 70B:

    Llama 3 70B achieved an average score of 95 correct answers in the Basic Medical Sciences section and 95 correct answers in the Clinical Medical Sciences section. This represents an overall accuracy of 79.17%, which is also significantly higher than the average human performance (p < 0.01).

  • Gemini 1.5 Pro:

    Gemini 1.5 Pro achieved an average score of 94 correct answers in the Basic Medical Sciences section and 93 correct answers in the Clinical Medical Sciences section. This represents an overall accuracy of 78.13%, which is significantly higher than the average human performance (p < 0.01).

  • Command R+:

    Command R+ achieved an average score of 60 correct answers in the Basic Medical Sciences section and 60 correct answers in the Clinical Medical Sciences section. This represents an overall accuracy of 50%, which is not significantly different from the average human performance in the Basic Medical Sciences section (p = 0.12) but is significantly lower in the Clinical Medical Sciences section (p < 0.05).

The performance of the AI platforms was assessed using the same metrics as human candidates.

Figure 3 compares the accuracy of different LLMs based on question difficulty - ChatGPT 4: The model that performs the best. Accuracy increases as the difficulty of the questions increases, approaching 70% even on the most challenging questions - Llama 3 70B: The model with medium performance. Accuracy first increases and then decreases as the difficulty of the questions increases. Its accuracy is around 25% on the most challenging questions. Gemini 1.5 70B: Its performance is similar to Llama 3 70B. Accuracy first increases and then decreases as the difficulty of the questions increases. Its accuracy is around 20% on the most challenging questions. Command R+: The model that performs the worst. Its accuracy decreases as the difficulty of the questions increases and remains around 15% on the most challenging questions

In summary, ChatGPT 4 is the model that is least affected by question difficulty and has the highest overall accuracy. Llama 3 70B and Gemini 1.5 Pro showed moderate performance, while Command R+ had a lower success rate than the other models. As the difficulty of the questions increases, the accuracy of the models decreases. This indicates that LLMs still need improvement in understanding and correctly answering complex questions

In Table 1, the ChatGPT 4 model stands out as the best performer, with a success rate of 88.75%. This indicates a robust ability to understand and accurately answer questions. The Llama 3 70B model ranks second with a success rate of 79.17%. While it lags behind the ChatGPT 4 model, it still demonstrates a high level of proficiency in answering questions. The Gemini 1.5 Pro model closely follows with a success rate of 78.13%. Its performance is comparable to the Llama 3 70B model, indicating its strong question-answering capabilities. On the other hand, the Command R+ model trails behind the other models, with a success rate of 50%. This suggests that it may struggle with specific questions or requires further fine-tuning to improve its performance. Distribution of correct answers across different difficulty levels. For example, all models performed well on easy questions (difficulty level 1), with the ChatGPT 4 model achieving a perfect score. On medium difficulty questions (levels 2 and 3), the ChatGPT 4 and Llama 3 70B models continued to perform well.

In contrast, the Gemini 1.5 Pro model started to show some weaknesses. On difficult questions (levels 4 and 5), the performance of all models decreased, with the Command R+ model struggling the most. Overall, these results provide valuable insights into the strengths and weaknesses of each AI model and can inform future development and improvement efforts

In Table 3, within Basic Medical Sciences, Biochemistry received a perfect score from ChatGPT 4, demonstrating its exceptional ability to answer questions in this area. Llama 3 70B and Gemini 1.5 Pro also performed well, but Command R+ performed poorly with an accuracy of 50%. The best-performing models (ChatGPT 4 and Llama 3 70B) in Pharmacology, Pathology, and Microbiology demonstrate strong information consistency, with accuracy rates ranging from 81% to 90%. Gemini 1.5 Pro and Command R+ lag behind but still perform well. Anatomy and Physiology presented some challenges to the models. ChatGPT 4 and Meta AI-Llama 3 70B performed well, while Gemini 1.5 Pro and Command R+ performed poorly with accuracy rates below 70%.

In Clinical Medical Sciences, Pediatrics was critical for all models, with ChatGPT 4 achieving a near-perfect score (90%). Llama 3 70B followed closely, and even Command R+ achieved an accuracy rate of 43%. Internal Medicine and General Surgery outperformed the best models, with accuracy rates ranging from 79% to 90%. Gemini 1.5 Pro and Command R+ lag behind but still perform well. Fewer questions were submitted in specialties such as Anesthesia and Resuscitation, Emergency Medicine, Neurology, and Dermatology, but the models generally performed well. ChatGPT 4 and Llama 3 70B demonstrated exceptional accuracy in these areas

Regarding model comparison, ChatGPT 4 is the best-performing model in most areas, with an overall accuracy of 88.75%. Its strength lies in its ability to answer basic medical and clinical medical science questions accurately. Llama 3 70B follows closely with an overall accuracy of 79.17%. While it cannot fully match the performance of ChatGPT 4, it still demonstrates strong knowledge consistencyacross various fields. Gemini 1.5 Pro and Command R+ lag behind with overall accuracy rates of 78.13% and 50%, respectively. While they show promise in some areas, they struggle to maintain consistency across all areas

In short, ChatGPT 4 is currently the most suitable model for answering medical science questions in various fields. Gemini 1.5 Pro and Command R+ show potential, but significant improvements are needed to compete with the best-performing models

In Table 4, regarding the knowledge domain, ChatGPT 4 had an accuracy of 86.7% (85/98) in basic medical sciences, outperforming the other models. Again, ChatGPT 4 performed the best, with an accuracy of 89.7% (61/68) in the clinical medical sciences field. Regarding the case domain, ChatGPT 4 had an accuracy of 81.8% (18/22) in the basic medical sciences field. ChatGPT 4 performed similarly in the clinical medical sciences field with an accuracy of 94.2% (49/52)

Pairwise comparison of the models shows that ChatGPT 4 significantly outperformed the other models in both domains and question types. Llama 3 70B and Gemini 1.5 Pro performed similarly, while Command R+ lagged behind. Based on this analysis, we can conclude that ChatGPT 4 demonstrated superior performance in both knowledge and case domains, as well as in basic and clinical medical sciences fields.

Statistical Analysis

The performance of LLMs was analyzed using Microsoft Office Excel and Python (version 3.10.2). To compare the performance of the models across different question difficulty levels, unpaired Chi-square tests were conducted. Contingency tables were constructed for each AI model’s correct and incorrect answers by difficulty level, and the Chi-square test was applied to determine if there was a statistically significant difference in performance across difficulty levels. A p-value threshold of <0.05 was used to determine statistical significance. ChatGPT 4 had a p-value of 0.00028, and was significant at p < 0.05, indicating a significant difference in performance across different difficulty levels. Gemini 1.5 Pro had a p-value of 0.047, and was significant at p < 0.05, indicating a significant difference in performance across different difficulty levels. Command R+ had a p-value of 0.197, and was not significant at p < 0.05, indicating no significant difference in performance across different difficulty levels. Llama 3 70B had a p-value: 0.118, p-value: 0.118, and was not significant at p < 0.05, indicating no significant difference in performance across different difficulty levels.

The correctness of ChatGPT 4 and Gemini 1.5 Pro on different question difficulties showed a statistically significant difference, suggesting that their performance varied significantly depending on the difficulty of the questions. Command R+ and Llama 3 70B did not show a significant performance difference across difficulty levels, suggesting that performance was more consistent regardless of question difficulty. These results may indicate that different models have different strengths and weaknesses in dealing with the complexities and topics associated with different difficulties.

Discussion

The TUS is a critical national examination for medical graduates in Turkey pursuing specialized training. The exam comprises multiple-choice questions covering basic and clinical sciences and features a centralized ranking system that determines placement in specialty programs

In assessing the performance of large language models on the TUS, GPT-4 emerged as the top performer. Similarly, ChatGPT, a powerful AI model, showcased near or above-human level performance in surgery, correctly answering 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. Furthermore, ChatGPT excelled in a public health examination, surpassing current pass rates and providing unique insights. These findings highlight the superior performance of GPT-4 and ChatGPT in medical assessments, showcasing their potential to enhance medical education and potential diagnostic aids.

For medical educators and examiners, the increasing accuracy of LLMs raises important questions about exam design and assessment. If AI models can solve standardized medical exams with high precision, future assessments may need to incorporate higher-order reasoning and clinical judgment questions that go beyond simple recall. Additionally, Turkish medical institutions could explore AI-assisted education strategies, such as adaptive learning systems that tailor learning materials to the individual needs of students.

From a national perspective, this study highlights the growing importance of AI in medical education in Turkey. As these LLMs perform well on Turkish medical questions, they can bridge the gap for students in underserved areas to access high-quality educational resources. Furthermore, policymakers should consider how to integrate AI models into continuing medical education and lifelong learning programs for healthcare professionals in Turkey.

In conclusion, while AI models like ChatGPT-4 demonstrate remarkable accuracy, their role in medical education should be carefully evaluated. The potential benefits of AI-assisted learning are enormous, but proper implementation requires ensuring that these tools are used responsibly and ethically, and in conjunction with human expertise.

Limitations

This study provides valuable insights into the performance of Large Language Models (LLMs) on the Turkish Medical Specialization Training Entrance Exam (TUS), but it is essential to acknowledge several key limitations to contextualize the findings and guide future research. First, it is uncertain whether the training data of the AI models assessed in this study included TUS questions. Because past TUS questions are publicly available, it is possible that the questions used in this study were part of the models’ training data. This raises concerns about whether the models’ performance reflects genuine understanding or simply the ability to memorize specific questions. Future studies should develop methods to assess whether AI models demonstrate true reasoning capabilities or rely on memorized information.

Second, AI models have the potential to exhibit biases stemming from their training data. These biases may arise from imbalances in the representation of certain medical conditions, populations, or perspectives within the training data. For example, the models’ performance in Turkish may differ from that in English due to variations in the quantity and quality of training data available in each language. Additionally, these models may be less accurate in answering questions that require an understanding of local Turkish medical practices or cultural contexts. These biases may limit the generalizability of the findings and raise ethical concerns about the use of AI in medical education and practice.

A third limitation is that the study focuses exclusively on multiple-choice questions. In real-world clinical practice, medical professionals need to possess skills such as reasoning through complex cases, interpreting ambiguous findings, and making decisions under uncertainty. Furthermore, the ability to communicate diagnoses, treatment plans, and risks to patients and colleagues in a clear and compassionate manner is essential. AI models have not yet been tested for their ability to perform these tasks, and their capabilities may be limited by their current design and training. Future research should evaluate AI models in more realistic settings, such as clinical case simulations and open-ended assessments.

Fourth, the study did not include open-ended questions. Open-ended questions are crucial for assessing higher-order cognitive skills such as critical thinking, information synthesis, and clinical reasoning. These types of questions require the ability to generate coherent and contextually relevant responses rather than simply selecting the correct option from a list. The performance of AI models on such tasks may differ significantly from their performance on multiple-choice questions, representing an important area for future research.

A fifth limitation is that the AI models were not tested under time pressure. Human test-takers face strict time constraints during examinations, which can impact their performance. In contrast, the AI models in this study were not subjected to time pressure, allowing them to operate without the stress of a timed environment.