Evaluating LLM Performance: Accuracy and Consistency
Our main objective was to assess the accuracy of responses provided by leading LLMs when presented with questions related to CVD prevention. We focused on BARD (Google’s language model), ChatGPT-3.5 and ChatGPT-4.0 (OpenAI’s models), and ERNIE (Baidu’s model). A set of 75 meticulously crafted CVD prevention questions were posed to each LLM, with responses evaluated based on their appropriateness (categorized as appropriate, borderline, or inappropriate).
English Language Performance
In the English language, the LLMs demonstrated notable accuracy. BARD achieved an "appropriate" rating of 88.0%, ChatGPT-3.5 scored 92.0%, and ChatGPT-4.0 excelled with a 97.3% rating. These results suggest that LLMs can provide valuable information to English-speaking users seeking guidance on CVD prevention. The high scores reflect the models’ comprehensive training on extensive datasets that include medical information in English. The slight variations in scores likely stem from differences in the models’ architectures, training methodologies, and the specific datasets they were exposed to. These findings are encouraging for the potential of LLMs to serve as reliable sources of information for English-speaking individuals seeking to learn more about CVD prevention strategies. However, it is crucial to remember that LLMs should not replace consultation with qualified healthcare professionals, as they are not substitutes for personalized medical advice.
Chinese Language Performance
The analysis extended to Chinese language queries, where the performance of LLMs varied. ERNIE achieved an "appropriate" rating of 84.0%, ChatGPT-3.5 scored 88.0%, and ChatGPT-4.0 reached 85.3%. While the results were generally positive, they also indicated a slight dip in performance compared to English, suggesting potential language bias in these models. Several factors could contribute to this observed performance difference. One possibility is that the datasets used to train these LLMs may be more heavily weighted towards English-language content, leading to a more comprehensive understanding and nuanced response generation in English. Additionally, the complexities of the Chinese language, including its tonal variations and idiomatic expressions, may pose challenges for LLMs that are primarily trained on Western languages. ERNIE, being a Baidu product, likely benefits from specialized training on Chinese language datasets, potentially explaining its relatively strong performance in this context. Despite these potential language biases, the reasonably high "appropriate" ratings in Chinese suggest that LLMs can still provide valuable information to Chinese-speaking users, albeit with a slightly higher margin for error compared to English-language queries.
Temporal Improvement and Self-Awareness
Beyond initial accuracy, we investigated the LLMs’ ability to improve their responses over time and their self-awareness of correctness. This involved assessing how the models responded to suboptimal answers initially provided and whether they could identify and rectify errors when prompted. This aspect of the evaluation is crucial for understanding the long-term potential of LLMs in healthcare, as their ability to learn and adapt is essential for maintaining their reliability and relevance in a rapidly evolving field.
Enhanced Responses Over Time
The analysis revealed that LLMs exhibit temporal improvement. When presented with initially suboptimal responses, BARD and ChatGPT-3.5 improved by 67% (6/9 and 4/6, respectively), while ChatGPT-4.0 achieved a perfect 100% improvement rate (2/2). This suggests that LLMs learn from user interactions and feedback, leading to more accurate and reliable information over time. The mechanism behind this temporal improvement likely involves the LLMs’ ability to analyze user input, identify areas where their initial responses were deficient, and then adjust their internal parameters to produce more accurate and comprehensive answers in subsequent interactions. This demonstrates the potential for LLMs to become increasingly valuable resources for CVD prevention as they accumulate more experience and are exposed to a wider range of queries and feedback. The superior performance of ChatGPT-4.0 in this regard may be attributed to its more advanced architecture and training data, enabling it to learn and adapt more effectively than its predecessors.
Self-Awareness of Correctness
We also examined the LLMs’ ability to recognize the correctness of their responses. BARD and ChatGPT-4.0 outperformed ChatGPT-3.5 in this area, demonstrating better self-awareness of the accuracy of the information they provided. This feature is particularly valuable in medical contexts, where incorrect information can have serious consequences. The ability of an LLM to assess the reliability of its own output hinges on its understanding of the underlying knowledge domain and its capacity to compare its responses against established facts and guidelines. LLMs that exhibit strong self-awareness are better equipped to identify and flag potentially inaccurate or incomplete information, thereby reducing the risk of users relying on flawed advice. The superior self-awareness of BARD and ChatGPT-4.0 underscores the importance of incorporating mechanisms for uncertainty estimation and knowledge validation into the design of LLMs intended for medical applications.
ERNIE’s Performance in Chinese
The analysis of Chinese prompts revealed that ERNIE excelled in temporal improvement and self-awareness of correctness. This suggests that ERNIE is well-suited for providing accurate and reliable information to Chinese-speaking users seeking CVD prevention guidance. ERNIE’s strong performance in these areas likely stems from its extensive training on Chinese language datasets and its integration of domain-specific knowledge related to Chinese medical practices and cultural contexts. This highlights the potential advantages of developing language-specific LLMs that are tailored to the unique characteristics of different linguistic and cultural communities. By focusing on the specific needs and nuances of a particular language, these models can deliver more accurate, relevant, and culturally sensitive information to users.
Comprehensive Evaluation of LLM Chatbots
To ensure a comprehensive evaluation that includes common and popular LLM-chatbots, this study included four prominent models: ChatGPT-3.5 and ChatGPT-4.0 by OpenAI, BARD by Google, and ERNIE by Baidu. The evaluation of English prompts involved ChatGPT 3.5, ChatGPT 4, and BARD; for Chinese prompts, the evaluation involved ChatGPT 3.5, ChatGPT 4, and ERNIE. The models were used with their default configurations and temperature settings, without adjustments to these parameters during the analysis. This decision was made to reflect the typical user experience, as most individuals interact with these chatbots without modifying the default settings. This approach provides a more realistic assessment of the models’ performance in real-world scenarios. However, it is important to note that adjusting the temperature setting could potentially influence the models’ output, with lower temperatures generally leading to more conservative and factually grounded responses, and higher temperatures resulting in more creative but potentially less accurate answers.
Question Generation and Chatbot Response Evaluation
The American College of Cardiology and American Heart Association provide guidelines and recommendations for CVD prevention, encompassing information on risk factors, diagnostic tests, and treatment options, as well as patient education and self-management strategies. Two experienced cardiologists generated questions related to CVD prevention, framing them similarly to how patients would inquire with physicians to ensure relevance and comprehensibility from a patient’s perspective. This patient-centered and guideline-based approach yielded a final set of 300 questions covering various domains. These questions were then translated into Chinese, ensuring the appropriate use of conventional and international units. The use of cardiologists to generate the questions ensured that the content was clinically relevant and reflected the types of queries that patients would typically have about CVD prevention. The translation into Chinese was carefully performed to maintain the integrity of the questions and ensure that they were culturally appropriate for Chinese-speaking users.
Blinding and Randomly Ordered Assessment
To ensure the graders were unable to distinguish the origin of the response among different LLM Chatbots, any chatbot-specific features were manually concealed. The evaluation was conducted in a blinded and randomly ordered manner, with responses from three chatbots randomly shuffled within the question set. The responses from three chatbots were randomly assigned to 3 rounds, in a 1:1:1 ratio, for blinded assessment by three cardiologists, with a 48-hour wash-out interval in between rounds to mitigate recency bias. The blinding process was essential to prevent any bias from influencing the graders’ evaluations. The random ordering of the questions and responses further contributed to the objectivity of the assessment. The 48-hour wash-out interval was implemented to minimize the potential for recency bias, which could occur if graders were exposed to similar responses from the same chatbot in close proximity.
Accuracy Evaluation Methodology
The primary outcome was the performance in responding to primary CVD prevention questions. Specifically, a two-step approach was used to evaluate the responses. In the first step, a panel of cardiologists reviewed all LLM Chatbot-generated responses and graded them as either “appropriate,” “borderline,” or “inappropriate,” in relation to expert consensus and guidelines. In the second step, a majority consensus approach was utilized, wherein the final rating for each chatbot response was based on the most common rating graded amongst the three graders. In scenarios where majority consensus could not be achieved among the three graders, a senior cardiologist was consulted to finalize the rating. This rigorous evaluation methodology ensured that the assessment of the LLM chatbots’ performance was both comprehensive and objective. The use of multiple cardiologists as graders provided a diversity of perspectives and reduced the potential for individual biases to influence the final ratings. The majority consensus approach further strengthened the reliability of the evaluation. The involvement of a senior cardiologist in resolving cases where majority consensus could not be reached ensured that the final ratings were aligned with expert clinical judgment.
Analysis of Key Findings
The data revealed that LLM-chatbot performed generally better with English prompts than with Chinese prompts. Specifically, for English prompts, BARD, ChatGPT-3.5, and ChatGPT-4.0 demonstrated similar sum scores. When comparing proportions of ‘appropriate’ rating, ChatGPT-4.0 had a notably higher percentage compared to ChatGPT-3.5 and Google Bard. For Chinese prompts, ChatGPT3.5 had a higher sum score, followed by ChatGPT-4.0 and Ernie. However, the differences were not statistically significant. Similarly, ChatGPT-3.5 had a higher proportion of ‘appropriate rating’ for Chinese prompts, compared to ChatGPT-4.0 and ERNIE, but the differences were not statistically significant. The overall trend of better performance with English prompts is consistent with the hypothesis that the LLMs are trained on larger English-language datasets, leading to a more comprehensive understanding of the subject matter in that language. The non-significant differences in scores for Chinese prompts suggest that the performance of the LLMs in this language is relatively comparable, although further research with larger sample sizes and more diverse question sets may be needed to identify statistically significant differences.
Performance Across CVD Prevention Domains
The analysis focused on "appropriate" ratings across different CVD prevention domains. Remarkably, ChatGPT-4.0 consistently performed well in most domains, with particularly high ratings in "dyslipidemia," "lifestyle," "biomarker and inflammation," and "DM and CKD" domains. However, BARD showed suboptimal performance compared to ChatGPT4.0 and ChatGPT-3.5, particularly in the "lifestyle" domain. The findings highlighted that all three LLM-Chatbots performed well in the "lifestyle" domain, with 100% "appropriate" ratings (Supplementary Table S6). However, variations in performance were observed across other domains, with some models showing greater efficacy in specific prevention domains. ChatGPT-4.0’s strong performance across multiple domains suggests that it has a broad and deep understanding of CVD prevention concepts. BARD’s suboptimal performance in the "lifestyle" domain warrants further investigation, as lifestyle modifications are often the cornerstone of CVD prevention efforts. The 100% "appropriate" ratings in the "lifestyle" domain for all three LLM-Chatbots, as detailed in Supplementary Table S6, are noteworthy and suggest that these models are well-equipped to provide accurate and helpful information on topics such as diet, exercise, and smoking cessation. The variations in performance across other domains highlight the importance of carefully evaluating LLMs’ capabilities in specific areas before relying on them as sources of medical information.
Implications for Health Literacy
The study’s findings hold important implications for efforts to improve cardiovascular health literacy. As individuals increasingly turn to online resources for medical information, LLMs have the potential to serve as valuable tools for enhancing understanding of CVD prevention. By providing accurate and accessible information, LLMs can bridge gaps in knowledge and empower individuals to make informed decisions about their health. LLMs can be used to create personalized educational materials, answer common questions about CVD prevention strategies, and provide tailored recommendations based on individual risk factors. By democratizing access to medical information, LLMs can help to reduce health disparities and empower individuals to take control of their cardiovascular health. However, it is crucial to ensure that the information provided by LLMs is accurate, easy to understand, and culturally sensitive to the needs of diverse populations.
Disparities in Performance
The study also revealed significant disparities in LLM performance across different languages. The finding that LLMs generally performed better with English prompts than with Chinese prompts highlights the potential for language bias in these models. Addressing this issue is crucial to ensure that LLMs provide equitable access to accurate medical information for all individuals, regardless of their native language. Several strategies can be used to mitigate language bias in LLMs. These include increasing the representation of non-English language data in the training datasets, developing language-specific LLMs that are tailored to the unique characteristics of different languages, and implementing techniques for cross-lingual transfer learning that allow LLMs to leverage knowledge gained from English-language data to improve their performance in other languages.
The Role of Language-Specific Models
The analysis of ERNIE’s performance in Chinese provides valuable insights into the role of language-specific LLMs. ERNIE’s strengths in temporal improvement and self-awareness of correctness suggest that models tailored for specific languages can effectively address linguistic nuances and cultural contexts. Further development and refinement of language-specific LLMs may be essential to optimize the delivery of medical information to diverse populations. Language-specific LLMs can be trained on datasets that are curated specifically for a particular language and culture, allowing them to develop a deeper understanding of the nuances of that language and the cultural context in which it is used. These models can also be designed to incorporate domain-specific knowledge that is relevant to the target language and culture, such as traditional medical practices and local health beliefs. By tailoring LLMs to the specific needs of different language communities, we can ensure that these models provide accurate, relevant, and culturally sensitive information to all individuals, regardless of their native language.
Limitations and Future Directions
While this study provides valuable insights into the capabilities of LLMs in addressing CVD prevention queries, it’s essential to acknowledge certain limitations. The questions used represented a small part of questions in terms of CVD prevention. The generalizability of findings is subject to the impact of stochastic responses. Additionally, the rapid evolution of LLMs requires ongoing research to accommodate updated iterations and emerging models. Future studies should expand the scope of questions, explore the impact of different interaction patterns with LLMs, and investigate the ethical considerations surrounding their use in medical contexts. Larger and more diverse question sets should be used to assess the LLMs’ performance across a wider range of CVD prevention topics. The impact of different interaction patterns with LLMs, such as the use of follow-up questions and the provision of feedback on responses, should also be explored. Finally, the ethical considerations surrounding the use of LLMs in medical contexts, such as data privacy, algorithmic bias, and the potential for over-reliance on these models, should be carefully examined.
Conclusion
In conclusion, these findings underscore the promise of LLMs as tools for enhancing public understanding of cardiovascular health, while also emphasizing the need for careful evaluation and ongoing refinement to ensure accuracy, fairness, and responsible dissemination of medical information. The path forward involves continuous comparative evaluations, addressing language biases, and leveraging the strengths of language-specific models to promote equitable access to accurate and reliable CVD prevention guidance. As LLMs continue to evolve and become more integrated into healthcare, it is imperative that we prioritize the development of models that are accurate, unbiased, and culturally sensitive, and that we implement safeguards to ensure that these models are used responsibly and ethically. By doing so, we can harness the power of LLMs to improve cardiovascular health literacy and empower individuals to make informed decisions about their health.