A recent benchmark study by the French startup Giskard has cast a spotlight on the significant shortcomings of some of the most widely used language models (LLMs) in the artificial intelligence landscape. This study meticulously assesses the propensity of these models to generate harmful content, hallucinate information, and exhibit various biases in their responses.
Identifying the Riskiest LLMs: A Comprehensive Evaluation
Giskard’s benchmark, released in April, delves into the potential risks associated with LLMs, providing a reliable evaluation of their tendency to fabricate information, produce toxic outputs, and display prejudiced or stereotypical viewpoints. The study’s findings offer valuable insights for developers, researchers, and organizations seeking to deploy AI models responsibly.
The benchmark meticulously examines several critical aspects of LLM performance, including:
- Hallucination: The model’s tendency to generate false or nonsensical information.
- Harmfulness: The model’s propensity to produce dangerous, offensive, or inappropriate content.
- Bias and Stereotypes: The model’s inclination to perpetuate unfair or discriminatory viewpoints.
By evaluating these factors, Giskard’s benchmark provides a comprehensive assessment of the overall risk associated with different LLMs.
Ranking the LLMs with the Most Significant Flaws
The study’s findings reveal a ranking of LLMs based on their performance across these key metrics. The lower the score, the more problematic the model is considered to be. The table below summarizes the results:
Model | Overall Average | Hallucination | Harmfulness | Bias & Stereotypes | Developer |
---|---|---|---|---|---|
GPT-4o mini | 63.93% | 74.50% | 77.29% | 40.00% | |
Grok 2 | 65.15% | 77.35% | 91.44% | 26.67% | xAI |
Mistral Large | 66.00% | 79.72% | 89.38% | 28.89% | Mistral |
Mistral Small 3.1 24B | 67.88% | 77.72% | 90.91% | 35.00% | Mistral |
Llama 3.3 70B | 67.97% | 73.41% | 86.04% | 44.44% | Meta |
Deepseek V3 | 70.77% | 77.91% | 89.00% | 45.39% | Deepseek |
Qwen 2.5 Max | 72.71% | 77.12% | 89.89% | 51.11% | Alibaba Qwen |
GPT-4o | 72.80% | 83.89% | 92.66% | 41.85% | OpenAI |
Deepseek V3 (0324) | 73.92% | 77.86% | 92.80% | 51.11% | Deepseek |
Gemini 2.0 Flash | 74.89% | 78.13% | 94.30% | 52.22% | |
Gemma 3 27B | 75.23% | 69.90% | 91.36% | 64.44% | |
Claude 3.7 Sonnet | 75.53% | 89.26% | 95.52% | 41.82% | Anthropic |
Claude 3.5 Sonnet | 75.62% | 91.09% | 95.40% | 40.37% | Anthropic |
Llama 4 Maverick | 76.72% | 77.02% | 89.25% | 63.89% | Meta |
Llama 3.1 405B | 77.59% | 75.54% | 86.49% | 70.74% | Meta |
Claude 3.5 Haiku | 82.72% | 86.97% | 95.36% | 65.81% | Anthropic |
Gemini 1.5 Pro | 87.29% | 87.06% | 96.84% | 77.96% |
The benchmark encompassed 17 widely used models, carefully selected to represent the current AI landscape. Giskard prioritized evaluating stable and widely adopted models over experimental or unfinalized versions, ensuring the relevance and reliability of the results. This approach excludes models that are primarily designed for reasoning tasks, as they are not the primary focus of this benchmark.
Identifying the Worst Performers Across All Categories
The Phare benchmark’s initial findings largely align with existing community perceptions and feedback. The top five "worst" performing models (out of the 17 tested) include GPT-4o mini, Grok 2, Mistral Large, Mistral Small 3.1 24B, and Llama 3.3 70B. Conversely, the models demonstrating the best performance include Gemini 1.5 Pro, Claude 3.5 Haiku, and Llama 3.1 405B. These findings are significant for understanding the current state of LLM development and the challenges that developers face in mitigating these issues. The variability in performance highlights the different approaches and priorities taken by different organizations in their model training and deployment strategies. Further research and development are crucial to improve the safety and reliability of these models, especially as they are increasingly integrated into various applications and services.
The identification of the worst performers also serves as a call to action for developers to focus on specific areas of improvement. For instance, the high hallucination rates in some models indicate a need for better fact-checking mechanisms and more robust training data. Similarly, the prevalence of harmful content generation underscores the importance of implementing effective safeguards and content moderation strategies. Addressing these shortcomings is essential to ensure that LLMs are used responsibly and do not contribute to the spread of misinformation or harmful ideologies. The benchmark provides a valuable tool for tracking progress and identifying areas where further innovation is needed.
Hallucination Hotspots: Models Prone to Fabricating Information
When solely considering the hallucination metric, Gemma 3 27B, Llama 3.3 70B, GPT-4o mini, Llama 3.1 405B, and Llama 4 Maverick emerge as the models most prone to generating false or misleading information. In contrast, Anthropic demonstrates strength in this area, with three of its models exhibiting the lowest hallucination rates: Claude 3.5 Sonnet, Claude 3.7 Sonnet, and Claude 3.5 Haiku, along with Gemini 1.5 Pro and GPT-4o. This stark contrast underscores the varying levels of reliability across different LLMs. The high hallucination rates in certain models raise concerns about their suitability for applications where accuracy is paramount, such as medical diagnosis or legal research. Users should be aware of these limitations and exercise caution when relying on these models for critical decision-making.
The relatively low hallucination rates observed in Anthropic’s models and Gemini 1.5 Pro suggest that certain training techniques and architectures are more effective at mitigating this issue. These models may benefit from more rigorous fact-checking mechanisms, better integration of external knowledge sources, or more sophisticated methods for detecting and correcting errors. By studying the approaches used by these high-performing models, developers can gain valuable insights into how to improve the accuracy and reliability of LLMs in general. Further research is needed to fully understand the factors that contribute to hallucination and to develop more effective strategies for preventing it.
Dangerous Content Generation: Models with Weak Safeguards
Regarding the generation of dangerous or harmful content (assessing the model’s ability to recognize problematic inputs and respond appropriately), GPT-4o mini performs the poorest, followed by Llama 3.3 70B, Llama 3.1 405B, Deepseek V3, and Llama 4 Maverick. On the other hand, Gemini 1.5 Pro consistently demonstrates the best performance, closely followed by Anthropic’s three models (Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Claude 3.5 Haiku) and Gemini 2.0 Flash. The ability of a model to discern and appropriately respond to prompts that could lead to dangerous or unethical outcomes is crucial for ensuring responsible AI deployment. The disparities in performance highlight the diverse approaches taken in implementing safety measures and content filters.
The models that perform poorly in this category may lack robust mechanisms for identifying and blocking harmful inputs, or they may be more susceptible to adversarial attacks that bypass these safeguards. Strengthening these defenses is essential to prevent LLMs from being used to generate malicious content, such as hate speech, disinformation, or instructions for illegal activities. The superior performance of Gemini 1.5 Pro and Anthropic’s models suggests that more sophisticated techniques, such as reinforcement learning from human feedback (RLHF) and adversarial training, can be effective in mitigating this risk. Continual monitoring and refinement of these safeguards are necessary to stay ahead of evolving threats and ensure that LLMs are used in a safe and ethical manner.
Bias and Stereotypes: A Persistent Challenge
The presence of bias and stereotypes in LLMs remains a significant area requiring improvement. The Phare benchmark results indicate that LLMs still exhibit marked biases and stereotypes in their outputs. Grok 2 receives the worst score in this category, followed by Mistral Large, Mistral Small 3.1 24B, GPT-4o mini, and Claude 3.5 Sonnet. Conversely, Gemini 1.5 Pro achieves the best scores, followed by Llama 3.1 405B, Claude 3.5 Haiku, Gemma 3 27B, and Llama 4 Maverick. The persistence of bias in LLMs is a complex problem that stems from the biased data used to train these models. These biases can perpetuate and amplify existing societal inequalities, leading to unfair or discriminatory outcomes.
Addressing this issue requires a multi-faceted approach, including careful curation of training data, development of bias detection and mitigation techniques, and ongoing monitoring of model outputs. It is also important to consider the ethical implications of using LLMs in sensitive applications, such as hiring or loan approvals, where biased outputs could have significant consequences. The better performance of Gemini 1.5 Pro and certain other models suggests that some progress is being made in this area, but much work remains to be done to ensure that LLMs are fair and equitable. The use of diverse and representative training data, along with techniques such as adversarial debiasing, can help to reduce the impact of bias on model outputs. Transparency and accountability are also essential for building trust in these systems and ensuring that they are used responsibly.
While model size can influence the generation of toxic content (smaller models tend to produce more "harmful" outputs), the number of parameters is not the sole determinant. According to Matteo Dora, CTO of Giskard, "Our analyses demonstrate that the sensitivity to user wording varies considerably across different providers. For example, Anthropic’s models seem less influenced by the way questions are phrased compared to their competitors, regardless of their size. The manner of asking the question (requesting a brief or detailed answer) also has varying effects. This leads us to believe that specific training methods, such as reinforcement learning from human feedback (RLHF), are more significant than size." This highlights the importance of training methodologies in shaping the behavior of LLMs.
The observation that Anthropic’s models are less sensitive to user wording suggests that they have been trained to be more robust and less susceptible to manipulation. This could be due to the use of more diverse and comprehensive training data, or to the implementation of more sophisticated techniques for detecting and mitigating adversarial inputs. The fact that the manner of asking the question can have varying effects on different models underscores the need for careful consideration of prompt engineering and user interface design. Developers should strive to create interfaces that are intuitive and user-friendly, and that minimize the potential for unintended or harmful outputs. The emphasis on reinforcement learning from human feedback (RLHF) highlights the crucial role of human oversight in training and refining LLMs. By incorporating human feedback into the training process, developers can ensure that models are aligned with human values and that they exhibit more desirable behaviors.
A Robust Methodology for Evaluating LLMs
Phare employs a rigorous methodology to assess LLMs, utilizing a private dataset of approximately 6,000 conversations. To ensure transparency while preventing manipulation of model training, a subset of approximately 1,600 samples has been made publicly available on Hugging Face. The researchers collected data in multiple languages (French, English, Spanish) and designed tests that reflect real-world scenarios. The use of a large and diverse dataset is essential for ensuring that the benchmark is comprehensive and representative.
The inclusion of data in multiple languages is particularly important for assessing the performance of LLMs in multilingual settings. By testing models on a variety of languages, researchers can identify potential biases or weaknesses that may not be apparent when testing only on English. The design of tests that reflect real-world scenarios is also crucial for ensuring that the benchmark is relevant and practical. These tests should simulate the types of tasks and interactions that LLMs are likely to encounter in real-world applications. The public availability of a subset of the data on Hugging Face promotes transparency and allows other researchers to validate and build upon the findings of the benchmark. This collaborative approach is essential for advancing the field of LLM evaluation and ensuring that these models are used responsibly.
The benchmark assesses various sub-tasks for each metric:
Hallucination
- Factuality: The model’s ability to generate factual responses to general knowledge questions.
- Accuracy with False Information: The model’s ability to provide accurate information when responding to prompts containing false elements.
- Handling Dubious Claims: The model’s ability to process dubious claims (pseudoscience, conspiracy theories).
- Tool Usage without Hallucination: The model’s ability to use tools without generating false information.
Harmfulness
The researchers evaluated the model’s ability to recognize potentially dangerous situations and provide appropriate warnings. This involves assessing the model’s understanding of context, its ability to identify potential risks, and its capacity to generate appropriate responses that mitigate those risks. The evaluation considers a range of potentially harmful scenarios, including those involving violence, self-harm, discrimination, and illegal activities.
Bias & Fairness
The benchmark focuses on the model’s ability to identify biases and stereotypes generated in its own outputs. This involves assessing the model’s awareness of its own biases and its capacity to recognize and correct biased outputs. The evaluation considers a range of different types of bias, including gender bias, racial bias, and religious bias.
Collaboration with Leading AI Organizations
Phare’s significance is further enhanced by its direct focus on metrics crucial for organizations seeking to utilize LLMs. The detailed results for each model are publicly available on the Giskard website, including breakdowns by sub-task. The benchmark is financially supported by the BPI (French Public Investment Bank) and the European Commission. Giskard has also partnered with Mistral AI and DeepMind on the technical aspects of the project. The LMEval framework for utilization was developed in direct collaboration with the Gemma team at DeepMind, ensuring data privacy and security. These collaborations and partnerships underscore the importance of a collective effort in addressing the challenges associated with LLMs.
The financial support from the BPI and the European Commission highlights the public interest in ensuring the responsible development and deployment of AI technologies. The technical partnerships with Mistral AI and DeepMind bring valuable expertise and resources to the project, allowing for more comprehensive and rigorous evaluations. The direct collaboration with the Gemma team at DeepMind in developing the LMEval framework ensures that data privacy and security are prioritized throughout the evaluation process. The public availability of detailed results on the Giskard website promotes transparency and allows organizations to make informed decisions about which LLMs to use.
Looking ahead, the Giskard team plans to add two key features to Phare: "Probably by June, we will add a module to evaluate resistance to jailbreaks and prompt injection," says Matteo Dora. Additionally, the researchers will continue to update the leaderboard with the latest stable models, with Grok 3, Qwen 3, and potentially GPT-4.1 on the horizon. The addition of a module to evaluate resistance to jailbreaks and prompt injection is crucial for assessing the robustness of LLMs against adversarial attacks. These attacks can be used to bypass safety measures and generate harmful or inappropriate content. By testing models against these types of attacks, researchers can identify vulnerabilities and develop strategies for strengthening their defenses.
The continuous updating of the leaderboard with the latest stable models ensures that the benchmark remains relevant and up-to-date. As new models are released, it is important to evaluate their performance across a range of metrics to ensure that they meet the required standards for safety, reliability, and fairness. The inclusion of Grok 3, Qwen 3, and potentially GPT-4.1 on the horizon demonstrates the commitment to keeping the benchmark at the forefront of LLM evaluation. These ongoing efforts will contribute to a better understanding of the capabilities and limitations of LLMs and will help to guide the responsible development and deployment of these powerful technologies. The future enhancements of the Phare benchmark will provide even more comprehensive insights into the strengths and weaknesses of different LLMs, enabling developers and users to make more informed decisions and contributing to the overall advancement of the field.