Mistral Medium 3: The AI Challenge from Europe and the Contrast in Real Performance
French startup Mistral AI recently released its latest multimodal model, Mistral Medium 3, claiming its performance rivals the powerful Claude Sonnet 3.7, while costing less than DeepSeek V3. This announcement immediately sparked widespread attention in the tech world. However, after actual testing, users found that the model’s performance differed significantly from the official claims, with some even suggesting that users should not waste time and resources downloading it.
Official Claims of Mistral Medium 3
Mistral AI highlighted several core strengths of Mistral Medium 3 in its official blog:
- Performance and Cost Balance: Mistral Medium 3 aims to provide top-tier performance while reducing costs to one-eighth of the original, thereby accelerating enterprise applications.
- Advantages in Professional Application Scenarios: The model excels in professional fields such as code writing and multimodal understanding.
- Enterprise-Grade Features: Mistral Medium 3 offers a range of enterprise-grade features, including support for hybrid cloud deployment, local deployment, and deployment within VPCs, as well as customized post-training and integration into enterprise tools and systems.
The Mistral Medium 3 API is already available on Mistral La Plateforme and Amazon Sagemaker, and is planned to be launched soon on IBM WatsonX, NVIDIA NIM, Azure AI Foundry, and Google Cloud Vertex.
Comparison of Performance Metrics
Mistral AI claims that Mistral Medium 3’s performance reaches or even exceeds 90% of Claude Sonnet 3.7 in various benchmark tests, but at a significantly lower cost. Specifically, the input cost of Mistral Medium 3 is $0.4 per million tokens, and the output cost is $2.
Furthermore, Mistral Medium 3’s performance is said to surpass leading open-source models such as Llama 4 Maverick and Cohere Command A. Whether through API or self-deployment, the cost of Mistral Medium 3 is lower than DeepSeek V3. The model can also be deployed on any cloud, including self-hosted environments with four or more GPUs.
Focus on Enterprise-Grade Applications
Mistral AI emphasizes that Mistral Medium 3 aims to be a top-performing model, especially outstanding in coding and STEM tasks, with performance approaching those larger, slower competitors.
Official data shows that Mistral Medium 3’s performance basically surpasses Llama 4 Maverick and GPT-4o, approaching the level of Claude Sonnet 3.7 and DeepSeek 3.1.
To further verify the model’s performance, Mistral AI also released the results of third-party manual evaluations, which are more representative of real-world use cases. The results show that Mistral Medium 3 performs well in the coding field and provides better performance than other competitors in all aspects.
Mistral Medium 3 also outperforms other SOTA models in its ability to adapt to enterprise environments. It provides enterprises with a way to fully integrate intelligence into enterprise systems, solving the challenges of API fine-tuning and model customization.
Le Chat Enterprise
Mistral AI also launched Le Chat Enterprise, a chatbot service for enterprises powered by the Mistral Medium 3 model. It provides an AI agent building tool and integrates Mistral’s models with third-party services such as Gmail, Google Drive, and SharePoint.
Le Chat Enterprise aims to address the AI challenges faced by enterprises, such as tool fragmentation, insecure knowledge integration, rigid models, and slow return on investment, providing a unified AI platform for all organizational work.
Le Chat Enterprise will soon support the MCP protocol, which is a standard proposed by Anthropic to connect AI with data systems and software.
Outlook for Mistral Large
Mistral AI also revealed in its blog that although Mistral Small and Mistral Medium have been released, they have a “big” plan in the coming weeks, which is Mistral Large. They said that the newly released Mistral Medium already performs far better than top open-source models such as Llama 4 Maverick, and the performance of Mistral Large is even more worth looking forward to.
The Real Situation of User Testing
However, after Mistral AI heavily promoted the powerful performance of Mistral Medium 3, the media and users quickly conducted actual tests, and the results were disappointing.
Performance Test Discrepancy
In the evaluation based on the New York Times Connections vocabulary classification questions, Mistral Medium 3’s performance was disappointing, and it was almost impossible to find its presence. In the new 100-question evaluation, it was not among the top models.
Some users said after testing that Mistral Medium 3’s writing ability did not show obvious improvement. However, in the LLM evaluation, it is on the Pareto front.
Zhu Liang’s test found that Mistral Medium 3 performed solidly in both code writing and text generation, ranking among the top five in both evaluations.
Performance in Coding Tasks
In a simple coding task (Next.js TODO application), Mistral Medium 3 generated concise and clear replies, with scores similar to Gemini 2.5 Pro and Claude 3.5 Sonnet, but inferior to DeepSeek V3 (new) and GPT-4.1.
In a complex coding task (benchmark visualization), Mistral Medium 3 produced average results similar to Gemini 2.5 Pro and DeepSeek V3 (new), but not as good as GPT-4.1, o3, and Claude 3.7 Sonnet.
Writing Ability Evaluation
In terms of writing, Mistral Medium 3 covered most of the key points, but the format was incorrect, with scores similar to DeepSeek V3 (new) and Claude 3.7 Sonnet, but not as good as GPT-4.1 and Gemini 2.5 Pro.
The well-known person “karminski-dentist” also said after actual testing that Mistral Medium 3’s performance is not as strong as the official boasts, and suggested that users do not need to download it to avoid wasting traffic and hard disk space.
Comparison and Reflection
The case of Mistral Medium 3 once again reminds us that when evaluating the performance of AI models, we cannot rely solely on official publicity and benchmark test results, but should pay more attention to users’ actual experience and third-party evaluation.
Official publicity often selectively showcases the advantages of the model while ignoring its shortcomings. Although benchmark tests can provide certain reference value, they cannot fully reflect the model’s performance in the real world. Users’ actual experience and third-party evaluation are more objective and comprehensive, which can help us more accurately understand the advantages and disadvantages of the model.
In addition, the performance of AI models is also affected by various factors, including training data, model architecture, and optimization algorithms. Different models may exhibit different advantages and disadvantages in different tasks. Therefore, when choosing an AI model, it is necessary to consider the specific application scenarios and needs comprehensively.
The huge contrast between the release of Mistral Medium 3 and the user test results has also triggered discussions on the evaluation standards of AI models. How to establish a more scientific, objective, and comprehensive AI model evaluation system is a question worth exploring in depth.
Industry Impact
The Mistral Medium 3 incident has also had a certain impact on the entire AI industry. On the one hand, it reminds AI companies to pay more attention to user experience and avoid excessive publicity and false advertising. On the other hand, it also prompts practitioners in the AI field to pay more attention to the formulation and improvement of AI model evaluation standards.
In the future, with the continuous development of AI technology, the performance of AI models will continue to improve, and application scenarios will continue to expand. We need to view AI technology with a more rational and objective attitude, not only seeing its great potential, but also recognizing its limitations. Only in this way can we better use AI technology to create value for human society.
In conclusion, the case of Mistral Medium 3 is a warning, reminding us to maintain critical thinking when evaluating AI models, not blindly believing in official publicity, but combining actual experience and third-party evaluation to make rational judgments.