Mistral Medium 3: Ambition vs. Reality

Mistral AI, a French startup, recently unveiled its latest multimodal model, Mistral Medium 3, which has garnered considerable attention within the industry. Mistral claims that the model’s performance rivals or even surpasses 90% of Claude Sonnet 3.7’s capabilities, while simultaneously boasting a lower cost than DeepSeek V3, positioning it as a cost-effective option. However, actual testing results have revealed discrepancies between the official claims and the model’s true performance, sparking a debate about the veracity of its advertised capabilities.

Core Highlights of Mistral Medium 3

Mistral outlined several core highlights of Mistral Medium 3 in its official blog post:

  • Performance and Cost Balance: Mistral Medium 3 aims to achieve leading-edge performance while reducing costs to one-eighth of the previous levels, simplifying the deployment process, and accelerating enterprise applications.
  • Exceptional Performance in Specialized Applications: The model demonstrates outstanding performance in specialized application scenarios such as code generation and multimodal understanding.
  • Enterprise-Grade Features: Mistral Medium 3 offers a range of enterprise-grade features, including support for hybrid cloud deployments, on-premises deployments, deployments within VPCs, customized post-training, and integration with enterprise tools and systems.

The Mistral Medium 3 API is now available on Mistral La Plateforme and Amazon Sagemaker and will soon be available on IBM WatsonX, NVIDIA NIM, Azure AI Foundry, and Google Cloud Vertex.

Balancing Performance and Cost

A key selling point of Mistral Medium 3 is its ability to deliver cutting-edge performance while significantly reducing costs. Official data indicates that Mistral Medium 3’s performance reaches or even exceeds 90% of Claude Sonnet 3.7’s in various benchmark tests, but at a significantly lower cost (input cost of $0.4 per million tokens, output cost of $2 per million tokens).

In addition, Mistral Medium 3’s performance surpasses leading open-source models such as Llama 4 Maverick and Cohere Command A. Whether deployed via API or self-hosted, Mistral Medium 3 is more cost-effective than DeepSeek V3.

Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments with four or more GPUs, providing enterprises with greater flexibility.

The Pursuit of Top-Tier Performance

Mistral claims that Mistral Medium 3 is designed to be a top-performing model, particularly excelling in coding and STEM tasks, with performance approaching that of larger, slower competitors.

The table provided by Mistral indicates that Mistral Medium 3’s performance has essentially surpassed Llama 4 Maverick and GPT-4o, approaching the levels of Claude Sonnet 3.7 and DeepSeek 3.1. However, these data primarily come from academic benchmarks and may not fully reflect the model’s performance in real-world applications.

Supplementing with Human Evaluations

To more comprehensively evaluate the performance of Mistral Medium 3, Mistral also released third-party human evaluation results. Human evaluations are more representative of real-world use cases and can compensate for the limitations of academic benchmarks.

The human evaluation results show that Mistral Medium 3 performs exceptionally well in the coding domain and provides better performance than other competitors in all aspects. This suggests that Mistral Medium 3 may have certain advantages in practical applications.

Design for Enterprise-Level Applications

Mistral Medium 3 outperforms other SOTA models in its ability to adapt to enterprise environments. When companies face the difficult choice of fine-tuning through APIs or self-deploying from scratch and customizing model behavior, Mistral Medium 3 offers a way to fully integrate intelligence into enterprise systems.

To further meet enterprise needs, Mistral also launched Le Chat Enterprise, a chatbot service powered by the Mistral Medium 3 model. Le Chat Enterprise provides an AI agent building tool and integrates Mistral’s models with third-party services such as Gmail, Google Drive, and SharePoint. It aims to address AI challenges faced by enterprises, such as tool fragmentation, insecure knowledge integration, rigid models, and slow ROI, providing a unified AI platform for all organizational work.

Le Chat Enterprise will soon support the MCP protocol, a standard proposed by Anthropic for connecting AI to data systems and software.

Mistral’s Future Outlook

Mistral revealed in its blog that while Mistral Small and Mistral Medium have been released, they have a “big” plan in the coming weeks, namely Mistral Large. They stated that the recently released Mistral Medium already far outperforms leading open-source models such as Llama 4 Maverick, and the performance of Mistral Large is even more anticipated.

The release of Mistral Large will undoubtedly further enhance Mistral’s competitiveness in the AI field and provide users with more options.

Discrepancies in Actual Testing

Despite Mistral’s confidence in the performance of Mistral Medium 3 and its claim that it exceeds 90% of Claude Sonnet 3.7’s capabilities, actual testing results have exposed some issues.

Media outlets and netizens quickly began testing Mistral Medium 3, and the results were disappointing. In the evaluation based on the New York Times Connections column vocabulary classification questions, Medium 3 was at the bottom, almost nowhere to be found. In the new 100-question evaluation, it was not among the top-performing models.

Some users who tested Medium 3 stated that its writing ability remained the same, with no significant improvements. However, it was on the Pareto frontier in the LLM evaluation.

Zhu Liang’s test results showed that Mistral Medium 3 performed solidly in both code generation and text generation, ranking in the top five in both evaluations.

In a simple coding task (Next.js TODO application):

  • It generated concise and clear responses.
  • Its score was similar to Gemini 2.5 Pro and Claude 3.5 Sonnet.
  • It was inferior to DeepSeek V3 (new) and GPT-4.1.

In a complex coding task (benchmark visualization):

  • It produced average results similar to Gemini 2.5 Pro and DeepSeek V3 (new).
  • It was inferior to GPT-4.1, o3, and Claude 3.7 Sonnet.

In writing:

  • Its content covered most of the key points, but the format was incorrect.
  • Its score was similar to DeepSeek V3 (new) and Claude 3.7 Sonnet.
  • It was inferior to GPT-4.1 and Gemini 2.5 Pro.

The well-known influencer “karminski-牙医” found after testing that Mistral Medium 3’s performance was not as powerful as the official hype suggested, and even advised users not to download it to avoid wasting traffic and hard drive space.

Conclusion

Mistral Medium 3, as an innovative attempt in the European AI field, seeks a balance between performance and cost and is optimized for enterprise-level applications. However, the discrepancy between the actual test results and the official promotion indicates that Mistral may have exaggerated its model’s performance.

Despite this, Mistral Medium 3 still has some potential, especially in areas such as coding and text generation. In the future, Mistral needs to further improve its model’s performance and strengthen its practical application testing to win users’ trust. At the same time, the release of Mistral Large is also worth anticipating, as it may make up for the shortcomings of Mistral Medium 3 and bring a better experience to users.

In conclusion, the release of Mistral Medium 3 reflects the active exploration and innovation spirit of Europe in the AI field. Although there is a gap between actual performance and expectations, Mistral is still worth paying attention to, and its future development is worth anticipating.