Phi-4-Reasoning: SLMs Can Reason Like Giants | en

Microsoft’s Phi-4 Reasoning presents compact, open-weight (MIT licensed), fast, efficient SLMs capable of advanced reasoning.

Microsoft, while a privileged partner of OpenAI and working with most players to integrate their AI models into Azure AI Foundry, doesn’t shy away from pursuing its own technological avenues. This includes working on innovations at the core of neural networks, such as the intriguing BitNet b1.58 model based on Trit, its own open-source SLMs, and even frontier models kept under wraps (Project MAI-1).

A year after introducing its range of small AI models (SLMs) Phi-3 and two months after debuting the 4th generation with a multimodal SLM (Phi-4-Multimodal) and a tiny model (Phi-4-mini), Microsoft announces three new variants of its latest generation SLM: Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning.

Released on April 30, 2025, these “reasoning-integrated” versions expand the open-weight offering of compact models for developers who need to maintain low latency while requiring complex reasoning.

At the heart of Microsoft engineers’ approach to making its SLMs “reasoning”: relying on fine-grained supervision (SFT) from OpenAI’s reasoning chains o3-mini, and leveraging reinforcement learning (RL) for the “plus” version. “Through distillation, reinforcement learning, and high-quality data, these models reconcile size and performance,” Microsoft explains.

Small but Gifted

The results on the various leading benchmarks of the market are enough to make the competition pale: typically with only 14 billion parameters, Phi-4-reasoning outperforms DeepSeek-R1-Distill-Llama-70B (70 billion parameters) on the AIME 2025, MMLU-Pro or HumanEval-Plus series, and approaches the complete DeepSeek-R1 model (671 billion parameters)! The Phi-4-reasoning-plus variant, aligned on the same 14 billion parameters but trained with 1.5 times more tokens, nearly matches OpenAI’s o3-mini scores on OmniMath! For information, Phi-4-reasoning benefits from a classic 128,000 token context window that has been extended to 256,000 tokens for the Phi-4-reasoning-plus version.

Designed for embedded systems, Phi-4-mini-reasoning displays 3.8 billion parameters, a synthetic set of one million mathematical problems generated by DeepSeek-R1, and achieves o1-mini performance on Math-500 while surpassing several models with 7 to 8 billion parameters. With its ultra-small size, this model is ideal for local execution, including on mobile devices, and for meeting the need for near-instant responses. It is particularly suited to educational uses and local chatbots.

Open Models for Varied Uses

On the deployment side, CISOs will find these models already optimized for Copilot+ PCs: the NPU variant “Phi Silica” is preloaded into memory and provides near-instant response time, guaranteeing energy-efficient cohabitation with business applications. The Windows APIs allow integrating offline generation into Outlook or internal tools.

In terms of security, Microsoft claims a pipeline aligned with its principles of responsibility — accountability, fairness, reliability, safety, and inclusion. The models undergo post-training combining SFT, Direct Preference Optimization, and RLHF from public and internal “helpfulness/harmlessness” oriented sets. Microsoft also publishes the “Cards” of its models, which detail the residual limitations and mitigation measures.

Available now on Azure AI Foundry, Hugging Face, and GitHub Models, the three models are published under the very permissive MIT license, opening the way to local inference as well as hybrid cloud deployments. For security and architecture teams, this new generation of SLMs offers a credible alternative to massive LLMs, with a reduced TCO, execution locally as well as at the Edge, and increased control of data. These models are proof of the incredible progress made by SLMs in a year and their amazing potential in a universe in search of less expensive and more energy and resource frugal AI.

A Deeper Dive into Phi-4’s Reasoning Capabilities

The arrival of the Phi-4 family of models represents a significant step forward in the development of small language models (SLMs). What sets these models apart is their enhanced reasoning abilities, achieved through innovative training techniques and a focus on high-quality data. Microsoft’s commitment to open-source principles further democratizes access to these powerful tools, empowering developers to integrate advanced AI capabilities into a wide range of applications.

Understanding the Architecture

The Phi-4 models are built upon a transformer architecture, a proven framework for natural language processing. However, Microsoft has implemented several key innovations to optimize the models for reasoning tasks.

Fine-grained Supervision (SFT): The models are trained using a technique called fine-grained supervision (SFT), which involves learning from detailed reasoning chains generated by OpenAI’s o3-mini model. This allows the Phi-4 models to learn the steps involved in complex reasoning processes. By observing and mimicking the reasoning steps of a more advanced model, Phi-4 can effectively distill the knowledge and apply it to new, unseen problems. The SFT process involves feeding the model a series of input prompts paired with the corresponding reasoning chains. The model then learns to predict the next step in the chain, given the current state. This is repeated for a large number of training examples, allowing the model to gradually learn the underlying patterns and principles of reasoning. The effectiveness of SFT depends on the quality and diversity of the reasoning chains used for training. Microsoft carefully selected the reasoning chains from OpenAI’s o3-mini model to ensure that they cover a wide range of reasoning tasks and are of high quality.
Reinforcement Learning (RL): The “plus” variant of the Phi-4 model, Phi-4-reasoning-plus, utilizes reinforcement learning (RL) to further enhance its reasoning abilities. RL involves training the model to maximize a reward signal, which in this case is based on the accuracy and efficiency of its reasoning. In RL, the model interacts with an environment and receives feedback in the form of rewards. The goal of the model is to learn a policy that maximizes the cumulative reward over time. In the case of Phi-4-reasoning-plus, the environment consists of reasoning tasks, and the reward is based on the accuracy and efficiency of the model’s solutions. For example, the model might receive a positive reward for correctly answering a question and a negative reward for making an incorrect answer. The model might also receive a reward for solving a problem quickly and efficiently. The use of RL allows Phi-4-reasoning-plus to learn more complex and nuanced reasoning strategies than would be possible with supervised learning alone. By exploring different approaches and receiving feedback on their effectiveness, the model can discover novel solutions to challenging reasoning problems.
Distillation: Distillation is employed to transfer knowledge from larger, more complex models to the smaller Phi-4 models. This allows the SLMs to achieve performance levels comparable to much larger models, while maintaining their compact size and efficiency. Distillation involves training a smaller “student” model to mimic the behavior of a larger “teacher” model. The student model is trained on the same dataset as the teacher model, but it also receives additional information from the teacher model in the form of soft targets. Soft targets are probability distributions over the possible outputs, rather than just the single correct output. This provides the student model with more information about the uncertainty in the teacher model’s predictions, which can help it to learn more effectively. Distillation has been shown to be an effective technique for compressing large models into smaller models without sacrificing too much performance. It is particularly useful for SLMs, where size and efficiency are critical considerations.

Benchmarking Performance

The Phi-4 models have demonstrated impressive performance on a variety of reasoning benchmarks, surpassing larger models in some cases. For instance, Phi-4-reasoning, with only 14 billion parameters, outperforms DeepSeek-R1-Distill-Llama-70B (70 billion parameters) on several challenging datasets, including AIME 2025, MMLU-Pro, and HumanEval-Plus. This highlights the efficiency and effectiveness of the Phi-4’s architecture and training techniques.

The AIME (American Invitational Mathematics Examination) is a challenging mathematics competition for high school students. The MMLU (Massive Multitask Language Understanding) benchmark is a collection of multiple-choice questions covering a wide range of topics, including humanities, social sciences, and STEM. HumanEval-Plus is a coding benchmark that tests the ability of language models to generate code from natural language descriptions. The fact that Phi-4-reasoning can outperform a much larger model like DeepSeek-R1-Distill-Llama-70B on these benchmarks is a testament to its superior reasoning capabilities. This is likely due to the combination of fine-grained supervision, reinforcement learning, and distillation used in its training.

The Phi-4-reasoning-plus variant, trained with 1.5 times more tokens, achieves scores close to OpenAI’s o3-mini on the OmniMath benchmark, demonstrating its ability to tackle complex mathematical reasoning problems. The OmniMath benchmark is a collection of mathematical problems that require a combination of algebraic, geometric, and trigonometric reasoning skills. The fact that Phi-4-reasoning-plus can achieve scores close to OpenAI’s o3-mini on this benchmark is a significant achievement, as o3-mini is a much larger and more powerful model. This demonstrates the effectiveness of reinforcement learning in enhancing the reasoning abilities of the Phi-4 models. The increase in training tokens also contributes to the improved performance, allowing the model to learn more complex patterns and relationships in the data.

Applications and Use Cases

The Phi-4 models are well-suited for a variety of applications that require advanced reasoning capabilities.

Educational Tools: The Phi-4-mini-reasoning model, with its small size and high performance, is ideal for educational applications. It can be used to create interactive learning tools that provide students with personalized feedback and support. For example, the model could be used to generate practice problems, provide hints and explanations, and assess student understanding. Its ability to run locally on mobile devices makes it accessible to students in a variety of settings. The Phi-4-mini-reasoning model could also be used to create personalized learning paths for students, adapting to their individual needs and learning styles.
Local Chatbots: The Phi-4 models can be used to build local chatbots that provide users with instant access to information and support. Their small size allows them to be deployed on mobile devices and other resource-constrained environments. A local chatbot powered by Phi-4 could answer questions, provide recommendations, and assist with tasks such as scheduling appointments or making reservations. The ability to run locally ensures that the chatbot is always available, even without an internet connection. This is particularly useful for users in areas with limited connectivity or for those who want to protect their privacy.
Copilot+ PCs: The Phi-4 models are optimized for Copilot+ PCs, providing users with a seamless AI experience. The “Phi Silica” variant is preloaded into memory and provides near-instant response times. This allows users to access AI-powered features quickly and easily, without having to wait for the model to load or connect to the internet. The integration of Phi-4 into Copilot+ PCs enhances productivity, creativity, and overall user experience.
Offline Generation: The Windows APIs allow integrating offline generation into Outlook or internal tools, enabling users to access AI capabilities even when they are not connected to the internet. This is particularly useful for users who travel frequently or work in areas with limited internet access. Offline generation allows users to draft emails, create documents, and perform other tasks using AI, even when they are not connected to the cloud.

Security and Responsibility

Microsoft is committed to developing and deploying AI models in a responsible and ethical manner. The Phi-4 models are no exception.

Responsibility Principles: Microsoft’s AI development pipeline is aligned with its principles of responsibility, which include accountability, fairness, reliability, safety, and inclusion. These principles guide the development, deployment, and use of AI models to ensure that they are used in a way that benefits society and minimizes potential risks. Accountability means that those who develop and deploy AI models are responsible for their actions and their impact on society. Fairness means that AI models should be used in a way that does not discriminate against individuals or groups. Reliability means that AI models should be accurate and consistent in their predictions. Safety means that AI models should be designed and used in a way that does not pose a risk to human safety. Inclusion means that AI models should be developed and used in a way that benefits all members of society, including those who are traditionally underserved.
Post-Training: The Phi-4 models undergo post-training using SFT, Direct Preference Optimization, and RLHF from public and internal “helpfulness/harmlessness” oriented datasets. This helps to ensure that the models are safe and reliable. SFT (Supervised Fine-Tuning) is used to further refine the model’s performance on specific tasks. Direct Preference Optimization (DPO) is a technique for training language models to align with human preferences. RLHF (Reinforcement Learning from Human Feedback) is a technique for training language models to generate responses that are helpful, harmless, and honest. By combining these techniques, Microsoft aims to create AI models that are both powerful and responsible.
Model Cards: Microsoft publishes “Cards” for its models, which detail the residual limitations and mitigation measures. This provides users with transparency and allows them to make informed decisions about how to use the models. Model Cards provide information about the model’s intended use, its performance on various benchmarks, its limitations, and potential biases. This information helps users to understand the model’s capabilities and limitations and to use it in a responsible manner.

The Future of SLMs

The Phi-4 models represent a significant step forward in the development of small language models (SLMs). Their enhanced reasoning abilities, combined with their small size and efficiency, make them a compelling alternative to larger language models (LLMs) in many applications. The development of Phi-4 demonstrates that it is possible to achieve high levels of performance with smaller models, which has significant implications for the future of AI.

As SLMs continue to improve, they are likely to play an increasingly important role in the AI landscape. Their ability to run on resource-constrained devices and provide fast, efficient performance makes them well-suited for a wide range of applications, from educational tools to local chatbots to edge computing devices. SLMs are also more energy-efficient than LLMs, which is important for reducing the environmental impact of AI.

Microsoft’s commitment to open-source principles and responsible AI development further positions the Phi-4 models as a valuable resource for the AI community. By democratizing access to these powerful tools, Microsoft is empowering developers to create innovative and impactful applications that can benefit society as a whole. The MIT license allows developers to freely use, modify, and distribute the Phi-4 models, which encourages collaboration and innovation.

A Closer Look at the Technical Aspects

Delving deeper into the specifics of the Phi-4 architecture and training reveals the innovative techniques that enable these SLMs to achieve such impressive reasoning capabilities. The combination of carefully curated datasets, sophisticated training algorithms, and a focus on efficiency has resulted in a family of models that are both powerful and practical. Understanding these technical aspects is crucial for appreciating the advancements made in the Phi-4 models and for further developing and improving SLMs in the future.

Data Curation and Preparation

The success of any machine learning model hinges on the quality and relevance of the data it is trained on. Microsoft invested significant effort in curating and preparing the datasets used to train the Phi-4 models. This involved not only selecting appropriate data sources but also cleaning, transforming, and augmenting the data to ensure that it was suitable for training the models.

Reasoning Chains from OpenAI’s o3-mini: The models leverage reasoning chains generated by OpenAI’s o3-mini model to learn the steps involved in complex reasoning processes. These chains provide a detailed roadmap for the SLMs to follow, enabling them to develop a deeper understanding of the underlying logic. The reasoning chains were carefully selected to cover a wide range of reasoning tasks and to be of high quality.
Synthetic Mathematical Problems: The Phi-4-mini-reasoning model is trained on a synthetic dataset of one million mathematical problems generated by DeepSeek-R1. This dataset provides a diverse range of mathematical challenges, allowing the model to develop strong problem-solving skills. The use of synthetic data allows for greater control over the training data and can help to improve the model’s generalization performance.
Helpfulness/Harmlessness Datasets: The models undergo post-training using datasets designed to promote helpfulness and harmlessness. This helps to ensure that the models generate safe and responsible outputs. These datasets were carefully curated to represent a wide range of perspectives and to be free of bias.

Training Algorithms

The Phi-4 models are trained using a combination of supervised learning, reinforcement learning, and distillation. These techniques work together to optimize the models for reasoning tasks and ensure that they are both accurate and efficient. The specific details of the training algorithms are proprietary, but the general principles are described below.

Supervised Fine-Tuning (SFT): SFT is used to fine-tune the models on the reasoning chains generated by OpenAI’s o3-mini model. This allows the models to learn the specific patterns and relationships that are characteristic of complex reasoning processes. SFT involves training the model to predict the next step in the reasoning chain, given the current state.
Reinforcement Learning (RL): RL is used to train the Phi-4-reasoning-plus model to maximize a reward signal based on the accuracy and efficiency of its reasoning. This encourages the model to develop strategies for solving problems that are both effective and computationally efficient. RL involves training the model to interact with an environment and to receive feedback in the form of rewards.
Distillation: Distillation is used to transfer knowledge from larger, more complex models to the smaller Phi-4 models. This allows the SLMs to achieve performance levels comparable to much larger models, while maintaining their compact size and efficiency. Distillation involves training the smaller model to mimic the behavior of the larger model.

Optimization for Efficiency

One of the key goals in developing the Phi-4 models was to optimize them for efficiency. This is reflected in several aspects of their design and training. The focus on efficiency is crucial for enabling the deployment of SLMs on resource-constrained devices and for reducing the environmental impact of AI.

Compact Architecture: The Phi-4 models are designed with a compact architecture that minimizes the number of parameters required. This reduces the computational cost of running the models and makes them well-suited for deployment on resource-constrained devices. The architecture is based on the transformer architecture, but it has been optimized for efficiency.
Quantization: Quantization is used to reduce the memory footprint of the models and improve their inference speed. This involves representing the model’s parameters using fewer bits, which can significantly reduce the computational cost of running the model. Quantization can be performed after training, without significantly affecting the model’s performance.
Hardware Acceleration: The Phi-4 models are optimized for hardware acceleration on a variety of platforms, including CPUs, GPUs, and NPUs. This allows them to achieve maximum performance on a wide range of devices. Hardware acceleration can significantly improve the speed and efficiency of running the models.

Implications for the Future of AI

The Phi-4 models represent a significant step forward in the development of AI, with implications that extend far beyond the specific applications for which they are designed. Their ability to achieve high performance with relatively small size and computational resources opens up new possibilities for deploying AI in a wide range of settings. The impact of the Phi-4 models is likely to be felt across many different industries and applications.

Democratization of AI

The Phi-4 models are a testament to the fact that powerful AI capabilities can be achieved without requiring massive computational resources or access to proprietary datasets. This democratizes access to AI, empowering developers and researchers to create innovative applications even with limited resources. The open-source nature of the Phi-4 models further contributes to the democratization of AI.

Edge Computing

The small size and efficiency of the Phi-4 models make them well-suited for edge computing applications. This allows AI to be deployed closer to the data source, reducing latency and improving responsiveness. Edge computing has the potential to revolutionize a wide range of industries, from manufacturing to healthcare to transportation. The ability to run AI models locally, without relying on the cloud, can also improve privacy and security.

Personalized AI

The Phi-4 models can be customized and adapted to meet the specific needs of individual users or organizations. This allows for the creation of personalized AI experiences that are tailored to the unique requirements of each user. Personalized AI has the potential to improve productivity, enhance learning, and improve overall well-being.

Sustainable AI

The Phi-4 models are a more sustainable alternative to larger language models, requiring less energy and computational resources. This is important for reducing the environmental impact of AI and ensuring that it can be deployed in a responsible and sustainable manner. The energy efficiency of SLMs is becoming increasingly important as the use of AI continues to grow.

The Microsoft Phi-4-Reasoning models are not just another iteration in the ever-evolving world of AI; they are a paradigm shift. They demonstrate that intelligence is not solely a function of size and computational power but can be achieved through clever design, careful curation of data, and innovative training techniques. As these models continue to evolve, they are poised to unlock new possibilities for AI and transform the way we interact with technology. The development of the Phi-4 models is a significant achievement that will have a lasting impact on the field of AI.

updated at 2025-05-06

# AGI # Microsoft # Phi