Microsoft's Tiny Model: Math 'Cheat Code' on 6K Samples

The Rise of the Phi-4 Reasoning Models

The AI community is currently buzzing about reasoning models, and Microsoft has recently unveiled the Phi-4 family of inference models. This series includes Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning. A particularly remarkable feature is that even the largest of these models, with a parameter count of only 14 billion, can operate smoothly on high-performance laptops. Furthermore, the 3.8 billion parameter Phi-4-mini-reasoning outperforms the 8 billion parameter DeepSeek-R1 distilled model in mathematical reasoning, showcasing the potential of smaller models in inference tasks.

Instead of awaiting the arrival of the second-generation DeepSeek-R2 reasoning model in April, Microsoft introduced the new Phi-4 reasoning models. These models exhibit outstanding capabilities in mathematical reasoning, surpassing the DeepSeek-R1 distilled model, despite Phi-4-Mini-Reasoning’s smaller parameter scale.

Ahmed Awadallah, Partner Research Manager at Microsoft AI Frontiers laboratory, described the Phi-4-reasoning and summarized the features of the new model.

  • The model is trained with Supervised Fine-tuning (using a carefully selected reasoning example dataset) and Reinforcement Learning.
  • It performs well in inference benchmarks and can be comparable to larger top models such as DeepSeek R1.
  • It continues to perform strongly on new tests (such as AIME 2025, HMMT)
  • Reasoning ability has strong transferability/generalization ability, even after only supervised fine-tuning, it can adapt to new tasks (such as k-SAT, mathematical equation solving, scheduling, etc.)
  • Retains and greatly improves general capabilities (such as instruction understanding and execution)

He stated that Phi-4 still has several aspects that need improvement, especially in context length, encoding ability, and tool integration.

In addition to the model itself, Microsoft also shared a detailed technical report that provides an in-depth analysis of the model’s training and evaluation process.

On X, Dimitris Papailiopoulos, Principal Researcher at Microsoft Research AI Frontiers laboratory and Associate Professor at the University of Wisconsin, introduced more information about the Phi-4 reasoning model.

He believes that Phi-4-reasoning has completely reached the graduate level and can be run on a local PC.

This exceeded his expectations for the development of AI.

The new model has few parameters but strong performance.

A Performance Powerhouse

Despite its relatively small size, this model excels in mathematics benchmarks such as AIME, HMMT, and OmniMath. Its performance is comparable to or surpasses larger open-weight models like QwQ-32B, R1-70B, and R1, as well as closed models like o1-mini and sonnet 3.7.

This model is small in size and suitable for running smoothly on high-performance laptops.

At the same time, it is capable of solving many puzzles that even larger non-reasoning models and some reasoning models cannot solve.

It also passed the DimitrisEval test!

Surprisingly, reasoning seems to be a truly transferable ‘meta-skill’ that can be learned even through supervised fine-tuning SFT!

Evidence 1: Even without specialized training on non-reasoning tasks, researchers still observed significant performance improvements on IFEval, FlenQA, and internal PhiBench (an increase of more than 10 points!).

In addition, there is very little data related to coding during the SFT stage (and none at all during the RL stage), but the model still performs well in this regard.

In addition, Dimitris Papailiopoulos revealed that programming is a key focus for subsequent versions.

Evidence 2: In the case of some specific problems that were not explicitly trained on (either SFT or RL stage), such as the traveling salesman problem, maze solving, k-SAT, constrained planning, etc., the model performs very well in these tasks!

And Phi-4 (and even GPT-4) cannot do this.

This fully illustrates that reasoning ability can indeed be transferred as a skill!

After a very short round of reinforcement learning (using only 6,000 samples, compared to 1.4 million examples for SFT), the model’s reasoning mechanism seems to be ‘locked’.

This made Dimitris Papailiopoulos particularly shocked.

He feels that it is as if reinforcement learning has taught the model to reason in ‘its own language’, increasing accuracy by about 10% on AIME and HMMT, and increasing the average answer length by 50% in difficult problems.

Reinforcement learning is really effective!!

The phenomenon of the reasoning mechanism being ‘locked’ usually makes the model’s output distribution more concentrated and the accuracy is also higher.

The fact that reinforcement learning can significantly improve the capabilities of the model has also been reflected in previous research by Microsoft.

In the reinforcement learning stage, the new model was not even specially optimized for data: 6,000 questions were just randomly selected from a larger selection of datasets.

So why didn’t Microsoft conduct more reinforcement learning training?

Because the model generated answers to questions that exceeded the 32k context length (the length the model was not trained on), they could only truncate it.

In addition, with the help of parallel reasoning calculations (such as Maj@N), the new reasoning model has almost reached the performance limit on AIME 2025, and even surpassed the pass@1 performance of its teacher model (o3-mini).

And completed all data collection before February 2025, and so is HMMT.

In other tasks, researchers have also observed the phenomenon of ‘surpassing the teacher’, such as OmniMath and Calendar Planning tasks.

The prompt design in the SFT stage, coupled with the subsequent reinforcement learning process, seems to have given the model the ability to ‘self-improve’, exceeding the scope of knowledge provided by the teacher model.

In the figure below, magenta represents o3-mini and green represents Phi.

An interesting phenomenon is that: long texts with response lengths in the top 25% are often strongly correlated with wrong answers!

However, on the other hand, in most evaluations, the overall average answer length is longer and the accuracy is higher.

In other words, increasing computing resources during testing does help, but the model is also prone to ‘ramble’ when it is ‘stuck’.

Regarding the limitations of the model, there are also some things to pay attention to:

  • The ability to handle context lengths exceeding 32k has not been fully expanded or tested.
  • The model is prone to ‘overthinking’ when dealing with simple problems, and may seem too verbose in self-assessment.
  • The ability of multi-turn dialogues has not been widely tested.

Of course, there are more ‘blind spots’ to be discovered, but overall, the research team feels that they are on the right track!

Training Surprises

Suriya Gunasekar, Principal Research Manager at Microsoft Research and belonging to the ‘AGI Physics’ team responsible for developing the Phi series of models, focused on introducing the core principles of the work.

This time, the Microsoft Phi team focused on the post-training stage and launched Phi-4-reasoning (using only SFT) and Phi-4-reasoning-plus (SFT+ a small amount of RL).

Both are 14B models that have demonstrated strong capabilities in reasoning and general task benchmarks.

The core of this work lies in prompt selection and experimental exploration of transferable, self-improving reasoning skills.

There were two surprising discoveries during the training process:

First, as long as a few domain-trained long-chain reasoning (CoT) trajectories are used, Phi-4 can achieve significant performance improvements in multiple tasks such as scheduling, maze solving (without visual input), IFEva, FlenQA, KITAB (lookup-based question answering), and internal PhiBench;

Second, even if only 6,000 mathematical examples are used for minimal RL training, the model’s performance is significantly improved in some benchmarks, with the highest improvement reaching 10% (but token usage increased by about 1.5 times), and cross-domain transfer of skills was also observed during the RL stage.

In other words, compared with major competitors such as OpenAI and Google, the Microsoft Phi-4 reasoning series demonstrates new possibilities: small models can match or even surpass large models in specific tasks by using high-quality data and refined training strategies. It suggests a shift towards data efficiency and clever training methodologies.

Core Methods

Reasoning model Phi-4-reasoning has 14 billion parameters and performs strongly in complex reasoning tasks.

The model is based on Phi-4 for supervised fine-tuning training, using a carefully selected set of ‘teachable’ prompts that have both appropriate complexity and diversity; the reasoning examples generated by o3-mini are used as references during the training process.

Phi-4-reasoning can generate detailed reasoning chains and make full use of computing resources during the reasoning process. This ability to generate detailed reasoning chains allows the model to solve complex problems in a step-by-step manner, mirroring human problem-solving strategies.

On this basis, Microsoft further developed Phi-4-reasoning-plus.

It is enhanced on the basis of the original model through a small stage of outcome-based reinforcement learning, and generates longer and more powerful reasoning chains. The goal of this reinforcement learning phase is to further refine the model’s reasoning capabilities and to encourage it to generate more elaborate and insightful solutions.

Research shows that a well-designed SFT dataset can significantly improve the effect of reasoning language models, and reinforcement learning (RL) can further amplify this improvement on this basis. This highlights the importance of both high-quality training data and sophisticated training algorithms in achieving state-of-the-art performance in reasoning tasks.

In SFT experiments, even in this relatively simple generation setting, careful selection and strict filtering of seed problems are still key to the model’s success. This emphasizes the importance of curating a training dataset that is both relevant and challenging, and that avoids biases or irrelevant information that could hinder the model’s learning process.

They have subjected the entire set of training data to a strict de-pollution process to ensure that it does not contain data that highly overlaps with widely used reasoning or general benchmark questions, including some benchmarks not mentioned in this report. This de-pollution process is crucial for ensuring that the model is actually learning to reason, rather than simply memorizing solutions to common problems. This helps to improve the generalization ability of the model.

The complete list of benchmark tests that have been decontaminated is as follows:

  • Mathematics and Reasoning: AIME-2024, MATH, GPQA, OmniMATH, GSM8k
  • Programming: LiveCodeBench, Codeforces, HumanEval, MBPP
  • Question Answering and General Knowledge: SimpleQA, DROP, AGIEval, ARC-Challenge, ARC-Easy, CommonsenseQA, OpenBookQA, PIQA, WinoGrande
  • Other Evaluation Tasks: SWE-Bench Verified, ArenaHard, MT-Bench, PhiBench

Through Supervised Finetuning (SFT) of the Phi-4 model with 14 billion parameters, researchers obtained Phi-4-reasoning, without any reinforcement learning before that. This demonstrates that significant reasoning capabilities can be achieved through supervised learning alone, without the need for more complex techniques like reinforcement learning.

The SFT goal is to refine the structured reasoning ability contained in the basic model. The model refines its ability to extract, organize and apply information during the SFT process.

The architecture of Phi-4-reasoning is the same as that of the Phi-4 model, but with two key modifications:

  • Reasoning tokens: The two placeholder tokens in the basic model are reused as and tokens, which are used to mark the beginning and end of a reasoning (‘thinking’) process. The inclusion of these dedicated tokens likely aids the model in delineating and focusing on the reasoning steps involved in problem-solving.
  • Increased Token Length: The maximum token length initially supported by the basic model (Phi-4) was 16K. In order to accommodate additional reasoning tokens, the base frequency of RoPE was doubled, and the model was trained at a maximum token length of 32K. Increasing the token length allows the model to consider more context and generate more detailed and nuanced reasoning chains. This is critical for tackling complex problems that require a significant amount of information and processing.

They used a synthetic method to generate a large number of chain-of-thought reasoning examples. The use of synthetic data allows researchers to create a controlled and diverse training dataset that covers a wide range of reasoning scenarios.

The SFT dataset used contains more than 1.4 million prompt-response pairs, totaling 8.3 billion unique tokens, covering reasoning fields such as mathematics and programming, as well as alignment data for safe and responsible AI. The size and diversity of this dataset are essential for training a robust and generalizable reasoning model.

Figure 4a shows the changes in key indicators throughout the SFT iteration process. This visual representation of the training process provides valuable insights into the model’s learning dynamics and helps to identify potential issues or areas for improvement.

Early in the training, the model began to use explicit ‘thinking’ tokens, which indicates that the model quickly learned this shallow structured format. This suggests that the model readily adapts to the introduction of specific markers that signal the beginning and end of reasoning sequences.

However, as shown in Figure 4a, the effectiveness of the chain-of-thought module and the model’s reasoning ability are improving throughout the training process, which indicates that the model is not just copying the format, but is actually learning reasoning skills. This is a crucial finding, as it confirms that the model is not simply mimicking the structure of the training data, but is actually acquiring the ability to reason independently.

Interestingly, unlike reinforcement learning, researchers did not see an increase in response length during the SFT process. This suggests that supervised fine-tuning primarily focuses on improving the quality and accuracy of the model’s reasoning, rather than simply increasing the amount of text it generates.

In fact, as shown in Figure 4b, the average response length decreased slightly. This may indicate that the model becomes more efficient in its reasoning process and can arrive at the correct answer with fewer steps and less verbose explanations.

This shows that as training progresses, the model is learning to use its token budget more effectively. By reducing the average response length, the model can allocate more of its limited resources to the most important aspects of the reasoning process.

In order to systematically evaluate different training strategies, they used a fixed benchmark- AIME 2024 and GPQA diamond - as an indicator of progress. Using a fixed benchmark allows researchers to objectively compare the performance of different training strategies and to track the model’s progress over time.

Overall, the experimental method can be divided into two stages: exploration and scaling. This two-stage approach allows researchers to efficiently identify promising training strategies and then scale them up to achieve state-of-the-art performance.

In the exploration stage, researchers used shorter training cycles and limited data sources and fields to quickly iterate and extract robust training methods. This rapid iteration process allows researchers to quickly test different hypotheses and to identify the most effective training strategies.

In the subsequent expansion phase, researchers summarized the results of early risk reduction experiments and finalized the SFT settings. This ensures that the final training setup is well-optimized and that the model is able to achieve its full potential.

Figure 5 summarizes this progress, highlighting ablation experiments for several key design choices. The inclusion of ablation experiments is essential for understanding the impact of each individual component of the training process and for identifying the most critical factors for success.

Figure 5 shows a high-level overview of the Phi-4-reasoning supervised fine-tuning (SFT) experimental cycle, including exploration and expansion phases, using some example experiments to represent. Each dot cluster represents the experimental results of a specific training design choice. This visual representation helps to understand the relative impact of different training parameters.

Figure 7 shows the key findings of the Phi-4-reasoning-plus model during the GRPO training process.

Starting from the supervised fine-tuning (SFT) base model Phi-4-reasoning, only 90 steps of GRPO training increased AIME performance by more than 10% (Figure 7a). This demonstrates the effectiveness of reinforcement learning in further enhancing the reasoning capabilities of a well-trained supervised model.

Continuing to increase the number of training steps did not bring additional benefits, which indicates that the potential of a strong SFT model is close to the performance ceiling. It should be noted that the output in GRPO training is limited to within 31k tokens, which objectively restricts the optimization space of GRPO. This suggests that the model may have reached a point of diminishing returns, where further training does not lead to significant improvements in performance. The token limit also affects further learning.

As shown in Figure 7c, response length is strongly correlated with AIME performance, while the correlation between reward score and AIME score is weak. This response length growth effect is the expected effect of GRPO training - the model improves its reasoning ability by increasing ‘thinking time’. This suggests that the model benefits from having more opportunities to process the information and to explore different reasoning pathways.

Figure 7d further reveals that due to the design of the reward model, the generation length of wrong answers grows significantly faster than correct answers (when the model’s current answer is wrong, the system will encourage it to think for a longer time). This highlights a potential issue with the reward function, which may be inadvertently encouraging the model to generate longer and more elaborate incorrect answers.

In fact, performing rejection sampling based solely on response length (especially long responses that significantly exceed the median) may further improve GRPO performance. This suggests that filtering out overly long responses could help to improve the accuracy of the model’s reasoning.

As shown in Figure 7d, the growth trend of shorter responses (length located in the bottom 25% quantile) during the training process is similar to the average length of correct answers, while the length of wrong answers is closer to the 75% quantile of the overall response length. The data shows the length for wrong answers is highly skewed.

This differentiation phenomenon indicates that length-based rejection sampling can improve model efficiency by suppressing overly long incorrect outputs. By focusing on shorter and more concise responses, the model can potentially improve its accuracy and reduce the amount of computational resources it consumes. The models need the ability to cut to the chase to be useful for society.