Microsoft Phi-4: A Small Language Model for Complex Mathematical Reasoning | en

Microsoft Research has unveiled Phi-4, a 14 billion-parameter small language model aimed at pushing the boundaries of mathematical reasoning. Initially available on Azure AI Foundry, it has recently been open-sourced on Hugging Face under the MIT license.

Innovations of Phi-4

According to Microsoft, Phi-4 outperforms comparable and larger models in mathematical reasoning due to several innovative techniques employed during its training, including:

Synthetic Data Pre-training and Mid-training: Using synthetic data for pre-training and mid-training provides a more structured learning path for the model.
Organic Data Curation: Carefully curated and filtered organic data ensures the quality of the training data.
Novel Post-Training Scheme: A new post-training approach further enhances the model’s performance.

These innovations have allowed Phi-4 to surpass its teacher model, GPT-4o, in STEM-focused question-answering, demonstrating that Microsoft’s data generation and post-training techniques are more than just simple knowledge distillation.

The Unique Advantage of Synthetic Data

The use of synthetic data in the training of large language models (LLMs) is not new, and the Phi models have utilized this approach before. Microsoft points out that synthetic data is not a cheap substitute, but rather it excels over organic data in the following ways:

More Gradual Learning Path: Synthetic data can guide LLMs to learn progressively, from the initial problem statement to the final solution, making the reasoning process easier to understand.
Better Alignment with Reasoning Environment: Unlike organic data that contains problem statements and final solutions, synthetic data can provide a more detailed step-by-step reasoning process, better aligning with real-world reasoning scenarios.

Carefully Curated Organic Data

In addition to synthetic data, Microsoft also used carefully curated organic data, including tens of millions of high-quality math problems and solutions collected from public websites and external datasets. For cases where accurate solutions were not provided, they synthesized solutions using a majority voting method to improve accuracy. They also collected academic papers, education forums, and programming tutorials.

Microsoft emphasized the critical role of high-quality natural data in synthetic data generation, noting that even minor errors can lead to a significant degradation in the quality of derived synthetic documents. Therefore, they invested significant effort in refining the curation of web data.

Phi-4’s Post-Training Phase

The post-training phase of Phi-4 aims to transform it into a reliable AI assistant. This phase includes the following steps:

Fine-tuning: The model is fine-tuned using high-quality data generated from various domains, including mathematics, coding, reasoning, dialogue, model identity, and safety.
Direct Preference Optimization (DPO): Two DPO steps are performed to better align the model with human preferences and eliminate undesirable behaviors.
- Pivotal Token Search: In the first step, Microsoft used a new technique called Pivotal Token Search to generate desired/undesired outcome pairs.
- GPT-4o as a Judge: In the second step, they used GPT-4o as a judge to label each outcome pair as positive or negative.

Evaluation of Phi-4

Phi-4 was evaluated using OpenAI’s SIMPLE-EVALS framework, and it surpassed Llama-3.1-405B on multiple benchmarks. Additionally, it outperformed its teacher model, GPT-4o, on the GPQA (Graduate-level STEM Question Answering) and MATH (Mathematical Competition) benchmarks.

Detailed Explanation of Phi-4 Model’s Training Data

Microsoft employed a carefully designed data strategy when training the Phi-4 model, primarily revolving around synthetic data and selected real-world data. This combination method aims to optimize the model’s learning process and enable it to excel in mathematical reasoning.

Synthetic Data Generation

Synthetic data plays a crucial role in the training of Phi-4. The Microsoft team did not view synthetic data as a simple substitute for real data, but rather as a tool to guide the model to learn gradually. The synthetic data generation process typically follows these steps:

Problem Creation: First, various mathematical problems are generated based on predefined rules and templates. These problems cover different mathematical fields and difficulty levels to ensure the model’s comprehensive learning.
Step-by-Step Solutions: For each generated problem, a step-by-step solution is created, detailing the reasoning process from the problem statement to the final answer. This step-by-step solution includes not only the final answer but also the intermediate steps and reasoning logic, helping the model understand the problem-solving process.
Data Augmentation: To increase data diversity, synthetic data is also augmented, such as by changing the wording of the problems, adjusting numbers, or using different solution methods.

Selected Real-World Data

In addition to synthetic data, the training of Phi-4 also used a large amount of selected real-world data. This data comes from various public websites, academic papers, education forums, and programming tutorials, including the following types:

Math Problems and Solutions: Millions of high-quality math problems and their solutions were collected from public websites and external datasets. These problems cover different mathematical fields and difficulty levels.
Academic Papers: To improve the model’s understanding and reasoning abilities, a large number of academic papers were also collected, providing in-depth mathematical concepts and theories.
Education Forums: Questions raised by students and answers provided by experts were collected from education forums, enabling the model to understand mathematical problems from different perspectives.
Programming Tutorials: To improve the model’s programming capabilities, a large number of programming tutorials were also collected, covering different programming languages and algorithms.

Data Quality Control

Microsoft invested significant effort in data quality control to ensure the accuracy and consistency of the training data. They took the following measures:

Manual Review: For some key datasets, manual reviews were conducted to ensure the accuracy and quality of the data.
Majority Voting: For problems without accurate solutions, solutions were generated using the majority voting method to improve accuracy.
Data Cleaning: All data was cleaned to remove duplicate data, erroneous data, and irrelevant data.

Detailed Analysis of Post-Training Strategy

The post-training phase of Phi-4 aims to transform it into a reliable AI assistant, mainly consisting of fine-tuning and direct preference optimization (DPO).

Fine-tuning Phase

The goal of the fine-tuning phase is to adapt the model to various different tasks and fields. In this phase, Microsoft used high-quality data generated from the following domains:

Mathematics: Including various math problems and solutions, aimed at improving the model’s mathematical reasoning ability.
Coding: Including various programming problems and solutions, aimed at improving the model’s code generation and understanding abilities.
Reasoning: Including various logical reasoning problems, aimed at improving the model’s logical thinking ability.
Dialogue: Including various dialogue data, aimed at improving the model’s natural language understanding and generation abilities.
Model Identity: Including various model identity descriptions, aimed at improving the model’s understanding of its own capabilities.
Safety: Including various safety problems and solutions, aimed at improving the model’s safety.

Direct Preference Optimization (DPO) Phase

The goal of the Direct Preference Optimization (DPO) phase is to better align the model’s behavior with human preferences and eliminate undesirable behaviors. This phase includes two steps:

Pivotal Token Search: In the first step, Microsoft used a new technique called Pivotal Token Search to generate desired/undesired outcome pairs. This technique searches the model’s output space to find the key tokens that can distinguish between desired and undesired behaviors.
GPT-4o as a Judge: In the second step, they used GPT-4o as a judge to label each outcome pair as positive or negative. GPT-4o can evaluate the model’s output based on human preferences, thus helping the model better learn human preferences.

Performance Evaluation of Phi-4

To evaluate the performance of Phi-4, Microsoft used OpenAI’s SIMPLE-EVALS framework, which contains a variety of different benchmarks to assess the model’s performance on different tasks.

Benchmarks

Phi-4 performed well on the following benchmarks:

GPQA (Graduate-level STEM Question Answering): On this benchmark, Phi-4 surpassed its teacher model GPT-4o, demonstrating its strong question-answering ability in the STEM field.
MATH (Mathematical Competition): On this benchmark, Phi-4 also surpassed its teacher model GPT-4o, demonstrating its excellent ability to solve complex math problems.
Comparison with Other Models: On multiple benchmarks, Phi-4 surpassed Llama-3.1-405B, demonstrating its strong overall performance.

Performance Analysis

Based on the performance evaluation of Phi-4, the following conclusions can be drawn:

Strong Mathematical Reasoning Ability: Phi-4 demonstrated excellent performance in mathematical reasoning, thanks to the innovative methods used during its training, including synthetic data, selected real-world data, and post-training strategies.
Surpassing the Teacher Model: On multiple benchmarks, Phi-4 surpassed its teacher model GPT-4o, proving that its performance is not simply knowledge distillation.
Comparison with Other Models: Phi-4 surpassed Llama-3.1-405B on multiple benchmarks, proving its strong overall performance.

Application Prospects of Phi-4

As a small language model designed for complex mathematical reasoning, Phi-4 has broad application prospects. It can be applied in the following fields:

Education: It can be used as a math tutoring tool to help students solve math problems and provide personalized learning experiences.
Scientific Research: It can be used as a scientific research tool to help researchers perform mathematical modeling and data analysis.
Engineering: It can be used as an engineering tool to help engineers in design and analysis.
Finance: It can be used as a financial tool to help financial analysts in risk assessment and investment decisions.
Other Fields: It can also be applied in other fields that require complex mathematical reasoning, such as healthcare, logistics, and manufacturing.

Conclusion

The emergence of Microsoft Phi-4 marks a significant advancement in small language models in the field of mathematical reasoning. Its unique data training strategy and post-training methods have enabled it to outperform similar and larger models in performance and have provided new ideas for future AI development. With Phi-4 being open-sourced on Hugging Face, it is believed that it will bring convenience to more researchers and developers and promote the application of AI technology in various fields.

updated at 2025-01-26

# Assistant # Microsoft # Phi