Nvidia Surpasses DeepSeek: Open-Source Model Deep Dive | en

Nvidia’s Llama-Nemotron series models have officially surpassed DeepSeek-R1, and the details of their training have been fully disclosed, offering insights into how these models were developed to achieve superior performance.

These models are now fully open-source, marking a significant advancement in accessible AI technology. This means that a series of inference models that significantly outperform DeepSeek-R1 in terms of inference throughput and memory efficiency are now available for anyone to use and modify.

Unveiling the Secrets Behind the Model’s Success

So, how exactly were these models, which surpass DeepSeek-R1, created? Nvidia’s technical report reveals the critical elements of their training process:

Supervised Fine-Tuning with Synthetic Data + Reinforcement Learning: This combination significantly enhances the model’s reasoning capabilities.
Comprehensive Post-Training Process: A robust and well-designed post-training process is crucial for optimizing the model’s performance.

Last month, Nvidia officially announced the Llama-Nemotron 253B, which quickly overshadowed Llama 4 (which was only three days old and facing "integrity crisis" due to leaderboard manipulation). The release of this series of models caused quite a stir in the industry.

According to the Artificial Analysis Intelligence Index, Llama-Nemotron-Ultra is currently considered the "most intelligent" open-source model as of April 2025.

Nvidia launched three models in the Llama-Nemotron series: LN-Nano 8B, LN-Super 49B, and LN-Ultra 253B.

Notably, LN-Ultra not only outperforms DeepSeek-R1 in performance but also runs on a single 8xH100 node, delivering higher inference throughput.

These models are optimized for high-throughput inference while maintaining strong reasoning capabilities and a context length of up to 128K.

Moreover, Nvidia has introduced a groundbreaking inference switch feature in the global AI open-source community. Users can dynamically switch between standard chat mode and reasoning mode using the system prompt "detailed thinking on/off."

This design allows the model to meet general everyday needs and handle complex, multi-step reasoning tasks without needing different models or architectures.

The Construction Process: A Five-Stage Approach

The construction of the Llama-Nemotron models is divided into five distinct stages:

Stage 1: Optimization of reasoning efficiency using neural architecture search (NAS) based on the Llama 3 series models, with the introduction of Feedforward Network Fusion (FFN Fusion). This initial phase is critical for laying the groundwork for efficient and optimized model performance. The focus on reasoning efficiency from the outset underscores the importance Nvidia places on practical applications and real-world usability. By leveraging the Llama 3 series, Nvidia builds upon a foundation of existing knowledge and proven techniques, while simultaneously introducing novel approaches like FFN Fusion to further enhance performance.

Stage 2: Recovery of model performance through knowledge distillation and continued pre-training. Knowledge distillation involves transferring knowledge from a larger, more complex model (the "teacher") to a smaller, more efficient model (the "student"). This process allows the student model to achieve performance levels that would be difficult or impossible to reach through training from scratch. Continued pre-training further refines the model’s understanding of language and improves its ability to generalize to new tasks. The combination of these two techniques is crucial for ensuring that the model maintains its accuracy and capabilities after the optimization performed in Stage 1.

Stage 3: Supervised fine-tuning (SFT), which combines standard instruction data with reasoning processes from powerful teacher models like DeepSeek-R1, enabling the model to perform multi-step reasoning. SFT is a crucial step in aligning the model with human expectations and ensuring that it can follow instructions effectively. By incorporating reasoning processes from DeepSeek-R1, Nvidia imbues the Llama-Nemotron models with the ability to perform complex, multi-step reasoning tasks. This is particularly important for applications that require critical thinking, problem-solving, and decision-making. The use of standard instruction data further ensures that the model is well-rounded and capable of handling a wide range of tasks.

Stage 4: Large-scale reinforcement learning on complex mathematical and STEM datasets, which is crucial for the student model to surpass the capabilities of the teacher model. For LN-Ultra, this stage significantly improves performance on the GPQA-D benchmark, establishing it as the strongest model for scientific reasoning in the open-source domain. Reinforcement learning (RL) allows the model to learn through trial and error, optimizing its performance based on feedback signals. By training the model on complex mathematical and STEM datasets, Nvidia pushes the boundaries of its reasoning capabilities. This is particularly important for applications in scientific research, engineering, and other fields that require advanced problem-solving skills. The improvement in performance on the GPQA-D benchmark is a testament to the effectiveness of this approach. The ability of LN-Ultra to surpass DeepSeek-R1, the teacher model, highlights the potential of RL for pushing the limits of AI performance.

To support such large-scale reinforcement learning training, the team developed a new training framework with multiple optimization measures, most importantly supporting FP8 precision generation capability. This highlights the importance of efficient training infrastructure for developing cutting-edge AI models.

Stage 5: A brief alignment training focused on instruction following and adhering to human preferences. Alignment training is the final step in ensuring that the model is safe, reliable, and aligned with human values. This process involves fine-tuning the model to follow instructions accurately, avoid generating harmful or offensive content, and adhere to ethical guidelines. The focus on human preferences is particularly important for ensuring that the model is useful and beneficial to society. This stage ensures that the model is ready for real-world deployment.

Innovative Architecture for Optimized Inference Efficiency

LN-Super and LN-Ultra leverage the Puzzle framework for neural architecture search to optimize model inference efficiency.

Puzzle transforms large language models into hardware-adapted, efficient versions, optimized for deployment.

Through "block-by-block local distillation," developers built a library of alternative Transformer modules using Llama 3 Instruct. This approach allows for a high degree of flexibility and customization in the model architecture. By creating a library of alternative Transformer modules, the developers can select the most appropriate modules for each layer of the model, optimizing for specific performance characteristics. The use of Llama 3 Instruct as the basis for these modules ensures that they are well-trained and capable of performing a wide range of tasks. The block-by-block local distillation process further enhances the efficiency of the modules, allowing them to achieve high performance with minimal computational resources. This is crucial for deploying the models on resource-constrained devices.

In this process, each module is trained independently and in parallel, approximating the functionality of the original module while optimizing computational performance. This parallel training approach significantly reduces the time required to train the modules. The focus on computational performance is critical for ensuring that the models can be deployed efficiently in real-world applications. By approximating the functionality of the original modules, the developers can maintain the accuracy and capabilities of the model while reducing its computational footprint. This is particularly important for applications that require low latency or high throughput.

Each alternative module has specific "precision-efficiency" trade-offs. Some modules are more efficient but may result in a certain quality decline, creating a clear trade-off between computational cost and model accuracy. This highlights the importance of carefully selecting the modules that are used in the final model. The developers must consider the specific requirements of the application and choose the modules that provide the best balance between precision and efficiency. This trade-off is a fundamental challenge in AI model development, and the Puzzle framework provides a flexible and powerful tool for managing it.

These module variations include:

Attention Mechanism Removal: Some modules completely omit the attention mechanism, reducing the amount of computation and KV cache memory consumption. The attention mechanism is a key component of Transformer models, but it can also be computationally expensive. By removing the attention mechanism from some modules, the developers can significantly reduce the computational cost of the model. This is particularly important for applications that require low latency or high throughput. However, removing the attention mechanism can also reduce the accuracy of the model, so it is important to carefully consider the trade-offs.

Variable FFN Dimensions: The intermediate dimensions of the feedforward networks are adjusted, allowing for model compression at different granularities. The feedforward networks (FFNs) are another important component of Transformer models. By adjusting the intermediate dimensions of the FFNs, the developers can compress the model and reduce its memory footprint. This is particularly important for deploying the models on resource-constrained devices. The ability to adjust the dimensions at different granularities allows for a fine-grained control over the compression process.

After building the module library, Puzzle selects a module from each layer to assemble a complete model. This selection process is controlled by a mixed-integer programming (MIP) solver, which finds the optimal configuration based on constraints such as hardware compatibility, maximum allowed latency, memory budget, or desired inference throughput. The MIP solver is a powerful optimization tool that can be used to find the best possible configuration of modules given a set of constraints. This allows the developers to tailor the model to specific hardware platforms and performance requirements. The use of constraints ensures that the final model meets the desired specifications.

Vertical Compression and FFN Fusion

In the LN-Ultra model, researchers introduced FFN Fusion (Feedforward Network Fusion), an additional compression technique to reduce the model’s sequence depth and improve reasoning latency efficiency. This technique addresses the challenge of long sequence lengths, which can be a bottleneck for Transformer models. By fusing multiple FFN layers into a single layer, FFN Fusion reduces the number of sequential operations that must be performed, thereby improving latency. This is particularly important for applications that require real-time responses.

Puzzle’s removal of some attention layers results in a unique structure: multiple continuous FFN blocks frequently appear in the model structure. This unique structure provides an opportunity for further optimization.

FFN Fusion identifies these continuous structures and replaces them with fewer but wider, parallel-executable FFN layers. This replacement method reduces the steps of sequential calculation without sacrificing model expressiveness, significantly improving the utilization of computing resources - especially in multi-GPU environments, where cross-layer communication overhead is significant. The parallel execution of the fused FFN layers further improves performance. This is particularly beneficial in multi-GPU environments, where the communication overhead between GPUs can be a significant bottleneck.

The LN-Ultra model consistently outperforms DeepSeek-R1 and Llama-3.1-405B in terms of accuracy and efficiency, achieving an optimal balance. This demonstrates the effectiveness of the Puzzle framework and FFN Fusion in optimizing model performance.

Post-NAS Training: Knowledge Distillation and Continued Pre-training

After the neural architecture search (NAS) phase, both LN-Super and LN-Ultra underwent additional training to improve compatibility between modules and recover any quality loss that may have occurred during module replacement. This ensures that the optimized models maintain their accuracy and capabilities.

LN-Super was trained on the Distillation Mix dataset for 40 billion tokens under the knowledge distillation objective. The Distillation Mix dataset is a curated collection of text and code data that is designed to improve the model’s understanding of language and code. The knowledge distillation objective encourages the model to learn from a larger, more complex model, further enhancing its performance.
LN-Ultra was initially trained on the same distillation dataset for 65 billion tokens, followed by continued training on the Nemotron-H fourth-stage pre-training dataset for 88 billion tokens. The Nemotron-H dataset is a large-scale dataset of text and code data that is designed to improve the model’s general knowledge and reasoning abilities. The continued pre-training helps the model to refine its understanding of language and code.

This final pre-training step enabled LN-Ultra to not only catch up with the reference model, Llama 3.1-405B-Instruct, but also surpass it in key benchmark tests. This demonstrates the effectiveness of the post-NAS training process.

This shows that brief distillation and pre-training can achieve compatibility between aggressive architectural optimization and high model performance. This is a significant finding that suggests that it is possible to optimize AI models for both efficiency and accuracy.

Supervised Fine-Tuning: Refining Reasoning Prowess

Supervised Fine-Tuning (SFT) acts as a "personal trainer" for the Llama-Nemotron models, specifically targeting reasoning steps for particular tasks and learning inference techniques from "star student" models such as DeepSeek-R1. This analogy highlights the importance of SFT in tailoring the models to specific applications and improving their performance on reasoning tasks.

To instill genuine reasoning skills, large-scale, high-quality reasoning training data is essential. This underscores the importance of data quality in AI model development.

Synthetic Data: Tailored for Reasoning

Researchers carefully curated data samples containing both reasoning and non-reasoning data for supervised fine-tuning. The use of synthetic data allows for a controlled and targeted approach to training the models.

For reasoning samples, they added "detailed thinking on" to the system instructions, while for non-reasoning samples, they used "detailed thinking off." This allows the model to learn to differentiate between reasoning and non-reasoning tasks and to apply the appropriate techniques.

This setting allows the model to switch reasoning behavior based on prompts during the reasoning phase. This is a crucial feature for applications that require both reasoning and non-reasoning capabilities.

Synthetic data for reasoning was prepared in math, coding, and related fields. This ensures that the model is well-trained on a variety of reasoning tasks.

To train the model to follow the "reasoning switch" instructions, researchers built paired datasets, where each prompt corresponds to a response with reasoning and one without reasoning. This pairing enables the model to learn to adjust its reasoning behavior based on system instructions.

Subsequent filtering of these responses is performed based on standard answers or reward models. This ensures that the model generates accurate and reliable responses.

Fine-Tuning Process

All models were trained on instruction fine-tuning data using token-level cross-entropy loss. Token-level cross-entropy loss is a common loss function used in natural language processing.

In most training settings, reasoning and non-reasoning data are mixed to form training batches, where each prompt is paired with a corresponding response based on the "detailed thinking on/off" system instructions. This ensures that the model is exposed to both types of data during training.

Extending the training to multiple rounds can improve performance, especially for smaller models. This highlights the importance of training duration in AI model development.

NeMo-Aligner was used for reinforcement learning training, supporting GRPO and training of heterogeneous models. NeMo-Aligner is a framework developed by Nvidia for aligning AI models with human values.

vLLM was used for the generation phase, and Megatron-LM was used for the training phase. vLLM is a framework for efficient inference, and Megatron-LM is a framework for large-scale training.

Training and reasoning phases shared the same batch of GPUs, completed on the same device. This ensures consistency between training and inference.

The entire training process used 72 nodes, each equipped with 8 H100 GPUs. This highlights the scale of the training process.

The generation phase used FP8 precision, the training phase used BF16 precision, and the optimizer state used FP32. The use of different precision levels optimizes for both performance and accuracy.

Each phase maintained an independent model weight, which was synchronized at the start of each step. This ensures consistency and stability during training.

Reinforcement Learning: The Key to Surpassing R1’s Reasoning Ability

Supervised fine-tuning (SFT) enables the model to extract knowledge from powerful teacher models, achieving excellent capabilities.

However, knowledge distillation inherently sets a limit on the performance of the student model, particularly when the base model capability of the student model does not exceed that of the teacher model. This limitation highlights the need for reinforcement learning to surpass the capabilities of the teacher model.

Through supervised fine-tuning, LN-Ultra’s performance can approach DeepSeek-R1 but cannot surpass it.

Large-scale reinforcement learning (RL) is a viable method to enable the student model to surpass the teacher model because it allows the model to continuously explore new possibilities and self-learn. RL allows the model to learn through trial and error, optimizing its performance based on feedback signals.

Due to resource constraints, researchers only applied reasoning RL to LN-Ultra, resulting in a student model that surpassed the teacher model. This demonstrates the effectiveness of RL in pushing the limits of AI performance.

Throughout the reasoning reinforcement learning training process, the accuracy of LN-Ultra on the GPQA-Diamond dataset improved.

Training Process: A Focus on Scientific Reasoning

For LN-Ultra, researchers enhanced its scientific reasoning ability through large-scale reinforcement learning (RL), using the Grouped Relative Policy Optimization (GRPO) algorithm, the same used by DeepSeek-R1. GRPO is a reinforcement learning algorithm that is designed to improve the performance of AI models on complex tasks.

The entire training process required approximately 140,000 H100 hours, continuously training the model until it converged on reasoning tasks. This highlights the computational resources required for large-scale reinforcement learning.

The reward mechanism design included two categories:

Accuracy Reward: Based on the standard answers (numerical/sentence/paragraph), calling the Llama-3.3-70B-Instruct model judges the matching degree of the prediction results. This ensures that the model is rewarded for generating accurate responses.
Format Reward: Following DeepSeek-AI’s scheme, the model is forced to wrap the reasoning process with <think\> tags in "detailed thinking" mode, and the appearance of such tags is prohibited in non-detailed thinking mode. This encourages the model to follow a consistent reasoning process.

The research team also pre-processed the data, including data filtering and curriculum training. This improves the efficiency and effectiveness of the training process.

Data Screening: LN-Super is used in advance to generate 8 responses for each question, and simple samples with a pass rate ≥ 75% are removed. This focuses the training on more challenging examples.
Curriculum Training: Progressive batch allocation based on pass rate is adopted. Curriculum training involves gradually increasing the difficulty of the training examples, which can improve the model’s performance.

Dynamic Distribution: Modeling batch difficulty with a Gaussian function, initially focusing on high-pass-rate (simple) samples and later shifting to low-pass-rate (difficult) samples. This ensures that the model is exposed to a wide range of difficulty levels.

Padding Logic: Samples are allocated according to the target distribution first, and the remaining capacity is supplemented from the largest remaining sample pool. This ensures that the training batches are well-balanced.

Intra-Batch Processing: Samples in the same batch are randomly shuffled to maintain diversity. This prevents the model from overfitting to specific patterns in the data.

Reinforcement Learning for Preference Optimization

After completing scientific reasoning training, researchers conducted a brief reinforcement learning phase for the LN-Super and LN-Ultra models, focusing on improving their instruction-following abilities. This ensures that the models can follow instructions accurately and effectively.

Researchers also used RLHF to optimize the models’ general help capabilities and chat performance while retaining the models’ capabilities in mathematics, science, and other fields. RLHF (Reinforcement Learning from Human Feedback) is a technique that involves training AI models based on human feedback.

LN-Super achieved a high score of 88.3 in the Arena Hard test, surpassing proprietary models such as Claude 3.5 Sonnet and GPT-4o-2024-05-13, and also better than larger open-source models. This demonstrates the effectiveness of the preference optimization process.

To achieve this result, they adopted the "OnLine Reward-Policy Optimization" method, maximizing the model’s prediction reward on the HelpSteer2 dataset. The reward model used was Llama-3.1-Nemotron-70B-Reward. This ensures that the model is rewarded for generating helpful and informative responses.

Two rounds of online RPO training increased the Arena Hard score from 69.1 to 88.1. This demonstrates the effectiveness of online RPO training.

For LN-Ultra, they used a similar process but adopted GRPO.

For LN-Nano, they conducted two rounds of offline RPO training, using policy-generated training data.

The first round combined reasoning and non-reasoning data with appropriate system prompts to optimize the model’s reasoning control ability. The second round focused on improving instruction-following abilities.

Evaluation Results: A Comprehensive Assessment

Researchers evaluated the performance of all Llama-Nemotron models on two benchmark categories: reasoning tasks and non-reasoning tasks. This provides a comprehensive assessment of the models’ capabilities.

Reasoning benchmarks included: AIME24 and AIME25, GPQA-Diamond, LiveCodeBench, and MATH500.

Non-reasoning benchmarks included: IFEval for instruction following evaluation, BFCL V2 Live for function call tool usage evaluation, and Arena-Hard for evaluating alignment with human conversation preferences.

LN-Nano achieved excellent performance in all reasoning benchmarks, despite its small size.

This demonstrates that supervised fine-tuning processes and well-curated reasoning datasets are effective in transferring structured reasoning abilities to smaller models.

LN-Super showed strong competitiveness in both reasoning and non-reasoning tasks when compared with other models of similar parameter scale.

In "reasoning off" mode, LN-Super’s performance was comparable to its distilled source model, Llama-3.3-70B; in "reasoning on" mode, it surpassed other competing models, such as DeepSeek-R1-Distilled-Llama-70B, demonstrating strong reasoning ability while maintaining good instruction-following ability.

These results indicate that LN-Super is a versatile model that combines the advantages of reasoning-optimized models and non-reasoning models, making it suitable for daily assistant tasks and structured reasoning tasks.

LN-Ultra performed on par with or better than all existing open-source weight models in reasoning and non-reasoning benchmarks. It achieved the most advanced level in open-source models on GPQA, fully demonstrating the effectiveness of Nvidia researchers’ large-scale reinforcement learning training methods.

Unlike DeepSeek-R1, which requires an 8×H200 hardware configuration, LN-Ultra is optimized to run efficiently on a single 8×H100 node, providing higher reasoning throughput and deployment efficiency.

LN-Ultra’s SFT phase has approached or reached the performance of DeepSeek-R1 on multiple reasoning benchmarks (including GPQA and AIME).

In addition to the reasoning and dialogue capabilities that the model was originally trained for, they also tested the model on a distribution task.

Specifically, the model was tested on the JudgeBench dataset, requiring it to distinguish between high-quality and low-quality answers.

The new model outperformed the current top proprietary and open-source models on this task.

LN-Ultra became the best-performing open-source model, significantly exceeding DeepSeek-R1, second only to the proprietary model o3-mini(high).

In addition, LN-Super’s performance also exceeded o1-mini, indicating that the new model has strong generalization ability in various tasks.

updated at 2025-05-07

# Nvidia # Fine-Tuning # Nemotron