Huawei's AI: MoGE Outperforms DeepSeek on Ascend | en

Huawei Technologies, a company facing significant technological hurdles due to US sanctions, has reportedly achieved a breakthrough in artificial intelligence (AI) model training. Researchers working on Huawei’s large language model (LLM), Pangu, claim to have developed an enhanced approach that outperforms DeepSeek’s original methodology. This innovative method leverages Huawei’s own proprietary hardware, reducing the company’s reliance on US technologies, a crucial objective in the current geopolitical landscape.

The Emergence of Mixture of Grouped Experts (MoGE)

The cornerstone of Huawei’s advancement lies in the concept of Mixture of Grouped Experts (MoGE). This novel technique, detailed in a paper published by Huawei’s Pangu team, is presented as an upgraded version of the Mixture of Experts (MoE) technique. MoE has proven instrumental in creating cost-effective AI models, as demonstrated by DeepSeek’s success.

MoE offers advantages for large model parameters, leading to enhanced learning capacity. However, the Huawei researchers identified inefficiencies arising from uneven activation of “experts,” crucial components in AI training, that can hinder performance when running tasks across multiple devices simultaneously. Huawei’s MoGE strategically addresses these challenges. The complexity of large language models, particularly those with billions or even trillions of parameters, demands innovative training techniques. MoGE provides a path to scale these models more effectively and efficiently. This is particularly relevant in the context of an increasing demand for more powerful and capable AI systems. The rise of generative AI has further accelerated the quest for ever-larger and more sophisticated models. MoGE provides an avenue to pursue these ambitions without exponentially increasing computational costs and resource requirements.

Addressing Inefficiencies in Traditional MoE Models

The MoGE system is intricately designed to optimize the workload distribution. The central idea is to “group” experts together during the selection process, leading to a more balanced workload distribution. By more equitably distributing the computational burden, the researchers reported a notable enhancement in the performance of parallel computing environments, a key aspect of modern AI training. This improvement is not merely incremental; it represents a substantial leap forward in terms of training speed, resource utilization, and overall model performance.

The concept of “experts” in AI training refers to specialized sub-models or components within a larger, more comprehensive model. Each expert is meticulously designed to handle very specific tasks or data types. This approach harnesses varied specialized expertise, allowing the overall AI system to significantly improve its overall performance. For example, in a language model, one expert might be specialized in understanding grammar, while another might focus on semantic meaning. This specialization can lead to a more nuanced and accurate understanding of complex text. The selection of the right experts for each input is determined by a gating mechanism that dynamically routes the data to the most appropriate sub-models.

Implications for China’s AI Advancement

This advancement is particularly timely. Chinese AI companies, despite facing US restrictions on the import of advanced AI chips like those from Nvidia, are aggressively pursuing methods to boost model training and inference efficiency. These methods include not only algorithmic improvements but also the synergistic integration of hardware and software. The geopolitical context makes this advancement even more significant. China’s ability to develop its own independent AI ecosystem is critical for its technological competitiveness and national security. Huawei’s innovations in AI hardware and software are therefore seen as a strategic asset for the country.

Huawei’s researchers rigorously tested the MoGE architecture on their Ascend neural processing unit (NPU), specifically engineered to accelerate AI tasks. The results indicated that MoGE achieved superior expert load balancing and more efficient execution, for both model training and inference phases. This is a significant validation of the benefits of optimizing the hardware and software stack simultaneously. The development of specialized hardware like the Ascend NPU is essential for unlocking the full potential of AI algorithms. By tailoring hardware to specific AI workloads, it is possible to achieve significant performance gains compared to using general-purpose processors.

Benchmarking Pangu Against Leading AI Models

Huawei’s Pangu model, fortified by the MoGE architecture and Ascend NPUs, was benchmarked against leading AI models. These included DeepSeek-V3, Alibaba Group Holding’s Qwen2.5-72B, and Meta Platforms’ Llama-405B. Results of the benchmark showed that Pangu achieved state-of-the-art performance across a range of general English benchmarks, and it excelled on all Chinese benchmarks. Pangu also showcased higher efficiency in processing long-context training, an area of critical significance for sophisticated natural language processing tasks. The success of Pangu demonstrates the effectiveness of Huawei’s integrated approach to AI development. By combining algorithmic innovations with specialized hardware, the company has created a system that is competitive with the best AI models in the world. The model’s performance on Chinese benchmarks is particularly noteworthy, as it suggests that Huawei is well-positioned to serve the rapidly growing Chinese market.

Moreover, the Pangu model demonstrated exceptional capabilities in general language-comprehension tasks, with particular strengths in reasoning tasks. This ability to grasp nuances and extract meaning from complex language demonstrates the advancements Huawei has achieved in AI. The model’s ability to perform well on reasoning tasks indicates that it is not simply memorizing patterns in the data, but rather developing a deeper understanding of the underlying concepts. This is a crucial step towards creating more intelligent and capable AI systems.

Huawei’s Strategic Significance

Huawei’s progress in AI model architecture carries strategic significance. Given ongoing sanctions, the Shenzhen-based company is strategically seeking to decrease its reliance on US technologies. The Ascend chips developed by Huawei are regarded as viable domestic alternatives to processors from Nvidia and are a key component of this independence. This strategic independence is not only important for Huawei, but also for China as a whole. By developing its own AI hardware and software, China can reduce its vulnerability to external pressures and ensure its continued technological progress.

Pangu Ultra, a large language model with 135 billion parameters optimized for NPUs, emphasizes the effectiveness of Huawei’s architectural and systemic streamlining while showcasing the capabilities of its NPUs. Demonstrating the effectiveness of its hardware-software integration is an important part of showcasing Huawei AI capabilities. The sheer scale of Pangu Ultra highlights the ambitions of Huawei in the AI space. By creating a model with 135 billion parameters, Huawei is demonstrating its ability to compete with the largest and most advanced AI models in the world.

Detailed Training Process

According to Huawei, the training process is divided into three major stages: pre-training, long context extension, and post-training. Pre-training involves initially training the model on a massive dataset of 13.2 trillion tokens. Long context extension then expands the model’s ability to handle longer and more complex texts and builds on initial data recognition. This phase uses a large-scale distributed processing across 8,192 Ascend chips. The scale of this training process is truly impressive. By training on 13.2 trillion tokens, Huawei ensure that its model has seen a vast amount of data and is well-equipped to handle a wide range of language tasks. The use of 8,192 Ascend chips demonstrates Huawei’s commitment to using specialized hardware to accelerate AI training.

Huawei disclosed that the model and system will soon be made accessible to its commercial clientele, opening up new opportunities for integration and development with its partners. The commercialization of Pangu is a significant step towards making AI technology more accessible to businesses and organizations. By offering its model and system to its commercial clientele, Huawei is enabling them to leverage the power of AI to improve their operations and develop new products and services.

Deep Dive into Mixture of Experts (MoE) and its Limitations

To fully appreciate the significance of Huawei’s MoGE, it’s crucial to understand the foundations upon which it builds: the Mixture of Experts (MoE) architecture. MoE represents a paradigm shift in how large AI models are designed and trained, offering a pathway to scaling model size and complexity without a proportional increase in computational cost. The MoE architecture provides a way to overcome the limitations of traditional neural networks. By dividing the model into smaller, specialized sub-models, MoE enables greater efficiency and scalability.

In a traditional neural network, every input is processed by every neuron in every layer. While this approach can yield high accuracy, it becomes computationally prohibitive for very large models. MoE, in contrast, introduces the concept of “experts” – smaller, specialized neural networks that focus on specific subsets of the input data. The use of specialized experts allows the model to learn more complex patterns in the data. By focusing on specific subsets of the data, each expert can become highly proficient in its area of expertise.

A “gate” network dynamically routes each input to the most relevant expert(s). This selective activation allows for a sparse computation, meaning that only a fraction of the model’s parameters are engaged for any given input. This sparsity dramatically reduces the computational cost of inference (using the model for prediction) and training. Further, because different experts can act on different parts of input data, it allows for greater specialization in the model. The dynamic routing of inputs to the most relevant experts is a key feature of the MoE architecture. This allows the model to adapt to different types of data and tasks.

Despite the advantages of MoE, several limitations must be addressed to unlock its full potential. The uneven activation of experts is a prime concern. In many MoE implementations, some experts become heavily utilized, while others remain relatively idle. This imbalance stems from the inherent characteristics of the data and the design of the gate network. This uneven activation can lead to inefficiencies in training and inference. The underutilized experts are essentially wasted resources, while the overutilized experts can become bottlenecks.

This imbalance can lead to inefficiencies in parallel computing environments. Since the workload is not evenly distributed across the experts, some processing units are left underutilized while others are overwhelmed. This disparity hinders the scalability of MoE and reduces its overall performance. Also, this imbalance often stems from biases in the training data, leading to under-representation and under-training of less active experts. This results in a sub-optimal model in the long run. The uneven workload distribution makes it difficult to fully utilize the available computational resources. This is a significant challenge for large-scale AI training, where every resource counts.

Another common issues when handling MoE includes the added complexity when designing the gate network. The gate network requires sophisticated techniques to ensure that experts are properly selected, otherwise, the MoE might not perform to expectations and cause unnecessary overhead. The design of the gate network is a critical aspect of MoE. A poorly designed gate network can lead to suboptimal expert selection and reduced performance.

Grouped Experts (MoGE): Addressing the Challenges of MoE

Huawei’s Mixture of Grouped Experts (MoGE) architecture offers a refined alternative to traditional MoE by focusing on load balancing and efficient parallel execution. The method involves grouping experts strategically, which alters the routing process of input data, leading to more even workload distribution. By addressing the limitations of traditional MoE, MoGE provides a pathway to more efficient and scalable AI training.

By grouping the experts during selection, MoGE ensures that each group of experts receives a more balanced workload. Instead of routing each input independently, the gate network now directs groups of inputs to groups of experts. This approach promotes a more equitable distribution of computational burden. The grouping mechanism ensures that all experts within a group are trained on a more diverse set of inputs, reducing the risk of under-representation and under-training.

The grouping mechanism also helps to mitigate the effects of data biases. By ensuring that all experts within a group are trained on a diverse set of inputs, MoGE reduces the risk of under-representation and under-training. Further, grouping experts enables better resource utilization. Since each group handles a more consistent workload, it becomes easier to allocate computational resources efficiently, leading to better overall performance. By mitigating the effects of data biases, MoGE helps to create more robust and generalizable AI models.

The end result is better expert load balancing and more efficient execution for both model training and inference. This translates to faster training times, lower computational costs, and improved overall performance. The improvements in training time, computational cost, and overall performance make MoGE a compelling alternative to traditional MoE.

The Ascend NPU: Hardware Acceleration for AI

The Ascend NPU (Neural Processing Unit) plays a key role in Huawei’s AI strategy. These processors are specifically designed to accelerate AI tasks, including model training and inference. They offer a variety of features optimized for deep learning workloads, such as high memory bandwidth, specialized processing units for matrix multiplication, and low-latency communication interfaces. The use of specialized hardware is essential for achieving high performance in AI training and inference.

Further, Huawei’s Ascend NPUs support a range of data types and precision levels, allowing for fine-grained control over performance and accuracy. The ability to control the data type and precision levels allows for optimization of performance and accuracy.

The synergistic combination of MoGE and Ascend NPU creates a powerful platform for AI innovation. MoGE optimizes the software side by improving load balancing and parallel execution, while Ascend NPU provides the hardware acceleration needed to realize these benefits. This integrated approach allows Huawei to push the boundaries of AI performance and efficiency. The combined power of MoGE and Ascend NPU enables Huawei to achieve state-of-the-art performance in AI.

The Ascend NPU is characterized by high computing density and energy efficiency. These features are critical for deploying AI models in a variety of settings, from powerful cloud servers to edge devices with limited power budgets. The combination of high computing density and energy efficiency makes Ascend NPU suitable for a wide range of applications.

Benchmarks and Performance Metrics

Huawei’s benchmark results demonstrate the effectiveness of the MoGE architecture and the Ascend NPU. By comparing Pangu against leading AI models like DeepSeek-V3, Qwen2.5-72B, and Llama-405B, Huawei showed that its technology achieves state-of-the-art performance on a variety of tasks. The benchmark results provide tangible evidence of Huawei’s progress in AI.

Pangu’s success on general English and Chinese benchmarks highlights its versatility and adaptability. The model’s proficiency in long-context training is particularly noteworthy as it reflects capabilities in handling real-world data. Further, Pangu’s strong performance on reasoning tasks underscores its ability to understand and process complex relationships. The versatility, adaptability, and proficiency of Pangu demonstrate the effectiveness of Huawei’s AI technology.

These benchmarks are not merely academic exercises, they offer tangible evidence of the technological strides made by Huawei. They bolster the company’s claim to be at the forefront of AI innovation and strengthen its position in the global market. The benchmark results strengthen Huawei’s position in the global market and bolster its claim to be at the forefront of AI innovation.

Implications for Huawei’s Future

Huawei’s advancements in AI model training has critical implications in the company’s strategic vision to establish technological sovereignty in artificial intelligence. As the company minimizes its reliance on US technologies amid the ongoing trade conflict, the development of Ascend chips serves as alternatives to processors from Nvidia and AMD. The advancements in AI model training have critical implications for Huawei’s strategic vision to establish technological sovereignty in AI.

The Pangu Ultra, an LLM featuring 135 billion parameters for NPUs, highlights the effectiveness of Huawei architectural and systemic streamlining by showcasing the capabilities of its cutting-edge chips. The Pangu Ultra showcases the effectiveness of Huawei’s architectural and systemic streamlining by highlighting the capabilities of its cutting-edge chips.

These efforts are expected to contribute to the overall competitiveness of Huawei in the long-term, as it strives to cater to a larger market for AI, particularly within China. By continuing to focus investments on research and development, Huawei hopes to propel itself as the leader in the AI space, overcoming the current market constraints. Huawei hopes to propel itself as the leader in the AI space, overcoming the current market constraints.

Future Research

Huawei’s continuous enhancements in AI model architecture via system and algorithmic-level optimizations, alongside hardware developments such as the Ascend chip, mark its importance in leading the technological curve in artificial intelligence. While benchmarks like the Pangu prove it to be a state-of-the-art model, there is still plenty of improvement to be had. Further refinement of the MoGE architecture may enable it to push to larger and more complex computations. More work in specializing the Ascend NPU’s architecture may further accelerate deep learning processes and reduce costs. Future investigation will see the continuous efforts to build better AI models and improve existing ones.

updated at 2025-06-05

# LLM # AIGC # Huawei