Reimagining AI Chips & Infrastructure Post-DeepSeek

The rapid pace of innovation in AI technology, exemplified by DeepSeek’s advancements, necessitates a fundamental re-evaluation of how we construct data centers, chips, and systems to provide the necessary computing power. DeepSeek’s engineering innovations have significantly reduced AI computing costs, prompting a broader discussion about the future of AI infrastructure.

While DeepSeek may not have drastically expanded the boundaries of AI technology, its influence on the AI market is profound. Technologies such as Mixture of Experts (MoE), Multi-Layer Attention (MLA), and Multi-Token Prediction (MTP) have gained prominence alongside DeepSeek. Although not all of these technologies were pioneered by DeepSeek, their successful implementation has spurred widespread adoption. MLA, in particular, has become a focal point of discussion across various platforms, from edge devices to cloud computing.

MLA and the Challenge of Algorithm Innovation

Elad Raz, CEO of NextSilicon, recently pointed out that while MLA improves memory efficiency, it may also increase the workload for developers and complicate the application of AI in production environments. GPU users might need to engage in ‘hand-code’ optimization for MLA. This example underscores the need to rethink the implementation of AI chips and infrastructure architectures in the post-DeepSeek era.

To understand the significance of MLA, it’s essential to grasp the underlying concepts of Large Language Models (LLMs). When generating responses to user inputs, LLMs rely heavily on KV vectors – keys and values – which enable the model to focus on relevant data. In attention mechanisms, the model compares new requests with keys to determine the most relevant content.

Elad Raz uses an analogy of a book, the key being like ‘a book’s chapter titles, indicating what each part is about, with the value being more detailed summaries under those titles. So as a user puts in the request, it asks for a search term to help generate an answer. It’s asking, ‘Under this storyline, which chapter is most relevant?’’

MLA compresses these chapter titles (keys) and summaries (values), accelerating the process of finding answers and boosting efficiency. Ultimately, MLA helps DeepSeek reduce memory usage by 5-13%. More detailed information can be found in DeepSeek’s official paper. MediaTek’s developer conference even discussed support for MLA in their Dimensity mobile chips, underscoring DeepSeek’s extensive influence.

Technologies like MLA represent typical algorithmic innovations in the AI era. However, the rapid pace of AI technology development leads to a constant stream of innovations, which in turn creates new challenges, especially when these innovations are tailored to specific platforms. In the case of MLA, non-NVIDIA GPU users require extra manual coding to leverage the technology.

While DeepSeek’s technologies demonstrate the innovation and value of the AI era, hardware and software must adapt to these innovations. According to Elad Raz, such adaptation should minimize complexity for developers and production environments. Otherwise, the cost of each innovation becomes prohibitively high.

The question then becomes: ‘What happens if the next algorithm innovation doesn’t translate well and simply to existing architectures?’

The Conflict Between Chip Design and Algorithm Innovation

Over the past few years, AI chip manufacturers have consistently reported that designing large AI chips takes at least 1-2 years. This means that chip design must commence well in advance of a chip’s market release. Given the rapid advancements in AI technology, AI chip design must be forward-looking. Focusing solely on current needs will result in outdated AI chips that cannot adapt to the latest application innovations.

AI application algorithm innovation now occurs on a weekly basis. As mentioned in previous articles, the computing power required for AI models to achieve the same capabilities decreases by 4-10 times annually. The inference cost of AI models achieving similar quality to GPT-3 has decreased by 1200 times in the past three years. Currently, models with 2B parameters can achieve the same level as the 170B parameter GPT-3 of yesteryear. This rapid innovation in the upper layers of the AI technology stack presents significant challenges for traditional chip architecture planning and design.

Elad Raz believes that the industry needs to recognize innovations like DeepSeek MLA as the norm for AI technology. ‘Next-generation computing needs to not only optimize for today’s workloads but also accommodate future breakthroughs.’ This perspective applies not only to the chip industry but to the entire mid-to-lower-level infrastructure of the AI technology stack.

‘DeepSeek and other innovations have demonstrated the rapid advancement of algorithm innovation,’ said Elad Raz. ‘Researchers and data scientists need more versatile, resilient tools to drive new insights and discoveries. The market needs intelligent, software-defined hardware computing platforms that allow customers to ‘drop-in replace’ existing accelerator solutions, while enabling developers to port their work painlessly.’

To address this situation, the industry must design more intelligent, adaptable, and flexible computing infrastructure.

Flexibility and efficiency are often conflicting goals. CPUs are highly flexible but have significantly lower parallel computing efficiency than GPUs. GPUs, with their programmability, may be less efficient than dedicated AI ASIC chips.

Elad Raz noted that NVIDIA expects AI data center racks to reach 600kW of power consumption soon. For context, 75% of standard enterprise data centers have a peak power consumption of only 15-20kW per rack. Regardless of the potential efficiency gains in AI, this poses a significant challenge for data centers building computing infrastructure systems.

In Elad Raz’s view, current GPUs and AI accelerators may not be sufficient to meet the potential demands of AI and High-Performance Computing (HPC). ‘If we don’t fundamentally rethink how we improve computing efficiency, the industry risks hitting physical and economic limits. This wall will also have side effects, limiting access to AI and HPC for more organizations, hindering innovation even with advancements in algorithms or traditional GPU architectures.’

Recommendations and Requirements for Next-Generation Computing Infrastructure

Based on these observations, Elad Raz proposed ‘four pillars’ for defining next-generation computing infrastructure:

(1) Plug-and-Play Replaceability: ‘History has shown that complex architecture transitions, like the migration from CPU to GPU, can take decades to fully implement. Therefore, next-generation computing architectures should support smooth migration.’ For ‘plug-and-play’ replaceability, Elad Raz suggests that new computing architectures should learn from the x86 and Arm ecosystems, achieving broader adoption through backward compatibility.

Modern designs should also avoid requiring developers to rewrite large amounts of code or creating dependencies on specific vendors. ‘For example, support for emerging technologies like MLA should be standardized, rather than requiring extra manual adjustments as is the case with non-NVIDIA GPUs. Next-generation systems should understand and optimize new workloads out-of-the-box, without requiring manual code modifications or significant API adjustments.’

(2) Adaptable, Real-Time Performance Optimization: Elad Raz believes that the industry should move away from fixed-function accelerators. ‘The industry needs to build on intelligent, software-defined hardware foundations that can dynamically self-optimize at runtime.’

‘By continuously learning from workloads, future systems can adjust themselves in real-time, maximizing utilization and sustained performance, regardless of the specific application workload. This dynamic adaptability means that infrastructure can provide consistent efficiency in real-world scenarios, whether it’s running HPC simulations, complex AI models, or vector database operations.’

(3) Scalable Efficiency: ‘By decoupling hardwareand software and focusing on intelligent real-time optimization, future systems should achieve higher utilization and lower overall energy consumption. This would make infrastructure more cost-effective and scalable to meet the evolving demands of new workloads.’ This is particularly critical in light of the escalating energy demands of AI data centers. Optimizing energy consumption not only reduces operational costs but also contributes to a more sustainable approach to AI development. Scalable efficiency also implies the ability to adapt to varying workloads, ensuring that resources are allocated dynamically to meet the demands of different AI applications. This agility is crucial for organizations looking to deploy a diverse range of AI solutions.

(4) Future-Design: This point corresponds to the forward-looking requirement for AI infrastructure, especially chip design. ‘Today’s cutting-edge algorithms may be outdated tomorrow.’ ‘Whether it’s AI neural networks or Transformer-based LLM models, next-generation computing infrastructure needs to be adaptable, ensuring that enterprises’ technology investments remain resilient for years to come.’ The concept of future-design extends beyond simply anticipating future algorithmic advancements; it also encompasses the ability to accommodate new data types, evolving security threats, and emerging architectural paradigms. A truly future-proof AI infrastructure should be able to seamlessly integrate new technologies and adapt to unforeseen challenges. This requires a modular and extensible design that can be easily upgraded and reconfigured. Furthermore, future-design should consider the ethical implications of AI and incorporate mechanisms for ensuring fairness, transparency, and accountability. As AI becomes increasingly integrated into various aspects of society, it is essential to build infrastructure that aligns with ethical principles and promotes responsible innovation.

These suggestions offer a relatively idealized yet thought-provoking perspective. This guiding methodology should be considered for the future development of AI and HPC technologies, even if some inherent contradictions remain long-standing issues in the industry. ‘To unleash the potential of AI, HPC, and other future computing and data-intensive workloads, we must rethink infrastructure and embrace dynamic and intelligent solutions to support innovation and pioneers.’ The implementation of these recommendations requires a collaborative effort involving researchers, engineers, policymakers, and industry leaders. By working together, we can create an AI ecosystem that is not only technologically advanced but also sustainable, equitable, and beneficial to society as a whole. The challenges are significant, but the potential rewards are even greater. As AI continues to evolve, it is imperative that we proactively shape its development to ensure that it serves as a force for good in the world. This requires a long-term perspective, a commitment to innovation, and a willingness to embrace new ideas and approaches. The journey ahead will undoubtedly be complex and challenging, but by focusing on the principles of adaptability, scalability, efficiency, and ethical responsibility, we can build an AI future that is both transformative and sustainable. Furthermore, the next generation of computing infrastructure must address the growing concerns around data privacy and security. With the increasing volume and sensitivity of data used in AI models, it is essential to implement robust security measures to protect against unauthorized access, data breaches, and other cyber threats. This includes not only technical safeguards, such as encryption and access controls, but also organizational policies and procedures to ensure responsible data handling practices. In addition, the infrastructure should be designed to comply with evolving data privacy regulations, such as GDPR and CCPA, which place strict requirements on the collection, storage, and use of personal data. By prioritizing data privacy and security, we can build trust in AI and foster its adoption across a wider range of applications.

The development of next-generation computing infrastructure also presents opportunities for innovation in areas such as hardware design, software development, and data management. New types of processors, such as neuromorphic chips and quantum computers, hold the potential to significantly accelerate AI workloads. Advanced software tools and frameworks can simplify the development and deployment of AI models, making them more accessible to a wider range of users. And innovative data management techniques, such as federated learning and differential privacy, can enable AI models to be trained on distributed data sources without compromising privacy. By investing in research and development in these areas, we can unlock new possibilities for AI and create a more robust and versatile computing infrastructure. The future of AI depends not only on advancements in algorithms and models but also on the underlying infrastructure that supports them. By rethinking how we design and build our computing systems, we can create a more efficient, scalable, and sustainable AI ecosystem that benefits all of humanity.