Nvidia's Dual Approach to Agent AI Inference | en

Hardware Strategy: Scaling Up and Out

Nvidia is strategically positioned to address the evolving demands of agent-based AI, a field poised to place unprecedented burdens on inference capabilities. The company’s comprehensive strategy encompasses advancements in both hardware and software, designed to provide the necessary infrastructure for this transformative technology.

The cornerstone of Nvidia’s hardware strategy is the relentless pursuit of increasingly powerful GPUs. This pursuit is guided by a two-phased approach: vertical scaling followed by horizontal scaling. The objective extends beyond the development of a single, ultra-powerful AI supercomputer within a rack. Instead, Nvidia aims to cultivate an entire ecosystem of interconnected racks, forming a massive AI supercomputer complex. This ‘AI factory’ approach is specifically tailored to deliver the computational horsepower required for the most demanding AI workloads.

The recently unveiled Blackwell Ultra rack-mounted AI supercomputer, showcased at the GTC conference, is a prime example of this strategy. It is designed to accelerate both training and test-time scaling inference. While based on the established Blackwell architecture, the Blackwell Ultra incorporates the more potent GB300 NVL72. This configuration features 72 Blackwell Ultra GPUs interconnected via NVLink, providing an astounding 1.1 Exaflops of FP4 precision compute power. The GB300 NVL72 significantly boosts AI performance, delivering 1.5 times the performance of the GB200 NVL72. A single DGS GB300 system boasts 15 Exaflops of compute. Scheduled for release in the second half of 2025, the Blackwell Ultra will enjoy extensive support from a broad spectrum of server equipment vendors, including industry giants like Cisco, Dell, HPE, Lenovo, ASUS, Foxconn, Gigabyte, Pegatron, and Quanta. Furthermore, leading cloud service providers such as AWS, GCP, and Azure will offer compute services built upon the foundation of the Blackwell Ultra.

Beyond these power plant-level AI factory systems, Nvidia has also introduced a new family of computers specifically tailored for inference needs within enterprise environments. These include the DGX Spark and DGX Station personal AI computers. The DGX Spark, with a form factor reminiscent of a Mac mini, delivers up to 1 PFlops of compute power.

To illustrate the significance of this level of computing power, consider the Taiwania 3 supercomputer, launched in 2021 with a massive configuration of over 50,000 cores. This system achieves a performance of only 2.7 PFlops. In a span of just four years, the compute power offered by three desktop-sized personal AI computers has surpassed that of Taiwania 3. Priced at $3,999 (approximately NT$130,000) for the 128GB memory configuration, these new personal AI computers are designed to empower future internal AI initiatives within enterprises, functioning as miniature AI factories or even operating at the edge in edge AI deployments.

Future Roadmap: Vera Rubin and Beyond

Looking forward, Nvidia’s CEO Jensen Huang has outlined a product roadmap for the next two years, providing insights into the company’s future innovations. The company plans to release the Vera Rubin NVL144 in the second half of 2026, named in honor of the American astronomer Vera Rubin, renowned for her groundbreaking work in discovering dark matter. The Vera Rubin NVL144 will deliver 3.3 times the performance of the GB300 NVL72, accompanied by increases of over 1.6 times in memory capacity, bandwidth, and NVLink speeds. In the second half of 2027, Nvidia will launch the Rubin Ultra NVL576, which will deliver a remarkable 14 times the performance of the GB300 NVL72, along with significantly enhanced memory capacity and bandwidth speeds through NVLink7 and CX9.

Building upon the Vera Rubin architecture, Nvidia’s next-generation architecture will be named after the esteemed American physicist Richard Feynman, widely recognized for his contributions to quantum mechanics and his role in the investigation of the Challenger space shuttle disaster. This choice reflects Nvidia’s commitment to pushing the boundaries of scientific advancement and innovation.

Software Strategy: Nvidia Dynamo

Nvidia has consistently emphasized the critical role of software, often considering it to be even more important than hardware. This strategic focus is particularly evident in the company’s AI factory initiatives.

In addition to continuously expanding the CUDA-X AI acceleration library to encompass a diverse range of domains and developing specialized acceleration libraries tailored to specific AI tasks, Nvidia has introduced Nvidia Dynamo, a novel AI factory operating system. Notably, Nvidia has chosen to open-source this operating system, making it accessible to the broader AI community.

Nvidia Dynamo is an open-source inference service framework engineered to facilitate the construction of platforms that provide LLM inference services. It can be readily deployed in K8s environments, enabling the deployment and management of large-scale AI inference tasks. Nvidia plans to integrate Dynamo into its NIM microservices framework, positioning it as a key component of the Nvidia AI Enterprise framework.

Dynamo represents the next generation of Nvidia’s existing open-source inference server platform, Triton. Its core innovation lies in the division of LLM inference tasks into two distinct stages, enabling more flexible and efficient utilization of GPUs to optimize inference processing, enhance overall efficiency, and maximize GPU utilization. Dynamo can dynamically allocate GPUs based on inference requirements and accelerate asynchronous data transfer between GPUs, thereby reducing model inference response times.

Transformer-based GAI models typically divide inference into two primary stages: Prefill (pre-input), which converts input data into tokens for storage, and Decode, a sequential process that generates the next token based on the preceding one.

Traditional LLM inference methodologies assign both Prefill and Decode tasks to the same GPU. However, given the different computational characteristics of these tasks, Dynamo adopts a split approach, assigning GPU resources accordingly and dynamically adjusting allocation based on the specific characteristics of each task. This intelligent resource allocation optimizes GPU cluster performance.

Nvidia’s internal testing demonstrates that utilizing Dynamo with the massive 671 billion-parameter DeepSeek-R1 model on GB200 NVL72 can improve inference performance by an impressive 30 times. Furthermore, performance on Llama 70B running on Hopper GPUs can also be improved by more than double.

Managing inference tasks is inherently complex due to the intricate nature of inference computation and the multitude of parallel processing models involved. Jensen Huang emphasized that Nvidia launched the Dynamo framework to provide an operating system specifically designed for AI factories, addressing the unique challenges of these environments.

Traditional data centers rely on operating systems such as VMware to orchestrate different applications on enterprise IT resources. In the future, AI agents will become the predominant applications, and AI factories will require Dynamo, rather than VMware, to manage their operations effectively.

Huang’s choice of naming the new AI factory operating system after the Dynamo, an engine that ignited the industrial revolution, underscores his high expectations and ambitions for the platform. He envisions Dynamo as a transformative force that will unlock new levels of efficiency and innovation in the realm of AI. The selection of the name suggests its pivotal role in powering the next wave of AI advancements and driving the evolution of AI infrastructure. This operating system aims to streamline AI development and deployment, making AI solutions more accessible and efficient for various applications across diverse industries. Its open-source nature encourages community collaboration and innovation, leading to a more robust and adaptable AI ecosystem.

updated at 2025-04-13

# LLM # Agent # Nvidia