AI Efficiency: Less Than 100% Brainpower Is OK | en

The relentless march of AI development has consistently shown that larger models tend to be smarter, but their operational demands also escalate. This creates a significant challenge, especially in regions with limited access to advanced AI chips. However, irrespective of geographical constraints, there’s a growing trend among model developers to embrace Mixture of Experts (MoE) architectures coupled with innovative compression techniques. The goal? To drastically reduce the computational resources needed to deploy and run these expansive Large Language Models (LLMs). As we approach the third anniversary of the generative AI boom ignited by ChatGPT, the industry is finally beginning to seriously consider the economic implications of keeping these power-hungry models running.

While MoE models, like those from Mistral AI, have been around for some time, their real breakthrough has occurred in the last year. We’ve witnessed a surge of new open-source LLMs from tech giants like Microsoft, Google, IBM, Meta, DeepSeek, and Alibaba, all leveraging some form of MoE architecture. The allure is straightforward: MoE architectures offer a far more efficient alternative to traditional “dense” model architectures.

Overcoming Memory Limitations

The foundation of MoE architecture dates back to the early 1990s, with the publication of “Adaptive Mixtures of Local Experts.” The core idea revolves around distributing tasks to one or more specialized sub-models or “experts,” rather than relying on a single, massive model trained on a broad spectrum of data.

In theory, each expert can be meticulously optimized for a specific domain, from coding and mathematics to creative writing. However, it’s worth noting that most model developers provide limited details about the specific experts within their MoE models, and the number of experts varies from model to model. Crucially, only a fraction of the overall model is actively engaged at any given time. This selective activation is what allows these models to be significantly more efficient.

Consider DeepSeek’s V3 model, which comprises 256 routed experts along with a shared expert. During token processing, only eight routed experts, plus the shared one, are activated. This selective activation means that MoE models may not always achieve the same level of quality as similarly sized dense models. Alibaba’s Qwen3-30B-A3B MoE model, for example, consistently underperformed the dense Qwen3-32B model in Alibaba’s benchmark tests. This highlights the trade-off between efficiency and raw performance often seen in MoE implementations. However, the gap in quality is often smaller than the gain in efficiency.

However, it’s essential to contextualize this slight dip in quality against the substantial efficiency gains offered by MoE architectures. The reduction in active parameters results in memory bandwidth requirements that are no longer directly proportional to the capacity needed to store the model’s weights. Essentially, while MoE models may still require substantial memory, they don’t necessarily need it to be the fastest and most expensive High Bandwidth Memory (HBM). This is a crucial consideration, as HBM is a significant cost driver in high-performance computing. The ability to utilize less expensive memory options makes AI development more accessible.

Let’s illustrate this with a comparison. Consider Meta’s largest “dense” model, Llama 3.1 405B, and Llama 4 Maverick, a comparable model that employs an MoE architecture with 17 billion active parameters. While numerous factors, such as batch size, floating-point performance, and key-value caching, contribute to real-world performance, we can approximate the minimum bandwidth requirements by multiplying the model’s size in gigabytes at a given precision (1 byte per parameter for 8-bit models) by the target tokens per second at a batch size of one. This provides a simplified but useful benchmark.

Running an 8-bit quantized version of Llama 3.1 405B would necessitate over 405 GB of vRAM and at least 20 TB/s of memory bandwidth to generate text at 50 tokens per second. Nvidia’s HGX H100-based systems, which until recently commanded prices of $300,000 or more, provided only 640 GB of HBM3 and approximately 26.8 TB/s of aggregate bandwidth. Running the full 16-bit model would have required at least two of these systems. This demonstrates the staggering resource requirements of large dense models. The sheer cost and power consumption limits their deployment.

In contrast, Llama 4 Maverick, while consuming the same amount of memory, requires less than 1 TB/s of bandwidth to achieve comparable performance. This is because only 17 billion parameters worth of model experts are actively involved in generating the output. This translates to an order-of-magnitude increase in text generation speed on the same hardware. The focus shifts from the total model size to the active parameter count, creating a bottleneck on memory bandwidth. The benefit is faster inference at lower latency and reduced power consumption.

Conversely, if sheer performance isn’t a primary concern, many of these models can now be run on cheaper, albeit slower, GDDR6, GDDR7, or even DDR memory, as seen in Intel’s latest Xeons. This allows for deployment on more cost-effective platforms, making AI more accessible to a wider range of users and organizations. The trade-off is latency, but for some applications, this is acceptable.

Nvidia’s new RTX Pro Servers, announced at Computex, are tailored to this very scenario. Instead of relying on expensive and power-hungry HBM that requires advanced packaging, each of the eight RTX Pro 6000 GPUs in these systems is equipped with 96 GB of GDDR7 memory, the same type found in modern gaming cards. This strategic shift towards lower cost memory expands the opportunities for smaller companies to run complex models.

These systems deliver up to 768 GB of vRAM and 12.8 TB/s of aggregate bandwidth, more than sufficient to run Llama 4 Maverick at hundreds of tokens per second. While Nvidia hasn’t revealed pricing, the workstation edition of these cards retails at around $8,500, suggesting that these servers could be priced at less than half the cost of a used HGX H100. This further reinforces the move toward more cost-effective solutions for running demanding AI models.

However, MoE doesn’t signify the end of HBM-stacked GPUs. Expect Llama 4 Behemoth, assuming it ever ships, to require a rack full of GPUs due to its sheer size. Very large models with extensive context windows will continue to leverage powerful, parallel processing architectures. HBM will likely remain the gold standard for memory in situations where ultra-low latency and very high performance are essential.

While it has approximately half the active parameters as Llama 3.1 405B, it boasts a total of 2 trillion parameters. Currently, there isn’t a single conventional GPU server on the market that can accommodate the full 16-bit model and a context window of a million tokens or more. Further innovation is needed to accommodate the massive memory footprint.

The CPU Renaissance in AI?

Depending on the specific application, a GPU may not always be necessary, particularly in regions where access to high-end accelerators is restricted. The trend to run AI on CPUs could lower the entry barrier to AI development and broaden accessibility.

Intel showcased a dual-socket Xeon 6 platform equipped with 8800 MT/s MCRDIMMs in April. This setup achieved a throughput of 240 tokens per second in Llama 4 Maverick, with an average output latency of under 100 ms per token. This demonstrates the evolving landscape of CPU-based AI.

In simpler terms, the Xeon platform could sustain 10 tokens per second or more per user for approximately 24 concurrent users. This opens up various possibilities for real-world deployment in scenarios where GPU acceleration is not readily available. A single large Xeon-based server can handle a reasonable number of concurrent text generation tasks.

Intel didn’t disclose single-user performance figures, as they are less relevant in real-world scenarios. However, estimates suggest a peak performance of around 100 tokens per second. This could still be useful for certain use cases such as local prototyping or experimentation.

Nonetheless, unless there are no better alternatives or specific requirements, the economics of CPU-based inference remain highly dependent on the use case. For demanding AI applications, GPUs will likely continue to deliver the best performance per dollar.

Weight Reduction: Pruning and Quantization

MoE architectures can reduce the memory bandwidth necessary for serving large models, but they don’t reduce the amount of memory required to store their weights. Even at 8-bit precision, Llama 4 Maverick requires over 400 GB of memory to run, regardless of the number of active parameters. Pruning techniques offer a way to lower this memory overhead by removing redundant parameters without significantly affecting quality.

Emerging pruning techniques and quantization methods can potentially halve that requirement without sacrificing quality. Pruning and Quantization are important because they can significantly reduce the memory footprint of large AI models, improving deployment scalability and reducing memory requirements.

Nvidia has been a proponent of pruning, releasing pruned versions of Meta’s Llama 3 models that have had redundant weights removed. Pruning makes the model smaller by eliminating less impactful weights.

Nvidia was also among the first companies to support 8-bit floating-point data types in 2022, and again with 4-bit floating point with the launch of its Blackwell architecture in 2024. AMD’s first chips to offer native FP4 support are expected to be released soon. The shift towards lower precision arithmetic unlocks further performance gains.

While not strictly essential, native hardware support for these data types generally reduces the likelihood of encountering computational bottlenecks, particularly when serving at scale. Effective hardware support enables efficient execution of lower parameter precision models, which is crucial for achieving optimal performance in mass deployment scenarios.

We’ve witnessed a growing number of model developers adopting lower-precision data types, with Meta, Microsoft, and Alibaba offering eight-bit and even four-bit quantized versions of their models. The widespread adoption of lower precision further underscores the industry-wide focus on efficiency.

Quantization involves compressing model weights from their native precision, typically BF16, to FP8 or INT4. This effectively reduces the memory bandwidth and capacity requirements of the models by half or even three-quarters, at the cost of some quality. It’s a compromise, but one that offers significant improvements in model efficiency.

The losses associated with transitioning from 16 bits to eight bits are often negligible, and several model builders, including DeepSeek, have begun training at FP8 precision from the outset. This suggests that BF8 is becoming a sweet spot for balancing quality and efficiency.

However, reducing the precision by another four bits can result in significant quality degradation. Consequently, many post-training quantization approaches, such as GGUF, don’t compress all of the weights equally, leaving some at higher precision levels to minimize quality loss. Hybrid quantization schemes help fine-tune the balance further.

Google recently demonstrated the use of quantization-aware training (QAT) to reduce its Gemma 3 models by a factor of 4x while maintaining quality levels close to native BF16. QAT helps to mitigate the information loss caused by low-precision weight representations by taking quantization into account during the training process.

QAT simulates low-precision operations during training. By applying this technique for approximately 5,000 steps on a non-qualified model, Google was able to reduce the drop in perplexity, a metric for measuring quantization-related losses, by 54 percent when converted to INT4. Perplexity measures how well a language model predicts a sample of text.

Another QAT-based approach to quantization, known as Bitnet, aims for even lower precision levels, compressing models to just 1.58 bits, or roughly a tenth of their original size. Although this can lead to quality degradation, the resulting model requires drastically less storage and compute resources.

The Synergy of Technologies

The combination of MoE and 4-bit quantization offers significant advantages, particularly when bandwidth is limited. By implementing these techniques sequentially, it becomes easier to reach an extraordinary level of efficiency so large models can run in edge environments.

For others that are not bandwidth-constrained, however, either of the two technologies, whether MoE, or quantization, can substantially lower the cost of equipment and operation for running larger and more powerful models; this is assuming that a valuable service can be found for them to perform. Efficiency allows companies to create compelling services at affordable rates.

And if not, you can at least be comforted that you are not alone–a recent IBM survey uncovered that only one in four AI deployments have delivered the return on investment that was promised. Finding valuable applications is key to unlocking the potential of efficient AI models.

updated at 2025-05-27

# AIGC # Llama # Meta