From a Quick Google Gig to Reshaping AI History: A Conversation with Transformer Author Noam Shazeer and Jeff Dean | en

The Dawn of AI’s Evolution: A 25-Year Journey from PageRank to AGI

Two luminaries of Google’s technological odyssey, Jeff Dean, the current Chief Scientist, and Noam Shazeer, a pivotal figure behind the Transformer model who recently rejoined the fold, recently engaged in an illuminating dialogue. Hosted by the renowned podcaster Dwarkesh Patel, their conversation offered a glimpse into the evolution of AI, spanning from the foundational days of MapReduce to the transformative era of Transformer and MoE architectures.

These seasoned veterans, with a combined experience of decades at Google, have not only witnessed but actively shaped the defining technologies of the internet and artificial intelligence. Ironically, Shazeer confessed that his initial motivation for joining Google was a short-term financial pursuit, a plan that was dramatically overturned by his subsequent contributions to the field.

The Current State and Future Trajectory of AI Compute

In a sprawling two-hour exchange, Dean and Shazeer unveiled insights into the present status of AI compute, revealing that:

The scale of operations has transcended individual data centers; Gemini’s training now spans across multiple data centers in different metropolitan areas, operating asynchronously.
There’s substantial room for growth in scaling inference compute, as interacting with AI remains significantly more cost-effective than traditional reading.
Future model architectures are envisioned to surpass the flexibility of MoE, enabling independent development of various model components by different teams.

Insights from the Trenches: Bug Bounties and Future Architectures

The conversation also sparked interest on social media, with users highlighting intriguing concepts, such as:

The potential of storing vast MoE models in memory.
The unexpected benefits of bugs in code, which, as scale increases, can inadvertently lead to groundbreaking discoveries.

Dean challenged the notion that AI compute is prohibitively expensive. By comparing the cost of engaging with a book versus interacting with an AI about the same book, he illustrated a compelling point:

The most advanced language models operate at an astonishingly low cost of approximately $10^{-18}$ per operation, translating to a million tokens processed for a single dollar. In contrast, purchasing a paperback book offers a mere 10,000 tokens per dollar.

This stark difference—a hundredfold cost advantage for AI interaction—underscores the untapped potential for enhancing AI intelligence through increased inference compute.

From an infrastructural perspective, the escalating significance of inference-time computation could reshape data center planning. This might necessitate hardware specifically tailored for inference tasks, reminiscent of Google’s first-generation TPUs, initially designed for inference and later adapted for training.

Distributed and Asynchronous Computation: A New Paradigm

The growing emphasis on inference suggests that continuous communication between data centers might become unnecessary, potentially leading to a more distributed and asynchronous computational model.

Gemini 1.5 has already embarked on this path, leveraging computational resources across several major cities. High-speed networks synchronize computations from different data centers, achieving unprecedented training scales. For large models, where each training step can take several seconds, even a network latency of 50 milliseconds poses minimal impact.

In the realm of inference, latency sensitivity becomes a critical consideration. While immediate responses demand optimized low-latency performance, non-urgent tasks, such as complex contextual analysis, can tolerate longer processing times.

A more adaptable and efficient system could asynchronously manage multiple tasks, enhancing overall performance while minimizing user wait times. Additionally, algorithmic advancements, like employing smaller draft models, can alleviate bottlenecks in the inference process. This approach involves smaller models generating potential tokens, which are then verified by larger models, significantly accelerating the inference process through parallelization.

Shazeer added that during asynchronous training, each model replica operates independently, sending gradient updates to a central system for asynchronous application. Despite the theoretical implications of minor parameter fluctuations, this method has proven remarkably successful.

In contrast, synchronous training offers stability and reproducibility, a preference for many researchers. To ensure replicability in training, Dean highlighted the practice of logging operations, particularly gradient updates and data batch synchronization. By replaying these logs, even asynchronous training can yield reproducible results, making debugging more manageable and mitigating inconsistencies caused by environmental factors.

The Serendipitous Role of Bugs

Expanding on this, Shazeer introduced an intriguing perspective:

While training models encounter various bugs, the inherent noise tolerance of these models allows for self-adjustment, leading to unforeseen outcomes. Some bugs even yield positive effects, presenting opportunities for improvement as scale amplifies experimental anomalies.

When queried about debugging practices, Shazeer described their approach of conducting numerous small-scale experiments for rapid validation. This method simplifies the codebase and shortens experiment cycles to hours instead of weeks, facilitating quick feedback and adjustments.

Dean concurred, noting that many experiments with initially unfavorable results could later provide crucial insights. However, researchers face the challenge of code complexity; while incremental improvements are necessary, they also introduce performance and maintenance challenges, necessitating a balance between system cleanliness and innovation.

The Organic Structure of Future Models

Dean and Shazeer envision a significant shift in AI models from monolithic structures to modular architectures.

Models like Gemini 1.5 Pro already employ a Mixture of Experts (MoE) architecture, activating different components based on the task. For instance, mathematical problems engage the math-proficient section, while image processing activates the corresponding specialized module.

However, current model structures remain somewhat rigid, with expert modules being uniform in size and lacking flexibility. Dean proposed a more forward-thinking vision: future models should adopt an organic structure, allowing different teams to independently develop or enhance distinct parts of the model.

For example, a team specializing in Southeast Asian languages could refine the relevant module, while another focuses on improving code comprehension. This modular approach not only boosts development efficiency but also enables global teams to contribute to the model’s advancement.

Technically, models can continuously optimize individual modules through distillation. This involves condensing large, high-performance modules into smaller, efficient versions, which then continue to learn new knowledge.

A router can select the appropriate module version based on task complexity, balancing performance and efficiency—a concept central to Google’s Pathway architecture.

This new architecture demands robust infrastructure, including powerful TPU clusters and ample high-bandwidth memory (HBM). Although each call might use only a fraction of the model’s parameters, the entire system needs to keep the complete model in memory to serve concurrent requests.

Current models can decompose a task into 10 subtasks with an 80% success rate. Future models could potentially break down a task into 100 or 1,000 subtasks, achieving 90% or higher success rates.

The “Holy Shit” Moment: Accurate Cat Recognition

Looking back, 2007 marked a significant milestone for large language models (LLMs).

At that time, Google trained an N-gram model using 2 trillion tokens for machine translation. However, reliance on disk storage for N-gram data resulted in high latency due to extensive disk I/O (e.g., 100,000 searches/word), taking 12 hours to translate a single sentence.

To address this, they devised several strategies, including memory compression, distributed architecture, and batch processing API optimization:

Memory Compression: Loading N-gram data entirely into memory to avoid disk I/O.
Distributed Architecture: Distributing data across multiple machines (e.g., 200) for parallel queries.
Batch Processing API Optimization: Reducing per-request overhead to improve throughput.

During this period, computational power began to follow Moore’s Law, leading to exponential growth.

“From late 2008, thanks to Moore’s Law, neural networks really started to work.”

When asked about a “Holy shit” moment—a moment of disbelief that a particular research effort actually worked—Jeff recounted an early Google team’s project where they trained a model to learn high-level features (like recognizing cats and pedestrians) from YouTube video frames. Through distributed training (2,000 machines, 16,000 cores), they achieved large-scale unsupervised learning.

After unsupervised pre-training, the model’s performance in supervised tasks (ImageNet) improved by 60%, demonstrating the potential of large-scale training and unsupervised learning.

Addressing whether Google remains primarily an information retrieval company, Jeff emphasized:

“AI fulfills Google’s original mission.”

In essence, AI not only retrieves information but also understands and generates complex content, with vast future potential. As for Google’s future direction, “I don’t know.”

However, one can anticipate integrating Google and some open-source code into every developer’s context. In other words, by enabling models to handle more tokens, searching within search will further enhance model capabilities and utility.

This concept is already being experimented with internally at Google.

“In fact, we have already conducted further training on the Gemini model for internal developers on our internal codebase.”

More precisely, Google has internally achieved the goal of 25% of its code being written by AI.

The Happiest Times at Google

Interestingly, the duo also shared more intriguing experiences related to Google.

For Noam in 1999, joining a large company like Google was initially unappealing, as he felt his skills might be underutilized. However, after seeing Google’s daily search volume index chart, he quickly changed his mind:

“These people are bound to succeed, and it seems they have many interesting problems to solve.”

He joined with a specific “small” intention:

“Make some money and then happily pursue my own AI research interests.”

Upon joining Google, he met his mentor, Jeff (new employees were assigned mentors), and they collaborated on several projects.

At this point, Jeff interjected with his own appreciation for Google:

“I like Google’s broad mandate for the RM vision (Responsive and Multimodal), even if it’s one direction, we can do many small projects.”

This also provided Noam with the freedom that led the person who initially planned to “hit and run” to stay long-term.

Meanwhile, when the topic turned to Jeff, his undergraduate thesis on parallel backpropagation was revisited.

This 8-page paper became the top undergraduate thesis of 1990 and is preserved in the University of Minnesota library. In it, Jeff explored two methods for parallel training of neural networks based on backpropagation:

Pattern-partitioned approach: Representing the entire neural network on each processor and dividing input patterns among available processors.
Network-partitioned approach (pipelined approach): Distributing neurons of the neural network across available processors, forming a communicating ring. Features pass through this pipeline, processed by neurons on each processor.

He tested these methods with neural networks of different sizes and various input data. Results showed that for the pattern-partitioned approach, larger networks and more input patterns yielded better acceleration.

Most notably, the paper reveals what a “large” neural network looked like in 1990:

“A 3-layer neural network with 10, 21, and 10 neurons per layer was considered very large.”

Jeff recalled that he used up to 32 processors for his tests.

(At that time, he probably couldn’t imagine that 12 years later, he, along with Andrew Ng, Quoc Le, and others, would use 16,000 CPU cores to identify cats from massive data.)

However, Jeff admitted that for these research findings to truly be effective, “we needed about a million times more computing power.”

Later, they discussed the potential risks of AI, especially the feedback loop problem when AI becomes extremely powerful. In other words, AI could enter an uncontrollable acceleration loop (i.e., “intelligence explosion”) by writing code or improving its algorithms.

This could lead to AI rapidly surpassing human control, even creating malicious versions. As the host put it, imagine “a million top programmers like Jeff, eventually turning into a million evil Jeffs.”

(Netizen): “New nightmare unlocked, haha!”

Finally, reflecting on their happiest times at Google, both shared their memories.

For Jeff, the most joyful moments in the early years of Google were witnessing the explosive growth of Google’s search traffic.

“Building something that 2 billion people now use is incredible.”

Recently, he has been thrilled to build things with the Gemini team that people wouldn’t have believed possible even five years ago, and he foresees the model’s impact expanding further.

Noam echoed similar experiences and a sense of mission, even fondly mentioning Google’s “micro-kitchen areas.”

This is a special space with about 50 tables, offering coffee and snacks, where people can freely chat and exchange ideas.

At this mention, even Jeff became animated (doge).

updated at 2025-02-18

# Google # Gemini # AGI