GPT-4.5 Training: 100K GPUs & 'Catastrophic Problems' | en

The Genesis of GPT-4.5: A Two-Year Odyssey

The GPT-4.5 initiative, conceived two years prior to its launch, represented OpenAI’s most meticulously planned undertaking to date. It demanded the concerted effort of hundreds of individuals, with Sam Altman noting that the project effectively engaged “almost everyone” at OpenAI. This widespread involvement underscores the strategic importance of GPT-4.5 within the organization’s broader mission.

During the development phase, the OpenAI team encountered what they termed “catastrophic problems.” The deployment of a 100,000 GPU cluster exposed latent infrastructure vulnerabilities that manifested as infrequent yet profound failures. To strike a balance between expediency and optimal performance, the system engineers adopted an iterative approach, essentially “building and fixing” concurrently. One particularly elusive bug plagued the cluster with recurrent errors, remaining undetected until the training process had reached approximately 40% completion.

Paradoxically, these trials contributed to the strengthening of OpenAI’s technical foundation. The expertise gained now enables a lean team of just 5-10 individuals to replicate a model of GPT-4’s magnitude. The performance leap from GPT-4 to GPT-4.5, estimated at around tenfold, was characterized by “difficult-to-quantify but comprehensively enhanced intelligence,” surprising even those within OpenAI. This qualitative leap suggests advances beyond mere scaling, pointing to fundamental improvements in the model’s ability to reason and understand.

Looking ahead, OpenAI recognizes that achieving the next order of magnitude in performance will hinge not on computational power alone, but rather on data efficiency. The focus is shifting towards developing algorithms that can extract more knowledge from existing datasets, thereby maximizing the utility of available compute resources.

Furthermore, the architecture is evolving from a single-cluster to a multi-cluster design, envisioning future training scenarios involving collaborative learning across as many as 10 million GPUs. This transition necessitates significant improvements in fault tolerance to ensure the stability and reliability of such large-scale distributed systems.

The conversation also delved into the relationship between data’s “long tail” and scaling laws, the advantages of close collaboration between machine learning and systems teams (co-design), the essence of unsupervised learning, and a culture of meticulous problem-solving.

Key Players Behind GPT-4.5

Besides Altman, the other three OpenAI team members who took part in this conversation were:

Alex Paino: Responsible for the pre-training machine learning algorithms of GPT-4.5.
Amin Tootoonchian: OpenAI’s chief system architect.
Daniel Selsam: Researches data efficiency and algorithms.

Origins and Evolution of GPT-4.5

Sam Altman: What does it really take to build a model as large as GPT-4.5?

Alex Paino: We initiated this project about two years ago. At that time, OpenAI was about to launch a new large computing cluster, and our team saw this opportunity and did a series of tasks to determine the functions that the model needed to include, and conducted a large number of risk reduction operations tests.

We developed a long plan for this, involving the entire technology stack from system to machine learning. Reducing risks and preparing for training is a long execution process, and training itself is also a very large project.

Amin Tootoonchian: I think this process requires close cooperation between the machine learning team and the system team from the beginning, until we clearly know what model we want to train, and then start training.

We have made predictions in machine learning and systems, trying to minimize the gap between expectations and reality. However, because our work rhythm is very fast and we have to use the latest computing resources, model training has become something that is difficult to perfectly plan in advance.

We almost always start training with many unsolved problems and try to overcome challenges and make progress during the process. The main solution is to increase more computing resources.

The final stage is execution, which requires many people to invest a lot of energy and motivation for a long time to complete the training process.

Sam Altman: How much do you think the gap between our expectations and reality is?

Amin Tootoonchian: In terms of the system, in the beginning, we are usually far from the expected state. We always face a choice: whether to postpone the launch and wait for the problem to be solved, or start early and solve the problem in the process. This always requires trade-offs to avoid unreasonable delays in the process.

But there are almost always unexpected problems, and what we have to do is to handle these nodes as much as possible, deal with the unknown factors, and formulate a plan for model training.

Alex Paino: In this project, our goal is to make GPT-4.5, which means that its capabilities should be 10 times smarter than GPT-4. This is the initial goal we set about 2 years ago.

Many things happened in this process. We were thinking about whether we could do better or worse than expected? This is a very complicated process, but in the end, in terms of the effective calculations we put in, we got a model that we think is 10 times smarter than GPT-4.

Amin Tootoonchian: In terms of execution, the time spent on the GPT-4.5 project is far from what we initially expected.

The Lean Team Revolution: Training GPT-4 with Minimal Resources

Sam Altman: When the cluster expanded from 10,000 cards to 100,000 cards, why did you encounter so many problems?

Amin Tootoonchian: I think that if system developers are sensitive enough, most problems can be observed in the small-scale stage.

There are also some problems that are not unique to the large-scale training stage, but originally occurred frequently, but will become catastrophic problems after the scale is increased, especially when the team did not anticipate that these problems would deteriorate to such an extent in advance.

Sam Altman: What things have caused disastrous consequences?

Amin Tootoonchian: I think the infrastructure problems are well known. The failure rate, failure type, and total amount of failures are very high. The 100,000 card cluster is a large-scale sample pool, so we also discovered problems that the computing power supplier did not observe.

The network is one part of it, and individual accelerators can also have problems. But this is also the beauty of this system - almost all components need to work as expected to produce the expected results. Our job is to minimize this problem as much as possible.

Sam Altman: It is indeed difficult to work at the limit of the cluster scale, but I also noticed that doing things that are no longer at the forefront of technology has become much easier. Training GPT-4.5 requires hundreds of people, and almost everyone in OpenAI is involved.

But today, if you let you pick a smallest team from OpenAI and retrain GPT-4 from scratch with all the knowledge we know and all the system work, how many people would it take?

Alex Paino: I think it may take about 5 to 10 people to make a GPT-4-level model now. The technology stack has been greatly improved in the process of completing GPT-4.5.

In fact, we have done similar things in the process of training GPT-4.5 - we trained GPT-4o, which is a GPT-4-level model, and re-trained it using many of the same content from the GPT-4.5 research project. Fewer people were used for that training.

Data Efficiency: The Key to Unlocking the Next Generation of Models

Sam Altman: From your perspective, Dan? Why is it difficult to train large models?

Daniel Selsam: I think it’s hard to do anything new. I think even just discovering that someone else has done something makes it much easier, because the hardest part is having the belief that you can do something in the first place. I think just knowing that something is feasible is a super cheat code, making things much easier.

Alex Paino: We are expanding the GPT pre-training operation to 10 times what it was before, and we will always find some interesting new things that you cannot necessarily predict.

Sam Altman: What is needed to achieve the next 10x or 100x growth in pre-training scale?

Daniel Selsam: Data efficiency. The Transformer architecture (which is GPT) is very efficient in using data. It can absorb and compress information well and achieve generalization. Its biggest feature is that it can efficiently absorb information with computing resources.

However, the depth of insight it gains from data is limited. When computing power grows rapidly while data grows relatively slowly, data becomes a bottleneck in this standard model. This requires algorithmic innovation, developing methods that can use more computing power to learn more knowledge from the same amount of data.

Sam Altman: What else do you think we need to maintain expansion besides this?

Amin Tootoonchian: My answer is about the system. I think the huge amount of work required for GPT-4.5 is essentially the inevitable result of model specifications. We cannot train GPT-4.5 with exactly the same technical architecture as GPT-4.

In terms of state management, because the required computing resources have exceeded the carrying capacity of a single cluster, we have to switch to a multi-cluster training architecture. To achieve this goal, we must integrate multiple different workflows in a short time.

Although this did help us achieve a phased breakthrough, to achieve the next order of magnitude of performance improvement, we still need to solve several known but temporarily shelved technical problems - these problems cannot be avoided. It is this kind of technical trade-off that constantly prolongs the development cycle of a perfect system. We are always making strategic trade-offs in the process of pursuing the optimal implementation plan.

It needs to be clear that the system itself is not the ultimate goal. Its actual output value is the core consideration. For the next 10x performance improvement, I think the breakthrough in fault tolerance is crucial. We need to build a fault-tolerant mechanism that deeply collaborates with the workload to significantly reduce operation and maintenance anxiety. The operation and maintenance complexity of the current super-large system is essentially different from that of previous systems.

Sam Altman: Do you know what percentage of failures were caused by certain components during the GPT-4.5 training?

Amin Tootoonchian: I don’t have specific figures to share, but in general, the initial deployment of a new generation of hardware often faces many technical challenges that have not been fully understood. We chose to advance the project before the problem was fully clarified, which led to a high initial failure rate.

But experience shows that as the root cause is identified and resolved, the failure rate will be significantly reduced. This phenomenon essentially reflects our deepening understanding of infrastructure - some people call it cleaning up the infrastructure or understanding the basic problems of the infrastructure.

The early stages of execution are almost always quite painful. While we are advancing the project, we are also continuously discovering and solving new failure modes, but eventually the failure rate will gradually decrease and the normal running time will increase.

This is essentially a matter of priority trade-offs: In the early stages of the infrastructure life cycle, its failure risk is often difficult to accurately estimate; and if we excessively pursue the ultimate ideal state (the original is “City Estate”, the ideal city-state design), it may lead to the system The initial availability performance is extremely poor.

Beyond Compute: Algorithmic Innovation and the Untapped Potential of Data

Sam Altman: Although the inference model is a key component of our future technology stack, let’s temporarily focus on the development boundaries of traditional pre-training models. Assuming we have unlimited GPU computing power, unlimited network bandwidth, and unlimited power supply, but are still limited by existing technical bottlenecks - including system reliability issues, lack of fault-tolerant training methods, and limitations of existing datasets.

According to our evolution rule of achieving a 100-fold scale increase for each major GPT version number, based on the current technical boundaries, what level can the development of pre-training models reach? Specifically, for the GPT series models, based on our existing knowledge system, what kind of model can theoretically be trained? Can we make GPT-5.5?

Alex Paino: From the perspective of machine learning and algorithm development, we have not yet reached a clear theoretical limit. In fact, we have only just begun to explore algorithms with higher data efficiency and how to make fuller use of existing data resources. This situation is very interesting - even models like GPT-4 are largely developed under conditions of limited computing resources, which has determined the direction of most previous research.

But the situation is completely different now. Since GPT-4.5, in some key dimensions, data rather than computing is becoming the main constraint. This shift makes related research less exciting.

Sam Altman: But this is indeed an amazing progress, and the world may not fully realize that computational resources are no longer the main bottleneck on the best model we can build. This shift is very meaningful, after all, we have lived in a computationally limited environment for too long.

Unveiling the Surprises: Predictability vs. Unforeseen Intelligence

Sam Altman: What is the most interesting machine learning experience we learned during the training of GPT-4.5? Just say what you want to share.

Amin Tootoonchian: In general, the most thought-provoking things are those that deviate from our predictions - especially when we try to understand why the actual performance deviates from the expected curve.

Alex Paino: One of the most surprising discoveries for us is that different machine learning components have very different scalability performances. Some parts can be expanded very well, while others cannot. This is what we really realized during the actual training process. This experience gave us a lot of inspiration.

Daniel Selsam: I think the two core characteristics of the GPT paradigm are: first, the test loss (a metric that measures how well the model performs on unseen test data) can be accurately predicted; second, the model performance shows a predictable improvement with the increase in scale. What’s even more amazing is that the reduction in test loss will be transformed into an all-round enhanced level of intelligence in various difficult-to-quantify but amazing and mysterious ways.

Sam Altman: Are you absolutely optimistic about this? Do you fully agree with this point of view?

Daniel Selsam: Actually, what I want to say is that we found a particularly interesting phenomenon in the GPT-4.5 test - after retesting, the many sophisticated capabilities shown by the model completely exceeded everyone’s expectations.

We are sure that it will become smarter in various ways that are difficult to define in advance, and these subtle improvements can be observed from user satisfaction after actual deployment: stronger common sense reserves, more accurate contextual understanding ability, and more subtle semantic grasp - this is the magic brought by those extra test losses. In my opinion, Scaling Law has been perfectly verified in this dimension.

The Power of Collaboration: Machine Learning and Systems Teams Working in Harmony

Sam Altman: What was the most positive moment during the entire training process? What is your favorite memory? Obviously there is a lot of pain, but I hope that pain has been alleviated.

Alex Paino: I do have such a moment. We did a lot of machine learning work during training, and I think some of the changes we made during the process had a pretty good impact, maybe even better than expected, which was a very exciting moment for us.

Amin Tootoonchian: For me, at the same time as training, we are also building infrastructure. We firmly believe that we can cross this performance cliff, and we have a plan, and everyone is executing it, but it takes a long time. This is hard work and definitely more difficult than I thought. My prediction was wrong, and I underestimated the time it would take to solve these problems.

The moment when the team finally overcame those key problems and the performance was significantly improved is still fresh in my memory. You can clearly feel the energy shift in the entire team - everyone is suddenly full of energy and rushing towards the final goal with new motivation.

The most amazing thing is that the estimated completion time displayed on our status tracker continued to shorten from the initial two years, and finally locked on a clear time node. This visible progress is immeasurable to the team’s morale boost. I think this is the beauty of it.

I would like to emphasize that machine learning work has never stopped. Even after the training is started, this machine learning co-design process is still ongoing. The machine learning team not only actively followed up on the problems that had been marked as “subsequent processing”, but also continued to deliver improvements that truly optimized the training time.

This perfectly embodies our team spirit - there is no “sweeping the snow in front of your own door” work boundary here, but a truly seamless collaboration. This cohesion is our greatest advantage.

Meticulous Planning and Relentless Pursuit of Anomalies in GPT-4.5 Pre-Training

Daniel Selsam: The outside world has discussed a lot about the challenges and predictive accuracy of this training itself. But in fact, all of this is built on extremely meticulous planning - can you talk more about this in detail?

Alex Paino: This is definitely the most meticulous plan we have made so far. As I said, we started preparing for this project a year before the official launch of training. During this period, we conducted multiple large-scale risk control test runs.

We pay special attention to gradually introducing all improvements: starting from a high-confidence basic configuration - which can be understood as a mature architecture similar to GPT-4, we have fully mastered this configuration at the machine learning level - and then layering new features like building blocks.

The key is to strictly verify the scalability of each improvement at different scales: not only to see performance improvements, but also to ensure that these improvements can continue to be effective as the model size increases. Many improvements perform well in small-scaletests,but will fail in large-scale applications.

Therefore, we have maintained a high degree of vigilance throughout theprocess and continue to iterate and improve our scaling law methodology. Through this risk control practice, we have accumulated a lot of valuable experience, which will continue to guide the development of future GPT series models.

Amin Tootoonchian: I remember a particularly interesting moment that I miss very much. You know, we almost inevitably encounter various bugs every time we start a training task, which is commonplace. But the key is to ensure that progress is not hindered, and we must always confirm whether the current progress is indeed on the right track and whether these bugs will have a fatal impact on the health of the training.

Although we were initially very sure that there were major flaws, through the entire monitoring system we built, we were able to accurately distinguish the root cause of the problem: Is it a hardware failure? What type of hardware failure? Is it data corruption? Or is it a bug in the machine learning model itself? Or is it a race condition in the code?

At that time, we had multiple problem discussion areas open at the same time, with a wide variety of symptoms. After a series of bug fixes, we fell into a deadlock: multiple unsolved problems were piled up in front of us, and everyone was racking their brains - were these caused by different bugs? Or is it a bug that is causing trouble?

Later, we held a vote and asked team members to vote for the most likely root cause. As a result, the least optimistic option hit the truth: it turned out that there was a problem with the torch.sum function upstream of PyTorch, a simple summation operation.

This bug is very interesting. You know that we mainly use the Triton kernel, and only in some insignificant marginal scenarios will we fall back to torch operations. The torch.sum function bug triggered by our specific code path will occasionally cause illegal memory access due to the data distribution characteristics - it made a mistake when calculating the memory offset.

The most dramatic thing is that when an engineer finally located the problem and submitted a fix, all the errors with different symptoms disappeared. Everyone excitedly changed the Slack channel from the “multi-bug theory” to the “single-bug theory”, and the scene was very happy.

How long has this bug been lurking? It has existed since the early stages of training and was not found until the progress bar had passed about 40%. The discovery process was also full of drama: At that time, a complex kernel continuously called a sequence, and the second call triggered illegal memory access.

Although this crash frequency is extremely low (it only occurs once every few hundred or even thousands of training steps), it is easy to be ignored as an occasional failure, but our team principle is: never let go of any abnormality. The best part of this story lies in this persistence of not giving up lightly.

The Quest for Ideal Systems: A Distant Horizon

Sam Altman: After GPT-4.5 pre-training starts, what else do you have to do?

Alex Paino: All of us need to observe the loss curve frequently. In addition, we need to continuously optimize the system and improve the co-design that was not completed before the training started. We closely monitor various statistical indicators during the training process to ensure that there are no unexpected abnormal trends. At the same time, we explore possible improvement plans from a machine learning perspective. Although data-level work will be temporarily reduced after pre-training starts, there are still a large number of tasks to be processed.

Amin Tootoonchian: I think machine learning largely depends on the correctness of judgment. After pre-training starts, facing a large number of noise signals, we are like fortune tellers interpreting tea leaves, and we need to judge whether the system is healthy. This is our responsibility.

Sam Altman: At the system level, what limits us from conducting model training? Is it chips, processors, memory, network, or power?

Amin Tootoonchian: The beauty of the system is that when doing co-design, the workload can adapt to the infrastructure you build. There is no general saying here that the network is the bottleneck, or the memory bandwidth is the bottleneck, and so on. Even for models of the same specification, we can choose to transfer resource requirements, and we can choose to create a more balanced system, but having more memory bandwidth is always beneficial. It is difficult to answer this question without limiting conditions.

When designing GPT-4.5, we may need the system to have some kind of attribute, which needs to be generated under human guidance. Therefore, co-design is very important for forming the model architecture and architectural elements, and to a certain extent connects the system and machine learning aspects. If the system has an attribute that we don’t want to have very much, my ideal situation is that everything should be decoupled to give each other the maximum space.

Sometimes things are connected together, and we need to meet the requirements of the infrastructure, or things should be like this. Most of the time, we need a balanced system and balanced communication. And the best means of adjustment we have is all these co-designs.

Sam Altman: How far are we from this ideal system goal?

Amin Tootoonchian: It’s still a long way from that goal. The process of building a system is always like this: first there is an idealized view of how things should work, and then reconcile those differences with existing resources.

I think we are not doing it for theory for theory, but just to discuss what we want it to become, to realize it, and to get as close to that ideal as possible. This may be the most exciting part of the system field. People used to say that this is an elegant system design, and ultimately history will tell us whether this choice is correct or wrong.

Sam Altman: If you could get an answer to a machine learning problem before the next large training, what would you most like to know?

Alex Paino: I would like to know what algorithms we should use under limited data and specific fields. Although this is a broad question, it is indeed the most critical one.

Sam Altman: Will you conduct synchronous pre-training with 10 million GPUs or more in the future?

Alex Paino: I think there will be, but it may not be a traditional pre-training model. Its form may be very different from existing technology, but it will still retain the core of unsupervised learning.

Amin Tootoonchian: I prefer semi-synchronous mode. Due to physical laws, complete synchronization is not realistic.

Daniel Selsam: I think it is more likely to be decentralized. There will definitely be 10 million GPUs working together in an AI system for learning and performing tasks, but like the various parts of the brain, they may not necessarily communicate with each other.

The Synergistic Power of Algorithmic Improvements and Data Efficiency

Sam Altman: How big is the gap between the most advanced algorithms and human data efficiency? Can we hope to catch up in the future?

Daniel Selsam: It is difficult to directly compare the two. The gap in language learning is definitely huge. The key is how to define the amount of information received by human visual nerves. I think algorithms are generally much less data efficient than humans.

For decades, deep learning has focused on computing power efficiency. In addition to the growth of data and computing power, what is really surprising is the synergistic effect produced by algorithmic improvements. Each time the algorithm performance improves by 10% or 20%, it will have a significant effect when superimposed on data efficiency. So far, there has been no mobilization around data efficiency, because this approach is not worthwhile when data is not circulating and computing power is limited.

Now, we are entering a new stage of AI research, and we will begin to accumulate victories in data efficiency. I think it is somewhat silly to predict now that we will encounter insurmountable obstacles. The way the human brain works is certainly different from our algorithm improvements, and we should be cautious in this regard. But I think we should remain optimistic about the future development of algorithms.

Sam Altman: What is the correlation between larger-scale pre-training and the model’s stronger learning and reasoning abilities?

Alex Paino: What we have observed is that better pre-training and unsupervised learning often improve the model’s overall intelligence and are of great help in generalization. This is complementary to reasoning ability, while reasoning may be more sluggish in improving intelligence. I think they are complementary.

Sam Altman: Pre-training seems to be general in many things, while training a model can only make it do well in one type of thing, is that right?

Alex Paino: This is very interesting, but you will not be surprised by this situation when you see the data that trains them. The pre-training dataset range is very large, and what we pursue is breadth and diversity. When it comes to model reinforcement learning and making it clearly obtain good reward signals and a good training environment, I think it is difficult to balance the breadth of the dataset.

Daniel Selsam: I agree, but I think there is another factor. Pre-training is essentially compressing data, thereby discovering the connections between different things. It is about analogies and more abstract. Reasoning is a skill that requires careful thinking on a specific problem and can also obtain solutions to many types of problems. However, in the pre-training process, more abstract knowledge can be learned when compressing data across different fields.

The Essence of Intelligence: Compression and the Long-Tail Effect

Sam Altman: Why is unsupervised learning effective?

Daniel Selsam: The key is compression. The ideal form of intelligence is Solomonoff induction. In general, machine learning will consider all possibilities, but tends to start testing with simpler programs.

The essence of current pre-training is a compression process, which achieves approximate expression by finding the simplest program to explain all the data produced by humans so far.

Sam Altman: How does the next Token prediction help achieve compression?

Daniel Selsam: There is a paradox in statistics - why can deep networks achieve generalization even though they seem unable to compress? Normally, when you have a lot of data and some small models, these models must go through compression to learn something.

In pre-training, the scale of data and models is very large. Some people think that this training is just memory and interpolation learning. In fact, they ignore another understanding perspective of compression - pre-quential compression. It is like a compressor. Even if the data weight is very large, the binary does not need to store this information. The result of the next Token prediction can quickly retrieve useful information and improve compression efficiency.

Sam Altman: The process of training GPT-4.5 cost a lot of manpower, time and money, which can actually be regarded as an experiment to verify Scaling Law, and the results prove that it is effective and will continue for a long time. Why can Scaling Law be called a law of the universe?

Daniel Selsam: The higher the degree of compression, the stronger the intelligence. This has profound philosophical connotations. Why does it take longer to train larger models and the compression rate is higher? This involves many theories, among which I like Sparse Representations.

The key concepts in reality follow a power law distribution. For example, the 100th important concept may only appear once in every 100 documents, and there is an obvious long-tail effect. This distribution characteristic leads to the need for large-scale data and computing power to effectively capture all key concepts, and also determines that Scaling Law will continue to exist effectively for a long time.

updated at 2025-04-15

# OpenAI # GPT # AGI