Overcoming ‘Catastrophic Problems’ in Large-Scale Training
The development of GPT-4.5, a project initiated two years prior, represents OpenAI’s most ambitious endeavor to date. This massive undertaking involved the collaborative efforts of hundreds of individuals, with Sam Altman, CEO of OpenAI, noting that the project demanded near-total organizational engagement.
The journey to create GPT-4.5 was not without its hurdles. The team encountered numerous ‘catastrophic problems’ during the research and development phase. Utilizing a cluster of 100,000 GPUs exposed previously unseen, low-probability, yet profound infrastructure failures. To balance expediency with optimal performance, OpenAI’s system team was compelled to adopt a ‘fix-as-we-go’ approach. One particularly elusive bug plagued the cluster with frequent errors, remaining undetected until approximately 40% of the training process had elapsed.
Despite these challenges, the GPT-4.5 project catalyzed the development of a more robust technology stack. Today, a lean team of just 5-10 individuals can replicate a large model akin to GPT-4. The performance gains from GPT-4 to GPT-4.5 were approximately tenfold, yielding ‘intelligence that is difficult to quantify but enhanced in all aspects,’ a result that surprised even OpenAI’s own personnel.
Shifting Focus: From Computational Power to Data Efficiency
OpenAI has come to realize that achieving the next tenfold or hundredfold leap in performance hinges not on raw computational power but on data efficiency – specifically, the ability to extract more knowledge from the same quantity of data while harnessing greater computational resources.
The architecture is also evolving from a single-cluster to a multi-cluster paradigm. Future training iterations may involve collaborative learning across as many as 10 million GPUs, necessitating heightened fault tolerance.
Sam Altman’s Dialogue with the GPT-4.5 Team
The following is an edited compilation of a discussion between Sam Altman and the OpenAI GPT-4.5 team:
Sam Altman: What does it take to build such a large model like GPT-4.5?
Alex Paino: We started this project about two years ago. At that time, OpenAI was about to launch a new large computing cluster, and our team saw this as an opportunity to conduct a series of operations to determine the functions that the model needed to include, and conducted a large number of risk reduction operations tests.
We have developed a long plan for this, involving the entire technology stack from system to machine learning. Reducing risks and preparing for training is a long execution process, and training itself is a very large project.
Amin Tootoonchian: I think this process requires close cooperation between the machine learning team and the system team from the beginning, until we clarify what model we want to train, and then start training.
We have made predictions in both machine learning and system aspects, trying to narrow the gap between expectation and reality as much as possible. But because our work rhythm is fast and we have to use the latest computing resources, model training has become something that is difficult to perfectly plan in advance.
We almost always start training with many unresolved problems and try to overcome challenges and make progress during the operation. The main solution is to add more computing resources.
The final stage is execution, which requires many people to invest a lot of energy and motivation for a long time to complete the training process.
Sam Altman: How much do you think the gap between our expectations and reality is?
Amin Tootoonchian: In terms of the system, we are usually far from the expected state at the beginning. We are always faced with a choice: whether to postpone the start and wait for the problem to be solved, or start early and solve the problem in the process. This always requires a trade-off to avoid unreasonable delays in the process.
But there are almost always some unexpected problems, and what we have to do is deal with these nodes as much as possible, deal with the unknown factors, and formulate a plan for model training.
Alex Paino: In this project, our goal is to make GPT-4.5, which means that its capabilities should be 10 times smarter than GPT-4. This is the initial goal we set about 2 years ago.
A lot of things happened during this process. We were thinking about whether we could do better or would be worse than expected? This is a very complicated process, but in the end, in terms of the effective calculations we invested, we got a model that we think has reached 10 times smarter than GPT-4.
Amin Tootoonchian: In terms of execution, the time spent on the GPT-4.5 project is far from what we initially expected.
Sam Altman: Why did you encounter so many problems when the cluster expanded from 10,000 cards to 100,000 cards?
Amin Tootoonchian: I think if system developers are sensitive enough, most problems can be observed in the small-scale stage.
Some problems are not unique to the large-scale training stage, but have often occurred before, but will become catastrophic problems after the scale is increased, especially when the team has not anticipated that these problems will worsen to such an extent.
Sam Altman: What things have caused catastrophic consequences?
Amin Tootoonchian: I think infrastructure problems are well known, whether the failure rate, failure type or total amount of failure is very high. The 100,000-card cluster is a large-scale sample pool, so we also discovered problems that the computing power supplier did not observe.
The network is one of them, and individual accelerators can also have problems. But this is also the beauty of this system - almost all components need to work as expected to produce expected results. Our job is to minimize this problem as much as possible.
Sam Altman: It is indeed difficult to work at the limit of cluster size, but I have also noticed that it has become much easier to do things that are no longer at the forefront of technology. Training GPT-4.5 requires hundreds of people, and OpenAI has almost everyone on board.
But today, if you were to select the smallest team from OpenAI and retrain GPT-4 from scratch with all the knowledge and system work we know, how many people would it take?
Alex Paino: I think it may take about 5 to 10 people to make a GPT-4-level model now. The technology stack has been greatly improved in the process of completing GPT-4.5.
In fact, we have done similar things in the process of training GPT-4.5 - we trained GPT-4o, which is a GPT-4-level model, and re-trained it using a lot of the samecontent from the GPT-4.5 research project. Fewer people were used for that training.
Sam Altman: From your perspective, Dan? Why is it difficult to train large models?
Daniel Selsam: I think it’s hard to do anything new. I think even just discovering that someone else has done something makes it much easier, because the hardest part is having the faith to do something in the first place. I think just knowing that something is feasible is a super cheat code that makes things much easier.
Alex Paino: We are expanding the GPT pre-training run to 10 times its previous size, and we always find some interesting new things that you can’t necessarily predict.
Sam Altman: What is needed to achieve the next 10x or 100x growth in pre-training scale?
Daniel Selsam: Data efficiency. The Transformer architecture (i.e. GPT) is very efficient in utilizing data. It can absorb and compress information well and achieve generalization. Its biggest feature is that it can efficiently absorb information with computing resources.
However, the depth of insight it gains from data is limited. When computing power grows rapidly and data grows relatively slowly, data becomes a bottleneck for this standard model. This requires algorithmic innovation to develop methods that can use more computing power to learn more knowledge from the same amount of data.
Sam Altman: What else do you think we need to maintain expansion?
Amin Tootoonchian: My answer is about the system. I think the huge amount of work required for GPT-4.5 is essentially the inevitable result of model specifications. We cannot train GPT-4.5 with the exact same technical architecture as GPT-4.
In terms of state management, because the required computing resources have exceeded the capacity of a single cluster, we have to turn to a multi-cluster training architecture. To achieve this goal, we must integrate multiple different workflows in a short period of time.
Although this has indeed helped us achieve stage breakthroughs, to achieve the next order of magnitude performance improvement, we still need to solve several known but temporarily shelved technical problems - these problems cannot be avoided. It is this kind of technical trade-off that constantly extends the R&D cycle of the perfect system, and we are always making strategic trade-offs in the process of pursuing the optimal implementation plan.
It needs to be clear that the system itself is not the ultimate goal, and its actual output value is the core consideration. For the next 10x performance improvement, I think the breakthrough in fault tolerance is crucial. We need to build a fault-tolerant mechanism that is deeply synergistic with the workload to significantly reduce operation and maintenance anxiety. The operation and maintenance complexity of current ultra-large-scale systems is essentially different from previous systems.
Sam Altman: Do you know what percentage of failures were caused by certain components during GPT-4.5 training?
Amin Tootoonchian: I don’t have specific numbers to share, but in general, in the early stages of deploying a new generation of hardware, system operation often faces many technical challenges that are not fully understood. We chose to advance the project before the problem was fully defined, which led to a high initial failure rate.
But experience has shown that as the root cause is identified and resolved, the failure rate will decrease significantly. This phenomenon essentially reflects our deepening understanding of the infrastructure - some people call it cleaning up the infrastructure or understanding the basic problems of the infrastructure.
The early stages of execution are almost always quite painful. While advancing the project, we are also continuously discovering and solving new failure modes, but the failure rate will gradually decrease and the normal operation time will become longer.
This is essentially a matter of priority trade-offs: In the early stages of the infrastructure life cycle, its failure risk is often difficult to accurately estimate; and if we excessively pursue the ultimate ideal state (the original is ‘City Estate’, the ideal city-state design), it may lead to the system The availability performance in the early stages is extremely poor.
Sam Altman: Although the reasoning model is a key component of our future technology stack, let’s temporarily focus on the development boundary of the traditional pre-training model. Suppose we have unlimited GPU computing power, unlimited network bandwidth, and unlimited power supply, but are still limited by existing technical bottlenecks—including system reliability issues, the lack of fault-tolerant training methods, and the limitations of existing data sets.
According to our evolution law of achieving a 100-fold scale increase in each major GPT version number, based on the current technical boundaries, what level can the development of the pre-training model reach? Specifically to the GPT series models, with our existing knowledge system, what kind of model can we theoretically train? Can GPT-5.5 be made?
Alex Paino: From the perspective of machine learning and algorithm development, we have not yet reached a clear theoretical upper limit. In fact, we are just beginning to explore algorithms with higher data efficiency and how to make fuller use of existing data resources. This situation is very interesting - even models like GPT-4 are largely developed under the constraints of limited computing resources, which also determines the direction of most previous research.
But the situation is completely different now. Since GPT-4.5, in some key dimensions, data rather than computing is becoming the main constraint. This shift makes related research less exciting.
Sam Altman: But this is indeed an amazing progress, and the world may not fully realize that computing resources are no longer the main bottleneck in the best model we can build. This change is profound, after all, we have lived in a computing-constrained environment for too long.
Sam Altman: What is the most interesting machine learning experience we have learned in the process of training GPT-4.5? Just talk about what you want to share.
Amin Tootoonchian: In general, the most thought-provoking are those situations that deviate from our predictions - especially when we try to understand why the actual performance deviates from the expected curve.
Alex Paino: One of the most surprising findings for us is that the scalability performance of different machine learning components varies greatly. Some parts can be scaled well, while others cannot. This is what we really realized in the actual training process. This experience gave us a lot of inspiration.
Daniel Selsam: I think the two core features of the GPT paradigm are: first, the test loss (a metric to measure how well the model performs on unseen test data) can be accurately predicted; second, the model performance shows a predictable improvement with the expansion of scale. More magically, the reduction of test loss will transform into an all-round enhanced level of intelligence in various ways that are difficult to quantify but amazing.
Sam Altman: Are you absolutely optimistic about this? Do you fully agree with this view?
Daniel Selsam: Actually, what I want to say is that we found particularly interesting phenomena in the GPT-4.5 test - after retesting, the model showed many subtle abilities that completely exceeded everyone’s expectations.
We are sure that it will become smarter in various ways that cannot be defined in advance, and after actual deployment, we can observe these subtle levels of improvement from user satisfaction: stronger common sense reserves, more accurate contextual understanding capabilities, and more delicate semantic grasp - this is exactly the magic brought by those extra test losses. In my opinion, Scaling Law has been perfectly verified in this dimension.
Sam Altman: What was the most positive moment during the entire training process? What is your favorite memory? Obviously there is a lot of pain, but I hope those pains have been alleviated.
Alex Paino: I do have such a moment. We did a lot of machine learning work during training. I think some of the changes we made during the operation had a fairly good impact, possibly better than expected, which was a very exciting moment for us.
Amin Tootoonchian: For me, at the same time as training, we are also building infrastructure. We firmly believe that we can cross this performance cliff, and we have a plan, and everyone is executing it, but it takes a long time. This is hard work and definitely more difficult than I thought. My prediction was wrong, and I underestimated the time it would take to solve these problems.
The moment when the team finally overcame those key problems and the performance was significantly improved is still fresh in my memory. You can clearly feel the energy transformation of the entire team - everyone is suddenly full of energy and rushing towards the final goal with new motivation.
The most magical thing is that the estimated completion time displayed on our status tracker continued to shorten from the initial two years, and finally locked in on a clear time node. This visible progress has an immeasurable boost to team morale. I think this is the beauty of it.
I would like to emphasize that machine learning work has never stopped. Even after training starts, this machine learning co-design process continues. The machine learning team not only actively follows up on those issues that were marked as ‘subsequent processing’, but also continues to deliver improvements that truly optimize training time.
This perfectly reflects our team spirit - there is no ‘each person sweeps the snow in front of their own door’ work boundary here, but a truly seamless collaboration, and this cohesion is our greatest strength.
Sam Altman: The outside world has discussed a lot about the challenges and prediction accuracy of this training itself. But in fact, all of this is based on extremely thorough planning - can you talk more about this in detail?
Alex Paino: This is definitely our most thorough plan so far. As I said, we started preparing for this project a year before the official start of training. During this period, we conducted multiple large-scale risk control tests.
We pay special attention to gradually introducing all improvements: starting from a high-confidence basic configuration - which can be understood as a mature architecture similar to GPT-4, we have fully mastered this configuration at the machine learning level - and then adding new features layer by layer like building blocks.
The key is to strictly verify the scalability of each improvement at different scales: not only to see performance improvements, but also to ensure that these improvements continue to be effective as the model scale expands. Many improvements perform well in small-scale tests, but will fail in large-scale applications.
Therefore, we have maintained a high degree of vigilance throughout the entire process and continue to iterate and improve our expansion law methodology. Through this risk control practice, we have accumulated a lot of valuable experience that will continue to guide the development of future GPT series models.
Amin Tootoonchian: I remember a particularly interesting moment that I miss very much. You know, we almost always encounter various bugs every time we start a training task. This is already commonplace. But the key is to ensure that progress is not blocked and to always confirm that the current progress is indeed on the right track and whether these bugs will have a fatal impact on the health of the training.
Although we were initially very confident that there were major defects, through the entire monitoring system we built, we have been able to accurately distinguish the root cause of the problem: Is it a hardware failure? What type of hardware failure? Is it data corruption? Or is it a bug in the machine learning model itself? Or is it a race condition in the code?
At that time, we had multiple problem discussion areas open at the same time, with various symptoms. After a series of bug fixes, we were stuck: there were multiple unsolved problems in front of us, and everyone was racking their brains - were these caused by different bugs? Or is it a bug at work?
Later, we held a vote to let team members vote for the most likely root cause. The least promising option hit the truth: it turned out that there was a problem with the torch.sum function upstream of PyTorch, a simple summation operation.
This bug is particularly interesting. You know, we mainly use the Triton kernel, and we will only fall back to torch operations in some unimportant edge scenarios. And the torch.sum function bug triggered by our specific code path will accidentally cause illegal memory access due to data distribution characteristics - it made a mistake when calculating the memory offset.
The most dramatic thing is that when an engineer finally located the problem and submitted a fix, all the error reports with different symptoms disappeared. Everyone excitedly changed the Slack channel from the ‘multi-bug theory’ to the ‘single-bug theory’, and the scene was very happy.
How long has this bug been lurking? It has existed since the early stages of training and was not identified until the progress bar passed about 40%. The discovery process was also full of drama: at that time, a complex kernel sequentially called sequences, and the second call triggered illegal memory access.
Although this crash frequency is extremely low (it only occurs once every few hundred or even thousands of training steps), it is easy to be ignored as an occasional failure, but our team’s guideline is: never let go of any anomalies. The best part of this story lies in this perseverance of not giving up easily.
Sam Altman: What else do you need to do after GPT-4.5 pre-training is started?
Alex Paino: All of us need to observe the loss curve frequently. In addition, we need to continue to optimize the system and improve the co-design that was not completed before training was started. We closely monitor various statistics during the training process to ensure that there are no unexpected trends. At the same time, we explore possible improvement plans from a machine learning perspective. Although data-level work will be temporarily reduced after pre-training is started, there are still a lot of tasks to be processed.
Amin Tootoonchian: I think machine learning largely depends on the correctness judgment. After pre-training is started, facing a large amount of noise signals, we are like fortune tellers interpreting tea dregs, and we need to judge whether the system is healthy. This is our responsibility.
Sam Altman: At the system level, what will limit us from conducting model training? Is it chip, processor, memory, network or power supply?
Amin Tootoonchian: The beauty of the system is that, when doing collaborative design, the workload can adapt to the infrastructure you build. There is no universal saying that the network is the bottleneck, or the memory bandwidth is the bottleneck, etc. Even for models of the same specification, we can choose to transfer resource requirements. We can choose to create a more balanced system, but having more memory bandwidth is always beneficial. It is difficult to answer this question without limiting conditions.
When designing GPT-4.5, we may need to have a certain attribute in the system, which needs to be generated through human guidance. Therefore, collaborative design is very important for forming the model architecture and architectural elements, and to a certain extent connects the system and machine learning aspects. If the system has an attribute that we don’t want to have very much. My ideal situation is that everything should be decoupled to give each other the greatest space.
Sometimes things are connected together, and we need to meet the requirements of theinfrastructure, or things should be like this. Most of the time, we need a balanced system, a balanced communication. And the best means of regulation we have is all these collaborative designs.
Sam Altman: How far are we from such an ideal system goal?
Amin Tootoonchian: We are far from that goal. The process of building a system is always like this: first there is an idealized view of how things should work, and then those differences are reconciled with existing resources.
I don’t think we are doing it for theory for the sake of theory, but just to discuss what we want it to become, to realize it, and to get as close to that ideal as possible. This may be the most exciting part of the system field. People used to say that this is an elegant system design, and eventually history will tell us whether this choice is right or wrong.
Sam Altman: If you could get an answer to a machine learning question before the next large training, what would you most like to know?
Alex Paino: I want to know which algorithms we should use under limited data and specific fields. Although this is a broad question, it is indeed the most critical.
Sam Altman: Will you conduct synchronous pre-training with 10 million GPUs or more in the future?
Alex Paino: I think there will be, but it may not be the traditional pre-training model. Its form may be very different from existing technologies, but it will still retain the core of unsupervised learning.
Amin Tootoonchian: I prefer a semi-synchronous model. Due to physical laws, complete synchronization is not very realistic.
Daniel Selsam: I think it is more likely to be decentralized. There will definitely be 10 million GPUs working together in an AI system that learns and performs tasks, but like the various parts of the brain, they may not necessarily communicate with each other.
Sam Altman: How much difference is there between the current most advanced algorithms and human data efficiency? Is it possible to catch up in the future?
Daniel Selsam: The two are difficult to compare directly. The gap in language learning is definitely huge. The key lies in how to define the amount of information received by human visual nerves. I think the overall data efficiency of algorithms is much lower than that of humans.
For decades, deep learning has focused on computing efficiency. In addition to the growth of data and computing power, what is really surprising is the superimposed effect produced by algorithm improvements. Every time the algorithm performance is improved by 10% or 20%, it will have a significant effect when superimposed on data efficiency. So far, there has been no such mobilization around data efficiency, because it is not worthwhile when data is not flowing and computing power is limited.
Now, we are entering a new stage of AI research, and we will begin to accumulate data efficiency victories. I think it is a bit foolish to predict now that we will encounter insurmountable obstacles. The way the human brain operates is definitely different from our algorithm improvements, and we should be cautious in this regard. But I think we should remain optimistic about the future development of algorithms.
Sam Altman: What is the correlation between larger-scale pre-training and the stronger learning and reasoning abilities of the model?
Alex Paino: What we have observed is that better pre-training and unsupervised learning tend to improve the overall intelligence of the model and greatly help in generalization, which is complementary to reasoning ability, while reasoning may be a little duller in improving intelligence. I think they are complementary.
Sam Altman: Pre-training seems to be universal in many things, while training a model can only make it do well in one type of thing, is that right?
Alex Paino: This is very interesting, but when you see the data that trains them, you will not be surprised by this situation. The pre-training data set range is very large, and what we pursue is breadth and diversity. When it comes to model reinforcement learning and making it clearly obtain good reward signals and a good training environment, I think it is difficult to take into account the breadth of the data set.
Daniel Selsam: I agree, but I think there is another factor. Pre-training is essentially compressing data, thereby discovering the connections between different things. It is about analogy and more abstract. Reasoning is a skill that requires careful thinking on a specific issue and can also obtain solutions for many types of problems. But in the pre-training process, more abstract knowledge can be learned when compressing data across different fields.
Sam Altman: Why is unsupervised learning effective?
Daniel Selsam: The key is compression. The ideal form of intelligence is Solomonov induction. In general, machine learning will consider all possibilities, but tend to start with simpler programs for testing.
The essence of current pre-training is a compression process, which achieves approximate expression by finding the simplest program to explain all the data that humans have produced so far.
Sam Altman: How does the next Token prediction help achieve compression?
Daniel Selsam: There is a paradox in statistics - why do deep networks seem unable to compress but can achieve generalization? Normally speaking, when you have a lot of data and some small models, these models must go through compression to learn something.
In pre-training, the scale of both data and models is very large. Some people think that this training is just memory and interpolation learning. In fact, they ignore another perspective of understanding compression - pre-quential compression. It is like a compressor. Even if the data weight is very large, the binary does not need to store this information. The result of the next Token prediction can quickly retrieve useful information and improve compression efficiency.
Sam Altman: The process of training GPT-4.5 cost a lot of manpower, time and money, which can actually be regarded as an experiment to verify Scaling Law, and the results prove that it is effective and will continue for a long time. Why can Scaling Law be called the law of the universe?
Daniel Selsam: The higher the degree of compression, the more powerful the intelligence, which has profound philosophical implications. Why does it take longer to train larger models and the compression rate is higher? This involves many theories, among which I like Sparse Representations.
The key concepts in reality follow a power law distribution. For example, the 100th most important concept may only appear once in every 100 documents, and there is an obvious long-tail effect. This distribution characteristic means that large-scale data and computing power are needed to effectively capture all the key concepts, and also determines that Scaling Law will be effective for a long time.