A Pioneer on a Non-Mainstream Technical Path
Could you briefly introduce yourself?
I’m Zhong Yiran, Senior Research Director at MiniMax, where I primarily oversee the design of network architectures and multimodal understanding large models. At MiniMax, my main responsibility is to lead the design of the MiniMax-01 network structure.
Previously, I served as a PI for the New Architecture Exploration Group at the Shanghai Artificial Intelligence Laboratory, focusing on efficient training modeling methods for non-transformer architectures and research on visual-audio-language multimodal fusion.
When did you begin researching linear attention, and why did you choose this technical route?
I started researching linear attention around July 2021. This stemmed from a paper I worked on for my PhD in 2020, ‘Invertible Attention.’ At the time, both invertible neural networks and attention mechanisms were quite popular, so we combined them in our research.
Later, some members of our team became very interested in mathematics. Efficient sequence modeling methods like linear attention require a strong mathematical foundation and involve numerous formula derivations, which aligned perfectly with the team’s interests, so we chose this direction.
What was the status of linear attention in the industry at that time?
It was very non-mainstream, with few people working on it. Most researchers were focused on transformers, which had essentially become the dominant force in NLP.
We thought that instead of being just another face in the crowd doing transformer research, we should do something different.
How did you assess the technical potential of the linear attention route?
Our initial motivation was straightforward: to address the quadratic computational complexity of transformers. We tested various methods, including sparse transformers and linear attention.
We found that sparse transformers did work, offering faster speed and lower memory usage compared to transformers. However, linear attention performed poorly and was also slow. Despite this, we chose to pursue linear attention.
One reason was its mathematical appeal – we believed its performance should be better. The other was that we felt the upper limit of sparse attention was full attention, making it difficult to surpass. Linear attention, on the other hand, had the potential to exceed it.
Could you explain what linear attention is?
Linear attention is essentially a kernel trick. In transformers, multiplying the Q, K, and V matrices involves different computational complexities depending on whether you multiply QK first or KV first, due to the different dimensions.
Multiplying KV first can reduce the computational complexity to linear. However, the problem is that QK multiplication is followed by a softmax operation, which does not satisfy the commutative property and cannot be easily split into multiplying KVfirst. Therefore, the first step in linear attention is to remove the softmax.
But removing the softmax affects the results. The subsequent task is to maintain consistency in the results without softmax, which is what linear attention aims to achieve.
What are the fundamental differences between linear attention, sparse attention, and linear RNN architectures?
Sparse attention is still essentially a softmax attention. It simply calculates fewer points than a dense attention matrix. For example, sliding window attention only calculates the attention score within a window, achieving acceleration by reducing the amount of computation.
Linear RNNs and linear attention are essentially the same thing, just called RNNs by some and attention by others.
Everything can be written in RNN form. For example, lightning attention corresponds to RWKV-4, while RWKV-7 is an improved version of the gated delta net. Although they are similar in essence, their implementation details differ.
What are the key milestones in the research of linear attention mechanisms?
Around 2018-19, research showed that the theoretical computational complexity of transformer softmax attention could be reduced using kernel tricks, but the results were poor, and efficiency was low.
In 2019-20, sparse attention was dominant, with companies like Google proposing many sparse attention variants. Later, linear attention began to emerge, but it faced the challenge of poor performance and slow speed.
Researchers mainly adopted two approaches to improvement: one was to approximate the softmax function, making the distribution conform to softmax; the other, which we chose, was to model using completely different methods, without concerning ourselves with approximating softmax.
We published our first paper, ‘COSFORMER: RETHINKING SOFTMAX IN ATTENTION,’ in October 2021, which replaced the softmax operation with a cosine function, allowing the computation to be split.
In the first half of 2022, we published a second paper, ‘The Devil in Linear Transformer,’ which analyzed the reasons for the performance degradation of linear attention and provided solutions. This was the precursor to lightning attention.
Later, we also researched position encodings specifically for linear attention and long convolutions, publishing TNN, ‘TOEPLITZ NEURAL NETWORK FOR SEQUENCE MODELING,’ a method similar to S4 (the predecessor of Mamba).
Finally, we launched lightning attention, which matched the performance of transformers through improved decay methods and network structures. We also used a tiling technique to make it faster.
What are your thoughts on the current non-transformer architecture technical routes?
Linear attention is actually a non-transformer method. Currently, besides RNN-like approaches, other non-transformer architectures are declining.
For example, CNNs like long convolutions and large kernel convolutions, feel like they’ve been gradually eliminated due to poor performance, but they are actually quite strong in certain aspects, still having some effect in sequence modeling, such as anomaly detection tasks.
There are actually only three non-transformer architectures: linear attention, long convolutions, and linear RNNs.
But in reality, these three can be unified into one, which we call the linear complexity model. We wrote an article encompassing all three.
What are the core differences between lightning attention and Mamba and RWKV?
The most core difference is that lightning attention is the simplest linear attention. Mamba and RWKV both use data-dependent decay, while lightning attention uses handcrafted decay for speed.
Although learnable decay can achieve better results, it sacrifices speed. For example, RWKV-7 is 10-15% slower than the gating delta net, while the gated delta net is about half the speed of lightning attention.
RWKV’s modeling effect is indeed better than lightning attention, but it is slower and has not yet solved the retrieval problem.
Is it now industry consensus that linear attention has a high and feasible upper limit?
No, if it were consensus, everyone would be scaling up linear attention models. And it’s not consensus now either. If it were, everyone would be doing linear, but as you can see, that’s not the case.
But for us, we already saw this in the second half of 2023. At that time, I asked many people and talked with many, and the most common point they raised was that they knew linear attention did work on a small scale, but they felt it would fail once scaled up.
At the time, I thought I would scale it up for everyone to see. Now that MiniMax-01 is out, no one doubts the ability of linear attention on a large scale.
From Small Experiments to Large-Scale Implementation
Do you think the upper limit of linear attention can surpass full attention?
We can now see that hybrid architectures are better than pure transformers. But the biggest problem with pure linear attention is retrieval ability, which is a difficult problem for academia to solve.
Existing methods, although complex and slow, still cannot completely solve it, which is why it is necessary to move towards hybrid architectures.
What node did you observe that made you decide to come out of the lab?
In May-June 2023, we already had lightning attention 2 internally, which was the world’s first linear attention implementation that was faster than Flash attention.
We believe it has crossed the industrial red line, and its technological maturity is very high and can be scaled up.
How do you define this industrial red line?
First, the effect is better than transformer, and second, it is faster than transformer. This gives it the ability to replace transformer. We verified this on a 15B scale dense model at that time.
At the node when you came out of the lab, why did you ultimately come together with MiniMax?
Actually, I had talked with some large companies at that time. But in the end, I still made this happen with MiniMax.
First of all, cosformer is an article I collaborated with Junjie on. We have a foundation for cooperation. Junjie was my boss when he was at SenseTime. At the end of 23, Junjie invited me to dinner. He is more confident in the possibilities of these cutting-edge technologies. My understanding is that he was also looking for a technical breakthrough at the time.
At that time, MiniMax had completed the research on Moe, and there were actually very few technical breakthrough points for the next step. At that time, lightning attention had been released, and mamba was also popular, so in his eyes, it was a feasible direction.
Is this related to MiniMax’s interactive companion product?
There is no connection. Yan Junjie is more concerned about the upper limit of the model and how to further break through this ceiling.
Linear attention may be more of a direction to break through efficiency in the public eye, rather than breaking through the ceiling.
The point here is that, first of all, the computing power of each manufacturer is constant. The faster the model can be accelerated, the more data it can eat, and the better the model produced. When the computing power is constant, the faster the model is, the better.
Have you observed a situation where data has peaked?
Not yet, right? Data is still in the stage of continuous scaling, but it may not be as aggressive as in 23.
Because data is always increasing, and new data comes out every day. For the model, it has new data to process every day. The data produced by the Internet every day is so much. Through cleaning, we can still get new data out.
Compared to the data that has existed for so many years of human development, has the data growth rate slowed down?
Actually, not necessarily. Look at the five thousand years of China’s history, and only those few books have been accumulated. But with the development of the Internet, the increase in data volume is a very steep curve. The overall data generated before the Internet may not be as much as the data generated inone year later.
During the scale-up process, what challenges did lightning attention face?
To verify its scalability, we first did scaling law experiments, gradually expanding from small models to 7B, 9B, and finally scaling to models with more than 400B.
And we theoretically proved that the capacity of linear is larger than that of transformer.
We define capacity as the size of the RNN’s current states. For transformer, the capacity size is O(d), where d is the size; for linear attention, the capacity size is d²/h. Since d is much larger than h, the capacity is larger.
In the end, we also verified that the hybrid model is better than the pure transformer.
How is the 4M length sequence window achieved?
For lightning, the training length can be arbitrary. As long as the computing power is fully utilized, the speed of training 8K, 32K, or 128K is the same, and the TGS (token per GPU per second) is the same.
Because transformer is n² computational complexity, the longer the sequence, the faster the computational complexity grows, and the latency increases in a quadratic curve. At 1M length, the latency of softmax attention is 2,700 times that of lightning attention.
What technical challenges still need to be addressed to achieve an infinite context window in the future?
In our current hybrid architecture, there is still 1/8 of softmax attention. This is a bottleneck at 1M length. The latency brought by this 1/8 is much higher than the remaining 7/8 of linear attention.
If we want to optimize long text, we must consider optimizing the softmax attention part. We can learn from sparse attention methods to make it faster and lighter.
In addition, we are also considering making the mixing ratio of softmax and linear attention more extreme, no longer 1/8, but possibly 1/16 or 1/32. The most radical solution is to put only one layer of softmax in the entire model, but for insurance, we did not adopt it, mainly considering the impact on retrieval ability.
Why is retrieval ability so important to the model?
Retrieval is the basis of in-context learning and is a necessary condition.
You must remember the information in the context to do in-context learning, and in-context learning is the basis of all advanced capabilities of current large models, such as CoT (Chain of Thought), especially long CoT, which all rely on retrieval ability.
Decisive New Architecture
Have you paid attention to the latest architectural improvements in FFN and attention in the industry?
The improvement of FFN is Moe. I also paid attention to Byte’s Ultra Mem, but I think it is a lossy thing, a lossy compression. There may be problems if it is scaled up in the future, but we have not scaled up, so I can only say that there may be problems.
Because FFN is basically these. Our improvements in the Moe area are nothing more than changing from the previous large expert to the current small expert mode, making it more sparse, and then doing some acceleration, which requires further research.
If you want to optimize it further, because FFN is matrix multiplication, the optimization can only be done on the CUDA level by Nvidia, doing some of the bottom-level optimizations of matrix multiplication.
Have you paid attention to the improvements in the attention architecture in the industry?
The improvements on attention are basically linear. We are also considering whether to make a stronger Linear in the future, and further accelerate Linear attention on the current basis.
There are many ways to improve, one is to change the decay, and the other is to change some small tricks inside. You can look forward to our new paper.
Is our current ratio of context length and inference cost relatively advanced?
Once it involves lengthening the sequence length, we have a very obvious computing power cost advantage. The longer it is, the more obvious the cost advantage will be, whether it is inference or training.
For example, on 1M, the computing power consumed by linear attention is 1/2700 of full attention. In comparison, because we still have 1/8 of full attention, it is basically 1/8 of the transformer architecture, because linear attention basically does not count as an expense.
If the calculation cost is so low, can it achieve a calculation bottleneck?
Now it is indeed a memory access bottleneck. Decoding is a memory access bottleneck, not a calculation bottleneck. Because lightning is very fast, it is too fast to allow memory access to occupy as few resources as calculation. This is mainly because the sequence length in actual applications is not long enough.
How to make it a calculation bottleneck in the future depends on how to optimize memory access. These will be things that the engineering department needs to be responsible for.
If linear architecture becomes the mainstream architecture of the next generation, what hardware adaptation improvements would be more suitable for it?
A very tricky thing here is that we need to consider the sequence length. If your sequence length is focused on 8K or 32K, then attention only accounts for a little over ten percent, and the remaining eighty percent is the FFN part.
Even if you optimize attention to the extreme, to 0, you have only optimized a little over ten percent of the latency. But if you lengthen the sequence length, the proportion of attention will become larger and larger. This is compared to full attention, but for linear attention, its proportion is unchanged.
Because FFN is also linear, and linear attention is also linear, its proportion is about 10%, which is almost unchanged, even in the case of 1M.
But if it is full attention, attention calculation may account for 99%, and the following FFN only accounts for 1%. So linear attention only has advantages in long texts.
If the linear architecture becomes the mainstream, then the pursuit may be low-energy hardware, only reducing the energy consumption. Including Spiking Neural Network (SNN) chips may be more suitable, and some people are actually doing it.
Looking Forward to the Road to AGI
What are your expectations for the model open-source effect?
The first is the publicity effect. I personally think that in addition to showing some muscles, the most important thing for open source is to see how everyone can use it in the future. I think small model open source may be what we are more considering doing in the future.
And how to make some infrastructure for everyone to finetune may also need to be considered. Open source is a long-term thing for us in the future, and flagship models should continue to be open-sourced.
Is it possible that a pure-blood architecture that is not hybrid will run out in the future?
Currently, there is no method that can do better than hybrid, especially in terms of speed. Adding a small portion of softmax attention, the speed advantage is very obvious when the sequence length is not particularly long, especially after the emergence of flash attention.
Research on pure-blood architecture is still ongoing, but it is very difficult, and there are no more low-hanging fruits. We have some technical solutions, but the implementation is not simple, and it ultimately depends on how long a sequence length we need to achieve.
Another question is, is there a strong demand for ultra-long texts? Although models like Claude have reached 200K context, users seem to be very satisfied with the current length. Agent applications may bring demand for ultra-long sequences in the future, but there is no mature benchmark yet.
But I think this problem is like Nvidia developing advanced performance graphics cards for future games, even though they are not needed now, it is technology for the future.
For example, deep research requires the model to read the content of dozens of websites, and the processing time is on the order of tens of minutes, which may be an application direction for long texts.
What do you think the next big thing after CoT might be?
Wehave thought about this. First of all, the current reasoning model is relatively popular, and the mainstream this year will still be the reasoning part. After that, it is difficult for us to think of any particularly large changes in the future of pure language models.
I have also talked with other teachers, and their feeling is that everyone will re-reduce the cost of the model, so that the speed of reasoning becomes faster and faster, and its price becomes lower and lower, and the cost is reduced while maintaining the effect.
Because the ceiling is approaching quickly, the vast majority of cases are checking and filling gaps in the capabilities of large models. But if there are even greater technological breakthroughs, they may be relatively rare in the short term, and we have not seen them yet.
After MiniMax explored linear attention, what might be the next direction to explore?
The next thing may be to explore the architecture of multimodal, specifically whether we want to do this native generation and understanding unified large model architecture.
With AGI as the end point, which model with a computational complexity of O(n²) or O(n) would be a better answer?
Of course, it is O(n). From the perspective of anthropomorphism, people must be O(n) complexity. For example, if a person’s complexity is O(n²), then the speed at which I speak to you will become slower and slower.
Because for transformer, its inference complexity is O(n²) computational complexity, that is, the latency of spitting out the first token and spitting out the 100th token is different.
We humans cannot imagine such a thing, because people have never restarted since they were born, and have been spitting things out all the time, so the computational complexity of people is constant.
Is man necessarily the optimal solution for intelligence?
We can only think so at the moment. There are also some people doing the route of bionic intelligence, but we have not paid too much attention to those directions.
With AGI as the end game, which areas of model improvement are the most important things?
In addition to language modeling, there is also the problem of learning methods. How you learn, and learn from the environment, learning from the interaction with the environment is very important. After all, the current multimodal understanding is still very lacking in data.
And even few-shot learning of machines is currently labeled, but human learning is unlabeled. So how to unify everything under a self-constructed framework is also a problem.