Li Auto's VLA: From Animal Evolution to Human-Like AI | en

In March of this year, at NVIDIA’s 2025 Spring GTC conference, Jia Peng, Head of Autonomous Driving Technology R&D at Li Auto, introduced their latest achievement: the MindVLA large model.

This model is a Vision-Language-Action Model (VLA) with 2.2 billion parameters. Jia Peng further stated that they have successfully deployed the model in vehicles. Li Auto believes that VLA models are the most effective method for solving the challenges of AI interacting with the physical world.

Over the past year, end-to-end architecture has become a technological hotspot in the field of intelligent driving, driving car companies to shift from traditional modular rule design to integrated systems. Car companies that previously led with rule-based algorithms face transitional pains, while latecomers have seized the opportunity for a competitive advantage.

Li Auto is a prime example of this.

Li Auto’s progress in intelligent driving last year can be described as rapid. In July, it took the lead in achieving nationwide no-map NOA (Navigation on Autopilot) and launched a unique “end-to-end (fast system) + VLM (slow system)” architecture, which has received widespread attention in the industry.

Tonight, with the second season of Li Auto AI Talk, we have gained a deeper understanding of what Li Xiang refers to as an "artificial intelligence company."

The “Driver Large Model” Is Also Your Driver

Li Xiang, CEO of Li Auto, first mentioned VLA in the AI Talk first season last December, in a conversation with Zhang Xiaojun, the chief technology writer of Tencent News. At that time, he said:

What we are doing with Li Auto Companion and autonomous driving is actually separate according to industry standards, and it is in the early stages. The Mind GPT we are doing is actually a large language model; the autonomous driving we are doing, we call it behavioral intelligence internally, but as defined by Li Feifei (Stanford Lifetime Professor, former Google Chief Scientist), it is called spatial intelligence. Only when you really do it on a large scale will you know that these two will definitely be connected one day. We call it VLA (Vision Language Action Model) internally.

Li Xiang believes that the base model will definitely become VLA at a certain moment. The reason is that language models can only understand the three-dimensional world through language and cognition, which is obviously not enough. “It needs to be truly vector-based, using Diffusion (diffusion model), and using generative methods (to understand the world).”

It can be said that the birth of VLA is not only a bold attempt to deeply integrate language intelligence and spatial intelligence, but also a reinterpretation of the concept of “intelligent car” by Li Auto.

Li Xiang further defined in tonight’s AI Talk: “VLA is a driver large model, working like a human driver.” It is not only a technology, but also an intelligent partner that can communicate naturally with users and make independent decisions.

So, what exactly is VLA? The core is actually very straightforward: by integrating visual perception, natural language understanding, and action generation capabilities, the vehicle becomes a “driver agent” that can communicate with people and make its own decisions. This representsa significant leap from simple automation to genuine autonomy in vehicles. It’s about creating a system that not only reacts to its environment but also understands and anticipates the needs of its passengers.

Imagine sitting in your car and casually saying, “I’m a little tired today, drive slower,” and the vehicle will not only understand what you mean, but also adjust its speed and even choose a smoother route. This natural and smooth interaction is exactly what VLA wants to achieve. Li Xiang revealed that all short commands are processed directly by the vehicle, while complex commands are parsed by the cloud-based 3.2 billion parameter model, ensuring both efficiency and intelligence. This hybrid approach allows for quick responses to simple requests while leveraging the power of the cloud for more complex scenarios. The car becomes more than just a vehicle; it becomes a responsive and considerate companion.

Achieving this goal is not easy. The special thing about VLA is that it connects the three dimensions of vision, language, and action. A simple command from the user may involve real-time perception of the surrounding environment, accurate understanding of the language intent, and rapid adjustment of driving behavior. The three are indispensable. Each of these components presents its own unique challenges. Visual perception requires the system to accurately interpret the world around it, even in challenging conditions like low light or heavy rain. Natural language understanding needs to go beyond simple keyword recognition and grasp the nuances of human language, including intent, emotion, and context. And action generation must translate understanding into safe and effective driving maneuvers.

And the great thing about VLA is that it allows these three to work seamlessly together. The integration of these capabilities creates a synergistic effect, where the whole is greater than the sum of its parts. The VLA doesn’t just see, understand, and act; it combines these abilities to create a truly intelligent and responsive driving experience.

From vision to reality, the R&D of VLA is an uncharted territory. Li Xiang admitted: “The acquisition of visual and action data is the most difficult. No company can replace it.” This highlights the critical importance of real-world data in training and refining AI models for autonomous driving. The more data a system has, the better it can learn to handle the complexities and unpredictable nature of real-world driving scenarios.

To understand the technical background of VLA, we must also look at the evolution of Li Auto’s intelligent driving.

Li Xiang said that the early system was “insect-level” intelligence, with only millions of parameters, driven by rules and high-precision maps, and was helpless when encountering complex road conditions. Later, end-to-end architecture and visual-language models allowed the technology to leap to “mammal-level,” get rid of map dependence, and nationwide no-map NOA became a reality. This evolution illustrates the rapid progress that has been made in the field of autonomous driving. The move from rule-based systems to AI-powered models represents a paradigm shift, enabling cars to handle increasingly complex and unpredictable situations.

In fact, this step has already put Li Auto at the forefront of the industry, but they are obviously not satisfied with this. In Li Xiang’s view, the emergence of VLA marks that Li Auto’s intelligent driving technology has entered a new stage of “human intelligence.”

Compared with the previous system, VLA can not only perceive the 3D physical world, but also perform logical reasoning and even generate driving behaviors close to human level. This represents a significant advancement beyond simply reacting to stimuli. VLA can analyze situations, anticipate potential problems, and make decisions based on a deeper understanding of the environment and the intentions of other drivers.

For a simple example, suppose you say “find a place to turn around” on a congested street, VLA will not mechanically execute the command, but will comprehensively consider road conditions, traffic flow, and traffic rules to find the most reasonable time and location to complete the U-turn. This demonstrates the VLA’s ability to go beyond simple instructions and make intelligent decisions based on the real-world context. It’s about creating a system that can not only drive a car but also navigate the complexities of human driving behavior.

Li Xiang said that VLA can quickly adapt to new scenarios by generating data, and can optimize responses even when encountering complex road repairs for the first time within three days. This flexibility and judgment are the core advantages of VLA. This adaptability is crucial for autonomous driving systems to be truly reliable in the real world. The ability to learn from new experiences and adapt to changing conditions is what separates a truly intelligent system from a simply programmed one.

Li Auto’s Teacher Is DeepSeek

Supporting VLA is a complex and sophisticated technical system independently developed by Li Auto. This system allows the car to not only “understand” the world, but also think and act like a human driver.

The first is 3D Gaussian representation technology, which uses many “Gaussian points” to create a 3D object. Each point contains its own position, color, and size information. This technology uses self-supervised learning to train a powerful 3D spatial understanding model using massive real data. With it, VLA can “understand” the surrounding world like a human, knowing where the obstacles are and where the passable areas are. This technology is a critical component in enabling the VLA to accurately perceive and understand its surroundings. By creating a detailed 3D model of the environment, the system can identify objects, estimate distances, and plan safe driving maneuvers.

Next is the Mixture of Experts (MoE) architecture, which consists of expert networks, gating networks, and combiners. When the model parameters exceed hundreds of billions, the traditional method will make all neurons participate in each calculation, which is a waste of resources. The gating network in the MoE architecture will call different experts according to different tasks to ensure that the activation parameters will not increase significantly. This architecture is essential for managing the complexity of the VLA model. By dividing the model into specialized expert networks, the system can efficiently process different types of information and make decisions quickly and accurately.

Talking about this, Li Xiang also praised DeepSeek:

DeepSeek uses the best practices of mankind… When they were doing DeepSeek V3, V3 was also a MoE, a 671B model. I think MoE is a very good architecture. It is equivalent to combining a bunch of experts together, and each one is an expert ability.

Finally, Li Auto introduced Sparse Attention to VLA, which in layman’s terms means that VLA will automatically adjust the attention weights of key areas, thereby improving the inference efficiency of the end side. This technology allows the VLA to focus its processing power on the most relevant parts of the scene, improving its speed and efficiency.

Li Xiang said that in the training process of this new base model, Li Auto’s engineers spent a lot of time finding the best data ratio, integrating a large amount of 3D data and text and image data related to autonomous driving, and reducing the proportion of literary and historical data. The careful curation of training data is crucial for the performance of AI models. By focusing on data that is relevant to autonomous driving, Li Auto can ensure that the VLA learns to perform its tasks effectively.

From perception to decision-making, VLA draws on the fast and slow combination mode of human thinking. It can quickly output simple action decisions, such as emergency avoidance, and can also use short thinking chains to “think slowly” to deal with more complex scenarios, such as temporarily planning a route to bypass the construction area. This mimics the way humans handle different driving situations. Simple tasks are handled quickly and automatically, while more complex situations require more thought and planning.

In order to further improve real-time performance, VLA also introduced speculative reasoning and parallel decoding technology, making full use of the computing power of the vehicle-side chip to ensure that the decision-making process is fast and not chaotic. This is essential for ensuring that the VLA can respond quickly and safely to changing conditions on the road.

When generating driving behavior, VLA uses Diffusion models and Reinforcement Learning from Human Feedback (RLHF). The Diffusion model is responsible for generating optimized driving trajectories, while RLHF makes these trajectories closer to human habits, both safe and comfortable. For example, VLA will automatically slow down when turning, or leave enough safe distance when merging lanes. These details reflect the deep learning of human driving behavior. The use of RLHF is particularly important for ensuring that the VLA’s driving behavior is not only safe but also natural and comfortable for passengers.

The world model is another key technology. Li Auto provides a high-quality virtual environment for reinforcement learning through scene reconstruction and generation. Li Xiang revealed that the world model has reduced the verification cost from 170,000-180,000 yuan per 10,000 kilometers to 4,000 yuan. It allows VLA to continuously optimize in simulation and deal with complex scenarios with ease. This virtual environment allows the VLA to be trained and tested in a safe and cost-effective manner, without the need for extensive real-world driving.

Speaking of training, VLA’s growth process is also quite organized. The entire process is divided into three stages: pre-training, post-training, and reinforcement learning. “Pre-training is like learning knowledge, post-training is like learning to drive in a driving school, and reinforcement learning is like social practice,” said Li Xiang. This analogy helps to understand the different stages of the VLA’s training process.

In the pre-training stage, Li Auto created a visual-language base model for VLA, stuffing it with rich 3D visual data, 2D high-definition images, and driving-related corpora, allowing it to first learn to “see” and “hear”; after training, the action module is added, generating 4-8 second driving trajectories, and the model expands from 3.2 billion parameters to 4 billion.

Reinforcement learning is divided into two steps: first, use RLHF to align human habits, analyze takeover data, and ensure safety and comfort; then, use pure reinforcement learning to optimize, based on G-value (comfort), collision, and traffic rules feedback, so that VLA “drives better than humans.” Li Xiang mentioned that this stage is completed in the world model, simulating real traffic scenarios, and the efficiency is far better than traditional verification.

This training method not only guarantees the technical advancement, but also makes VLA reliable enough in practical applications.

Li Xiang admitted that the success of VLA is inseparable from the inspiration of industry benchmarks. DeepSeek’s MoE architecture not only improved training efficiency, but also provided valuable experience for Li Auto. He lamented: “We are standing on the shoulders of giants and accelerating the R&D of VLA.” This open learning attitude allows Li Auto to go further in the no-man’s land. This willingness to learn from others is a key factor in Li Auto’s success. By embracing the best practices in the industry, they have been able to accelerate their own development and push the boundaries of what is possible.

From “Information Tools” to “Production Tools”

At present, the AI industry is undergoing a profound transformation from “information tools” to “production tools.” With the maturity of large model technology, AI is no longer limited to processing data and providing suggestions, but begins to have the ability to make independent decisions and perform tasks. This shift is transforming the way AI is used in a wide range of industries, from healthcare to finance to transportation.

Li Xiang proposed in the second season of AI Talk that AI can be divided into information tools (such as search), auxiliary tools (such as voice navigation), and production tools. He emphasized: “Artificial intelligence becoming a production tool is the moment of true outbreak.” With the maturity of large model technology, AI is no longer limited to processing data, but begins to have the ability to make independent decisions and perform tasks. This represents a fundamental shift in the role of AI, from assisting humans to performing tasks autonomously.

This trend is particularly evident in the concept of “embodied intelligence” - AI systems are given physical entities, capable of sensing, understanding, and interacting with the environment.

Li Auto’s VLA model is a vivid practice of this trend. By integrating vision, language, and action intelligence, it transforms the car into an intelligent agent that can drive autonomously and interact naturally with users, perfectly interpreting the core concept of “embodied intelligence.” The car becomes more than just a machine; it becomes an intelligent entity that can understand and respond to its environment and the needs of its passengers.

As long as humans hire professional drivers, artificial intelligence can become a production tool. When AI becomes a production tool, artificial intelligence will truly explode.

Li Xiang’s remarks clarified the core value of VLA - it is no longer a simple auxiliary tool, but a “driver agent” that can independently perform tasks and assume responsibilities. This transformation not only improves the practical value of cars, but also opens up imagination space for the application of AI in other fields. The potential applications of embodied intelligence are vast and far-reaching.

Li Xiang’s thinking on AI always has a perspective that breaks out of the box. He also mentioned: “VLA is not a sudden change process, but an evolutionary process.” This sentence accurately summarizes Li Auto’s technical path -

From early rule-driven, to end-to-end breakthroughs, to today’s VLA’s “human intelligence” level. This evolutionary thinking not only makes VLA more feasible in technology, but also provides a reference paradigm for the industry. Compared with some attempts that blindly pursue subversion, Li Auto’s pragmatic path may be more suitable for the complex Chinese market. This gradual approach allows for continuous learning and improvement, minimizing the risks associated with radical changes.

From technology to belief, Li Auto’s AI exploration is not smooth. Li Xiang admitted: “We have experienced many challenges in the AI field, like the darkness before dawn, but we believe that if we persevere, we will see the light.” The R&D of VLA faces problems such as computing power bottlenecks and data ethics, but Li Auto has gradually ushered in their technological dawn through self-developed base models and world models. These challenges are inherent in the development of cutting-edge technology. Overcoming them requires perseverance, innovation, and a commitment to ethical considerations.

Li Xiang also mentioned in the interview that the success of VLA is inseparable from the rise of Chinese AI.

He said that the emergence of models such as DeepSeek and Tongyi Qianwen has made China’s AI level rapidly approach the United States. Among them, the open source spirit upheld by DeepSeek is particularly encouraging, which directly prompted Li Auto to open source Xinghuan OS. Li Xiang said: “This is not out of company strategic considerations. DeepSeek has given us so much help, we should contribute something to society.” This collaborative spirit is essential for fostering innovation and progress in the AI field.

While pursuing technological breakthroughs, Li Auto has not ignored the safety and ethical issues of AI technology. The “super alignment” technology introduced by VLA makes the model’s behavior closer to human habits through Reinforcement Learning from Human Feedback (RLHF). Data shows that the application of VLA has increased the high-speed MPI (average intervention mileage) from 240km to 300km. This demonstrates Li Auto’s commitment to developing safe and reliable autonomous driving technology.

More importantly, Li Auto emphasizes building “AI with human values” and regards morality and trust as the cornerstone of technological development. This commitment to ethical considerations is crucial for building public trust in AI and ensuring that it is used for the benefit of society.

From a more macro perspective, the significance of VLA lies in that it redefines the role of car companies.

In the past, cars were industrial-age means of transportation; today, they are evolving into “spatial robots” in the artificial intelligence era. Li Xiang mentioned in AI Talk: “Li Auto used to walk in the no-man’s land of cars, and will walk in the no-man’s land of artificial intelligence in the future.” This transformation of Li Auto brings new imagination space to the business model of the automotive industry. The car is no longer just a means of transportation; it is a platform for delivering a wide range of services and experiences.

Of course, the development of VLA is not without challenges. The continuous investment of computing power, data ethics, and the establishment of consumer trust in autonomous driving are all issues that Li Auto needs to face. These challenges are complex and require a multi-faceted approach.

In addition, the competition in the AI industry is becoming increasingly fierce. Domestic and foreign giants such as Tesla, Waymo, and OpenAI are accelerating the layout of multi-modal models. Li Auto needs to maintain its leading position in technology iteration and market promotion. “We have no shortcuts, we can only cultivate deeply,” said Li Xiang. This commitment to continuous innovation and improvement is essential for staying ahead in the rapidly evolving AI landscape.

Undoubtedly, the landing of VLA will be a key node.

Li Auto plans to release VLA simultaneously with the pure electric SUV Li Auto i8 in July 2025, and achieve mass production in 2026. This is not only a comprehensive test of technology, but also an important touchstone for the market. This launch will be a critical test of the VLA’s capabilities and its acceptance by consumers.

updated at 2025-05-09

# Agent # Li Auto # VLA