Google's New AI Powers Dexterous Robots | en

The Quest for Embodied AI: A Moonshot Goal

The pursuit of “embodied AI” has been a long-standing ambition within the robotics community. The core concept revolves around creating artificial intelligence that can not only process information but also directly control a physical robot to interact with the world. This interaction needs to be autonomous, adaptable to novel and unpredictable situations, and, crucially, executed with both safety and precision. It’s a challenge that goes far beyond simply programming a robot to perform a specific, repetitive task. Embodied AI aims for a level of general intelligence and physical competence that allows a robot to function effectively in a wide range of dynamic environments.

Companies like Nvidia are actively pursuing this “holy grail” of robotics, recognizing its transformative potential. If successful, embodied AI would enable the creation of truly versatile robotic laborers, capable of performing a vast array of tasks in real-world settings, from assisting in homes to working in factories and even exploring hazardous environments. The difficulty lies in bridging the gap between understanding a command or situation (the AI component) and translating that understanding into precise, coordinated physical actions (the robotics component).

Gemini Robotics: Building on a Foundation of Language and Vision

Google’s approach, embodied in the Gemini Robotics model, leverages the power of large language models (LLMs), specifically building upon the Gemini 2.0 architecture. LLMs have demonstrated remarkable capabilities in understanding and generating human language, but Gemini Robotics extends this foundation to encompass the unique demands of controlling a physical robot. The key innovation is what Google terms “vision-language-action” (VLA) abilities.

This VLA framework allows the model to process multiple inputs simultaneously and integrate them into a coherent understanding that drives action. Visual input, typically from cameras mounted on the robot, provides a stream of information about the robot’s surroundings. Natural language commands, spoken or written, specify the desired task or goal. Gemini Robotics then synthesizes these inputs – the visual scene and the linguistic instruction – and translates them into a sequence of precise physical movements for the robot’s actuators (motors, joints, etc.).

For example, if a user says, “Move the red block to the blue box,” the robot’s cameras would capture the scene, identifying the red block and the blue box. The language model component would process the command, understanding the objects, the action (“move”), and the spatial relationship (“to”). The VLA system then combines these understandings to generate the necessary motor commands to move the robot’s arm and gripper to pick up the red block and place it in the blue box.

Gemini Robotics-ER: Enhanced Embodied Reasoning

While Gemini Robotics focuses on the direct translation of vision and language into action, Gemini Robotics-ER takes a slightly different approach, emphasizing “embodied reasoning.” This model boasts enhanced spatial understanding, allowing it to better comprehend the three-dimensional relationships between objects and the robot itself. This improved spatial awareness is crucial for navigating complex environments and interacting with objects in a more nuanced way.

The “ER” in Gemini Robotics-ER signifies its focus on integrating seamlessly with existing robot control systems. Many robots already have sophisticated low-level control systems that handle tasks like maintaining balance, controlling joint angles, and preventing collisions. Gemini Robotics-ER is designed to work in conjunction with these systems, providing high-level reasoning and planning while leveraging the existing control infrastructure for precise execution. This allows for a more modular and adaptable approach, making it easier to integrate the AI with a variety of different robot platforms.

From Understanding to Action: A New Era of Dexterity

The practical implications of these advancements – both Gemini Robotics and Gemini Robotics-ER – are significant, moving beyond simply understanding commands to executing complex physical manipulations. Previous models, like Google’s RT-2, made strides in enabling robots to understand language and adapt to new situations. However, RT-2 was primarily limited to repurposing pre-practiced physical movements. It could understand a new command, but its ability to execute novel physical actions was constrained.

Gemini Robotics, in contrast, demonstrates a marked improvement in dexterity. This newfound dexterity unlocks a range of tasks that were previously unattainable for robots, including delicate manipulations like origami folding and precisely packing items into containers. These tasks require not only an understanding of the goal but also the ability to execute fine motor control and adapt to subtle variations in the environment (e.g., the slight differences in how a piece of paper folds each time).

The ability to perform origami, for instance, highlights the model’s capacity for intricate, multi-step manipulation. Each fold requires precise positioning and pressure, and the robot must adapt to the changing shape of the paper as it progresses. Similarly, packing snacks into Zip-loc bags requires careful handling to avoid crushing the contents and precise manipulation to seal the bag. These examples showcase a level of dexterity that represents a substantial leap forward in robotic capabilities.

Generalization: The Key to Real-World Adaptability

A crucial aspect of both Gemini Robotics and Gemini Robotics-ER is their improved generalization – the ability to perform tasks for which they were not explicitly trained. This is a fundamental requirement for robots to operate effectively in the real world, which is inherently unpredictable and diverse. A robot trained only on a specific set of tasks in a controlled environment will likely fail when faced with even slightly different conditions.

Generalization allows a robot to apply its learned knowledge and skills to new situations, adapting to variations in object appearance, environment layout, and even unexpected obstacles. DeepMind emphasizes that Gemini Robotics “more than doubles performance on a comprehensive generalization benchmark compared to other state-of-the-art vision-language-action models.” This indicates a significant improvement in the robot’s ability to handle novel tasks and scenarios.

The ability to generalize is what separates a specialized, task-specific robot from a truly versatile and adaptable machine. It’s the key to creating robots that can be deployed in a wide range of environments and perform a variety of tasks without requiring extensive retraining for each new situation. This adaptability is essential for realizing the vision of robots as helpful assistants in homes, workplaces, and other dynamic settings.

A Generalist Robot Brain: Google’s Ambitious Vision

Google’s overarching goal is clearly to develop a “generalist robot brain” – a versatile AI capable of controlling a wide range of robotic platforms. This contrasts with the traditional approach of developing specialized AI for specific robots and tasks. A generalist brain would be able to adapt to different robot bodies and perform a variety of tasks without requiring significant modifications to the underlying AI.

In line with this vision, Google has announced a partnership with Apptronik, a leading robotics company specializing in humanoid robots. The collaboration aims to “build the next generation of humanoid robots with Gemini 2.0,” suggesting that Google’s LLM technology will be integrated into Apptronik’s hardware to create more intelligent and capable humanoid robots.

While the Gemini Robotics models were primarily trained on a bimanual robot platform (a robot with two arms) known as ALOHA 2, Google states that the system is versatile enough to control diverse robot types. This includes research-oriented robotic arms, like the Franka Emika Panda, and more sophisticated humanoid systems, such as Apptronik’s Apollo robot. This adaptability underscores the potential of Gemini Robotics to become a universal “brain” for a wide array of robotic applications, from industrial automation to healthcare and even space exploration.

The Humanoid Robotics Landscape: Hardware and Software Converge

The pursuit of humanoid robotics is a collaborative effort, with numerous companies contributing to different aspects of the challenge. Companies like Figure AI and Boston Dynamics (formerly an Alphabet subsidiary) have been focused on developing advanced humanoid robotics hardware, creating robots with increasingly sophisticated physical capabilities, including bipedal locomotion, balance, and manipulation.

However, a truly effective AI “driver” – the software component that imbues these robots with intelligence and autonomy – has remained a critical missing piece. This is where Google’s efforts with Gemini Robotics and Gemini Robotics-ER come into play. By developing advanced AI models specifically designed for robotic control, Google is aiming to provide the “brain” that can unlock the full potential of these sophisticated hardware platforms.

The collaboration between hardware and software developers is essential for accelerating progress in humanoid robotics. Google’s decision to grant limited access to Gemini Robotics-ER through a “trusted tester” program to leading robotics companies, including Boston Dynamics, Agility Robotics, and Enchanted Tools, reflects this collaborative approach. By allowing these companies to experiment with the AI model and integrate it with their own hardware, Google is fostering a broader ecosystem of innovation and accelerating the development and deployment of truly capable humanoid robots.

Safety First: A Layered Approach to Responsible Robotics

Recognizing the paramount importance of safety in robotics, especially with increasingly capable and autonomous systems, Google emphasizes a “layered, holistic approach” to safety. This approach incorporates traditional robot safety measures, such as collision avoidance and force limitations, which are designed to prevent robots from causing harm to humans or their surroundings.

Collision avoidance systems use sensors (e.g., cameras, lidar) to detect obstacles and prevent the robot from colliding with them. Force limitations restrict the amount of force a robot can exert, preventing it from accidentally crushing or injuring someone. These measures are fundamental to ensuring that robots operate within safe parameters.

In addition to these traditional measures, Google is developing a “Robot Constitution” framework, inspired by Isaac Asimov’s Three Laws of Robotics. This framework provides a set of guiding principles for the ethical and safe development and deployment of robots. While Asimov’s laws are fictional, they provide a useful starting point for thinking about the ethical implications of increasingly intelligent and autonomous robots.

Google’s Robot Constitution framework is intended to go beyond simply preventing physical harm and address broader ethical considerations, such as ensuring fairness, accountability, and transparency in the design and use of robots. This framework is a work in progress, and Google is actively engaging with researchers and ethicists to refine and develop it further.

The ASIMOV Dataset: Standardizing Safety Assessment

To support the development of safer robots and facilitate research in this area, Google has released a dataset called “ASIMOV.” This dataset is designed to help researchers evaluate the safety implications of robotic actions, extending beyond the prevention of direct physical harm. It aims to provide a standardized way to assess how well AI models understand the potential consequences of a robot’s actions in various scenarios.

The ASIMOV dataset includes a variety of scenarios, each with a description of the robot’s actions and potential outcomes. Researchers can use this dataset to train and evaluate AI models, assessing their ability to predict the safety implications of different actions. For example, a scenario might involve a robot handing a sharp object to a human. The dataset would include information about the object’s sharpness, the way the robot is holding it, and the potential risks involved.

According to Google’s announcement, the dataset will “help researchers to rigorously measure the safety implications of robotic actions in real-world scenarios.” This initiative underscores Google’s commitment to responsible innovation in the field of robotics and its recognition of the need for standardized methods to assess and mitigate potential risks.

The Future of Robotics: A Glimpse into the Possibilities

While Google has not yet announced specific timelines or commercial applications for the new AI models, which currently remain in a research phase, the advancements demonstrated are undeniably significant. The demo videos released by Google showcase remarkable progress in AI-driven robotic capabilities, including tasks that were previously considered extremely challenging, such as origami folding and delicate object manipulation.

However, it’s important to acknowledge that these demonstrations have been conducted in controlled research environments. The true test of these systems will lie in their ability to perform reliably and safely in the unpredictable and dynamic settings of the real world. Factors like variations in lighting, object appearance, and unexpected obstacles can all pose challenges for robots.

Despite these challenges, the development of Gemini Robotics and Gemini Robotics-ER represents a pivotal moment in the evolution of robotics. These models have the potential to unlock a new era of dexterity, adaptability, and autonomy, paving the way for robots to seamlessly integrate into our lives and contribute to a wide range of tasks. As research progresses and these technologies mature, we can anticipate a future where robots play an increasingly prominent role in our homes, workplaces, and communities.

The journey towards truly embodied AI is ongoing, but Google’s latest advancements offer a compelling glimpse into the exciting possibilities that lie ahead. The fusion of sophisticated hardware and increasingly intelligent software is poised to transform the robotics landscape, bringing us closer to a future where robots are not just tools, but versatile partners in our daily lives. The potential benefits are vast, ranging from increased productivity and efficiency to improved safety and quality of life. However, it’s crucial to continue to prioritize safety and ethical considerations as we develop and deploy these powerful technologies.

updated at 2025-03-13

# Google # Gemini # AGI