AI Video's Physics Problem

The Rise of Chinese Generative Video Models

If 2022 marked the year generative AI truly captured the public’s imagination, 2025 is shaping up to be the year when a new wave of generative video frameworks from China takes center stage. Tencent’s Hunyuan Video has already made significant waves in the hobbyist AI community. Its open-source release of a full-world video diffusion model allows users to tailor the technology to their specific needs.

Following closely behind is Alibaba’s Wan 2.1, released more recently. This model stands out as one of the most powerful image-to-video Free and Open Source Software (FOSS) solutions currently available, and it now supports customization through Wan LoRAs.

In addition to these developments, we’re also anticipating the release of Alibaba’s comprehensive VACE video creation and editing suite, alongside the availability of the recent human-centric foundation model, SkyReels.

The generative video AI research scene is equally explosive. It’s still early March, yet Tuesday’s submissions to Arxiv’s Computer Vision section (a key hub for generative AI papers) totaled nearly 350 entries – a number typically seen during the peak of conference season.

The two years since Stable Diffusion’s launch in the summer of 2022 (and the subsequent development of Dreambooth and LoRA customization methods) were characterized by a relative lack of major breakthroughs. However, the last few weeks have witnessed a surge of new releases and innovations, arriving at such a rapid pace that it’s nearly impossible to stay fully informed, let alone cover everything comprehensively.

Solving Temporal Consistency, But New Challenges Emerge

Video diffusion models like Hunyuan and Wan 2.1 have, at long last, addressed the issue of temporal consistency. After years of unsuccessful attempts from hundreds of research initiatives, these models have largely resolved the challenges related to generating consistent humans, environments, and objects over time.

There’s little doubt that VFX studios are actively dedicating staff and resources to adapt these new Chinese video models. Their immediate goal is to tackle pressing challenges like face-swapping, despite the current absence of ControlNet-style ancillary mechanisms for these systems.

It must be a huge relief that such a significant hurdle has potentially been overcome, even if it wasn’t through the anticipated channels.

However, among the remaining problems, one stands out as particularly significant: All currently available text-to-video and image-to-video systems, including commercial closed-source models, have a tendency to produce physics-defying blunders. An example shows a rock rolling uphill, generated from the prompt: ‘A small rock tumbles down a steep, rocky hillside, displacing soil and small stones’.

Why Do AI Videos Get Physics Wrong?

One theory, recently proposed in an academic collaboration between Alibaba and the UAE, suggests that models might be learning in a way that hinders their understanding of temporal order. Even when training on videos (which are broken down into single-frame sequences for training), models might not inherently grasp the correct sequence of ‘before’ and ‘after’ images.

However, the most plausible explanation is that the models in question have employed data augmentation routines. These routines involve exposing the model to a source training clip both forwards and backwards, effectively doubling the training data.

It’s been known for some time that this shouldn’t be done indiscriminately. While some movements work in reverse, many do not. A 2019 study from the UK’s University of Bristol aimed to develop a method to distinguish between equivariant, invariant, and irreversible source data video clips within a single dataset. The goal was to filter out unsuitable clips from data augmentation routines.

The authors of that work clearly articulated the problem:

‘We find the realism of reversed videos to be betrayed by reversal artefacts, aspects of the scene that would not be possible in a natural world. Some artefacts are subtle, while others are easy to spot, like a reversed ‘throw’ action where the thrown object spontaneously rises from the floor.

‘We observe two types of reversal artefacts, physical, those exhibiting violations of the laws of nature, and improbable, those depicting a possible but unlikely scenario. These are not exclusive, and many reversed actions suffer both types of artefacts, like when uncrumpling a piece of paper.

‘Examples of physical artefacts include: inverted gravity (e.g. ‘dropping something’), spontaneous impulses on objects (e.g. ‘spinning a pen’), and irreversible state changes (e.g. ‘burning a candle’). An example of an improbable artefact: taking a plate from the cupboard, drying it, and placing it on the drying rack.

‘This kind of re-use of data is very common at training time, and can be beneficial – for example, in making sure that the model does not learn only one view of an image or object which can be flipped or rotated without losing its central coherency and logic.

‘This only works for objects that are truly symmetrical, of course; and learning physics from a ‘reversed’ video only works if the reversed version makes as much sense as the forward version.’

We don’t have concrete evidence that systems like Hunyuan Video and Wan 2.1 allowed arbitrary ‘reversed’ clips during training (neither research group has been specific about their data augmentation routines).

However, considering the numerous reports (and practical experience), the only other reasonable explanation is that the hyperscale datasets powering these models might contain clips that genuinely feature movements occurring in reverse.

The rock in the example video mentioned earlier was generated using Wan 2.1. It’s featured in a new study that investigates how well video diffusion models handle physics.

In tests for this project, Wan 2.1 achieved a score of only 22% in its ability to consistently adhere to physical laws.

Surprisingly, that’s the best score among all the systems tested, suggesting that we may have identified the next major hurdle for video AI.

Introducing VideoPhy-2: A New Benchmark for Physical Commonsense

The authors of the new work have developed a benchmarking system, now in its second iteration, called VideoPhy. The code is available on GitHub.

While the scope of the work is too broad to cover comprehensively here, let’s examine its methodology and its potential to establish a metric that could guide future model-training sessions away from these bizarre instances of reversal.

The study, conducted by six researchers from UCLA and Google Research, is titled VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation. A comprehensive accompanying project site is also available, along with code and datasets on GitHub, and a dataset viewer on Hugging Face.

The authors describe the latest version, VideoPhy-2, as a ‘challenging commonsense evaluation dataset for real-world actions.’ The collection features 197 actions across a range of diverse physical activities, including hula-hooping, gymnastics, and tennis, as well as object interactions like bending an object until it breaks.

A large language model (LLM) is used to generate 3840 prompts from these seed actions. These prompts are then used to synthesize videos using the various frameworks being tested.

Throughout the process, the authors have compiled a list of ‘candidate’ physical rules and laws that AI-generated videos should adhere to, using vision-language models for evaluation.

The authors state:

‘For example, in a video of sportsperson playing tennis, a physical rule would be that a tennis ball should follow a parabolic trajectory under gravity. For gold-standard judgments, we ask human annotators to score each video based on overall semantic adherence and physical commonsense, and to mark its compliance with various physical rules.’

Curating Actions and Generating Prompts

Initially, the researchers curated a set of actions to evaluate physical commonsense in AI-generated videos. They started with over 600 actions sourced from the Kinetics, UCF-101, and SSv2 datasets, focusing on activities involving sports, object interactions, and real-world physics.

Two independent groups of STEM-trained student annotators (with a minimum undergraduate qualification) reviewed and filtered the list. They selected actions that tested principles such as gravity, momentum, and elasticity, while removing low-motion tasks like typing, petting a cat, or chewing.

After further refinement with Gemini-2.0-Flash-Exp to eliminate duplicates, the final dataset included 197 actions. 54 involved object interactions, and 143 centered on physical and sports activities.

In the second stage, the researchers used Gemini-2.0-Flash-Exp to generate 20 prompts for each action in the dataset, resulting in a total of 3,940 prompts. The generation process focused on visible physical interactions that could be clearly represented in a generated video. This excluded non-visual elements such as emotions, sensory details, and abstract language, but incorporated diverse characters and objects.

For example, instead of a simple prompt like ‘An archer releases the arrow’, the model was guided to produce a more detailed version such as ‘An archer draws the bowstring back to full tension, then releases the arrow, which flies straight and strikes a bullseye on a paper target’.

Since modern video models can interpret longer descriptions, the researchers further refined the captions using the Mistral-NeMo-12B-Instruct prompt upsampler. This added visual details without altering the original meaning.

Deriving Physical Rules and Identifying Challenging Actions

For the third stage, physical rules were derived not from text prompts but from generated videos. This is because generative models can struggle to adhere to conditioned text prompts.

Videos were first created using VideoPhy-2 prompts, then ‘up-captioned’ with Gemini-2.0-Flash-Exp to extract key details. The model proposed three expected physical rules per video. Human annotators reviewed and expanded these by identifying additional potential violations.

Next, to identify the most challenging actions, the researchers generated videos using CogVideoX-5B with prompts from the VideoPhy-2 dataset. They then selected 60 out of 197 actions where the model consistently failed to follow both the prompts and basic physical commonsense.

These actions involved physics-rich interactions such as momentum transfer in discus throwing, state changes like bending an object until it breaks, balancing tasks like tightrope walking, and complex motions that included back-flips, pole vaulting, and pizza tossing, among others. In total, 1,200 prompts were chosen to increase the difficulty of the sub-dataset.

The VideoPhy-2 Dataset: A Comprehensive Evaluation Resource

The resulting dataset comprised 3,940 captions – 5.72 times more than the earlier version of VideoPhy. The average length of the original captions is 16 tokens, while upsampled captions reach 138 tokens – 1.88 times and 16.2 times longer, respectively.

The dataset also features 102,000 human annotations covering semantic adherence, physical commonsense, and rule violations across multiple video generation models.

Defining Evaluation Criteria and Human Annotations

The researchers then defined clear criteria for evaluating the videos. The main goal was to assess how well each video matched its input prompt and followed basic physical principles.

Instead of simply ranking videos by preference, they used rating-based feedback to capture specific successes and failures. Human annotators scored videos on a five-point scale, allowing for more detailed judgments. The evaluation also checked whether videos followed various physical rules and laws.

For human evaluation, a group of 12 annotators were selected from trials on Amazon Mechanical Turk (AMT) and provided ratings after receiving detailed remote instructions. For fairness, semantic adherence and physical commonsense were evaluated separately (in the original VideoPhy study, they were assessed jointly).

The annotators first rated how well videos matched their input prompts, then separately evaluated physical plausibility, scoring rule violations and overall realism on a five-point scale. Only the original prompts were shown, to maintain a fair comparison across models.

Automated Evaluation: Towards Scalable Model Assessment

Though human judgment remains the gold standard, it’s expensive and comes with several caveats. Therefore, automated evaluation is essential for faster and more scalable model assessments.

The paper’s authors tested several video-language models, including Gemini-2.0-Flash-Exp and VideoScore, on their ability to score videos for semantic accuracy and for ‘physical commonsense.’

The models again rated each video on a five-point scale. A separate classification task determined whether physical rules were followed, violated, or unclear.

Experiments showed that existing video-language models struggled to match human judgments, mainly due to weak physical reasoning and the complexity of the prompts. To improve automated evaluation, the researchers developed VideoPhy-2-Autoeval, a 7B-parameter model designed to provide more accurate predictions across three categories: semantic adherence; physical commonsense; and rule compliance. It was fine-tuned on the VideoCon-Physics model using 50,000 human annotations*.

Testing Generative Video Systems: A Comparative Analysis

With these tools in place, the authors tested a number of generative video systems, both through local installations and, where necessary, via commercial APIs: CogVideoX-5B; VideoCrafter2; HunyuanVideo-13B; Cosmos-Diffusion; Wan2.1-14B; OpenAI Sora; and Luma Ray.

The models were prompted with upsampled captions where possible, except that Hunyuan Video and VideoCrafter2 operate under 77-token CLIP limitations and cannot accept prompts above a certain length.

Generated videos were kept to less than 6 seconds, since shorter output is easier to evaluate.

The driving data was from the VideoPhy-2 dataset, which was split into a benchmark and training set. 590 videos were generated per model, except for Sora and Ray2; due to the cost factor, equivalent lower numbers of videos were generated for these.

The initial evaluation dealt with physical activities/sports (PA) and object interactions (OI) and tested both the general dataset and the aforementioned ‘harder’ subset.

The authors comment:

‘Even the best-performing model, Wan2.1-14B, achieves only 32.6% and 21.9% on the full and hard splits of our dataset, respectively. Its relatively strong performance compared to other models can be attributed to the diversity of its multimodal training data, along with robust motion filtering that preserves high-quality videos across a wide range of actions.

‘Furthermore, we observe that closed models, such as Ray2, perform worse than open models like Wan2.1-14B and CogVideoX-5B. This suggests that closed models are not necessarily superior to open models in capturing physical commonsense.

‘Notably, Cosmos-Diffusion-7B achieves the second-best score on the hard split, even outperforming the much larger HunyuanVideo-13B model. This may be due to the high representation of human actions in its training data, along with synthetically rendered simulations.’

The results showed that video models struggled more with physical activities like sports than with simpler object interactions. This suggests that improving AI-generated videos in this area will require better datasets – particularly high-quality footage of sports such as tennis, discus, baseball, and cricket.

The study also examined whether a model’s physical plausibility correlated with other video quality metrics, such as aesthetics and motion smoothness. The findings revealed no strong correlation, meaning a model cannot improve its performance on VideoPhy-2 just by generating visually appealing or fluid motion – it needs a deeper understanding of physical commonsense.

Qualitative Examples: Highlighting the Challenges

Though the paper provides abundant qualitative examples, few of the static examples provided in the PDF seem to relate to the extensive video-based examples that the authors furnish at the project site. Therefore we will look at a small selection of static examples and then some more of the actual project videos.

Regarding qualitative tests, the authors comment:

‘[We] observe violations of physical commonsense, such as jetskis moving unnaturally in reverse and the deformation of a solid sledgehammer, defying the principles of elasticity. However, even Wan suffers from the lack of physical commonsense, as shown in [the clip mentioned at the start of this article].

‘In this case, we highlight that a rock starts rolling and accelerating uphill, defying the physical law of gravity.’

As mentioned at the outset, the volume of material associated with this project far exceeds what can be covered here. Therefore, please refer to the source paper, project site, and related sites mentioned earlier for a truly exhaustive outline of the authors’ procedures, and considerably more testing examples and procedural details.

* As for the provenance of the annotations, the paper only specifies ‘acquired for these tasks’ – it seems a lot to have been generated by 12 AMT workers.

First published Thursday, March 13, 2025