Debate (Part Three): Does Sora Truly Understand the Physical World?

Background: Despite not being officially released to the public yet, discussions regarding the technical details and real impact of Sora have never ceased. Behind these discussions lies the exploration of the fundamental questions of artificial intelligence.

Sora’s generated results are indeed impressive, with high resolution and consistent subject integrity even after multiple angle changes. Does this level of generation imply that Sora is a world model? On the basis of being able to generate realistic videos, can it be said that Sora understands the physical world?

(Continued from the previous article.)

Affirmative

The opposing debater argues that grasping the laws of physics requires some counterintuitive thinking and assumptions, as well as intervention and verification of the physical world.

However, we believe this perspective fundamentally misunderstands what the physical world entails, as it is overly anthropocentric. Regardless of human presence, the world remains a physical entity; one cannot assert that only a physics understood by humans qualifies as the physical world.

Returning to the core of machine learning, it involves having models and unknown parameters, defining a loss or humanly evaluating it on real-world data, and finally optimizing it. Physicists essentially follow this paradigm as well.

They invent formulas that defy intuition, incorporate parameters, conduct experiments under idealized conditions, intervene in the world, gather data, assess formulaic losses, and then optimize the model through thorough, intelligent deliberation.

Today’s neural networks, when widened, approximate piecewise linear functions, capable of approximating continuous curves. Additionally, as their depth increases, they can represent more complex functions, surpassing the scope of formulas known to earlier physicists.

If such a learning process in neural networks cannot be deemed intelligent, are humans? Why must intelligence be confined to formulas proposed and experiments conducted by humans?

Moreover, from the perspective of machine learning, if one makes ideal assumptions and experiments, discovers so-called universal laws that are not entirely universal, isn’t it akin to traditional feature engineering?

It’s akin to finding a particularly useful feature, conducting experiments to verify its 99% or higher applicability, which constitutes a narrower depiction of the physical world.

Currently, there are indeed phenomena in Sora that contradict the physical world. However, understanding the physical world and precisely understanding it are not synonymous.

The same applies to humans; just because one can mentally simulate the scenario of two pirate ships sailing in a coffee cup, generated by Sora, does not mean they can precisely recreate the image.

Negative

Firstly, there’s no inherent connection between generating realistic videos and understanding the physical world. Analogous to the human world, architects and painters can depict and even create entities within this world, yet it doesn’t signify a true understanding of the physical world.

Primitive humans could build shelters with stones and create cave paintings before comprehending the physical world as we do today.

Back then, the concept of understanding the physical world as we know it may not have existed, yet they could create corresponding artworks or objective entities. From this perspective, I don’t believe Sora’s current ability to generate realistic videos equates to an understanding of the physical world.

Secondly, human understanding of the physical world follows a strict methodology involving hypothesis, observation, and experimental verification to infer physical phenomena. However, what we currently observe in models like Sora is a data-driven learning paradigm.

After feeding data to Sora, it may observe some data, which isn’t necessarily obtained under rigorous experimental conditions. Under these circumstances, if it understands the physical world, it does so in a manner beyond our cognitive scope.

However, we have yet to see any AI truly match human levels of generality or understanding of the world.

Lastly, Sora’s immense capabilities might stem from its lack of understanding of the physical world. Based on its learning paradigm, it can grasp statistical regularities and integrate related entities.

For instance, it can generate surreal scenarios like a turtle with a shell resembling a crystal ball or pirate ships battling in a coffee cup, detached from the constraints of the physical world.

Earlier image generation models, represented by stable diffusion, could produce scenes like riding horses in space or on Mars, clearly defying the laws of our physical world. Such phenomena don’t constitute an understanding of the physical world.

Because it lacks comprehension of the physical world, Sora can construct its own world based on statistical correlations. Therefore, I believe Sora doesn’t understand the physical world.

(To be continued, please see the next post)

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top