Background: Despite not being officially released to the public yet, discussions regarding the technical details and real impact of Sora have never ceased. Behind these discussions lies the exploration of the fundamental questions of artificial intelligence.
Sora’s generated results are indeed impressive, with high resolution and consistent subject integrity even after multiple angle changes. Does this level of generation imply that Sora is a world model? On the basis of being able to generate realistic videos, can it be said that Sora understands the physical world?
Affirmative
We believe Sora understands the physical world. Based on the videos released by Sora, we can observe that regardless of how the camera rotates, the continuity of time, the invariance of spatial angles after switching, and the reflection and variation of light all conform to the laws of the physical world. From this perspective, if these are not physical laws, then what are they?
The second point we emphasize is: Sora understands physical laws, not necessarily physics laws. Basic physical laws refer to the direct experiences of most people in real life, such as free fall, where one can observe a ball falling from a height to a lower position. It can be seen that the vast majority of videos generated by Sora conform to the motion of physical laws in daily life.
Physics laws, on the other hand, refer to the rigorous physical formulas or rules derived by physicists through experiments or theories.
The topic of today’s debate is whether Sora understands the physical world. This physical world does not refer to the world of physicists, but rather the general physical world that the vast majority of people understand and perceive.
The third perspective is about what it means to “understand” or “learn.” Some people judge that Sora does not understand the physical world because it does not understand physical formulas or the rigorous process of physics. But is this understanding necessarily the understanding of AI?
Here, it is necessary to review the Turing test. The Turing test refers to randomly asking questions to both the tester and the testee when they are separated. If most people cannot distinguish between the behaviors of the two, it means that this AI system possesses intelligence. From this perspective, generation equals intelligence, generation equals intelligence, generation equals intelligence (laughter).
As long as what Sora generates is judged by everyone based on common sense to be true, and there is no way to distinguish whether it is human or AI, we believe it has learned and understands.
Negative
I regret that my colleague from the affirmative side was deceived by the appearance of Sora (laughter), and I also regret the misunderstanding of the understanding of physical laws by my colleague from the affirmative side.
First, let’s correct the basic definition of the physical world. The physical world refers to the world governed by natural laws and physical laws, such as conservation, symmetry, etc. It includes all the observable matter and basic phenomena of motion, and is actually the objective universe that exists. If Sora understands the physical world, then the videos it generates must understand the relevant laws, and be able to simulate and accurately depict these laws, which is obviously not the case for Sora now.
Secondly, the basic mechanism of Sora’s operation is based on Diffusion Transformer to compress video and language data, and learn its distribution. However, it is obviously insufficient to describe our objective three-dimensional world based solely on video and language, as it has strong limitations.
The evolution of many media, such as fluids, requires special state quantities to describe them, so it is not enough to train models based solely on finite-dimensional video and language data. Even if the generated content seems realistic, it is completely different from “true” in concept.
Therefore, it is necessary to clarify the difference between realism and reality. The videos generated by Sora are indeed very realistic, but they only stay at the surface of the video and lack substance. Traditional rendering techniques for generating animations can also achieve similar effects, which does not mean that Sora has the ability to simulate and understand the real world.
However, we cannot deny the huge potential of Sora in areas such as creative design and visual effects.
(To be continued, please see the next post)