How Smart Do We Want AI to Be? World Models May Understand Things Better Than We Do

In recent years, AI has learned to write text, generate images, create videos and even produce working computer code. As those capabilities became mainstream, attention shifted to a deeper question within AI research: Can machines learn how the world actually works, and not just how to describe it?

For researchers, that question has real-world consequences, from how robots navigate homes to how self-driving cars anticipate what's likely to happen at an intersection. That's where world models come in.

World models are not a new concept. The term was first introduced in the 1950s, resurfaced in modern AI research around 2018 and gained wider attention in 2024 with models like OpenAI's Sora and Google DeepMind's Genie.

In 2025, it expanded further into world foundation models, popularized by Nvidia's Cosmos, which won Best AI at CES 2025. Meta's V-JEPA 2 also came out in 2025 and claims to understand physical rules like gravity.

So what exactly are world models, who is building them and why are they becoming one of the most important areas of AI research right now? Let's dive in.

Don't miss any of our unbiased tech content and lab-based reviews. Add CNET as a preferred Google source.

World models vs. foundation models vs. world foundation models

We first need to clarify the terminology.

"World models" originally referred to AI systems built to understand and predict what happens inside a specific environment, such as a robotic arm workspace or a video game level. For example, an agent learning how objects move inside an Atari game.

Foundation models are large, general-purpose systems trained on massive datasets to handle multiple tasks simultaneously. This includes large language models, such as ChatGPT or Gemini, which learn broad patterns primarily from text, as well as multimodal models trained on images, audio or code.

World foundation models combine both ideas by taking the scale of foundation models and training them specifically to simulate physical reality using video and sensory data (think Nvidia's Cosmos or Genie 3).

However, the term "world models" is often used as shorthand for these larger world foundation models, rather than the narrower systems the phrase originally described.

From book-smart to world-smart

Large language models (LLMs) are good at sounding informed. However, that knowledge comes from reading vast amounts of text, not from direct experience of the world. They are trained to predict the next token, meaning the next word or piece of a word, based on patterns in text. So they can describe how gravity works or how traffic flows without ever having a sense of weight, motion or cause and effect.

Some say world models are a successor to LLMs. But Eric Landau, co-founder and CEO of AI data company Encord, told CNET, "It's hard to say it's like a next step, per se, but it's definitely a parallel track that's running."

Instead of focusing on sentences, they focus on what happens next after an action. This can involve predicting how objects move through space, how a scene changes when something is blocked from view or, for an AI agent or robot, answering a question like, "If I turn left, what will the camera see?"

The key difference between language models and world models is what they are trained to predict. Language models predict text. World models predict changes in an environment. That environment can be physical, like a room or a road, or virtual, like a simulated world. By learning how actions lead to consequences, world models theoretically enable AI systems to reason before acting rather than reacting one step at a time.

Jad Tarifi, CEO and co-founder of AI agent platform Integral AI, told CNET that large language models already contain a form of world knowledge, but it is incomplete.

"LLMs do learn a basic implicit world model hidden in their network weights," Tarifi said. "But it's a fractured world model."

Models trained directly as world models aim to build a cleaner and more direct representation of how the world works.

How do world models work?

At a basic level, world models attempt to predict how an environment changes when something occurs within it.

Researchers mainly use two approaches. In the first one, the world is generated in real time. As a person moves through a scene or interacts with objects, the model updates what happens next based on what it has learned about motion, objects and basic physics. It works a bit like a video game world that responds to your movements.

The second approach builds the whole world upfront, like a movie set. The model creates a fixed spatial environment with its own rules, and then you step inside. Because the structure is already there, you can explore it or change things around without the scene shifting or losing its logic.

Both approaches aim to do the same thing. They help AI understand how a world is put together and how actions lead to outcomes, rather than guessing based on language alone.

From robotics to everyday use

Interest in world models has grown as AI moves beyond chatbots toward agents, robots and systems expected to operate with less supervision. Training those systems directly in the real world is expensive, slow and sometimes dangerous. World models offer a safer alternative, allowing AI to learn and fail in simulated environments while developing a deeper understanding of how reality behaves.

This is why world models matter most in robotics, autonomous driving and other forms of physical AI. Landau told CNET that robots and other embodied systems are the most obvious use case, whether they are deployed directly or used to train other AI systems in simulation.

Researchers expect those applications to expand quickly.

"World models will transition from pure video prediction to models capable of generating abstractions. We should expect these models to be deployed at scale in robotics, automation of science and human-computer interaction," Tarifi said. "I also think they will revolutionize medicine."

Landau agrees medicine is "a very plausible use case." He points to the potential in drug discovery and understanding how different conditions interact inside the human body, offering a more holistic way to explore treatments before real-world testing.

World models could also shape creative and educational tools. Instead of generating a single image or video, an AI system could generate an environment that responds as a person explores it, allowing designers to walk through prototypes or students to interact with complex systems rather than read about them.

Risks and limitations

Simulating reality is difficult. Small errors in how a model understands physics or cause and effect can grow over time.

Landau said that compute is a major constraint. Today's world models are highly GPU-intensive and challenging to deploy in real-time systems, such as those used in robots or autonomous vehicles. He also points to data as another bottleneck. World models rely on trajectory-based and sensor-rich data, which is far harder to collect than the text used to train language models. If simulated data fails to accurately reflect the real world, models can learn incorrect physics or causal relationships.

Tarifi points out the risks that are not just technical. He warns about unchecked incentives, adversarial misuse of autonomous agents as weapons and the need to safeguard human agency, especially as society prepares for what he describes as "a transition to an economy without labor as an economic lifesource for the majority of the population."

AI in the spotlight

The growing focus on AI is why Time named AI architects as its 2025 Person of the Year. It reflects how central AI innovation has become across industries and society. As Nvidia CEO Jensen Huang told Time, "This is the single most impactful technology of our time."

World models are a move away from AI that only responds and toward AI that reasons, plans and anticipates. The technology is still in development, but it points to where advanced AI research is headed.