For years, artificial intelligence systems have been prone to cognitive loops. Ask them to generate an image of, say, a full wine glass, and what do you get? A half-filled glass—every single time. While this might seem like a minor oversight, it reveals some pretty significant flaws in how traditional AI systems work.
The problem isn’t that AI secretly fears spillage; it’s more about how these models operate. Early large language models (LLMs) and image generators didn’t actually “understand” concepts like fullness, basic physics, or etiquette. Instead, they relied on statistical probabilities. If most of the images in their training data showed half-filled glasses, that’s what they’d produce. These models tended to get stuck in “eigenvector traps,” which is just a fancy way of saying they kept repeating dominant patterns from their training data.
Early AI systems lacked any real reasoning ability or understanding of the physical world and instead relied on probabilistic guesses that often missed the mark entirely. To make matters worse, as AI systems began training on their own outputs, they created feedback loops that reinforced their mistakes. It was like copying homework from someone who copied theirs from someone else who didn’t understand the assignment.
In short, AI’s inability to fill a wine glass wasn’t just a silly quirk—it was a symptom of its broader limitations.
GPT-4o: The AI bartender who understands how the world works
Enter GPT-4o, OpenAI’s latest and greatest model, which has finally come up with a solution. We now have an AI that can ‘pour’ a full glass of wine. But before you roll your eyes at what seems like a trivial achievement, take a moment to appreciate what this breakthrough actually represents.
Unlike its predecessors, GPT-4o doesn’t just guess based on patterns - it demonstrates an understanding of abstract concepts like “fullness” and liquid volume. This is no small feat; it marks a shift from simple pattern recognition to something resembling actual reasoning. The model can now grasp physical relationships and conceptual nuances in ways that older systems couldn’t.
What makes GPT-4o even more impressive is its ability to seamlessly integrate multiple modes of communication. Text and images can now "understand" each other. GPT-4o is a multimodal system capable of handling complex prompts with ease - whether it’s creating an image of a full wine glass or designing an intricate scene with multiple objects while maintaining visual consistency across iterations. In other words, it’s gone from reproducing a scene to truly understanding it.
In vino veritas? (in wine, there is truth)
This represents a significant step forward. However, achieving AGI (Artificial General Intelligence) will require an understanding of the world that goes far beyond the capacities of GPT-4o. AGI is a hypothetical form of AI capable of performing tasks in domains it hasn’t yet encountered. It can abstractly reason what to do in unfamiliar situations and execute these tasks as expertly as a human can.
Professor Fei-Fei Li believes that achieving AGI will require an even deeper understanding of the physical world. Rather than training AI solely on data scraped from the internet, direct interaction with the real world is needed. She calls this “Spatial Intelligence,” and her company, WorldLabs, has just raised $1 billion from investors to explore this groundbreaking idea.
This approach mirrors how humans learn during the sensorimotor stage—by playing around with real objects such as trying to fill (wine) glasses—we internalize underlying physical rules that we can later apply elsewhere.
This has the potential to revolutionize the way we "do" AI, unlocking groundbreaking possibilities and supercharging AI systems to achieve capabilities that far surpass anything currently possible.
Sometimes progress starts with something as simple as pouring a proper glass of wine.
Cheers!
by David Reid, Associate Professor in Computer Science