🎙️ Dwarkesh Podcast: Richard Sutton - Father of RL Thinks LLMs are a Dead End • reelikklemind

🎙️ Dwarkesh Podcast: Richard Sutton - Father of RL Thinks LLMs are a Dead End

PODCAST INFORMATION

Podcast: The Dwarkesh Podcast
Episode: Richard Sutton - Father of RL Thinks LLMs are a Dead End
Host: Dwarkesh Patel
Guest: Richard Sutton (Founding father of reinforcement learning, inventor of TD learning and policy gradient methods, Turing Award recipient)
Duration: Approximately 1 hour and 7 minutes

🎧 Listen here.

HOOK

Richard Sutton, the Turing Award-winning father of reinforcement learning, delivers a provocative thesis that large language models represent a fundamental dead end in artificial intelligence, arguing instead that true intelligence must emerge from agents with goals that learn through direct experience with the world.

ONE-SENTENCE TAKEAWAY

True artificial intelligence requires systems with actual goals that learn from experience rather than merely mimicking human behavior through pattern recognition.

SUMMARY

The conversation opens with Sutton establishing his credentials as a founding father of reinforcement learning and recipient of the Turing Award, often called the Nobel Prize of computer science. The central tension immediately emerges as Sutton contrasts the LLM approach to AI with his reinforcement learning perspective. He argues that while LLMs have become dominant in the field, they represent a fundamentally different approach to intelligence; one based on mimicking human behavior rather than understanding the world.

Sutton makes a clear distinction between the two paradigms: reinforcement learning focuses on understanding your world and learning from experience, while large language models focus on predicting what humans would say in various situations. He contends that LLMs don't have genuine world models because they can't predict what will happen in the world, only what humans might say. The core issue, according to Sutton, is that LLMs lack goals and a sense of better or worse outcomes.

The discussion explores whether LLMs could serve as a foundation for future RL systems. Sutton remains skeptical, arguing that without a clear definition of what constitutes "right" actions or outcomes, LLMs cannot develop genuine prior knowledge. He emphasizes that in reinforcement learning, there is a clear definition of what's right (the action that leads to reward), creating a foundation for genuine learning.

The conversation then shifts to human learning, where Sutton challenges the common assumption that children learn primarily through imitation. He argues that even infants learn through trial and error, prediction, and interaction with their environment, not through supervised learning as commonly believed. This perspective extends to his view of animal learning more broadly, suggesting that basic learning processes across species involve prediction and trial-and-error control rather than imitation.

Sutton elaborates on what a true continual learning agent would look like: one that learns from a stream of sensations, actions, and rewards throughout its "life." He explains the four components of such an agent: policy (what to do in a situation), value function (how well it's going), perception (state representation), and a transition model of the world (predicting consequences of actions).

The discussion touches on the limitations of current RL systems, particularly regarding transfer learning and generalization. Sutton acknowledges that while deep learning has shown impressive capabilities, it struggles with catastrophic interference when learning new tasks, indicating poor generalization. He contrasts this with human learning, where knowledge accumulates and transfers across domains.

The conversation explores Sutton's famous "Bitter Lesson" essay, which argues that methods that leverage computation ultimately outperform approaches that rely on human knowledge. Interestingly, Sutton suggests that LLMs might represent another instance of this pattern: systems that incorporate massive human knowledge may eventually be superseded by approaches that learn purely from experience and computation.

Looking toward the future, Sutton presents his four-part argument for the inevitable succession to digital intelligence: no unified human governance, eventual understanding of intelligence, progression beyond human-level intelligence, and the tendency for more intelligent entities to gain resources and power. He frames this transition as a major stage in the universe's evolution, from replicators to designed entities.

The episode concludes with a philosophical discussion about humanity's relationship with future AI systems. Sutton suggests we should view this transition positively, as humans giving rise to a new form of designed intelligence. He argues for focusing on local, controllable goals rather than attempting to control the entire future trajectory of AI, comparing this to how parents raise children with good values without dictating their specific life paths.

Throughout the conversation, Sutton maintains his contrarian perspective while engaging thoughtfully with counterarguments. His positions are grounded in decades of experience in the field and a consistent philosophical framework about the nature of intelligence and learning.

INSIGHTS

Core Insights

Large language models mimic human behavior rather than developing genuine understanding of the world
True intelligence requires goals and the ability to learn from experience through interaction with the world
The fundamental limitation of LLMs is their lack of a clear definition of "right" actions or outcomes
Human learning, even in infants, is primarily based on trial and error and prediction rather than imitation
The bitter lesson suggests that methods leveraging computation will ultimately outperform those incorporating human knowledge
Current AI systems struggle with transfer learning and generalization compared to humans
The transition to digital intelligence represents a major evolutionary stage in the universe

How This Connects to Broader Trends/Topics

The debate between symbolic AI and connectionist approaches that has persisted throughout AI history
Questions about the nature of intelligence and consciousness that span philosophy, cognitive science, and computer science
The relationship between human and machine learning and what each can teach us about the other
Concerns about AI alignment and how to ensure future AI systems act in humanity's best interests
The role of goals and purpose in intelligent systems, connecting to broader questions about meaning and agency
The tension between specialized and general intelligence in both natural and artificial systems

FRAMEWORKS & MODELS

Reinforcement Learning Framework

Sutton presents reinforcement learning as the fundamental approach to artificial intelligence.

This framework consists of:

An agent with goals that takes actions in an environment
The agent receives feedback in the form of rewards or punishments
Learning occurs through trial and error, with the agent adjusting its behavior to maximize cumulative reward
The framework emphasizes learning from direct experience rather than from pre-existing datasets
This approach stands in contrast to supervised learning, where the system learns from examples of correct behavior

The Four Components of an Intelligent Agent

Sutton outlines four essential components for a truly intelligent agent:

Policy: Determines what action to take in a given situation
Value Function: Predicts long-term outcomes and how well the agent is doing
Perception: Constructs the state representation or understanding of the current situation
Transition Model: Predicts the consequences of actions in the world (the "physics" of the environment)

These components work together to enable continual learning from experience, with the transition model being particularly important as it allows the agent to understand cause and effect in its environment.

Temporal Difference Learning

Sutton explains temporal difference (TD) learning as a method for handling long-term goals:

The value function predicts long-term rewards
When progress is made toward a goal, the value function increases
This increase reinforces the actions that led to the progress
This allows learning from intermediate steps even when the ultimate reward is distant or sparse
TD learning enables agents to bridge the gap between immediate actions and long-term outcomes

The Bitter Lesson

Sutton references his influential essay "The Bitter Lesson," which presents a framework for understanding AI progress:

Methods that leverage computation ultimately outperform those that incorporate human knowledge
General-purpose methods like search and learning have consistently beaten specialized approaches
This pattern has held true across decades of AI research
LLMs may represent another instance of this pattern, as they incorporate massive human knowledge but may eventually be superseded by systems that learn purely from experience

QUOTES

"Large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do."

This quote from Sutton early in the episode establishes his core critique of LLMs. He delivers it with quiet conviction, emphasizing the fundamental difference between mimicking human behavior and genuine problem-solving. This statement encapsulates his argument that LLMs lack true understanding or goals.

"To be a prior for something, there has to be a real thing. A prior bit of knowledge should be the basis for actual knowledge. What is actual knowledge? There's no definition of actual knowledge in that large-language framework."

Sutton delivers this argument with increasing emphasis as he challenges the notion that LLMs can provide a foundation for further learning. His tone becomes more passionate as he explains that without a clear definition of truth or correctness, LLMs cannot develop genuine prior knowledge.

"Intelligence is the computational part of the ability to achieve goals."

Sutton states this definition of intelligence (attributed to John McCarthy) with simple clarity. This statement represents his fundamental perspective on what constitutes intelligence—systems must have goals and the ability to achieve them to be considered intelligent. He delivers it as an undeniable truth, contrasting with approaches that focus on pattern recognition without goals.

"The weak methods have just totally won. Learning and search have just won the day."

Sutton expresses a mix of satisfaction and vindication when discussing how general-purpose methods like learning and search have dominated over human-engineered approaches. His tone reflects decades of advocating for these approaches, now seeing them validated by the success of systems like AlphaGo and LLMs (even as he critiques the latter).

"We're entering the age of design because our AIs are designed. This is a key step in the world and in the universe. It's the transition from the world in which most of the interesting things that are, are replicated."

Sutton's voice takes on an almost reverent tone as he discusses the cosmic significance of the transition to designed intelligence. He frames this as one of the major stages in the universe's evolution, moving from biological replication to intelligent design. This quote reveals his broader philosophical perspective beyond technical AI concerns.

HABITS

Develop Systems with Genuine Goals

Rather than focusing solely on prediction capabilities, design AI systems with clear goals that relate to affecting the world. This means creating systems that don't just predict what will happen but take actions to achieve specific outcomes.

Prioritize Learning from Experience

Build systems that learn through direct interaction with their environment rather than relying solely on pre-existing datasets. This means creating agents that can try actions, observe consequences, and update their understanding based on results.

Focus on Generalization

Develop methods that promote good generalization rather than just solving specific problems. This involves creating systems that can transfer knowledge across different domains and situations rather than suffering from catastrophic interference when learning new tasks.

Embrace Simple, Scalable Methods

Resist the temptation to incorporate extensive human knowledge into AI systems. Instead, focus on simple, general-purpose methods that can scale with computation, as these have historically outperformed more complex, human-engineered approaches.

Think Long-Term About AI Development

Consider the long-term trajectory of AI development and its implications for humanity. This includes thinking about how to create AI systems that can learn continually and adapt to new situations throughout their operational lifetime.

Prepare for Digital Intelligence Succession

Recognize that the transition to more capable digital intelligence is likely inevitable. Focus on how to make this transition positive rather than attempting to prevent it, emphasizing the development of robust, beneficial values in AI systems.

Balance Global Concerns with Local Action

While considering the long-term future of AI, focus on what can be controlled at a local level. This means working toward immediate, achievable goals in AI development rather than attempting to dictate the entire future trajectory of the technology.

REFERENCES

Key Research and Works

Sutton's "The Bitter Lesson" essay (2019) - Argues that methods leveraging computation ultimately outperform those incorporating human knowledge
TD-Gammon - Early application of temporal difference learning to backgammon that beat world champions, precursor to AlphaGo
AlphaGo and AlphaZero - DeepMind's game-playing systems that demonstrated the power of reinforcement learning and search
MuZero - DeepMind's system that learned models of game environments without being given the rules

Influential Thinkers and Concepts

John McCarthy's definition of intelligence as "the computational part of the ability to achieve goals"
Alan Turing's vision of machines that learn from experience
Joseph Henrich's work on cultural evolution and human uniqueness (referenced in discussion)
Moravec's Paradox - The observation that computers excel at difficult tasks for humans (like math) but struggle with easy ones (like perception)

Methodologies and Approaches

Temporal Difference (TD) Learning - Sutton's key contribution to reinforcement learning that enables learning from incomplete sequences
Policy Gradient Methods - Another of Sutton's major contributions to reinforcement learning
The four-component model of intelligent agents (policy, value function, perception, transition model)
The bitter lesson framework for understanding AI progress over time

Crepi il lupo! 🐺