🎙️ Machine Learning Street Talk (MLST) The Real Reason Huge AI Models Actually Work with Professor Andrew Wilson

PODCAST INFORMATION

Channel: Machine Learning Street Talk (MLST)
Episode: The Real Reason Huge AI Models Actually Work
Guest: Andrew Wilson, Professor at the Institute of Mathematical Sciences and Center for Data Science at New York University
Duration: Approximately 2 Hours and 7 minutes

🎧 Listen here.

HOOK

The episode reveals the counterintuitive principle that larger AI models often develop stronger biases toward simplicity, challenging conventional assumptions about model complexity and generalization.

ONE-SENTENCE TAKEAWAY

The remarkable success of huge AI models stems not just from their expressiveness but from an emergent simplicity bias that grows with scale, allowing them to generalize better despite having more parameters than data points.

SUMMARY

This episode of Machine Learning Street Talk features an in-depth conversation with Andrew Wilson, a professor at NYU's Kuran Institute of Mathematical Sciences and Center for Data Science, who challenges fundamental misconceptions about why large AI models work so well. Wilson explains that deep learning's success cannot be attributed solely to model expressiveness but rather to a surprising phenomenon: as models scale up, they develop stronger biases toward simpler solutions.

The conversation begins with Wilson explaining that the conventional understanding of the bias-variance trade-off is flawed. Contrary to popular belief, there doesn't have to be a trade-off between bias and variance. Larger models can achieve both low bias and low variance simultaneously, which helps explain phenomena like double descent and benign overfitting that have puzzled the machine learning community.

Wilson introduces the concept that parameter counting is a poor proxy for model complexity. What matters more is the induced distribution over functions and the model's preferences for certain types of solutions. He illustrates this with the airline passenger dataset example, where most people initially prefer simpler models (linear or cubic polynomials) but can be convinced that a model with 10,000 parameters might actually be better.

A central theme of the discussion is the idea that we should "honestly represent our beliefs" when building models. This means embracing expressiveness while incorporating a simplicity bias (Occam's razor). Wilson argues that the real world is complex, but simple solutions that explain observations are more likely to be true. This philosophy leads to more adaptive models that perform well with both small and large datasets.

The episode explores several key phenomena in deep learning, including double descent, benign overfitting, and overparameterization. Wilson explains that in the second descent phase of double descent, all models fit the training data perfectly, yet larger models generalize better. This can only be explained by a simplicity bias rather than expressiveness.

Wilson also discusses the connection between deep learning and Solomonoff induction, suggesting that large transformers combine expressiveness with a strong preference for low Kolmogorov complexity solutions. This alignment with the real-world data generating distribution, which also seems biased toward low complexity, explains why increasingly general models like transformers have been successful.

The conversation touches on Bayesian principles, with Wilson emphasizing the importance of marginalization and uncertainty representation. He explains that Bayesian methods have been successful in deep learning but are often overlooked. Marginalization automatically incorporates Occam's razor, providing an elegant solution to model selection.

Wilson addresses practical implications for practitioners, suggesting that they should build models as large as possible while incorporating some form of simplicity bias. He acknowledges that making models bigger is currently the most effective way to enhance this simplicity bias, but expresses hope for more elegant approaches in the future.

The episode concludes with Wilson discussing future directions, particularly the development of AI systems that can discover new scientific theories. He emphasizes the importance of understanding why models work, not just that they work, to achieve lasting progress in the field.

INSIGHTS

Core Insights

Larger models often develop stronger biases toward simpler solutions, contradicting the intuition that more parameters lead to overfitting.
The bias-variance trade-off is a misnomer; it's possible to achieve both low bias and low variance simultaneously.
Parameter counting is a poor proxy for model complexity; what matters is the induced distribution over functions and preferences for certain solutions.
Double descent occurs because larger models in the second descent phase have a simplicity bias, not because they're more expressive.
Solomonoff induction provides a theoretical framework for understanding why large models generalize well.
Bayesian marginalization automatically incorporates Occam's razor, providing an elegant approach to model selection.
The real world data generating distribution appears biased toward low Kolmogorov complexity, and successful models share this bias.

How This Connects to Broader Trends/Topics

The success of foundation models and transformers can be explained by their alignment with the structure of real-world data.
The movement from specialized architectures (CNNs for vision, RNNs for sequences) to more general models (transformers) reflects a progression toward increasingly universal learning systems.
Understanding the principles behind model generalization is crucial for developing more efficient and capable AI systems.
The debate between feature engineering and end-to-end learning reflects deeper questions about the role of inductive biases in machine learning.

FRAMEWORKS & MODELS

Solomonoff Induction

A formalization of Occam's razor that assigns exponentially stronger preferences to solutions with low Kolmogorov complexity.
Works by considering all possible computer programs that could generate the observed data, with shorter programs receiving higher prior probability.
Provides a theoretical foundation for understanding why large neural networks generalize well despite their expressiveness.
Evidence for its relevance comes from generalization bounds that tightly characterize the behavior of large models.

Double Descent

A phenomenon where generalization error first decreases, then increases, and then decreases again as model expressiveness increases.
In the second descent phase, all models fit the training data perfectly, yet larger models generalize better.
Demonstrates that larger models must have some bias (simplicity bias) beyond just expressiveness.
Challenges the conventional understanding of the relationship between model size and generalization.

Bayesian Marginalization

A method for representing uncertainty by integrating over all possible solutions weighted by their posterior probabilities.
Automatically incorporates Occam's razor without manual specification.
Becomes increasingly important with more expressive models where many different parameter settings can explain the data.
Provides an elegant alternative to optimization-based approaches that bet everything on a single solution.

Mode Connectivity

The discovery that seemingly different solutions found by training neural networks with different initializations are actually connected through paths in parameter space without increasing training loss.
Reveals the existence of large, flat regions in loss landscapes that support good generalization.
Led to practical techniques like stochastic weight averaging that improve generalization by finding flatter solutions.
Challenges the notion that different solutions are isolated in the loss landscape.

QUOTES

"I think the bias variance trade-off is an incredible misnomer. There doesn't actually have to be a trade-off." - Andrew Wilson. This statement challenges a fundamental concept in statistics and machine learning, setting the stage for a reevaluation of how we think about model complexity and generalization.
"Parameter counting is a very bad proxy for model complexity. Really, what we care about is the properties of this sort of induced distribution over functions rather than just how many parameters the model happens to have." - Andrew Wilson. This quote captures a key insight that shifts focus from the number of parameters to the behavior of the model as a whole.
"My philosophy for model construction is just honestly represent your beliefs. And we believe the real world is a complicated place. And if we combine that belief with the idea that simple solutions that are consistent with our observations are more likely to be true, then we can often see desirable behavior in quite a variety of different settings." - Andrew Wilson. This quote encapsulates Wilson's approach to model building, emphasizing both expressiveness and simplicity.
"I'm really hoping to do research that will be relevant in hundreds of years from now. And so I think these questions around model selection for example in Occam's razor people will never stop asking." - Andrew Wilson. This quote reveals Wilson's motivation as a scientist, focusing on fundamental questions that have enduring relevance.
"If we can compress our data really effectively, then in order to do that, we're discovering regularities that are typically going to help enable generalization. And in some sense, physical laws, for example, are a great compressed representation of reality." - Andrew Wilson. This quote connects the concept of compression to intelligence and scientific understanding, suggesting a deep relationship between these ideas.

HABITS

Embrace model expressiveness while incorporating a simplicity bias rather than using hard constraints.
Build models as large as computationally feasible, as scale tends to enhance simplicity bias.
Use techniques like stochastic weight averaging to find flatter, more compressible solutions in the loss landscape.
Consider Bayesian approaches to represent uncertainty honestly, especially when working with expressive models.
Focus on understanding why models work rather than just that they work to achieve lasting progress.
When possible, use soft inductive biases rather than hard constraints, allowing models to adapt to the data.
Evaluate models based on their induced distribution over functions rather than just parameter count.
Consider the entire training dynamics, not just final performance, when assessing model quality.

REFERENCES

"Deep Learning is Not So Mysterious or Different" - Wilson's paper challenging misconceptions about deep learning.
"Bayesian Deep Learning and a Probabilistic Perspective of Generalization" - Paper discussing Bayesian principles in deep learning.
"Compute Optimal LLMs Provably Generalize Better with Scale" - Recent paper analyzing the source of simplicity bias arising from scale.
Solomonoff induction - A formal framework for induction that assigns higher prior probability to simpler explanations.
Double descent phenomenon - First presented in the 1980s, challenging conventional understanding of model size and generalization.
Mode connectivity - Discovery showing that different neural network solutions are connected through paths in parameter space.
David MacKay's "Information Theory, Inference, and Learning Algorithms" - Reference for Bayesian model selection and Occam's razor.
Radford Neal's work on Gaussian processes and Bayesian neural networks - Early work embracing expressiveness in machine learning.

Crepi il lupo! 🐺