Introduction

In the current era of Artificial Intelligence, Large Language Models (LLMs) are often hailed as “general-purpose learners.” We’ve seen them write code, compose sonnets, and even pass bar exams. This versatility has led to a growing assumption: if you throw enough data at a Transformer, it can learn the underlying model of almost anything.

But how true is this when we step away from language and move toward the physical world? Does an LLM actually “understand” the laws of physics governing a system, or is it just memorizing statistical correlations?

This is the core question behind the research paper “Exploring the Learning Capabilities of Language Models using LEVERWORLDS.” The researchers investigated whether language models can learn to predict the outcome of a simple physical system—a lever with weights—and, crucially, how efficiently they learn compared to classical statistical methods.

The results offer a fascinating reality check. While LLMs can learn these tasks, they are surprisingly inefficient compared to simpler, older algorithms. This post will break down the LEVERWORLDS framework, the tug-of-war between “Structure” and “Variability,” and why sometimes, a simple Logistic Regression model beats a billion-parameter giant.

The Two Pillars of Learning: Structure and Variability

To understand why this research matters, we first need to define what it means to “learn” a stochastic (random) setting. The authors argue that learning involves mastering two distinct challenges:

Structure: The universal rules that never change. In a physical context, these are the laws of physics. For example, if you drop a ball, gravity dictates how it falls. This rule applies everywhere on Earth.
Variability: The context-dependent randomness. If you drop balls from different buildings, the heights of the buildings vary. This is a specific distribution unique to that city or dataset.

When an AI model tries to predict how long a ball takes to hit the ground, it needs to implicitly learn both the law of gravity (Structure) and the average height of the buildings (Variability).

This creates a tension known as the Bias-Variance Tradeoff, a fundamental concept in machine learning:

High Bias (Strong Assumptions): A model that “assumes” a lot. If you use a physics equation as your model, you assume the world follows that equation. It learns very fast but fails if the assumption is wrong.
High Variance (Flexible): A model that assumes nothing. It tries to fit the data perfectly. It is very flexible but requires massive amounts of data to distinguish between noise and actual rules.

Where do LLMs fit? They are typically high-variance, low-bias machines. They assume very little, which makes them flexible, but as we will see, this comes at a cost: Sample Efficiency.

The LEVERWORLDS Framework

To test these theories, the researchers couldn’t just use messy real-world text. They needed a controlled environment where they knew the “ground truth” perfectly. Enter LEVERWORLDS.

LEVERWORLDS is a framework that generates simple physics puzzles based on a lever (a seesaw) on a fulcrum.

Figure 1: Overview of our experiments. First, we generate a physical model, then we sample from the model and train a language model to predict the output. We then evaluate the model’s probability estimations.

As shown in Figure 1, the process is straightforward:

Physical Model: A lever is created with weights (masses) placed at certain distances.
Sampling: The system generates instances of this world.
Language Formulation: These physical instances are converted into text strings (e.g., “mass1: 3, distance1: 3…”).
Training: An LLM (or other model) tries to predict the outcome: will it tip Left (L) or Right (R)?

The Anatomy of the Lever

The beauty of this setup is in the Causal Graph. We know exactly how the inputs relate to the outputs.

Figure 2: Causal graph for balance on a lever. Different worlds differ by the number of objects, by the optional use of density and volume, and by whether the intermediate variables are observed or not.

Looking at Figure 2:

Inputs: We have variables like Density (\(\rho\)), Volume (\(V\)), Mass (\(m\)), Distance (\(d\)), and Side (\(s\)).
Physics: Torque (\(T\)) is calculated as \(T = s \cdot d \cdot m\).
Output: The Balance (\(b\)) depends on the sum of Torques.

The Twist: Latent Variables

If the model saw every number, the task would be trivial math. To make it a true learning challenge, the researchers hide some variables (making them Latent).

For example, the model might see the mass of the object, but not its distance from the center. It has to infer the probability of the lever tipping based on the distribution of distances it has seen in training. This forces the model to learn both the physics (heavier things tip scales) and the statistics (hidden objects usually sit 3 meters away).

The Contenders

The researchers pitted several types of learning algorithms against each other to see who could learn these worlds the fastest.

1. The Transformers (LLMs)

They used OPT models (ranging from 125 million to 6.7 billion parameters). These models were fine-tuned on the textual data. They treat the physics problem as a sentence completion task.

2. Naïve MLE (Maximum Likelihood Estimation)

This is the simplest baseline. It makes zero assumptions about physics. It treats every unique input combination as a separate bucket and just counts what happened.

Equation for Naive MLE estimator

As the equation above shows, if a specific input state \(x\) appears \(N_x\) times, it calculates the probability simply by dividing the “Left” outcomes by the total outcomes. If it hasn’t seen an input before, it guesses 50/50. This model has extremely high variance—it memorizes data but cannot generalize.

3. Logistic Regression

This is a classic statistical method. It assumes a relationship between inputs and outputs is somewhat linear (or polynomial). It has stronger “inductive bias” than the Naive method because it assumes inputs interact in a specific mathematical way.

4. Structure MLE (The “Oracle”)

This model is given the “cheat codes.” It knows the laws of physics (\(T = m \cdot d\)).

Equation for structure MLE oracle

It only needs to learn the distribution of the hidden variables. Because it knows the structure, it should theoretically be the most sample-efficient.

Experiments & Results

The researchers evaluated the models using Total-Variation (TV) Distance. Simply put, this measures how far the model’s predicted probability is from the real probability. Lower is better.

1. How did the Transformers do?

The Transformers did learn the task. As they saw more samples, their error rates dropped.

Figure 3: Results for OPT models. In the first row are the results for world-1 and in the second are the results for world-3. In cases, we plot the metric as a function of the number of training samples.

Figure 3 shows the learning curves for OPT models.

Size Matters: Larger models (the darker lines) generally converged faster and reached lower errors than smaller models.
Convergence: They eventually get good at the task, but look at the X-axis (number of samples). They require thousands of examples to get there.

2. Transformers vs. Classical Models

Here is where the comparison gets spicy. Let’s look at Logistic Regression.

Figure 4: Results for Logistic Regression models.

In Figure 4, look at how quickly the curves drop. With just a few hundred samples, Logistic Regression (especially with polynomial features) achieves low error rates.

Now compare this to the MLE Models in Figure 5:

Figure 5: Results for MLE models.

Structure MLE (Red/Orange): Because it knows the physics, it learns almost instantly (near zero error immediately).
Naïve MLE (Blue): It struggles immensely. Because it treats every input as unique, it needs a massive amount of data to learn anything useful.

3. The Grand Tradeoff

The most illuminating result of the paper is the comparison of Structure Score vs. Error.

The Structure Score measures if the model understands the “direction” of physics. For example, if I increase the mass on the left, the probability of tipping Left should go up. If a model predicts the opposite, it has a low structure score.

Equation for Structure Score

The researchers plotted this score against the error rate in Figure 6:

Figure 6: Tradeoff between TV-distance and Structure score. OPT represents the average distances and average scores for all 4 sizes with 5 seeds each.

Key Takeaways from Figure 6:

The Bottom Right (Naive MLE): High error, low structure. It doesn’t understand physics; it’s just counting.
The Top Left (Structure MLE): Low error, perfect structure.
The Middle Ground: Notice that Logistic Regression (Green) is closer to the ideal top-left corner than the OPT (Blue) models.

This confirms the hypothesis: Transformers are less sample-efficient than simple regression models for this type of task. The regression model’s assumptions (inductive bias) align better with the physical world than the Transformer’s general-purpose architecture.

Can’t We Just Prompt GPT-4?

You might be thinking, “Why fine-tune? Just use GPT-4!”

The researchers tested this using In-Context Learning (ICL) (giving the model examples in the prompt) and a Pipeline approach (asking the model to write code to solve the problem).

Table 1: Results for the zero-shot experiments.

The results in Table 1 are sobering.

ICL (In-Context Learning): GPT-4o only achieved a low error (\(<0.1\)) in 3.7% of experiments. It largely fails to intuit the statistical distribution just by reading examples.
Pipeline: When asked to write a Python program to solve the problem (using scikit-learn’s Logistic Regression), GPT-4o succeeded 51% of the time.

This suggests that LLMs are poor statisticians on their own, but they are decent engineers—they can write code that uses the correct statistical tools.

Theoretical Limits

For the mathematically inclined, the paper provides bounds on why the “Naïve” approach fails so hard. The sample complexity (how many examples you need) grows exponentially with the number of variables.

Equation for Sample Complexity bound

This equation shows that the expected squared error decreases inversely with the number of samples \(N_x\). If the model treats every input as unique (like Naïve MLE), \(N_x\) is small for any specific input, and the error remains high.

Equation for probability bound

The researchers use concentration inequalities (like the one above) to prove that without structural assumptions, the amount of data required to learn these worlds becomes astronomical. Transformers sit somewhere between the Naïve approach and the Structural approach—they effectively learn a “soft” structure, but slowly.

Conclusion: The Lesson for AI Practitioners

The LEVERWORLDS paper serves as an important reminder that “bigger isn’t always better” and “newer isn’t always smarter.”

Transformers are Generalists, Not Specialists: They can learn physical distributions, but they are data-hungry. They effectively have to re-derive the concept of multiplication and addition from the text data.
Inductive Bias is Powerful: Classical models like Logistic Regression perform incredibly well because their mathematical structure (weighted sums) mirrors the actual physics of the lever (torques are sums of products).
The Hybrid Future: The most promising path isn’t forcing LLMs to do math in their “heads” (weights). It’s the Pipeline approach. Use the LLM to understand the problem and write code that utilizes classical statistical methods.

As we continue to push the boundaries of AI, distinguishing between learning the world model (physics) and the instance model (statistics) will be key to building efficient, robust systems. Sometimes, you don’t need a Transformer; you just need a lever and a fulcrum.

Introduction#

The Two Pillars of Learning: Structure and Variability#

The LEVERWORLDS Framework#

The Anatomy of the Lever#

The Twist: Latent Variables#

The Contenders#

1. The Transformers (LLMs)#

2. Naïve MLE (Maximum Likelihood Estimation)#

3. Logistic Regression#

4. Structure MLE (The “Oracle”)#

Experiments & Results#

1. How did the Transformers do?#

2. Transformers vs. Classical Models#

3. The Grand Tradeoff#

Can’t We Just Prompt GPT-4?#

Theoretical Limits#

Conclusion: The Lesson for AI Practitioners#