How do we learn new concepts so quickly? A child sees a “high-five” once or twice and can generalize to a low-five. A researcher hears about “few-shot prompting” and rapidly grasps the idea. From “1, 4, 16, 64,” we instantly infer the pattern is “powers of 4.” This ability to infer a general rule from a handful of specific examples—a process called induction—is a cornerstone of human intelligence. We do it effortlessly and across a seemingly infinite range of concepts.

For artificial intelligence, this remains a steep challenge. Machine learning models often require vast amounts of data, and their generalization can be brittle. This gap highlights a core tension in building intelligent systems: how can we search a vast, expressive space of concepts (as humans do) without being overwhelmed by the computation needed to evaluate them all?

A recent paper from Cornell University, Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language, offers a compelling approach. It argues that the secret may lie in something uniquely human: thinking in natural language. The authors introduce a model that learns new concepts by internally “talking to itself,” generating hypotheses in English, then leveraging Bayesian inference to decide which hypothesis best explains the data. This approach marries the expressive power of language with principled probabilistic reasoning, creating a model that learns efficiently and captures subtle nuances of human judgment.


The Bayesian Brain and the Problem of Scale

At its heart, induction is about managing uncertainty. If you see the number 16 as an example, the underlying concept could be “powers of two,” “square numbers,” “even numbers,” or even “only the numbers 16 and 97.”

A powerful way to model this reasoning is through Bayes’ Rule:

\[ p(C|D) \propto p(D|C) \, p(C) \]

Where:

  • \( p(C|D) \)posterior: belief in concept \( C \) after seeing data \( D \).
  • \( p(D|C) \)likelihood: how likely we would see \( D \) if \( C \) were the true concept.
  • \( p(C) \)prior: initial belief in \( C \) before seeing data, representing our bias toward certain types of explanations.

If our concepts are computer programs, this becomes Bayesian Program Learning (BPL)—expressing concepts in a domain-specific programming language (DSL) and inferring the program that matches the data. BPL is powerful but requires designing a DSL for each domain and conducting costly searches to find the right programs.

An alternative, Latent Language, skips DSLs and uses natural language descriptions directly. Given examples, it searches for an English description that helps a neural network solve the task—but typically finds only a single “best” description, lacking the richer uncertainty representation of a full Bayesian model.

The Cornell approach sits between these worlds: maintain the Bayesian framework but use natural language as the hypothesis space, and make searching it efficient with modern large language models (LLMs).


The Core Method: A “Conversation” with Bayes’ Rule

The proposed model captures a dual-process style of reasoning, akin to Kahneman’s “fast and slow” thinking:

  • A fast, bottom-up process proposes hypotheses quickly.
  • A slower, top-down process weighs and evaluates them.

Step 1: Proposing Hypotheses — The “Fast” Brain

The set of all concepts expressible in English is infinite, so we can’t check them all. The model uses a proposal distribution \( q(C \mid X_{1:K}) \) to generate a small, high-quality set of candidate hypotheses based on observed examples \( X_{1:K} \).

This is where an LLM like GPT-4 comes in: given inputs like 16, 8, 2, 64, it might propose concepts such as:

  • “a power of two”
  • “divisible by 2”
  • “a multiple of 8”
  • “numbers less than 100”

This acts like the intuitive “flash” recognition—narrowing infinite possibilities to a few dozen promising ones.


A diagram showing the model’s Bayesian network. A concept C generates a program, which in turn generates data X. A proposal distribution q provides a reverse path from the data back to the concept.

Figure 1: Model as Bayesian network. Natural language concept \( C \) is deterministically translated into a program, which is then used to evaluate data \( X_k \). The proposal distribution \( q \) suggests concepts based on observed data.


Step 2: Re-weighting Hypotheses — The “Slow” Brain

Given candidates \( \{C^{(1)}, \dots, C^{(S)}\} \), the model computes a weight for each hypothesis:

\[ \tilde{w}^{(C)} = p(C) \, p(X_{1:K} \mid C) \]

These are normalized so they sum to 1. High weights reflect concepts that are both plausible a priori (high \( p(C) \)) and fit the data well (high \( p(X_{1:K} \mid C) \)).

Crucially, to evaluate the likelihood \( p(X|C) \), the model translates each English description into a Python function using another LLM, then executes the program on the data.


Step 3: Learning the Prior — Human-like Biases

A major innovation is tuning the prior \( p(C) \) to reflect human intuitive biases. The authors start with a pretrained language model and train a small linear layer on top to predict actual human judgments from behavioral experiments.

The optimization matches the fraction \( r \) of humans who labeled a test item as part of the concept:

[ \arg\max_{\theta} \sum_{(X_{1:K}, X_{\text{test}}, r)} r \log p_{\theta}(X_{\text{test}} \in C \mid X_{1:K})

  • (1-r) \log\big(1 - p_{\theta}(X_{\text{test}} \in C \mid X_{1:K})\big) ]

This tunes the model to favor concepts humans find simple, natural, and compelling.


Experiment 1: The Number Game

The Number Game is a classic induction experiment: given a few example numbers that follow a hidden rule, rate how likely other numbers match the rule.

For example, starting with “16” alone, people consider several rules plausible (“powers of two,” “square numbers,” even “numbers less than 20”), producing graded probabilities. Seeing “16, 8, 2, 64” pushes almost everyone toward “powers of two” with high confidence.


Three plots comparing the model’s predictions (orange line) to human judgments (blue bars) in the Number Game for different training examples. The model captures the graded uncertainty and sharp beliefs of humans.

Figure 2: Model predictions vs. human judgments for the Number Game under different training examples. The model captures both nuanced uncertainty and sharp posterior beliefs.


Performance is measured by \( R^2 \) — agreement with human ratings — as the number of sampled hypotheses increases.


A comparison of different models on the Number Game task. The proposed model with a tuned prior (blue line) achieves a very high R² score of 0.95 with just 100 samples, outperforming baselines like GPT-4, DreamCoder, and Latent Language.

Figure 3: Fit to human ratings vs. sampling budget. Tuned prior (blue) reaches \( R^2 \approx 0.95 \) with only 100 samples, outperforming GPT-4, DreamCoder, and Latent Language baselines.


Key findings:

  • Tuned Prior Wins: Achieves near-perfect fit with only ~100 samples.
  • Natural Language Matters: Python code as the hypothesis space reduces performance.
  • Baselines Lag: GPT-4 alone and DreamCoder (even with 10,000 hypotheses) fall short.

Experiment 2: Learning Complex Logical Concepts

To push beyond simple math concepts, the authors tested the model on logical rules about shapes.

Participants saw batches of objects (various colors, shapes, sizes), with some labeled “wudsy.” Hidden rules could be:

  • Simple: “green objects”
  • Propositional: “a circle unless it is blue”
  • Relational: “the largest object of its shape in the batch”

An example from the logical concept learning experiment. Participants see examples of “wudsy” objects and must guess which new objects are also wudsy.

Figure 4: Logical concept learning setup from [55]. Shapes have color, size, and type; only some are “wudsy.”


The model was adapted for online learning: generating new proposals at each batch and performing Bayesian updates. It was compared against a strong, hand-crafted BPL baseline using a logical grammar and billions of MCMC proposals.


Model performance on the logical concepts task. The tuned prior model (blue line) explains 81% of the variance in human responses, outperforming the BPL baseline and GPT-4.

Figure 5: Agreement with human responses vs. samples per batch. Tuned prior outperforms GPT-4 and the BPL baseline.


With only 100 proposals per batch, the model explained 81% of the variance in human learning curves—slightly outperforming the massive BPL system. Tuned priors were crucial here because complex concepts can be expressed many different ways in language.


Examples of learning curves for logical concepts. Panel A shows the model learning complex rules and explaining human mistakes. Panel C shows its task accuracy is close to human-level. Panel D shows it can learn novel concepts not representable by the BPL baseline.

Figure 6: A. Model and humans sometimes learn the same wrong rule, revealing shared inductive biases. C. Task accuracy nearly reaches human performance. D. Able to learn novel concepts outside the BPL hypothesis space.


Flexibility advantage: Tested on two new concepts — “objects with the majority color” and “objects with the least common color” — humans learned them quickly. The BPL baseline could not represent them due to missing primitives. The language-based Bayesian model had no trouble, highlighting a fundamental strength: language brings a vast repertoire of composable concepts without manual enumeration.


Conclusion: Thinking in Language

This work offers a plausible account of human-like induction:

  • A fast generative step proposes explanations in natural language.
  • A slower Bayesian step evaluates them with explicit priors and likelihoods.
  • Language acts as a flexible, compositional hypothesis space.
  • Bayesian reasoning constrains and rationalizes LLM-generated ideas.

Key Takeaways

  1. Natural Language as Hypothesis Space: Outperforms rigid DSLs, aligning well with human conceptual structure.
  2. Bayesian Wrapping for LLMs: Brings rationality to powerful but unpredictable proposal generators.
  3. Tuning Priors on Human Judgments: An effective recipe for instilling human inductive biases into AI systems.

While not a full model of all human cognition, for abstract, symbolic reasoning it bridges computational-level goals with algorithmic-level processes—offering a rational process model that is both efficient and psychologically plausible.

By combining the timeless principles of Bayesian inference with the modern capabilities of large language models, this approach not only illuminates how we learn so much from so little—it points the way to machines that can do the same.