EvoPrompting: How to Evolve Language Models into Expert AI Architects

Large Language Models (LLMs) like GPT-4 and PaLM have become astonishingly good at writing code. Give them a description, and they can generate a functional script, a web component, or even a complex algorithm. But writing code from a clear specification is one thing—designing something truly novel and high-performing from scratch is another. Can an LLM invent a new, state-of-the-art neural network architecture?

If you simply ask an LLM to “design a better neural network,” the results are often underwhelming. The task is too complex, the search space of possible architectures is astronomically vast, and the model lacks a structured way to iterate and improve. This is the challenge that a fascinating new paper, EvoPrompting: Language Models for Code-Level Neural Architecture Search, tackles head-on.

The researchers propose a brilliant solution: don’t use an LLM as a one-shot code generator—instead embed it within an evolutionary algorithm. By combining the iterative, optimizing power of evolution with the vast coding knowledge of an LLM, they create a system that can automatically discover neural network architectures that outperform both human-designed and existing state-of-the-art models. This new method, called EvoPrompting, isn’t just a clever trick—it’s a new paradigm for using LLMs as creative partners in complex design tasks.

The Challenge: Finding the Perfect Blueprint for AI

Before diving into how EvoPrompting works, let’s set the stage with a bit of background on Neural Architecture Search (NAS).

Imagine you’re building the fastest car in the world. You have countless components to choose from—different engines, transmissions, chassis types, aerodynamics—and performance depends not just on individual parts, but on how they work together. Finding the optimal combination is an incredibly difficult search problem.

NAS is similar, but for AI models. Instead of car parts, we have neural network components—convolutional layers, attention mechanisms, activation functions, and more. The goal is to automate the process of designing the network’s “blueprint” or architecture to achieve the best performance on a given task.

One of the most powerful NAS techniques uses Evolutionary Algorithms, mimicking natural selection:

Initialize a population of random or simple architectures.
Evaluate each one by training it and measuring performance (its “fitness”).
Select the best-performing architectures to be “parents.”
Generate children by applying mutations (small changes) and crossover (mixing features from parents).
Repeat for many generations.

Over time, the population evolves toward better architectures. But traditional evolutionary NAS has a major limitation—it’s confined to a human-designed search space. Researchers must predefine the building blocks (e.g., “use 3×3 or 5×5 convolutions”) the algorithm can choose from. This caps creativity and can block truly novel designs.

What if we broke free from these constraints—letting mutation and crossover be guided by an intelligent agent that understands good code and network design? This is where LLMs come in.

The EvoPrompting Meta-Learning Loop

EvoPrompting reinvents the evolutionary process by replacing mutation and crossover operators with a code-generating LLM. The search space becomes the entire Python language—anything the LLM can write.

Here’s the self-improving loop behind EvoPrompting:

An overview of the EvoPrompting meta-learning loop. The process starts with seed code, which an LLM uses to generate new architectures. These are trained and evaluated, and the best ones are selected to become prompt examples for the next round, while also being used to prompt-tune the LLM.

Figure 1: EvoPrompting overview. Seed architectures are used in-context by the LLM to generate candidate architectures, which are trained and evaluated. The top-performing candidates become prompt examples, and all evaluated models help prompt-tune the LLM for the next round.

1. Initialization: Planting the Seeds

The search begins with a warm start: a small population of seed architectures—good, human-designed models known to perform well for the task. These seeds give the LLM a strong foundation to build upon.

2. Generation: The LLM as an Adaptive “Crossover” Operator

This is EvoPrompting’s heart. To create the next generation:

Select Parents: Pick the highest-fitness models from the current population.
Craft the Prompt: Build a few-shot prompt containing the Python code of parent architectures. Each begins with a comment showing performance metrics, e.g.:
1 2 3
"""Metrics: {'num_params': '4800', 'val_accuracy': '0.865'} """
Set Targets: At the end of the prompt, add target metrics (e.g., 10% smaller, 2% higher accuracy than the best parent).
Generate Children: The LLM uses the prompt to output new, complete Python model classes. Conditioned on examples of high-performing architectures and ambitious targets, it produces plausible and often creative variations.

This is far more than random tweaking. The LLM leverages its pretraining to act as an intelligent mutation and crossover operator, crafting meaningful changes.

3. Evaluation: Survival of the Fittest

Each new child architecture is trained on the dataset and validated. Fitness balances accuracy and efficiency:

\[ \text{fitness} = -\text{validation error} \times \text{model size} \]

Models that are untrainable or fall below an accuracy threshold are discarded, ensuring quality and efficiency.

4. Selection & Adaptation: Learning from Experience

After evaluation:

Select Next Generation Parents: Highest-fitness models become parents.
Prompt-Tune the LLM: All evaluated children (code + fitness) are used to soft prompt-tune the LLM, adapting its generation style for better architectures in future rounds.

This two-way feedback—better prompt examples and a smarter LLM—drives steady improvement.

Experiment 1: MNIST-1D as a Proving Ground

The researchers first test EvoPrompting on MNIST-1D, a simple, low-compute version of the classic handwritten digit dataset, focusing on CNN architectures.

Baselines included:

Naive few-shot prompting: Just use seeds in a prompt to generate models—no evolution.
EvoPrompting without prompt-tuning: No adaptation between rounds.
EvoPrompting with random parents: Parents chosen randomly instead of by fitness.

A pair of plots showing EvoPrompting’s performance on MNIST-1D. The left plot shows that EvoPrompting finds models with lower test error and smaller size (closer to the origin) than baselines. The right plot shows that EvoPrompting achieves the highest fitness over the course of the search.

Figure 2: (a) Pareto frontiers of model size vs. test error for top 20 models per method—closer to the origin is better. (b) Maximum fitness over the search as more models are generated.

Findings:

Full EvoPrompting consistently yielded models that were both smaller and more accurate than baselines and seeds (blue dots closest to the bottom-left in Figure 2a).
EvoPrompting achieved higher maximum fitness with fewer evaluations (blue line in Figure 2b).
Removing evolutionary selection or prompt-tuning drastically reduced performance—every component mattered.

Experiment 2: Graph Neural Networks for Algorithmic Reasoning

Next, they tackle a tougher challenge: CLRS Algorithmic Reasoning Benchmark—30 tasks modeling classical algorithms like BFS, sorting, and shortest paths. The state-of-the-art here was the Triplet-GMPNN GNN. EvoPrompting’s search targeted one key component: the triplet message function.

Seeded with nine hand-tweaked triplet functions, EvoPrompting ran on three selected CLRS tasks.

Line plots showing the maximum fitness achieved over time for three different CLRS tasks. In all three cases, the full EvoPrompting method (blue line) finds better models than the baselines.

Figure 3: EvoPrompting outperforms baselines in fitness when evolving triplet message functions for three CLRS tasks.

Novel architectures discovered included:

QUADNODEMINMAX: Uses quadruplet node representations and max - min aggregation.
CONCATREP: Concatenates projection outputs with feedforward layers for richer node/edge representations.
DIV2MEAN: Halves node representations, using mean aggregation instead of max.

Evaluated across all 30 CLRS tasks:

A table comparing the performance of EvoPrompting’s discovered models against the baseline on several CLRS tasks. The new models achieve significantly higher out-of-distribution (OOD) accuracy on many tasks, often with a similar or smaller model size.

Table 1: EvoPrompting-discovered models outperform the Triplet-GMPNN baseline on many CLRS tasks, often improving OOD accuracy without increasing model size.

Results were stunning: EvoPrompting set new state-of-the-art on 21 of 30 tasks. On Bubble Sort, accuracy jumped from 67.7% to 88.9%. Gains often came without size increases—proof it’s finding better designs, not just bigger models.

Conclusion: A New Chapter for AI-Assisted Design

EvoPrompting isn’t just another NAS method. It’s a blueprint for guiding the creative potential of LLMs:

Search an open-ended space: Not limited by predefined components—anything expressible in Python is fair game.
Iterative refinement: Evolutionary loops retain and build on the best ideas.
Learning from history: Prompt-tuning makes the LLM progressively better at the specific design task.

While demonstrated in neural architecture search, EvoPrompting could be applied anywhere solutions are code/text—designing algorithms, discovering chemical compounds, optimizing complex systems. It marks a step toward AI-human collaboration for invention, uncovering designs we might never find alone.

The Challenge: Finding the Perfect Blueprint for AI#

The EvoPrompting Meta-Learning Loop#

1. Initialization: Planting the Seeds#

2. Generation: The LLM as an Adaptive “Crossover” Operator#

3. Evaluation: Survival of the Fittest#

4. Selection & Adaptation: Learning from Experience#

Experiment 1: MNIST-1D as a Proving Ground#

Experiment 2: Graph Neural Networks for Algorithmic Reasoning#

Conclusion: A New Chapter for AI-Assisted Design#