Large Language Models (LLMs) like GPT-4 and PaLM have become astonishingly good at writing code. Give them a description, and they can generate a functional script, a web component, or even a complex algorithm. But writing code from a clear specification is one thing—designing something truly novel and high-performing from scratch is another. Can an LLM invent a new, state-of-the-art neural network architecture?
If you simply ask an LLM to “design a better neural network,” the results are often underwhelming. The task is too complex, the search space of possible architectures is astronomically vast, and the model lacks a structured way to iterate and improve. This is the challenge that a fascinating new paper, EvoPrompting: Language Models for Code-Level Neural Architecture Search, tackles head-on.
The researchers propose a brilliant solution: don’t use an LLM as a one-shot code generator—instead embed it within an evolutionary algorithm. By combining the iterative, optimizing power of evolution with the vast coding knowledge of an LLM, they create a system that can automatically discover neural network architectures that outperform both human-designed and existing state-of-the-art models. This new method, called EvoPrompting, isn’t just a clever trick—it’s a new paradigm for using LLMs as creative partners in complex design tasks.
The Challenge: Finding the Perfect Blueprint for AI
Before diving into how EvoPrompting works, let’s set the stage with a bit of background on Neural Architecture Search (NAS).
Imagine you’re building the fastest car in the world. You have countless components to choose from—different engines, transmissions, chassis types, aerodynamics—and performance depends not just on individual parts, but on how they work together. Finding the optimal combination is an incredibly difficult search problem.
NAS is similar, but for AI models. Instead of car parts, we have neural network components—convolutional layers, attention mechanisms, activation functions, and more. The goal is to automate the process of designing the network’s “blueprint” or architecture to achieve the best performance on a given task.
One of the most powerful NAS techniques uses Evolutionary Algorithms, mimicking natural selection:
- Initialize a population of random or simple architectures.
- Evaluate each one by training it and measuring performance (its “fitness”).
- Select the best-performing architectures to be “parents.”
- Generate children by applying mutations (small changes) and crossover (mixing features from parents).
- Repeat for many generations.
Over time, the population evolves toward better architectures. But traditional evolutionary NAS has a major limitation—it’s confined to a human-designed search space. Researchers must predefine the building blocks (e.g., “use 3×3 or 5×5 convolutions”) the algorithm can choose from. This caps creativity and can block truly novel designs.
What if we broke free from these constraints—letting mutation and crossover be guided by an intelligent agent that understands good code and network design? This is where LLMs come in.
The EvoPrompting Meta-Learning Loop
EvoPrompting reinvents the evolutionary process by replacing mutation and crossover operators with a code-generating LLM. The search space becomes the entire Python language—anything the LLM can write.
Here’s the self-improving loop behind EvoPrompting:
Figure 1: EvoPrompting overview. Seed architectures are used in-context by the LLM to generate candidate architectures, which are trained and evaluated. The top-performing candidates become prompt examples, and all evaluated models help prompt-tune the LLM for the next round.
1. Initialization: Planting the Seeds
The search begins with a warm start: a small population of seed architectures—good, human-designed models known to perform well for the task. These seeds give the LLM a strong foundation to build upon.
2. Generation: The LLM as an Adaptive “Crossover” Operator
This is EvoPrompting’s heart. To create the next generation:
- Select Parents: Pick the highest-fitness models from the current population.
- Craft the Prompt: Build a few-shot prompt containing the Python code of parent architectures. Each begins with a comment showing performance metrics, e.g.:
- Set Targets: At the end of the prompt, add target metrics (e.g., 10% smaller, 2% higher accuracy than the best parent).
- Generate Children: The LLM uses the prompt to output new, complete Python model classes. Conditioned on examples of high-performing architectures and ambitious targets, it produces plausible and often creative variations.
This is far more than random tweaking. The LLM leverages its pretraining to act as an intelligent mutation and crossover operator, crafting meaningful changes.
3. Evaluation: Survival of the Fittest
Each new child architecture is trained on the dataset and validated. Fitness balances accuracy and efficiency:
\[ \text{fitness} = -\text{validation error} \times \text{model size} \]Models that are untrainable or fall below an accuracy threshold are discarded, ensuring quality and efficiency.
4. Selection & Adaptation: Learning from Experience
After evaluation:
- Select Next Generation Parents: Highest-fitness models become parents.
- Prompt-Tune the LLM: All evaluated children (code + fitness) are used to soft prompt-tune the LLM, adapting its generation style for better architectures in future rounds.
This two-way feedback—better prompt examples and a smarter LLM—drives steady improvement.
Experiment 1: MNIST-1D as a Proving Ground
The researchers first test EvoPrompting on MNIST-1D, a simple, low-compute version of the classic handwritten digit dataset, focusing on CNN architectures.
Baselines included:
- Naive few-shot prompting: Just use seeds in a prompt to generate models—no evolution.
- EvoPrompting without prompt-tuning: No adaptation between rounds.
- EvoPrompting with random parents: Parents chosen randomly instead of by fitness.
Figure 2: (a) Pareto frontiers of model size vs. test error for top 20 models per method—closer to the origin is better. (b) Maximum fitness over the search as more models are generated.
Findings:
- Full EvoPrompting consistently yielded models that were both smaller and more accurate than baselines and seeds (blue dots closest to the bottom-left in Figure 2a).
- EvoPrompting achieved higher maximum fitness with fewer evaluations (blue line in Figure 2b).
- Removing evolutionary selection or prompt-tuning drastically reduced performance—every component mattered.
Experiment 2: Graph Neural Networks for Algorithmic Reasoning
Next, they tackle a tougher challenge: CLRS Algorithmic Reasoning Benchmark—30 tasks modeling classical algorithms like BFS, sorting, and shortest paths. The state-of-the-art here was the Triplet-GMPNN GNN. EvoPrompting’s search targeted one key component: the triplet message function.
Seeded with nine hand-tweaked triplet functions, EvoPrompting ran on three selected CLRS tasks.
Figure 3: EvoPrompting outperforms baselines in fitness when evolving triplet message functions for three CLRS tasks.
Novel architectures discovered included:
- QUADNODEMINMAX: Uses quadruplet node representations and
max - min
aggregation. - CONCATREP: Concatenates projection outputs with feedforward layers for richer node/edge representations.
- DIV2MEAN: Halves node representations, using mean aggregation instead of max.
Evaluated across all 30 CLRS tasks:
Table 1: EvoPrompting-discovered models outperform the Triplet-GMPNN baseline on many CLRS tasks, often improving OOD accuracy without increasing model size.
Results were stunning: EvoPrompting set new state-of-the-art on 21 of 30 tasks. On Bubble Sort, accuracy jumped from 67.7% to 88.9%. Gains often came without size increases—proof it’s finding better designs, not just bigger models.
Conclusion: A New Chapter for AI-Assisted Design
EvoPrompting isn’t just another NAS method. It’s a blueprint for guiding the creative potential of LLMs:
- Search an open-ended space: Not limited by predefined components—anything expressible in Python is fair game.
- Iterative refinement: Evolutionary loops retain and build on the best ideas.
- Learning from history: Prompt-tuning makes the LLM progressively better at the specific design task.
While demonstrated in neural architecture search, EvoPrompting could be applied anywhere solutions are code/text—designing algorithms, discovering chemical compounds, optimizing complex systems. It marks a step toward AI-human collaboration for invention, uncovering designs we might never find alone.