Introduction

In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4, Claude, and Gemini have become ubiquitous. They possess incredible reasoning capabilities, yet they remain prone to hallucinations, biases, and reasoning errors. For researchers and engineers, the standard solution to these errors is usually fine-tuning or steering the model.

However, a significant barrier exists: most state-of-the-art models are black boxes. We interact with them via APIs, sending a prompt and receiving text. We do not have access to the model’s weights, gradients, or often even the output token probabilities (logits). This opacity makes traditional adaptation methods—which rely on accessing internal model states—impossible.

How do we “fix” a model we cannot touch?

Recent research has proposed a novel framework called CoBB (Correct for Black-Box LLMs). This approach treats the black-box model as a generator of imperfect reasoning, and trains a separate, smaller “Adaptation Model” to correct those errors. What makes CoBB distinct is its efficiency: it does not require expensive iterative verification processes during inference, nor does it require access to the black-box model’s probabilities.

In this post, we will deconstruct the CoBB paper, exploring how it uses genetic algorithms to curate training data and contrastive learning to teach a smaller model how to fix the mistakes of a larger one.

The Problem with Black-Box Adaptation

To understand why CoBB is necessary, we must look at the limitations of existing adaptation methods. When you want to improve an LLM’s performance on a specific task (like mathematical reasoning or avoiding bias) without access to its weights, you generally have three options:

  1. Probability-Based Adaptation: You train a small model to adjust the output probabilities of the large model. This fails if the API doesn’t return probabilities.
  2. Verifier/Reranking Methods: You sample many answers from the LLM, use a trained “verifier” to score them, and pick the best one. While effective, this is computationally expensive because it requires generating multiple outputs and running a verification step for every single query.
  3. The CoBB Approach: You train a separate, efficient model that takes the black-box LLM’s initial reasoning and transforms it into a correct output in a single pass.

The figure below illustrates these differences. Note how CoBB (c) offers a streamlined inference process compared to the probability-dependent (a) or cost-heavy (b) approaches.

Figure 1 illustrating three adaptation methods. (a) relies on token probabilities. (b) relies on sampling multiple outputs and verifying them, which is costly. (c) The proposed CoBB method, which uses a direct seq2seq adaptation model.

The CoBB Framework

The core idea of CoBB is seq2seq adaptation. We treat the original, potentially flawed reasoning of the black-box model (\(y_o\)) as the input, and the corrected reasoning (\(y_a\)) as the target output.

The goal is to train an adaptation model, denoted as \(\pi_\theta\) (parameterized by \(\theta\)), to perform this mapping:

Equation 1: The adaptation model distribution.

Here, \(\mathbf{x}\) is the question, and \(\mathbf{y}_o\) is the original output from the black-box. The adaptation model is initialized using a smaller, open-source LLM (like Mistral-7B).

The workflow consists of three distinct phases:

  1. Collection: Gathering reasoning chains from the black-box model.
  2. Dataset Construction: Using a genetic algorithm to select the most informative positive and negative pairs.
  3. Training: Using a contrastive learning objective to teach the model to prefer correct reasoning over incorrect reasoning.

Figure 2: An overview of CoBB. Step 1: Collect reasoning. Step 2: Subsample pairs using a genetic algorithm. Step 3: Train the adaptation model using contrastive likelihoods.

Phase 1: Collecting and Labeling Reasonings

First, the system needs raw material. For a set of questions \(\mathbf{q}\) in a dataset (like GSM8K for math or StrategyQA for reasoning), the researchers query the black-box model \(\mathcal{M}\) multiple times using Chain-of-Thought (CoT) prompting.

Equation 3: Sampling Chain-of-Thought reasonings from the black-box model.

For each question, they generate \(K\) different reasoning paths. Using ground-truth labels provided by the dataset, these paths are sorted into two buckets:

  • Positive Set (\(\mathcal{Y}_{pos}\)): Reasoning chains that lead to the correct answer.
  • Negative Set (\(\mathcal{Y}_{neg}\)): Reasoning chains that lead to an incorrect answer.

If the model is too good and produces no errors, they force errors by sampling from other questions. If the model is too bad and produces no correct answers, they use “answer-augmented prompting” (giving the model the answer and asking it to justify it) to generate synthetic positive examples.

Phase 2: Optimizing the Dataset via Genetic Algorithms

This is arguably the most technically interesting part of the paper. Once we have a pile of correct and incorrect reasonings for a question, how do we train the model?

A naive approach would be to pair every correct reasoning with every incorrect reasoning and tell the model “prefer A over B.” However, the number of combinations grows quadratically. If you have 10 correct and 10 incorrect outputs, that’s 100 pairs per question. For a large dataset, this becomes computationally unmanageable and contains a lot of redundant information.

We need to subsample these pairs. We want a small subset of pairs that is statistically representative of the entire collection of all possible pairs.

The Optimization Objective

The researchers formulate this as minimizing the 2-Wasserstein distance between the likelihood distribution of the full set (\(\mathcal{P}\)) and the likelihood distribution of the selected subset (\(\mathcal{P}_{sub}\)).

First, let’s define the “value” of a pair. For any pair of positive (\(\mathbf{y}_p\)) and negative (\(\mathbf{y}_n\)) reasoning, the adaptation model \(\pi_\theta\) assigns a likelihood difference:

Equation 4: The set of likelihood differences between positive and negative reasonings.

The goal is to find a subset \(\mathcal{P}_{sub}\) such that the statistical properties (mean and variance) of these likelihood differences match the full set as closely as possible. The distance metric used is:

Equation 5: The 2-Wasserstein distance metric comparing the mean and variance of the subset against the full set.

Minimizing this distance ensures that the small subset we train on creates the same “learning signal” as the massive full dataset would have.

The Genetic Algorithm

Selecting the optimal subset is an NP-hard problem. To solve this efficiently, the authors employ a Genetic Algorithm.

Genetic algorithms mimic natural selection. They maintain a population of candidate solutions (subsets), perturb them (mutate), and select the ones that minimize the objective function (the Wasserstein distance).

Equation 10: The genetic algorithm function call.

The algorithm iterates for \(T\) steps, progressively refining the subset until it finds a combination of reasoning pairs that best represents the full distribution. This optimized subset is then added to the final training dataset \(\mathcal{D}\).

Equation 11: Constructing the final dataset by uniting the question with the optimized subset.

Phase 3: Learning to Correct by Contrasting Likelihoods

With the dataset constructed, we move to training the adaptation model \(\pi_\theta\).

Standard fine-tuning uses Supervised Fine-Tuning (SFT), which simply maximizes the likelihood of the correct output given the input:

Equation 2: The standard Supervised Fine-Tuning (SFT) loss function.

However, SFT only teaches the model what to do. It doesn’t explicitly teach the model what to avoid. Since CoBB has explicit pairs of correct and incorrect reasonings (thanks to Phase 2), the authors utilize a contrastive objective called Odds Ratio Preference Optimization (ORPO).

The Odds Ratio Loss

The training objective is a combination of standard SFT and the Odds Ratio (OR) loss. The OR loss forces the model to increase the odds of generating the positive reasoning (\(\mathbf{y}_p\)) relative to the negative reasoning (\(\mathbf{y}_n\)).

The odds of generating a sequence \(\mathbf{y}\) are defined as \(\frac{\pi_\theta(\mathbf{y}|\mathbf{x})}{1-\pi_\theta(\mathbf{y}|\mathbf{x})}\). The loss function is:

Equation 14: The Odds Ratio (OR) loss function.

The total training loss combines these two:

Equation 13: The total training loss combining SFT and Odds Ratio loss.

By using \(\lambda\) to weight the OR loss, the model learns to strictly discriminate between valid and invalid reasoning paths.

Visualizing the Impact of Contrastive Learning

Does this complex loss function actually matter? The results suggest it is crucial.

The figure below compares the training dynamics. In (a), without contrastive training, the likelihood of the negative reasoning (red line) actually increases alongside the positive reasoning—the model is just learning to generate text, not necessarily correct text. In (b), with the CoBB objective, the model successfully pushes the positive likelihood up while suppressing the negative likelihood.

Figure 3: Training dynamics. (a) Without contrastive loss, negative reasoning likelihood increases. (b) With contrastive loss, negative reasoning is suppressed. (c) Ablation on the lambda hyperparameter.

Experimental Results

The researchers evaluated CoBB on four diverse QA benchmarks: StrategyQA (reasoning), GSM8K (math), TruthfulQA (safety/hallucination), and ScienceQA. They used GPT-3.5-Turbo as the target black-box model and Mistral-7B-v2 as the adaptation model.

Accuracy Improvements

CoBB significantly improved performance across the board. In the table below, observe the difference between the “Target Black-box LLM” (the original GPT-3.5) and “CoBB (Ours).”

  • GSM8K: Accuracy jumped from 76.25% to 78.59% (Average) and 85.14% (Majority Vote).
  • ScienceQA: Accuracy improved from 81.24% to 88.00%.

Crucially, CoBB outperformed BBox-Adapter, the previous state-of-the-art method that relies on expensive beam search and verification.

Table 1: Main experimental results comparing CoBB against baselines like SFT, Distillation, and BBox-Adapter across four datasets.

Cost Efficiency

One of the strongest arguments for CoBB is economic. BBox-Adapter requires querying the black-box model and a verifier multiple times during inference (beam search). CoBB, once trained, is a single-pass model.

As shown in Table 3, CoBB reduces evaluation costs significantly—requiring roughly 1/10th the cost of BBox-Adapter for inference on datasets like StrategyQA.

Table 3: Cost efficiency comparison. CoBB is significantly cheaper in both training and evaluation compared to BBox-Adapter.

Transferability: Generalizing to Other LLMs

A fascinating question arises: If we train a CoBB adapter using data from GPT-3.5, can we use it to fix the errors of other models, like Claude-3 or Llama-2?

The answer appears to be yes. The researchers applied the adapter (trained on GPT-3.5 data) to the outputs of Claude-3-Haiku, Mistral-7B, Phi-3, and Gemma.

Table 2: Transferability results. The adaptation model trained on GPT-3.5 successfully improves the performance of other models like Claude-3 and Phi-3.

For example, Claude-3-Haiku’s performance on StrategyQA improved from 72.05% to 76.42% (Majority Vote) simply by passing its outputs through the CoBB adapter. This suggests that the adaptation model learns general principles of “correct reasoning” for a task, rather than just memorizing the specific quirks of GPT-3.5.

Qualitative Analysis

What does a “correction” actually look like? In the example below from ScienceQA, the original model (GPT-3.5) correctly identifies Newton’s third law but incorrectly identifies the force direction, leading to a wrong answer. CoBB takes this reasoning, identifies the logical flaw regarding the direction of the magnetic pull, and rewrites it to reach the correct conclusion.

Figure 7: Qualitative examples showing CoBB correcting reasoning errors in ScienceQA.

Why This Matters

The CoBB paper presents a compelling step forward in the “Control” phase of Large Language Models. As models become more integrated into critical workflows, the ability to wrap a black-box system in a “correction layer” is invaluable.

Key takeaways for students and practitioners:

  1. Don’t rely on Logits: We can effectively steer models even when we only have access to their text output.
  2. Data Selection is Key: We don’t need all the data. We need a representative subset. Using statistical measures like Wasserstein distance combined with optimization algorithms (Genetic Algos) is a powerful way to curate training data.
  3. Contrastive Learning: Teaching a model “what is wrong” is just as important as teaching it “what is right,” especially for reasoning tasks.
  4. Efficiency Wins: A method that improves accuracy while cutting inference costs by 90% is highly likely to see real-world adoption over complex, iterative verification methods.

CoBB demonstrates that with clever dataset construction and objective functions, smaller open-source models can effectively act as supervisors for their larger, closed-source counterparts.