Making Learning Stick: How SMART Aligning LLMs with Student Feedback Creates Better Mnemonics

Vocabulary acquisition is often the bane of a student’s existence. Whether preparing for the GRE, learning a new language, or mastering medical terminology, the sheer volume of new terms can be overwhelming. Cognitive science has long offered a solution: keyword mnemonics. These are memorable verbal links that connect a new, complex term to a simpler, familiar keyword, followed by an explanation that bridges the two.

For example, to learn the word Benevolent (meaning kind), you might link it to Benefit. The explanation? “A boss who gives their employees benefits is kind—or benevolent.”

While effective, creating these mnemonics is cognitively demanding. It requires creativity, phonetic awareness, and semantic reasoning. Naturally, researchers have looked to Large Language Models (LLMs) to automate this. But there is a catch: simply asking an LLM to “write a mnemonic” doesn’t guarantee it will actually help a student learn.

In this deep dive, we explore a fascinating paper titled “A SMART Mnemonic Sounds like ‘Glue Tonic’”. The researchers introduce SMART (Student Mnemonic Alignment for the Recall of Terms), a system that doesn’t just generate text—it learns from how students actually study.

The paper makes a provocative discovery: what students think helps them learn is often different from what actually helps them learn. By disentangling these two types of preferences and feeding them into a sophisticated Bayesian model, the researchers created an LLM that matches GPT-4 performance at a fraction of the cost.

Let’s unpack how they did it.

The Problem with Automated Mnemonics

Prior to this work, automatic mnemonic generation relied mostly on finding rhyming words (phonetic linking). However, a good mnemonic needs two parts:

The Keyword: A word that sounds like the target term but is simpler.
The Explanation: A memorable narrative linking the keyword’s meaning to the target term’s definition.

LLMs are excellent at writing explanations, but they can hallucinate associations or create links that are confusing rather than helpful. Furthermore, most LLM training (RLHF - Reinforcement Learning from Human Feedback) relies on “Expressed Preferences”—asking a human, “Which of these two responses do you like better?”

In education, “liking” a response doesn’t mean you learned from it. The researchers hypothesized that to build a truly educational model, they needed to align the LLM not just with what students said they liked, but with observed learning outcomes.

The SMART Pipeline: An Overview

The researchers designed a four-stage process to build the SMART model.

Figure 1: SMART overview. The process moves from Supervised Fine-Tuning (SFT) to Preference Collection, then to Bayesian Modeling, and finally Direct Preference Optimization (DPO).

As shown in Figure 1, the pipeline is circular and iterative:

Supervised Fine-Tuning (§2): They start by teaching a base model (LLaMA-2 70B) the basic format of a mnemonic using a curated dataset.
Preference Collection (§3): They deploy the model in a real flashcard app to gather data from students.
Bayesian Modeling (§5.1): They use advanced statistics to synthesize different types of feedback (ratings vs. learning speed).
DPO (§5.2): They use the synthesized signal to refine the model, making it “smarter.”

Let’s break down each stage.

Stage 1: The Initial Model (Supervised Fine-Tuning)

You cannot align a model that produces garbage. Before collecting student feedback, the researchers needed a base model capable of generating decent mnemonics.

Since no large-scale dataset of mnemonics existed, they built one. They scraped MnemonicDictionary, a community website where users submit and vote on mnemonics. However, internet data is noisy. A mnemonic with 5 upvotes and 0 downvotes might be better than one with 100 upvotes and 90 downvotes, but simple ratios are misleading when sample sizes vary.

To solve this, they used a Bayesian approach to estimate the “true quality” (\(q_i\)) of a mnemonic based on upvotes (\(v_{u,i}\)) and downvotes (\(v_{d,i}\)).

Equation modeling the quality of a mnemonic based on upvotes and downvotes using a Beta prior.

They modeled the quality \(q_i\) as a Beta distribution. This allows for a “prior” belief (that about 20% of mnemonics are high quality) and updates that belief based on the observed votes. They selected the top 1,000 highest-quality mnemonics to fine-tune LLaMA-2 70B.

The training objective was standard Cross-Entropy Loss (\(\mathcal{L}_{CE}\)), teaching the model to predict the next token in a valid mnemonic sequence:

Cross-Entropy Loss equation used for fine-tuning the initial model.

This resulted in a model, \(p_0(m|v)\), that could take a vocabulary term \(v\) and generate a mnemonic \(m\).

Stage 2: Collecting Preferences in the Wild

This is where the research breaks new ground. Instead of using paid crowd-workers to look at static text, the researchers built a fully functional web-based flashcard app.

They recruited 45 students preparing for exams like the GRE. The students were paid to study vocabulary terms using the app. When a student failed to recall a definition, the app would intervene by showing a mnemonic generated by the model.

Figure 2: Screenshot from the web-based flashcard app. A user types a definition, and if incorrect, receives a mnemonic assistance card.

This setup allowed the researchers to collect two distinct categories of feedback: Expressed Preferences and Observed Preferences.

1. Expressed Preferences (What they say)

These are the standard metrics used in LLM alignment.

Likert Ratings: After seeing a mnemonic, students could rate it from 1 to 5 stars.
Pairwise Comparisons: If a student eventually answered correctly, they were sometimes shown two mnemonics side-by-side and asked, “Which mnemonic do you think would help you learn better?”

2. Observed Preferences (What they do)

This is the hidden signal. Because the app tracked every interaction, the researchers could calculate the Learning Curve.

Metric: The number of “turns” (attempts) a student needed to correctly recall the definition after seeing the mnemonic.
If Mnemonic A allowed a student to learn the word in 1 turn, but Mnemonic B took 5 turns, Mnemonic A is objectively more effective for short-term recall, regardless of how the student “rated” it.

The Conflict: Students Don’t Know Best

Here is the crux of the paper. You might assume that if a student rates a mnemonic 5 stars, they learned the word quickly.

The data says otherwise.

Figure 5: Scatter plot showing almost zero correlation (Pearson’s r = -0.06) between user ratings and actual learning speed.

As shown in Figure 5, the correlation between the Likert rating and the iterations until recall is essentially zero (\(r = -0.06\)).

Expressed Preferences (\(y_{pair}\), \(y_{rate}\)) measure what users think is helpful (or perhaps what is entertaining).
Observed Preferences (\(y_{learn}\)) measure actual cognitive utility.

The agreement between these signals is startlingly low. As seen in Table 1 below, while pairwise and rating preferences agree 67.5% of the time, the agreement between ratings and actual learning outcomes drops to near 50%—essentially random chance.

Table 1: Agreement rates between preference types. Expressed vs. Observed agreement is low (~0.5).

This implies that optimizing an LLM solely on human ratings (the standard RLHF approach) might result in models that users like but that fail to achieve the educational goal.

Stage 3: The Bayesian Model

The researchers now faced a dilemma. They had conflicting signals. They didn’t want to discard the “Expressed” preferences entirely—if a mnemonic is effective but offensive or bizarre, students might hate it (and rightly so). But they needed to prioritize learning.

To solve this, they built a Hierarchical Bayesian Model. Instead of treating the votes as the truth, they treated the “True Effectiveness” of a mnemonic as a latent (hidden) variable, \(\theta\).

They assumed every mnemonic has an effectiveness score \(\theta\), distributed uniformly:

Prior distribution for mnemonic effectiveness parameters theta.

The model views the three data sources (Pairwise, Ratings, Learning) as noisy observations generated by this hidden effectiveness score.

Modeling Pairwise Choices: If Mnemonic A is more effective than Mnemonic B (\(\theta_A > \theta_B\)), the probability of a user choosing A is modeled via a sigmoid function and a Bradley-Terry model (a standard model for competitive rankings):

Equations modeling pairwise probability using a sigmoid function and Bradley-Terry model with ties. Multinomial distribution for pairwise outcomes including ties.

Modeling Ratings: Similarly, a higher \(\theta\) should lead to a higher distribution of star ratings.

Equations modeling Likert ratings as a multinomial distribution derived from effectiveness.

Modeling Learning (The Geometric Approach): This is the most clever modeling choice. They treated the “turns to learn” as a series of failures followed by a success. This mathematically fits a Geometric Distribution. If a mnemonic is highly effective (high \(\theta\)), the probability of success on any given turn is high, meaning the number of turns (\(t_j\)) will be low.

Equations modeling learning turns as a Geometric distribution.

By running this model (using sampling methods like NUTS), the researchers could infer a single, synthesized “Effectiveness” score for every mnemonic pair, aggregating the wisdom of ratings with the reality of learning speed.

Stage 4: Direct Preference Optimization (DPO)

With the Bayesian “effectiveness” labels in hand, the researchers moved to the final stage: Alignment.

They used Direct Preference Optimization (DPO). Unlike traditional RLHF, which requires training a separate Reward Model and then using PPO (Proximal Policy Optimization) to update the language model (a complex and unstable process), DPO optimizes the language model directly.

The loss function essentially encourages the model to increase the likelihood of the “winning” mnemonic (\(y_w\)) and decrease the likelihood of the “losing” mnemonic (\(y_l\)), weighted by how much the base model (\(\pi_0\)) already liked them.

The DPO Loss function equation.

They trained the SMART model on these Bayesian-derived preferences. This effectively “baked in” the combined signal of student ratings and learning outcomes into the weights of LLaMA-2.

Experiments and Results

Did all this math and user study actually result in better mnemonics? The researchers evaluated SMART against baselines, including the un-tuned LLaMA model and GPT-4.

Does Combining Preferences Help?

One major question was whether combining the conflicting signals (Expressed vs. Observed) was better than just using one.

The researchers found that using the Bayesian signal (combining everything) was superior to using just pairwise comparisons. Why?

Breaking Ties: In many cases, students rated two mnemonics as “Equal” (a tie). Standard alignment throws these data points away. The Bayesian model, however, could look at the learning speed data for those “tied” mnemonics to find a winner.
Data Augmentation: By resolving ties and missing labels using the other data sources, they increased their effective training data size.

As shown in Table 12 (below), the DPO model aligned with the full signal (\(p_{dpo}\)) significantly outperformed the base model (\(p_0\)) and was directionally better than models trained on partial data.

Table 12: GPT-4 judgment showing DPO models outperform fine-tuning, with the combined model performing best.

SMART vs. GPT-4 vs. Humans

The ultimate test is quality. The researchers employed experts (mnemonic researchers) to blindly rate mnemonics generated by:

SMART (The aligned LLaMA-2 model)
Transphoner (A previous state-of-the-art non-LLM system)
GPT-4 (10-shot prompted)
Human (A professional creative writer)

The Results (Figure 6):

Figure 6: Expert qualitative evaluation. SMART matches GPT-4 in quality. Humans still lead in simplicity and imageability.

SMART matches GPT-4: Despite being a smaller, open-source model (LLaMA-2 70B) compared to the proprietary giant GPT-4, SMART produced mnemonics of equal quality. This validates the power of aligning small models with domain-specific, high-quality human feedback.
Humans are still champions: The professional writer consistently scored higher on Simplicity and Imageability.

Simplicity: LLMs often choose keywords that are themselves obscure (e.g., explaining Pithy using Pythagoras). Humans choose simple words (e.g., Pithy \(\rightarrow\) Pit).
Imageability: Humans write explanations that evoke vivid mental pictures, which is crucial for memory. LLMs tend to be more abstract.

What do SMART Mnemonics Look Like?

The model generates concise, two-part mnemonics. Here are a few examples of high-quality outputs from the final model:

Table 13: Examples of high-quality mnemonics generated by SMART. Example: “Lionized sounds like lion-eyes…”

Term: Lionized
Mnemonic: Lionized sounds like “lion-eyes,” envisioning a lion being admired for its eyes. Lionized means to be admired or treated like a celebrity.
Term: Escalate
Mnemonic: Escalate sounds like “escalator,” which goes up, representing an increase or rise.

These examples show the model successfully grasping the “sounds like” component and the semantic bridge.

Conclusion and Implications

The “SMART” paper is a significant step forward in educational technology for two reasons.

First, it democratizes high-quality instruction. By fine-tuning and aligning open-source models (like LLaMA) to match GPT-4’s performance, it opens the door for cheaper, offline, or private educational tools that don’t rely on expensive API calls to proprietary models.

Second, and perhaps more importantly, it challenges the standard paradigm of “Human Preference” in AI alignment. In education, the customer is not always right. Students often prefer mnemonics that are funny or short, even if they don’t help retention. By introducing Observed Preferences—metrics based on actual performance—the researchers demonstrated a safer, more effective way to align models for human utility.

The future of EdTech isn’t just about models that chat fluently; it’s about models that understand how we learn, even when we don’t understand it ourselves.

Key Takeaways:

Expressed vs. Observed: What users say they like \(\neq\) what helps them achieve their goals.
Bayesian Fusion: You can combine “soft” feedback (ratings) and “hard” metrics (learning speed) to create a robust training signal.
Efficiency: A properly aligned open-source model can rival state-of-the-art closed models in specific domains.

This blog post is based on the research paper “A SMART Mnemonic Sounds like ‘Glue Tonic’: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick” by Balepur et al.

The Problem with Automated Mnemonics#

The SMART Pipeline: An Overview#

Stage 1: The Initial Model (Supervised Fine-Tuning)#

Stage 2: Collecting Preferences in the Wild#

1. Expressed Preferences (What they say)#

2. Observed Preferences (What they do)#

The Conflict: Students Don’t Know Best#

Stage 3: The Bayesian Model#

Stage 4: Direct Preference Optimization (DPO)#

Experiments and Results#

Does Combining Preferences Help?#

SMART vs. GPT-4 vs. Humans#

What do SMART Mnemonics Look Like?#

Conclusion and Implications#