If you have ever interacted with a customer service chatbot, you have likely hit a wall. You ask a question, perhaps phrased slightly differently than the bot expects, or about a topic that feels relevant but is technically “new,” and you get the dreaded response: “I’m sorry, I didn’t understand that.”
This limitation stems from a fundamental design choice in traditional conversational AI: the static ontology. Most systems are built on a pre-defined list of things they can understand. If a user’s request falls outside that list, the system fails. But the real world is dynamic. New products launch, new terminologies emerge (think “social distancing” or specific vaccine brand names in 2020), and user needs evolve rapidly.
How do we build agents that don’t just fail when they encounter the unknown, but actually learn from it?
This brings us to the concept of Ontology Expansion (OnExp). In a comprehensive survey paper titled A Survey of Ontology Expansion for Conversational Understanding, researchers Liang et al. explore how conversational agents can dynamically update their knowledge base—discovering new user intents and new information slots—without requiring a human to manually reprogram them every time the world changes.
In this deep dive, we will unpack how OnExp works, the architecture behind discovering new intents and slots, and what this means for the future of AI.
The Problem: The Closed-World Assumption
To understand the solution, we must first define the bottleneck. Conversational Understanding (CU) modules are the ears and brains of a chatbot. Their job is to parse what you say into structured data.
Traditionally, this is handled via a predefined ontology, which consists of:
- Intents: What is the user trying to do? (e.g.,
Book_Flight,Check_Balance). - Slots: What specific parameters are needed? (e.g.,
Destination,Date). - Values: What are the possible inputs for those slots? (e.g.,
London,Tomorrow).
In a standard “Closed-World” setting, the AI treats intent detection as a classification task where the classes are fixed. If you introduce a new intent, the model tries to shove it into an existing box, leading to errors.
The Solution: Ontology Expansion
Ontology Expansion (OnExp) flips the script. It assumes an “Open-World” setting where the ontology is fluid. The system analyzes user utterances to identify:
- Known items: Things it already knows (mapping them correctly).
- Novel items: New intents or values it hasn’t seen before (flagging them for addition to the ontology).

As shown in Figure 1, consider a system built for booking flights (the “Predefined Ontology” on the left). A user suddenly asks, “Can you tell me if there are any Pfizer or Moderna vaccines available right now?”
A traditional bot would crash or misclassify this as a “booking” request. An OnExp system, however, parses the utterance and recognizes that this does not fit existing categories. It identifies a new intent (Check Vaccination Status) and a new slot (Vaccine Brands) with specific values (Pfizer, Moderna). This feedback loop allows the system to update its own internal map of the world.
Mathematically, we can visualize the goal of OnExp as a mapping function. If we have an utterance \(x_i\), we want to map it to an intent (\(o^I\)), a slot (\(o^S\)), and a value (\(o^V\)), even if those items are currently unknown to the system.
\[ f _ { \theta } ^ { O n E x p } ( { \pmb x } _ { i } ) ( o _ { i } ^ { I } , o _ { i } ^ { S } , o _ { i } ^ { V } , r ) , \]Here, the function must not only classify known items but also output distinct representations for unknown ones, allowing them to be clustered and labeled later.
A Taxonomy of Ontology Expansion
The researchers propose a novel taxonomy to organize the chaotic landscape of OnExp research. They divide the field into three primary branches:
- New Intent Discovery (NID): Figuring out what the user wants to do.
- New Slot-Value Discovery (NSVD): Figuring out the details of the request.
- Joint OnExp: Doing both simultaneously.

Let’s break down the two most developed branches: NID and NSVD.
1. New Intent Discovery (NID)
New Intent Discovery is about clustering. If a chatbot receives 1,000 queries that it doesn’t recognize, NID algorithms attempt to group those queries. If 500 of them are similar, that group likely represents a specific new intent.
The survey categorizes NID methods based on how much “help” (supervision) the model gets.
Unsupervised NID
This is the hardest setting. The model has no labeled data. It must look at raw text and figure out the structure purely based on semantics.
- Statistical Methods: Early approaches used K-Means clustering or looked at click-through logs (e.g., if different queries lead to the same URL, they likely have the same intent).
- Neural Approaches: Modern methods use Deep Neural Networks (DNNs) to create vector representations (embeddings) of sentences. If two sentences are close to each other in vector space, they are clustered together. Techniques like Deep Embedded Clustering (DEC) iteratively refine these clusters.
Zero-Shot NID
Here, the model is trained on a set of known intents but must identify new ones it has never seen, often using semantic descriptions of the labels.
- Transfer Learning: The assumption is that if a model learns the structure of intent detection in one domain (e.g., travel), it can apply that logic to a new domain (e.g., banking).
- Transformer-based Methods: BERT and similar models excel here. They can map utterances to an “intent label semantic space.” By checking if an utterance falls close to a known label or far away, the model determines if it is a known or new intent.
Semi-Supervised NID
This is currently the most popular and effective approach. The model has a small set of labeled data (known intents) and a massive pile of unlabeled data (containing both known and new intents).
- The Challenge: The model must classify the knowns correctly while simultaneously clustering the unknowns. It needs to avoid “overfitting” to the known classes.
- Techniques:
- Pairwise Constraints: Using known data to teach the model what “similar” and “dissimilar” sentences look like.
- Contrastive Learning: Pulling similar queries closer in vector space and pushing dissimilar ones apart.
- LLM Integration: Recent hybrid methods use Large Language Models (like GPT-4) to generate pseudo-labels or clean up the clusters formed by smaller models.
2. New Slot-Value Discovery (NSVD)
While NID looks at the whole sentence, NSVD zooms in. It tries to extract specific spans of text that represent entities. This is harder because a single sentence can contain multiple slots (e.g., “Book a flight to Paris on Friday”).
Unsupervised NSVD
Without labels, how does a model know “Paris” is a destination?
- Linguistic Patterns: Models use frame-semantic parsers (tools that understand sentence structure) to guess which nouns act as objects or modifiers.
- Slot Schema Induction: Advanced methods use clustering to group extracted values. If “Paris,” “London,” and “Berlin” appear in similar contexts, they form a “City” slot cluster.
Partially Supervised NSVD
This area is nuanced because “partial” supervision can mean different things. The researchers identify four sub-settings:
- No New Slots (New Values Only): The slot
Restaurant_Nameexists, but the user mentions a new restaurant, “The Gourmet Kitchen,” which isn’t in the database. The system needs to recognize this new value belongs to the existing slot. - New Slot Type Known: The system knows there should be a slot for
Vaccine_Brand, but it has never seen examples of it in training. - New Slot Description Known: The system is given a description, like “The brand of the vaccine,” and must find the words in the sentence that match that description (similar to Reading Comprehension tests).
- New Slot Unknown: The most “open” setting. The system must figure out that a new concept exists and extract the values for it without prior warning.
Measuring Success: Experiments and Metrics
How do we know if these expansions are accurate? The paper outlines specific metrics for intents and slots.
Metrics for Intents (NID)
For intents, we are mostly looking at clustering performance. Standard accuracy is not enough because the “labels” for new clusters are arbitrary (the model might call a cluster “Group 1” while the ground truth is “Check Balance”).
Researchers use the Hungarian Algorithm to map predicted clusters to true labels to calculate Accuracy (ACC):
\[ A C C = \frac { \sum _ { i = 1 } ^ { N } \mathbb { 1 } _ { y _ { i } = m a p ( \hat { y } _ { i } ) } } { N } , \]They also use Adjusted Rand Index (ARI), which measures the similarity between two data clusterings (ignoring labels entirely), and Normalized Mutual Information (NMI).
\[ \begin{array} { r } { A R I = \frac { \sum _ { i , j } \binom { n _ { i , j } } { 2 } - [ \sum _ { i } \binom { u _ { i } } { 2 } \sum _ { j } \binom { v _ { j } } { 2 } ] / \binom { N } { 2 } } { \frac { 1 } { 2 } [ \sum _ { i } \binom { u _ { i } } { 2 } + \sum _ { j } \binom { v _ { j } } { 2 } ] - [ \sum _ { i } \binom { u _ { i } } { 2 } \sum _ { j } \binom { v _ { j } } { 2 } ] / \binom { N } { 2 } } , } \end{array} \]\[ N M I ( \hat { \pmb { y } } , \pmb { y } ) = \frac { 2 \cdot \pmb { I } ( \hat { \pmb { y } } , \pmb { y } ) } { H ( \hat { \pmb { y } } ) + H ( \pmb { y } ) } , \]Metrics for Slots (NSVD)
For slots, precision and recall are paramount. We are checking if the extracted text span matches the ground truth.
\[ P = \frac { \sum _ { i = 1 } ^ { n } | \varepsilon _ { i } | P _ { i } } { \sum _ { j = 1 } ^ { n } | \varepsilon _ { j } | } , \]\[ R = \frac { \sum _ { i = 1 } ^ { n } | M _ { i } | R _ { i } } { \sum _ { j = 1 } ^ { n } | M _ { j } | } . \]These are combined into the Span-F1 score:
\[ F 1 = \frac { 2 P R } { P + R } . \]Finally, simply averaging performance can be misleading. A model might be great at old intents but terrible at new ones. The H-score is used to balance the performance between known classes and novel classes:
\[ { \mathrm { H - s c o r e } } = { \frac { 2 } { 1 / { \mathrm { K n o w } } \ : { \mathrm { A C C + 1 / N o v e l } } \ : { \mathrm { A C C } } } } . \]Analyzing the Results
The survey compares dozens of methods across datasets like BANKING77 (banking queries) and StackOverflow (technical questions).
1. Unsupervised NID: Looking at Table 4, we see a massive leap in performance over time. Standard K-Means clustering performs poorly (around 30-45% accuracy). However, modern methods like IDAS (which uses abstractive summarization) achieve nearly double that performance. This highlights the power of using semantic understanding rather than simple keyword statistics.

2. Semi-Supervised NID: The results in Table 2 are even more impressive. Methods like ALUP (Active Learning with Uncertainty Propagation) reach accuracies in the 80s and 90s. This confirms a key takeaway: even a small amount of labeled data significantly guides the discovery of new intents. The integration of Large Language Models (LLMs) in hybrid methods pushes these boundaries further.

3. New Slot Discovery: Table 3 highlights the difficulty of the NSVD task. On complex datasets like WOZ-hotel, early methods struggled (F1 scores around 17-20). However, newer methods like the Bi-criteria active learning scheme have pushed this up to nearly 70%. This improvement is crucial for extracting detailed information from users without frustration.

Emerging Frontiers and Future Directions
The survey concludes by pointing out that while we have made great progress, we are not there yet. Several frontiers remain largely unexplored:
Early OnExp (The Cold Start Problem)
Most current research assumes we have a large batch of data to analyze (e.g., a month’s worth of chat logs). But in reality, businesses want to spot a new trend (like a sudden service outage or a new viral product) immediately. Early OnExp focuses on identifying new ontological items with minimal examples, preventing them from being drowned out by the noise of common queries.
Multi-modal OnExp
Humans don’t just text; we send screenshots, voice notes, and videos. Current OnExp is text-heavy. The future lies in Multi-modal OnExp, where a system might recognize a new product intent not just from the text “I want to return this,” but by analyzing the attached photo of a damaged item.
Holistic OnExp
Right now, research focuses on the “Understanding” module (CU). But what happens after the intent is discovered? If the bot learns a new intent, does it know how to respond? Holistic OnExp suggests that expansion must integrate with Dialogue Management (policy learning) and Natural Language Generation (response formulation). Finding the intent is only half the battle; serving the user is the ultimate goal.
Conclusion
The shift from static to dynamic ontologies represents a maturation point for Conversational AI. We are moving away from fragile, scripted bots toward resilient agents that learn from the very people they interact with.
As the experimental results show, leveraging pre-trained language models and semi-supervised techniques allows these agents to “expand their universe” with surprising accuracy. While challenges remain—particularly in real-time discovery and multi-modal integration—Ontology Expansion is the key to creating AI that genuinely understands the ever-changing nature of human conversation.
](https://deep-paper.org/en/paper/2410.15019/images/cover.png)