Unlocking the Black Box: A New Algorithm for Learning Deterministic Weighted Automata

In the world of computer science and Natural Language Processing (NLP), we are often faced with powerful “black box” models. We feed them an input, and they give us an output—often a probability score or a classification. But understanding how they arrived at that conclusion is notoriously difficult. This is the realm of automata extraction: the process of reverse-engineering a complex model into a simpler, interpretable Finite State Automaton (FSA).

For decades, Dana Angluin’s \(L^*\) algorithm (1987) has been the gold standard for learning regular languages. However, the original \(L^*\) was designed for boolean languages—strings are either “in” or “out.” Modern NLP relies on weighted languages, where strings have probabilities, costs, or scores.

In this post, we will dive deep into a recent paper, “An \(L^*\) Algorithm for Deterministic Weighted Regular Languages,” which bridges this gap. The authors present a faithful extension of Angluin’s algorithm that learns Deterministic Weighted Finite-State Automata (WDFSA). This method is mathematically elegant, works with algebraic structures that support division (semifields), and guarantees finding the minimal automaton for a target language.

The Challenge: From Boolean to Weighted

To understand the contribution of this paper, we first need to set the stage with the objects we are trying to learn.

Weighted Regular Languages

In a standard Regular Language, a finite state automaton accepts or rejects a string. In a Weighted Regular Language, the automaton assigns a value (weight) to every string in the universe \(\Sigma^*\).

A Weighted Finite-State Automaton (WFSA) consists of states, transitions, and weights. When a string traverses a path through the automaton, we combine the weights of the transitions.

As illustrated below, the weight of a language \(\mathbf{L}_{\mathcal{A}}\) for a string \(\boldsymbol{p}\) is the sum (\(\oplus\)) over all valid paths \(\pi\), where the weight of a path is the product (\(\otimes\)) of the initial weight, transition weights, and final weight.

Definition of the language generated by a weighted automaton.

Here, \(\lambda\) represents the initial weight function, and \(\rho\) represents the final weight function. A path is simply a sequence of transitions:

Definition of a path in an automaton.

The Determinism Gap

There are existing algorithms for learning weighted automata, often using methods called “spectral learning.” However, these methods typically learn non-deterministic automata (where multiple paths can exist for the same string). While powerful, non-deterministic models are harder to interpret and analyze.

Furthermore, many previous approaches required the weights to be part of a field (allowing addition, subtraction, multiplication, and division). This paper relaxes that requirement to a semifield (subtraction is not required, but division is). This allows the algorithm to work with a broader class of weights, such as positive real numbers or probabilities.

The goal of this paper is to create an algorithm that:

  1. Learns Deterministic Weighted Automata (WDFSA).
  2. Uses active learning (querying an oracle), similar to Angluin’s original method.
  3. Is mathematically guaranteed to find the minimal automaton.

The Core Concept: Homothetic Equivalence

The brilliance of Angluin’s original \(L^*\) algorithm lies in how it decides when two prefixes lead to the “same state.” In the boolean world, two prefixes \(u\) and \(v\) lead to the same state if they have identical futures—meaning for any suffix \(z\), \(uz\) is in the language if and only if \(vz\) is in the language.

In the weighted world, we cannot demand that the future weights are identical. Because weights multiply along a path, a prefix \(u\) might reach a state with accumulated weight \(2\), and prefix \(v\) might reach the same state with accumulated weight \(4\). Their futures will look similar, but everything for \(v\) will be scaled by a factor of \(2\) compared to \(u\).

The authors introduce Homothetic Equivalence to solve this. Two functions (or futures) are equivalent if one is a scalar multiple of the other.

Definition of Homothetic Equivalence.

This relationship implies that two prefixes \(\mathbf{x}\) and \(\mathbf{z}\) correspond to the same state in the underlying deterministic automaton if their “right languages” (their futures) are homothetically equivalent:

Equivalence relation for right languages.

This simple yet powerful shift allows the algorithm to group prefixes that behave “proportionally” identically into the same state.

The Weighted \(L^*\) Algorithm

The algorithm uses an Oracle (a teacher). The learner can ask two types of questions:

  1. Membership Query: “What is the weight of string \(x\)?”
  2. Equivalence Query: “Here is my hypothesis automaton. Is it correct?” (If not, the Oracle provides a counterexample).

The learner maintains a data structure called an Empirical Hankel Matrix (\(\mathbf{H}\)).

The Empirical Hankel Matrix

Think of the Hankel Matrix as a table of observations.

  • Rows (\(P\)): Represent prefixes (strings leading to a state).
  • Columns (\(S\)): Represent suffixes (future strings).
  • Cells: The value \(\mathbf{H}(p, s)\) is the weight of the string \(p \cdot s\).

The algorithm tries to organize this matrix so that rows representing the same state are grouped together. Based on our previous section, two rows \(p\) and \(q\) represent the same state if the row vectors are scalar multiples of each other.

Equivalence of rows in the Hankel Matrix.

Building the Automaton from the Matrix

How do we turn a matrix into a machine? The algorithm constructs a “Naïve Hankel Automaton” directly from the data.

1. States: The states are the row indices \(P\). Specifically, they are the unique rows under homothetic equivalence.

2. Transitions: To calculate the weight of a transition from state \(p\) on symbol \(a\), we look at the ratio of the “total weight” of the future of \(pa\) versus \(p\). The authors define \(d_{\mathbf{H}}(p)\) as the sum of the entries in row \(p\):

Definition of d_H, the sum of a row in the Hankel matrix.

The transition weight \(w\) for a symbol \(a\) is derived by dividing the total weight of the destination row by the total weight of the source row:

Calculation of transition weights via division.

This is why the algebraic structure must be a semifield—we need to be able to divide (or multiply by an inverse) to normalize these weights.

3. Initial and Final Weights: Similarly, initial and final weights are derived from specific entries in the matrix (like the empty string \(\varepsilon\)) normalized by the row sums.

Definition of initial weights.

Definition of final weights.

The Learning Loop

The algorithm iteratively refines the Hankel Matrix until it represents a valid deterministic automaton. This process involves two main properties the matrix must satisfy: Consistency and Closedness.

Here is the high-level algorithm:

Algorithm 1: The Weighted L* Algorithm.

Let’s break down the inner loop:

1. Making it Consistent

Consistency ensures that if two rows look equivalent (are scalar multiples), they must transition to equivalent rows for every symbol in the alphabet.

Mathematically, if \(p \sim_{\mathbf{H}} q\), then for all symbols \(a\), \(pa\) must be equivalent to \(qa\).

Consistency condition.

If the matrix violates this (i.e., \(p\) and \(q\) look alike now, but diverge after seeing symbol \(a\)), the algorithm calls MAKECONSISTENT. This adds a new suffix to \(S\) to expose the difference between \(p\) and \(q\), forcing them into separate states.

2. Making it Closed

Closedness ensures that for every known state (prefix \(p\)) and every symbol \(a\), the resulting state (prefix \(pa\)) is equivalent to some state we have already discovered in \(P\).

If the system discovers a “new” type of row via a transition, it calls MAKECLOSED to promote that new prefix into the set \(P\).

3. Completing the Matrix

Whenever new rows or columns are added, the algorithm calls COMPLETE (using the subroutines in Algorithm 2) to fill in the blank cells by asking the Oracle for the weights of the new strings.

Algorithm 2: Subroutines for Consistency, Closedness, and Completion.

From Matrix to Minimal Automaton

Once the matrix is both closed and consistent, the algorithm generates a hypothesis automaton \(\tilde{\mathcal{A}}_{\mathbf{H}}\).

A key theoretical contribution of this paper is the proof that the automaton derived this way is not just a valid automaton, but the minimal deterministic weighted automaton for the language.

The authors define a transition-regular equivalence relation. This ensures that when we merge equivalent rows into single states, the transition weights remain consistent.

Transition regularity requirement.

Specifically, the transition weights must balance perfectly between equivalent paths:

Weight equality for transition regularity.

The algorithm essentially computes the quotient of the observed behaviors. The number of states in the resulting automaton is exactly equal to the number of unique “directions” (equivalence classes) in the Hankel matrix.

The number of states equals the number of equivalence classes in the Hankel matrix.

Theoretical Guarantees

The paper provides rigorous proofs for two essential questions: Does it stop? And how fast is it?

Termination

The authors prove that the algorithm always terminates. In every iteration where the hypothesis is incorrect—whether due to an inconsistency, a lack of closure, or a counterexample from the Oracle—the “dimension” of the Hankel system increases.

Since the target language has a finite number of states (let’s say \(N\)), the dimension of the matrix cannot grow forever. It is bounded by the size of the minimal target automaton.

State complexity bound relative to the Hankel matrix.

Complexity

The complexity depends on \(N\) (states in the target automaton), \(M\) (length of the longest counterexample), and \(|\Sigma|\) (alphabet size).

The dominant term in the runtime complexity is derived as:

Runtime complexity of the algorithm.

The algorithm runs in polynomial time relative to the size of the target automaton and the alphabet. This is a standard result for \(L^*\)-style algorithms, confirming that this weighted extension is computationally feasible for regular languages of reasonable size.

Conclusion and Implications

This paper successfully extends Angluin’s active learning paradigm to the world of Deterministic Weighted Automata over semifields. By utilizing homothetic equivalence, the authors provide a way to group “proportionally similar” states, allowing for the exact learning of weighted languages where weights represent probabilities, durations, or costs.

Why does this matter?

  1. Interpretability: We can use this algorithm to extract deterministic automata from black-box Recurrent Neural Networks (RNNs) or Transformers trained on regular tasks. Since the output is deterministic, it is much easier to analyze than the non-deterministic outputs of spectral learning.
  2. Theoretical Foundation: It bridges the gap between algebraic automata theory (minimization via quotients) and machine learning (active learning from queries).
  3. Generality: By working with semifields (requiring division but not subtraction), it supports a wide range of weight types useful in modern AI applications.

This work serves as a robust tool for those looking to peek inside the black box of modern neural models, translating complex continuous vector spaces into clean, discrete, and understandable states.