Introduction

We are living in the era of Foundation Models (FMs). From Chatbots to code assistants, Large Language Models (LLMs) have demonstrated incredible capabilities in zero-shot and few-shot learning. However, there is a massive friction point in the current AI ecosystem: privacy.

Most powerful models reside in centralized data centers. To fine-tune these models on sensitive private data—like medical records, legal documents, or personal chat history—users typically have to upload their data to the cloud. This is a privacy nightmare that Federated Learning (FL) aims to solve. FL allows models to be trained across distributed devices (clients) without the data ever leaving the device.

But here is the catch: Foundation Models are huge. Even the smaller “On-Device” variants (ODFMs) with single-digit billion parameters are computationally heavy. Fine-tuning an entire model on a smartphone is often impossible due to memory and battery constraints.

Furthermore, not all devices are created equal. In a real-world network, you might have a high-end tablet, a brand-new flagship phone, and an aging budget smartphone all trying to contribute to the same model. This is known as system heterogeneity.

In this post, we will dive into a research paper that proposes a novel solution to this problem: HETLORA (Heterogeneous LoRA). The authors introduce a method that allows devices with different capabilities to train using different “ranks” of model complexity, intelligently aggregating the results to build a powerful global model.

Background: The Necessity of Fine-Tuning

Before dissecting the solution, we must understand why we can’t just use pre-trained models on devices. While massive models (like GPT-4 or PaLM-L) have amazing zero-shot capabilities, smaller “On-Device” models (ODFMs) often struggle without specific fine-tuning.

As shown in the table below, the performance of smaller models like PaLM 2 (XXS and XS sizes) drops significantly in zero-shot or few-shot settings compared to a model that has been fully fine-tuned via Federated Learning.

Table 1: Perplexity of PaLM 2 for zero-shot, few-shot, and full federated fine-tuning.

The difference in perplexity (where lower is better) is staggering. To get usable performance on edge devices, fine-tuning is not a luxury; it is a necessity.

The Challenge of Parameter Efficiency

Fine-tuning all parameters of a generic model (Full Fine-Tuning) is computationally expensive. To solve this, researchers use Parameter-Efficient Fine-Tuning (PEFT). The most popular method today is LoRA (Low-Rank Adaptation).

What is LoRA?

Instead of retraining the massive weight matrix \(\mathbf{W}_0\) of a pre-trained model, LoRA freezes the original weights and injects trainable low-rank decomposition matrices.

Mathematically, the update \(\Delta \mathbf{W}\) is represented as the product of two smaller matrices, \(\mathbf{B}\) and \(\mathbf{A}\):

\[ \Delta \mathbf{W} = \mathbf{B} \mathbf{A} \]

Where \(\mathbf{B} \in \mathbb{R}^{d \times r}\) and \(\mathbf{A} \in \mathbb{R}^{r \times l}\). The variable \(r\) is the rank.

The magic of LoRA is that \(r\) can be very small. As the table below illustrates, using LoRA reduces the number of trainable parameters to a tiny fraction (often less than 1%) of the original model size.

Table 2: Percentage of LoRA parameters compared to the original model size.

The Problem: The Trade-off of Homogeneous LoRA

In a standard Federated Learning setup using LoRA, the server usually enforces a Homogeneous Rank. This means every smartphone, regardless of whether it is a powerful iPhone 15 or an older Android device, must use the exact same rank \(r\) (e.g., \(r=16\)).

This “one-size-fits-all” approach creates a difficult trade-off:

High Rank: If everyone uses a high rank (more parameters), the model learns quickly but tends to overfit on local data. It also burdens weaker devices.
Low Rank: If everyone uses a low rank (fewer parameters), the model is computationally light and stable, but it converges very slowly.

The authors demonstrated this trade-off empirically in the figure below. Notice how high ranks (green, \(r=50\)) drop in perplexity quickly but then stall or worsen (overfit), while the lowest rank (red, \(r=1\)) is slow but steady.

Figure 3: Performance of homogeneous LoRA for different ranks.

We need a “Goldilocks” solution: the speed of high ranks and the stability of low ranks, while accommodating devices of varying power.

Core Method: HETLORA

The authors propose HETLORA, a framework that enables Heterogeneous LoRA configurations. This allows different clients to train with different ranks (\(r_k\)) based on their system capabilities and data complexity.

Figure 1: Overview of heterogeneous rank deployment of LoRA.

As illustrated above, HETLORA allows the network to have a mix of ranks (\(r_1 < r_2 < r_3\)). The server manages a global model with a maximum rank, but individual devices only download and train what they can handle.

HETLORA consists of three distinct phases: Distribution via Truncation, Local Training with Self-Pruning, and Sparsity-Weighted Aggregation. Let’s break these down.

1. Distribution via Truncation

The server maintains a global LoRA module with a maximum rank (\(r_{max}\)). When a client connects, they don’t necessarily download the whole thing.

If a client has limited bandwidth or compute power, the server truncates the matrices. For a client requiring rank \(r_k\), the server sends:

\(\mathbf{B}_{:, :r_k}\) (The first \(r_k\) columns of B)
\(\mathbf{A}_{:r_k, :}\) (The first \(r_k\) rows of A)

This simple step ensures that weak devices aren’t forced to download or compute gradients for high-rank parameters they can’t handle.

2. Local Training with Rank Self-Pruning

Once the client receives its LoRA module, it begins training on its local private data. However, the authors introduce a clever twist: Rank Self-Pruning.

Sometimes, a client might be assigned a high rank (e.g., \(r=50\)) because it has a powerful processor, but its data might be simple enough that it doesn’t need that complexity. High ranks on simple data lead to overfitting (noise).

To fix this, the local loss function includes a regularization term that penalizes the magnitude of the higher ranks.

\[ \text{Loss} + \lambda \| \text{Higher Ranks of B} \| \| \text{Higher Ranks of A} \| \]

During training, if the weights of these higher ranks shrink below a certain threshold (controlled by a decay factor \(\gamma\)), the client effectively “prunes” itself. It realizes, “I don’t need rank 50; rank 30 is enough.” It then sends back a smaller update to the server.

This acts as a local noise filter, preventing clients from uploading overfitted parameters.

3. Sparsity-Weighted Aggregation

This is the most mathematically critical part of HETLORA. The server now receives updates from varying ranks.

Client A sends a rank-5 update.
Client B sends a rank-50 update.

How do you combine them?

A naive approach would be to zero-pad the small matrices to match the size of the large ones and then average them. However, simple averaging is dangerous here. High-rank clients often introduce more noise (as seen in the overfitting problem). If we simply average them, the noisy high-rank updates might drown out the stable low-rank features.

The authors propose Sparsity-Weighted Aggregation.

Figure 2: Overview of the zero-padding and sparsity-weighted aggregation method.

The process works as follows (referencing the figure above):

Zero-Padding (a): Small rank updates are padded with zeros to match the global maximum rank.
Calculate Sparsity Weights: The server calculates a weight \(p_k\) for each client. This weight is based on the Frobenius norm (magnitude) of the update.

The intuition relies on singular value decomposition (SVD) theory. Informative updates tend to have distinct singular value patterns.
Instead of running expensive SVD, the authors use the Frobenius norm of the reconstructed matrix (\(\Delta \mathbf{W} = \mathbf{B}\mathbf{A}\)) as a computationally cheap proxy.

Weighted Aggregation (b): The server aggregates the updates using these calculated weights. \[ \overline{\mathbf{B}} = \sum p_k \mathbf{B}_k \] This ensures that stable, informative updates (often from lower ranks) are prioritized, while noisy, sparse updates from high-rank clients are de-emphasized.

Experiments & Results

The researchers evaluated HETLORA using the PaLM 2 model (XXS and XS sizes) on two realistic tasks:

Multi-Session Chat (MSC): Generating dialogue responses.
Reddit Summarization: Summarizing user posts.

Does HETLORA beat the baseline?

The results show that HETLORA significantly outperforms the naive approach.

First, let’s look at what happens if we just use heterogeneous ranks without the special aggregation or pruning (Naive Heterogeneous LoRA).

Figure 4: Performance of HETLoRA without rank pruning or and with simple average aggregation.

As seen in Figure 4 above, naive heterogeneous setups still suffer from overfitting (perplexity goes down, then shoots back up). This confirms that simply mixing ranks isn’t enough; you need the pruning and weighted aggregation mechanisms.

When the full HETLORA method is applied, the results are much stronger. The figure below compares HETLORA (green line) against Homogeneous LoRA at rank 5 (blue) and rank 50 (purple), as well as Full Fine-Tuning (red).

Figure 5: Comparison of performance across homogeneous LoRA, heterogeneous LoRA, and full fine-tuning.

Key Takeaways from Figure 5:

Speed: HETLORA learns much faster than the low-rank baseline.
Stability: Unlike the high-rank baseline (purple), HETLORA does not spike in perplexity; it remains stable.
Performance: It approaches the performance of Full Fine-Tuning (red dashed line) while training only a tiny fraction of the parameters.

Quantitative Comparison

The table below provides the final hard numbers. We can see that HETLORA achieves better (lower) perplexity on Chat data and better (higher) RougeL scores on Reddit data compared to Homogeneous LoRA (“HOMLoRA”).

Table 3: Final RougeL score and Perplexity comparison.

Notably, look at the Recon+SVD row. This represents an alternative strategy where the server reconstructs the full matrix and redistributes it using SVD. HETLORA beats this method, proving that aggregating the decomposed matrices (\(\mathbf{B}\) and \(\mathbf{A}\)) preserves cross-client relationships better than reconstructing the full weights.

Efficiency Improvements

One of the main goals of using LoRA on devices is to reduce communication costs. The graph below shows the ratio of communicated parameters required to achieve a specific target performance.

Figure 6: Ratio of communicated parameters to full fine-tuning.

HETLORA (the green bar/point) requires significantly less communication than full fine-tuning. In the Reddit task, Homogeneous LoRA (min rank) failed to reach the target accuracy (marked by X), while HETLORA succeeded with minimal overhead.

The Role of Pruning

Finally, the authors conducted an ablation study to see how important the Self-Pruning step was. They tested different values of \(\gamma\) (the decay factor).

Table 4: Ablation study on the effect of the decaying factor gamma.

The results indicate that \(\gamma = 0.99\) is the sweet spot.

\(\gamma = 1\) (No pruning): Performance is worse because high-rank noise isn’t filtered.
\(\gamma = 0.85\) (Aggressive pruning): Performance degrades because the model discards too much useful information.
\(\gamma = 0.99\): Balances noise reduction with information retention.

Conclusion and Implications

The paper “Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models” presents a robust solution for deploying large models in the wild.

By acknowledging that heterogeneity is a feature, not a bug, HETLORA allows:

Inclusivity: Older devices can contribute to training via low ranks.
Performance: The system gains the learning speed of high ranks without the instability.
Efficiency: It significantly cuts down on communication and computation compared to full fine-tuning.

This research bridges the gap between the massive potential of Foundation Models and the privacy/hardware constraints of the real world. As mobile hardware continues to evolve at different rates, techniques like HETLORA will be essential for creating personalized, private AI assistants that run entirely on the edge.

For students interested in Federated Learning, this paper highlights an important trend: moving away from “uniform” algorithms toward flexible, adaptive systems that respect the diverse nature of distributed computing.

Introduction#

Background: The Necessity of Fine-Tuning#

The Challenge of Parameter Efficiency#

What is LoRA?#

The Problem: The Trade-off of Homogeneous LoRA#

Core Method: HETLORA#

1. Distribution via Truncation#

2. Local Training with Rank Self-Pruning#

3. Sparsity-Weighted Aggregation#

Experiments & Results#

Does HETLORA beat the baseline?#

Quantitative Comparison#

Efficiency Improvements#

The Role of Pruning#

Conclusion and Implications#