If you follow the current trajectory of Artificial Intelligence, you might assume that Large Language Models (LLMs) have solved natural language. Models like GPT-4 can write poetry, code in Python, and summarize legal documents with ease. However, there is a hidden disparity in the AI landscape: the dominance of English.

While English-centric models flourish, languages with fewer speakers—and consequently less digitized training data—are often left behind. This category, known as Low-Resource Languages (LRLs), includes Norwegian, which is spoken by only about 5 million people. When we test mainstream models on these languages, we often find that translation is not the same as comprehension. A model might translate words correctly but fail spectacularly at understanding cultural nuance or local context.

Today, we are diving deep into a research paper titled “NLEBench+NorGLM”. This work represents a significant step forward for Nordic NLP. The researchers didn’t just test existing models; they built a suite of new Norwegian language models (NorGLM) from scratch and created a comprehensive benchmark suite (NLEBench) to properly evaluate them.

For students of NLP, this paper is a masterclass in how to approach language modeling when you don’t have the infinite resources of the English web, and why “bigger” isn’t always “better” when data is constrained.

The Problem: Why English Models Don’t Always Work for Norway

Before we get into the architecture, we need to understand the gap. Most benchmarks for low-resource languages have historically focused on discriminative tasks. These are tasks like classification or multiple-choice questions (e.g., “Is this sentence positive or negative?”).

However, the modern AI revolution is built on generative tasks—writing text, answering open-ended questions, and summarizing articles. There has been a lack of benchmarks that specifically test how well a model can generate Norwegian.

Furthermore, when researchers simply translate English benchmarks (like the famous GLUE benchmark) into Norwegian, they lose cultural context. A question about American history translated into Norwegian tests the model’s translation ability, not its knowledge of Norwegian culture.

To fill this gap, the authors of this paper pursued two massive initiatives:

  1. NorGLM: Training a suite of Generative Language Models specifically for Norwegian.
  2. NLEBench: Creating a benchmark that includes translation, summarization, and—crucially—cultural grounding.

Part 1: Building NorGLM (The Models)

Training an LLM requires a massive corpus of text. For English, this is easy. For Norwegian, it requires careful curation. The researchers compiled a dataset totaling 198.7 billion tokens.

As illustrated in the figure below, the data wasn’t just Norwegian. It included a mix of Norwegian (71%), Swedish (10%), Danish (8%), and English (6%). This inclusion of neighboring North Germanic languages is a strategic move to improve the model’s linguistic robustness through transfer learning.

Figure 1: The data distribution within the pre-training dataset. The inner segment represents languages, and the outer segment denotes various sourced datasets in Norwegian.

The Norwegian data itself came from diverse sources:

  • mC4 and OSCAR: Massive web-crawled corpora (cleaned and filtered).
  • Nasjonalbiblioteket: Non-copyrighted material from the Norwegian National Library.
  • News & Social Media: High-quality news articles (Schibsted) and informal text from Reddit and Twitter to capture conversational nuance.

The Architecture

The researchers trained several models, collectively called NorGLM, with varying sizes to test how parameter scale affects performance in low-resource settings:

  1. NorGPT-369M: A smaller model based on GPT-2 architecture.
  2. NorGPT-3B: A mid-sized model (3 billion parameters).
  3. NorLlama-3B: A model using the Llama architecture to test if architecture changes (like different activation functions) impact performance.
  4. NorGPT-23B: A large 23-billion parameter model.

They also compared these against NB-GPT-J-6B (an existing model trained on English and fine-tuned on Norwegian) and the commercial giant GPT-3.5-Turbo.

The specific training parameters, including layer counts and context windows, are detailed below. Note that training the 23B model required significantly more computational resources (global batch size of 112) compared to the smaller models.

Table 7: The training parameter setings of NorGLMs

Part 2: NLEBench (The Evaluation Suite)

You cannot improve what you cannot measure. The second major contribution of this paper is NLEBench. The researchers moved beyond simple classification tasks to include complex generative tasks.

They categorized their datasets into three types:

  1. Existing Datasets: Standard NLP tasks adapted for Norwegian (e.g., Sentiment analysis).
  2. Machine Translated Datasets: English benchmarks translated via Google Translate API (e.g., CNN/DailyMail for summarization).
  3. Human Annotated Datasets: Freshly created data specifically for this project.

The table below provides a comprehensive overview of the benchmark. Notice the variety of tasks: from Instruction Fine-tuning (following orders) to Bias Detection and Multi-task Learning.

Table 1: Overview of the NLEBench dataset and evaluation setups.

The Innovation: Multi-Task Synergy (NO-Multi-QA-Sum)

One of the most interesting parts of NLEBench is the NO-Multi-QA-Sum dataset. The researchers argue that standard benchmarks utilize “single-task” data (e.g., just summarize this text). However, real understanding often involves connecting different tasks.

To test this, they hired human annotators to read news articles and perform two linked actions:

  1. Conduct a conversation (Q&A) about the article.
  2. Write a summary of the article.

This creates a dataset where the Questions, Answers, and Summaries are all mathematically and logically linked to the same source text. This allows the researchers to test Chain-of-Thought (CoT) reasoning. Can the model write a better summary if it first answers questions about the text?

The interface used for this human annotation process is shown below. It integrates GPT-4 suggestions which humans then verified and corrected, a “human-in-the-loop” approach to data generation.

Figure 7: API appearance for multi-task benchmark annotation.

Experiments and Key Results

The researchers ran extensive experiments comparing their NorGLM models against the existing NB-GPT-J-6B and OpenAI’s GPT-3.5. The results offered several counter-intuitive insights.

1. The “English-Centric” Bias of GPT-3.5

One might expect GPT-3.5 to crush the competition due to its sheer size. While it performed well on general tasks, it struggled significantly with Norwegian cultural context.

In the example below, the instruction asks: “Who wrote the song ‘Ut mot havet’?”

  • Human Truth: Finn Kalvik.
  • GPT-3.5: Jo Nesbø (a famous crime author).

GPT-3.5 hallucinates a connection to a famous Norwegian it does know (Jo Nesbø) rather than retrieving the correct cultural fact. This highlights that while massive multilingual models can speak the language grammatically, they often lack the “cultural knowledge graph” of a native model.

Figure 3: Example of generated performance of GPT3.5 on Norwegian culture instruction of NO-AlpacaPlus. Translations are on the right.

Similarly, when asked about a specific localized slur/expression (“hestkuk”), GPT-3.5 treats it as a general vulgarity, whereas the human annotator correctly identifies it as a specific regional expression used in Northern Norway.

Figure 4: Example of generated performance of GPT3.5 on Norwegian special expression instruction of NOAlpaca-Plus.Translations are on the right.

2. Size Isn’t Everything

When looking at the News Summarization tasks (NO-CNN/DailyMail), the researchers found that simply increasing parameters didn’t guarantee victory.

As shown in the table below, NB-GPT-J-6B (a 6 billion parameter model) often outperformed the much larger NorGPT-23B in ROUGE scores (a metric for text overlap).

Table 3: Experimental Results on the News Summarization Task.

Why? NB-GPT-J-6B was pre-trained on a massive English corpus before being fine-tuned on Norwegian. The NorGPT models were trained from scratch. This suggests that for low-resource languages, transfer learning (starting with a smart English model and teaching it Norwegian) might be more efficient than trying to build a massive native model from scratch if your native data is limited.

3. Synergy and Chain-of-Thought (CoT)

The researchers tested the “Synergy” hypothesis using their multi-task dataset. They asked the models to:

  1. Task A: Answer questions about an article, then use those answers to write a summary.
  2. Task B: Write a summary, then answer questions based on it.

They found that Chain-of-Thought prompting (asking the model to reason through the questions first) significantly improved the factual consistency (Entailment Score) of the generated summaries.

Interestingly, GPT-3.5 saw a massive jump in performance with CoT, suggesting that while it lacks cultural knowledge, its reasoning engine is highly developed. The smaller Norwegian models also saw improvements, proving that even smaller models can reason if guided correctly.

Table 4: Experimental Results on task one using NO-Multi-QA-Sum dataset for summarization task.

4. Toxicity and Bias

Finally, the researchers evaluated the models for toxicity. They found something surprising regarding the source of toxic generations. We often assume toxicity comes from Reddit or Twitter data. However, the researchers traced high toxicity scores back to news articles describing crimes (e.g., “taken life from/kill”).

The table below shows the toxicity scores. NorLlama-3B had the lowest toxicity scores, but the authors note this is partly because it often generated meaningless text, which the toxicity filter didn’t flag. This is a reminder to students: always check why a metric looks good!

Table 15: Experimental Results on the Toxicity of Norwegian Generative Language Models. Scores were obtaine( using the Perspective API, with higher scores indicating more toxic generations.

Conclusion: The Future of Low-Resource NLP

The NLEBench+NorGLM paper provides a roadmap for democratizing AI. It shows that we cannot rely solely on Silicon Valley giants to solve natural language processing for every language on Earth.

Key Takeaways for Students:

  1. Context Matters: A model like GPT-3.5 can be grammatically perfect in Norwegian but culturally illiterate. We need native benchmarks to detect this.
  2. Data Quality > Model Size: Training a 23B parameter model on limited data yields diminishing returns. Smart data curation or transfer learning (from English models) is often more effective.
  3. Multi-Tasking is the Future: Isolated benchmarks (just summarization, just translation) are becoming obsolete. The future lies in synergy tasks where models must demonstrate reasoning across different modes of thought.

The researchers have released their models and datasets openly. This is crucial for the scientific community, ensuring that the preservation and development of languages like Norwegian happens in the open, not behind closed API doors.


This blog post summarizes the paper “NLEBench+NorGLM” by Liu et al., from the Norwegian University of Science and Technology.