Introduction
In the last few years, the hype surrounding Large Language Models (LLMs) like GPT-4 and Claude has been inescapable. We have seen them write poetry, debug code, and even pass the US Bar Exam. But as these models integrate deeper into educational and professional workflows, a critical question arises: Do they actually understand scientific concepts, or are they just really good at guessing multiple-choice answers?
Most current benchmarks used to evaluate AI—such as MMLU (Massive Multitask Language Understanding)—rely heavily on multiple-choice questions. While efficient, this format has a major flaw: it doesn’t reflect how humans effectively use science and engineering skills in the real world. In a university setting, a student isn’t just asked to pick option A, B, or C. They are asked to write a proof, design an algorithm, explain a complex theory in their own words, or interpret a diagram.
Enter SciEx (Scientific Exams), a new benchmark proposed by researchers at the Karlsruhe Institute of Technology. This paper introduces a rigorous evaluation framework consisting of real university-level Computer Science exams. Unlike previous tests, SciEx demands freeform text answers, interprets images, and requires deep reasoning. Furthermore, the researchers didn’t just use automated scripts to score the AI; they brought in the actual university lecturers to grade the AI’s answers as if they were students.
In this deep dive, we will explore how SciEx was built, how state-of-the-art models performed when pitted against computer science undergrads, and the surprising discovery that while AI might be a mediocre student, it makes for a surprisingly good professor.
Background: The Limits of Multiple Choice
To understand why SciEx is necessary, we first need to look at the limitations of existing benchmarks.
Benchmarks like SciQ or ScienceQA have been instrumental in tracking the progress of LLMs. However, they suffer from the “multiple-choice gap.” When a model is presented with four options, it can often use elimination strategies or statistical likelihood to guess the correct answer without truly deriving the solution. This creates a disconnect between a model’s test scores and its utility.
Real-world scientific tasks are open-ended. If you ask an engineer to “optimize this database query,” there is no list of options. They must generate a solution from scratch. Additionally, scientific education is multimodal. You cannot pass a Deep Learning or Computer Graphics course without understanding diagrams, plots, and visual schemas.
The authors of SciEx identified three key requirements for a true scientific benchmark:
- Freeform Answers: The model must generate the text, proof, or code itself.
- Multimodality: The test must include questions involving images.
- High-Quality Grading: Evaluating freeform answers is hard. It requires expert human judgment or highly advanced automated systems.
The SciEx Methodology
The core of this research is the dataset itself. The authors collected 10 genuine Computer Science exams from the Karlsruhe Institute of Technology, covering the 2022-2024 semesters.
1. The Curriculum
The exams cover a broad spectrum of the Computer Science discipline, ensuring that the AI isn’t just tested on one niche topic. The subjects include:
- Natural Language Processing (NLP)
- Deep Learning & Neural Networks
- Computer Vision
- Human-Computer Interaction (HCI)
- Databases (SQL, Relational Algebra)
- Computer Graphics
- Theoretical Foundations (Turing machines, proofs)
- Algorithms
2. The Format
The researchers took these exams—originally in PDF format—and converted them into a structured JSON format. This allowed them to feed questions systematically to the LLMs while preserving reference images.

As shown in Figure 4 above, the transformation preserves the complexity of the question. On the left is the original exam requiring the student to analyze a BERT model diagram. On the right, this is structured into a machine-readable format that includes the question text and a path to the image file.
The resulting dataset is statistically significant in its diversity. It contains questions ranging from “Easy” to “Hard,” spanning both English and German languages.

Table 1 highlights the breakdown. Out of 154 unique questions, a significant portion (33) relies on images, and the difficulty is skewed towards “Medium,” which is typical for university exams intended to differentiate between average and excellent students.
3. The Examinees
Who took the test? The researchers evaluated a mix of proprietary (closed-source) and open-source models.

As detailed in Table 2, the lineup includes heavy hitters like GPT-4V (Vision) and Claude 3 Opus, as well as efficient open-source models like Mixtral and Llama. Note that only some models (Claude, GPT-4V, Llava) are multimodal, meaning they can actually “see” the images provided in the exams. Text-only models were given the text of the question but had to skip the visual context, putting them at a natural disadvantage—much like a student taking a test with their eyes closed during diagram questions.
4. The Grading Process
This is where SciEx shines. Grading freeform text is notoriously difficult for computers. A student might write a correct answer that looks completely different from the answer key.
To solve this, the authors employed Human Expert Grading. They asked the actual lecturers who designed the courses to grade the AI’s answers. The lecturers used the same criteria they would apply to university students. This provides a “Gold Standard” of evaluation.
However, recognizing that expert human grading is expensive and slow, the authors also experimented with Automatic Grading using “LLM-as-a-judge.” They fed the questions, the reference answers, and the AI’s attempts into a powerful model (like GPT-4V) and asked it to assign a score. We will discuss the reliability of this method in the results section.
Experiments & Results
So, did the AI pass the semester? The results paint a picture of technology that is impressive yet distinctly flawed.
Overall Performance
The headline result is that university exams remain a significant challenge for current LLMs.

Table 3 shows the average normalized grades.
- Claude took the top spot with 59.4%, followed closely by GPT-4V at 58.2%.
- In the German grading scale (where 1.0 is best and 4.0 is a pass), these scores correspond to a 2.4 and 2.5 respectively. This is a solid “Good” grade—essentially a B- or C+ student.
- The Student Average was 45.3%.
This is a fascinating finding: The best AI models outperformed the average student. However, they are far from perfect. A score of 59% implies that nearly half of the material was answered incorrectly or incompletely.
The drop-off for smaller, open-source models is steep. Mixtral achieved 41.1%, and Llava (a smaller vision model) scored only 21.5%, failing the exams.
The Difficulty Paradox
One might expect AI performance to degrade linearly as questions get harder. Interestingly, the data suggests a more complex relationship.

Figure 1 offers two crucial insights:
- Panel (a): Student performance (the grey bars) follows a logical trend—they score high on “Easy” questions and low on “Hard” ones.
- Panel (b): The strongest AI models (Claude and GPT-4V) actually outperform students by the widest margin on Hard questions.
Why does this happen? The researchers hypothesize that “Easy” questions in these exams often involve specific calculations or visual tasks (drawing a graph), which LLMs notoriously struggle with. “Hard” questions, conversely, often involve synthesizing theoretical knowledge or writing long explanations—tasks where LLMs excel because they have memorized vast amounts of textbook data.
The Modality Gap: Can AI See?
A major differentiator in this benchmark is the inclusion of images.

Figure 2 illustrates the gap between text-only and multimodal performance. The dark blue bars represent Image-related questions. Even the best models (Claude and GPT-4V) see their performance advantage over students shrink or disappear when images are involved.
For models that cannot see images (like Mixtral or GPT-3.5), the performance on image questions naturally collapses. But even for the vision-enabled models, the ability to interpret a complex scientific diagram and reason about it lags significantly behind their ability to process text.
The Language Barrier
SciEx is multilingual, containing exams in both English and German. Despite German being a high-resource language in training data, models showed a clear bias.

Figure 3 shows that across the board, models performed significantly better on English exams (dark blue bars) compared to German exams (green bars). In many cases, models that beat the average student in English fell behind the student average in German. This highlights that for scientific reasoning, the language of instruction still plays a massive role in AI competency.
Qualitative Failures: “Hallucinating” Success
Beyond the raw numbers, the expert graders noted several behavioral quirks in the AI answers:
- Verbosity: AI tends to write too much. Without time constraints, models output lengthy explanations hoping to hit the right keywords.
- Math Blindness: Models frequently failed at basic arithmetic or counting tasks required for proofs (e.g., calculating the complexity of an algorithm).
- Superficial Reasoning: On questions asking “True or False, and explain why,” models would sometimes guess “True” but provide an explanation that argued for “False,” a contradiction a human student would rarely make.
Automated Grading: The “LLM-as-a-Judge”
Perhaps the most impactful contribution of this paper for the future of AI research is the evaluation of Automatic Grading.
Running a benchmark like SciEx is expensive because it requires university lecturers to spend hours grading. If LLMs could reliably grade themselves, research could accelerate. The authors tested this by having GPT-4V, Llama3, and Mixtral grade the exam answers and comparing those grades to the human experts.
The results were highly encouraging.

Table 9 shows the Pearson Correlation between the AI grader and the human grader. A score of 1.0 means perfect agreement.
- GPT-4V achieved a correlation of 0.948 when provided with reference answers and examples (few-shot).
- This is an incredibly high level of agreement, suggesting that while GPT-4V might struggle to answer every question perfectly, it understands the material well enough to grade it accurately.
The study also found that providing the AI grader with a “Gold Standard” reference answer significantly improves its grading reliability.

Table 5 confirms this reliability by ranking the “students.” On the left is the ranking based on human grades; on the right is the ranking based on GPT-4V’s grades. The rankings are almost identical. This validates SciEx as a sustainable benchmark: future researchers can use GPT-4V to grade new models on this dataset without needing to call up the professors at KIT every time.
However, not all graders are equal.

As shown in Table 7, the ability to grade accurately varies by difficulty. Weaker models like Mixtral struggle to grade “Hard” questions (correlation drops to 0.224), likely because they don’t fully understand the complex reasoning required. GPT-4V, however, maintains high accuracy even on hard questions (0.732), making it the only viable candidate for automated grading of university-level material.
Conclusion
The SciEx paper provides a sobering yet optimistic look at the state of AI in education and science.
The key takeaways:
- AI is a “Good” Student, but not a Genius: Top-tier models like Claude and GPT-4V can pass computer science exams and even beat the average student, but they still get nearly 40% of the material wrong.
- Multimodality is the Bottleneck: While text processing is strong, the ability to reason about diagrams and charts is the primary weakness holding these models back from true scientific mastery.
- Grading is Solved: The discovery that GPT-4V correlates 95% with human professors on grading implies that we can scale up educational benchmarks massively. We can now build “AI Tutors” that give feedback indistinguishable from a lecturer’s.
SciEx moves the goalposts from “can the AI guess the right option” to “can the AI do the work.” As models evolve, benchmarks like this—grounded in the messy, difficult, open-ended reality of university education—will be the true measure of whether Artificial General Intelligence is on the horizon.
](https://deep-paper.org/en/paper/2406.10421/images/cover.png)