Introduction
Imagine a world where every student, regardless of their location or resources, has access to a personal tutor with the knowledge of Neil deGrasse Tyson, the mathematical intuition of Terence Tao, and the chemical expertise of Marie Curie. This is the promise of Large Language Models (LLMs) like GPT-4 and Llama-3. We have rapidly transitioned from using chatbots for writing emails to relying on them for summarizing complex research papers and explaining scientific concepts.
However, there is a hidden danger in this reliance. While an LLM can eloquently explain Newton’s laws, how does it fare when pushed to the bleeding edge of scientific inquiry? Does it know the difference between a proven fact and an open scientific debate? Or worse, does it confidently hallucinate answers to problems that humanity has not yet solved?
A recent paper titled “Can LLMs replace Neil deGrasse Tyson?” by researchers from the Indian Institute of Technology, Delhi, tackles these critical questions. The authors argue that current benchmarks for AI are insufficient because they often rely on rote memorization or simple reasoning. To truly test an AI’s capability as a science communicator, we must evaluate its nuanced understanding and, crucially, its “answer abstinence”—the ability to say, “I don’t know.”
The researchers introduce a grueling new benchmark called SCiPS-QA and expose a worrying trend: not only do models struggle with complex science, but they also possess a persuasive power that can deceive even human evaluators into believing incorrect information.

As shown in Figure 1 above, the models can produce highly convincing but factually incorrect reasoning across Physics, Chemistry, and Mathematics. Whether it is incorrectly claiming air can cast a shadow due to refractive indices or misidentifying the chirality of a chemical complex, the errors are subtle and wrapped in authoritative language.
The Problem with Current Benchmarks
Before diving into the new method, it is essential to understand why we need a new dataset. The AI field is awash with benchmarks. You may have heard of MMLU (Massive Multitask Language Understanding) or GSM8K (Grade School Math). These have served as the standard for measuring progress in LLMs.
However, these existing datasets have limitations:
- Complexity Ceiling: Many datasets focus on high-school or undergraduate level problems which, while difficult for early models, are becoming trivial for state-of-the-art systems like GPT-4.
- Lack of Nuance: Scientific inquiry is rarely binary. It involves caveats, conditions, and context. Standard multiple-choice questions often fail to capture this.
- The “Open Problem” Blind Spot: Perhaps the most significant flaw is the lack of “open” problems—questions to which the scientific community does not yet have an answer. If you ask an LLM, “Is P equal to NP?” (a famous unsolved problem in computer science), a reliable communicator should explain the debate, not flip a coin and argue for a definitive “Yes.”
The authors of this paper posit that for an LLM to be a faithful science communicator, it must possess self-awareness. It needs to recognize the boundaries of its own knowledge (and human knowledge in general). Overconfidence in the face of ignorance is the hallmark of a bad teacher.
The Core Method: Introducing SCiPS-QA
To rigorously test these capabilities, the researchers developed SCiPS-QA (Specially Challenging Problems in Science – Question Answering). This is not your average pop-quiz. It is a curated collection of 742 complex Boolean (Yes/No) questions derived from niche research areas.
Dataset Composition
The dataset is meticulously structured to cover a wide range of scientific disciplines, with a heavy emphasis on hard sciences where precision is non-negotiable.

As detailed in Table 1, the dataset includes:
- Physics: 242 questions
- Mathematics: 283 questions
- Chemistry: 132 questions
- Others: Theoretical CS, Astronomy, Biology, and Economics.
A critical innovation here is the division between Closed and Open problems.
- Closed Questions (510 total): These have definitive answers rooted in established scientific literature. They test the model’s retrieval and reasoning capabilities.
- Open Questions (232 total): These are questions that currently have no definitive answer in science. They are designed to test “Answer Abstinence.” The correct behavior for an LLM here is to refuse to give a definitive Yes/No or to acknowledge the uncertainty.
Topic Decomposition
The questions are not surface-level. They dive deep into sub-fields. For instance, in Physics, the questions aren’t just about gravity; they cover Quantum Mechanics, Statistical Mechanics, and Relativity. In Mathematics, they span Topology, Number Theory, and Combinatorics.

Figure 5 illustrates this breakdown. The dominance of Quantum Mechanics in Physics and Topology in Mathematics highlights the dataset’s focus on abstract and complex reasoning, areas where human intuition often fails, making accurate AI assistance potentially valuable—but only if it is reliable.
The Difficulty Gap
To prove that SCiPS-QA is indeed a harder challenge than existing benchmarks, the researchers compared the performance of GPT-4 Turbo on SCiPS-QA against MMLU-Pro and SciQ.

The results in Figure 2 are stark. While GPT-4 Turbo scores high accuracy (approaching or exceeding 80-90%) on SciQ and MMLU-Pro, its performance drops significantly on SCiPS-QA, hovering closer to 60-70%. This confirms that the new dataset successfully exposes limitations in the model’s reasoning that were previously masked by easier benchmarks.
Experimental Setup and Metrics
The researchers benchmarked a wide array of models, including proprietary giants like GPT-4 Turbo and GPT-3.5 Turbo, as well as open-access models from the Llama-2, Llama-3, and Mistral families.
They evaluated the models using several key metrics:
- MACC (Main Response Accuracy): Accuracy of the response generated at temperature 0 (deterministic).
- MSACC (Major Stochastic Response Accuracy): The model is asked the same question 10 times at temperature 1 (randomized). This measures the accuracy of the “majority vote” answer.
- VSR (Variation in Stochastic Responses): A measure of how consistent the model is. If it answers “Yes” 5 times and “No” 5 times, it has high variation (bad consistency).
Hallucination Quantification: SelfCheckGPT
A major part of the methodology involved quantifying “hallucination”—when a model generates nonsensical or unfaithful text. To do this, they employed a technique called SelfCheckGPT.
The core idea of SelfCheckGPT is that if a model knows a fact, it will state it consistently. If it is hallucinating, its answers will vary wildly when sampled multiple times. The researchers used three variants of this mathematical approach.
1. BERTScore Variant: This method measures the semantic similarity between a main response sentence (\(M_i\)) and stochastic sample sentences (\(S_j^k\)).

If the sentence \(M_i\) is semantically similar to the stochastic samples, the hallucination score is low. If it differs significantly, the score is high (closer to 1).
2. NLI (Natural Language Inference) Variant: This uses a separate model to check if the stochastic samples “contradict” the main response. It calculates the probability of contradiction.

The final hallucination score is the average of these contradiction probabilities across all samples:

3. Prompt Variant: Here, an external LLM (like GPT-3.5) acts as a judge, explicitly asked if the sentences support each other.

These mathematical frameworks allow the researchers to move beyond “feeling” like a model is wrong to assigning a concrete “Hallucination Score” to the generated scientific explanations.
Results and Analysis
The experiments yielded several groundbreaking insights into the current state of AI science communicators.
1. The Open-Source Contender
While proprietary models generally led the pack, the gap is closing. As shown in Table 2 below, Llama-3-70B emerged as a formidable competitor.

Notice the MACC (accuracy) column. Llama-3-70B achieves a score of 0.693, actually surpassing GPT-4 Turbo’s 0.646 in this specific metric. This is a significant moment for open-access models, suggesting that with sufficient parameter size and training, they can rival the industry leaders in complex scientific reasoning. However, GPT-4 Turbo generally maintained better consistency (lower VSR scores).
2. The Failure of Abstinence
One of the most concerning findings relates to the “Open Questions”—the unsolved scientific problems. A perfect science communicator would look at a question like “Is the Navier-Stokes existence and smoothness problem solved?” and answer “No” or “Unknown.”
However, the OMACC (Open Main Accuracy) scores reveal a systemic failure. Most models performed poorly here. They tend to hallucinate a definitive answer rather than admitting ignorance. They lack the “scientific humility” required for research.
3. Verification is Broken
Can we just ask an LLM to double-check its own work? The researchers tested this by asking GPT-4 Turbo and GPT-3.5 Turbo to verify reasoning passages. They scored them on “Factuality,” “Convince-factor,” and “Information Mismatch.”

Figure 3 reveals a chaotic reality. The blue lines (correct responses) and red lines (incorrect responses) should be distinct. Ideally, all incorrect responses would get a score of 1, and correct ones a score of 5. Instead, we see extensive overlap. GPT-4 Turbo struggles to distinguish between its own correct and incorrect hallucinations. If the model cannot reliably verify itself, it cannot be deployed as a standalone expert.
4. The Human Deception
Perhaps the most alarming result came from the human evaluation. The researchers asked human experts to rate how “convincing” the model’s reasoning was.

Look at the graph on the left in Figure 4 (“with answer”). The red line represents incorrect responses. A significant portion of incorrect responses received high “convince factor” scores (3, 4, or even 5) from humans.
This means GPT-4 Turbo is persuasive enough to trick human evaluators into accepting false scientific reasoning. The model adopts an authoritative, academic tone that masks logical fallacies, creating a “convincingness trap.”
5. Hallucination Detection is Inconsistent
The researchers analyzed the distribution of hallucination scores generated by the SelfCheckGPT methods.



Figures 6, 7, and 8 show the distribution of these scores. While there are statistical differences (confirmed by Welch’s t-tests shown in Table 6 below), the distributions often overlap significantly.

For example, in Figure 8 (SelfCheckGPT with Prompt), GPT-3.5 Turbo (red line) assigns lower hallucination scores than GPT-4 Turbo, implying it is more confident (or perhaps overconfident). The lack of a clear, binary separation between “hallucinated” and “faithful” text in these plots indicates that even our best automated detection methods are not yet a silver bullet for complex science.
Finally, Table 7 provides a raw look at the verification data.

The table reinforces the finding that models—particularly GPT-3.5 Turbo—assign high “Factuality” scores (4s and 5s) to a frightening number of incorrect responses (the red bars).
Conclusion and Implications
The paper “Can LLMs replace Neil deGrasse Tyson?” serves as a vital reality check in the AI hype cycle. While Large Language Models have made tremendous strides, they are not yet ready to replace human experts in science communication.
The introduction of SCiPS-QA provides the community with a necessary, high-bar benchmark. The results from this dataset show that while models like Llama-3-70B and GPT-4 Turbo are incredibly capable, they suffer from critical flaws:
- Lack of Humility: They struggle to identify open, unsolved problems.
- Self-Delusion: They cannot reliably verify their own outputs.
- Persuasiveness over Truth: They generate reasoning so convincing that it deceives human experts.
What does this mean for students and researchers? It means LLMs should be viewed as assistants, not authorities. When an LLM explains a complex concept in Quantum Mechanics or Topology, it may sound like Neil deGrasse Tyson, but there is a non-zero chance it is confidently making things up. Until models improve their “answer abstinence” and self-verification capabilities, the human element—critical thinking and skepticism—remains the most important tool in the scientific toolkit.
](https://deep-paper.org/en/paper/2409.14037/images/cover.png)