Beyond Chatbots—How LLMs are Re-Engineering the Scientific Method

Introduction

In the last few years, the term “Large Language Model” (LLM) has become synonymous with chatbots that can write emails, debug code, or compose poetry. However, a quiet revolution is happening in a sector far more critical to human progress: the natural sciences.

Biology, chemistry, physics, and mathematics are drowning in data. The rate of publication has far outpaced any human’s ability to read, let alone synthesize, new information. Furthermore, scientific data is distinct; it isn’t just English text. It involves molecular graphs, protein sequences, mathematical formulas, and complex imagery.

This brings us to a fascinating problem: Can the same architecture that powers ChatGPT be taught the “languages” of nature? Can we treat a DNA sequence or a chemical reaction just like a sentence in a book?

The paper we are diving into today, A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery, argues that the answer is a resounding “yes.” Unlike previous surveys that focus narrowly on just medical LLMs or just chemical LLMs, this work provides a “Grand Unified Theory” of scientific AI. It explores how researchers are adapting the Transformer architecture to act as a universal engine for scientific discovery, covering over 260 models across every major scientific discipline.

In this deep dive, we will unpack the unified framework these researchers propose, explore how LLMs are being built for specific domains like quantum physics and genomics, and look at the real-world implications of “AI Scientists.”

Background: The Language of Science

To understand how LLMs apply to science, we first need to abstract our definition of “language.” To a computer, language is simply a sequence of tokens (discrete units of information) that follow specific statistical rules.

English is a sequence of words. But consider the alternatives:

Proteins are sequences of amino acids (represented by letters like M, V, H, L…).
Mathematics is a sequence of symbols and logic.
Molecules can be represented as text strings (SMILES strings) like C1=CC=CC=C1 (Benzene).

If scientific data can be “linearized”—turned into a sequence—it can be fed into a Transformer. This realization has shifted the scientific workflow. We are moving from specialized, hand-crafted models for every specific task (like a model just for predicting toxicity) to unified foundation models that understand the underlying “grammar” of the scientific domain.

Core Method: A Unified Architecture for Science

The authors of this paper have categorized the hundreds of existing scientific LLMs into a coherent framework. They observed that regardless of whether a model is studying galaxy clusters or enzyme folding, the engineering strategies fall into three distinct “columns.”

Understanding this framework is key to understanding the entire field.

Figure 1: Three major types of scientific LLM pre-training techniques. (COLUMN 1): Pre-training encoder LLMs with sequentialized scientific data via masked language modeling. (COLUMN 2): Pre-training (encoder-)decoder LLMs with sequentialized scientific data via next token prediction. (COLUMN 3): Mapping text and relevant sequences/graphs/images closer in the latent space via contrastive learning.

As illustrated in Figure 1, the landscape is divided by the model architecture and the pre-training objective.

Column 1: The Encoder Era (Understanding)

On the left side of Figure 1, we see Encoder-only models. These are direct descendants of BERT (Bidirectional Encoder Representations from Transformers).

The Mechanism: These models use Masked Language Modeling (MLM). You take a sequence (be it a sentence from a paper or a DNA strand), hide a specific token ([MASK]), and ask the model to guess what belongs there based on the context.
Scientific Application:
1.A (Text): Models like SciBERT are trained on millions of scientific papers. They learn that “penicillin” relates to “bacteria” much like standard BERT learns “king” relates to “queen.”
1.C (Chemistry): Models like ChemBERTa take chemical strings (SMILES) and mask atoms. By guessing the missing atom, the model learns the rules of chemistry—valency, ring structures, and stability.
1.D (Biology): DNABERT treats DNA sequences as sentences, masking sections of the genetic code to learn evolutionary patterns.

These models are exceptional at understanding. They are typically used for tasks like classification (Is this molecule toxic? Is this paper about geology?) or property prediction.

Column 2: The Generative Era (Creation)

The middle column represents the current wave of Generative AI (Decoder-only or Encoder-Decoder). These are inspired by the GPT family and LLaMA.

The Mechanism: These models use Next Token Prediction. They aren’t just filling in blanks; they are predicting the future. Given a sequence, what comes next? This allows them to generate new content.
The Challenge of Structure: The biggest hurdle here is “Linearization.” Text is naturally linear. But how do you feed a 3D crystal structure or a 2D table into a GPT model?
2.B (Tables): Models like TableLlama flatten table cells into a text string so the LLM can “read” rows and columns to answer questions about data.
2.C (Crystals): CrystalLLM represents 3D materials by converting their atomic coordinates and lattice vectors into a text string, enabling the model to “hallucinate” stable new materials that don’t yet exist.
Instruction Tuning: Many of these models are further refined using “Instruction Tuning,” where they are trained on Q&A pairs (e.g., “Synthesize a molecule that targets Receptor X”) to act as helpful assistants.

The right column addresses a fundamental problem: Text describes science, but it isn’t the thing itself. A description of a lung X-ray is not the X-ray.

The Mechanism: This approach uses Contrastive Learning, similar to the CLIP model. It uses two encoders—one for text and one for the data (image, graph, or structure). The model is trained to move the mathematical representation of an image and its matching text caption closer together in vector space, while pushing mismatched pairs apart.
Scientific Application:
3.B (Proteins): ProtST aligns protein sequences with their textual descriptions (function, organism, etc.). This allows you to search for proteins using natural language.
3.C (Molecules): Text2Mol matches molecular graphs with their descriptions, allowing chemists to retrieve molecules based on text queries like “sweet-smelling ester.”
3.D (Vision): In medicine, models align chest X-rays with radiology reports. This connects the visual features of a tumor directly to the medical terminology used to describe it.

The Evolution of Scientific LLMs: A Field-by-Field Analysis

The paper provides an exhaustive taxonomy of how these three architectures are applied across different scientific domains. Let’s analyze the specific breakthroughs in each field.

1. General Science: The Foundation

Before specializing, models need to understand the general discourse of science.

Table A1: Summary of LLMs in general science. Includes models like SciBERT, Galactica, and SciGLM.

Table A1 highlights the progression in general science LLMs.

Early days (2019): We see models like SciBERT (Type 1, Column 1 in Figure 1). These were relatively small (110 Million parameters) and focused on parsing the dense vocabulary of scientific literature.
The Shift: As we move down the table, the architectures shift to GPT and LLaMA variants (Type 2).
The Giant: A standout here is Galactica (Type 2). Unlike SciBERT, which simply “read” papers, Galactica was trained to write them, solve equations, and even predict protein functions. It represents a move toward an “AI Research Assistant” that can handle multiple modalities.
Graph Integration: Note the “L+G” (Language + Graph) entries like OAG-BERT. Science doesn’t exist in a vacuum; papers cite other papers. These models ingest the citation graph to understand the context and impact of research, not just the content.

2. Mathematics: From Calculation to Reasoning

Math poses a unique challenge: LLMs are notoriously bad at arithmetic because they predict tokens based on probability, not logic. However, the survey shows how this is changing.

Table A2: Summary of LLMs in mathematics. Includes models like MathBERT, Minerva, and G-LLaVA.

Table A2 reveals two distinct approaches:

Text-Based Reasoning (Type 2): Models like Minerva and Llemma are trained on massive repositories of math webs and arXiv papers. They leverage “Chain of Thought” prompting, where the model produces the steps of a solution rather than just the answer. This mimics human deduction.
Visual Geometry (Type 3/Multi-modal): Look at G-LLaVA and Inter-GPS. Geometry problems are visual. You cannot solve for \(x\) if you cannot see the triangle. These models use vision encoders to “see” the diagram and language models to reason through the theorem.

3. Physics: The Frontier of Theory

Physics has been slightly slower to adopt LLMs compared to bio/chem, largely because physics relies heavily on complex equations that are difficult to tokenize effectively.

Table A3: Summary of LLMs in physics. Includes astroBERT and AstroLLaMA.

As shown in Table A3, the field is currently dominated by Astronomy.

astroBERT and AstroLLaMA are fine-tuned on astrophysics literature (NASA ADS, arXiv).
Why Astronomy? It is a highly descriptive field with massive text archives.
Future Potential: The paper notes emerging applications in theoretical physics, such as using Transformers to predict coefficients in quantum field theory or to design quantum experiments (Type 2).

4. Chemistry: The Language of Matter

Chemistry is perhaps the most mature field for LLMs outside of pure language, because chemists invented “SMILES”—a way to write molecules as text strings—decades ago.

Table A4: Summary of LLMs in chemistry and materials science. Includes ChemBERT, ChemLLM, and Text2Mol.

Table A4 is dense with innovation, highlighting the battle between representations:

SMILES vs. Graphs: Early models (ChemBERTa) treated molecules as text (SMILES). However, molecules are actually 3D graphs. Newer models like Graph-Text Multi-modal LLMs (bottom of the table, like GIT-Mol) try to have it both ways: they ingest the 2D graph structure and the text description.
The “Agent” Revolution: Models like ChemCrow and ChemLLM (Type 2) are not just passive predictors. They are agents capable of planning a synthesis pathway and, in some setups, interfacing with robotic labs to actually mix chemicals. This is a direct application of Instruction Tuning—teaching the model to act as a chemist, not just a chemistry encyclopedia.

5. Biology and Medicine: The Heavyweights

This domain has the largest number of models, driven by the immense value of drug discovery and automated healthcare.

Table A7: Summary of LLMs in biology and medicine. Includes BioBERT, Med-PaLM, and ESM-2.

Table A5 (labeled A7 in the image list, referring to the Bio/Med table) shows the sheer scale of this sector. We can split this into two worlds:

A. The Language of Doctors (Text/EHRs)

Models like ClinicalBERT and Med-PaLM are trained on Electronic Health Records (EHRs) and medical exams.
Med-PaLM 2 is a landmark here, achieving expert-level performance on the US Medical Licensing Exam. These models are moving from entity extraction (finding disease names in notes) to full diagnostic dialogue.

B. The Language of Life (Sequences)

This is where Type 1 (Encoder) models shine. ESM-2 (Evolutionary Scale Modeling) is a massive protein language model.
By masking amino acids in millions of protein sequences, ESM-2 learned the hidden patterns of biology. It became so effective that it can predict the 3D structure of a protein just from its sequence, rivaling the accuracy of physics-based simulations but at a fraction of the computational cost.

6. Geoscience: Modeling the Earth

Finally, we look at the macro scale.

Table A6: Summary of LLMs in geography, geology, and environmental science. Includes ClimateBERT and Pangu-Weather.

Table A6 introduces a fascinating modality: Climate Time Series.

Pangu-Weather and FourCastNet treat weather data (wind speed, pressure, temperature) as a sequence of tokens, similar to an image or a sentence.
By applying Transformers to this data, these models can forecast global weather faster and often more accurately than traditional numerical weather prediction models, which require supercomputers to solve fluid dynamics equations.
Urban Planning: Models like UrbanCLIP align satellite imagery with urban indicators, helping researchers understand city development through a multi-modal lens.

Implications for Scientific Discovery

The survey moves beyond just listing models to discuss how they are actually changing the scientific process. The authors identify several stages where LLMs are intervening:

Hypothesis Generation: Instead of a human reading 100 papers to find a gap in the research, an LLM can scan 10,000 and suggest novel connections. Tools like SciMon are already exploring this, generating “research ideas” based on prior literature.
Experiment Design: In chemistry and biology, LLMs are being used to write the code for robotic laboratories. A researcher can type “Synthesize aspirin,” and the LLM translates that into the specific machine instructions required.
Review and Evaluation: While controversial, LLMs are increasingly used to summarize papers and provide initial feedback, acting as a first-pass peer reviewer.

Challenges and Future Directions

Despite the optimism, the paper highlights significant hurdles:

Hallucination: In creative writing, making things up is a feature. In science, it is a bug. A “plausible-sounding” but non-existent chemical reaction can be dangerous.
Specialized “Tail” Knowledge: General scientific models might know “chemistry,” but do they know the specific properties of a rare alloy described in only two papers from 1985? The authors suggest that Knowledge Graphs must be integrated to anchor LLMs in factual reality.
Modal Disconnect: While we have multi-modal models, true integration is still lagging. We need models that can seamlessly reason across text, a molecular graph, and a microscopy image simultaneously to solve complex biological problems.

Conclusion

The survey presented by Zhang et al. paints a picture of a scientific ecosystem on the brink of a major transformation. By viewing atoms, genes, and equations as “languages,” researchers have unlocked the ability to use the most powerful AI architectures available today for scientific discovery.

We are moving away from the era of “Text Mining”—where computers simply searched for keywords in papers—and into the era of Scientific Understanding. Whether it is predicting the weather, designing a new protein, or solving a math proof, the underlying architecture is converging. The siloed tools of the past are being replaced by unified, pre-trained scientific brains.

For the student of science, this means the future curriculum will likely involve not just learning the periodic table or the Krebs cycle, but understanding the attention mechanisms and tokenizers that will help navigate them.

Introduction#

Background: The Language of Science#

Core Method: A Unified Architecture for Science#

Column 1: The Encoder Era (Understanding)#

Column 2: The Generative Era (Creation)#

Column 3: The Multi-Modal Bridge (Connection)#

The Evolution of Scientific LLMs: A Field-by-Field Analysis#

1. General Science: The Foundation#

2. Mathematics: From Calculation to Reasoning#

3. Physics: The Frontier of Theory#

4. Chemistry: The Language of Matter#

5. Biology and Medicine: The Heavyweights#

6. Geoscience: Modeling the Earth#

Implications for Scientific Discovery#

Challenges and Future Directions#

Conclusion#