Introduction

In the rapid evolution of Artificial Intelligence, we have become accustomed to Large Language Models (LLMs) performing feats of impressive logical deduction. Ask GPT-4 to solve a complex calculus problem or debug a Python script, and it usually shines. These are objective tasks—problems with a single, verifiable standard answer. The reasoning path is clear-cut, and success is binary: the answer is either right or wrong.

But what happens when we ask an AI a question that doesn’t have a right answer?

Consider questions like “How does literature shape cultural identity?” or “Does technology in education enhance or hinder learning?” These are subjective topics. To answer them well, one cannot simply follow a linear logical path. One must exercise comprehensive thinking (looking at the problem from all angles), reflective thinking (critiquing one’s own assumptions), and creative thinking (offering novel insights).

As it turns out, while LLMs are math whizzes, they are often philosophical novices. When faced with open-ended subjective questions, they tend to provide shallow, generic, or one-sided responses. They lack the ability to “chew” on a thought, critique it, and evolve it.

Figure 1: LLMs with CoT prompting showcase strong logical thinking ability but fail to solve subjective questions efficiently.

As illustrated in Figure 1, standard prompting techniques like Chain-of-Thought (CoT) work wonders for objective questions (left) but struggle to trigger the deep, creative, and reflective thinking required for subjective questions (right).

In this post, we will explore a fascinating research paper titled “Subjective Topic meets LLMs” by researchers from Tsinghua University and The Chinese University of Hong Kong. They propose a novel framework called NeoN, inspired by the philosophical principle of the “Negation of Negation.” This method teaches LLMs to argue against themselves, finding flaws and missing perspectives to spiral upward toward a more profound, “human-like” answer.

The Problem: The Subjectivity Gap

Current benchmarks for LLMs are heavily skewed toward objective reasoning—arithmetic, symbolic logic, and commonsense QA. These benchmarks reward models for converging on a single correct point. However, human intelligence is defined just as much by divergent thinking—the ability to explore a solution space where multiple conflicting truths might exist simultaneously.

The researchers identified that existing methods, such as Chain-of-Thought (CoT), rely on a linear progression of thought. While perfect for math, this linearity is a handicap for subjective topics. A linear path rarely captures the nuance of a complex social issue.

Introducing the SJTP Benchmark

To tackle this, the authors first needed a way to measure subjective reasoning. They created the SJTP (SubJective ToPics) dataset. Unlike standard datasets with multiple-choice answers, SJTP consists of open-ended questions that require free-form responses.

The dataset is structured around three specific types of subjective topics, each testing a different facet of higher-order thinking:

Viewpoint Discourse: Tests comprehensiveness (e.g., “What is the impact of social media?”).
Binary Dialectics: Tests in-depth analysis and reflection (e.g., “Should schools wear uniforms?”).
Practical Analysis: Tests creativity and constructive opinions (e.g., “How can we preserve cultural heritage?”).

Table 1: Topic types, fields, and evaluation dimensions for the SJTP dataset.

As shown in Table 1 above, these topics span eight diverse fields, from Social Ethics to Technology and Education. The goal isn’t just to generate text, but to score high on dimensions like comprehensiveness, reflection, and creativity.

To ensure the dataset covered a wide breadth of human knowledge, the researchers included topics from eight distinct fields.

Figure 5: Different fields of subjective topics in SJTP.

Figure 5 illustrates the distribution of these topics. By forcing models to engage with Law, Psychology, History, and Art, the benchmark ensures that an LLM cannot rely on memorized templates alone. It must engage in genuine reasoning.

The Solution: The NeoN Framework

The core contribution of this paper is the NeoN framework. The name stands for Negation of Negation, a concept borrowed directly from dialectical philosophy, specifically the works of Engels and Hegel.

The Philosophy Behind the Code

In Hegelian dialectics, truth is not a static destination. It is a process. Things develop through a cycle:

Thesis: An initial state or proposition.
Antithesis (Negation): The contradiction or critique of the thesis.
Synthesis (Negation of Negation): The resolution that transcends the conflict, incorporating the best of both previous stages.

The authors applied this to Large Language Models. They hypothesized that if an LLM is forced to “negate” its own answer—to actively look for flaws, missing perspectives, or counter-arguments—it will break out of its established thinking patterns.

How NeoN Works

The NeoN framework operates in a three-stage pipeline. Importantly, this is a zero-shot framework. It does not require training the model on thousands of examples; it simply guides the model’s inference process using carefully designed prompts.

Figure 2: Illustration of the SJTP dataset construction and the NeoN framework workflow.

Figure 2 provides a high-level overview of the workflow (on the right side). Let’s break down the three distinct steps.

Step 1: Direct Answer (The Thesis)

First, the model is asked to generate a standard answer to the question using its inherent logical reasoning. This ensures the baseline response is coherent and grounded.

\[ \mathbf { r } _ { 0 } = { \mathcal { M } } ( { \mathcal { Q } } \oplus { \mathcal { P } } _ { 1 } ) , \]

Equation 1: Generating the initial response.

Here, \(\mathcal{Q}\) is the question and \(\mathcal{P}_1\) is a direct prompt like “Let’s generate the answer.” The result, \(\mathbf{r}_0\), is the initial “Thesis.”

Step 2: Iterative Negation (The Antithesis)

This is where NeoN diverges from standard prompting. Instead of accepting \(\mathbf{r}_0\), the system casts the LLM as a “negator.” The model is prompted to critique the previous response.

The prompt might look like: “Negate the above responses to deduce a more perfect answer.”

The model looks at its previous answer and asks:

What is missing?
Is this perspective too narrow?
Are there counter-examples? \[ \mathbf { r } _ { n } = { \mathcal { M } } ( { \mathcal { Q } } \oplus \mathbf { r } _ { 0 } \oplus \cdots \oplus \mathbf { r } _ { n - 1 } \oplus { \mathcal { P } } _ { 2 } ) , \]

As shown in the equation above, the new response \(\mathbf{r}_n\) is generated based on the history of all previous responses. This process can repeat multiple times (\(n\) rounds).

The Stopping Criteria: How does the model know when to stop arguing with itself? The system checks the semantic similarity between the current response and the previous ones.

If the new response is very different, it means the model is finding new angles. The negation continues.
If the new response is very similar to the previous one (similarity > threshold \(\epsilon\)), it implies the model has exhausted its ability to find flaws. The answer is approaching “perfection” relative to the model’s capabilities, and the loop stops.

Step 3: Integration and Unification (The Synthesis)

Once the negation loop finishes, the model has a collection of responses: the initial thought, the critique, the critique of the critique, and so on.

The final step is to synthesize these into one cohesive, high-quality response.

\[ \mathcal { R } = \mathcal { M } ( \mathcal { Q } \oplus \mathbf { r } _ { 0 } \oplus \cdots \oplus \mathbf { r } _ { n } \oplus \mathcal { P } _ { 3 } ) , \]

Equation 3: The final unification reasoning.

Using a prompt \(\mathcal{P}_3\) (e.g., “Based on all the previous answers, generate a perfect answer”), the model integrates the diverse perspectives it generated into a final output \(\mathcal{R}\).

Why Negation is Better than Reflection

You might ask, “Isn’t this just self-reflection?”

The authors argue that “Negation” is stronger than standard “Self-Refine” or “Reflection” techniques. Standard reflection usually relies on feedback that checks for specific errors or follows a reward signal. Negation is broader. It forces an adversarial relationship with the previous text. It demands the exploration of unconsidered perspectives rather than just fixing grammatical or factual errors. It simulates a debate between infinite parties, forcing the model to spiral upward in quality.

Experiments and Results

To validate NeoN, the researchers tested it on several LLMs, including GPT-3.5, GPT-4, LLaMA-2, and Mistral. They compared NeoN against strong baselines like Zero-Shot-CoT (Chain of Thought), Self-Consistency, and Self-Refine.

Evaluation Metrics

Evaluating subjective text is notoriously difficult. To solve this, the authors developed three automated evaluation indicators, using GPT-4 as the judge:

\(SCR_{dim}\): Scores the response based on six dimensions (Clarity, Logicality, Correctness, Comprehensiveness, Innovation, Depth).
\(SCR_{point}\): Generates specific scoring points for the question and checks if the answer hits them.
\(SCR_{sol}\): Compares the response to a high-quality “Gold Standard” solution generated by GPT-4.

Performance on Subjective Topics

The results were compelling. NeoN consistently outperformed all baselines across different models.

Figure 3: Categorized performance analysis and Table 4 showing efficacy of negation.

Look at the radar chart in Figure 3 (left). The red line (NeoN) encloses the largest area.

Innovation & Depth: Notice how NeoN scores significantly higher on “Innovation” and “Depth” compared to other methods. This confirms that the negation process successfully pushes the model out of shallow, generic reasoning.
Comprehensiveness: By forcing the model to view the problem from opposing angles, the final answer naturally covers more ground.

Table 4 (right side of the image) offers an ablation study. It compares NeoN against:

NeoN_direct: Just generating multiple answers without negation.
NeoN_rethink: Just “rethinking” without the explicit instruction to “negate.” The full NeoN framework wins, proving that the specific act of negation—challenging the premise—is the key driver of quality.

Does it help with Objective Tasks?

One of the most surprising findings of the paper is that this “philosophical” approach also improved performance on rigid, objective tasks like Math (GSM8K) and Commonsense Reasoning (CSQA).

Why would dialectics help with math?

It turns out that “negating” a math answer functions as a rigorous verification step. If the model tries to negate its answer and finds a contradiction, it catches a calculation error.

Table 5: The ratio of correcting initially wrong answers vs misleading correct answers.

Table 5 shows the “False-to-True” (F2T) and “True-to-False” (T2F) ratios.

F2T (25.64%): This is the percentage of times the model started with a wrong answer, applied NeoN, and fixed it. This is a massive improvement over Self-Refine (11.67%).
T2F (0.13%): This is the danger zone—taking a right answer and “thinking” it into a wrong one. NeoN has an incredibly low rate here. Because negation requires a correct premise to be valid, it is very difficult to successfully negate a mathematically true statement. If the model tries to negate “2+2=4”, it fails, reinforcing the original answer.

Efficiency and “Rounds” of Negation

How many times does the model need to argue with itself?

Figure 4: Impact of negation rounds and viewpoint comparison.

Figure 4(a) shows the performance relative to the number of negation rounds. Interestingly, the performance gains plateau around 2 to 3 rounds. This makes NeoN highly efficient compared to other debate-based methods that might require dozens of turns.

Figure 4(b) illustrates that NeoN generates a significantly higher number of unique viewpoints per answer compared to standard GPT-3.5, particularly in “Practical Issues” (Prac. Iss.), where creativity is most needed.

Case Study: Seeing NeoN in Action

To make this concrete, let’s look at a generated example from the dataset to understand what a “Subjective Topic” looks like and how the model tackles it.

Table 14: A binary dialectic data generated by SJTP.

Consider the Binary Dialectic topic in Table 14: “Does technology in education enhance learning or hinder it?”

A standard LLM might produce a generic “sandwich” essay: It has pros and cons. It helps access but causes distraction. In conclusion, balance is key.

Under NeoN, the process looks different:

Direct: The model produces that standard essay.
Negation 1: The model attacks the essay. “The previous answer assumes access is universal, but ignores the digital divide. It also fails to mention that ‘distraction’ might actually be a symptom of outdated teaching methods, not the technology itself.”
Negation 2: The model attacks again. “Critiquing the digital divide is valid, but we must also acknowledge that technology changes the neural pathways of learning itself, essentially rewiring how students process information, which isn’t just ‘good’ or ‘bad’ but a fundamental shift.”
Synthesis: The final answer weaves these deep, conflicting insights into a nuanced discussion that touches on cognitive science, socioeconomic equity, and pedagogy, far surpassing the initial generic response.

Conclusion

The “Subjective Topic meets LLMs” paper presents a compelling argument: Logic is not enough. As we integrate LLMs deeper into human society, we need them to act as more than just calculators or encyclopedias. We need them to be thinkers.

The NeoN framework demonstrates that ancient philosophical principles can be translated into effective prompt engineering strategies. By forcing models to undergo the dialectical process of Thesis-Antithesis-Synthesis, we unlock a level of comprehensive and creative thinking that standard prompting methods miss.

Key Takeaways:

Subjectivity Matters: We need benchmarks like SJTP to measure how well AI handles open-ended, human-centric problems.
Conflict Creates Quality: The Negation of Negation process forces the model to challenge its own biases and finding missing links, leading to deeper insights.
Versatility: While designed for philosophy and debate, this method also makes models better at finding bugs in their own code or errors in their math.

As LLMs continue to evolve, frameworks like NeoN suggest that the path to “Superintelligence” might not just be more data and compute, but better ways of thinking—borrowing a page from the great philosophers of our past.

Introduction#

The Problem: The Subjectivity Gap#

Introducing the SJTP Benchmark#

The Solution: The NeoN Framework#

The Philosophy Behind the Code#

How NeoN Works#

Step 1: Direct Answer (The Thesis)#

Step 2: Iterative Negation (The Antithesis)#

Step 3: Integration and Unification (The Synthesis)#

Why Negation is Better than Reflection#

Experiments and Results#

Evaluation Metrics#

Performance on Subjective Topics#

Does it help with Objective Tasks?#

Efficiency and “Rounds” of Negation#

Case Study: Seeing NeoN in Action#

Conclusion#