Towards General Reasoning

How far have we gone towards general reasoning? How far are we from the general reasoning?

Recently our paper NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning was accepted to EMNLP. NOVER uses the LLM's perplexity of the ground truth conditioned on the reasoning trajectory as the reward, which extends the RLVR paradigm beyond math and code, enabling learning of general reasoning on arbitrary text-to-text tasks without extra models or verifiers.

When we began NOVER in February, most RLVR work focused on mathematical and code reasoning; RLVR research that targets general or hard-to-verify domains was scarce. Nearly six months later many interesting related papers have emerged. Due to limited resources, many ideas in our experiments were not fully validated and were left out of the paper. This post organizes those ideas and surveys recent relevant work to assess how far we have come on general reasoning and how far we still must go.

What

NOVER extends RLVR to general reasoning like Science Explanation and Proof, Social Reasoning, Creative Writing and Translation

NOVER’s target problem is, for tasks whose answers are unstructured natural text and therefore unsuitable for rule-based verifiers, how can we apply RLVR to acquire reasoning ability?

Such cases are common. A math problem often has a unique canonical answer that can be placed in a \boxed{}, while explanations of physical or chemical phenomena can legitimately take many different forms (think of long, opinionated answers on Quora-like sites). For entity-level QA we can use exact match, but for long-form generation such as document-level summarization/translation/creative writing, there is no reliable rule (ROUGE, BLEU and similar metrics have long been shown to be unreliable). The same applies across vertical domains (medicine, law, social sciences, psychology, literature, education) where many ground-truth instances are free-form.

A few concepts are easy to conflate; NOVER focuses only on the following:

  • the ground truth exists (this is not unsupervised learning); we do not desire a reward model to directly give a judgment, but rather a verifier that compares ground truth and prediction;
  • the ground truth may be subjective or objective, but the dataset contains at least one reference.
  • Even when the ground truth is objective, rule-based verifiers are often failed: objective GTs can be expressed in many textual forms, otherwise we would simply use exact match as the reward. Moreover, even for math, where answers are seemingly easy to verify (multiple choice labels, numbers, formulas, short text, booleans), model responses vary widely and rule-based verifiers are not robust [1].
Error patterns on easy-to-verify tasks from compass verifier [1]

Why

Why pursue general reasoning? Many non-math/code tasks (creative, humanities, or long-form scenarios) appear to be not a suitable targets for a “reasoning” model. But consider:

  • It is still unknown whether RLVR is the ultimate correct paradigm for training reasoning models.
  • It is still unknown whether CoT genuinely corresponds to human-style reasoning.
  • It is still unknown whether CoTs learned via RLVR truly represent a model’s internal reasoning.

Currently RLVR is better seen as a method to train native CoT outputs, and CoT is simply “say something before the final answer.” That something is not necessarily reasoning (some works find that even repeating the question can increase accuracy); it is model-generated tokens that, when produced before the answer, help the LLM better exploit learned patterns and increase the prediction probability of the next correct answer tokens.

DataAlchemy's show that CoT's gains can arise from reuse and interpolation of patterns near the training distribution [2]

From that practical viewpoint, producing a bit more text that improves answer quality is no harm (users often try a “deeper thinking” mode in chat systems expecting better answers).

Another reason to study general reasoning is that, for an LLM, task difficulty is tied to verification difficulty. We aim to keep pushing the frontier of problems that models can solve: some tasks that are hard for humans (e.g., olympiad math) might still be learnable by models if supplied with a correct, sufficiently informative reward. Conversely, tasks whose rewards are hard to formalize are harder for models to learn.

What it actually is

What we need for RLVR on free-form text is a good verified signal, which is actually a reward function that measures semantic agreement between ground truth and model prediction. That is exactly what we pursue in the Natural Language Generation. The most basic target is cross-entropy (ppl). From this perspective NOVER essentially moves the SFT loss into the RLVR setting, and recent work shows SFT and RL differences are often not large.

Although NOVER used ppl, perplexity may not be optimal. We can arrange verified signals along an axis from fine to coarse granularity: the coarser the signal, the more information is lost and the sparser the reward becomes. On this axis three main approaches appear:

  • Perplexity-based signals.
  • Rubrics / checklists.
  • Trained verifier models that yield binary (0/1) rewards.
The Axis of Verified Signals

Compared with binary rewards, ppl provides a denser signal, extends naturally to free-form text, and avoids reward saturation; but it loses the absolute correctness signal, i.e., the model never observes a strict correct/incorrect label and we cannot use pass@k-style metrics to assess sample difficulty. Rubrics/checklists sit between these extremes: they are more fine-grained than binary rewards but still sparser than ppl. High-quality rubrics typically require sample-wise, human expert annotation. Several recent works explore rubric-style solutions[3][4][5][6][7]. Baichuan-M2 in particular develops a fairly detailed Verifier System that functions as a model-driven environment, with a Patient Simulator (data generator) and a Rubrics Generator (rollout evaluator) [8].

Baichuan-M2's Verifier System[8]

Rubrics also enable controlled synthetic data generation for debiasing reward models[9], so the reward model focuses on true causal factors and resists hacks stemming from format, length, or tone. OpenAI’s Deliberative Alignment can be seen as an outcome-RL approach that uses safety-oriented rubrics [10].

How

NOVER's reward is derived from the policy model's conditional ppl of the ground truth given the reasoning trajectory

NOVER applies a crude but direct approach: for a rollout, compute the policy model’s conditional ppl of the ground-truth answer given the rollout's reasoning trajectory as the reward.

The idea of reasoning advantage (RA).

The idea of reasoning-ppl based improvements has appeared before. A short NeurIPS 2024 LanGame workshop paper called this notion reasoning advantage (RA), essentially the relative change in reasoning ppl compared to a no-reasoning baseline. That paper used RA for data selection, which is essentially keeping CoT examples with high RA for SFT, so it can be viewed as an offline-RL style method [11].

Fortuitously, I experimented with relative reasoning ppl in NOVER and later found the LANGame writeup: it is an intuitive and reasonable design.

The idea of longPPL.

Another related refinement on ppl is longPPL which measures ppl on a context-dependent subset of tokens: longPPL subtracts the ppl without long context from the ppl with long context, thereby focusing evaluation on tokens that truly depend on long-range context [12]. RA shares the same spirit: we want the reward to come from those tokens in the ground truth that genuinely require CoT reasoning.

More Interestingly, in GRPO a simple group normalization makes relative ppl improvements and absolute ppl effectively equivalent on advantage calculation, so absolute reasoning ppl itself is a solid reward signal.

But applying ppl directly has issues.

  • First, ppl is numerically unstable: advantage estimates vary across batches and exhibits length bias. NOVER converted ppl into in-group quantiles to produce more stable rewards. QRPO applies quantile transforms more rigorously: it maps rewards to quantiles of the base-policy reward distribution across the dataset, making the partition function tractable and enabling numerically stable pointwise rewards even in offline RL [13].
  • Which model should be used to compute ppl? In principle a stronger external model could be a more accurate verifier, but the gap between large and small model cause problems, which is similar to bad distillation results when using DPO to train small models from GPT-distilled labels. NOVER uses the policy model itself to compute ppl, which saves extra models and eases scaling. We found that using a separate large verifier (closed-source SOTA or a specialized verifier) often leads to LM-hack-LM issues, whereas using the policy model’s own ppl yields smoother learning curves.
  • With small batches and limited compute, training is unstable. NOVER introduced a policy-proxy sync: periodically copy policy parameters to a proxy model and compute ppl from the proxy during training. This effectively increases the batch size (similar in spirit to gradient accumulation) and stabilizes reward estimates.
RLPR shows that ppl can accurately measure the reasoning advantage.

Several contemporaneous works adopt related ideas but differ in how they stabilize ppl numerics.

  • VeriFree [14] uses reasoning ppl directly, but restricts to short answers (≤7 tokens) where ppl is less unstable, and shows ppl can approach or exceed verifier-based baselines on short QA.
  • RLPR [15] uses relative token probabilities (the per-token mean probability, clipped, then advantage computed) rather than ppl and provides detailed ablations showing direct ppl can lose 20 points if used naively.
  • DRO [16] targets long answers and uses relative reasoning ppl with per-token weighting for high-variance ground-truth tokens and local weight decay.
  • DeepWriter [17] focuses on long-form writing but uses reasoning ppl purely as a scoring metric to filter and iteratively rewrite drafts (not an RL loop), avoiding numeric instability by staying in a supervised selection regime.

Observations

Collapse modes in training.

We experienced many collapse modes early in training: completion lengths exploding, ill rollouts where the model produces garbled text, and simultaneous blowups of format rewards. We applied the tricks above to stabilize training (see the paper’s ablation for details on the “curse of proxy”).

The curse of proxy.

A small but useful trick is reward dependency: when multiple reward terms are simply summed the model can be uncertain which objective produced a given penalty or bonus. Practically, we found it effective to gate task rewards on a strict format reward: unless the format reward is satisfied, set all other rewards to zero. When the format reward gained, the model is usually “sane”, no hallucination or gibberish. This dependency can also pull the model back from training collapse.

We also found that excessive strictness on format rewards may hinder exploration [18]. For example, one interesting reward hacking on format we observed was nested <think> tags in CoT: models can nest a sub-reasoning reflection inside an outer <think> block to game the signal, e.g.

1
2
3
4
5
6
7
8
9
10
<think>
inner thoughts
<think>
reflection on the earlier thoughts
</think>
continue reasoning
</think>
<answer>
...
</answer>

Stronger base models exploit dense semantic signals better. For example, we converted multiple-choice questions into free-form answers where the model must output both the option letter and the full option text; comparing 7B vs 3B, the 7B model better leverages ppl to rank rollouts:

  • rank 1: option letter and option text both correct
  • rank 2: letter wrong, option text correct
  • rank 3: letter correct, option text similar to another option
  • rank 4: letter correct, option text completely wrong
  • lowest: everything wrong

Looking at rollouts beyond the answer, ppl can indirectly reflect differences in the reasoning details. In an astronomy example that required an explanation plus numeric computation, we asked GPT to analyze each rollout (reasoning plus result) sorted by reasoning ppl; the model’s qualitative analyses correlated with ppl rankings.

The correlation between ppl rankings and GPT's qualitative analyses.

NOVER also partially works on non-Qwen models, though weaker bases (e.g., some Mistral checkpoints) show erratic behavior. Zero-shot CoT can be seen as an untrained base exploration strategy; if that baseline is close to or exceeds the base model, RL typically provides gains.

NOVER partially works on non-Qwen models.

We also observed (without exhaustive experiments) that many general-reasoning datasets are annotated by closed-source large models and thus are not perfectly objective or correct (loose definitions, symbol misuse). Perplexity can still provide a useful guiding signal: in some cases models learned complex reasoning patterns from the ppl signal that can produce arguably more correct answers than the original ground truth.

Is changing only the reward enough?

Some works on reproducing GRPO.

No, but reward design is the most obvious gap when extending rule-based verification to general reasoning. What's more, from the bitter-lesson viewpoint many algorithmic tricks are spurious: data and compute dominate. By March many people were reproducing GRPO and noting its fragility; our NOVER training surfaced similar issues. Many algorithmic “tricks” proposed in these papers have marginal effects compared with data and scale.

So advancing general reasoning faces larger challenges in data and base models; algorithmic work will be required later to make training more efficient and stable.

  • Data: existing general-reasoning datasets vary widely in quality; cleaning consumes substantial effort, and much data is LLM-annotated (distilled from GPT or similar) rather than human-curated. The data are static and finite. RL itself is sample-efficient in some senses; the cost-effective path to scaling is not simply more examples but higher-quality environments and feedback.
  • Base model: the base model governs exploration in RL. Practically, it should already possess zero-shot instruction following and CoT capability; richer knowledge helps. Debates over whether RL can raise the ultimate capability ceiling are not the key point: post-training often elicits latent abilities rather than creates them. Some works already explore combining memory and elicitation, and I believe mid-training vs post-training may form new positive feedback loops.

One more thing: Climb the Solver–Verifier Asymmetry

The Solver-Verifier Asymmetry.

A central concept in RLVR is the solver-verifier asymmetry: for some tasks verification is easier than solving, while for others verification is harder. Much of RLVR excels when verification is simpler than solving. The opposite side, where verification is harder, includes:

  • General Reasoning with hard-to-verify free-form answers
  • Situations requiring long time horizons to obtain a return (e.g., a business plan whose real feedback arrives after weeks, months, or years). Those cases resemble deep conversion problems in recommender systems: we need accurate attribution and systems that handle extremely sparse feedback.
  • Scenarios that may require large human labeling efforts or hard-to-acquire real users to verify the solution, which motivates the development of effective user simulators.

The verifier-free design of NOVER introduces a new possibility (though not yet tested):

whether it is feasible to synchronize the intelligence of the policy model to the verifier model, thereby enabling co-evolution of solver and verifier along the Solver-Verifier Asymmetry diagonal.

A stronger policy model would lead to a stronger verifier model, which in turn could train an even stronger policy model. The key lies in the transmission of intelligence. NOVER’s design of using perplexity as the reward naturally unifies the form of intelligence in both solver and verifier: both aim to increase the probability of generating the ground truth on good reasoning trajectories. In this way, co-evolution can be achieved through standard RL without the need to design additional adversarial or self-play tasks. Here, the direction of intelligence transfer is from solving to verifying. A symmetric related work is LLaVA-Critic-R1, which found that a strong preference model can yield a strong policy model, though it required constructing an additional task. [19].

If we want to achieve such fully automatic co-climbing, we have RL training which performs horizontal climbing (fix verifier y, improve solver x), we have Intelligence Sync which would perform vertical climbing (fix solver x, improve verifier y). However, we also need a third variable: tasks and data. Each point in the solver–verifier grid corresponds to specific tasks and datasets. As argued in my earlier post on Scaling the Environment, beyond solver and verifier there is also the question generator. Most current reasoning-evolution work focuses on self-improvement via model consistency or entropy patterns; some approaches implement co-evolution of two modules, while a tri-evolution of three modules has not been explored:

The Trinity of Solver-Verifier-Generator.
  • R-Zero and Self-Questioning Language Models consider adversarial generation between a generator and a solver [20][21].
  • URPO reframes verification as a solving task and unifies data training. COOPER trains a verifier from positive/negative samples constructed from current policy rollouts. Both lines implement solver–verifier co-evolution [22][23].

Another route to continual solver improvement is self-play: with a suitable environment, two solvers can game and thereby improve each other without worrying about asymmetry. For general reasoning such environments are hard to design because the “rules” are nebulous. Recent works have proved that models can learn rules [24] and combine atom skills to learn new skills through synthetic data and task, but existing real general-reasoning datasets are limited enumerations rather than comprehensive rule sets. This is still essentially static datset/benchmark-driven RL. In the AI “second half,” we should seek real-world environments and problems rather than static datasets.

Between static data and the real world lies a middleware: simulators. Simulators trade fidelity for feedback speed—like reward models or verifier models—and for general reasoning a useful simulator might look like a patient simulator in medical domains (see Baichuan-M2’s case), since real patients raise ethical and regulatory issues and validation can be slow [8].

A different idea is to forgo task-specific environments and instead play games: self-play on games could improve math and general reasoning if reasoning patterns transfer across games and tasks [25][26]. If feasible, we could use game environments and self-play to continually evolve general-reasoning models.

Citation

If you found the topics in this blog post interesting and would like to cite it, you may use the following BibTeX entry:

1
2
3
4
5
6
7
8
@article{general_reasoning_202509,
author = {Wei Liu},
title = {Towards General Reasoning},
year = {2025},
month = {9},
url = {https://thinkwee.top/2025/09/13/general-reasoning/#more},
note = {Blog post}
}

  • [1] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward.
  • [2] Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens.
  • [3] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning.
  • [4] Checklists Are Better Than Reward Models For Aligning Language Models.
  • [5] TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation.
  • [6] Reinforcement Learning with Rubric Anchors.
  • [7] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains.
  • [8] Baichuan-M2: Scaling Medical Capability with Large Verifier System.
  • [9] Robust Reward Modeling via Causal Rubrics.
  • [10] Deliberative Alignment: Reasoning Enables Safer Language Models.
  • [11] On Reward Functions For Self-Improving Chain-of-Thought Reasoning Without Supervised Datasets.
  • [12] What is Wrong with Perplexity for Long-context Language Modeling?
  • [13] Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions.
  • [14] Reinforcing General Reasoning without Verifiers.
  • [15] RLPR: Extrapolating RLVR to General Domains without Verifier.
  • [16] Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks.
  • [17] Reverse-Engineered Reasoning for Open-Ended Generation.
  • [18] SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild.
  • [19] LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model.
  • [20] R-Zero: Self-Evolving Reasoning LLM from Zero Data.
  • [21] Self-Questioning Language Models.
  • [22] URPO: A Unified Reward & Policy Optimization Framework for Large Language Models.
  • [23] COOPER: CO-OPTIMIZING POLICY AND REWARD MODELS IN REINFORCEMENT LEARNING FOR LARGE LANGUAGE MODELS.
  • [24] Large Language Models can Learn Rules.
  • [25] Play to Generalize: Learning to Reason Through Game Play.
  • [26] SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning.