What is the Next Step for Scaling in the Era of RL for LLM?
When the redundant designs we added in the pre-LLM era have been deleted by the bitter lesson, we are ready to scale up. In the era of RL for LLM, what should be the next scaling up?
Towards General Reasoning
- I've been following recent developments on Agents and the line of work on Incentive Training for Reasoning sparked by DeepSeek. A few days ago, the release of Kimi K2 caught my attention, particularly its section on general reasoning:
Going beyond verifiable rewards, our general RL system uses a self-judging mechanism where the model acts as its own critic, providing scalable, rubric-based feedback for non-verifiable tasks.
- It was at this moment that I realized the dots from the recent papers I had been reading started to connect—just as Steve Jobs once said, “You can't connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future.”
- This post organizes those thoughts, as illustrated in the diagram above: the scaling of large models has never stopped—we have continued to scale up the knowledge and capabilities learned from next token prediction, in various ways and from different angles.
Next Token Prediction
- At the top, Next token prediction is all you need. A
general consensus seems to be that knowledge is primarily acquired
during pretraining. Post-training mainly serves to steer/incentivize
capabilities. Continuing pretraining remains the preferred way to encode
knowledge into the model. Trying to add/edit/delete knowledge during
post-training is extremely challenging, and research on knowledge
editing has remained relatively toy-level.
Knowledge Is Stored Messily in LLMs
- Once we clarify the roles of pretraining and post-training, we find that both are worth scaling. When the scaling of pretraining hits a bottleneck, OpenAI proposed inference-time scaling—which actually relies on post-training to elicit capabilities. In this sense, it is really scaling up post-training.
Scaling Up Human Feedback with Reward Models
- Moving to the lower left is the RLHF path proposed by OpenAI. Besides highlighting the importance of human feedback (beyond objective correctness, there's also the distinction between hard-to-verify good vs bad), I believe it importantly demonstrates how reward models can be used to scale up human feedback. It's unrealistic to have humans annotate massive amounts of model rollouts, but we can use a small amount of high-quality human data to train a reward model, which can then provide feedback at scale for policy model rollouts. This is essentially a tradeoff: sacrificing precision (human-labeled quality) for quantity (reward model scalability). A reward model trained on a small dataset is sufficient to guide a strong policy—because posing a question is easier than solving it (Discriminator-Generator Gap). Once this cold start is done, OpenAI productized it (ChatGPT), enabling continuous data collection and spinning up a data flywheel.
Verifiable Rewards
- Moving rightward, we see RLVR—DeepSeek’s exploration of end-to-end
RL with outcome rewards, undertaken in the broader effort to replicate
OpenAI’s o1 model. DeepSeek’s work has offered me three insights:
- From DeepSeekMath to DeepSeekR1, it showed that when the pretrained checkpoint is strong enough, RL still has enormous potential—not only can it solve math proofs, it can also elicit general reasoning capabilities;
- Simple, rule-based rewards can directly be used to train language models. This expands the range of possible environments for RL with LLMs, enabling them to serve as general-purpose models across tasks;
- With GRPO/Rule Reward, DeepSeek removed the need for a critic model and a reward model, making the approach extremely simple. The early media narrative focused on lower costs, but I believe the greater potential lies in higher efficiency scaling.
LLMs Implicitly Contain Reward Models
- Below RLHF, we see the realization that LLM-based policy models
inherently contain reward models:
- The intuition behind DPO is straightforward: if I follow a pipeline of [train reward model on high-quality human preference data → train policy model using reward model], then surely there exists a way to directly train the policy model using high-quality preference data. DPO mathematically proves this. Although it overlooks on-policy vs off-policy distinctions, DPO offers a key insight: in RLHF post-training, an LLM policy model might also be a reward model;
- There's a paper not included in the diagram—PRIME: Process Reinforcement through Implicit Rewards by Tsinghua—which extends the implicit reward concept from DPO into outcome-reward tasks, extracting process reward signals. While PRIME is not central to this post, it’s very interesting, and a future combining outcome + process rewards could be promising;
- Finally, we have Generalist Reward Models: Endogenous Reward. Next token prediction on massive human-approved corpora is itself a way of learning a reward function. This continues DPO’s idea: LLMs are both policy and reward models, not just in post-training, but even during pretraining.
LLMs Implicitly Contain Verifier Models
- Further right, mirroring the idea of Secret Reward Model under RLHF, is the RLVR counterpart: in RLVR, the LLM itself serves as a verifier model. This includes our recent work NOVER and several related papers. The motivation is straightforward: RLVR depends on verifiable tasks, but what if we only have freeform ground truth, which can’t be rule-based verified? A natural idea is to use the model's perplexity on those ground truth samples as reward. For incentive training, we can condition on the reasoning trajectory and compute perplexity. The idea is simple, but echoing Secret Reward Model, it supports a broader claim: whether RLHF or RLVR, alignment or incentive training, LLMs themselves are sufficient feedback signal extractors. In RL terms, LLMs are good enough as both policy and reward.
Scaling Up Reinforcement Learning
- At the bottom, we see the community’s recent efforts in scaling RL
for LLMs:
- DeepSeek-GRM emphasizes that reward models are worth scaling up;
- ProRL suggests that RL training itself is worth scaling, with potential to surpass the pretrained ceiling;
- POLAR argues that reward models should not only be scaled up, but done so with pretraining-level scale.
Converging to a Single Point
- Looking back at the road we've traveled, we see everything converges
to one point:
- Reward models should be scaled up
- Post-training should be scaled up
- LLMs themselves are reward models
- --> We only need to scale up the LLM itself! It is both policy and reward, and RL enhances its capabilities in both roles. A stronger reward provides better guidance for policies tackling harder problems. The simplest form accommodates the widest range of data and compute. Once tasks and data are well-prepared, everything clicks into gear, spinning faster and leveraging bigger levers. This is the insight conveyed by Kimi K2’s section on general reasoning. As Hyung Won Chung’s slide suggests—less structure, more performance. We began by adding various models and structures, and now we’re removing them one by one:
As a community we love adding structures but a lot less for removing them. We need to do more cleanup.
The Next Step
- So what should we scale next? And how?
- One seemingly obvious answer is: from training to inference, to agentic, to multiagent—the next step is scaling up multiagent. (Grok 4 Heavy may currently be the strongest multiagent model/framework; multiagent has also become one of 2025’s hottest AI terms.)
We have made further progress on parallel test-time compute, which allows Grok to consider multiple hypotheses at once. We call this model Grok 4 Heavy, and it sets a new standard for performance and reliability.
- But why? Just because the terms are newer? My view is: we’re not scaling the terms or paradigms themselves. When new paradigms emerge, we’re keen to build around them—but ultimately, the benefits are internalized by the model itself. Scaling laws remain faithful to data and compute, only now they come in different forms.
- When identifying the next direction for scaling, it's not just about
agentic or multiagent formats—but where the data comes from to support
such scaling. Some scattered thoughts:
- Synthetic data is inevitable, but I suspect it will show patterns similar to the Discriminator-Generator Gap, enabling tradeoffs to continually produce harder and better data;
- In RL contexts, data may also appear as environments. A well-defined
environment can theoretically generate near-infinite data, meaning that
the next phase is not just scaling amount, but scaling
difficulty—an environment that assigns harder goals to stronger
agents.
Why you should stop working on RL research and instead work on product // The technology that unlocked the big scaling shift in AI is the internet, not transformers
- I agree with the training → inference → agentic → multiagent scaling roadmap, not just because those terms aim for higher intelligence, but because that path makes LLMs increasingly useful. And usefulness brings a key benefit: people are willing to provide more and diverse data to use it.
- For multiagent, I'm particularly interested in heterogeneous, personal multiagent systems. LLMs have nearly exhausted all knowledge on the internet, which reflects an average of humanity’s collective information. But for individuals, each person’s micro-world continues to generate virtually infinite data and environments. Assigning each person an agent, and allowing society to mirror itself with a society of agents evolving through this data, may be how multiagent scaling becomes possible.
Citation
If you found the topics in this blog post interesting and would like to cite it, you may use the following BibTeX entry:
1 | @article{next_scaling_202507, |