Scaling the Environment
What I Talk About When I Talk About Scaling the Environment?
Scaling Environments
- In the previous blog post, we mentioned the importance of scaling environments in the era of RL for LLMs.
- Similar to the pre-LLM era where we scaled the quantity and quality of data, in the RL era, we scale the difficulty of environments.
- We can evaluate the difficulty of an environment from three dimensions:
- Generation Difficulty: The difficulty of collecting new problems/goals/tasks within the environment
- Solving Difficulty: The difficulty of the problems assigned to the agent within the environment
- Verification Difficulty: The difficulty of verifying whether the agent’s output is correct after completing a task
- These difficulties determine how easy it is to build an environment, and whether the constructed environment is sufficient to train powerful agents.
- A precise terminological distinction is that verification typically refers to checking if a prediction matches a ground truth using a verifier model; whereas a reward model generally evaluates the quality of a prediction without having ground truth. The former emphasizes consistency checking, the latter emphasizes obtaining a ground truth. For the sake of discussion, we do not strictly differentiate between the two in this post. After all, if it is hard to obtain ground truth, then it is naturally hard to verify. For tasks that are difficult to verify, under the discussion of a general reward model, they also correspond to tasks that are difficult to model a reward function for.
- Each dimension can be categorized as either simple or difficult. This binary classification mainly emphasizes relative difficulty—e.g., generation being harder than solving, or verification being easier than generation. This classification helps us clarify the goals and direction when scaling environments. Under this classification, we can form an environment matrix with eight subspaces.
- It’s worth noting that we ignore two subspaces where verification is harder than solving:
- Because most RL-applicable problems exhibit a generator-discriminator/verificator gap, i.e., it is easier to judge whether a policy is good than to get a good policy.
- If getting a policy is easy (e.g., proposing a mathematical conjecture or making a future prediction), but verification is hard (requiring higher intelligence or a long time), then in that domain, the problem that needs solving is verification itself.
- From this perspective, verification becomes the policy that needs to be learned, while “policy generation” becomes more akin to problem generation.
- Therefore, if such subspaces are to be solved with AI methods, they can be categorized into other subspaces. However, for completeness, we still illustrate them here and mark them in gray to indicate their objective existence.
First Layer: Generation, Solving, and Verification Are All Easy
- Tasks in this category are simple in every aspect. Typically, humans already have mature and robust templates or rules for them—e.g., unit conversion, spelling correction, arithmetic, etc.
- Strategies for such tasks can be written using rules and do not require complex AI systems to learn.
- Interestingly, although these are the simplest environment settings, LLMs are not necessarily the best strategies here—in fact, they do not need to be. For example, with arithmetic for countless attempts, we have no theoretical guarantee that a language model never make mistakes, whereas a calculator will never make mistakes. Or for arithmetic on billion-digit numbers, the language model’s context cannot hold it all, but a simple big integer algorithm can handle it easily.
- Does this imply that language models are not good? Of course not. Rather, it means that for different types of problems, we need differently designed intelligences. Language models can simply solve such problems by calling a calculator or writing code. When simple problems challenge the robustness of high-level intelligence, high-level intelligence can use induction, reasoning to organize lower-level intelligences to solve them.
Second Layer: Either Solving or Generation Is Very Difficult


- This layer corresponds to most current RL research for LLMs. Two representative directions are RLHF and RLVR:
- RLHF corresponds to scenarios where generating high-quality problems (data collection) is very difficult. For product-grade LLMs, we need to collect real-world queries from actual user logs rather than relying on simple datasets for preference learning. Therefore, constructing challenging tasks/goals is very difficult. Initially, the challenge of RLHF seemed to be verification difficulty, but a series of works have shown that with high-quality human preference data, reward models can indeed learn accurate human preferences. Good data consists of two parts: good questions and good model answers (not just good answer, but good rollouts from better trained policy llm). All this depends on deploying on-policy RL into product-grade LLMs and achieving data flywheels.
- RLVR corresponds to scenarios where solving problems is very difficult. It was only two years after RLHF became common in LLM post-training that the RLVR paradigm emerged. Before RLVR was applied to mathematics, there was no shortage of math problems or easy verification for their results. However, when the base model’s capabilities were insufficient and the search space was not optimized, it was hard to explore strategies in early-stage RL that would yield positive feedback. It’s like a monkey typing on a keyboard—while it could theoretically type Shakespeare, we don’t know how long it would take. But if RL starts from a strong base, it’s like a PhD in literature typing, making the probability of generating literature much higher. People now realize the importance of pretraining a strong base model for RLVR, and some mid-training efforts are also emerging.
Third Layer: Only Verification Is Easy / Only Generation Is Easy


- This layer corresponds to directions we are about to explore. It is hard, but as more effort is invested, these environments will be gradually constructed to train more advanced intelligence.
- Only Verification Is Easy: Both generation and solving are difficult. A typical example would be the highest-difficulty math problems. Math problems with standard answers are always easy to verify, and currently, most high-difficulty math problems can be solved by LLMs. To further improve intelligence, we need even harder math problems. But where do we collect them? That’s the difficulty—requiring the smartest human minds to continuously produce more difficult (but solvable) problems to train the models. This process is clearly unsustainable and cannot scale up. Humanity’s final mathematical frontier can be updated annually, but it will become thinner and thinner, and the fact that problems must be solvable by humans limits the upper bound of this type of intelligence. If AI generates and solves the problems, it violates the Generator-Verifier Gap. Therefore, constructing this type of environment is resource-constrained—specifically, limited by human intellectual resources.
- Only Generation Is Easy: Both verification and solving are difficult. The main challenge here lies in verification—tasks that are subjective, require semantic understanding, lack unified evaluation standards, or have high time/labor verification costs, such as artistic/literary creation, policymaking, education, and healthcare. In these areas, we have a vast number of problems to solve, but it is very difficult to determine whether AI has solved them. A key feature of this subspace is human participation. AI will become part of human civilization, participating in and influencing social activities, receiving feedback from human society. This is a far more challenging direction—optimizing a system that includes both humans and AI.
Final Layer: Expert-Level / Superhuman
- This subspace is difficult in all dimensions. We cannot take shortcuts by leveraging one dimension being easier than the others to train intelligence. I currently cannot give an example of this subspace, but it must exist. Perhaps at this level, AI will develop AI, regulate AI, and leverage AI.
Citation
If you found the topics in this blog post interesting and would like to cite it, you may use the following BibTeX entry:
1 | @article{next_scaling_202507, |