[Some Questions asking Myself 2025.5]

Posted on 2025-05-21 Edited on 2025-07-16 In MyQuestion Views: Word count in article: 17k Reading time ≈ 15 mins.

The second post on my "some very-personal questions to myself" series. It's been over a year since last post and many progress on LLM have been made from academic/industry, which partially solves my questions. I will introduce these works and ask myself some new questions. This post is about Pretrain Ceiling, Second Half, Scaling the Environment.

Questions from a Year Ago

Can Compression Solve Everything?

One year later, it appears that mainstream AI research still adheres to the LLM compression paradigm: using pretraining to compress world knowledge, then relying on post-training to extract it.
As for whether LLMs can discover entirely new knowledge, research has now largely shifted to the AI4Science domain.
Regarding the example from the previous blog post involving mathematicians and biologists:
- In foundational fields like mathematics, new breakthroughs can often be achieved via interpolation within existing research ideas and paradigms. For example, DeepMind’s AlphaEvolve^[2] combines evolutionary algorithms with LLMs to discover new fast matrix multiplication algorithms. The foundational algorithmic knowledge required was already encoded in the model via compression. Through prompt engineering and evolutionary iteration, the system uncovered “low-hanging fruit” that humans had yet to explore.
- In empirical sciences like biology, which rely heavily on large amounts of new observations, an agentic approach can allow LLMs to interact with the real world using tools to synthesize new knowledge. In this paradigm, the LLM acts more like a scientist’s tool than a replacement. Another path is to bypass the reasoning abilities of LLMs altogether and build domain-specific models directly from field data—like Evo2^[3], which trains on genome sequences. For naturally sequential data (like genomes), retraining domain-specific models makes sense; for non-sequential data, one can structure it as text, using language models for knowledge organization and reasoning.

World Models: Data-Driven?

There has been no substantial breakthrough in world modeling so far.
Researchers have found that using LLMs to simulate both the world and the agent requires different capabilities^[1].
More practical progress lies in the domain of LLM Agents: using models to construct interactive environments, such as video generation models or 3D space models—collectively referred to as world models. This reflects another extension trend in agent research: scaling the environment.
In Advances and Challenges in Foundation Agents, scholars provided a comprehensive overview of the current state of world model research^[6] from the agent perspective; most current approaches rely on models or external simulators and treat world models as single-task modeling problems that ultimately reduce to traditional single-step prediction.

The "Bitter Lesson" of Agents?

The “bitter lesson” still holds true. For instance, Alita: Generalist Agent^[5] minimizes prior design and maximizes freedom, autonomously building and invoking MCP tools and achieving impressive results on the GAIA platform.
Minimal priors and maximal freedom mean the agent’s capabilities are internalized within the base model, requiring no additional framework or scaffolding. We have yet to see truly “agent-native” application scenarios.
Since the release of OpenAI’s o1 and DeepSeek, the industry consensus is that even the most basic LLM responses can shift from System 1 to System 2 reasoning.

Alignment and Feedback

As mentioned a year ago, post-training is essentially about steering LLMs: traditional alignment sacrifices some capability to guide models toward safer behavior; similarly, we can steer models toward more intelligent yet more hallucination-prone behavior, as demonstrated by DeepSeek R1’s "reasoning" capabilities.
Our understanding of post-training continues to deepen. Looking at the life cycle of LLMs^[7], we find that knowledge is hard to truly add, remove, or edit during the post-training phase—retrieval is the most viable option. Hence, the “incentive, not memorize” principle seems reasonable.
Pretraining has already endowed models with powerful knowledge. Traditional supervised-learning-based post-training can only introduce limited new knowledge. If we wish to add knowledge and capability, it’s better to stick with next-token-prediction rather than simple input-output pairs. So far, reinforcement learning appears to be a more suitable direction for post-training, enabling models to explore autonomously and truly grow in ability, rather than merely memorizing narrowly defined tasks.
From a reward perspective, just as alignment uses human data to train reward models and inject values, we can also design rule-based rewards to encode natural laws into models.
Recent discussions about the limits of post-trained models not exceeding pretraining ceilings^[8] show that if post-training is used only for steering without new knowledge input, it aligns with entropy-reduction expectations. In RL processes, the policy rolls out and only strengthens the good parts; if the correct solution never appears, pretraining ceilings can’t be breached. Traditional alignment data flywheels may face similar bottlenecks, though their knowledge scope is so broad it’s hard to investigate.

Beyond Language

The multimodal field has surged forward over the past year. Many advances are not in foundational techniques but in improved user experiences—combining multiple modalities better appeals to the senses. For example, Google VEO3^[9] can generate both video and audio simultaneously.
At the same time, we’ve seen fascinating new pretraining paradigms such as fractal neural networks^[10] and diffusion language models^[11].
What do additional modalities mean for reasoning capabilities? Most researchers respond: using text reasoning capabilities for visual reasoning. This isn’t about introducing new modalities for reasoning, but feeding other modalities into text-based reasoning. What I hope to see is “purely visual chain-of-thought” reasoning. Some recent efforts like Visual Planning: Let’s Think Only with Images^[4] attempt image-only reasoning. But such tasks are often limited to navigation, maps, or Frozen Lake scenarios, and still require interpreting intermediate images into action commands. To achieve true “image-to-image” reasoning, clearer problem definitions and task setups may be needed.

New Questions

Beyond Pretraining and Post-training

The field continues to explore scaling laws, revealing from perspectives like communication^[12] and physics^[13] that knowledge can be transferred effectively to models via next-token-prediction and that model capacity can be expressed predictably.
Is it possible to go beyond the pretraining and post-training paradigm to truly enable adding, deleting, updating, and querying knowledge and capabilities? This is crucial for personalized LLMs.
Existing knowledge editing methods remain too simplistic and intrusive.

Self-Evolution

Recently, research on "model self-evolution" has grown rapidly. Nearly all unsupervised, weakly supervised, and self-supervised post-training approaches claim self-evolution capabilities. But is this truly self-evolution? Just as AlphaGo evolved through self-play, can LLMs under RLVR paradigms achieve genuine self-evolution? Or is this still just “self-entertainment” within the boundaries of pretraining?

What Am I Overlooking?

Both academia and industry are being driven by a blind arms race:
- When industry makes a breakthrough with large-scale models or reduces research costs, academia quickly follows to harvest the benefits. Academia becomes the tester for breakthroughs made by industry. From “Can LLM do sth?” benchmark-style papers to “XXPO” studies applying DeepSeek GRPO to various domains, researchers now sometimes don’t even test other domains—just overfit some math benchmarks.
- Industry too faces competitive pressure, like smartphone vendors releasing new iPhones, LLM companies roll out new versions monthly. If one company introduces a new model feature, competitors often replicate it by the next release cycle. If a competitor can replicate it quickly, it means the breakthrough falls within a foreseeable range of the scaling law.
- This causes researchers to be constrained by low-hanging fruit and predictable problems, overlooking broader questions. Problems can remain unsolved, but critical thinking should never stop. In this LLM era, what overlooked domains are still worth examining?
Don’t underestimate applications. Applications are the final link of science serving society, and they can also reverse-inspire new research trends. ChatGPT is a classic example: by combining a simple chat interface with post-training, it brought the value of LLMs into the homes of everyday users—and in turn spurred academic interest in LLM research.
Why did I once overlook large models? And what am I overlooking now?

If LLMs Are AGI

Then should we use LLMs to build real AGI applications? Suppose AGI has arrived, and we can operate an entity equivalent to an ordinary person or even superhuman. What valuable things could we attempt?
“Operate” sounds very negative, but operating a human being is actually simple—it doesn’t require science. Many industries throughout history have relied on humans functioning as operated entities to keep running.
Our first thought is workers—blue-collar, white-collar, industry employees—can they be replaced by AI? But there’s also a different angle, like recruiting ancient human subjects^[14]. This line of thinking doesn’t just ask which jobs can be replaced, but explores what things require humans yet are impossible for humans to achieve—perhaps AI can step in.

“The Second Half”

Recent works such as The Second Half^[15] and Welcome to the Era of Experience^[16] suggest that research should shift from “how to solve problems” to “how to define problems.”
I strongly agree, but I see this as a terminology update following the rise of powerful LLMs—the paradigm itself hasn’t changed: we still define tasks and solve them with models. What’s changed is that we’ve moved from constructing datasets to designing new environments, and from proposing new models to enabling models to learn online in environments and outperform others. We’re not scaling datasets—we’re scaling environments.
In what directions should we “scale the environments”?
- A quality environment should generate infinite data; data volume should no longer be the only axis of expansion.
- The environment’s difficulty should become the focus of expansion.
- Beyond digital environments, real-world environments may be an important milestone—for instance, agents physically building cities on Earth.
- Scientific environments may offer even higher ceilings.
In this “second half,” is RL the only thing we need?

Citation

If you find this blog post interesting and wish to cite it, you may use the following bibtex:

@article{next_on_llm_2025_5,
  author = {Wei Liu},
  title = {[Some Questions asking Myself 2025.5] Pretrain Ceiling, Second Half, Scaling the Environment},
  year = {2025},
  month = {5},
  url = {https://thinkwee.top/2025/05/21/next-on-llm-2/},
  note = {Blog post}
}

[1] Li, M., Shi, W., Pagnoni, A., West, P., & Holtzman, A. (2024). Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling. In arXiv: Vol. abs/2407.02446. https://doi.org/10.48550/ARXIV.2407.02446
[2] Google DeepMind. (2025). AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms. https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/
[3] Stanford University. (2025). Evo2: Generative AI tool marks a milestone in biology and accelerates the future of life sciences. https://news.stanford.edu/stories/2025/02/generative-ai-tool-marks-a-milestone-in-biology-and-accelerates-the-future-of-life-sciences
[4] Xu, K., Wang, Y., Zhang, R., & Chen, X. (2025). Visual Planning: Let's Think Only with Images. In arXiv: Vol. abs/2505.11409. https://arxiv.org/abs/2505.11409
[5] Qiu, H., Xiao, C., Yang, Y., Wang, H., & Zheng, L. (2025). Alita: Generalist Agent with Minimal Predefinition and Maximal Self-Evolution. In arXiv: Vol. abs/2505.20286. https://arxiv.org/abs/2505.20286
[6] Liu, M., Tian, Y., Yang, S., Chen, B., & Zhou, B. (2025). Advances and Challenges in Foundation Agents. In arXiv: Vol. abs/2504.01990. https://arxiv.org/abs/2504.01990
[7] Zhang, S., Chen, J., & Wang, W. Y. (2025). The LLM Knowledge Lifecycle: An AAAI 2025 Tutorial. https://llmknowledgelifecycle.github.io/AAAI2025_Tutorial_LLMKnowledge/
[8] Wei, J., Chen, X., & Bubeck, S. (2025). Can reasoning emerge from large language models? Investigating the limits of reasoning capabilities in pre-trained and fine-tuned models. In arXiv: Vol. abs/2504.13837. https://arxiv.org/abs/2504.13837
[9] Google DeepMind. (2025). Veo3: Advancing text-to-video generation with synchronized audio. https://deepmind.google/models/veo/
[10] Chen, L., Dai, Y., & He, K. (2025). Fractal Neural Networks: Scaling Deep Learning Beyond Linear Paradigms. In arXiv: Vol. abs/2502.17437. https://arxiv.org/abs/2502.17437
[11] Yang, M., Tian, Y., & Lin, Z. (2025). Diffusion Language Models: Toward Controllable Text Generation with Guided Diffusion. In arXiv: Vol. abs/2502.09992. https://arxiv.org/abs/2502.09992
[12] Rao, S., Knight, W., & Sutskever, I. (2024). Scaling Laws from an Information-Theoretic Perspective. In arXiv: Vol. abs/2411.00660. https://arxiv.org/abs/2411.00660v2
[13] Allen-Zhu, Z., & Li, Y. (2025). On the Connection between Physical Laws and Neural Scaling Laws. https://physics.allen-zhu.com/
[14] Jiang, L., Cohen, J., & Griffiths, T. L. (2024). Recruiting ancient human subjects with large language models. Proceedings of the National Academy of Sciences, 121(21), e2407639121. https://www.pnas.org/doi/10.1073/pnas.2407639121
[15] Chen, K., Liu, H., & Zhang, D. (2025). The Second Half: From Solving Problems to Defining Problems. https://ysymyth.github.io/The-Second-Half/
[16] DeepMind. (2025). Welcome to the Era of Experience. https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf
[17] Liu, W. (2024). [Some Questions asking Myself 2024.4] Compression, World Model, Agent and Alignment. https://thinkwee.top/2024/04/23/next-on-llm/

[Some Questions asking Myself 2025.5]

Questions from a Year Ago

Can Compression Solve Everything?

World Models: Data-Driven?

The "Bitter Lesson" of Agents?

Alignment and Feedback

Beyond Language

New Questions

Beyond Pretraining and Post-training

Self-Evolution

What Am I Overlooking?

If LLMs Are AGI

“The Second Half”

Citation

关于一年前的疑问

压缩能否解决一切？

世界模型：以数据驱动？

智能体的“苦涩教训”？

对齐与反馈

超越语言

新的问题

超越预训练与后训练

自我进化

我忽视了什么

如果 LLM 是 AGI

“下半场”

引用