[Some Questions asking Myself 2024.4]

Some very-personal questions, assumptions and predictions on the future after the large model era. I hope to keep it a habit for writing such future-ask post for every half year to keep me thinking about the "next token" in the AI era. This post is about Compression, World Model, Agent and Alignment.

Is Compression Our Only Path to General Intelligence?

Is compression all we need?

  • The first question is about compression.
  • Large models compress all the textual data in the world into the parameters of a single model, enabling everyone to "extract" information through natural language interaction. This process undoubtedly alleviates knowledge or information asymmetry. For example, a dentist can query an LLM to write code, while a programmer can enhance their paper writing with the assistance of an LLM. Extracting pre-encoded knowledge from LLMs is always beneficial. However, our aspirations go beyond this simple query-based knowledge retrieval. We wonder:
    • Can new discoveries emerge from the compressed information/knowledge in these models? For instance, could a physicist uncover a new law from an LLM? Or could an LLM predict the content of this post? The answer is uncertain: it could be yes or no.
      • On the affirmative side, mathematicians provide an example—many discoveries in pure theoretical research arise solely from scientists' cognitive processes and prior knowledge. Compression-based large models excel at leveraging past knowledge. If they can effectively simulate the cognitive process of scientists, they might achieve groundbreaking discoveries.
      • On the negative side, some discoveries require empirical observation. They are "discovered" because someone observes them, such as the identification of new species in biology, which cannot be inferred merely from known information.
      • Another question worth pondering is whether new discoveries are even necessary. After all, perhaps 99.999% of the world's activities in the next second follow established patterns. A tool that efficiently extracts and applies these patterns can still profoundly impact humanity. While this is true, our pursuit of AGI compels us to strive for more than this pragmatic goal.
    • The core question hinges on "Is compression all we need?"[1] If I could compress all the world's myriad and diverse data into a model, could it predict the future? If the model could accurately simulate the entire world, the answer would be yes—fast-forwarding the simulation would reveal glimpses of the future. But does compression combined with conditional extraction truly equate to simulation?
    • Elon Musk once remarked that the focus should be on the transformation between energy and intelligence. Is compression the best method for such transformation? Perhaps it serves as an efficient intermediary between energy and compressed knowledge (instead of intelligence).
    • Related to this "compression question" is another: "Is predicting the next token all we need?" This question probes the limits of procedural and causal knowledge representation.

World Model: A Data-Driven Approach?

  • Regarding world models, a popular concept posits that intelligence comprises several interconnected subsystems (e.g., cognition, memory, perception, and world models), informed by human cognitive priors. The world model specifically refers to our brain's simulation of the world, enabling decision-making without waiting for real-world interaction.
  • The aspiration is to model these subsystems individually. However, most of our data is either unsupervised or end-to-end (holistic rather than divided into subsystems). Unsupervised data poses challenges in enabling all subsystem functionalities (e.g., language model pretraining struggles with instruction-following). End-to-end data might not train all subsystems effectively.
  • If we could segment and organize data to correspond to these subsystems, could we achieve a world model in the form of multi-agent or multi-LM systems?

Agents

  • Could OpenAI's Bitter Lesson overshadow many aspects of research on large models? Will agent-based research meet a similar fate? In other words, even after scaling up large models, will the research focus on agents remain irreplaceable? This might depend on whether the most rudimentary outputs of LLMs can transition from "System 1" (intuitive responses) to "System 2" (deliberative reasoning)[2][3].

  • If an agent possesses all the actions and information of a human, can we consider it equivalent to a human?

Alignment and Feedback

  • Everything revolves around the data flywheel. The objective is to achieve better signals with each update by aligning the model.
  • Alignment demonstrates the importance of improving positive samples rather than focusing on negative samples, distinguishing it significantly from contrastive learning.
  • Alignment[4] can be beneficial or detrimental, depending on the goal to which the model is aligned.
  • Some interesting questions are:
    • How can we integrate various forms of feedback (human/non-human, textual/other modalities, social/physical)?
    • By connecting all these feedback types, we might align models with more powerful goals. Moreover, the laws governing this integration could reveal fundamental rules of the world.
    • Reward models exemplify the energy hidden in tradeoffs: by sacrificing some precision, we gain scalable training, rewarding, and labeling. This tradeoff results in stable improvements. Can we uncover more such "energy" within these processes?
      • For example, could cascading reward models (like interlocking gears) amplify the reward knowledge encoded by human annotations across datasets?
    • Similarly, the alignment tax[5] represents another tradeoff. Is there latent "energy" in these tradeoffs, where sacrificing A for B leads to overall intelligence gains?

Beyond Language

  • Language is more intricate, reasoned, and abstract than other modalities because it is fundamentally "unnatural"—a construct of human invention.
  • Nonetheless, researchers have identified an elegant objective for language: predicting the next token, a goal reflecting the entire history of computational linguistics.
  • Other modalities, like images, videos, and sounds, are "natural," as they convey raw information from the physical world. Could these modalities have objectives as intuitive or powerful as predicting the next token?
  • What implications do multimodal capabilities have for the reasoning abilities of large models?

Cite This Post

If you find this post helpful or interesting, you can cite it as:

1
2
3
4
5
6
7
8
@article{next_on_llm_2024,
author = {Wei Liu},
title = {[Some Questions asking Myself 2024.4] Compression, World Model, Agent and Alignment},
year = {2024},
month = {4},
url = {https://thinkwee.top/2024/04/23/next-on-llm/#more},
note = {Blog post}
}
  • [1] Compression for AGI - Jack Rae | Stanford MLSys #76 https://www.youtube.com/watch?v=dO4TPJkeaaU
  • [2] LeCun, Y. (2022). A path towards autonomous machine intelligence. version 0.9. 2, 2022-06-27. Open Review, 62(1), 1-62.
  • [3] Bengio, Y. (2017). The consciousness prior. arXiv preprint arXiv:1709.08568.
  • [4] Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  • [5] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., ... & Kaplan, J. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.