[Some Questions asking Myself 2024.4]
Some very-personal questions, assumptions and predictions on the future after the large model era. I hope to keep it a habit for writing such future-ask post for every half year to keep me thinking about the "next token" in the AI era. This post is about Compression, World Model, Agent and Alignment.
Is Compression Our Only Path to General Intelligence?
Is compression all we need?
- The first question is about compression.
- Large models compress all the textual data in the world into the
parameters of a single model, enabling everyone to "extract" information
through natural language interaction. This process undoubtedly
alleviates knowledge or information asymmetry. For example, a dentist
can query an LLM to write code, while a programmer can enhance their
paper writing with the assistance of an LLM. Extracting pre-encoded
knowledge from LLMs is always beneficial. However, our aspirations go
beyond this simple query-based knowledge retrieval. We wonder:
- Can new discoveries emerge from the compressed
information/knowledge in these models? For instance, could a
physicist uncover a new law from an LLM? Or could an LLM predict the
content of this post? The answer is uncertain: it could be yes or no.
- On the affirmative side, mathematicians provide an example—many
discoveries in pure theoretical research arise solely from scientists'
cognitive processes and prior knowledge. Compression-based large models
excel at leveraging past knowledge. If they can effectively simulate the
cognitive process of scientists, they might achieve groundbreaking
discoveries.
- On the negative side, some discoveries require empirical
observation. They are "discovered" because someone observes them, such
as the identification of new species in biology, which cannot be
inferred merely from known information.
- Another question worth pondering is whether new discoveries are even necessary. After all, perhaps 99.999% of the world's activities in the next second follow established patterns. A tool that efficiently extracts and applies these patterns can still profoundly impact humanity. While this is true, our pursuit of AGI compels us to strive for more than this pragmatic goal.
- On the affirmative side, mathematicians provide an example—many
discoveries in pure theoretical research arise solely from scientists'
cognitive processes and prior knowledge. Compression-based large models
excel at leveraging past knowledge. If they can effectively simulate the
cognitive process of scientists, they might achieve groundbreaking
discoveries.
- The core question hinges on "Is compression all we need?"[1] If I could compress all the world's myriad and diverse data into a model, could it predict the future? If the model could accurately simulate the entire world, the answer would be yes—fast-forwarding the simulation would reveal glimpses of the future. But does compression combined with conditional extraction truly equate to simulation?
- Elon Musk once remarked that the focus should be on the
transformation between energy and intelligence. Is compression
the best method for such transformation? Perhaps it serves as
an efficient intermediary between energy and compressed knowledge
(instead of intelligence).
- Related to this "compression question" is another: "Is predicting the next token all we need?" This question probes the limits of procedural and causal knowledge representation.
- Can new discoveries emerge from the compressed
information/knowledge in these models? For instance, could a
physicist uncover a new law from an LLM? Or could an LLM predict the
content of this post? The answer is uncertain: it could be yes or no.
World Model: A Data-Driven Approach?
- Regarding world models, a popular concept posits that intelligence
comprises several interconnected subsystems (e.g., cognition, memory,
perception, and world models), informed by human cognitive priors. The
world model specifically refers to our brain's simulation of the world,
enabling decision-making without waiting for real-world
interaction.
- The aspiration is to model these subsystems individually. However,
most of our data is either unsupervised or end-to-end (holistic rather
than divided into subsystems). Unsupervised data poses challenges in
enabling all subsystem functionalities (e.g., language model pretraining
struggles with instruction-following). End-to-end data might not train
all subsystems effectively.
- If we could segment and organize data to correspond to these subsystems, could we achieve a world model in the form of multi-agent or multi-LM systems?
Agents
Could OpenAI's Bitter Lesson overshadow many aspects of research on large models? Will agent-based research meet a similar fate? In other words, even after scaling up large models, will the research focus on agents remain irreplaceable? This might depend on whether the most rudimentary outputs of LLMs can transition from "System 1" (intuitive responses) to "System 2" (deliberative reasoning)[2][3].
If an agent possesses all the actions and information of a human, can we consider it equivalent to a human?
Alignment and Feedback
- Everything revolves around the data flywheel. The
objective is to achieve better signals with each update by aligning the
model.
- Alignment demonstrates the importance of improving positive samples
rather than focusing on negative samples, distinguishing it
significantly from contrastive learning.
- Alignment[4]
can be beneficial or detrimental, depending on the goal to which the
model is aligned.
- Some interesting questions are:
- How can we integrate various forms of feedback (human/non-human,
textual/other modalities, social/physical)?
- By connecting all these feedback types, we might align models with
more powerful goals. Moreover, the laws governing this integration could
reveal fundamental rules of the world.
- Reward models exemplify the energy hidden in tradeoffs: by
sacrificing some precision, we gain scalable training, rewarding, and
labeling. This tradeoff results in stable improvements. Can we uncover
more such "energy" within these processes?
- For example, could cascading reward models (like interlocking gears)
amplify the reward knowledge encoded by human annotations across
datasets?
- For example, could cascading reward models (like interlocking gears)
amplify the reward knowledge encoded by human annotations across
datasets?
- Similarly, the alignment tax[5] represents another tradeoff. Is there latent "energy" in these tradeoffs, where sacrificing A for B leads to overall intelligence gains?
- How can we integrate various forms of feedback (human/non-human,
textual/other modalities, social/physical)?
Beyond Language
- Language is more intricate, reasoned, and abstract than other
modalities because it is fundamentally "unnatural"—a construct of human
invention.
- Nonetheless, researchers have identified an elegant objective for
language: predicting the next token, a goal reflecting
the entire history of computational linguistics.
- Other modalities, like images, videos, and sounds, are "natural," as
they convey raw information from the physical world. Could these
modalities have objectives as intuitive or powerful as predicting the
next token?
- What implications do multimodal capabilities have for the reasoning abilities of large models?
Cite This Post
If you find this post helpful or interesting, you can cite it as:
1 | @article{next_on_llm_2024, |
- [1] Compression for AGI - Jack Rae | Stanford MLSys #76 https://www.youtube.com/watch?v=dO4TPJkeaaU
- [2] LeCun, Y. (2022). A path towards autonomous machine intelligence. version 0.9. 2, 2022-06-27. Open Review, 62(1), 1-62.
- [3] Bengio, Y. (2017). The consciousness prior. arXiv preprint arXiv:1709.08568.
- [4] Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- [5] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., ... & Kaplan, J. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.