Thinkwee's Blog

Scaling the Environment

Posted on 2025-07-17 Edited on 2025-07-18 In LLM Views: Word count in article: 10k Reading time ≈ 9 mins.

What I Talk About When I Talk About Scaling the Environment?

What is the Next Step for Scaling in the Era of RL for LLM?

Posted on 2025-07-15 Edited on 2025-07-16 In LLM Views: Word count in article: 14k Reading time ≈ 13 mins.

When the redundant designs we added in the pre-LLM era have been deleted by the bitter lesson, we are ready to scale up. In the era of RL for LLM, what should be the next scaling up?

LLM Reasoning in 2025

Posted on 2025-05-30 Edited on 2025-07-16 In MyQuestion Views: Word count in article: 1.3k Reading time ≈ 1 mins.

What LLM Reasoning be like in the first half of 2025.

[Some Questions asking Myself 2025.5]

Posted on 2025-05-21 Edited on 2025-07-16 In MyQuestion Views: Word count in article: 17k Reading time ≈ 15 mins.

The second post on my "some very-personal questions to myself" series. It's been over a year since last post and many progress on LLM have been made from academic/industry, which partially solves my questions. I will introduce these works and ask myself some new questions. This post is about Pretrain Ceiling, Second Half, Scaling the Environment.

[Some Questions asking Myself 2024.4]

Posted on 2024-04-23 Edited on 2025-07-16 In MyQuestion Views: Word count in article: 8.8k Reading time ≈ 8 mins.

Some very-personal questions, assumptions and predictions on the future after the large model era. I hope to keep it a habit for writing such future-ask post for every half year to keep me thinking about the "next token" in the AI era. This post is about Compression, World Model, Agent and Alignment.

Multi-agent Reinforcement Learning Notes

Posted on 2023-07-20 Edited on 2025-07-16 In RL Views: Word count in article: 21k Reading time ≈ 20 mins.

A simple note on the RL used in single-agent and multi-agent.

Debates between GPTs

Posted on 2023-06-05 Edited on 2025-07-16 In NLP Views: Word count in article: 2.7k Reading time ≈ 2 mins.

A webpage based on ChatGPT-Shortcut that shows some interesting debates that took place between GPTs.
The experience website is here

Prompt - Task Reformulation in NLP

Posted on 2021-05-13 Edited on 2025-07-16 In NLP Views: Word count in article: 24k Reading time ≈ 21 mins.

Record of recent task reconstruction methods based on templates, a particularly interesting direction since the appearance of GPT-3. These methods typically design prompts for tasks, converting samples and tasks into natural language templates, which are then directly input into pre-trained language models to generate text, thereby indirectly completing the tasks. The construction of prompts standardizes the form of downstream tasks and pre-trained tasks (language models), achieving good results in few-shot learning. Key papers to read include the following nine:
- Early work that converts questions into natural language and uses pre-trained language models for answers:
  - (Harvard) Commonsense Knowledge Mining from Pretrained Models
  - (Heidelberg) Argumentative Relation Classification as Plausibility Ranking
  - (NVIDIA) Zero-shot Text Classification With Generative Language Models
- The PET approach, Pattern Exploiting Training:
  - (LMU) Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference
  - (LMU) It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
  - (UNC) Improving and Simplifying Pattern Exploiting Training
- Automatically constructing prompts, Automatically Searching Prompts:
  - (UCI, UCB) AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts
  - (Princeton, MIT) Making Pre-trained Language Models Better Few-shot Learners
  - (THU) GPT Understands, Too
    Read more »

Edit-based Text Generation

Posted on 2021-05-11 Edited on 2025-07-16 In NLP Views: Word count in article: 27k Reading time ≈ 24 mins.

Record the methods of editing seq2seq in recent years, which have the advantages of high efficiency (partially autoregressive or non-autoregressive decoding) and less data hungry (small output vocabulary) for tasks with the same language input and output and minor changes (error correction, simplification, summarization).
Mainly read five papers, sorted by their publication date on arXiv:
- (LevT, Facebook) Levenshtein Transformer
- (Huawei) EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing
- (LaserTagger, Google) Encode, Tag, Realize: High-Precision Text Editing
- (PIE) Parallel Iterative Edit Models for Local Sequence Transduction
- (Google) Felix: Flexible Text Editing Through Tagging and Insertion

Note for VC Dimension

Posted on 2020-05-15 Edited on 2025-07-16 In ML Views: Word count in article: 16k Reading time ≈ 15 mins.

A brief review of the VC dimension. All discussions are based on the simple case of binary classification.

Notes for NLP with Graph-Structured Representations

Posted on 2020-04-05 Edited on 2025-07-16 In ML Views: Word count in article: 14k Reading time ≈ 13 mins.

Read Dr. Bang Liu’s paper Natural Language Processing and Text Mining with Graph-Structured Representations from the University of Alberta and take some notes.

Study Notes for CS224w

Posted on 2020-03-30 Edited on 2025-07-16 In ML Views: Word count in article: 15k Reading time ≈ 14 mins.

Study notes for Stanford CS224W: Machine Learning with Graphs by Jure Leskovec.

CLSciSumm summary

Posted on 2020-03-27 Edited on 2025-07-16 In NLP Views: Word count in article: 7.6k Reading time ≈ 7 mins.

A brief note on the CLSciSumm Workshop that the CIST lab participated in, the main focus is on methods. The experiments are analysised in detail in papers. Papers:

Incremental Decoding

Posted on 2020-03-17 Edited on 2025-07-16 In NLP Views: Word count in article: 16k Reading time ≈ 15 mins.

Record the incremental decoding processing of parallel decoding models such as CNN seq2seq and Transformer in the inference phase in Fairseq.

BERTology

Posted on 2020-03-02 Edited on 2025-07-16 In NLP Views: Word count in article: 8.5k Reading time ≈ 8 mins.

note for A Primer in BERTology: What we know about how BERT works

Structured Neural Summarization, Paper Reading

Posted on 2020-02-28 Edited on 2025-07-16 In NLP Views: Word count in article: 12k Reading time ≈ 11 mins.

reading note for STRUCTURED NEURAL SUMMARIZATION.

SVM

Posted on 2020-02-13 Edited on 2025-07-16 In NLP Views: Word count in article: 13k Reading time ≈ 12 mins.

Long time no see, SVM.

Reformer - Paper Reading

Posted on 2020-02-07 Edited on 2025-07-16 In NLP Views: Word count in article: 9k Reading time ≈ 8 mins.

Reading note for reformer.

Paper Reading 4

Posted on 2019-12-16 Edited on 2025-07-16 In NLP Views: Word count in article: 14k Reading time ≈ 12 mins.

Paper reading on

GNN Pooling
Discourse-Aware Summarization
Siamese BERT
Large Chatbot

Note for Hierarchical Latent Dirichlet Allocation

Posted on 2019-11-15 Edited on 2025-07-16 In ML Views: Word count in article: 23k Reading time ≈ 21 mins.

Note for Hierarchical Latent Dirichlet Allocation

Paper reading on Knowledge Graphs

Posted on 2019-11-13 Edited on 2025-07-16 In NLP Views: Word count in article: 35k Reading time ≈ 32 mins.

Knowledge Graph Special Collection
- Entity Alignment in Cross-lingual Knowledge Graphs
- Knowledge Graph Language Model
- Dynamic Knowledge Graph Dialogue Generation
- Graph2Seq
- Graph Matching Network
- Dynamic Knowledge Graph Update
- Attention-based Embeddings for Relation Prediction

Note for Heterogeneous Information Network

Posted on 2019-10-30 Edited on 2025-07-16 In NLP Views: Word count in article: 14k Reading time ≈ 12 mins.

Record some recent processing of heterogeneous information networks

PathSim
HGNN
HGAN
HGAN for text classification
Attribute, Attributed Multiplex Heterogeneous Network
Meta-graph Guided Random Walks

Note for Graph-based Summarization

Posted on 2019-10-03 Edited on 2025-07-16 In NLP Views: Word count in article: 23k Reading time ≈ 20 mins.

Graph-based Automatic Summary Related Paper Selection Reading

AMR Generative Summary
AMR Multi-document Summarization Two Papers
pagerank in encoder attention
Build a graph based on thematic modeling, use ILP for extractive summarization
Multi-document Extractive Summary Based on GCN
STRUCTURED NEURAL SUMMARIZATION

Easy Reinforcement Learning Notes

Posted on 2019-09-23 Edited on 2025-07-16 In NLP Views: Word count in article: 17k Reading time ≈ 15 mins.

rl study note, minimalist style

Q-learning
Sarsa
Sarsa(\(\lambda\))
DQN
Double DQN
DQN with Prioritized Experience replay
Dueling DQN
Policy Gradient

Summarization-Related Papers Reading (ACL/NAACL 2019)

Posted on 2019-08-15 Edited on 2025-07-16 In NLP Views: Word count in article: 23k Reading time ≈ 21 mins.

Selected Reading of ACL/NAACL 2019 Automatic Summarization Papers

DPPs Similarity Measurement Improvement
STRASS: Backpropagation for Extractive Summarization
Translate first, then generate the abstract
Reading Comprehension + Automatic Abstract
BiSET: Retrieve + Fast Rerank + Selective Encoding + Template Based

Study Notes for Cognitive Graph

Posted on 2019-08-13 Edited on 2025-07-16 In NLP Views: Word count in article: 19k Reading time ≈ 17 mins.

Note for paper "Cognitive Graph for Multi-Hop Reading Comprehension at Scale."

Study Notes for Correlation Explaination

Posted on 2019-07-29 Edited on 2025-07-16 In ML Views: Word count in article: 23k Reading time ≈ 21 mins.

Note for CorEx(Correlation Explaination).

Outstanding Papers Reading (ACL 2019)

Posted on 2019-07-28 Edited on 2025-07-16 In NLP Views: Word count in article: 29k Reading time ≈ 27 mins.

Selected readings from ACL 2019 award-winning papers.

Using Oracle for sentence-level teacher forcing
speaker commitment
A set of evaluation index frameworks applicable to abstracts, combining multiple indicators
Zero-Shot Entity Linking

Note for Variational Auto-Encoder

Posted on 2019-03-20 Edited on 2025-07-16 In ML Views: Word count in article: 17k Reading time ≈ 15 mins.

Variational Autoencoder Learning Notes
Reference Article:
On VAE, the original paper and the two blogs above have already explained it very clearly. I am just repeating and paraphrasing, just to go through it myself. If anyone reads this blog, I recommend reading these three reference sources first

Glove Embedding - Mathematical Derivation

Posted on 2019-01-13 Edited on 2025-07-16 In ML Views: Word count in article: 12k Reading time ≈ 11 mins.

Record the mathematical derivation of GloVe word vectors, as the original paper does not derive the model graphically but rather calculates the objective function through pure mathematical operations. This design approach is very interesting, and it also writes out and compares the mathematical essence of word2vec.
GloVe: Global Vectors for Word Representation