Thinkwee's Blog

Too Stupid to Give Up Learning

Some very-personal questions, assumptions and predictions on the future after the large model era. I hope to keep it a habit for writing such future-ask post for every half year to keep me thinking about the "next token" in the AI era. This post is about Compression, World Model, Agent and Alignment.

Read more »


  • Record of recent task reconstruction methods based on templates, a particularly interesting direction since the appearance of GPT-3. These methods typically design prompts for tasks, converting samples and tasks into natural language templates, which are then directly input into pre-trained language models to generate text, thereby indirectly completing the tasks. The construction of prompts standardizes the form of downstream tasks and pre-trained tasks (language models), achieving good results in few-shot learning. Key papers to read include the following nine:
    • Early work that converts questions into natural language and uses pre-trained language models for answers:
      • (Harvard) Commonsense Knowledge Mining from Pretrained Models
      • (Heidelberg) Argumentative Relation Classification as Plausibility Ranking
      • (NVIDIA) Zero-shot Text Classification With Generative Language Models
    • The PET approach, Pattern Exploiting Training:
      • (LMU) Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference
      • (LMU) It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
      • (UNC) Improving and Simplifying Pattern Exploiting Training
    • Automatically constructing prompts, Automatically Searching Prompts:
      • (UCI, UCB) AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts
      • (Princeton, MIT) Making Pre-trained Language Models Better Few-shot Learners
      • (THU) GPT Understands, Too
        Read more »


  • Record the methods of editing seq2seq in recent years, which have the advantages of high efficiency (partially autoregressive or non-autoregressive decoding) and less data hungry (small output vocabulary) for tasks with the same language input and output and minor changes (error correction, simplification, summarization).
  • Mainly read five papers, sorted by their publication date on arXiv:
    • (LevT, Facebook) Levenshtein Transformer
    • (Huawei) EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing
    • (LaserTagger, Google) Encode, Tag, Realize: High-Precision Text Editing
    • (PIE) Parallel Iterative Edit Models for Local Sequence Transduction
    • (Google) Felix: Flexible Text Editing Through Tagging and Insertion
Read more »

A brief review of the VC dimension. All discussions are based on the simple case of binary classification.

Read more »

Read Dr. Bang Liu’s paper Natural Language Processing and Text Mining with Graph-Structured Representations from the University of Alberta and take some notes.

Read more »

Study notes for Stanford CS224W: Machine Learning with Graphs by Jure Leskovec.

Read more »

A brief note on the CLSciSumm Workshop that the CIST lab participated in, the main focus is on methods. The experiments are analysised in detail in papers. Papers:

Read more »

Record the incremental decoding processing of parallel decoding models such as CNN seq2seq and Transformer in the inference phase in Fairseq.

Read more »

Long time no see, SVM.

Read more »

Paper reading on

  • GNN Pooling
  • Discourse-Aware Summarization
  • Siamese BERT
  • Large Chatbot
Read more »

  • Knowledge Graph Special Collection
    • Entity Alignment in Cross-lingual Knowledge Graphs
    • Knowledge Graph Language Model
    • Dynamic Knowledge Graph Dialogue Generation
    • Graph2Seq
    • Graph Matching Network
    • Dynamic Knowledge Graph Update
    • Attention-based Embeddings for Relation Prediction
Read more »

Record some recent processing of heterogeneous information networks

  • PathSim
  • HGNN
  • HGAN
  • HGAN for text classification
  • Attribute, Attributed Multiplex Heterogeneous Network
  • Meta-graph Guided Random Walks
Read more »

Graph-based Automatic Summary Related Paper Selection Reading

  • AMR Generative Summary
  • AMR Multi-document Summarization Two Papers
  • pagerank in encoder attention
  • Build a graph based on thematic modeling, use ILP for extractive summarization
  • Multi-document Extractive Summary Based on GCN
  • STRUCTURED NEURAL SUMMARIZATION
Read more »

rl study note, minimalist style

  • Q-learning
  • Sarsa
  • Sarsa(\(\lambda\))
  • DQN
  • Double DQN
  • DQN with Prioritized Experience replay
  • Dueling DQN
  • Policy Gradient
Read more »

Selected Reading of ACL/NAACL 2019 Automatic Summarization Papers

  • DPPs Similarity Measurement Improvement

  • STRASS: Backpropagation for Extractive Summarization

  • Translate first, then generate the abstract

  • Reading Comprehension + Automatic Abstract

  • BiSET: Retrieve + Fast Rerank + Selective Encoding + Template Based

Read more »

Selected readings from ACL 2019 award-winning papers.

  • Using Oracle for sentence-level teacher forcing
  • speaker commitment
  • A set of evaluation index frameworks applicable to abstracts, combining multiple indicators
  • Zero-Shot Entity Linking
Read more »


Read more »


  • Record the mathematical derivation of GloVe word vectors, as the original paper does not derive the model graphically but rather calculates the objective function through pure mathematical operations. This design approach is very interesting, and it also writes out and compares the mathematical essence of word2vec.
  • GloVe: Global Vectors for Word Representation
Read more »

  • Convolutional Sequence to Sequence

  • Robust Unsupervised Cross-Lingual Word Embedding Mapping

Read more »

Course Notes on Computational Linguistics, Reference Textbook: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.

Read more »

Note for John Mount's "The Equivalence of Logistic Regression and Maximum Entropy Models" and explains that this proof is a special case of the general derivation proof of the maximum entropy model introduced in statistical learning methods

Conclusion

  • Maximum entropy model is softmax classification
  • Under the balanced conditions of the general linear model, the model mapping function that satisfies the maximum entropy condition is the softmax function
  • In the book on Statistical Machine Learning methods, a maximum entropy model defined under the feature function is presented, which, along with softmax regression, belongs to the class of log-linear models
  • When the feature function extends from a binary function to the feature value itself, the maximum entropy model becomes a softmax regression model
  • The maximum entropy maximizes conditional entropy, not the entropy of conditional probabilities, nor the entropy of joint probabilities.
Read more »

Attended a light salon at Tsinghua University's FIT, which introduced some advancements in machine reading comprehension. Interestingly, the PhD who spoke at 9 am also mentioned an unpublished work: BERT, which is very impressive and well-funded; it took eight p100 GPUs to train for a year. By 10:30, Machine Intelligence had already published a report, and by the afternoon, Zhihu was buzzing with discussions, saying that a new era for NLP had arrived... This salon is part of a series, and there may be future sessions on machine translation, deep Bayesian, transfer learning, and knowledge graphs, so if you have the time, you might as well listen and take notes.

Read more »
0%