Outstanding Papers Reading (ACL 2019)

Posted on 2019-07-28 Edited on 2025-07-16 In NLP Views: Word count in article: 29k Reading time ≈ 26 mins.

Selected readings from ACL 2019 award-winning papers.

Using Oracle for sentence-level teacher forcing
speaker commitment
A set of evaluation index frameworks applicable to abstracts, combining multiple indicators
Zero-Shot Entity Linking

Bridging the Gap between Training and Inference for Neural Machine Translation

Background

The best long papers, this direction is very attractive, it is very common, everyone knows but chooses to ignore, or cannot find an elegant and effective solution.
Attempts to address all the issues encountered by seq2seq, namely the inconsistency between training and inference, i.e., exposure bias.
exposure bias is the bias produced during decoding. Normally, we generate a sentence from left to right, character by character, how so? The model generates a character, and then this character is input into the decoder to decode the next character, that is, the preceding text used to decode each character is the previously decoded sentence fragment. However, this training converges very slowly and is prone to cumulative errors. Think about it, the model is already difficult to generate the correct character at the beginning, and now it has to generate the next character based on this incorrect character, which is adding insult to injury. Therefore, during general training, it is necessary to use the teacher forcing method: forcing the model to generate each character based on the correct preceding text in the training data, that is, regardless of the characters already generated, only generating the correct character based on the premise of correctness. However, this technique can only be used for training; during testing, there is no ground truth for teacher forcing.
This issue is neither particularly large nor particularly small; it has also been encountered in previous summarization tasks, leading to good training responses but poor testing performance or inexplicable biases. Today, seq2seq models have made significant progress in the encoding end, with feature extractors such as CNN and Transformer having moved beyond unidirectional extraction methods. However, regardless of the model, at the decoding end, they must generate from left to right in a straightforward manner, and exposure bias cannot be avoided.
For translation, exposure bias also packages another issue that affects the quality of translation: the cross-entropy loss calculated word by word. The model needs to learn to generate the correct word at the correct position, and this dual correctness standard is too stringent for translation, making it difficult for the model to learn flexible translation relationships, i.e., over correction.
What are the existing methods for solving exposure bias and word-level CrossEntropy Loss?
- In generating words, sometimes we use ground truth, sometimes our own predicted output, and sample a moderate amount, i.e., scheduled sampling
- Using pre-trained models, performing Masked Seq2seq pretraining
- Utilizing sentence-level loss functions, the goal is to achieve the highest score for the entire sentence, rather than greedy optimization on a word-by-word basis, which includes various optimization criteria and reinforcement learning methods, such as mixed incremental cross-entropy reinforcement
- Among them, the pre-trained method is a relatively new approach, while the other two types of methods were proposed as early as 2015, and the authors also compared their own method with theirs

Methods

This paper aims to address the above two issues, and at first glance, the approach is still the same: by sampling from the ground truth and predicted results to mitigate bias, and by using sentence-level optimization metrics to relax the constraints on loss.
How to sample specifically? The method provided by the authors is shown in the figure below (isn’t this the figure for scheduled sampling…):
- First select the oracle word, i.e., the word predicted by the model: Note that the word predicted by the model here is not very accurate, as the predicted word is deterministic, obtained by taking the maximum of the dictionary probability distribution decoded by the decoder (excluding beam search). However, the oracle here should be expressed as “not ground truth,” i.e., not the true word. If we directly use the predicted word, we will make mistakes on top of mistakes; if we use the ground truth, there will be exposure bias. Therefore, the author took a compromise, different from the previous probabilistic compromise (which may take the predicted word or the ground truth), and also optimized the word selection, not simply taking the predicted word as the oracle. Specifically:
  - If the word with the highest predicted probability by the decoder is directly taken as the Oracle, that is ordinary scheduled sampling.
  - However, the author adjusts the predictive probability distribution using the Gumbel-Max regularization method, introducing two parameters: one calculated from a uniform distribution variable $u$ as Gumbel noise $\eta$ ; and one temperature variable $\gamma$ . Assuming the original probability distribution is $o$ , the adjusted probability distribution $P$ is
    $\eta = - \log ( - \log u) \\ \overline{o} _{j-1} = (o_{j-1} + \eta) / \gamma \\ \overline{P} _{j-1} = softmax(\overline{o} _{j-1}) \\ y_{j-1}^{\text {oracle }}=y_{j-1}^{\mathrm{WO}}=\operatorname{argmax}\left(\tilde{P}_{j-1}\right) \\$
  - The process of adding noise only affects the selection of the oracle and not the model’s loss. The operation of adding Gumbel noise makes the argmax operation equivalent to a sampling operation based on the probabilities of softmax, making the probability distribution obtained by softmax meaningful rather than simply taking the maximum. Here, only Gumbel-Max is used (the softmax in the formula is actually not necessary). Another more common application of Gumbel is Gumbel-Softmax, which is used to achieve reparameterization when the distribution of the assumed latent variable is a categorical distribution. Compared to the ordinary softmax, Gumbel-Softmax’s effect is equivalent to calculating a series of samples using softmax, which are sampled probabilistically according to the original softmax probabilities.
- This is a word-level oracle selection, and it can also be done at the sentence level; the specific method is
  - Firstly, using a word-level method, combined with beam search, several candidate sentences are selected
  - Select the best sentence through BLEU, ROUGE, and other metrics, and take each word of this sentence as an oracle
  - There is an obvious issue here, which is to ensure that the oracle sentences generated by beam search are of the same length as the ground truth sentences. The authors introduce force decoding, where if the decoded sentence is still shorter than the ground truth length and an EOS is decoded, the EOS is excluded, and the beam search is performed on the top k words with the highest probabilities; if the length is already sufficient but EOS has not been decoded, the decoding is forced to EOS and terminated
- Re-calculate the probability to decide whether to use oracle or ground truth: Like scheduled sampling, it also involves setting a dynamic sampling probability. Initially, during training, more ground truth is used, and then the proportion of oracle is gradually increased. The probability setting given by the authors is:
  $p = \frac{\mu}{\mu + exp(e / \mu)}$
The results are undoubtedly better than those of naive RNN and Transformer, with a 2-point improvement in BLEU. The authors also conducted a large number of experiments to test the impact of hyperparameters. It’s simple and effective, especially the method of introducing sentence-level optimization is straightforward, much more intuitive than a bunch of changes to the objective functions.

Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment

Best short paper, studying a very interesting direction: speaker commitment, also known as event fact.
Speaker’s commitment refers to determining whether an event has occurred through the speaker’s description, specifically divided into three categories: factual, unfactual, and uncertain. The model needs to extract the factual status of the event from the speaker’s description. Traditional methods focus on modal verbs and verb phrases, but the author introduces the CommitmentBank dataset to test various existing models, indicating that existing datasets cannot capture the lexical and syntactic diversity of natural language, especially in spoken language, and finds that models incorporating linguistic knowledge are superior to LSTM, setting another goal for deep learning to conquer.
For example, to illustrate the issue of speaker commitment, consider the following two statements: “I never believed I would study NLP,” and “I do not believe I can study NLP.” Both sentences have “believe” as the verb and both contain the negative words “never” and “not.” The event in question is “I study NLP,” and whether this event has occurred. Clearly, the former suggests that the event has already happened, while the latter suggests that it has not yet occurred. There are also more complex scenarios, such as given the statements of two debaters, guessing whether a certain fact discussed by them has occurred. Generally, each sample would also have context, and the speaker commitment task is to provide context, speaker expression, and an event, and to judge whether the event is a fact.
Authors tested two models on the CommitmentBank dataset: rule-based and neural network-based
- Rule-Based: Integrating Deep Linguistic Features in Factuality Prediction over Unified Datasets. Linguistic knowledge is applied by manually assigning factual scores to various predicate words/phrases, identifying the hidden signature of the predicate, and connecting adjectives and modal verbs based on syntactic tree analysis to enhance or reverse the scores. Finally, the scores from various human knowledge bases and syntactic structures are input as features into the SVM regression model to calculate the scores.
- Based on Neural Networks: Neural Models of Factuality. Sentence modeling is performed using multi-layer bidirectional LSTM and tree-LSTM, followed by a multi-layer MLP to calculate regression scores. The authors tested three models: bidirectional, tree, and ensemble.
The main part of the article is in the results analysis, with rich data presentation. However, the authors do not provide excessive cause analysis; they merely state which types of facts, states, corpora, and modalities result in better performance for which types of models. Perhaps, since I do not work in this field, I do not feel that there are any research points that can be 挖掘 from these conclusions. In the end, the overall conclusion is drawn that human knowledge has stronger generalization ability, and deep models need to integrate human knowledge; the conclusion is somewhat broad.
This paper won an award, indicating that the academic community still values diversity in NLP research. Challenging tasks like this one are not undertaken by many, but once completed, they can greatly enhance downstream tasks such as information extraction and dialogue.

A Simple Theoretical Model of Importance for Summarization

One of the outstanding papers, simply because I also do summarization, I picked it out to read. The author presents a simple theoretical model for the quantitative analysis of the importance of abstracts, which had no direct, explicit definition before. The author integrates semantic knowledge into the concept of information entropy, proposes semantic units, and generalizes the three major concepts that have always been used in summarization: redundancy, relevance, and informativeness (Redundancy, Relevance, and Informativeness), unifying these three concepts under the category of importance. The author also points out that the importance indicator highly aligns with human judgments, unlike previous automatic measurement indicators that are difficult to ensure the quality of abstracts.
Firstly, it must be said that the relevant work in the paper is very thorough, extending from the 1950s to the present, weaving together several threads, and the reference list is well worth reading.

Definition

Semantic unit: the atomic unit of information, the set of semantic units is denoted as $\Omega$ , a document can be expressed as a probability distribution over the set of semantic units. Semantic units are applicable to many frameworks, such as frames, for example, topic models, and for example, embeddings commonly used in deep learning. All semantic units share a unified feature: they are discrete and independent, and the meaning of language is based on these semantic units. We mark documents and abstracts as $D$ and $S$ , respectively, and the corresponding probability distributions over the semantic units are denoted as $P_D, P_S$ .
entropy: Entropy can be calculated with the concept distribution: $H = - \sum _{w} P(w) \log (P(w))$
Redundancy (Redundancy): Redundancy is defined as the difference between maximum entropy and entropy:
$Red(S) = H_{max} - H(S)$
The maximum entropy is achieved at the uniform distribution. In fact, it is the conversion of the entropy, which measures uncertainty, into the redundancy, which measures determinacy. The abstract should have low redundancy, i.e., a small entropy, otherwise, the information obtained in the document collection is largely repetitive, and does not lead to a reduction in the abstract entropy. Since the maximum entropy is fixed for a given corpus, it can be abbreviated as $Red(S) = -H(S)$
Relevance: The author defines relevance as follows: When we observe an abstract to infer the information of the original text, the difference (loss) from the true information of the original text should be minimized. Therefore, we define relevance as the opposite of this loss. The simplest definition of loss is the cross-entropy between the semantic unit distributions of the document and the abstract:
$Rel(S,D) = - CrossEntrophy(S,D) \\ = \sum _{w_i} P_S(w_i) \log (P_D(w_i)) \\$
At the same time, we note that:
$KL(S||D) = Red(S) - Rel(S,D)$
Low redundancy and high relevance abstracts result in the minimum KL divergence between the abstract and the original text.
Informativeness: We define the informativeness of an abstract as the ability to alter one’s common sense or knowledge. The author introduces background knowledge $K$ and its probability distribution $P_K$ , and defines informativeness as
$Inf(S,K) = CrossEntrophy(S,K)$
High informativeness should be able to bring information that is not present in the background knowledge. Next is how to define background knowledge:
- Background knowledge should allocate known semantic units with a high probability, representing that these semantic units have a high intensity in the user’s memory
- Generally speaking, background knowledge can be set to none, i.e., uniformly distributed, but background knowledge provides Summarization with a controllable choice, that is, users can specify queries indicating the semantic units they are interested in, and then the background knowledge should assign low probabilities to these semantic units.
- In multi-document summarization, background knowledge can be simplified to documents that have already generated summaries
Next, we can define importance to integrate the above three indicators: Importance should measure the importance of semantic units; we want to retain only relatively important semantic units in the abstract, which means we need to find a probability distribution that unifies the document and background knowledge, and encode the expected semantic units that need to be retained in the abstract

Importance

Should be able to extract the useful parts from the information in document $D$ for users with background knowledge $K$ , we define
- Semantic unit $d_i = P_D(w_i)$ probability in the document $w_i$
- Semantic unit $k_i = P_K(w_i)$ probability in background knowledge $w_i$
- Function for encoding the importance of semantic units, which should satisfy:
  - Informational: $\forall i \not= j \ \text{if} \ d_i=d_j \ \text{and} \ k_i > k_j \ \text{then} \ f(d_i,k_i) < f(d_j,k_j)$
  - Relevance: $\forall i \not= j \ \text{if} \ d_i>d_j \ \text{and} \ k_i = k_j \ \text{then} \ f(d_i,k_i) > f(d_j,k_j)$
  - Additivity: $I(f(d_i,k_i)) \equiv \alpha I(d_i) + \beta I(k_i)$
  - Normality: $\sum _i f(d_i,k_i) = 1$
- The formulaic expression of the four properties is simple and easy to understand, where $I$ represents self-information. The first two properties describe semantic units that we want to be related to the document and that can bring new knowledge. Additivity ensures consistency with the definition of self-information, while normalization guarantees that this function is a probability distribution.
The importance coding function that satisfies the above properties is:
$P_{\frac DK}(w_i) = \frac 1C \frac {d_i^{\alpha}}{k_i^{\beta}} \\ C = \sum _i \frac {d_i^{\alpha}}{k_i^{\beta}}, \alpha, \beta \in \mathbb{R} ^{+} \\$
$\alpha$ and $\beta$ represent the intensity of relevance and informativeness
Based on the definition of importance, we can identify the criteria that the best abstract should meet:
$S^* = \text{argmax}_S \theta _I = \text{argmin} _S KL(S || P_{\frac DK})$
Therefore, we take $\theta _I$ as a measure of the quality of abstracts:
$\theta _I (S,D,K) = -KL(P_S||P_{\frac DK})$
Entropy of importance probability can measure the number of potential good abstract candidates
The measurement indicator $\theta _I$ can actually be divided into the three indicators mentioned earlier:
$\theta _I (S,D,K) \equiv -Red(S) + \alpha Rel(S,D) + \beta Inf(S,K)$

Results

Authors use the simplest words as semantic units, employ word frequency normalization as a probability distribution, and set both hyperparameters $\alpha$ and $\beta$ to 1. For incremental summarization, the background knowledge is the document that has already been summarized, while for general summarization, the background knowledge is set to none, i.e., uniformly distributed
The results show that the importance measurement indicators are closer to human judgments than traditional indicators and are more discriminative.
The author of this paper proposes only a framework, with background knowledge and the definition of semantic units being flexible according to the task and model. The evaluation issue of abstracts has always lacked good indicators, and this paper also tackles this tough problem, offering a simple and effective method.

Zero-Shot Entity Linking by Reading Entity Descriptions

Task Description

Outstanding paper, which investigates zero-shot learning in entity linking and proposes a domain-adaptive pre-training strategy to address the domain bias problem existing when linking unknown entities in new domains.
Entity linking task refers to given a query containing the entities to be linked, as well as a series of candidate entity descriptions, the model needs to establish correct entity linking and eliminate ambiguity.
The author provided an interesting example, in the game The Elder Scrolls, the description of the query is “The Burden spell is the opposite of Feather, increasing a character’s encumbrance……Clearly, here the ‘Burden’ describes the name of the spell. There are Burdens in the candidate entities as spell names, as spell effects, and of course, as other interpretations in the conventional dictionary. The model needs to link the ‘Burden’ in the query to the ‘Burden’ as a spell name. For such specific noun tasks, it is relatively simple; the difficulty lies in linking various pronouns, such as ‘he’ and ‘this person’, to specific individuals. Entity linking tasks are closely related to reading comprehension tasks.
Zeroth-order learning refers to the scenario where the training set is only trained on the domain dataset of The Elder Scrolls games, yet it is required to correctly predict test sets from other domains, such as the Lego game dataset and the Coronation Street TV series dataset.
This requires the model to achieve natural language understanding rather than simple domain-specific pattern memorization.
In the zero-shot learning entity linking task, there is no alias table or frequency prior to refer to; the model needs to read the description of each candidate entity and establish a correspondence with the context.
General entity linking tasks involve the following assumptions:
- Single entity set: Training and testing are performed on the same entity set
- Alias Table: For each query entity, there is a candidate entity table, or what is referred to as the alias table of the query entity, which does not require manual search
- Frequency statistical information: Information obtained from the statistics of a large annotated corpus, which can be used to estimate the popularity of entities and the probability of a text linking to an entity, can serve as an important prior knowledge supplement to the model
- Structured Data: Some systems provide relational tuples to assist models in disambiguation
However, zero-shot learning abandons all the above assumptions, assuming only the existence of an entity dictionary, that is, all entities have at least a corresponding description, reducing the anthropomorphic assumptions in the entity linking task to the minimum, which can be said to be the most difficult and extreme case. The task is obviously divided into two parts:
- For each query entity, find the candidate linked entity set
- Rank the candidate link entity set

Two-step approach

Candidate set generation adopted a simple and quick approach: all candidates were found using an information retrieval method. The authors used BM25 to measure the similarity between queries and documents, identifying the top-64 most similar documents as the candidate set.
The subsequent ranking task is similar to reading comprehension or natural language inference, and the authors used a transformer-based model as a strong baseline.
- Formal definition should be called Mention rather than query, referring to the context where the entity to be linked exists, denoted as $m$ ; while the description of the candidate entities is denoted as $e$
- Input $m$ and $e$ as sentence pairs to the BERT model, $m$ also adds additional embeddings to distinguish from $e$
- BERT encodes the sentence pairs, then computes the dot product between the encoded vectors and the word vectors of the entities to obtain scores
- To demonstrate the importance of the self-attention in the joint training of $m$ and $e$ , the authors also conducted two comparative naive BERT models with controlled variables, but that is not worth mentioning here, as the importance of self-attention is already a common knowledge and does not require further emphasis.
The above baseline is actually quite strong, because after pre-training, BERT has gained some ability for domain transfer, as can be seen from the results. The average accuracy of pre-trained and non-pretrained BERT differs by a factor of three, and the difference between using src, tgt, or both for pre-trained BERT is not significant, but it is much higher than traditional methods.

Zeroth-order learning

Next is the author’s proposed zero-shot learning method, which mainly still utilizes pre-training; there are two types of traditional pre-training transfer:
- Task-adaptive pretraining: Pretrain on unsupervised corpus of src and tgt, and fine-tune on supervised corpus of src
- Open Corpus Pretraining: This is like BERT, which pretrains on a large-scale unsupervised corpus regardless of src and tgt, and then fine-tunes on the supervised corpus of src
Authors propose domain adaptation: that is, to insert a pre-training process that is only on the tgt corpus after the above two pre-trainings, for the reason that the expression capacity of the model is limited, and the representation in the tgt domain should be optimized first

Results

The results are, of course, that the field adaptation effect proposed by the author is somewhat better, but the difference is not significant, at most 1 to 2 points, and the method proposed is not particularly new; it merely adds an additional pre-training process by changing the corpus. The entire paper seems to have been infused with a new field using BERT, just like the training guide for pre-trained models. Perhaps the key contribution is also the proposal of a dataset for a zero-shot learning entity linking task.

Bridging the Gap between Training and Inference for Neural Machine Translation

Background

最佳长论文，这个方向就很吸引人，属于很常见，大家都知道但都选择无视，或者找不出优雅有效解法的问题。
本文试图解决所有seq2seq都会遇到的问题，训练与推理的不一致，即exposure bias。
exposure bias是解码时产生的偏差。正常来讲，我们生成一句话，是从左往右逐字生成，怎么个逐字？模型生成一个字，然后这个字接着输入解码器解码出下一个字，也就是解码出每一个字时使用的上文是之前解码出的句子片段。但是这样训练收敛很慢，容易导致错误的累积。想想模型一开始本来就难以生成正确的字，现在还要基于这个错误的字生成接下来的字，那就是错上加错了。因此一般训练时，都需要使用teacher forcing的方法：forcing模型在生成每一个字的时候，依靠的是训练数据中正确的上文，也就是不管已经生成的字，只管前提正确的情况下去生成正确的字。但是这种技巧只能用于训练，测试的时候没有ground truth来teacher forcing。
这个问题说大不大，说小不小，之前做summarization也会遇到这个问题，导致训练的反应很好，但是测试效果差，或者出现难以解释的偏差。如今的seq2seq在编码端已经取得了长足的进步，CNN和Transformer等特征抽取器已经摆脱了单向的抽取方式，但是无论什么模型，在解码端，都得老老实实从左往右生成，都避免不了exposure bias。
对于翻译，exposure bias还和另一个问题打包影响了翻译的质量：逐字计算的交叉熵损失。模型需要学习到在正确的位置生成正确的词，这个双重正确的标准对于翻译来说太过苛刻，模型难以学到灵活的翻译关系，也就是over correction.
现有的解决exposure bias以及word-level CrossEntrophy Loss的方法有哪些？
- 在生成词的时候，有时用ground truth，有时用自己的预测的输出，采样中庸一下，即scheduled sampling
- 使用预训练模型，做Masked Seq2seq pretraining
- 使用句子级别的损失函数，目标是整个句子的分数最高，而不是逐字贪心，这里包括了各种各样的优化指标以及强化学习的方法，例如mixed incremental cross-entrophy reinforce
- 其中预训练是比较新的方法，其余两类方法早在2015年就已经提出，作者也把自己的方法与他们的方法做了对比

Methods

本文想要解决以上两个问题，粗看思路还是和以前一样：通过从ground truth 和 predicted results中采样来中和偏差，以及使用句子级别的优化指标来放宽损失的约束。
具体怎么采样？作者给出的方法如下图（这不就是scheduled sampling的图吗。。。。）：
- 先选出oracle word，即模型预测的词：注意，这里用模型预测的词其实不太准确，因为模型预测的词是确定的，是decoder解码出词典概率分布取最大得到的（不考虑束搜索的话），然而这里的oracle应该表述为not ground truth，即非真实词。假如我们直接用预测的词，那就会错上加错；假如我们用ground truth，那就会有exposure bias。因此作者取了个折中，不同于之前概率上的折中（可能取预测词可能取ground truth），还做了选词上的优化，不是简单的拿预测出的词作为oracle，具体而言：
  - 假如直接取decoder预测概率最大的词作为Oracle,那就是普通的scheduled sampling。
  - 然而作者使用Gumbel-Max正则化方法对预测概率分布调整，引入两个参数：一个由01均匀分布变量$u$计算得来的Gumbel noise $\eta$；以及一个温度变量$\gamma$。假设原始概率分布为$o$，则调整后的概率分布$P$为
    $\eta = - \log ( - \log u) \\ \overline{o} _{j-1} = (o_{j-1} + \eta) / \gamma \\ \overline{P} _{j-1} = softmax(\overline{o} _{j-1}) \\ y_{j-1}^{\text {oracle }}=y_{j-1}^{\mathrm{WO}}=\operatorname{argmax}\left(\tilde{P}_{j-1}\right) \\$
  - 这个加入噪音的过程只影响选择 oracle，而不影响模型的损失。增加Gumbel noise的操作可以使得argmax操作等效于依据softmax的概率进行采样操作，使得softmax得到的概率分布有意义，而不是单纯取最大。这里只是用了Gumbel-Max（式子里那个softmax其实不需要）。Gumbel的另一个更为常见的应用是Gumbel-Softmax，用于在假设隐变量分布为category distribution时实现重参数化(reparameterization),相比普通的softmax，Gumbel-Softmax的效果等价于用softmax计算出了一系列样本，这些样本是按照原始softmax概率依概率采样得到。
- 这是单词级别的oracle选择，还可以做句子级别的选择，具体做法是
  - 先用单词级别的方法，加上beam search，选出几个候选句
  - 通过BLEU，ROUGE等指标选出最好的句子，将这个句子的每一个词作为oracle
  - 显然这里有一个问题，就是得保证beam search出的oracle句子和ground truth的句子长度一致，作者引入了force decoding，当解码出的句子还不够ground truth长度时，假如解码出了EOS，就排除EOS，取剩下的概率最大前k个单词做beam search；假如长度已经够了，但是还没解码出EOS，就强制设置为EOS并结束解码
- 再计算概率，决定是用oracle还是ground truth：和scheduled sampling一样，也是要设置动态采样概率，刚开始训练的时候多用ground truth，然后慢慢提高oracle的比例，作者给出的概率设置为：
  $p = \frac{\mu}{\mu + exp(e / \mu)}$
结果当然是比naive RNN and Transformer要好，BLEU能有2个点的提升。作者也做了大量实验来测试超参数的影响。很简单很work，尤其是引入句子层级优化的方法简单明了，比一堆目标函数的改动要直观的多。

Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment

最佳短论文，研究了一个非常有意思的方向：speaker commitment，叫说话人承诺，或者叫事件事实。
说话人承诺是指，通过说话人的描述，来判断某一事件是否发生，具体而言分三类：事实、非事实、不确定。模型需要从说话人的描述当中挖掘出事件的事实状态。传统的方法关注情态动词、动词短语，但作者引入了CommitmentBank数据集来测试各种已有模型，说明已有的数据集不能捕捉自然语言，尤其是口语当中的词法和句法多样性，且发现引入语言学知识的模型要优于LSTM，为深度学习树立了另一个有待攻克的目标。
举个例子来形象说明一下说话人承诺问题，“我从没相信我会研究NLP”，“我不相信我可以研究NLP”，两句话都有“相信”作为动词，且都具有否定词“从没”、“不”，那么事件是“我研究NLP”，这个事件究竟有没有发生？显然前者倾向于事件已经发生，而后者倾向于事件还未发生。还有更复杂的情形，例如给定辩论双方的陈述，猜测双方讨论的某一事实是否发生。一般而言每一条样本还会有上下文，说话人承诺任务就是给定上下文、说话人表述和事件，判断事件是否是事实。
作者在CommitmentBank数据集上测试了两个模型：基于规则的和基于神经网络的
- 基于规则：Integrating Deep Linguistic Features in Factuality Prediction over Unified Datasets。基于语言学的知识即人为给各种谓语词语/短语打上事实分数，找到谓语的隐藏签名，并根据句法树剖析来联系上形容词和情态动词，进行分数的增强或者反转，最后将各种人类知识库得分和句法结构作为特征输入SVM回归模型，计算出分数
- 基于神经网络：Neural models of factuality。使用多层双向LSTM和tree-LSTM对句子建模，然后过一个多层MLP计算出回归分数。作者测试了双向、树、集成三种模型。
文章的主要部分在结果分析，数据展示很丰富，但是作者也没有给出过多的原因分析，只是在陈述哪类事实、哪类状态、哪类语料、哪类情态下哪类模型表现更好。可能是我不做这方面工作，没有感受到从这些结论里能有哪些可以挖掘的研究点。最后得出总的结论，人类知识具有更强的泛化能力，深度模型需要整合人类知识，结论有点宽泛。
这篇论文得了奖，表明学界还是希望NLP研究具有多样性，像这样具有挑战性的任务并不会有太多人做，但做好之后能给下游任务例如信息抽取、对话以极大的提升。

A Simple Theoretical Model of Importance for Summarization

杰出论文之一，单纯是因为我也做summarization才拎出来看。作者给出了一种简单的理论模型来定量分析文摘的重要性，在此之前重要性都没有直接的、显示的定义出来。作者将语义知识融入信息熵的概念，提出了语义单元，并泛化了之前summarization一直用的三大概念：冗余度、相关性和信息性（Redundancy, Relevance and Informativeness），将这三个概念统一于重要性之下，作者还指出重要性指标与人类直接高度吻合，而不像以前的自动衡量指标一样难以保证文摘质量，
首先得说论文的相关工作做的很足，从上世纪50年代一直做到现在，串起了几条线，参考文献列表都值得一读。

定义

语义单元：信息的原子单位，语义单元的集合记为$\Omega$，一篇文档可以表述为在语义单元集合上的概率分布。语义单元适用于许多框架，例如frame，例如主题模型，例如深度学习常用的embedding。所有的语义单元形式都具有统一的一个特征：他们离散且独立，语言的意义基于这些语义单元产生。我们把文档和文摘标记为$D$和$S$，对应的在语义单元上的概率分布是$P_D, P_S$。
熵：有了概念分布就可以计算熵:$H = - \sum _{w} P(w) \log (P(w))$
冗余度（Redundancy）：冗余度定义为最大熵与熵之差：
$Red(S) = H_{max} - H(S)$
最大熵在均匀分布取到。实际上就是将衡量不确定性的熵转成衡量确定性的冗余度。文摘应该具有低冗余度，即熵小，否则获取的信息在文档集中大量重复，并不能带来文摘熵的减少。由于对于给定语料，最大熵是固定的，因此可以简写为$Red(S) = -H(S)$
相关性（Relevance）：作者如此定义相关：当我们观察文摘来推断原文的信息时，与原文真实的信息之差（损失）应该最小。既然如此我们就用这个损失的相反数来定义相关性。损失的最简单定义就是文档和文摘的语义单元分布之间的交叉熵：
$Rel(S,D) = - CrossEntrophy(S,D) \\ = \sum _{w_i} P_S(w_i) \log (P_D(w_i)) \\$
同时我们注意到：
$KL(S||D) = Red(S) - Rel(S,D)$
低冗余，高相关的文摘，所带来的文摘与原文之间的KL散度最小。
信息性（Informativeness）：我们定义文摘的信息性为，能够改变人的常识或者知识。作者引入了背景知识$K$以及其概率分布$P_K$，并定义信息性为
$Inf(S,K) = CrossEntrophy(S,K)$
即高信息性应该能够带来背景知识里没有的信息。接下来就是如何定义背景知识：
- 背景知识应该分配已知的语义单元以高概率，代表这些语义单元在用户记忆中强度很高
- 一般来讲背景知识可以设为无，即均匀分布，但是背景知识给了Summarization一种可控的选择，即用户可以给出查询表明他们感兴趣的语义单元，那么背景知识就应该给这些语义单元低概率。
- 在多文档摘要中，背景知识可以简化为已经生成摘要的文档
接下来就可以定义重要性来整合以上三种指标：重要性应该是衡量语义单元的重要性，我们想在文摘中只保留相对重要的语义单元，这意味着我们需要找一个概率分布统一文档和背景知识，编码需要保留在文摘中的语义单元的期望

重要性

摘要$S$应该能够从文档$D$的信息里提取出对拥有背景知识$K$的用户有用的部分，我们定义
- $d_i = P_D(w_i)$：语义单元$w_i$在文档中的概率
- $k_i = P_K(w_i)$：语义单元$w_i$在背景知识中的概率
- $f(d_i,k_i)$：编码语义单元重要性的函数，这个函数应该满足：
  - 信息性：$\forall i \not= j \ \text{if} \ d_i=d_j \ \text{and} \ k_i > k_j \ \text{then} \ f(d_i,k_i) < f(d_j,k_j)$
  - 相关性：$\forall i \not= j \ \text{if} \ d_i>d_j \ \text{and} \ k_i = k_j \ \text{then} \ f(d_i,k_i) > f(d_j,k_j)$
  - 可加性：$I(f(d_i,k_i)) \equiv \alpha I(d_i) + \beta I(k_i)$
  - 归一性：$\sum _i f(d_i,k_i) = 1$
- 四条性质的公式表述很简单易懂，其中$I$是自信息。前两条说明我们想要与文档相关的，且能带来新知识的语义单元。可加性保证了与自信息定义的一致性，归一性保证这个函数是一个概率分布
满足以上性质的重要性编码函数为：
$P_{\frac DK}(w_i) = \frac 1C \frac {d_i^{\alpha}}{k_i^{\beta}} \\ C = \sum _i \frac {d_i^{\alpha}}{k_i^{\beta}}, \alpha, \beta \in \mathbb{R} ^{+} \\$
其中$\alpha$和$\beta$代表了相关性和信息性的强度
基于重要性的定义，我们可以找出最好的文摘应该满足：
$S^* = \text{argmax}_S \theta _I = \text{argmin} _S KL(S || P_{\frac DK})$
因此我们取$\theta _I$作为衡量文摘质量的指标：
$\theta _I (S,D,K) = -KL(P_S||P_{\frac DK})$
重要性概率的熵可以衡量可能的好文摘候选数量
衡量指标$\theta _I$其实可以拆分为之前提到的三个指标：
$\theta _I (S,D,K) \equiv -Red(S) + \alpha Rel(S,D) + \beta Inf(S,K)$

结果

作者用最简单的词作为语义单元，用词频归一化作为概率分布，两个超参数$\alpha$和$\beta$均设置为1，对于增量摘要，背景知识是已经生成摘要的文档，对于普通摘要，背景知识设置为无，即均匀分布
结果发现重要性衡量指标比传统指标更贴近人类的判断，且更具有区分性。
本文作者提出的只是一个框架，背景知识、语义单元的定义可根据任务、模型灵活定义。文摘的评价问题一直缺乏好的指标，本文也算是啃了这个硬骨头，而且给出的方法简单有效。

Zero-Shot Entity Linking by Reading Entity Descriptions

任务描述

杰出论文，研究了实体链接的零次学习，提出了领域自适应预训练策略来解决链接新领域未知实体时存在的领域偏差问题。
实体链接任务即给定一个query，其中包含待链接的实体，以及一系列候选实体描述，模型需要建立正确的实体链接，消除歧义。
作者给出了个很有趣的例子，在上古卷轴游戏里，query的描述时The Burden spell is the opposite of Feather , increasing a character ‘ s encumbrance…….显然这里的Burden描述的是法术的名字，待候选实体里有作为法术名字描述的Burden，也有作为法术效果描述的Burden，当然还有Burden在常规词典里的其他几种解释，模型需要将query里的Burden链接到作为法术名字描述的Burden上。对于这种具体名词任务还相对简单，困难的是将各种代词，例如“他”，“这个人”与具体的人物链接起来。实体链接任务与阅读理解任务紧密关联。
零次学习即，训练集只在上古卷轴游戏的领域数据集上训练，但是要正确预测其他领域的测试集，例如乐高游戏数据集、冠冕街电视剧数据集。
这就需要模型做到自然语言理解，而不是简单的领域内模式记忆。
在零次学习实体链接任务中，没有别名表或者频率先验可以参考，模型需要阅读每一个候选实体的描述并建立与上下文的对应关系。
一般的实体链接任务包含以下假设：
- 单一实体集：训练和测试是在同一实体集合上做的
- 别名表：对于每一个query实体，有一个候选实体表，或者叫query实体的别名表，不需要自己找
- 频率统计信息：从大型标注语料中统计得到的信息，可以用来估计实体的popularity和一段文本链接到实体的概率，可以作为很重要的先验知识补充给模型
- 结构化数据：一些系统提供了关系元组来帮助模型消歧
然而零次学习抛弃了以上所有假设，只假定存在实体词典，即所有的实体好歹有一个对应的描述，将实体链接任务的人为假设降到最低，可以说是最难最极限的情况了。接下来任务显然就分成了两部分：
- 对于每个query实体，找出其候选链接实体集合
- 将候选链接实体集合进行ranking

两步走

候选集生成采取了简单快速的方式：用信息检索的方式找出所有candidate。作者使用BM25来衡量query和document之间的相似度，找出来最相似的top-64篇文档作为候选集
接下来的ranking任务类似于阅读理解或者自然语言推断，作者使用了transformer based模型作为strong baseline。
- 正式的定义应该叫Mention而不是query，即需要去找链接的实体存在的上下文，记为$m$；而候选集实体的描述，记作$e$
- 将$m$和$e$作为句子对输入BERT模型，$m$还加上了额外的embedding以示区别于$e$
- BERT得到句子对编码之后，将编码的向量与entity的词向量内积得到分数
- 为了证明联合训练$m$和$e$的自注意力的重要性，作者还控制变量做了两个对比的naive bert模型，不过那就暂过不表了，毕竟self attention很重要已经是通识了，不需要再强调了。
以上的baseline其实很强大了，因为BERT经过预训练之后多少获得了领域迁移的能力，从结果也可以看出来，预训练和不预训练的BERT在平均准确度上差了3倍，而预训练的BERT无论使用src还是tgt还是都使用，差别都不大，不过都远远高于传统方法。

零次学习

接下来就是作者提出的零次学习方法，主要还是利用预训练，传统的预训练迁移有两种：
- 任务自适应预训练：在src和tgt的无监督语料上预训练，在src的监督语料上微调
- 开放语料预训练：就是BERT这一类的，不管src和tgt，自己先在超大规模无监督语料上预训练，再到src的监督语料上微调
作者提出领域自适应：即在以上两种预训练之后插入一段仅仅在tgt语料上预训练的过程，理由是模型的表达容量有限，应该优先优化tgt领域的表示

结果

结果当然是作者提出的领域自适应效果好一些，但其实差别不大，顶多1到2个点，而且提出的方法也不算新，仅仅改变语料多了一个预训练过程，整篇论文像是用BERT灌了一个新领域，和预训练模型训练指南一样。可能关键贡献也是提出了一个零次学习实体链接任务数据集吧。