Outstanding Papers Reading (ACL 2019)
Selected readings from ACL 2019 award-winning papers.
- Using Oracle for sentence-level teacher forcing
- speaker commitment
- A set of evaluation index frameworks applicable to abstracts, combining multiple indicators
- Zero-Shot Entity Linking
Bridging the Gap between Training and Inference for Neural Machine Translation
Background
- The best long papers, this direction is very attractive, it is very common, everyone knows but chooses to ignore, or cannot find an elegant and effective solution.
- Attempts to address all the issues encountered by seq2seq, namely the inconsistency between training and inference, i.e., exposure bias.
- exposure bias is the bias produced during decoding. Normally, we generate a sentence from left to right, character by character, how so? The model generates a character, and then this character is input into the decoder to decode the next character, that is, the preceding text used to decode each character is the previously decoded sentence fragment. However, this training converges very slowly and is prone to cumulative errors. Think about it, the model is already difficult to generate the correct character at the beginning, and now it has to generate the next character based on this incorrect character, which is adding insult to injury. Therefore, during general training, it is necessary to use the teacher forcing method: forcing the model to generate each character based on the correct preceding text in the training data, that is, regardless of the characters already generated, only generating the correct character based on the premise of correctness. However, this technique can only be used for training; during testing, there is no ground truth for teacher forcing.
- This issue is neither particularly large nor particularly small; it has also been encountered in previous summarization tasks, leading to good training responses but poor testing performance or inexplicable biases. Today, seq2seq models have made significant progress in the encoding end, with feature extractors such as CNN and Transformer having moved beyond unidirectional extraction methods. However, regardless of the model, at the decoding end, they must generate from left to right in a straightforward manner, and exposure bias cannot be avoided.
- For translation, exposure bias also packages another issue that affects the quality of translation: the cross-entropy loss calculated word by word. The model needs to learn to generate the correct word at the correct position, and this dual correctness standard is too stringent for translation, making it difficult for the model to learn flexible translation relationships, i.e., over correction.
- What are the existing methods for solving exposure bias and
word-level CrossEntropy Loss?
- In generating words, sometimes we use ground truth, sometimes our own predicted output, and sample a moderate amount, i.e., scheduled sampling
- Using pre-trained models, performing Masked Seq2seq pretraining
- Utilizing sentence-level loss functions, the goal is to achieve the highest score for the entire sentence, rather than greedy optimization on a word-by-word basis, which includes various optimization criteria and reinforcement learning methods, such as mixed incremental cross-entropy reinforcement
- Among them, the pre-trained method is a relatively new approach, while the other two types of methods were proposed as early as 2015, and the authors also compared their own method with theirs
Methods
- This paper aims to address the above two issues, and at first glance, the approach is still the same: by sampling from the ground truth and predicted results to mitigate bias, and by using sentence-level optimization metrics to relax the constraints on loss.
- How to sample specifically? The method provided by the authors is
shown in the figure below (isn't this the figure for scheduled
sampling...):
First select the oracle word, i.e., the word predicted by the model: Note that the word predicted by the model here is not very accurate, as the predicted word is deterministic, obtained by taking the maximum of the dictionary probability distribution decoded by the decoder (excluding beam search). However, the oracle here should be expressed as "not ground truth," i.e., not the true word. If we directly use the predicted word, we will make mistakes on top of mistakes; if we use the ground truth, there will be exposure bias. Therefore, the author took a compromise, different from the previous probabilistic compromise (which may take the predicted word or the ground truth), and also optimized the word selection, not simply taking the predicted word as the oracle. Specifically:
If the word with the highest predicted probability by the decoder is directly taken as the Oracle, that is ordinary scheduled sampling.
However, the author adjusts the predictive probability distribution using the Gumbel-Max regularization method, introducing two parameters: one calculated from a uniform distribution variable \(u\) as Gumbel noise \(\eta\) ; and one temperature variable \(\gamma\) . Assuming the original probability distribution is \(o\) , the adjusted probability distribution \(P\) is
\[ \eta = - \log ( - \log u) \\ \overline{o} _{j-1} = (o_{j-1} + \eta) / \gamma \\ \overline{P} _{j-1} = softmax(\overline{o} _{j-1}) \\ y_{j-1}^{\text {oracle }}=y_{j-1}^{\mathrm{WO}}=\operatorname{argmax}\left(\tilde{P}_{j-1}\right) \\ \]
The process of adding noise only affects the selection of the oracle and not the model's loss. The operation of adding Gumbel noise makes the argmax operation equivalent to a sampling operation based on the probabilities of softmax, making the probability distribution obtained by softmax meaningful rather than simply taking the maximum. Here, only Gumbel-Max is used (the softmax in the formula is actually not necessary). Another more common application of Gumbel is Gumbel-Softmax, which is used to achieve reparameterization when the distribution of the assumed latent variable is a categorical distribution. Compared to the ordinary softmax, Gumbel-Softmax's effect is equivalent to calculating a series of samples using softmax, which are sampled probabilistically according to the original softmax probabilities.
This is a word-level oracle selection, and it can also be done at the sentence level; the specific method is
- Firstly, using a word-level method, combined with beam search, several candidate sentences are selected
- Select the best sentence through BLEU, ROUGE, and other metrics, and take each word of this sentence as an oracle
- There is an obvious issue here, which is to ensure that the oracle sentences generated by beam search are of the same length as the ground truth sentences. The authors introduce force decoding, where if the decoded sentence is still shorter than the ground truth length and an EOS is decoded, the EOS is excluded, and the beam search is performed on the top k words with the highest probabilities; if the length is already sufficient but EOS has not been decoded, the decoding is forced to EOS and terminated
Re-calculate the probability to decide whether to use oracle or ground truth: Like scheduled sampling, it also involves setting a dynamic sampling probability. Initially, during training, more ground truth is used, and then the proportion of oracle is gradually increased. The probability setting given by the authors is:
\[ p = \frac{\mu}{\mu + exp(e / \mu)} \]
- The results are undoubtedly better than those of naive RNN and Transformer, with a 2-point improvement in BLEU. The authors also conducted a large number of experiments to test the impact of hyperparameters. It's simple and effective, especially the method of introducing sentence-level optimization is straightforward, much more intuitive than a bunch of changes to the objective functions.
Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment
- Best short paper, studying a very interesting direction: speaker commitment, also known as event fact.
- Speaker's commitment refers to determining whether an event has occurred through the speaker's description, specifically divided into three categories: factual, unfactual, and uncertain. The model needs to extract the factual status of the event from the speaker's description. Traditional methods focus on modal verbs and verb phrases, but the author introduces the CommitmentBank dataset to test various existing models, indicating that existing datasets cannot capture the lexical and syntactic diversity of natural language, especially in spoken language, and finds that models incorporating linguistic knowledge are superior to LSTM, setting another goal for deep learning to conquer.
- For example, to illustrate the issue of speaker commitment, consider the following two statements: "I never believed I would study NLP," and "I do not believe I can study NLP." Both sentences have "believe" as the verb and both contain the negative words "never" and "not." The event in question is "I study NLP," and whether this event has occurred. Clearly, the former suggests that the event has already happened, while the latter suggests that it has not yet occurred. There are also more complex scenarios, such as given the statements of two debaters, guessing whether a certain fact discussed by them has occurred. Generally, each sample would also have context, and the speaker commitment task is to provide context, speaker expression, and an event, and to judge whether the event is a fact.
- Authors tested two models on the CommitmentBank dataset: rule-based
and neural network-based
- Rule-Based: Integrating Deep Linguistic Features in Factuality Prediction over Unified Datasets. Linguistic knowledge is applied by manually assigning factual scores to various predicate words/phrases, identifying the hidden signature of the predicate, and connecting adjectives and modal verbs based on syntactic tree analysis to enhance or reverse the scores. Finally, the scores from various human knowledge bases and syntactic structures are input as features into the SVM regression model to calculate the scores.
- Based on Neural Networks: Neural Models of Factuality. Sentence modeling is performed using multi-layer bidirectional LSTM and tree-LSTM, followed by a multi-layer MLP to calculate regression scores. The authors tested three models: bidirectional, tree, and ensemble.
- The main part of the article is in the results analysis, with rich data presentation. However, the authors do not provide excessive cause analysis; they merely state which types of facts, states, corpora, and modalities result in better performance for which types of models. Perhaps, since I do not work in this field, I do not feel that there are any research points that can be 挖掘 from these conclusions. In the end, the overall conclusion is drawn that human knowledge has stronger generalization ability, and deep models need to integrate human knowledge; the conclusion is somewhat broad.
- This paper won an award, indicating that the academic community still values diversity in NLP research. Challenging tasks like this one are not undertaken by many, but once completed, they can greatly enhance downstream tasks such as information extraction and dialogue.
A Simple Theoretical Model of Importance for Summarization
- One of the outstanding papers, simply because I also do summarization, I picked it out to read. The author presents a simple theoretical model for the quantitative analysis of the importance of abstracts, which had no direct, explicit definition before. The author integrates semantic knowledge into the concept of information entropy, proposes semantic units, and generalizes the three major concepts that have always been used in summarization: redundancy, relevance, and informativeness (Redundancy, Relevance, and Informativeness), unifying these three concepts under the category of importance. The author also points out that the importance indicator highly aligns with human judgments, unlike previous automatic measurement indicators that are difficult to ensure the quality of abstracts.
- Firstly, it must be said that the relevant work in the paper is very thorough, extending from the 1950s to the present, weaving together several threads, and the reference list is well worth reading.
Definition
Semantic unit: the atomic unit of information, the set of semantic units is denoted as \(\Omega\) , a document can be expressed as a probability distribution over the set of semantic units. Semantic units are applicable to many frameworks, such as frames, for example, topic models, and for example, embeddings commonly used in deep learning. All semantic units share a unified feature: they are discrete and independent, and the meaning of language is based on these semantic units. We mark documents and abstracts as \(D\) and \(S\) , respectively, and the corresponding probability distributions over the semantic units are denoted as \(P_D, P_S\) .
entropy: Entropy can be calculated with the concept distribution: \(H = - \sum _{w} P(w) \log (P(w))\)
Redundancy (Redundancy): Redundancy is defined as the difference between maximum entropy and entropy:
\[ Red(S) = H_{max} - H(S) \]
The maximum entropy is achieved at the uniform distribution. In fact, it is the conversion of the entropy, which measures uncertainty, into the redundancy, which measures determinacy. The abstract should have low redundancy, i.e., a small entropy, otherwise, the information obtained in the document collection is largely repetitive, and does not lead to a reduction in the abstract entropy. Since the maximum entropy is fixed for a given corpus, it can be abbreviated as \(Red(S) = -H(S)\)
Relevance: The author defines relevance as follows: When we observe an abstract to infer the information of the original text, the difference (loss) from the true information of the original text should be minimized. Therefore, we define relevance as the opposite of this loss. The simplest definition of loss is the cross-entropy between the semantic unit distributions of the document and the abstract:
\[ Rel(S,D) = - CrossEntrophy(S,D) \\ = \sum _{w_i} P_S(w_i) \log (P_D(w_i)) \\ \]
At the same time, we note that:
\[ KL(S||D) = Red(S) - Rel(S,D) \]
Low redundancy and high relevance abstracts result in the minimum KL divergence between the abstract and the original text.
Informativeness: We define the informativeness of an abstract as the ability to alter one's common sense or knowledge. The author introduces background knowledge \(K\) and its probability distribution \(P_K\) , and defines informativeness as
\[ Inf(S,K) = CrossEntrophy(S,K) \]
High informativeness should be able to bring information that is not present in the background knowledge. Next is how to define background knowledge:
- Background knowledge should allocate known semantic units with a high probability, representing that these semantic units have a high intensity in the user's memory
- Generally speaking, background knowledge can be set to none, i.e., uniformly distributed, but background knowledge provides Summarization with a controllable choice, that is, users can specify queries indicating the semantic units they are interested in, and then the background knowledge should assign low probabilities to these semantic units.
- In multi-document summarization, background knowledge can be simplified to documents that have already generated summaries
Next, we can define importance to integrate the above three indicators: Importance should measure the importance of semantic units; we want to retain only relatively important semantic units in the abstract, which means we need to find a probability distribution that unifies the document and background knowledge, and encode the expected semantic units that need to be retained in the abstract
Importance
Should be able to extract the useful parts from the information in document \(D\) for users with background knowledge \(K\) , we define
- Semantic unit \(d_i = P_D(w_i)\) probability in the document \(w_i\)
- Semantic unit \(k_i = P_K(w_i)\) probability in background knowledge \(w_i\)
- Function for encoding the importance of semantic units, which should
satisfy:
- Informational: \(\forall i \not= j \ \text{if} \ d_i=d_j \ \text{and} \ k_i > k_j \ \text{then} \ f(d_i,k_i) < f(d_j,k_j)\)
- Relevance: \(\forall i \not= j \ \text{if} \ d_i>d_j \ \text{and} \ k_i = k_j \ \text{then} \ f(d_i,k_i) > f(d_j,k_j)\)
- Additivity: \(I(f(d_i,k_i)) \equiv \alpha I(d_i) + \beta I(k_i)\)
- Normality: \(\sum _i f(d_i,k_i) = 1\)
- The formulaic expression of the four properties is simple and easy to understand, where \(I\) represents self-information. The first two properties describe semantic units that we want to be related to the document and that can bring new knowledge. Additivity ensures consistency with the definition of self-information, while normalization guarantees that this function is a probability distribution.
The importance coding function that satisfies the above properties is:
\[ P_{\frac DK}(w_i) = \frac 1C \frac {d_i^{\alpha}}{k_i^{\beta}} \\ C = \sum _i \frac {d_i^{\alpha}}{k_i^{\beta}}, \alpha, \beta \in \mathbb{R} ^{+} \\ \]
\(\alpha\) and \(\beta\) represent the intensity of relevance and informativeness
Based on the definition of importance, we can identify the criteria that the best abstract should meet:
\[ S^* = \text{argmax}_S \theta _I = \text{argmin} _S KL(S || P_{\frac DK}) \]
Therefore, we take \(\theta _I\) as a measure of the quality of abstracts:
\[ \theta _I (S,D,K) = -KL(P_S||P_{\frac DK}) \]
Entropy of importance probability can measure the number of potential good abstract candidates
The measurement indicator \(\theta _I\) can actually be divided into the three indicators mentioned earlier:
\[ \theta _I (S,D,K) \equiv -Red(S) + \alpha Rel(S,D) + \beta Inf(S,K) \]
Results
- Authors use the simplest words as semantic units, employ word frequency normalization as a probability distribution, and set both hyperparameters \(\alpha\) and \(\beta\) to 1. For incremental summarization, the background knowledge is the document that has already been summarized, while for general summarization, the background knowledge is set to none, i.e., uniformly distributed
- The results show that the importance measurement indicators are closer to human judgments than traditional indicators and are more discriminative.
- The author of this paper proposes only a framework, with background knowledge and the definition of semantic units being flexible according to the task and model. The evaluation issue of abstracts has always lacked good indicators, and this paper also tackles this tough problem, offering a simple and effective method.
Zero-Shot Entity Linking by Reading Entity Descriptions
Task Description
- Outstanding paper, which investigates zero-shot learning in entity linking and proposes a domain-adaptive pre-training strategy to address the domain bias problem existing when linking unknown entities in new domains.
- Entity linking task refers to given a query containing the entities to be linked, as well as a series of candidate entity descriptions, the model needs to establish correct entity linking and eliminate ambiguity.
- The author provided an interesting example, in the game The Elder Scrolls, the description of the query is "The Burden spell is the opposite of Feather, increasing a character's encumbrance......Clearly, here the 'Burden' describes the name of the spell. There are Burdens in the candidate entities as spell names, as spell effects, and of course, as other interpretations in the conventional dictionary. The model needs to link the 'Burden' in the query to the 'Burden' as a spell name. For such specific noun tasks, it is relatively simple; the difficulty lies in linking various pronouns, such as 'he' and 'this person', to specific individuals. Entity linking tasks are closely related to reading comprehension tasks.
- Zeroth-order learning refers to the scenario where the training set is only trained on the domain dataset of The Elder Scrolls games, yet it is required to correctly predict test sets from other domains, such as the Lego game dataset and the Coronation Street TV series dataset.
- This requires the model to achieve natural language understanding rather than simple domain-specific pattern memorization.
- In the zero-shot learning entity linking task, there is no alias table or frequency prior to refer to; the model needs to read the description of each candidate entity and establish a correspondence with the context.
- General entity linking tasks involve the following assumptions:
- Single entity set: Training and testing are performed on the same entity set
- Alias Table: For each query entity, there is a candidate entity table, or what is referred to as the alias table of the query entity, which does not require manual search
- Frequency statistical information: Information obtained from the statistics of a large annotated corpus, which can be used to estimate the popularity of entities and the probability of a text linking to an entity, can serve as an important prior knowledge supplement to the model
- Structured Data: Some systems provide relational tuples to assist models in disambiguation
- However, zero-shot learning abandons all the above assumptions,
assuming only the existence of an entity dictionary, that is, all
entities have at least a corresponding description, reducing the
anthropomorphic assumptions in the entity linking task to the minimum,
which can be said to be the most difficult and extreme case. The task is
obviously divided into two parts:
- For each query entity, find the candidate linked entity set
- Rank the candidate link entity set
Two-step approach
- Candidate set generation adopted a simple and quick approach: all candidates were found using an information retrieval method. The authors used BM25 to measure the similarity between queries and documents, identifying the top-64 most similar documents as the candidate set.
- The subsequent ranking task is similar to reading comprehension or
natural language inference, and the authors used a transformer-based
model as a strong baseline.
- Formal definition should be called Mention rather than query, referring to the context where the entity to be linked exists, denoted as \(m\) ; while the description of the candidate entities is denoted as \(e\)
- Input \(m\) and \(e\) as sentence pairs to the BERT model, \(m\) also adds additional embeddings to distinguish from \(e\)
- BERT encodes the sentence pairs, then computes the dot product between the encoded vectors and the word vectors of the entities to obtain scores
- To demonstrate the importance of the self-attention in the joint training of \(m\) and \(e\) , the authors also conducted two comparative naive BERT models with controlled variables, but that is not worth mentioning here, as the importance of self-attention is already a common knowledge and does not require further emphasis.
- The above baseline is actually quite strong, because after pre-training, BERT has gained some ability for domain transfer, as can be seen from the results. The average accuracy of pre-trained and non-pretrained BERT differs by a factor of three, and the difference between using src, tgt, or both for pre-trained BERT is not significant, but it is much higher than traditional methods.
Zeroth-order learning
- Next is the author's proposed zero-shot learning method, which
mainly still utilizes pre-training; there are two types of traditional
pre-training transfer:
- Task-adaptive pretraining: Pretrain on unsupervised corpus of src and tgt, and fine-tune on supervised corpus of src
- Open Corpus Pretraining: This is like BERT, which pretrains on a large-scale unsupervised corpus regardless of src and tgt, and then fine-tunes on the supervised corpus of src
- Authors propose domain adaptation: that is, to insert a pre-training process that is only on the tgt corpus after the above two pre-trainings, for the reason that the expression capacity of the model is limited, and the representation in the tgt domain should be optimized first
Results
- The results are, of course, that the field adaptation effect proposed by the author is somewhat better, but the difference is not significant, at most 1 to 2 points, and the method proposed is not particularly new; it merely adds an additional pre-training process by changing the corpus. The entire paper seems to have been infused with a new field using BERT, just like the training guide for pre-trained models. Perhaps the key contribution is also the proposal of a dataset for a zero-shot learning entity linking task.