Paper Reading 4

Posted on 2019-12-16 Edited on 2025-01-30 In NLP Views: Word count in article: 13k Reading time ≈ 12 mins.

Paper reading on

GNN Pooling
Discourse-Aware Summarization
Siamese BERT
Large Chatbot

Edge Contraction Pooling for Graph Neural Networks

A new GNN pooling method that considers edges
Significance of pooling in GNNs:
- Identify clusters based on features or structure
- Reduce computational complexity
The authors’ edgepool method can improve graph classification and node classification performance.
There are two types of pooling: fixed and learned. The authors briefly introduce three learned pooling methods:
- DiffPool: DiffPool learns a probability allocation, using one GNN to learn embedding and another to learn cluster assignment, treating the cluster assignment as a soft assign matrix $S$. Nodes are assigned to clusters based on node features, with a predetermined number of clusters. Each layer pools both embedding and adjacency matrix simultaneously, as follows:
  $\begin{array}{l}{X^{(l+1)}=S^{(l)^{T}} Z^{(l)} \in \mathbb{R}^{n_{l+1} \times d}} \\ {A^{(l+1)}=S^{(l)^{T}} A^{(l)} S^{(l)} \in \mathbb{R}^{n_{l+1} \times n_{l+1}}}\end{array} \\$
  Problems include: fixed cluster number; assignment based solely on node features without considering node distances; cluster assignment matrix linearly related to node count, difficult to scale; challenging to train
- TopKPool: A straightforward approach that learns a projection vector, projecting each node’s features to a single weighted value, retaining the top-k nodes. Issues include inability to modify the graph (add nodes) and potential information loss due to hard assignment
- SAGPool: An improvement on TopK, using attention-weighted neighborhood nodes before projection, but still maintaining a hard topk assignment
The edge pooling concept reduces sampling through edge contraction. Given an edge e with nodes $v_i$ and $v_j$, edge contraction means connecting all adjacent nodes of i and j to a new node $v_e$. This operation can be repeated multiple times, similar to expanding receptive field in CNNs.
How to select edges?
- First, calculate edge scores by concatenating and linearly transforming the embeddings of connected nodes
  $r(e_{ij}) = W (n_i || n_j) + b$
- Then normalize all scores using softmax, with the author adding 0.5 to ensure a mean of 1, explained as improving numerical stability and gradient propagation
  $s_{ij} = 0.5 + softmax_{r_{*j}}(R_{ij})$
- Begin contracting edges based on scores, avoiding contraction of already contracted edge nodes. This reduces nodes by half each time.
The new node score is directly obtained by weighted averaging of the two endpoint node features:
$\hat{n}_{i j}=s_{i j}\left(n_{i}+n_{j}\right)$

Discourse-Aware Hierarchical Attention Network for Extractive Single-Document Summarization

Using a hierarchical LSTM encoder + LSTM decoder for extractive summarization as a baseline, the authors added a three-layer attention to incorporate discourse information. Specifically, discourse information refers to sentence-level elaborate relationships, where one sentence provides detailed explanation or supplementary information about another. The authors argue that document summarization, as a discourse-level task, naturally requires discourse information.
The authors use attention to learn directed elaborate edges between sentences, as shown in the following diagram:
Three components:
- Parent Attention: Use hierarchical encoder to obtain sentence representations, then use attention to represent the probability of sentence k being the parent node of sentence i, with elaborate edges pointing from k to i (without using self-attention)
  $\begin{aligned} p(k | i, \mathbf{H}) &=\operatorname{softmax}(g(k, i)) \\ g(k, i) &=v_{a}^{\mathrm{T}} \tanh \left(U_{a} \cdot H_{k}+W_{a} H_{i}\right) \end{aligned}$
- Recursive Attention: Calculate multi-hop parent nodes, obtaining the probability of k being the d-hop parent node of i. This can be simply achieved by powering the attention matrix, with special handling for the root sentence (virtual node) which has no parent nodes:
  $\alpha_{d, k, i}=\left\{\begin{array}{ll}{p(k | i, \mathbf{H})} & {(d=1)} \\ {\sum_{l=0}^{N} \alpha_{d-1, k, l} \times \alpha_{1, l, i}} & {(d>1)}\end{array}\right.$
- Selective Attention: Combine attention information by first weighted summing parent node information for sentence i at each hop:
  $\gamma_{d, i}=\sum_{k=0}^{N} \alpha_{d, k, i} H_{k}$
  Then calculate hop weights using selective attention, depending on sentence i’s encoder and decoder states $H,s$, and encoder states of all parent nodes:
  $\beta_{d, i}=\operatorname{softmax}\left(\mathbf{W}_{\beta}\left[H_{i} ; s_{i} ; K\right]\right)$
  Obtain weighted information from all hops and append to decoder input
  $\Omega_{i}=\sum_{d} \beta_{d, i} \gamma_{d, i} \\ p\left(y_{i} | \mathbf{x}, \theta\right)=\operatorname{softmax}\left(\mathbf{W}_{o} \tanh \left(\mathbf{W}_{c^{\prime}}\left[H_{i} ; s_{t} ; K ; \Omega_{i}\right]\right)\right) \\$
The authors mention that Rhetorical Structure Analysis (RST) currently lacks good off-the-shelf tools with high accuracy. They propose a joint learning framework, which turns out to mean using existing RST Parsers to obtain elaborate edges during training to guide Parent Attention, with no parser needed during testing. The parser’s errors still significantly impact the model. The objective function is:
$-\log p(\mathbf{y} | \mathbf{x})-\lambda \cdot \sum_{k=1}^{N} \sum_{i=1}^{N} E_{k, i} \log \alpha_{1, k, i}$
The second term guides attention using parser-obtained edges
The authors first use HILDA parser to obtain RST discourse annotations, then convert them to dependency format using a method from “Single-document summarization as a tree knapsack problem”
Although still dependent on parser during training, the authors created two baselines: one without parser using the previous sentence as the elaborate parent, another letting attention learn independently. Results showed the parser-informed attention model outperformed baselines. The model showed more significant advantages on short texts (75 words) compared to long texts (275 words) in the Daily Mail dataset, partly due to ROUGE metric’s preference for longer texts, indicating discourse information indeed helps in extracting the most important information within word count constraints.
This paper can be seen as an attention model (self-attention + multi-blocks) injecting prior information to achieve better results in single-document extractive summarization.

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

A NAACL 2018 paper considering discourse information for abstractive summarization on research paper datasets
Here, discourse is narrowly defined as sections in research papers, essentially a hierarchical attention model built upon the pointer-generator, with the following structure:
Praiseworthy is the authors’ provision of two large-scale long-document research paper summary datasets, PubMed and arXiv, both reaching tens of thousands in scale, with average source document lengths over 3000 and 4900 words, and average summary lengths exceeding 100 words - valuable ultra-long single-document summary datasets.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Highlight: Sentence pair regression tasks with BERT are time-consuming. The authors propose a Siamese BERT network, improving inference speed by 1,123,200 times
Evidently, this speedup claim is ambiguous. Naive BERT is slow in semantic matching tasks because each pair requires sending two sentences through BERT to calculate scores. The authors slightly modify BERT to use embeddings as sentence feature vectors, directly using cosine distance for matching
Next, they prove that original BERT embeddings are not good semantic matching features. SBERT adds regression or classification layers after BERT and introduces triplet loss, significantly improving performance over original BERT. It can be seen as a fine-tuning of BERT for semantic matching tasks.

Towards a Human-like Open-Domain Chatbot

Highlight: Google-produced, large. Researching detailed design aspects
2.6 billion parameters. 40 billion token corpus. To capture multi-turn dialogue quality, the authors propose Sensibleness and Specificity Average (SSA) as a metric, finding that models optimized for perplexity achieve the best SSA
The authors use evolved transformer, training a seq2seq model with multi-turn dialogue input, vocabulary size of 8k (using BPE), achieving a test set perplexity of only 10.2. The model outperforms other dialogue systems supplemented with rules, systems, and knowledge, again proving that deep neural networks can achieve miracles with sufficient data and training
SSA measures two aspects: sensible and specific. It’s a human-evaluated metric where testers first judge if the response is sensible, and if so, then judge its specificity, as systems often scoring well on automatic metrics tend to give vague “I don’t know” responses. The authors found SSA correlates with human assessments of system human-likeness
SSA has two testing environments: a specified test set of 1,477 multi-turn dialogues, and direct chatting with the system for 14-28 turns
The authors provide numerous training and testing details, essentially highlighting the model’s scale: trained on one TPU-v3 Pod for 30 days, 164 epochs, observing 10T tokens in total
This powerful yet simple model doesn’t require complex decoding to ensure high-quality, diverse responses. The authors used sample and rank: dividing logits by temperature T, then softmax, randomly sampling multiple sequences based on probability and selecting the highest-probability sequence. Higher temperatures reduce logits’ differences, facilitating context-related rare word generation. Sample and rank surprisingly outperforms beam search, provided the model achieves low perplexity. The authors set temperature to 0.88, sampling 20 sentences
Statistical tests revealed a correlation coefficient exceeding 0.9 between perplexity and SSA

Edge Contraction Pooling for Graph Neural Networks

一种新的GNN池化方式，考虑了边
池化在GNN中的意义：
- 识别基于特征或者基于结构的聚类
- 减少计算量
作者提出的edgepool能够提高图分类和节点分类的性能。
pooling有两种，fixed和learned，作者简单介绍了三种learned pooling method
- DiffPool：DiffPool学习到一种概率分配，用一个GNN学习embedding，用一个GNN学习聚类分配，将聚类分配视为一个soft assign matrix$S$，基于节点特征将每个节点分配给一个聚类，聚类数量事先固定，每一层同时对embedding和邻接矩阵进行pooling，如下：
  $\begin{array}{l}{X^{(l+1)}=S^{(l)^{T}} Z^{(l)} \in \mathbb{R}^{n_{l+1} \times d}} \\ {A^{(l+1)}=S^{(l)^{T}} A^{(l)} S^{(l)} \in \mathbb{R}^{n_{l+1} \times n_{l+1}}}\end{array} \\$
  问题在于：聚类数量不可变；基于节点特征分配而不考虑节点之间距离；聚类分配矩阵与节点数目成线性关系，难以scale；难以训练
- TopKPool：简单粗暴，学习到一个投影向量，将每个节点的特征投影加权为一个单值，取topk个节点保留作为Pooling，问题在于不能改变图（加节点），以及这种hard assignment容易丢失信息
- SAGPool：对TopK的改进，对邻域节点使用了注意力加权，再投影，不过依然是topk的hard assignment。
edge pooling的思想是通过边的收缩(edge contraction)来降采样，给定一条边e，两边节点$v_i$和$v_j$，边收缩指的是将i和j的所有邻接节点全部接到一个新节点$v_e$，这个操作显然是可以叠加多次，类似于CNN的不断扩大感受野。
如何选边？
- 先对边计算分数，这里简单的将边连接的两个节点的embedding拼接再线性变换
  $r(e_{ij}) = W (n_i || n_j) + b$
- 之后对所有的分数做softmax归一化，注意这里作者加了0.5使得均值为1，作者给出的解释是数值计算更稳定且梯度传导更好
  $s_{ij} = 0.5 + softmax_{r_{*j}}(R_{ij})$
- 按照分数开始收缩边，假如边连接了已经收缩的边节点那就不再收缩了。这样每次都能减少一半的节点。
新的节点分数直接用边分数加权两端节点特征和得到：
$\hat{n}_{i j}=s_{i j}\left(n_{i}+n_{j}\right)$

Discourse-Aware Hierarchical Attention Network for Extractive Single-Document Summarization

以hierarchical lstm encoder+lstm decoder的抽取式摘要作为baseline，添加了一个三层attention用来加入篇章信息，这里的篇章信息具体指的是句子级别的elaborate关系，即某一句详细阐述或者补充说明了另一句，作者认为document summarization这种篇章级别的任务当然需要篇章信息。
作者使用了attention来学习句子之间的elaborate有向边，具体如下图：
三个组件
- Parent Attention：使用hierarchical encoder得到每个句子的表示，之后用attention表示句子k是句子i父节点的概率，即elaborate的边由k指向i（作者没有用self attention）
  $\begin{aligned} p(k | i, \mathbf{H}) &=\operatorname{softmax}(g(k, i)) \\ g(k, i) &=v_{a}^{\mathrm{T}} \tanh \left(U_{a} \cdot H_{k}+W_{a} H_{i}\right) \end{aligned}$
- Recursive Attention：即计算多跳父节点，得到k是i的d跳父节点概率，这里简单的用注意力矩阵幂应该就可以得到，注意要对root句子（虚节点）做特殊处理，root没有父节点：
  $\alpha_{d, k, i}=\left\{\begin{array}{ll}{p(k | i, \mathbf{H})} & {(d=1)} \\ {\sum_{l=0}^{N} \alpha_{d-1, k, l} \times \alpha_{1, l, i}} & {(d>1)}\end{array}\right.$
- Selective Attention：综合得到的attention信息，首先将句子i某一跳所有父节点的信息加权求和：
  $\gamma_{d, i}=\sum_{k=0}^{N} \alpha_{d, k, i} H_{k}$
  之后再用selective attention计算该跳的权重，依赖于句子i的encoder和decoder state$H,s$，以及所有父节点的encoder state：
  $\beta_{d, i}=\operatorname{softmax}\left(\mathbf{W}_{\beta}\left[H_{i} ; s_{i} ; K\right]\right)$
  得到权重加权所有跳的信息，并补充进decoder input当中（拼接）
  $\Omega_{i}=\sum_{d} \beta_{d, i} \gamma_{d, i} \\ p\left(y_{i} | \mathbf{x}, \theta\right)=\operatorname{softmax}\left(\mathbf{W}_{o} \tanh \left(\mathbf{W}_{c^{\prime}}\left[H_{i} ; s_{t} ; K ; \Omega_{i}\right]\right)\right) \\$
这里，作者说提到了修辞结构分析（RST）目前没有很好的off-the-shelf tools，误差大，这是硬伤，因此提出了一个联合学习的框架，后来发现联合学习是指训练集上依然用已有的RST Parser得到elaborate edges，用以指导Parent Attention，之后测试集就不需要了，这样的话Parser当中的误差对模型的影响依然很大。目标函数为：
$-\log p(\mathbf{y} | \mathbf{x})-\lambda \cdot \sum_{k=1}^{N} \sum_{i=1}^{N} E_{k, i} \log \alpha_{1, k, i}$
其中第二项就是用parser得到的边指导attention
作者先用HILDA parser得到RST格式的篇章标注信息，然后用Single-document summarization as a tree knapsack
problem一文中的方法转换为dependency的格式
虽然依然依赖于parser进行训练，但是作者做了两个Baseline，一个是不用parser，直接将前一句作为下一句的elaborate parent，另一个也不用parser，让attention自己学习，结果发现baseline都不如注入了parser信息的attention模型。让attention自己学习最差，其次是学一个固定的前句父节点。作者提出的模型相比baseline在daily mail数据集上抽短文本（75)比抽长文本(275)优势更大，这里有ROUGE指标偏爱长文本的原因，也说明在字数限制下，抽最重要的信息方面，discourse的信息确实可以起到帮助。
这篇文章可以看成一个attention模型(self attention + multi-blocks)，注入了一些先验信息来帮助在单文档抽取式摘要获得更好的结果。

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

NAACL 2018的一篇论文，依然是考虑了篇章信息，不过是在科研论文数据集上做生成式摘要。
这里的discourse有些狭义了，指的是科研论文里的每一个section，其实还是一个hierarchical attention，作者也直接在pointer-generator上改了，结构如下：
值得称赞的是作者提供了两个大规模长文档的科研论文摘要数据集，pubmed以及arxiv，均达到十万规模即便，平均原文长度达到3000+和4900+，平均摘要长度也过百，是很有价值的超长单文档摘要数据集。

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

亮点：用BERT做句子对回归任务很耗时，作者提出孪生BERT网络，将推理速度提高了1123200倍
显然提高这么多倍的说法是有歧义的，naive bert在语义匹配任务上耗时，是因为每匹配一对就要将一对句子送进BERT计算出分数，而作者将BERT稍作修改，用BERT得到的embedding作为句子的特征向量，直接用向量的之间的cosine距离来做匹配，当然要快
接下来就是证明原始BERT得到的embedding并不能很好的作为语义匹配的特征向量，SBERT也就是在BERT之后加了回归层或者分类层，引入triplet loss，得到的效果就比原始BERT好很多。可以看成是BERT在语义匹配任务上的一种微调吧。

Towards a Human-like Open-Domain Chatbot

亮点：谷歌出品，大。研究一些细节设计。
26亿参数量。400亿token的语料。为了很好的捕捉多轮对话的质量，作者提出了Sensibleness and Specificity Average(SSA)作为指标，并且发现最优化perplexity的模型能够达到最好的SSA。
作者使用evolved transformer，多轮对话作为输入，训练了一个seq2seq，词标大小8k（用了BPE），最后测试集的困惑度只有10.2，且实际表现比其他的补充了规则、系统、知识的复杂的对话系统表现要好，再次证明了深度神经网络，只要数据够多，训得够好，就是可以大力出奇迹。
SSA衡量两个方面：合理且具体。这是一个人工衡量指标，首先问测试人员回答是否合理，假如合理，再问回答是否具体，因为很多时候回答不具体（总是回答i don’t know）的系统反而在自动指标上取得比较好的成绩。作者也实验发现SSA和人工检测系统是否human-like一致，SSA高的系统表现更加像人类。
SSA有两种测试环境，一种是指定测试集，作者收集了1477个多轮对话作为测试数据集；另一个就是让测试人员直接和系统闲聊，至少14轮，至多28轮
作者给出了很多训练细节和测试细节，具体可见论文，反正就是大，在一块TPU-v3 Pod上训练了30天，164个epoch，模型总共观察了10T个token。
这么强大而简单的模型，在decoding时不需要复杂的处理来保证生成高质量且多样化的回答。作者采用了sample and rank：生成Logits之后先除以温度T，再过softmax，按概率随机采样生成多个序列之后取概率最大的那一句作为输出。作者发现温度越高，即logits输出的差异性越小，容易生成与上下文相关的罕见词。作者比对发现sample and rank虽然简单但是比beam search表现更好，前提是能够训练到low perplexity。作者将温度设为0.88，采样20句。
统计测试发现perplexity和SSA的相关系数高达0.9以上。