Paper Reading 4
Paper reading on
- GNN Pooling
- Discourse-Aware Summarization
- Siamese BERT
- Large Chatbot
Edge Contraction Pooling for Graph Neural Networks
A new GNN pooling method that considers edges
Significance of pooling in GNNs:
- Identify clusters based on features or structure
- Reduce computational complexity
The authors' edgepool method can improve graph classification and node classification performance.
There are two types of pooling: fixed and learned. The authors briefly introduce three learned pooling methods:
DiffPool: DiffPool learns a probability allocation, using one GNN to learn embedding and another to learn cluster assignment, treating the cluster assignment as a soft assign matrix \(S\). Nodes are assigned to clusters based on node features, with a predetermined number of clusters. Each layer pools both embedding and adjacency matrix simultaneously, as follows:
\[ \begin{array}{l}{X^{(l+1)}=S^{(l)^{T}} Z^{(l)} \in \mathbb{R}^{n_{l+1} \times d}} \\ {A^{(l+1)}=S^{(l)^{T}} A^{(l)} S^{(l)} \in \mathbb{R}^{n_{l+1} \times n_{l+1}}}\end{array} \\ \]
Problems include: fixed cluster number; assignment based solely on node features without considering node distances; cluster assignment matrix linearly related to node count, difficult to scale; challenging to train
TopKPool: A straightforward approach that learns a projection vector, projecting each node's features to a single weighted value, retaining the top-k nodes. Issues include inability to modify the graph (add nodes) and potential information loss due to hard assignment
SAGPool: An improvement on TopK, using attention-weighted neighborhood nodes before projection, but still maintaining a hard topk assignment
The edge pooling concept reduces sampling through edge contraction. Given an edge e with nodes \(v_i\) and \(v_j\), edge contraction means connecting all adjacent nodes of i and j to a new node \(v_e\). This operation can be repeated multiple times, similar to expanding receptive field in CNNs.
How to select edges?
First, calculate edge scores by concatenating and linearly transforming the embeddings of connected nodes
\[ r(e_{ij}) = W (n_i || n_j) + b \]
Then normalize all scores using softmax, with the author adding 0.5 to ensure a mean of 1, explained as improving numerical stability and gradient propagation
\[ s_{ij} = 0.5 + softmax_{r_{*j}}(R_{ij}) \]
Begin contracting edges based on scores, avoiding contraction of already contracted edge nodes. This reduces nodes by half each time.
The new node score is directly obtained by weighted averaging of the two endpoint node features:
\[ \hat{n}_{i j}=s_{i j}\left(n_{i}+n_{j}\right) \]
Discourse-Aware Hierarchical Attention Network for Extractive Single-Document Summarization
Using a hierarchical LSTM encoder + LSTM decoder for extractive summarization as a baseline, the authors added a three-layer attention to incorporate discourse information. Specifically, discourse information refers to sentence-level elaborate relationships, where one sentence provides detailed explanation or supplementary information about another. The authors argue that document summarization, as a discourse-level task, naturally requires discourse information.
The authors use attention to learn directed elaborate edges between sentences, as shown in the following diagram:
Three components:
Parent Attention: Use hierarchical encoder to obtain sentence representations, then use attention to represent the probability of sentence k being the parent node of sentence i, with elaborate edges pointing from k to i (without using self-attention)
\[ \begin{aligned} p(k | i, \mathbf{H}) &=\operatorname{softmax}(g(k, i)) \\ g(k, i) &=v_{a}^{\mathrm{T}} \tanh \left(U_{a} \cdot H_{k}+W_{a} H_{i}\right) \end{aligned} \]
Recursive Attention: Calculate multi-hop parent nodes, obtaining the probability of k being the d-hop parent node of i. This can be simply achieved by powering the attention matrix, with special handling for the root sentence (virtual node) which has no parent nodes:
\[ \alpha_{d, k, i}=\left\{\begin{array}{ll}{p(k | i, \mathbf{H})} & {(d=1)} \\ {\sum_{l=0}^{N} \alpha_{d-1, k, l} \times \alpha_{1, l, i}} & {(d>1)}\end{array}\right. \]
Selective Attention: Combine attention information by first weighted summing parent node information for sentence i at each hop:
\[ \gamma_{d, i}=\sum_{k=0}^{N} \alpha_{d, k, i} H_{k} \]
Then calculate hop weights using selective attention, depending on sentence i's encoder and decoder states \(H,s\), and encoder states of all parent nodes:
\[ \beta_{d, i}=\operatorname{softmax}\left(\mathbf{W}_{\beta}\left[H_{i} ; s_{i} ; K\right]\right) \]
Obtain weighted information from all hops and append to decoder input
\[ \Omega_{i}=\sum_{d} \beta_{d, i} \gamma_{d, i} \\ p\left(y_{i} | \mathbf{x}, \theta\right)=\operatorname{softmax}\left(\mathbf{W}_{o} \tanh \left(\mathbf{W}_{c^{\prime}}\left[H_{i} ; s_{t} ; K ; \Omega_{i}\right]\right)\right) \\ \]
The authors mention that Rhetorical Structure Analysis (RST) currently lacks good off-the-shelf tools with high accuracy. They propose a joint learning framework, which turns out to mean using existing RST Parsers to obtain elaborate edges during training to guide Parent Attention, with no parser needed during testing. The parser's errors still significantly impact the model. The objective function is:
\[ -\log p(\mathbf{y} | \mathbf{x})-\lambda \cdot \sum_{k=1}^{N} \sum_{i=1}^{N} E_{k, i} \log \alpha_{1, k, i} \]
The second term guides attention using parser-obtained edges
The authors first use HILDA parser to obtain RST discourse annotations, then convert them to dependency format using a method from "Single-document summarization as a tree knapsack problem"
Although still dependent on parser during training, the authors created two baselines: one without parser using the previous sentence as the elaborate parent, another letting attention learn independently. Results showed the parser-informed attention model outperformed baselines. The model showed more significant advantages on short texts (75 words) compared to long texts (275 words) in the Daily Mail dataset, partly due to ROUGE metric's preference for longer texts, indicating discourse information indeed helps in extracting the most important information within word count constraints.
This paper can be seen as an attention model (self-attention + multi-blocks) injecting prior information to achieve better results in single-document extractive summarization.
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
- A NAACL 2018 paper considering discourse information for abstractive summarization on research paper datasets
- Here, discourse is narrowly defined as sections in research papers, essentially a hierarchical attention model built upon the pointer-generator, with the following structure:
- Praiseworthy is the authors' provision of two large-scale long-document research paper summary datasets, PubMed and arXiv, both reaching tens of thousands in scale, with average source document lengths over 3000 and 4900 words, and average summary lengths exceeding 100 words - valuable ultra-long single-document summary datasets.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Highlight: Sentence pair regression tasks with BERT are time-consuming. The authors propose a Siamese BERT network, improving inference speed by 1,123,200 times
- Evidently, this speedup claim is ambiguous. Naive BERT is slow in semantic matching tasks because each pair requires sending two sentences through BERT to calculate scores. The authors slightly modify BERT to use embeddings as sentence feature vectors, directly using cosine distance for matching
- Next, they prove that original BERT embeddings are not good semantic matching features. SBERT adds regression or classification layers after BERT and introduces triplet loss, significantly improving performance over original BERT. It can be seen as a fine-tuning of BERT for semantic matching tasks.
Towards a Human-like Open-Domain Chatbot
- Highlight: Google-produced, large. Researching detailed design aspects
- 2.6 billion parameters. 40 billion token corpus. To capture multi-turn dialogue quality, the authors propose Sensibleness and Specificity Average (SSA) as a metric, finding that models optimized for perplexity achieve the best SSA
- The authors use evolved transformer, training a seq2seq model with multi-turn dialogue input, vocabulary size of 8k (using BPE), achieving a test set perplexity of only 10.2. The model outperforms other dialogue systems supplemented with rules, systems, and knowledge, again proving that deep neural networks can achieve miracles with sufficient data and training
- SSA measures two aspects: sensible and specific. It's a human-evaluated metric where testers first judge if the response is sensible, and if so, then judge its specificity, as systems often scoring well on automatic metrics tend to give vague "I don't know" responses. The authors found SSA correlates with human assessments of system human-likeness
- SSA has two testing environments: a specified test set of 1,477 multi-turn dialogues, and direct chatting with the system for 14-28 turns
- The authors provide numerous training and testing details, essentially highlighting the model's scale: trained on one TPU-v3 Pod for 30 days, 164 epochs, observing 10T tokens in total
- This powerful yet simple model doesn't require complex decoding to ensure high-quality, diverse responses. The authors used sample and rank: dividing logits by temperature T, then softmax, randomly sampling multiple sequences based on probability and selecting the highest-probability sequence. Higher temperatures reduce logits' differences, facilitating context-related rare word generation. Sample and rank surprisingly outperforms beam search, provided the model achieves low perplexity. The authors set temperature to 0.88, sampling 20 sentences
- Statistical tests revealed a correlation coefficient exceeding 0.9 between perplexity and SSA