Structured Neural Summarization, Paper Reading

Posted on 2020-02-28 Edited on 2025-01-30 In NLP Views: Word count in article: 12k Reading time ≈ 11 mins.

reading note for STRUCTURED NEURAL SUMMARIZATION.

Intuition

A natural idea of introducing GNN into seq2seq is to first obtain token representations using a sequence encoder, then construct relationships between tokens, build a graph, and feed the graph and token representations into a GNN to obtain new token representations.
The problem is that for summarization, the intuitive approach would be to use sentences as nodes and construct a relationship graph between sentences. This would result in sentence-level representations, lacking word-level granularity, which would make it difficult to implement an attention-based decoder.
The authors' approach is quite straightforward: directly create a word-level GNN, where a document with 900 words becomes a graph with 900 nodes. The edge relationships are heterogeneous, consisting of three types:
- All words in a sentence connect to an additional sentence node, and all words in an entity connect to an additional entity node. These edges are called IN
- Connecting the previous word to the next word, and the previous sentence to the next sentence. These edges are called NEXT
- Coreference resolution, referred to as REF The overall structure is shown in the following image:

GGNN

The GNN used by the authors is a Gated Graph Neural Network. The original paper: GATED GRAPH SEQUENCE NEURAL NETWORKS

\[ \begin{aligned} \mathbf{h}_{v}^{(1)} &=\left[\boldsymbol{x}_{v}^{\top}, \mathbf{0}\right]^{\top} \\ \mathbf{a}_{v}^{(t)} &=\mathbf{A}_{v:}^{\top}\left[\mathbf{h}_{1}^{(t-1) \top} \ldots \mathbf{h}_{|\mathcal{V}|}^{(t-1) \top}\right]^{\top}+\mathbf{b} \\ \mathbf{z}_{v}^{t} &=\sigma\left(\mathbf{W}^{z} \mathbf{a}_{v}^{(t)}+\mathbf{U}^{z} \mathbf{h}_{v}^{(t-1)}\right) \end{aligned} \\ \begin{aligned} \mathbf{r}_{v}^{t} &=\sigma\left(\mathbf{W}^{r} \mathbf{a}_{v}^{(t)}+\mathbf{U}^{r} \mathbf{h}_{v}^{(t-1)}\right) \\ \widetilde{\mathbf{h}_{v}^{(t)}} &=\tanh \left(\mathbf{W} \mathbf{a}_{v}^{(t)}+\mathbf{U}\left(\mathbf{r}_{v}^{t} \odot \mathbf{h}_{v}^{(t-1)}\right)\right) \\ \mathbf{h}_{v}^{(t)} &=\left(1-\mathbf{z}_{v}^{t}\right) \odot \mathbf{h}_{v}^{(t-1)}+\mathbf{z}_{v}^{t} \odot \widetilde{\mathbf{h}_{v}^{(t)}} \end{aligned} \\ \]
GGNN was published in 2015, improving upon the GNN model from 2009.
The original GNN model essentially uses the graph's topological relationships, masking some edges in a multi-layer linear network. The representation of nodes at each layer is obtained through linear transformations of neighboring nodes from the previous layer (propagation), with the final layer using a linear layer to output node labels.
GGNN considers directed and heterogeneous edges. Its adjacency matrix A, as shown in the figure, is a linear layer of \(\mathbf{A} \in \mathbb{R}^{D|\mathcal{V}| \times 2 D|\mathcal{V}|}\), with twice the width representing two output channels in both directions. The input node representation matrix is \(\mathbb{R}^{D|\mathcal{V}|}\), where \(A\) contains parameters that depend on edge type and direction, essentially obtained through embedding lookup. This is followed by a GRU-like update, with \(z\) and \(r\) being the update and reset gates, respectively. The formula meanings are as follows:
- 1: Initialize node embeddings by adding hand-crafted features according to the specific task and padding to the same length
- 2: Obtain propagated information through a linear layer containing adjacency information
- 3: Calculate the update gate based on the previous layer's state and propagated information
- 4: Calculate the reset gate based on the previous layer's state and propagated information
- 5,6: Similar to GRU
For the output part, a simple linear layer can be applied to each node for node-level tasks. To obtain the graph's representation, a gating mechanism can be used (originally described as attention):

\[ \mathbf{h}_{\mathcal{G}}=\tanh \left(\sum_{v \in \mathcal{V}} \sigma\left(i\left(\mathbf{h}_{v}^{(T)}, \boldsymbol{x}_{v}\right)\right) \odot \tanh \left(j\left(\mathbf{h}_{v}^{(T)}, \boldsymbol{x}_{v}\right)\right)\right) \]

GGSNN

The gated GNN can be extended to sequence output, namely GATED GRAPH SEQUENCE NEURAL NETWORK.
As shown in the figure, a typical seq2seq model needs to encode the input sequence and then decode step by step. However, in a graph, one step already contains all sequence token information, with multiple layers being a stacking of depth rather than temporal layers. Therefore, we can start decoding at any depth, similar to CRF, as shown in the figure: \(o\) is the output, \(X^k\) is the input embedding matrix at the k-th output step, \(H^{k,t}\) represents the k-th output, and the node embedding of the entire input is propagated t steps in depth. Similar to transition and emission matrices, the authors used two GGNNs \(F_o,F_x\) to complete the transfer and emission of hidden states. They can share parameters in the propagation part. Although only \(F_x\) transferring \(H\) to \(X\) is written, in practice, similar to LSTM, \(X^{k+1}\) is also determined by \(X^k\):

\[ \boldsymbol{x}_{v}^{(k+1)}=\sigma\left(j\left(\mathbf{h}_{v}^{(k, T)}, \boldsymbol{x}_{v}^{(k)}\right)\right) \]
Similarly, it's possible to directly input \(X\) for each decoding step, similar to teacher forcing
The experiments in the paper were on relatively small state spaces, different from text tasks. Refer to the usage in STRUCTURED NEURAL SUMMARIZATION

Sequence GNNs

The authors introduced GGNN into the encoding side, equivalent to supplementing the traditional seq2seq encoder with a GNN, but the encoder output remains unchanged, and the decoder remains unchanged (abandoning the GGSNN decoder design)
First, the authors described GGNN in clearer language, with each step including propagation and update
- Propagation: \(\boldsymbol{m}_{v}^{(i)}=g\left(\left\{f_{k}\left(\boldsymbol{h}_{u}^{(i)}\right) | \text { there is an edge of type } k \text { from } u \text { to } v\right\}\right.)\), i.e., collecting and summing neighboring node information using edge-related linear transformations, where \(f\) is a linear layer and \(g\) is summation
- Update: \(\boldsymbol{h}_{v}^{(i+1)}=\operatorname{GRU}\left(\boldsymbol{m}_{v}^{(i)}, \boldsymbol{h}_{v}^{(i)}\right)\)
In seq2seq, the encoder must provide two pieces of information: token representation and context representation. Token-level representation is obtained through GNN node embeddings, and for context-level representation, in addition to the gated weighted sum of nodes used in GGNN, they also concatenated the hidden state before and after inputting the graph, seemingly concatenating the hidden states of all nodes before and after graph input as the final node embedding. The RNN output is directly concatenated with the graph embedding and then passed through a linear layer. Note that the RNN output is essentially a representation of the graph (entire sequence), so it can be directly concatenated:

\[ \left[\mathbf{e}_{1}^{\prime} \ldots \mathbf{e}_{N}^{\prime}\right]=\operatorname{GNN}\left(\left(S,\left[R_{1} \ldots R_{K}\right],\left[\mathbf{e}_{1} \ldots \mathbf{e}_{N}\right]\right)\right) \\ \]

\[ \sigma\left(w\left(\boldsymbol{h}_{v}^{(T)}\right)\right) \in[0,1] \\ \]

\[ \hat{\mathbf{e}}=\sum_{1<i<N} \sigma\left(w\left(\mathbf{e}_{i}^{\prime}\right)\right) \cdot \aleph\left(\mathbf{e}_{i}^{\prime}\right) \\ \]

\[ Embedding_{graph} = W \cdot(\mathbf{e} \hat{\mathbf{e}}) \\ \]
In practical engineering implementation, batching graphs of different sizes is inconvenient. The authors used two tricks:
- Standard GNN approach: Combine small graphs into a large graph with multiple connected subgraphs as a batch
- Since copy and attention mechanisms require calculating weights across the entire input sequence, after combining into a large graph, the authors also preserved each node's index in the small graph. Then, using TensorFlow's unsorted segment * operator (performing operations on segments of different lengths), they can efficiently and numerically stably perform softmax over the variable number of node representations for each graph
The authors used a simple LSTM encoder and decoder configuration, mainly modifying the pointer generator code. GNN was stacked eight layers
The final results did not surpass the pointer generator, but the ablation with the pointer mechanism was quite significant, as shown in the following figure:
The authors did not provide much analysis of the results, as the paper used three datasets, with the other two being code summaries that have naturally structured data, thus performing well. On purely natural language datasets like CNNDM, the performance was not particularly outstanding
However, in the ablation experiments, it's worth noting that even without adding coreference information, simply using GNN to process sentence structure performed better than LSTM