Paper reading on Knowledge Graphs

  • Knowledge Graph Special Collection
    • Entity Alignment in Cross-lingual Knowledge Graphs
    • Knowledge Graph Language Model
    • Dynamic Knowledge Graph Dialogue Generation
    • Graph2Seq
    • Graph Matching Network
    • Dynamic Knowledge Graph Update
    • Attention-based Embeddings for Relation Prediction

Cross-lingual Knowledge Graph Alignment via Graph Matching Neural Network

  • Research: Entity Alignment in Cross-lingual Knowledge Graphs

  • Generally, the approach involves projecting entities into low-dimensional vectors in their respective language knowledge graphs and then learning a similarity calculation function

  • The problem is that this approach relies on the assumption that the neighborhood structures of the same entities are identical in different language knowledge graphs, but this assumption is not always true. Therefore, traditional methods are not friendly to entities with few aligned neighborhood nodes or few neighborhood nodes

  • The authors propose a topic entity graph to encode the context information of entity nodes, transforming the node embedding match into a graph match between topic entity graphs

  • Here, the topic entity refers to the entities that need to be aligned. The topic entity graph is a subgraph constructed by the one-hop neighborhood entities and the entity itself. If these two entities are not directly connected in the original knowledge graph, an edge is added to the knowledge graph

  • After obtaining the topic entity graphs, a four-layer network calculates the similarity between two topic entity graphs:

    • Input representation layer: Learn embeddings for each entity in the topic entity graph. First, use word-level LSTM to learn initial embeddings, then because it's a directed graph, distinguish between input and output neighbors, perform aggregation separately (FFN + mean pooling), concatenate with the previous entity embedding and update (FFN), iterate K times

    • Node-level local matching layer: Mutually match all entity nodes between the two graphs using attention-based matching. First, calculate the cosine distance between a node i in topic entity graph 1 and all nodes in topic entity graph 2 as attention weights, use these weights to weight all nodes in topic entity graph 2 to obtain graph 2's graph embedding, then calculate a multi-perspective cosine distance between this graph embedding and graph 1's query entity embedding. The multi-perspective means l perspectives represented by l weighted vectors (d-dimensional, same as embedding dimension). The cosine distance of one perspective is calculated by element-wise multiplication of a weighted vector and then computing the cosine distance. The l perspectives together form a matrix \(W \in R^{l*d}\), as follows:

      \[ score_{perspective_k} = cosine(W_k \cdot embedding_1, W_k \cdot embedding_2) \]

    • Global matching layer: The local matching layer still has the previously mentioned problem of not being friendly to nodes with few co-occurring neighbors. Here, a global matching is needed. Specifically, use a GCN to propagate the local match embedding (the vector obtained from the previous layer's multi-perspective cosine score) to the entire topic entity graph, then perform an FFN + max/mean pooling on the entire graph to obtain the graph matching vector for the two graphs

    • Prediction layer: Concatenate the graph matching vectors of the two topic entity graphs and send them to a softmax for prediction

  • During training, 20 negative samples were heuristically generated for each positive sample, 10 in each direction. Here, word-level average embedding was directly used as the feature vector for entities, matching the 10 most similar entities.

Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

  • The authors want to maintain a local knowledge graph in the language model to manage detected facts and use this graph to query unknown facts for text generation, called KGLM (Knowledge Graph Language Model)

  • Assuming the entity set is \(\xi\), KGLM predicts:

    \[ p(x_t,\xi _t|x_{1,t-1},\xi_ {1,t-1}) \]

  • The process of generating the next word can be split as follows:

    • Next word is not an entity: Calculate probability in the normal vocabulary range

    • Next word is a completely new entity: Calculate probability across both normal vocabulary and all entities

    • Next word is an entity related to an already seen entity: First select a previously seen entity as the parent node, then select a child node, and then calculate probability across the normal vocabulary and all aliases of the child node

  • The authors use LSTM as the base model for the language model. All selections - selecting new entities, selecting parent nodes, selecting child nodes - are based on the LSTM's hidden state (divided into three parts), with pre-trained embeddings of entities and relations as input dependencies, followed by probability calculation via softmax

  • To implement such a model, the dataset should provide entity information. The authors proposed the Linked WikiText2 dataset, with the following construction process:

    • Create entity links based on Wikipedia links, use neural-el to identify additional links in the Wikidata database, and use Stanford CoreNLP for co-reference resolution

    • Construct local knowledge graph: Next, build parent relations between entities. For each entity a, add all related entities {b} in Wikidata as matching candidates. If a related entity b appears in subsequent paragraphs, set entity a as the parent node of entity b

    • The above method only creates an initial set and needs continuous expansion. The authors also created alias tables for dates, quantifiers, etc.

    • Below is a representation of a sentence in Linked WikiText2. Compared to the API query method of WikiText, Linked WikiText2 directly operates on the original HTML, preserving more link information: MY7je0.jpg

  • Train and Inference: First, use TransE algorithm for pre-training entities and relations. Given a triple (p,r,e), the objective is to minimize the distance:

    \[ \delta\left(\mathbf{v}_{p}, \mathbf{v}_{r}, \mathbf{v}_{e}\right)=\left\|\mathbf{v}_{p}+\mathbf{v}_{r}-\mathbf{v}_{e}\right\|^{2} \]

    Use Hinge Loss to ensure the score difference between positive and negative samples does not exceed \(\gamma\):

    \[ \mathcal{L}=\max \left(0, \gamma+\delta\left(\mathbf{v}_{p}, \mathbf{v}_{r}, \mathbf{v}_{e}\right)-\delta\left(\mathbf{v}_{p}^{\prime}, \mathbf{v}_{r}, \mathbf{v}_{e}^{\prime}\right)\right) \]

  • Although the entire process is generative, all variables are visible, so it can be trained end-to-end. For entity nodes with multiple parent nodes, probability needs to be normalized

  • During inference, we don't have annotation information. We want to calculate the marginal probability of word \(x\) summed over entities \(\xi\), not the joint probability (we only want to get words, with entity information marginalized), but there are too many entities to calculate joint probabilities and sum them. Therefore, the authors use importance sampling:

    \[ \begin{aligned} p(\mathbf{x}) &=\sum_{\mathcal{E}} p(\mathbf{x}, \mathcal{E})=\sum_{\mathcal{E}} \frac{p(\mathbf{x}, \mathcal{E})}{q(\mathcal{E} | \mathbf{x})} q(\mathcal{E} | \mathbf{x}) \\ & \approx \frac{1}{N} \sum_{\mathcal{E} \sim q} \frac{p(\mathbf{x}, \mathcal{E})}{q(\mathcal{E} | \mathbf{x})} \end{aligned} \]

    Where the proposed distribution is obtained by the discriminative KGLM, i.e., training another KGLM to judge the annotation of the current token

  • The results are impressive. KGLM, using only LSTM with a small number of parameters, shows a significant advantage in entity word prediction compared to the large-scale GPT-2 model.

DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs

  • The authors propose a new task of dialogue generation based on dynamic knowledge graphs, aiming to capture relationships and generalize knowledge graph-based dialogue generation to zero-shot scenarios

  • The task description is divided into two steps:

    • For each dialogue turn t, given input x and graph K, generate the correct answer y that includes the correct knowledge graph entities

    • When the knowledge graph is updated (only relations and recipients can be updated), the answer y should be correspondingly modified

  • To effectively measure the quality of dynamic knowledge graph dialogue, the authors propose two types of metrics:

    • Knowledge Entity Modeling: including accuracy of predicting known entities, entity word hit rate; true positive/false negative/false positive rates for distinguishing predicted entities from generic words; true positive/false negative/false positive rates for all entities in the knowledge graph

    • Graph Adaptability: The authors propose three methods of changing the graph, including shuffling and randomly replacing entities, observing whether the generated sequence is replaced and replaced correctly

  • The authors create a parallel corpus containing dialogue data from two TV series in Chinese and English, with detailed processing

  • The proposed Qadpt model modifies the seq2seq approach. First, the decoder's current state \(d_t\) generates a controller \(c_t\) to decide whether to select entities from the KG or generate general vocabulary. Similar to the copy mechanism, this selection is not hard but calculates probabilities separately, then concatenates two vocabulary lists and selects based on probability:

    \[ \begin{aligned} P\left(\{K B, \mathcal{W}\} | y_{1} y_{2} \ldots y_{t-1}, \mathbf{e}(x)\right) \\=\operatorname{softmax}\left(\phi\left(\mathbf{d}_{t}\right)\right) \\ \mathbf{w}_{t}=P\left(\mathcal{W} | y_{1} y_{2} \ldots y_{t-1}, \mathbf{e}(x)\right) \\ c_{t}=P\left(K B | y_{1} y_{2} \ldots y_{t-1}, \mathbf{e}(x)\right) \\ \mathbf{o}_{t}=\left\{c_{t} \mathbf{k}_{t} ; \mathbf{w}_{t}\right\} \end{aligned} \]

  • Regarding how to generate entity candidate lists, they perform reasoning on the knowledge graph. Unlike typical attention-based graph embedding approaches, the authors use multi-hop reasoning

    • First, combine path matrix and adjacency matrix into a transition matrix, where the path matrix represents the probability of selecting each relation for each entity learned from \(d_t\), then select recipient nodes based on probability:

    \[ \begin{aligned} \mathbf{R}_{t} &=\operatorname{softmax}\left(\theta\left(\mathbf{d}_{t}\right)\right) \\ \mathbf{A}_{i, j, \gamma} &=\left\{\begin{array}{ll}{1,} & {\left(h_{i}, r_{j}, t_{\gamma}\right) \in \mathcal{K}} \\ {0,} & {\left(h_{i}, r_{j}, t_{\gamma}\right) \notin \mathcal{K}}\end{array}\right.\\ \mathbf{T}_{t}=\mathbf{R}_{t} \mathbf{A} \end{aligned} \]

    • Then take an initial vector \(s\) (uniformly distributed?), transform n times using the transition matrix to obtain the probability of each entity appearing, and provide it to the controller for calculation. Here, cross-entropy is calculated using one-hot ground truth as an auxiliary loss

Graph2Seq: Graph to Sequence Learning with Attention-based Neural Networks

  • As the name suggests, the input is graph-structured data, generating a sequence

  • Previous approaches encoded graphs into fixed-length sequences for Seq2Seq, which the authors believe leads to information loss. Encoding graph to encoder sequence introduces an additional layer of information loss

  • A more natural approach would be for the decoder to perform attention on the encoded graph nodes, directly utilizing graph information

  • First, the graph encoder, referencing GraphSage's approach. Notably, the authors handle directed graphs by distinguishing neighbor nodes in two directions, performing Aggregate and Update operations, and concatenating after k hops

  • The authors tried Mean, LSTM, and Pooling methods. Since neighbors are unordered, LSTM has no temporal effect, so the authors randomly arrange neighbors and use LSTM Aggregate

  • The authors believe that in addition to node embedding, graph embedding should be passed to the decoder. They adopted two methods to obtain graph embedding

    • Pooling-based: First pass all node embeddings through a fully connected layer, then perform element-wise max, min, average pooling. The authors found the three methods performed similarly and used max pooling as the default pooling method

    • Node-based: Add a super node to the graph, connected to all other nodes, using the embedding of this node after graph encoding as the graph embedding

  • Attention-based decoder: Graph embedding is input as the initial state of the decoder, and at each step, the decoder generates attention on all node embeddings and uses this as the decoder's hidden state for that time step

  • Focusing on NLG tasks, the authors tested SQL2Text task, first building a graph from SQL Query, then using Graph2Seq. The effect was significantly better than Seq2seq from SQL Query to Text

  • In the Aggregate comparison experiment, they found Mean Pooling performed best, and for Graph Embedding, Pooling-based significantly outperformed Node-based

Graph Matching Networks for Learning the Similarity of Graph Structured Objects

  • Google-produced, experimental and visualization results are as rich as ever
  • Two contributions:
    • Proved that GNN can generate graph embeddings for similarity calculation
    • Proposed attention-based Graph Matching Networks, surpassing baselines

Graph Embedding Model

  • Baseline: Graph Embedding Model, a simple encode-propagation-aggregate model

  • Encode: Encode point and edge features through MLP to get embeddings

  • Propagation: Transmit center point, adjacent point, and adjacent edge embeddings to the next layer's center point embedding

    \[ \begin{aligned} \mathbf{m}_{j \rightarrow i} &=f_{\text {message }}\left(\mathbf{h}_{i}^{(t)}, \mathbf{h}_{j}^{(t)}, \mathbf{e}_{i j}\right) \\ \mathbf{h}_{i}^{(t+1)} &=f_{\text {node }}\left(\mathbf{h}_{i}^{(t)}, \sum_{j:(j, i) \in E} \mathbf{m}_{j \rightarrow i}\right) \end{aligned} \]

  • Aggregate: The author uses a gating method to weight and sum the embeddings of each node to obtain the final graph embedding

    \[ \mathbf{h}_{G}=\operatorname{MLP}_{G}\left(\sum_{i \in V} \sigma\left(\operatorname{MLP}_{\operatorname{gate}}\left(\mathbf{h}_{i}^{(T)}\right)\right) \odot \operatorname{MLP}\left(\mathbf{h}_{i}^{(T)}\right)\right) \]

Graph Matching Networks

  • GMN does not separately generate embeddings for two graphs and then match them, but directly accepts two graphs as input to output a similarity score

    \[ \begin{aligned} \mathbf{m}_{j \rightarrow i} &=f_{\text {message }}\left(\mathbf{h}_{i}^{(t)}, \mathbf{h}_{j}^{(t)}, \mathbf{e}_{i j}\right), \forall(i, j) \in E_{1} \cup E_{2} \\ \boldsymbol{\mu}_{j \rightarrow i} &=f_{\text {match }}\left(\mathbf{h}_{i}^{(t)}, \mathbf{h}_{j}^{(t)}\right) \\ \forall i \in V_{1}, j & \in V_{2}, \text { or } i \in V_{2}, j \in V_{1} \\ \mathbf{h}_{i}^{(t+1)} &=f_{\text {node }}\left(\mathbf{h}_{i}^{(t)}, \sum_{j} \mathbf{m}_{j \rightarrow i}, \sum_{j^{\prime}} \mu_{j^{\prime} \rightarrow i}\right) \\ \mathbf{h}_{G_{1}} &=f_{G}\left(\left\{\mathbf{h}_{i}^{(T)}\right\}_{i \in V_{1}}\right) \\ \mathbf{h}_{G_{2}} &=f_{G}\left(\left\{\mathbf{h}_{i}^{(T)}\right\}_{i \in V_{2}}\right) \\ s &=f_{s}\left(\mathbf{h}_{G_{1}}, \mathbf{h}_{G_{2}}\right) \end{aligned} \]

  • From the above formula, GMN made two modifications in the propagation phase

    • Since a pair of graphs is input at once, the first step of neighborhood nodes is found from the range of two graphs. However, generally, there are no node connections between two graphs, unless the same nodes in both graphs share neighborhoods?

    • In addition to neighborhood information transmission, the authors also calculate the match between two graphs, using a simple attention mechanism, weighting the difference between two node embeddings by their distance:

      \[ \begin{aligned} a_{j \rightarrow i} &=\frac{\exp \left(s_{h}\left(\mathbf{h}_{i}^{(t)}, \mathbf{h}_{j}^{(t)}\right)\right)}{\sum_{j^{\prime}} \exp \left(s_{h}\left(\mathbf{h}_{i}^{(t)}, \mathbf{h}_{j^{\prime}}^{(t)}\right)\right)} \\ \boldsymbol{\mu}_{j \rightarrow i} &=a_{j \rightarrow i}\left(\mathbf{h}_{i}^{(t)}-\mathbf{h}_{j}^{(t)}\right) \end{aligned} \]

    • When updating to the next layer of node embeddings, the match part actually calculates the weighted distance of a graph's node with all nodes of b graph:

      \[ \sum_{j} \boldsymbol{\mu}_{j \rightarrow i}=\sum_{j} a_{j \rightarrow i}\left(\mathbf{h}_{i}^{(t)}-\mathbf{h}_{j}^{(t)}\right)=\mathbf{h}_{i}^{(t)}-\sum_{j} a_{j \rightarrow i} \mathbf{h}_{j}^{(t)} \]

    • This calculation increases complexity to \(O(V(G_1)V(G_2))\), but this point-by-point comparison can distinguish subtle changes and is more interpretable. So the algorithm's use case should be small graphs with high precision requirements

  • For such matching problems, pair or triplet loss can be used, with the former comparing similarity and dissimilarity, and the latter comparing which is more similar to two candidates. The authors provide margin loss in both forms:

    \[ L_{\text {pair }}=\mathbb{E}_{\left(G_{1}, G_{2}, t\right)}\left[\max \left\{0, \gamma-t\left(1-d\left(G_{1}, G_{2}\right)\right)\right\}\right] \\ L_{\text {triplet }}=\mathbb{E}_{\left(G_{1}, G_{2}, G_{3}\right)}\left[\max \left\{0, d\left(G_{1}, G_{2}\right)-d\left(G_{1}, G_{3}\right)+\gamma\right\}\right] \\ \]

  • The authors specifically mention that to accelerate computation, graph embeddings can be binarized, using Hamming distance instead, sacrificing some Euclidean space. The specific method is to pass the entire vector through tanh and use average inner product as the graph similarity during training, designing a loss that pushes the Hamming distance of positive sample pairs towards 1 and negative sample pairs towards -1. This loss design is more stable than margin loss when used for retrieval:

    \[ s\left(G_{1}, G_{2}\right)=\frac{1}{H} \sum_{i=1}^{H} \tanh \left(h_{G_{1} i}\right) \cdot \tanh \left(h_{G_{2} i}\right) \\ L_{\text {pair }}=\mathbb{E}_{\left(G_{1}, G_{2}, t\right)}\left[\left(t-s\left(G_{1}, G_{2}\right)\right)^{2}\right] / 4 \\ \begin{aligned} L_{\text {triplet }}=\mathbb{E}_{\left(G_{1}, G_{2}, G_{3}\right)}\left[\left(s\left(G_{1}, G_{2}\right)-1\right)^{2}+\right.\\\left.\left(s\left(G_{1}, G_{3}\right)+1\right)^{2}\right] / 8 \end{aligned} \\ \]

    Dividing by 4 or 8 is to constrain the loss range to the [0,1] interval.

Learning to Update Knowledge Graphs by Reading News

  • An EMNLP 2019 work, the author is definitely a basketball fan, using a very appropriate NBA player transfer example to illustrate the problem this paper aims to solve: knowledge graph updating
  • For example, when a player transfers clubs, the player graphs of the two related clubs will change. The authors highlight two key points, as shown in Figure 1:
    • Knowledge graph updates only occur in the text subgraph, not the 1-hop subgraph
    • Traditional methods cannot extract hidden knowledge graph update information from text, such as a player's teammates changing after a transfer, which is not explicitly mentioned in the text but can be inferred
  • The overall structure is an encoder based on R-GCN and GAT, and a decoder based on DistMult, essentially modifying RGCN to an attention-based approach, with the decoder remaining unchanged, performing link prediction tasks

Encoder

  • Encoder: RGCN+GAT = RGAT. In RGCN, the forward propagation process is as follows:

    \[ \mathbf{H}^{l+1} = \sigma\left(\sum_{r \in \mathbf{R}} \hat{\mathbf{A}}_{r}^{l} \mathbf{H}^{l} \mathbf{W}_{r}^{l}\right) \]

  • This means assigning a parameter matrix to each type of heterogeneous edge, calculating independently, summing the results, and then applying an activation function. By replacing the adjacency matrix with an attention matrix, the attention calculation becomes:

    \[ a_{ij}^{lr} = \left\{ \begin{array}{ll} \frac{\exp \left(\text{att}^{lr}\left(\mathbf{h}_i^l, \mathbf{h}_j^l\right)\right)}{\sum_{k \in \mathcal{N}_i^r} \exp \left(\text{att}^{lr}\left(\mathbf{h}_i^l, \mathbf{h}_k^l\right)\right)}, & j \in \mathcal{N}_i^r \\ 0, & \text{otherwise} \end{array} \right. \]

  • The attention function \(\text{attn}\) is computed based on the text:

    • First, a bidirectional GRU encodes the sequence as \(u\).

    • Sequence attention is then used to obtain the contextual representation:

      \[ b_t^{lr} = \frac{\exp \left(\mathbf{u}_t^T \mathbf{g}_{\text{text}}^{lr}\right)}{\sum_{k=1}^{|S|} \exp \left(\mathbf{u}_k^T \mathbf{g}_{\text{text}}^{lr}\right)} \\ \mathbf{c}^{lr} = \sum_{t=1}^{|S|} b_t^{lr} \mathbf{u}_t \]

    • When applying attention, a trainable guidance vector \(g\) incorporates this contextual representation through simple linear interpolation:

      \[ \mathbf{g}_{\text{fin}}^{lr} = \alpha^{lr} \mathbf{g}_{\text{graph}}^{lr} + \left(1-\alpha^{lr}\right) \mathbf{U}^{lr} \mathbf{c}^{lr} \\ \text{att}^{lr}(\mathbf{h}_i^l, \mathbf{h}_j^l) = \mathbf{g}_{\text{fin}}^{lr}[\mathbf{h}_i^l \| \mathbf{h}_j^l] \]

  • In practical applications, aimed at the task of KG updates, the authors incorporated several techniques:

    • The number of parameters in RGCN/RGAT increases linearly with the number of edge (relation) types. To reduce parameters, the authors used basis decomposition, where \(k\) relation types share \(b\) parameter sets (\(b < k\)), and these \(k\) parameters are linear combinations of the \(b\) sets.
    • In sparse relational datasets, messages cannot aggregate effectively in one or two layers of RGAT. Thus, the authors added an artificial "SHORTCUT" relation between all entities in the graph. Using an information extraction tool, the "SHORTCUT" relation was refined into add, delete, and other categories to preliminarily capture transfer relationships (e.g., an individual leaving one club and joining another).

Decoder

  • The paper Embedding Entities and Relations for Learning and Inference in Knowledge Bases summarized the task of learning relation embeddings in KGs. The method involves combining different linear/bilinear parameter matrices with various scoring functions to compute the margin triplet loss:

    Knowledge Graph Embedding
  • DistMult is the simplest approach, where bilinear parameter matrices are replaced with diagonal matrices. The final score is obtained by element-wise multiplication of two entity embeddings, weighted by a relation-specific parameter. The formula used is:

    \[ P(y) = \operatorname{sigmoid}\left(\mathbf{h}_i^T \left(\mathbf{r}_k \circ \mathbf{h}_j\right)\right) \]

Results

  • Comparing several baselines (RGCN, PCNN), the authors used a GRU-based network to extract semantic similarities, which is data-intensive. While the dataset size was small, the final results were impressive. Notably, RGAT doubled the accuracy of RGCN in small-sample classes like add and delete.
  • A highlight of the paper was how it framed the link prediction problem, emphasizing continual learning and updates. By simplifying the task to link prediction, the model achieved high performance without extensive modifications.

Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs

  • This study also focused on link prediction, again using GAT.

  • The authors emphasized the importance of relations in KGs, but direct feature assignment to edges is challenging. To address this, they incorporated edge features into node features indirectly.

  • A diagram explains this approach effectively:

    Attention-based Embeddings
  • From left to right:

    • Input: Node features are input into GAT. Each node feature is derived via self-attention over triplet features, which are created by concatenating node and relation features. Green indicates relation features, and yellow indicates node features:

      \[ c_{ijk} = \mathbf{W}_1\left[\vec{h}_i \| \vec{h}_j \| \vec{g}_k\right] \\ \begin{aligned} \alpha_{ijk} &= \operatorname{softmax}_{jk}\left(b_{ijk}\right) \\ &= \frac{\exp \left(b_{ijk}\right)}{\sum_{n \in \mathcal{N}_i} \sum_{r \in \mathcal{R}_{in}} \exp \left(b_{inr}\right)} \end{aligned} \\ \overrightarrow{h_i'} = \|_{m=1}^M \sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{ijk}^m c_{ijk}^m\right) \]

    • Intermediate Layer: GAT computes multi-head attention, concatenating the results. Features (6-dimensional gray) are transformed and concatenated with relation features (green) to form triplet representations for each node.

    • Final Layer: Average pooling is applied instead of concatenation. Input node features are incorporated again, concatenated with relation features, and loss is calculated based on the triplet distance: subject + predicate - object. Negative sampling involves randomly replacing the subject or object.

  • Decoder: The decoder uses ConvKB:

    \[ f\left(t_{ij}^k\right) = \left(\prod_{m=1}^\Omega \operatorname{ReLU}\left(\left[\vec{h}_i, \vec{g}_k, \vec{h}_j\right] * \omega^m\right)\right) \mathbf{. W} \]

  • Loss: Soft-margin loss is used (although the values for 1 and -1 seem reversed):

    \[ \begin{aligned} \mathcal{L} &= \sum_{t_{ij}^k \in \{S \cup S'\}} \log \left(1 + \exp \left(l_{t_{ij}^k} * f\left(t_{ij}^k\right)\right)\right) + \frac{\lambda}{2}\|\mathbf{W}\|_2^2 \\ \text{where } l_{t_{ij}^k} &= \begin{cases} 1 & \text{for } t_{ij}^k \in S \\ -1 & \text{for } t_{ij}^k \in S' \end{cases} \end{aligned} \]

  • Additional improvements included adding edges between 2-hop nodes.

  • The results were excellent, achieving SOTA performance on FB15K-237 and WN18RR. The authors avoided integrating edge features directly into GAT's message passing, focusing instead on encoding these features effectively and ensuring that initial features influenced every model layer for robust gradient propagation.