Study Notes for Cognitive Graph
Note for paper "Cognitive Graph for Multi-Hop Reading Comprehension at Scale."
Task
- The framework proposed by the author is called CogQA, which is a framework based on cognitive graphs to address question-answering in machine reading comprehension, including general question-answering (selecting entities) and comparative question-answering (the relationship between two entities). Setting cognitive graphs aside for the moment, let's look at what this task is.
- The uniqueness of this question-answering task lies in its extension to multi-hop. Multi-hop should not be a task type but rather a method for completing entity-based question-answering tasks, that is, finding entities in the question, then locating clues in their corresponding context descriptions, and using these clues to find the next entity as the next hop. The next-hop entity is used to jump to the corresponding context, where clues are then found again, and this process is repeated until, after multiple hops, the correct entity is found in the correct description as the answer. For a question, you can solve it using a single-hop approach or a multi-hop approach. In fact, most question-answering models based on information retrieval are based on a single-hop approach, which simply compares the question and the context to find the most relevant sentence and then extract entities from these sentences. Essentially, this is a pattern matching approach, but the problem is that if the question itself is multi-hop, then the single-hop pattern matching may not be able to find the correct entity at all, because the answer is not in the candidate sentences.
- This is actually very similar to the way humans answer questions. For example, if we ask who the authors are of the cognitive graph published in ACL2019, we would first find the two entities, ACL2019 and cognitive graph, and then separately find all the corresponding authors of ACL2019 papers and the various meanings of the cognitive graph (it may be neuroscience, it may be education, it may be natural language processing), and then find more entities and descriptions (different authors of different papers, different explanations of meanings), ultimately finding one or more answers. Humans might directly search for the term "cognitive graph" in all the paper titles of ACL2019, while computers might extend and jump between ACL2019 and cognitive graph multiple times before merging into a single entity, namely the author's name, and then output it as an answer.
- The multi-hop connections among multiple entities and their
topological relationships constitute the cognitive graph, a directed
graph. This directed graph is both inferable and interpretable, unlike
black-box models, with a clear inference path. Therefore, the issue
boils down to:
- How to construct a graph?
- With the diagram, how to reason?
- The author first proposed using the dual-process theory from cognitive science to explain their approach.
Two-process model
- The dual-process model in cognitive science refers to the fact that
humans solve problems in two steps:
- System 1: Initially, attention is allocated through an implicit, unconscious, and intuitive process to retrieve relevant information
- System Two: Reasoning is completed through another explicit, conscious, and controllable process
- System one provides resources to system two, system two guides the search of system one, and they iterate in this manner
- The above two processes can actually correspond to two major schools of thought in artificial intelligence, connectionism and symbolism. The first process, although difficult to explain and completed through intuition, is not innate but actually obtained through hidden knowledge gained from life experience. This part can correspond to the black-box models currently completed by deep learning, which learn from a large amount of data to produce models that are unexplainable but can achieve the intended purpose. While the second process requires causal relationships or explicit structures to assist in reasoning.
- In the context of machine question answering, the authors naturally
employed existing neural network models to accomplish these two tasks:
- The first item requires attention to retrieve relevant information, so I directly use the self-attention model to complete entity retrieval.
- The second item requires an explicit structure, so I construct a directed cognitive graph and complete reasoning on the cognitive graph.
How to Construct a Graph
- The author uses BERT to complete the work of System 1. BERT itself
can be used as a one-step machine reading comprehension, and here the
author follows the one-step approach, with the input sentence pairs
being the question and the sentence to be annotated, and the output
being the annotation probability, i.e., the probability that each word
is the start or end position of an entity. However, to achieve
multi-hop, the author made some modifications:
- The input sentence pairs are not based on the problem as a unit, but
on each entity within each problem. Specifically, the A sentence of each
input sentence pair is composed of the problem and a clue to a certain
entity within the problem, while the B sentence is about all the
sentences in the descriptive context of that entity.
- sentence A:\([CLS]Question[SEP]clues(x,G)[SEP]\)
- sentence B:\(Para(x)\)
- What is the clue of a certain entity? The clue is the sentence
describing the entity extracted from the context of all parent nodes in
the cognitive graph. It may sound a bit awkward, but this design runs
through the entire system and is the essence of its cognitive reasoning,
as illustrated by the examples given in the paper:
- Who made a movie in 2003 with a scene shot at the Los Angeles Quality Cafe?
- We found the entity "quality cafe," and found its introduction context: "......This is a coffee shop in Los Angeles. The shop is famous for being the filming location for several movies, including Old School, Gone in 60s, and so on."
- We then proceed to traverse these movie name entities, and then find the introduction context of the movies, such as "Old School is an American comedy film released in 2003, directed by Todd Phillips," and through other cognitive reasoning, we deduce that this "Todd Phillips" is the correct answer. Then, what is the clue for this director entity? What kind of clues do we need as supplementary input to obtain "Old School is an American comedy film released in 2003, directed by Todd Phillips," where "Todd Phillips" is the answer we seek?
- The answer is "This store is famous for being the filming location for several movies, including Old School, Gone in 60 Seconds, etc." This sentence corresponds to the input format in BERT.
- sentence A:
- Who made a movie in 2003 with a scene shot at the Los Angeles Quality Cafe?
- clues(x,G): “This store is famous for being the filming location for several movies, including 'Old School' and 'Gone in 60 Seconds' etc.”
- "Old School is an American comedy film released in 2003, directed by Todd Phillips."
- Entity x is referred to as "old school."
- This design completes the iterative part in both System 1 and System 2, connecting the two systems. This part allows System 2 to use graph structures to guide System 1 in retrieval. And through cycles, it is possible that System 2 updates the features of a certain entity's parent node or adds a new parent node, all of which may lead to the acquisition of new clues. System 1 can then use these clues again to predict and find new answer entities or next-hop entities that were not previously identified.
- How does System 2 depend on the results of System 1? This also
divides into two parts
- Perform two span predictions: System 1's BERT separates the prediction start and end positions of the answer entity and the next-hop entity, using four parameter vectors to combine the word feature vectors output by BERT to predict the start and end positions of the answer entity, the start and end positions of the next-hop entity, totaling four quantities. After obtaining the answer and next-hop entities, they are added to the cognitive graph as sub-nodes of the current entity, connecting the edges.
- Of course, it is not enough to just connect the edges; node features are also required. Just as BERT's position 0 extracts features of the entire sentence pair, the authors use it as a node feature \(sem(x,Q,clues)\) and supplement it to the diagram.
- This system one provides topological relationships and node features for the expansion of the graph, thereby providing resources for system two.
- The input sentence pairs are not based on the problem as a unit, but
on each entity within each problem. Specifically, the A sentence of each
input sentence pair is composed of the problem and a clue to a certain
entity within the problem, while the B sentence is about all the
sentences in the descriptive context of that entity.
How to Reason on Graphs
This section directly employs GNN to perform spectral transformation on a directed graph to extract one layer of node features
\[ \Delta = \sigma ((AD^{-1})^T) \sigma (XW_1)) \\ X^1 = \sigma (XW_2 + \Delta) \\ \]
The subsequent predictions only require adding a simple network on top of the transformed node features for regression or classification.
Note that although model one extracts the answer span, both the answer span and the next-hop entity span are added as nodes to the cognitive graph, because there may be multiple answer nodes that require judgment of confidence by System Two. The reason for BERT to predict both the answer and the next-hop separately is:
- Both should have different features obtained through BERT and require independent parameter vectors to assist in updating
- Both are equally included in the cognitive graph, but only the next-hop node will continue to input into the system to make further predictions
The model's loss consists of two parts, namely the System One's span prediction (answer & next hop) loss and the System Two's answer prediction loss, both of which are relatively simple and can be directly referred to in the paper.
Data
- Authors used the full-wiki part of HotpotQA for training and testing, with 84% of the data requiring multi-hop reasoning. Each question in the training set provided two useful entities, as well as multiple descriptions of the context and 8 irrelevant descriptions for negative sampling. During validation and testing, only the questions were provided, requiring answers and relevant descriptions of the context.
- To construct a gold-only cognitive graph, i.e., the initialized total cognitive graph, the authors perform fuzzy matching on every sentence in the description context of all entities y and a certain entity x. If a match is found, (x, y) is added as an edge to the initialized graph
Overall Process
- Input: System One, System Two, Issue, Prediction Network, Wiki Dataset
- Initialize the gold-only graph with entities from the problem and mark these entities as parent nodes, and add the entities found through fuzzy matching during initialization to the boundary queue (the queue to be processed)
- Repeat the following process
- Pop an entity x from the boundary queue
- Collect clues from all ancestors of x
- Input the clues, issues, and descriptive context of the entity into the unified system, obtaining the cognitive graph node representation \(sem(x^1,Q,clues)\)
- If entity x is the next-hop node, then:
- Entity span for generating answers and next-hop
- For the next-hop entity span, if it exists in the Wiki database, create a new next-hop node in the diagram and establish an edge; if it is already in the diagram but has not established an edge with the current entity x, add an edge and include the node in the boundary queue
- For the answer entity span, nodes and edges are directly added without the need for judgment from the Wikipedia database, because the answer may not be in the database
- Through the second system updating node features
- Until there are no nodes in the boundary queue, or the cognitive graph is sufficiently large
- Through predicting the network's return results.
- Through the above process, it can be seen that for each training data, before using the prediction network to predict the results, two systems need to interact iteratively multiple times until feature extraction is complete. The condition for stopping iteration is when the boundary queue is empty. Then, what kind of nodes will join the boundary queue? Nodes that have already been in the graph and established new edges for the next hop may bring new clues, therefore, all such nodes must be processed, allowing system two to see all clues before making predictions.
Other details
- In System One, there may not be Sentence B, i.e., there may be no descriptive context for a certain entity. In this case, we can simply obtain the node features \(sem(x^1,Q,clues)\) through BERT, without predicting the answer and the next-hop entity, i.e., this node acts as a leaf node in the directed graph and no longer expands.
- At the initialization of the cognitive graph, it is not necessary to obtain node features; only the prediction of spans is needed to construct edges
- The author found that using the feature at position 0 of the last layer of BERT as node features was not very good, because the features of higher layers are transformed to be suitable for span prediction, so after experimentation, the author took the third-to-last layer of BERT to construct node features
- When performing span prediction, it actually specifies a maximum span length, then predicts the top k beginning positions, and then predicts the end positions within the span maximum length
- The author also employed negative sampling to prevent span prediction on irrelevant sentences. Specifically, it first samples irrelevant samples, sets the [CLS] position probability of these samples to 0, and sets the position probability of the positive samples to 1. In this way, BERT can learn the probability that sentence B is a positive sample at the [CLS] position. Only the topk spans selected previously will be retained if their begin position probability is greater than the [CLS] position probability.
- In the process described in the pseudo-algorithm, every time the system updates the cognitive graph structure, system two runs once. In fact, the author found that it is the same effect, and more efficient, to let system one traverse all the boundary nodes first, wait until the graph no longer changes, and then let system two run multiple times. In actual implementation, this algorithm is also adopted.
- HotpotQA includes special questions, non-traditional questions, and traditional questions. The author has constructed prediction networks for each, where special questions are regression models, and the other two types are classification models.
- When initializing the cognitive graph, it is not only necessary to establish edges between entities and the next-hop entities, but also to mark the begin and end positions of the next-hop entities and feed them into the BERT model
- The author also conducted an ablation study, mainly focusing on the differences in the initial entity sets, and the experimental results show that the model is relatively dependent on the quality of the initial entities
Results
- Dominating the HotpotQA leaderboard for several months before this April, until recently being surpassed by a new BERT model, but at least this model can provide a good interpretability, as shown in the three cognitive graph reasoning scenarios in the following figure
Conclusion
- This model can be simply regarded as an extension of GNN in NLP, with the powerful BERT used for node feature extraction. However, the difficulty of using GNN in NLP lies in the definition of edge relationships. This paper presents a very natural definition of relationships, consistent with the intuition of humans in completing question-answering tasks, and BERT not only extracts node features but also completes the construction of edges. I feel that this framework is a good way to combine black-box models and interpretable models, rather than necessarily explaining black-box models. The black box will let it do what it is good at, including feature extraction of natural language and reasoning networks, while humans can design explicit rules for adding edge relationships. Both work together, complementing each other rather than being mutually exclusive.