In this topic, we'll talk about question-answering systems again. Existing QA methods are either knowledge-based or not-knowledge-based (so-called raw-text-based). Systems that aren't based on a knowledge base are trained on ordinary datasets. We will talk about them in other topics. In this topic, we'll walk you through the methods of knowledge-based QA.
But first, let's start with clarifying what is the knowledge base.
Knowledge base
A knowledge base (KB) is a mass of information stored as structured data, ready for analysis or inference. Usually, a KB is stored as a graph, where nodes are entities and edges are relations between entities. A knowledge graph is a knowledge base that uses a graph-structured data model or topology to integrate data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations, or abstract concepts — while also encoding the semantics underlying the used terminology.
For example, from the text Santiage is the capital of Chile, we can extract <Santiage, is the capital of, Chile>. Chile has borders with Peru -> we get named entities Chile and Peru with the relation — has borders with.
Below is data about capitals and countries in a directed edge-labeled graph and a heterogeneous graph:
Here is a more advanced graph, a knowledge graph extracted from twenty news articles about Google:
The knowledge base is good, as it:
-
is a comprehensive repository of information about a given domain or several domains;
-
reflects the ways we model knowledge about a given subject or subjects, in terms of concepts, entities, properties, and relationships;
-
enables us to use this structured knowledge where appropriate, for example, by answering factoid questions.
The most popular knowledge graph is Wikidata, based on Wikipedia. DBpedia is also often mentioned in NLP articles. Google Knowledge Graph is another popular knowledge graph. Freebase, RDF, and YAGO2 are good knowledge bases too.
Knowledge-based approaches
There are two mainstream approaches for complex KBQA. Both approaches start by recognizing the subject in the question and linking it to an entity in the KB, which will be called a topic entity. Then, they derive the answers within the KB neighborhood of the topic entity:
-
By executing a parsed logic form, typical of semantic parsing-based methods (SP-based methods). It follows a parse-then-execute paradigm;
-
By reasoning in a question-specific graph extracted from the KB and ranking all the entities in the extracted graph based on their relevance to the question typical to the information retrieval-based methods (IR-based methods). It follows a retrieval-and-rank paradigm.
Semantic Parsing-based QA aims at parsing a sentence into logical forms. They predict answers via the following steps:
-
Parse a question into a logical form (for example, a SPARQL query template), which is a syntactic representation of the question without the grounding of entities and relations;
-
The logic form is then instantiated and validated by conducting some semantic alignments to structured KBs via KB grounding (obtaining, for example, an executable SPARQL query);
-
The parsed logic form is executed with KBs to generate predicted answers.
IR-based QA, on the other hand, directly retrieves and ranks answers from the KBs considering the information conveyed in the questions. IR-based methods naturally fit into popular end-to-end training, making them easy to train. However, the black-box style of the reasoning model makes intermediate reasoning less interpretable. This approach has the following algorithm:
-
The system first extracts a question-specific graph from KBs
-
Then the system encodes input questions into vectors representing reasoning instructions.
-
A graph-based reasoning module conducts semantic matching via vector-based computation to propagate and then aggregate the information along the neighboring entities within the graph.
-
An answer ranking module is utilized to rank the entities in the graph according to the reasoning status at the end of the reasoning phase. The top-ranked entities are predicted as the answers to the question.
Knowledge-based systems
The U\universal schema can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. Traditionally, the universal schema is applied to extraction. Some researchers propose using universal schema to extend the knowledge base by employing memory networks to attend to the large body of facts in the combination of text and KB. The Dynamic Memory Network (DMN) is a good example of a memory-network-based QA model.
DeepPavlov offers a good knowledge-based QA system — KBQA. This model is available in English and Russian. Other notable Knowledge-based QA systems are DEANNA, gAnswer, and Eddie (a QA-based chatbot).
Some hybrid models combine training on knowledge bases and common datasets. YodaQA and HAWK are one of those. The first one is especially popular. You can ask any question without even coding here. If you are interested in this model more like an NLP Researcher rather than a common user, then here is its GitHub page.
Take a look at an example of a complex KBQA with the question Who is the first wife of the TV producer that was nominated for The Jeff Probst Show? This question requires constrained relations as we are looking for a TV producer, multi-hop reasoning as we need to find the wives of the TV producer, and numerical operations — once we find his wives, we need to find the first one. Here is a graph of the KB search:
RnG-KBQA is a model with astonishing results in the GrailQA leaderboard (1st place!), which enables answering questions over large-scale knowledge bases. This model can answer questions about topics never seen in the training data, which makes it generalizable to a broad range of domains.
So, apart from knowledge-based systems, there are also memory-network-based and raw-text-based. Implementation of raw-text-based will be discussed a little later; now, let's talk about KBQA implementation.
Knowledge-based system implementation
To start working with DeepPavlov's KBQA, you should install the DeepPavlov library and then install the knowledge base. DeepPavlov provides two models: in English and Russian. So, for the English model, they use the English Wikidata. To install the English one use the following code. To get a Russian one change kbqa_cq_en to kbqa_cq_ru.
!pip install deeppavlov
!python -m deeppavlov install kbqa_cq_en ## Installing english wikidata
!python deeppavlov/deep.py interact kbqa_cq_en [-d] ## NOT obligatory
The last row may not work properly in Google Colab, so you can omit that.
Now it's necessary to build the model:
from deeppavlov import build_model
kbqa_model = build_model('kbqa_cq_en', download=True)
Then, ask a question you want:
kbqa_model(['What is the capital of Denmark'])
## ['Copenhagen']
This model is an Extractive one, as you can see from the answer.
DeepPavlov also offers an open-domain raw-text-based QA system, ODQA. As was mentioned before, the advantage of an open-domain raw text-based system is that it can answer more questions than a knowledge-based one does. ODQA can give you an exact answer to any question stored in the Wikipedia dataset.
The article called On knowledge base question answering system (KBQA) by VIP Programming shows you how to make a KBQA on your own.
Conclusion
In this topic, we've discussed knowledge-based QA. To better understand what is knowledge-based QA, we've talked about knowledge bases and knowledge graphs. We have talked about two main approaches in KBQA: semantic-based and IR-based. IR is strongly related to the problem of QA, so we recommend you learn it if you are interested in this field. We showed what is the difference between knowledge-based and non-knowledge-based. We have also learned how to implement knowledge-based QA in DeepPavlov.