Computer scienceData scienceNLPMain NLP tasksQuestion Answering

Open-domain and close-domain QA

14 minutes read

We continue our question-answering series. Just a reminder — Question Answering (QA) is a major NLP field devoted to creating systems that can automatically answer the user's question. Two main QA types are closed-domain and open-domain. We've talked about knowledge-based systems previously; here, we'll learn about open/closed domain systems!

Open vs. closed

The main difference between a closed-domain and an open-domain QA system is the dataset on which it was trained. If you train your model on the TweetQA dataset, then eventually you will get a closed-domain QA, which will be able to answer typical tweeter questions. Conversely, if you train a model on SQuAD, you will get a model that can answer almost any question.

We first use TF-IDF to find most important documents from the pool of articles (e.g., the whole Web), then we find the top relevant paragraphs in those important documents and read them to find the necessary information.

The image above shows how closed-domain systems work. There is a pool of articles; when a model retrieves the information to answer the question, it processes only these articles. The model selects 3-5 articles with the top highest scores, extracts the most relevant paragraphs from there, and extracts the answer. The main idea here is that the model's knowledge (dataset) is very limited, and so is the spectrum of questions the model can answer to.

Hugging Face provides trained closed-domain systems, though you can fine-tune them for open-domain. LayoutLM for Visual Question Answering is an example of a closed-domain QA model. Upload a .png document, and the model will reply to the document-related questions. Apart from this, there are also context-based QA models. We have not discussed them before because they are not very popular. Many conversational QA systems are context-based: for example, fractolego's conversation-qa. Such models are supposed to be integrated into chatbot systems or customer support services.

Other closed-domain models are Adaptnlp, Cdqa-suite, txtai, Haystack, and Ktrain simpleQA. Ready open-domain models are DeepPavlov's ODQA and Wikipedia QA. Both are based on Wikipedia.

Approaches

There are many ways of implementing this type of QA system, as with any other NLP system. The most common architecture in NLP is Transformers. Most Transformers models are available in Hugging Face.

Transformers are big models; they achieve better performance by increasing the model sizes as well as the amount of data they are pre-trained on.

General transformer architecture includes an encoder and decoder (as any neural network model does). Each of these parts can be used independently, meaning that there can be an encoder model, a decoder model, and an encoder-decoder model. Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, NER, and extractive question answering. Encoder models include BERT, DistilBERT, etc.

GPT, for instance, is a decoder model. It is better suited for tasks like text generation/summarization, chatbots, etc. It is also good for generative QA (when we need not just a brief answer, but also a good sentence). This is also true because the architecture of GPT is designed to predict the future token. Look at the picture showing the difference between BERT and GPT architectures. In GPT architecture, weights from embeddings of the previous input token are transmitted to the next token output embeddings — this helps to predict the next word and thus generate a text. In BERT, on the other hand, input token embeddings are transmitted to all token outputs to see the whole context. So, BERT is excellent for text classification, QA, etc.

 In BERT architecture an information is transfered to all token, before and after the present one. IN GPT information is offered only to right-hand tokens from the present one.

BERT models rely on the masked language modeling (MLM) objective. It is therefore efficient at predicting masked tokens and at NLU in general but is not optimal for text generation. It's available in Hugging Face and Tensorflow Hub too. These models are applicable to QA tasks. We can get excellent results by fine-tuning BERT family models on QA datasets. Popular encoder models for QA include RoBERTa, MobileBERT, and ALBERT.

A model named ReQA was trained on the base BERT model. It is Retrieval question answering, another type of QA that we haven't mentioned in our Introduction topic. This type is more like an extractive QA, rather than generative. It extracts the necessary info but gives not just a brief, but the whole sentence from the extraction source. Here is an example of ReQA:

Question: Which US county has the densest population?

Wikipedia Page: New York City

Answer: Geographically co-extensive with New York County, the borough of Manhattan's 2017 population density of 72.918 inhabitants per square mile (28,154/km²) makes it the highest of any county in the United States and higher than the density of any individual American city.

So, if it was a classic Extractive QA, then the answer would be New York County. If it were a generative one, the model would generate an answer like this: New Tork County has the densest population.

The most important feature of ReQA is its possibility to bypass common document retrieval systems. ReQA models are available in the Hugging Face and the TensorFlow hubs. In the implementation section, we'll discuss how to implement them in HF.

And the last type of NN model is ELECTRA. It is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little computation. In QA, it produces only extractive answers. Take a look at one of the examples of a basic Electra model.

Open-domain datasets

QA datasets can be of two types:

  • for knowledge-based models -> knowledge graphs

  • for common models (raw-text-based) -> common datasets

This section is dedicated to common datasets for raw-text-based models. Such datasets intersect with datasets for reading comprehension tasks (RC). Automatic reading comprehension evaluates how well a computer can comprehend a text. Still, a cloze format question could be asked by users as a part of the QA task. That's why they intersect.

For you, it's enough to remember that datasets for both QA and RC have four formats: simple questions, queries, cloze, and completion. See the examples below.

The picture shows 4 types of QA datasets with examples.

Questions are the most typical format for QA. They could further be described in terms of their syntactic structure: yes/no questions (did it rain on Monday?) wh-questions (When did it rain?), tag questions (It rained, didn't it?), or declarative questions (It rained?).

Apart from the example datasets mentioned above, there are others: BoolQ, MS MARCO, CoQA, QuAC, QReCC, etc.

  • SQuAD is a very famous extractive QA dataset, based on Wikipedia. The SQuAD website provides a leaderboard with QA best models that have used the SQuAD dataset (according to their F1 measure). The leaderboard is live and it changes over time;

  • BoolQ is a categorical dataset, with answers to questions as Yes or No. For example, Was Einstein born in 1880? — No. It was collected as "natural" information-seeking questions in Google search queries similar to Natural Questions;

  • QuAC is a collection of factual questions about a topic, asked by one worker and answered by another (who has access to a Wikipedia article);

  • QReCC is a dataset of dialogues with seed questions from QuAC, Natural Questions, and TREC CAst;

  • CNN/Daily Mail is a cloze dataset based on news articles from CNN and Daily Mail;

  • RocStories is a complete dataset of five-sentence commonsense stories. This corpus is unique in two ways: it captures a rich set of causal and temporal common-sense relations between daily events, and it is a high-quality collection of everyday life stories that can also be used for story generation. This corpus was created together with Story Cloze Test. RocStories is just a dataset of stories, while Story Cloze Test is a collection of completion tests to check text understanding.

Those were open-domain datasets, or, to be correct, conversational and general knowledge question datasets. Now let's talk about closed-domain datasets.

Closed-domain datasets

Closed-domain dataset implies that it consists of texts on a particular topic. However, a dataset with all questions in Twitter can be a closed domain, too, because such texts have a particular source. Moreover, tweets are usually stylistically specific.

So, these are the main groups of closed-domain datasets:

  • FICTION:

    1. CBT is an old cloze dataset based on fiction stories for kids;

    2. BookTest is a similar dataset. It is much bigger than CBT and encompasses all Gutenberg corpus stories;

    3. FairyTaleQA is also a recent multi-choice fiction dataset, based on school tests;

  • QUIZ:

    1. TriviaQA is a dataset based on human knowledge competitions that overlap with an encyclopedia in subject matter, but this is a separate genre: the questions are authored by domain experts specifically to be discriminative tests of human knowledge, and, unlike in academic tests; the participants engage in the QA activity for fun;

    2. Jeopardy;

    3. QuizBowl;

  • REVIEWS:

    1. AmazonQA is a dataset based on question-answers and reviews on products on Amazon.com;

  • PROFESSIONAL:

    1. TechQA is a dataset of naturally occurring questions on tech expert forums;

  • SOCIAL NETWORK:

    1. TweetQA is a dataset of questions and their answers occurring on Twitter.

  • NEWS:

    1. NewsQA is based on CNN data. Given the increasing problem of online misinformation, it is a highly important area of research, but it is hampered by the lack of public-domain data.

    2. Daily Mail Cloze dataset is more RC-oriented

Implementation

Let's start with the common QA model implementation available for fine-tuning, which can be converted to a closed domain. We have already mentioned that many QA systems available in the Hugging Face Hub are closed-domain and context-based. The default HF model is also closed-domain and context-based.

To initialize a default Hugging Face model, you shouldn't specify the model name. All you need is to specify the task name — question-answering. Note that question_answering won't work.

from transformers import pipeline

qa_model = pipeline("question-answering")

Now, define the context and then ask a question:

context = "My name is Alex and I am 15 years old."
question = "How old are you?"

qa_model(question = question, context = context)['answer']

##  '15'

This model is a closed domain, but you can fine-tune them to an open domain.

Now, let's talk about open-domain model implementation. This model is pretrained and can't be fine-tuned to a closed-domain one. DeepPavlov also offers an open-domain raw-text-based QA system — ODQA. As was mentioned before, the advantage of open-domain systems is that they can answer more questions than knowledge-based ones. ODQA can give you an exact answer to any question that is stored in the Wikipedia dataset.

First, you need to install the dataset. Note that we use the English Wikipedia dataset; DeepPavlov also offers a Russian one. To install a Russian one change en_odqa_infer_wiki to ru_odqa_infer_wiki.

!python -m deeppavlov install en_odqa_infer_wiki

Then, build a model:

from deeppavlov import build_model

odqa = build_model('en_odqa_infer_wiki', download=True)

And we can ask a question:

result = odqa(['What is the name of Darth Vader\'s son?'])
print(result)

##  Luke Skywalker

Conclusion

In this topic, we have discussed the two approaches to QA system technology: open-domain and closed-domain. We have discussed the difference between them: we've concluded that the main difference is the dataset it is trained on (though you should keep in mind that DeepPavlov's ODQA is quite a specific model and cannot be converted into CDQA). We have also inspected relevant datasets for each of them. And, finally, we've learned how to implement common QA systems in Hugging Face and ODQA in DeepPavlov.

7 learners liked this piece of theory. 2 didn't like it. What about you?
Report a typo