Computer scienceData scienceNLPMain NLP tasksQuestion Answering

Machine Reading Comprehension & Logical Reasoning QA

4 minutes read

Teaching a computer how to read and answer general questions based on a document is a challenging yet unsolved problem. One of the crucial tasks is to make computers understand the text — this problem is also known as Natural Language Understanding (NLU). NLU is a wider concept that implies automatic speech recognition, too.

In this topic, we will discuss the problem of Machine Reading Comprehension (RC), datasets, and models for this task. RC is also closely related to the problem of logical reasoning for AI, so we'll discuss logical reasoning QA too. Together, RC and Logical Reasoning QA, comprise a problem for AI to understand a text on the same level as a human being.

Machine reading comprehension

Computers lack the reasoning and sensitivity any human has. It's difficult to transmit an emotion to artificial intelligence (though there are some attempts). In this sense, the most complicated issue for AI is to comprehend a novel or a poem.

Take, for example, Louis Ferdinand Celine's Journey to the End of the Night. AI can understand facts from this book — the main character, a French student of medicine, decides to participate in World War I, then ends up in a hospital for the mentally disabled, then becomes a French colonial officer in Africa, and so on... But AI cannot understand the feelings of the main character when he saw the atrocities of war or his emotions during his other adventures in Africa and then in the US. Not to mention that for the AI it's impossible to make a literary analysis of a novel: it cannot understand why Dante put some Florentine politicians to Hell and others in Heaven. To answer such questions, AI needs to know Dante's biography and Renaissance Italian politics. Again, this can be done if we provide our AI system with some data source (Wikipedia is the best choice), but this is a separate task — open-domain QA.

This was about novel reading comprehension. Each text has some kind of reference; AI cannot comprehend them the same way as an average educated man can.

Machine reading comprehension (or reading comprehension, RC) makes artificial intelligence understand a text. This "understanding" is measured by AI answering a cloze test in a dataset.

When the machine comprehension dialog involves multiple co-referenced questions, such as if a latter question may be a logical successor of the former, the challenge is called Conversational machine comprehension (CMC).

Datasets for RC

Genuine reading comprehension is challenging since effective comprehension involves a thorough understanding of documents and sophisticated inference. Solving a machine reading comprehension problem, in recent years, several works have collected various datasets, in the form of questions, paragraphs, and answers. A couple of large-scale cloze-style datasets have gained significant attention along with powerful deep-learning models. Reminding you that cloze-style datasets infer that there is blank space that should be filled down. For example, Dante was born ___. These datasets are similar to the reading tests you complete in your foreign language classes.

Such datasets intersect with datasets for QA tasks. We recommend paying your attention to four formats: simple questions, queries, cloze, and completion:

Just a table that illustartes what we you've already read: there're 4 types of QA datasets. We give example questions for each of type and an example datasets too

SQUAD and CNN/Daily Mail datasets are actively used for both QA and RC. SQuAD is a famous extractive QA dataset. The SQuAD website provides a leaderboard with QA best models that have used the SQuAD dataset according to their F1 measure. The leaderboard is live, and it changes over time. The CNN/Daily Mail dataset contains news and cloze quizzes.

Cloze datasets for training RC models are in great trend. Recent approaches to cloze-style datasets can be separated into two categories: single-turn and multi-turn reasoning.

Some other RC datasets are:

  • RACE is collected from English reading comprehension exams for middle and high school Chinese students,

  • DuoRC is based on collections of IMDb reviews and Wikipedia articles.

  • QAMR is built by the Amazon crowdsourcing system. It represents the predicate-argument structure of a sentence as a set of question-answer pairs.

RC models

Single-turn reasoning models employ attention mechanisms to emphasize specific parts of the document which are relevant to the query. These attention models subsequently calculate the relevance between a query and the corresponding weighted representations of document subunits (for example, sentences or words) to score target candidates. Such models include EpiReader.

Existing multi-turn reasoning models have a predefined number of hops or iterations in their inference without regard to the complexity of each query or document.

The first RC model was developed in 1977, it was called QUALM. This early work set a strong vision for language understanding, but the systems built at that time were tiny and limited to hand-coded scripts and difficult to generalize to broader domains. DEEP READ, a rule-based bag-of-words solution with shallow linguistic processing such as stemming, semantic class identification, and pronoun resolution, was developed in 1999. QUARC followed in 2000 — it's the RC model based on manual rules for lexical and semantic correspondence. Current RC models are mainly neural.

One of the latest end-to-end RC models, Attentive Reader, was developed in 2015. If you're interested, take a look RC dataset and model development timeline during 2015-2018. Models are marked blue, and datasets are in black.

The first QA dataset, CNN/Daily Mail, was developed in 2015. At the same time a QA model Attentive Reader was developed too. IN 2016 two QA datasets were created (Children Book Test and SQuAD 1.1) and one model (Standford Attentive Reader). 2017 was more productive with 3 new QA models (Match-LSTM, BiDAF, R-Net) and two datasets (TriviaQA and RACE). In 2018 researched created 3 new datasets and 3 models.

The most recently proposed model is Reasoning Network (ReasoNet). It is a neural network model that tries to simulate the inference process of human readers. It reads a question first, and then, keeping the question in "mind", it reads the text until it finds the answer.

Other RC models include AoA Reader, Iterative Attention Reader, DER Network, CLER (Cross-task Learning with Expert Representation), D-NET, etc.

For cloze-style and multi-choice MRC, a common evaluation metric is accuracy.

Logical reasoning QA

Logical reasoning is of vital importance to natural language understanding. Logical reasoning QA requires a machine to understand the logic behind the text, for example, identifying the logical components, logical relations, or fallacies.

Some researchers propose discourse-aware graph networks (DAGN) to build logical graphs and learn logic representations accordingly. Such models analyze a context text (see "passage" on the left of the picture), underlines facts, premises, and a conclusion, and then try to answer an input question.

Below is an example of multi-choice logical reasoning QA and the logical structure-based solution (right). The logical units are sentences or clauses and perform multi-hop reasoning processes from premises or refuting evidence to the conclusion.

Suppose we have a text like this:  'Mount Shalko is the perfect site for the proposed astronomical observatory.The summit would accommodate the complex as currently designed, with some room left for expansion. There are no large cities near the mountain, so neither smog nor artificial light interferes with atmospheric transparency. Critics claim that Mount Shalko is a unique ecological site, but the observatory need not be a threat to endemic life-forms. In fact, since it would preclude recreational use of the mountain, it would be their salvatio. It is estimated that 20,000 recreational users visit the mountain every year posing a threat to the wildlife.'.  After this text is passed we pose the following question: Which one of the following, if true, most weakens the astronomer's argument? Possible options: A. More than a dozen insect and plant species endemic to Mount Shalko are found nowhere else on earth. B. The building of the observatory would not cause the small towns near Mount Shalko eventually to develop into a large city, complete with smog, bright lights, and an influx of recreation seekers. C. A survey conducted by a team of park rangers concluded that two other mountains in the same general area have more potential for recreational use than Mount Shalko. D. Having a complex that covers most of the summit, as' well as having the necessary security fences and access road on the mountain, could involve just as much ecological disruption as does the current level of recreational use. Our model analyses the input text. It underlines logical Facts, Premises and Conclusion (we underlined them as red, blue and green respectively.  Finally, model makes conclusion that the right answer is D.

The proposed logic graph construction uses generic textual clues and logic theories and is easily applied to new texts. This model is handy for fine-tuning and, thus, applicable.

Logical Reasoning can be applied in fact-checking, text summarization, etc. Reasoning QA validates system reasoning capability by asking questions.

Datasets and models for Logical Reasoning QA

Datasets for Logical reasoning QA tasks exist, too. One of the most famous datasets is LogiQA, based on American Civil Servants Exams. Answers to questions from this dataset require logical reasoning. Below is a fragment from this dataset. It has a text, which model should analyze, then a multi-choice question with a marked correct answer. The reasoning Type, on the far left, denotes a "degree of correctness" according to the Aristotelian logic laws.

There are 5 types of reasoning: Categorical (most frequent), Sufficient conditional, Necessary conditional, Disjunctive and Conjunctive ones,

MuTual is a retrieval-based dataset for multi-turn dialogue reasoning. There are many other datasets: ReClor, CLUTRR, etc. ReClor is an excellent dataset for both RC and Logical QA. What is more, ReClor provides the leaderboard. From this leaderboard, we can realize that the best Logical reasoning model today is the Rational reasoner single model (provided by HFL & iFLYTEK). This model is based on MERIt, a MEta-path guided contrastive learning method for logical ReasonIng of text, which lets him perform self-supervised pre-training on abundant unlabeled text data.

Some other models with good evaluation results on ReClor are LReasoner (based on RoBERTa), Focal Reasoner, and ABLBERT.

Some researchers also use GPT-2 (as a base model for further fine-tuning), capable of complex logical reasoning.

Conclusion

In this topic, we've discussed two main problems of Natural language understanding: Machine reading comprehension and Logical reasoning question answering. This topic is a theoretical one because there are few open-source models for these two tasks; those available are pretty difficult to implement, as they demand a lot of coding.

We have discussed the theoretical basis behind these two problems and their datasets and models. In later topics, we'll discuss Information Retrieval, problems relevant to RC and QA.

How did you like the theory?
Report a typo