Teaching a computer how to read and answer general questions based on a document is a challenging yet unsolved problem. One of the crucial tasks is to make computers understand the text — this problem is also known as Natural Language Understanding (NLU). NLU is a wider concept that implies automatic speech recognition, too.
In this topic, we will discuss the problem of Machine Reading Comprehension (RC), datasets, and models for this task. RC is also closely related to the problem of logical reasoning for AI, so we'll discuss logical reasoning QA too. Together, RC and Logical Reasoning QA, comprise a problem for AI to understand a text on the same level as a human being.
Machine reading comprehension
Computers lack the reasoning and sensitivity any human has. It's difficult to transmit an emotion to artificial intelligence (though there are some attempts). In this sense, the most complicated issue for AI is to comprehend a novel or a poem.
Take, for example, Louis Ferdinand Celine's Journey to the End of the Night. AI can understand facts from this book — the main character, a French student of medicine, decides to participate in World War I, then ends up in a hospital for the mentally disabled, then becomes a French colonial officer in Africa, and so on... But AI cannot understand the feelings of the main character when he saw the atrocities of war or his emotions during his other adventures in Africa and then in the US. Not to mention that for the AI it's impossible to make a literary analysis of a novel: it cannot understand why Dante put some Florentine politicians to Hell and others in Heaven. To answer such questions, AI needs to know Dante's biography and Renaissance Italian politics. Again, this can be done if we provide our AI system with some data source (Wikipedia is the best choice), but this is a separate task — open-domain QA.
This was about novel reading comprehension. Each text has some kind of reference; AI cannot comprehend them the same way as an average educated man can.
Machine reading comprehension (or reading comprehension, RC) makes artificial intelligence understand a text. This "understanding" is measured by AI answering a cloze test in a dataset.
When the machine comprehension dialog involves multiple co-referenced questions, such as if a latter question may be a logical successor of the former, the challenge is called Conversational machine comprehension (CMC).
Datasets for RC
Genuine reading comprehension is challenging since effective comprehension involves a thorough understanding of documents and sophisticated inference. Solving a machine reading comprehension problem, in recent years, several works have collected various datasets, in the form of questions, paragraphs, and answers. A couple of large-scale cloze-style datasets have gained significant attention along with powerful deep-learning models. Reminding you that cloze-style datasets infer that there is blank space that should be filled down. For example, Dante was born ___. These datasets are similar to the reading tests you complete in your foreign language classes.
Such datasets intersect with datasets for QA tasks. We recommend paying your attention to four formats: simple questions, queries, cloze, and completion:
SQUAD and CNN/Daily Mail datasets are actively used for both QA and RC. SQuAD is a famous extractive QA dataset. The SQuAD website provides a leaderboard with QA best models that have used the SQuAD dataset according to their F1 measure. The leaderboard is live, and it changes over time. The CNN/Daily Mail dataset contains news and cloze quizzes.
Cloze datasets for training RC models are in great trend. Recent approaches to cloze-style datasets can be separated into two categories: single-turn and multi-turn reasoning.
Some other RC datasets are:
RACE is collected from English reading comprehension exams for middle and high school Chinese students,
DuoRC is based on collections of IMDb reviews and Wikipedia articles.
QAMR is built by the Amazon crowdsourcing system. It represents the predicate-argument structure of a sentence as a set of question-answer pairs.
RC models
Single-turn reasoning models employ attention mechanisms to emphasize specific parts of the document which are relevant to the query. These attention models subsequently calculate the relevance between a query and the corresponding weighted representations of document subunits (for example, sentences or words) to score target candidates. Such models include EpiReader.
Existing multi-turn reasoning models have a predefined number of hops or iterations in their inference without regard to the complexity of each query or document.
The first RC model was developed in 1977, it was called QUALM. This early work set a strong vision for language understanding, but the systems built at that time were tiny and limited to hand-coded scripts and difficult to generalize to broader domains. DEEP READ, a rule-based bag-of-words solution with shallow linguistic processing such as stemming, semantic class identification, and pronoun resolution, was developed in 1999. QUARC followed in 2000 — it's the RC model based on manual rules for lexical and semantic correspondence. Current RC models are mainly neural.
One of the latest end-to-end RC models, Attentive Reader, was developed in 2015. If you're interested, take a look RC dataset and model development timeline during 2015-2018. Models are marked blue, and datasets are in black.
The most recently proposed model is Reasoning Network (ReasoNet). It is a neural network model that tries to simulate the inference process of human readers. It reads a question first, and then, keeping the question in "mind", it reads the text until it finds the answer.
Other RC models include AoA Reader, Iterative Attention Reader, DER Network, CLER (Cross-task Learning with Expert Representation), D-NET, etc.
For cloze-style and multi-choice MRC, a common evaluation metric is accuracy.
Logical reasoning QA
Logical reasoning is of vital importance to natural language understanding. Logical reasoning QA requires a machine to understand the logic behind the text, for example, identifying the logical components, logical relations, or fallacies.
Some researchers propose discourse-aware graph networks (DAGN) to build logical graphs and learn logic representations accordingly. Such models analyze a context text (see "passage" on the left of the picture), underlines facts, premises, and a conclusion, and then try to answer an input question.
Below is an example of multi-choice logical reasoning QA and the logical structure-based solution (right). The logical units are sentences or clauses and perform multi-hop reasoning processes from premises or refuting evidence to the conclusion.
The proposed logic graph construction uses generic textual clues and logic theories and is easily applied to new texts. This model is handy for fine-tuning and, thus, applicable.
Logical Reasoning can be applied in fact-checking, text summarization, etc. Reasoning QA validates system reasoning capability by asking questions.
Datasets and models for Logical Reasoning QA
Datasets for Logical reasoning QA tasks exist, too. One of the most famous datasets is LogiQA, based on American Civil Servants Exams. Answers to questions from this dataset require logical reasoning. Below is a fragment from this dataset. It has a text, which model should analyze, then a multi-choice question with a marked correct answer. The reasoning Type, on the far left, denotes a "degree of correctness" according to the Aristotelian logic laws.
MuTual is a retrieval-based dataset for multi-turn dialogue reasoning. There are many other datasets: ReClor, CLUTRR, etc. ReClor is an excellent dataset for both RC and Logical QA. What is more, ReClor provides the leaderboard. From this leaderboard, we can realize that the best Logical reasoning model today is the Rational reasoner single model (provided by HFL & iFLYTEK). This model is based on MERIt, a MEta-path guided contrastive learning method for logical ReasonIng of text, which lets him perform self-supervised pre-training on abundant unlabeled text data.
Some other models with good evaluation results on ReClor are LReasoner (based on RoBERTa), Focal Reasoner, and ABLBERT.
Some researchers also use GPT-2 (as a base model for further fine-tuning), capable of complex logical reasoning.
Conclusion
In this topic, we've discussed two main problems of Natural language understanding: Machine reading comprehension and Logical reasoning question answering. This topic is a theoretical one because there are few open-source models for these two tasks; those available are pretty difficult to implement, as they demand a lot of coding.
We have discussed the theoretical basis behind these two problems and their datasets and models. In later topics, we'll discuss Information Retrieval, problems relevant to RC and QA.