Computer scienceData scienceNLPMain NLP tasksQuestion Answering

Information Retrieval and Question Answering

4 minutes read

Information retrieval is a wide area in computer science, the main purpose of which is to obtain the necessary information from documents. In this topic, we'll talk about the concepts behind it.

What is Information retrieval?

Information retrieval is the process of collecting the required information from various sources. It is used not only in NLP applications because images, videos, audio, geolocations, or texts can be found as the information you're looking for. Because the correct search for information is required in many areas, information retrieval is used in:

  • Search engines (like Google or Bing);

  • Digital assistants (like Siri, Cortana, and Google Assistant);

  • Question-answering systems;

  • Spam filters, and so on.

Search engines are powerful tools we use almost every day to find information on the Internet. In the next section, we will consider them in more detail.

Search engines

We usually use search engines to find out answers to questions, such as How many minutes to boil eggs? or What city is the capital of Pakistan? But there are search engines that can perform more complex tasks, such as maintaining applications, finding materials for construction, etc. Commercial search engines, such as ElasticSearch, help store petabits of data in private storage and find the required information easily and quite fast. It's useful for monitoring applications, supporting large databases, and so on. Another type of search engine is an industrial search engine to help you find suppliers and buyers of products and make big deals. Thomasnet is an example of an industrial search engine. Commercial and industrial search engines are used by large companies, such as Boeing, NASA, Audi, Lenovo, and many others.

The main question of informational retrieval is how to find the required information in the data. There are many approaches to information retrieval, let's learn more about some of them!

Imagine that you think of some number from 1 to 100 (for example, 67), and your friend tries to guess it. It will be hard to guess when your friend calls random numbers. If your friend starts from 50 and you give the cue guessed number is bigger than 50, your friend will be able to immediately reject half of the answers! And if your friend calls 75, you say less, so again it'll be easier to find the right number. If you repeat this process a few times, this will help you come to the correct answer much faster!

This is an example of a binary search. The idea of this approach is to constantly take the middle value of a sorted list and determine where to continue searching, on the left or the right side, until the correct value will be found. The values must be sorted for the binary search approach, so it works well with numbers or text documents.

The binary search could be implemented just using basic Python tools, like lists and for loops, but there is the bisect package for implementing this approach. This is a built-in package, so you shouldn't install it, just import it.

Let's implement the binary search — we'll try to find an author's name by the book title. When we use the bisect() function, the output is the index of the sought-for variable, so you can find the title of the book by the index.

Remember that the bisect() function returns index starts from one, not from zero. You should take this into account when working with sequence indexes.

from bisect import bisect

# the list of tuples sorted by authors

library = [('Peter Pan', 'Barrie, J.M.'),
 ('Alice in Wonderland', 'Carroll, L.'),
 ('Oliver Twist', 'Dickens, Ch.'),
 ('Harry Potter and the Chamber of Secrets', 'Rowling, J.K.'),
 ('The Lord of the Rings', 'Tolkien, J.R.R.')]

title, author = zip(*library)
book_id = bisect(author, 'Carroll, L.') # 2

# find out the name of the book by index
book = title[book_id -1] # 'Alice in Wonderland'

The advantages of binary search:

  • It's quite easy to understand and implement;

  • Works well on large datasets;

  • Generally faster than other algorithms.

But the binary search can be used quite limitedly because an array must be sorted, and the search elements must be comparable. Also, the binary search takes a very large amount of memory, a big disadvantage. Let's consider another approach — a keyword search.

A keyword search algorithm looks for necessary information by one or more keywords. It's the most useful algorithm when you have structured data, like a table or a database with questions and answers. The algorithm inputs the keywords in the user query and compares them with the content of the documents in the QA base. The real-life analogy of the key search is a book search in a library. If you know the title and the author of the book, it's easy to find and get the book.

The keyword search is the most powerful and well-suited of all informational retrieval approaches if you are dealing with a knowledge-based question-answering system. The disadvantage of this approach is that the appropriate document could be not found if they do not contain a keyword, but have its synonyms.

Let's imagine you want to find all uses of the word color on a web page. Of course, you can use the Ctrl + F combination and get some results. What if the word is capitalized? Or what if there is British spelling colour on the web page? In that case to obtain all uses of the word you should use the fuzzy search.

The fuzzy search, as you can tell from the name, does not look for a complete match but allows you to search for a string that does not exactly match the pattern. It can be done with the edit distance. Some of the fuzzy search packages are implemented using the Levenshtain distance, which provides a search for strings with character substitutions, insertions, and deletions. Fuzzysearch is one of these packages in Python. Let's take a look at how it works!

You can install it using the pip install fuzzysearch command. The package has two functions find_near_matches() and find_near_matches_in_file() to search a substring in the strings and the files respectively. You can define several mismatched symbols using the max_l_dist parameter.

from fuzzysearch import find_near_matches
find_near_matches('Tim', 'Tim, Tom and Bob went for a walk', max_l_dist=1)

#[Match(start=0, end=3, dist=0, matched='Tim'),
# Match(start=5, end=8, dist=1, matched='Tom')]

The fuzzy search is a good solution when there are typos, spelling errors, or other spelling differences in the document. It can also help with finding words with different grammatical features, such as verb tense (write - wrote - writes) or noun number (apple - apples). To do this, as a pattern, you need to specify the word lemma and indicate the maximum possible number of changes that can occur with this lemma. Keep in mind that sometimes the fuzzy search approach can return extra substrings that you did not plan to receive.

Vector-space models

You are already familiar with vectors as text representation in NLP. This approach is also used for information retrieval. The most popular vector-space models are TF-IDF, Word2Vec, Doc2Vec, and BERT. Let's look at the vector-space model algorithm when creating a question-answering system.

The questions in a database are stored as vectors. When a user inputs some query, it converts to a vector and pairwise compares it with each question vector. Using some text similarity metrics, such as cosine similarity, the algorithm can find the question vector that is most similar to the user's query. Thus, the output results are ranked from the most to the least suitable due to the text similarity metrics. By limiting the value of the metric to some number (for example, you want to get documents that are 80% similar to the query), you can limit the number of documents in the output. Some vector-space models, like Word2Vec and Doc2Vec, allow you to specify the number of documents using a special parameter.

The vector-space models have a great advantage over the fuzzy search and the keyword search approaches: they don't require a hard match with the query and can select synonyms for words from the query, which makes this approach very flexible. But Doc2Vec, BERT, and other vector models demand a large amount of training data, which could be a serious limitation in their use.

Conclusion

In this topic, you have learned more about informational retrieval, applications where it's used, and search engines. You also learned about informational retrieval approaches, such as binary search, keyword search, fuzzy search, and vector-space models, and their pros and cons. You also saw how you can implement some of these approaches using the bisect and the fuzzysearch Python packages. Let's go and try to do it in practice!

2 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo