NLP generally helps with several tasks in business and industry. However, it offers much interesting stuff without the foreseeable benefit — this stuff is the basis for all NLP techniques. In this topic, we will discuss semantic and logical constructions. We also take a look at Google Books N-gram Viewer and Google Trends. Let's start with a completely theoretical subject — comparative linguistics.
Comparative linguistics
English mead (an alcohol made from honey; IPA phonetic transcription: [mi:d]) sounds similar to:
Greek μέθυ (wine; IPA: [ˈme.θi]),
Russian мёд (honey; IPA: [mʲɵt]) and медовуха (mead; IPA: [mʲɪdɐˈvuxə]),
Welsh meddw (a drunk, IPA: [ˈmɛðu]),
Persian می (wine; IPA: [maj]),
Hindi मधु (honey; IPA: [mə.d̪ʱuː]), and so on.
This similarity stems from the fact that these languages have a genetic relationship; they belong to one language family — the Indo-European one. All these words came from a Proto-Indo-European word *médʰu (honey, wine). * before a word means that this word was reconstructed.
The same applies to many other words. However, we cannot say that 100% of words in English, Spanish, or Persian are Indo-European. Many languages have been mixed with words from other language families. Persian, for example, has plenty of words from Semitic and Turkic languages. Spanish, Portuguese, and southern dialects of Italian were heavily influenced by Semitic languages. Indo-Aryan languages were in direct connection with Dravidian ones for centuries and were influenced by Semitic and Turkic.
Indo-European family consists of many groups:
Germanic: German, Swedish, Danish, Dutch, English, and others.
Romance: Italian, French, Spanish, Romanian, and others.
Slavic: Russian, Polish, Serbian, and others.
Indo-Iranian:
Iranian: Persian, Kurdish, Pashto, and others.
Indo-Aryan: Hindi-Urdu, Bengali, Sinhala, and others.
Other groups include the following: Armenian, Hellenic, Celtic, Baltic, Albanian, and Anatolian.
Other language families (it's not a full list):
Sino-Tibetian: Burmese, Chinese Mandarin, Tibetan, and others.
Afroasiatic: Arabic, Hebrew, Ancient Egyptian, and others.
Turkic: Turkish, Kazakh, Azerbaijani, Tatar, and others.
Koreanic: Korean and Jeju
Japonic: Japanese and minor languages on the Japanese archipelago
Uralic: Finnish, Hungarian, Estonian, and minor languages in Russia
The section above is important in understanding how tokenization, lemmatization, and morphological parsing work. To understand what is behind these processes better, we need to talk about how words are formed in each language. For this, let's look at another language classification based on its morphological features. With it in mind, all world languages can be:
Analytic — grammatical relationship is spread through syntax. Function words are used to show grammatical conditions. Example: English I will wait and Italian aspetterò — we can say that English is more analytical than Italian.
Synthetic — grammatical relationship is spread inside one word through affixation, suppletion, flection, and word stress. It has two main subgroups:
Agglutinative. Each affix represents only one grammatical category. Example: In Kazakh dos means friend, dos-tar means my friend, dostar-im means my friends and so on. If we want to say to my friends, we need to add the affix -a — dostarim-a.
Fusional. One morpheme can represent many grammatical categories. Example: in the Latin word manus, the affix -us means the nominative case, single form, and masculine gender.
We can generalize that the majority of Indo-European and Semitic languages are fusional. Exceptions: English, Danish, Swedish, Norwegian, Macedonian, and Bulgarian — they are analytic.
For analytic and agglutinative languages it's better to use stemming, and for fusional ones — lemmatization.
There are also some features that characterize various languages:
Agglutinative and analytic languages are easier to parse than fusional ones;
Indo-Aryan languages are difficult for computer reading, as do not spell the vowels out. Take Sinhala — අපි (we, IPA: [api:]). This word compounds two syllables:
අ— a, andපි— pi.Tokenization in the Sino-Tibetian language is even more difficult because there are no spaces between tokens. Exceptions are Burmese and Tibetan because they don't use hieroglyphs; they use Indian-style script instead.
Semantic constructions
VerbNet is the same as WordNet, but it contains only verbs. Take a look at the official VerbNet website, you can also find VerbNet in Arabic, Basque, and Spanish there. VerbNet groups verbs with identical sets of syntactic frames and semantic predicate structures, as in the example below:
These classes may inherit frames from a parent class, resulting in a hierarchical structure.
There's a branch project of the VerbNet — Visual VerbNet. It provides a set of images for each action.
FrameNet is a WordNet for frames. The frame is a schematic representation of the situation from the view of one of the actors. For example, the word sell will show you a frame of a commercial transaction from the view of a seller, while the word buy from the view of a buyer. We can establish the following relationships between frames:
Inheritance
Perspectivized_in
Subframe
Precedes
Causative_of and Inchoative_of
Using
See_also
The original English FrameNet is available on this website. There you can see the frame graph for English. Let's draw a frame for abusing with the following settings:
show only inheritance relationship
maximum number of generations out from the current frame to see — 2
maximum number of peripheral children frames to see - 2
Let's see what we've got:
Logical constructions
Occasionally, it is important to see a logical relationship between sentences. This is relevant if you need to sentence align two raw texts: you can use just sentence aligner or the Natural Language Inference (NLI).
NLI allows us to see the logical relations between two sentences A and B. For example, we have 3 sentences:
A: "Mary loves cooking"
B: "Mary eats spaghetti"
C: "Mary goes to the restaurant every evening".
NLI measures 3 scores: entailment, neutral, and contradiction. If the contradiction is high enough, then the two sentences are not related. In our example, we will get the following results:
A&B: the entailment score is moderate, the contradiction is low, the neutral is high.
A&C: the entailment score is low, the contradiction is high, the neutral is low.
B&C: the entailment score is very high, the contradiction is very low, the neutral is low.
NLI models may be one-way, two-way, or three-way. We have discussed a three-way model above. Two-way NLI models lack a neutral score, and One-way models provide only the entailment score.
Word of the year
Some Google instruments can help you with statistics on word usage. It may be helpful in analyzing trends in a specific period or place.
Google N-grams is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2019 (depends on the corpus you choose).
Google N-grams provides several corpora: English Fiction, American English, British English, etc. Most of them are dated 2019, 2012, and 2009. Available languages: English, Chinese, Hebrew, French, German, Russian, Italian, and Spanish.
You can enter any words in the search engine. Don't forget to separate them with commas (no spaces):
Here is a plot of mention frequencies of three 20-century authors: Thomas Mann, Gabriele D'Annunzio, and Louis Celine. As you can see, Thomas Mann is referenced the most.
Google Trends provides us with statistics on what people search. In the example below, we search how often those three authors were searched over the last year (October 2021-October 2022) worldwide.
You can notice here that Gabriele D'Annunzio enjoyed unprecedented attention in June 2022.
Let's see who is searched more in each country:
On the left graph, we see that D'Annunzio is the most popular in this trio. On the right, we see a map with the most popular author in various regions. You can also customize Google Trends and adjust the category, region, timeline, and where to search (Google, YouTube, Google News Search, and so on)
Conclusion
In this topic, we've discussed comparative linguistics and how it's linked to such NLP tasks as tokenization, lemmatization, and morphological parsing. We found out that words in analytical and agglutinative languages should be stemmed, while in fusional — lemmatized.
We have also talked about semantic and logical constructions. These things are fundamental in NLP.
We also saw how to use Google Books N-gram Viewer and Google Trends.
You can find more on this topic in Mastering Stemming and Lemmatization on Hyperskill Blog.