When we learn a new language, we not only learn words but also the rules of how to put these words together so that they make sense. This set of rules is called syntax, and it is unique to each language. We can teach a computer to "understand" words by giving it a dictionary, but how can a computer understand sentence structure? Let's find out!
Syntactic parsing
When we see a sentence in our native language, we don't think about its structure. Look at the following sentence:
John invited Jane to the cinema.
We know that it was John who invited Jane to the cinema, not the other way around. And we know that because we know the syntactic rules in English. In other languages, the exact words in the same order may mean that it was Jane who invited John to the cinema.
We also make mistakes when constructing a sentence. For example, you can say:
They runs really fast
instead of They run really fast.
The mistake lies in the agreement between a plural pronoun and a verb. Linguists call such errors syntactic. Now, we need to define what syntax is. The syntax is a part of the natural language grammar. We use syntax to analyze elements larger than one word (token).
Sometimes one sentence can be interpreted differently depending on the situation. Let's look at this sentence:
John saw the man on the mountain with a telescope.
What does it mean? John, using a telescope, saw a man who was on the mountain, or John saw a man who was on the mountain and had a telescope. These are only two possible sentence interpretations. The different interpretations come from the ambiguity of the sentence structure, which is why such a phenomenon is called syntactic or structural ambiguity.
To "understand" the meaning of the sentence, which meaning is more probable, and check whether the sentence is correct, a computer needs to understand the structural relationships between words in a sentence. The task of automatically analyzing the sentence is called syntactic parsing. As a result of parsing, we get a parse tree.
There are two most commonly used syntactic parsing types are dependency and constituency. We will learn more about each of them in the following sections.
Dependency parsing
We have two types of algorithms: constituency parsing and dependency parsing. The idea behind these algorithms is that some kind of relationship connects words in a sentence. To determine the sentence structure, we need to determine the dependencies between the words in a sentence. In this case, the parse trees are graphs with words as nodes and relationships as edges. Each relationship has one head and a dependent that modifies the head.
The only relationship between two elements with dependency parsing is the relationship between a dependent construction (a token or a phrase) and the superior one. The main problem in dependency parsing stems from the need to understand what is more important. Let's see the following example:
I love mom and dad.
In this case, the root is the predicate love. We have one subject in this sentence, I, so one branch will go to the I. We have two objects: mom and dad. Which one is more significant? There are multiple solutions for this issue; here is how Spacy solved it.
This is incorrect, as the conjunction word cannot be the endpoint.
Here is a list of some other problematic aspects:
How to parse dependent clauses?
How to parse clauses with syntactic ambiguity?
Let's parse the sentence from the previous section with Stanza. Stanza's syntactic module analyzes various dependencies in the sentence, using Universal Dependencies, and then composes it into a tree-like structure:
import stanza
nlp = stanza.Pipeline(lang='en',
processors = 'tokenize,mwt,pos,lemma,depparse')
my_doc = nlp('I like this lesson')
my_doc.sentences[0].print_dependencies()
# ('I', 2, 'nsubj')
# ('like', 0, 'root')
# ('this', 4, 'det')
# ('lesson', 2, 'obj')We can present the result in the following graph:
The relationship tags stand for:
nsubj— a nominal subject;obj— an object;det— a relation determiner.
Dependency parsing is also available in
Spacy— you need to use thetoken.dep_attribute to get a relationship tag (ROOT,nsubj,obj, etc.)NLTKvia StanfordDependencyParserUDPipe— a state-of-the-art stochastic syntactic parser that demands training.UDPipeuses annotations like in Universal DependenciesLinguaKit Tool can parse texts up to 5000 symbols.
You can learn more about dependency relations on the Universal Dependencies site.
Context-free grammar
Before talking about constituency parsers, let's talk briefly about the rules behind them. Usually, such parsers use syntax grammar. Apart from context-free grammars, the most popular one, there are also context-sensitive grammars, regular grammars, affix grammars, and many others. Context-free grammar (CFG) is one of the methods to define a parsing rule for constituency parsing. CFG is a type of formal grammar. A formal grammar is context-free if its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the left-hand side can always be replaced by the right-hand side. This is what distinguishes it from context-sensitive grammar.
Context-sensitive grammars (CSG) are more general than context-free grammars and less general than unrestricted grammars. CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy. It's important to understand that there are no CFG parsers! CFG looks like this:
where
is an alphabet of terminal symbols*,
— an alphabet of finite symbols,
— a multitude of rules such as ,
— the starting symbol.
In the above-mentioned NLTK Chart parser, you need to define your context-free grammar. It means you need to write your grammar. Let's imagine we have the following grammar for the sentence I love this Hyperskill lesson. First, define all words in our sentence (finite symbols ):
this'/'love'/'I'/'Hyperskill'/'lesson'.
Then, we should compile an alphabet of our finite symbols . We can compile it the way we want to see a parsed tree… or write out all possible ways of parsing. Then we should define our start symbol and load our grammar into the NLTK CFG parser:
import nltk
from nltk import CFG
from nltk.draw.tree import draw_trees
grammar = nltk.CFG.fromstring('''
S -> NP VP NP | NP VP | VP NP
NP -> Noun | Noun Noun | Dep Noun | Dep Noun Noun
VP -> Verb | Verb NP
Dep -> 'this'
Verb -> 'love'
Noun -> 'I' | 'Hyperskill' | 'lesson'
''')
parser = nltk.ChartParser(grammar)The tags stand for: S — sentence, NP — noun phrase, VP — verb phrase, PRP — a personal pronoun, VBP — verb, non-3rd person singular present tense, DT — determiner, NN — noun.
In NLTK, you can also draw trees (note that you cannot draw them in Jupyter Notebook or Google Colab):
draw_trees(*(tree for tree in parser.parse(nltk.word_tokenize(sent))))Constituency parsing
The most popular parsing algorithms are based on constituency grammars. The main idea behind constituency is that a group of words may behave as a single unit, constituents. Usually, the meaning of a group of words is different from the sum of meanings of the words in this group.
Two main types of word groups are noun phrases (NP) and verb phrases (VP). Noun phrases denote an object or subject: they, a well-arranged kitchen, this Hyperskill topic. Verb phrases, on the contrary, denote an action: cook a meal, go by train, listen attentively.
Each sentence can be represented as a tree. To parse the sentence, the computer needs to know part-of-speech tags for every word in the sentence and the rules of how the phrases and sentences can be formed.
Let's parse the following sentence: I like this lesson. In this lesson, we will use the Stanza Python library. Here is the code snippet for accomplishing this task:
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')
my_doc = nlp('I like this lesson')
print(my_doc.sentences[0].constituency)
# (ROOT (S (NP (PRP I)) (VP (VBP like) (NP (DT this) (NN lesson)))))The resulting parse tree can be presented hierarchically:
So, now we see that the tree in Stanza is the same as the NLTK's 3rd (last) tree. In NLP, we have two ways of parsing: top-down and bottom-up parsing. The top-down one is more popular for constituency parsing. Then, the top-down parsers could be Left-right ones or Right-to-left. This means that if a parser is the left-right, it chooses the leftmost tree; otherwise, it chooses the rightmost. In our case, the parser is right-to-left because it selects the last tree available.
You can try constituency parsing in other NLP libraries too:
NLTK: you can use it in many ways: RegexpParser, ChartParser, etc.
Constituency vs. dependency
Again, a constituency parse tree always contains words as terminal nodes. Usually, each word has a parent node containing its part-of-speech tag (noun, adjective, verb), although this may be omitted in other graphical representations. Unlike constituency parsing, dependency parsing doesn't employ phrasal constituents or sub-phrases. Instead, the sentence syntax is expressed in terms of dependencies between words — that is, directed, typed edges between words in a graph.
The main advantage of dependency parsers is their ability to find the most dependent and superior element in the sentence. Dependency parsing also performs better when parsing non-projective and fragmented sentences. Discontinuity occurs when a given word or phrase is separated from another word or phrase that it modifies so that a direct connection cannot be established between the two without incurring crossing lines in the tree structure. Constituency parsing's advantage over dependency parsing is its ability to display the entire structure of a sentence rather than simply the word associations.
When we use dependency parsing, we can see only the relationships between words, not the whole sentence structure. This way, dependency trees are usually more concise, showing only the relationship between the main word and its dependents.
Dependency parsing displays only relationships between words and their constituents, while constituency parsing displays the entire sentence structure and relationships. Often, dependency parsing is praised for being concise yet informative, but constituency parsing is often easier to read and understand.
Dependency parsing's one key advantage over constituency is that it can parse relatively free word order. So, constituency parsing is more suitable for English and German, while for such languages as Russian, Hungarian, Latin, and Persian dependency parsing works better.
Ultimately, we should mention that both constituency and dependency structures preserve formal isomorphism. Isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by inverse mapping. This fact allowed the Stanford University team to develop a system that converts a constituency tree into a dependency tree. Note that vice-versa isn't possible because the dependency tree is too ambiguous.
Corpora and models
Universal Dependencies form a structure for syntactic annotation that includes POS tags and syntactic relationships. We've heard of over 200 UD-styled treebanks for over a hundred languages.
You can read more about UD annotation in the official Guidelines and possible UD Relations.
A treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. Nowadays, there are lots of treebanks for world languages like English:
Penn Treebank is an English syntactic corpus that comprises a fully tagged version of the Brown Corpus, one million words based on the 1989 Wall Street Journal material;
Nine Universal Dependencies treebanks;
And many others.
Treebanks for other world languages:
More than 10 treebanks for French, including Projet Rhapsodie;
11 treebanks for German, including Hamburg Dependency Treebank;
Over 15 treebanks for Italian, including Turin University Treebank;
Nearly seven treebanks for Russian, including SynTagRus.
SHR++ is an application for morphosyntactic annotation of Sanskrit texts.
Corpus of Contemporary Romanian Language (CoRoLa) offers a UD-styled tree bank of Romanian sentences.
Still, many minor and medium-sized languages have no treebanks: Saami, Khmer, etc.
These corpora are very helpful in training a model. In our case, a parser. As you could've guessed, the UD has its parser — UDPipe. Another famous parser is Berkley Parser. You can use it either via the Spacy pipeline or download the parser itself. FactLex and Stanford Parser are developed by Stanford University.
Practical applications
Dependency parsing can be more useful for several downstream tasks like information extraction or question answering.
For example, extracting subject-verb-object triples often indicative of semantic relations between predicates makes it easy. Although we could also extract this information from a constituency parse tree, it would require additional processing while it's immediately available in a dependency parse tree.
We can use both parse tree types to extract features for a supervised machine-learning model. The syntactical information they provide can be very useful for tasks like semantic role labeling, question answering, and others. In general, we should analyze our situation and assess which type of parsing will best suit our needs.
Parsers can also be very useful in grammar checkers. If we can't parse a sentence, then there is something wrong with it.
Conclusion
In this topic, we discussed the basic theory of syntactic parsing and how to implement it in Stanza: we showed how to do dependency and constituency parsing. In addition, we discussed context-free grammar (an important part of constituency parsing) and showed how to make your grammar in NLTK. We have also mentioned corpora and models that might help you in the future.
Reminding you that syntactic parsing is often used as a part of the preprocessing data in information extraction, question answering, creating a chatbot, and much more. Choosing an algorithm depends on many factors, such as the language and the task. As we have said, every language has syntax rules, and no universal parsing algorithm exists.
You can find more on this topic in Mastering Stemming and Lemmatization on Hyperskill Blog.