Computer scienceData scienceNLPFormal grammar

Deep understanding of context-free grammar

18 seconds read

In this topic, we will study some specific details of context-free grammar implementation. For example, we'll discuss the nature of syntactic ambiguity and recursion and some problems frequently appearing with parsing CFGs. After reading this topic, you will know how to handle bugs in CFG implementation.

This topic is not about studying a new technology of NLTK or any other NLP library. It's about developing more elaborate skills in CFG parsing and syntactic parsing.

Syntactic ambiguity

Syntactic ambiguity appears when we don't know how to assign an element for sure. For example, we have a statement I walk in the park. I is a pronoun, so it forms a noun phrase (NP). Walk is a verb, so it forms a verb phrase. But then we have in the park... One may decide to classify it as a prepositional phrase (PP), so the sentence will be like this:

Syntactic ambiguity - the first option

Others would classify it as a PP to VP since in the park is an object aligned to the verb. English has three types of objects: direct, indirect, and a for preposition. In this case, this is a for preposition. So, our sentence would look like this:

Syntactic ambiguity - the second option

The third linguist would say: No, we can form a separate noun phrase with the park inside a PP. So our sentence would be:

Syntactic ambiguity - the firrd option

There are many ways to parse this sentence. Syntactic ambiguity occurs when we have extensive grammar designed for many sentences. When you have one sentence, you can develop grammar so that there is only one tree in the output, but if there are eight such sentences, it's way more complicated. For 100 or 200 sentences, it's almost impossible to tackle ambiguity.

Recursion

A grammar is recursive if a category occurring on the left-hand side of a production also appears on the right-hand side of a production. Both of the following grammar rules are recursive:

A -> Ab
A -> bA

If you need a more realistic example:

VP -> VP NP

Recursion could be direct or indirect. The direct one looks like this:

NP -> PP NP

As for indirect:

NP -> PP Noun
PP -> Prep NP

In both cases, we started with NP and ended with NP. It doesn't matter what we have in the middle. When some grammar rules are designed in such a way, the grammar is recursive.

Recursion often appears in so-called nested nominals, the same as a nested named entity. For example, big European butterfly is a nested nominal:

Recursive grammar

The following grammar is called left-recursive (it occurs frequently with the parsing of English sentences):

Left - recursive grammar

If the parsing tree expanded on the right side, we would call it a right-recursive.

Parsing long sentences

Previously we showed how to parse short and simple sentences. Long sentences are parsed similarly. By a long sentence, we mean either a compound sentence or a complex one. A compound sentence is a sentence with two or more independent clauses, usually joined by or, and, or other conjunctions. An Independent clause is a statement that can stand as a separate sentence. A complex sentence is a sentence with one independent clause and one or more non-independent clauses. There are many non-independent clauses: conditional clauses, relative clauses, dependant clauses, and so on.

To see how to parse a long sentence, take the following sentence:

sent = '''Sometimes his face swelled purple with anger, 
          he pounded on the door till he was sobbing with exertion.'''.lower()

It's better to start the grammar by defining all terminal symbols — this will help us understand what production rules we need. We begin with the simplest, smallest element; the best way to parse is to go from the bottom to the top!

Noun -> 'face' | 'anger' | 'door' | 'exertion' 
Pronoun -> 'he'
PronAdj -> 'his'
Verb -> 'swelled' | 'pounded' | 'sobbing' 
BE -> 'was'

Adj -> 'purple'
Adverb -> 'sometimes' 
Prep -> 'with' | 'on' 
Prep_time -> 'till'
Det -> 'the'
Comma -> ','
Stop -> '.'

Now, let's define the production rules. One of the most common mistakes while parsing long sentences is that you've missed a minor aspect of the production rule. If you miss an element in the alphabet, NLTK will remind you of that, but if you miss a production rule, the program will print an empty tree.

S -> Adv S1 Comma S2

S1 -> NP VP 
S2 -> NP VP S3
S3 -> Prep_time NP VP
VP -> Verb Adj PP | Verb PP 
NP -> Pronoun | PronAdj Noun | Noun
PP -> Prep Noun | Prep Det Noun
Verb -> BE Verb

An attentive reader can notice a recursion in this grammar. Verb occurs both on the right and left sides in the last production rule. This rule makes possible present continuous forms. The alternative way is to make the same rule for the verb phrase:

$VP \to BE, Verb$

S1, S2, S3 stand for clauses. Also, note how we made a complex sentence in S2. S3 is a dependent clause, as it depends on S2, so we've added S3 to S2.

Finally, parse the sentence and draw a tree:

parser = nltk.ChartParser(grammar)
draw_trees(*(tree for tree in parser.parse(nltk.word_tokenize(sent))))

Unfortunately, this grammar doesn't work. But why?

Debugging

Now we know that it's not an error of the alphabet. In that case, it would be:

ValueError: Grammar does not cover some of the input words: "'swelled'".

It might be reasonable to chop the sentence into smaller parts (clauses) to detect the actual error. We will start from the right part: he pounded on the door till he was sobbing with exertion. For this, we need to change our production rules a little bit. In the end, we get a good tree in the output:

So, the grammar for the right part was correct. Now, let's check the other side — alas, no tree. This clause has the following elements: adverb, pronoun, adjective, noun, verb, adjective, preposition, and noun. You need to check all the rules for suspicious elements. In our case, the error is in a missed prepositional phrase. We have the following rule:

VP -> Verb Adj | Verb PP

It must be so:

VP -> Verb Adj PP | Verb PP

After correcting this error, we can finally have the syntax tree:

Conclusion

In this topic, we discussed:

Syntactic ambiguity;
Recursion;
Parsing complex sentences;
The fastest way to find bugs in CFGs.

After reading this topic, you have a more detailed understanding of how context-free grammar works. You can define a grammar for longer and several sentences at once. You know how to handle bugs that will inevitably occur in your life.

How did you like the theory?

Report a typo