Computer scienceData scienceNLPFormal grammar

Context-free grammar implementation

4 minutes read

Context-free grammar is a fundamental concept in computer science that describes the syntax of many programming languages, natural languages, and other formal languages. In this topic, we will revise the theory on this subject and then try to implement it in NLTK. We will explore the structure of context-free grammars, how they can generate and recognize languages, and some of the algorithms and techniques developed for working with them. By the end of this article, you will have a solid understanding of context-free grammars and their importance in computer science and related fields.

Theory behind context-free grammars

Context-free grammars are widely applied in syntactic parsing. A parse tree for derivation in grammar is an ordered labeled tree with the following properties:

  • Each node is labeled with either a non-terminal symbol or a terminal one (NΣ)(N \cup \Sigma);
  • The root of the tree is labeled with the start symbol SS;
  • Each inner node is labeled with a single non-terminal symbol;
  • If an internal node with label A has children labeled with symbols α1,...,αn\alpha_1, ..., \alpha_n, then there is the following rule in PP: Aα1,...,αnA \to \alpha_1,..., \alpha_n.

Now imagine that terminal symbols aren't just words but standard tokens. Then we can parse a sentence.

SNP,VPVPVP,ObjVPVerbNPNounNounIVerbplayObjhockeyS \to NP, VP\\ VP→VP,Obj\\VP→Verb\\NP→Noun\\Noun→I\\Verb→play\\Obj→hockeyHere {Noun,Verb,Obj,NP,VP}N\{Noun, Verb, Obj, NP, VP\} \subset N (e.g., Noun, Verb, Obj, NP, and VP are non-terminals) and {I,play,hockey}Σ\{I, play, hockey\} \subset \Sigma. All terminal symbols should be marked with citation marks (or apostrophes). Otherwise, the parser will recognize them as non-terminals.

Implementation in NLTK

You can define your context-free grammars in NLTK for customizing syntactic parsing. Let's try it on a French sentence, Je fais du ski, which means I ski or I am skiing. First, let's define our grammar rules:

import nltk
from nltk import CFG
from nltk.draw.tree import draw_trees

nltk.download('punkt')

grammar = CFG.fromstring('''
S -> NP VP Stop
S -> NP VP NP Stop
NP -> Noun
NP -> Pronoun
NP -> Article Noun
VP -> Verb
VP -> Verb Obj
VP -> Verb Article Obj
VP -> Verb NP


Verb -> 'fais'
Noun -> 'ski'
Pronoun -> 'Je'
Article -> 'du'
Stop -> '.'
  ''')

We described each rule on one line, but we can combine them:

S ->  NP VP Stop
S -> NP VP NP Stop
NP -> Noun | Pronoun | Article Noun
VP -> Verb | Verb Obj | Verb Article Obj | Verb NP

We've combined non-terminal elements with | which means the same as or. We generally do not combine rules with the start symbol in the beginning. Also note that here we used a simplified tagging: a noun phrase with only a pronoun inside is sometimes called a pronoun phrase. Now, we need to parse our grammar:

parser = nltk.ChartParser(grammar)

Apart from the ChartParser, there are other interesting parsers in NLTK: recursive descent parsing and then shift-reduce parsing. Finally, define the sentence we want to parse and draw the parse trees:

draw_trees(*(tree for tree in parser.parse(nltk.word_tokenize(sent))))

The parse trees

Here we have two parsing variants, which is not a problem. One sentence can usually be parsed in numerous ways. There are two variants in our case because we defined that a sentence could be either NP VP NP or NP VP. If we use the first way, VP will be decomposed into Verb, but if we choose the second, VP will be Verb NP. If we want only one version of parsing, then we should either delete S -> NP VP NP Stop or S -> NP VP Stop.

Recursive-descent parsing

Recursive-descent parsing is a top-down parsing technique. It is a type of predictive parsing that predicts the next input symbol by matching it against a set of production rules. The parser starts with the input tokens and recursively expands them using a set of mutually recursive procedures, each corresponding to a non-terminal symbol in the grammar. As the parser reads each token, it decides which procedure to call based on the input token's match with the first set of the production rules for each non-terminal symbol. During recursive-descent parsing, the parser recursively expands each non-terminal symbol into a sequence of terminal symbols until it matches the input or encounters a syntax error. This process continues until the entire input is parsed and a parse tree is built. One of the problems of such parsing is that it needs a lot of memory.

Recursive-descent parsing is available in NLTK. We use the above-described grammar and apply it to the same sentence (Je fais du ski). We initialize a parser with the given grammar as follows:

parser = nltk.RecursiveDescentParser(grammar)

draw_trees(*(tree for tree in parser.parse(nltk.word_tokenize(sent))))

Then we would get such a result:

Two parse trees

As you can see, the result is almost the same as in the Chart Parser result. The only difference is that the trees have interchanged places.

Shift-reduce parsing

Shift-reduce parsing is bottom-up parsing (in contrast to recursive-descent one) that starts with the input tokens and recursively reduces them based on a set of production rules until a parse tree is formed. With shift-reduce parsing, the parser maintains a stack of tokens and a bunch of states based on a parsing table. When a token is shifted onto the stack, the parser transitions to a new state, and if a specific condition is met, a reduction occurs, where a set of tokens is replaced with a non-terminal symbol. This process continues until the entire input is parsed and a parse tree is built.

Unfortunately, Je fais du ski cannot parse with the shift-reduce parsing, so we will take another sentence (this time in English): Busan is a Korean city.

For this sentence, we need to rewrite the grammar:

grammar = CFG.fromstring('''
S ->  NP VP 
NP -> PropN | Det PropAdj Noun
VP -> Verb NP


PropN -> 'busan'
PropAdj -> 'korean'
Noun -> 'city'
Verb -> 'is'
Det -> 'a'
  ''')

This sentence contains a proper name (always marked as PropN) and a proper adjective (marked as PropAdj, but it is generally marked as just Adj). They both start with capital letters. We can normalize them by turning all the letters into lowercase:

sent = 'Busan is a Korean city'.lower()

text = nltk.word_tokenize(sent)

Then it's important to initialize the parser itself and draw the tree:

parser = nltk.ShiftReduceParser(grammar)

draw_trees(*(tree for tree in parser.parse(text)))

Conclusion

In this topic, we have discussed context-free grammars and their implementation. Context-free grammars are a powerful and versatile tool for working with formal languages in computer science and related fields. By mastering the art of implementing context-free grammars, we can unlock the full potential of formal languages and develop more sophisticated algorithms and tools for a wide range of applications.

How did you like the theory?
Report a typo