In the syntactic parsing topic, we have discussed the constituency parsing and Context-Free Grammar (CGF) as the primary example of such parsing. In this topic, we will dive into a more detailed universe of CFGs. To better understand CFG, you need first to know the nature of formal grammars. We'll start with formal grammars.
Syntax is the linguistic domain, and formal grammars is a field of mathematical linguistics, a discipline that led to NLP thanks to such superstars as Noam Chomsky.
Formal language
It all started with a formal language theory. A formal language is a theoretical language based on an alphabet; it's marked by the symbol. An alphabet in formal language () is a finite number of a non-empty set. Elements of such an alphabet are called symbols or letters, but an alphabet contains strings (words) too. A string (also, a word) is a finite sequence of elements from .
Example: we have an alphabet . Then, is a string in alphabet . Such a string has a length of 4, but if it was an empty string with a length equal to 0, then we would mark the string with this symbol - . "caaa" can be scribed as "".
A set of all strings in the alphabet is marked as . A set of non-empty strings is marked as .
We say that is a prefix for if the following equation is valid for the string: ; and vice-versa, for the suffix.
From that, we can say:
- — each element from (a formal language) is also an element of ;
- Then, , where denotes additional elements to the formal language .
Formal grammar
Formal grammar () is a set of rules that describe the behavior of the elements of the alphabet mentioned above. Formal language is a wider concept than formal grammar is.
Apart from ordinary grammar rules, it also contains the so-called start symbol (marked as ) from which the rewriting starts. Formal grammar sometimes is described as the language generator thanks to the start symbols. At other times, it can be called a recognizer since it can detect whether an input string is correct.
Classical grammar has the following components:
- is a finite set of nonterminal symbols
- is a finite set of terminal symbols; it's included in an alphabet, but in formal grammar, the context is usually called a set of terminal symbols.
- is a finite set of production rules.
- is a start symbol, sometimes called the sentence symbol too. The formula () means that the start symbol is one of the nonterminal symbols.
We define a generative grammar as a combination of nonterminal symbols, terminal symbols, rules, and a start symbol. In other words:
The main feature of such a grammar is that its set of nonterminal symbols shouldn't intersect with the terminal symbols:
The other important feature is that the set of rules should be a subset of a combination of the nonempty nonterminal and terminal symbols multiplied by the combination of all nonterminal and terminal symbols:
A language of formal generative grammar often marked as . We can consider such grammar generative because it helps us generate instances (that possibly never existed before).
If we have the definitions: , then we can say that . In other words, it is a generative grammar. We have only the start symbols in the non-terminal alphabet; in the terminal alphabet, we have three elements (, , ) and two grammar rules:
- Inside the start symbol, there should be another start symbol;
- If the string is a string with length 0 (in other words, it is an empty string).
This allows us to say that
Types of formal languages
In 1956, Noam Chomsky devised an abstract representation of formal and natural languages, known as the Chomsky hierarchy. Before him, no one has made an inclusive classification of formal languages and grammars. Chomsky, by the way, developed a formal representation of these grammars and languages.
Some attempts to classify formal languages and grammars were made before Chomsky. Panini, an ancient Indian linguist, made the first attempt. Axel Thue, a Norwegian mathematician, introduced the semi-Thue system (a string rewriting system) on the verge of the 19th and 20th centuries; a Context-Free Grammars are a special form of Semi-Thue system.
Chomsky described four types of grammars. Each type can have several variants. Each Chomsky grammar generates only one type of formal language:
| Chomsky hierarchy | Grammar | Language | Abstract machine |
| Type 0 | Unrestricted | Recursive enumerable | Turing machine |
| Type 1 | Context-Sensitive | Context-Sensitive | Linear-bounded |
|
Tree-adjoining |
Tree-adjoining |
Embedded pushdown |
|
| Type 2 | Context-Free | Context-Free |
Nondeterministic pushdown |
| Deterministic Context-Free | Deterministic Context-Free | Deterministic pushdown | |
| Visibly pushdown | Visibly pushdown | Visibly pushdown | |
| Type 3 | Regular | Regular | Finite |
| Non-recursive | Finite | Acyclic finite |
Chomsky's hierarchy (from more generals languages to more precise ones):
Unrestricted grammar
The first type (Type-0 in his hierarchy) is an unrestricted grammar. This is a semi-Thue system mentioned a little bit earlier. Such grammars produce a recursively enumerable language. Unrestricted grammar has almost no constraints. The only one is that shouldn't be empty in the following rule: .
This type is the most general class of grammars. Every Chomsky grammar is of Type-0 by default, be it context-free or context-sensitive grammar, it is an unrestricted grammar too. That's because the concept of unrestricted grammar is wider than grammars of Type 1, 2 or 3.
In NLP, unrestricted grammars can generate and recognize natural language sentences. This is because a natural language is highly complex and can be difficult to model using more restricted grammars. In compiler design, unrestricted grammars can be used to define programming languages. In AI, unrestricted grammars can represent knowledge and reason about it. This is because unrestricted grammars can be used to determine complex concepts and relationships necessary for reasoning about the world.
Context-Sensitive Grammar
Type-1 is a Context-Sensitive Grammar (CSG). The only difference between context-sensitive grammars and unrestricted grammars is that can be empty in the unrestricted case. An exciting feature of CSGs (which is why they are called context-sensitive) is that in production rules, an original element should have surrounding elements, too (context). We call a grammar Context-Sensitive if its rules comply with the following constraint:
, where and and and .
CSG produces context-sensitive language.
In natural language processing, context-sensitive grammars can model sentence structure. In programming language design, context-sensitive grammars can be used to define programming languages with complex constructs and features. Overall, context-sensitive grammars are a powerful tool that can be used in a wide range of applications in computer science where a more powerful grammar than context-free grammar is needed.
Context-Free grammar
Context-Free Grammar is a generative grammar where each rule has the following type: if and . To translate in English: a rule like is possible only if is a nonterminal symbol and is either a nonterminal or terminal (it could be any of two!).
Here is an example of a CFG:
With such rules, we can generate the following instances:
or just:
Deterministic context-free and visibly pushdown grammars belong to Type-2.
Many sentences have a hierarchical structure that can be captured by context-free grammar, so CFG is applicable in NLP. In parsing, context-free grammars can be used to parse input strings and determine whether they are valid according to the grammar. This is because context-free grammars can be used to generate parse trees that represent the structure of the input string.
Regular grammar
Type-3 is a regular grammar, that can be either right-regular or left-regular, depending on the position of the additional nonterminal symbol. Here is an example of right-regular grammar rules:
where (nonterminal symbols), and , (a terminal symbol).
In the left-regular grammar, the second rule will look like this (the first one will be left unchanged):
In pattern matching, regular grammars can match patterns in strings. This is because regular grammars can express patterns that are commonly found in strings, such as regular expressions.
In lexical analysis, regular grammars can define the tokens of a programming language. This is because tokens often have a simple regular structure, such as identifiers, keywords, and operators.
In string processing, regular grammars can manipulate and transform strings. This is because regular grammars define patterns that can be matched and replaced in strings.
Formal grammars can be applied in many fields. They can be helpful in understanding HTML and web browser structure as a whole. Speaking of NLP, formal grammar is used in speech recognition, dialogue systems, and regular expressions (a set of all regular expressions is a formal language!).
But the most important contribution lies in the field of syntax analysis. Formal grammar is essential in understanding how the syntax works. Most syntactic parsers are based on formal grammars. One of the most renowned parsing algorithms, Context-Free Grammar Parsing, is based on CFG, which as we already know is a formal grammar.
Conclusion
In conclusion, formal grammar is fundamental in NLP and computer science. By providing a systematic way to describe the structure of languages, formal grammars enable us to generate, recognize, and manipulate languages precisely and efficiently. From the simple rules of regular grammars to the complex rules of unrestricted grammars, formal grammars provide a powerful tool for modeling and analyzing a wide range of languages and systems. As NLP continues to evolve, formal grammars will remain an important concept that underpins many of the foundational principles of the field.