Computer scienceData scienceNLPFormal grammar

Formal grammars

60 seconds read

In the syntactic parsing topic, we have discussed the constituency parsing and Context-Free Grammar (CGF) as the primary example of such parsing. In this topic, we will dive into a more detailed universe of CFGs. To better understand CFG, you need first to know the nature of formal grammars. We'll start with formal grammars.

Syntax is the linguistic domain, and formal grammars is a field of mathematical linguistics, a discipline that led to NLP thanks to such superstars as Noam Chomsky.

Formal language

It all started with a formal language theory. A formal language is a theoretical language based on an alphabet; it's marked by the LL symbol. An alphabet in formal language (ΣΣ) is a finite number of a non-empty set. Elements of such an alphabet are called symbols or letters, but an alphabet contains strings (words) too. A string (also, a word) is a finite sequence of elements from ΣΣ.

Example: we have an alphabet Σ=a,b,cΣ = {a, b, c}. Then, caaacaaa is a string in alphabet ΣΣ. Such a string has a length of 4, but if it was an empty string with a length equal to 0, then we would mark the string with this symbol - εε. "caaa" can be scribed as "ca3ca^3".

A set of all strings in the ΣΣ alphabet is marked as ΣΣ^*. A set of non-empty strings is marked as Σ+Σ ^+.

We say that xx is a prefix for yy if the following equation is valid for the uu string: y=uxy = ux ; and vice-versa, for the suffix.

From that, we can say:

  • LΣL \subseteq Σ^* — each element from LL (a formal language) is also an element of ΣΣ^*;
  • Then, ΣL=complement(L)Σ^* - L = complement(L) , where complementcomplement denotes additional elements to the formal language LL.

Formal grammar

Formal grammar (GG) is a set of rules that describe the behavior of the elements of the alphabet mentioned above. Formal language is a wider concept than formal grammar is.

Apart from ordinary grammar rules, it also contains the so-called start symbol (marked as SS) from which the rewriting starts. Formal grammar sometimes is described as the language generator thanks to the start symbols. At other times, it can be called a recognizer since it can detect whether an input string is correct.

Classical grammar GG has the following components:

  • NN is a finite set of nonterminal symbols
  • ΣΣ is a finite set of terminal symbols; it's included in an alphabet, but in formal grammar, the context is usually called a set of terminal symbols.
  • PP is a finite set of production rules.
  • SNS \in N is a start symbol, sometimes called the sentence symbol too. The formula (SNS \in N) means that the start symbol is one of the nonterminal symbols.

We define a generative grammar as a combination of nonterminal symbols, terminal symbols, rules, and a start symbol. In other words:

G=(N,Σ,P,S)G = (N, Σ, P, S)

The main feature of such a grammar is that its set of nonterminal symbols shouldn't intersect with the terminal symbols:

NΣ=N ∩ Σ = \emptyset

The other important feature is that the set of rules should be a subset of a combination of the nonempty nonterminal and terminal symbols multiplied by the combination of all nonterminal and terminal symbols:

P(NΣ)+×(NΣ)P ⊂ (N ∪ Σ)^+ × (N ∪ Σ)^∗

A language of formal generative grammar GG often marked as L(G)L(G). We can consider such grammar generative because it helps us generate instances (that possibly never existed before).

If we have the definitions: N={S},Σ={a,b,c},P={SacSbcS,cSε}N = \{S\}, Σ = \{a, b, c\}, P = \{S \to acSbcS, cS \to ε\}, then we can say that (N,Σ,P,S)(N, Σ, P, S). In other words, it is a generative grammar. We have only the start symbols in the non-terminal alphabet; in the terminal alphabet, we have three elements (aa, bb, cc) and two grammar rules:

  1. Inside the start symbol, there should be another start symbol;
  2. If the cScS string is a string with length 0 (in other words, it is an empty string).

This allows us to say that

SaεbεcScScSbε3bS \to aεbε \\cScScSb \to ε_3b

Types of formal languages

In 1956, Noam Chomsky devised an abstract representation of formal and natural languages, known as the Chomsky hierarchy. Before him, no one has made an inclusive classification of formal languages and grammars. Chomsky, by the way, developed a formal representation of these grammars and languages.

Some attempts to classify formal languages and grammars were made before Chomsky. Panini, an ancient Indian linguist, made the first attempt. Axel Thue, a Norwegian mathematician, introduced the semi-Thue system (a string rewriting system) on the verge of the 19th and 20th centuries; a Context-Free Grammars are a special form of Semi-Thue system.

Chomsky described four types of grammars. Each type can have several variants. Each Chomsky grammar generates only one type of formal language:

Chomsky hierarchy Grammar Language Abstract machine
Type 0 Unrestricted Recursive enumerable Turing machine
Type 1 Context-Sensitive Context-Sensitive Linear-bounded

Tree-adjoining

Tree-adjoining

Embedded pushdown

Type 2 Context-Free Context-Free

Nondeterministic pushdown

Deterministic Context-Free Deterministic Context-Free Deterministic pushdown
Visibly pushdown Visibly pushdown Visibly pushdown
Type 3 Regular Regular Finite
Non-recursive Finite Acyclic finite

Chomsky's hierarchy (from more generals languages to more precise ones):

Chomsky's hierarchy

Unrestricted grammar

The first type (Type-0 in his hierarchy) is an unrestricted grammar. This is a semi-Thue system mentioned a little bit earlier. Such grammars produce a recursively enumerable language. Unrestricted grammar has almost no constraints. The only one is that aa shouldn't be empty in the following rule: aba \to b .

This type is the most general class of grammars. Every Chomsky grammar is of Type-0 by default, be it context-free or context-sensitive grammar, it is an unrestricted grammar too. That's because the concept of unrestricted grammar is wider than grammars of Type 1, 2 or 3.

In NLP, unrestricted grammars can generate and recognize natural language sentences. This is because a natural language is highly complex and can be difficult to model using more restricted grammars. In compiler design, unrestricted grammars can be used to define programming languages. In AI, unrestricted grammars can represent knowledge and reason about it. This is because unrestricted grammars can be used to determine complex concepts and relationships necessary for reasoning about the world.

Context-Sensitive Grammar

Type-1 is a Context-Sensitive Grammar (CSG). The only difference between context-sensitive grammars and unrestricted grammars is that bb can be empty in the unrestricted case. An exciting feature of CSGs (which is why they are called context-sensitive) is that in production rules, an original element should have surrounding elements, too (context). We call a grammar Context-Sensitive if its rules comply with the following constraint:

ηAθηaθ\eta A \theta \to \eta a \theta , where ANA \in N and η(NΣ)\eta \in (N \cup \Sigma)^* and a(NΣ)a \in (N \cup \Sigma)^* and a(NΣ)+a \in (N \cup \Sigma)^+.

CSG produces context-sensitive language.

In natural language processing, context-sensitive grammars can model sentence structure. In programming language design, context-sensitive grammars can be used to define programming languages with complex constructs and features. Overall, context-sensitive grammars are a powerful tool that can be used in a wide range of applications in computer science where a more powerful grammar than context-free grammar is needed.

Context-Free grammar

Context-Free Grammar is a generative grammar where each rule has the following type: AaA \to a if ANA \in N and a(NΣ)a \in (N \cup \Sigma)^*. To translate in English: a rule like AaA \to a is possible only if AA is a nonterminal symbol and aa is either a nonterminal or terminal (it could be any of two!).

Here is an example of a CFG:

  1. SASTAS \to ASTA
  2. SAbAS \to AbA
  3. AaA \to a
  4. bAbbbA \to bb
  5. TssT \to ss

With such rules, we can generate the following instances:

S    ASTA    aSTa    aAbATa    aAbAssa    aabassaS \implies ASTA \implies aSTa \implies aAbATa \implies aAbAssa \implies aabassa

S    ASTA    aSTa    aAbATa    aabbssaS \implies ASTA \implies aSTa \implies aAbATa \implies aabbssa

or just:

S    AbA    abbS \implies AbA \implies abb

Deterministic context-free and visibly pushdown grammars belong to Type-2.

Many sentences have a hierarchical structure that can be captured by context-free grammar, so CFG is applicable in NLP. In parsing, context-free grammars can be used to parse input strings and determine whether they are valid according to the grammar. This is because context-free grammars can be used to generate parse trees that represent the structure of the input string.

Regular grammar

Type-3 is a regular grammar, that can be either right-regular or left-regular, depending on the position of the additional nonterminal symbol. Here is an example of right-regular grammar rules:

  1. DdD \to d
  2. DdBD \to dB

where D,BND, B \in N (nonterminal symbols), and dΣd \in Σ , (a terminal symbol).

In the left-regular grammar, the second rule will look like this (the first one will be left unchanged):
DBdD \to Bd

In pattern matching, regular grammars can match patterns in strings. This is because regular grammars can express patterns that are commonly found in strings, such as regular expressions.

In lexical analysis, regular grammars can define the tokens of a programming language. This is because tokens often have a simple regular structure, such as identifiers, keywords, and operators.

In string processing, regular grammars can manipulate and transform strings. This is because regular grammars define patterns that can be matched and replaced in strings.

Formal grammars can be applied in many fields. They can be helpful in understanding HTML and web browser structure as a whole. Speaking of NLP, formal grammar is used in speech recognition, dialogue systems, and regular expressions (a set of all regular expressions is a formal language!).

But the most important contribution lies in the field of syntax analysis. Formal grammar is essential in understanding how the syntax works. Most syntactic parsers are based on formal grammars. One of the most renowned parsing algorithms, Context-Free Grammar Parsing, is based on CFG, which as we already know is a formal grammar.

Conclusion

In conclusion, formal grammar is fundamental in NLP and computer science. By providing a systematic way to describe the structure of languages, formal grammars enable us to generate, recognize, and manipulate languages precisely and efficiently. From the simple rules of regular grammars to the complex rules of unrestricted grammars, formal grammars provide a powerful tool for modeling and analyzing a wide range of languages and systems. As NLP continues to evolve, formal grammars will remain an important concept that underpins many of the foundational principles of the field.

How did you like the theory?
Report a typo