====== Context-Free Grammars ======
===== Motivation =====
Regular expressions are **insufficient** in describing the structure of complicated languages (e.g. programming languages). We recall our previously-used example: //simple arithmetic expressions with parentheses//.
::= | + | ()
This language is **not regular**, however it can be described via a //generator object// called (context-free) **grammar** (short CFG).
$def[CFG]
A **CFG** is a 4-tuple: $math[G=(V,\Sigma,R,S)] where:
* $math[V] is a finite set whose elements are called **non-terminals and terminals**
* $math[\Sigma\subseteq V] is the set of **terminals**
* $math[R] is a **relation** over $math[(V\setminus\Sigma)\times V^*]. Here $math[V\setminus\Sigma] is the set of **non-terminals** and $math[V^*] is the set of **words** over $math[V]. An element of $math[R] is called **production rule**. We explain productions below.
* $math[S\in V\setminus\Sigma] is the **start symbol**.
$end
As an example, consider the following CFG $math[G], where:
* $math[V=\{S\}]
* $math[\Sigma=\{a,b\}]
* $math[R=\{S\rightarrow aSb, S\rightarrow \epsilon\}]
This grammar contains a **single** non-terminal $math[S], which is also the start symbol. **Production rules** are written as follows:
$math[\displaystyle X \rightarrow Y]
where $math[X] is a **non-terminal** and $math[Y] is a string containing terminal and non-terminal symbols. Our grammar has two production rules:
* $math[S\rightarrow aSb]
* $math[S\rightarrow \epsilon]
We can also write production rules of the form:
$math[\displaystyle X \rightarrow Y_1, \ldots X \rightarrow Y_n ]
in the more compact form:
$math[\displaystyle X \rightarrow Y_1 \mid \ldots \mid Y_n ]
In our example, we can write:
$math[S\rightarrow aSb \mid \epsilon].
As a general **convention**, we use **italic uppercase symbols** to designate **non-terminals**, and **lowercase symbols** (or occasionally, typewriter symbols e.g. $math[\texttt{A}]) - for **terminals**. At the same time, $math[S] is always used to designate the **start-symbol**. Under this convention, we can completely define a grammar by giving the set of productions only.
For instance, we can define a CFG for expressions as follows:
$math[S \rightarrow S + S \mid (S) \mid A]
$math[A \rightarrow UVT]
$math[U \rightarrow \texttt{A} \mid \ldots \mid \texttt{Z}]
$math[V \rightarrow LV \mid \epsilon]
$math[L \rightarrow \texttt{a} \mid \ldots \mid \texttt{z}]
$math[T \rightarrow DT \mid \epsilon]
$math[D \rightarrow \texttt{0} \mid \ldots \mid \texttt{1}]
We have preserved the same convention for atoms: they must start with an uppercase, followed by zero-or-more lowercase symbols, and then zero-or-more digits.
(Context-Free) Grammars are the **corner-stone** for writing parsers.
===== The language of a CFG =====
Let $math[\alpha A\beta] and $math[\alpha\gamma\beta] be strings from $math[V^*], where $math[A] is a **non-terminal**. Also, suppose we have a production $math[A\rightarrow\gamma] in a CFG $math[G]. Then we say:
$math[\alpha A\beta \Rightarrow_G \alpha\gamma\beta]
and read that $math[\alpha\gamma\beta] is a **one-step derivation** of $math[\alpha A\beta]. The relation over strings $math[\Rightarrow_G] is very similar in spirit to $math[\vdash_M]. We omit the subscript when the grammar $math[G] is understood from context, and write $math[\Rightarrow^*] to refer to the **reflexive and transitive closure of **$math[\Rightarrow]. $math[\Rightarrow^*] is the **zero-or-more steps derivation** relation.
As an example, consider the grammar for arithmetic expressions, and the following derivation:
$math[S\Rightarrow (S)\Rightarrow(S+S)\Rightarrow(A+S)\Rightarrow(A+(S+S))\Rightarrow(A+(A+S))\Rightarrow(A+(A+A))]
Hence, we have $math[S\Rightarrow^*(A+(A+A))].
Notice that the string $math[(A+(A+A))] contains **non-terminals**.
One possible derivation for $math[A] is:
$math[A\Rightarrow UVT\Rightarrow \texttt{X}VT\Rightarrow\texttt{X}T\rightarrow\texttt{X}DT\rightarrow\texttt{X0}T\rightarrow\texttt{X0}]
Similarly, we may write derivations that witness:
$math[A\Rightarrow^*\texttt{Y}] and $math[A\Rightarrow^*\texttt{Z}].
and finally: $math[S\Rightarrow^*(A+(A+A))\Rightarrow^*(\texttt{X0}+(\texttt{Y}+\texttt{Z}))]. Notice that $math[(\texttt{X0}+(\texttt{Y}+\texttt{Z}))] contains only **terminal symbols**.
$def[Language of a grammar]
For a CFG $math[G], the **language generated by G** is defined as: $math[L(G)=\{w\in\Sigma^*\mid S\Rightarrow^*_G w\}]
$end
Informally, $math[L(G)] is the set of words that be obtained via **zero-or-more** derivations from $math[G].
If a language is generated by a CFG, then it is called a **context-free language**.
===== Parse trees =====
Informally, a parse tree is an //illustration// of //sequences of derivations//. We illustrate a parse tree for $math[(A+(A+A))] below:
S
/ | \
( S )
/ | \
S + S
| / | \
A ( S )
/ | \
S + S
| |
A A
Notice that there is not a **one-to-one** correspondence between a sequence of derivations and a parse-tree. For instance, we may first derive the left-hand side of ''+'', or the right-hand side. However, a parse-tree uniquely identifies the set of of productions used in the derivation and how they are applied.
The construction rules for parse trees are as follows:
* **the root** of the tree is the **start symbol**
* each **interior node** $math[X] having as children nodes $math[Y_1, \ldots, Y_n] corresponds to a **production rule** $math[X\rightarrow Y_1 \ldots Y_n]
* if **each leaf** is a **terminal**, then the parse-tree **yields** a word of $math[L(G)].
For instance, the following parse-tree:
S
/ | \
a S b
/ | \
a e b
yields the word $math[aabb], which is obtained by concatenating each terminal leaf from left to right.
Parse-trees are especially useful for parsing, because they reveal **the structure** of a parsed program (or word in general).
It is only natural that we require **the program structure to be //unique//**. However, it is quite easy to find grammars where **the same word** has **different** parse trees as yield.
Consider the following CFG:
$math[S \rightarrow S + S \mid S * S \mid \texttt{a}]
and the word: $math[a+a*a] has different parse trees:
S
/ | \
S + S
| / | \
a S * S
| |
a a
and
S
/ | \
S * S
/ | \ |
S + S a
| |
a a
Incidentally, these two different //structures// reflect different interpretations of our arithmetic expression. Thus, our grammar is **ambiguous**. In general, a grammar is ambiguous if there **exist two different parse trees for the same word**.
To remove ambiguity in our example, it is sufficient to:
- include **precedence-rules** in the grammar;
- enforce parsing to proceed //left-to-right//;
The result is:
$math[S\rightarrow M + S \mid M]
$math[M\rightarrow T * M \mid T]
$math[T\rightarrow a]
The first production rule enforces //left-to-right// parsing. Consider the alternative production:
$math[S \rightarrow S + S \mid M]
Via this production, a parse tree might unfold //to the left// ad-infinitum, **depending on how the parser implementation works**:
S
/ | \
S + S
/ | \
S + S
...
By instead using:
$math[S\rightarrow M + S \mid M], we are requiring that a suitable **multiplication term** be found at the left of ''+'', while any expression may occur at it's right.
The second production describes **multiplication terms**. Note that, under this grammar, addition cannot appear within a multiplication term. If this is the case, we need a new production rule which includes parentheses. Can you figure how this modification should be done?
==== Solving ambiguity in general ====
Consider another example:
$math[L = \{a^nb^nc^md^m \mid n,m\geq 1\} \cup \{a^nb^mc^md^n \mid n,m\geq 1\}]
This language contains strings in $math[L(aa^*bb^*cc^*dd^*)] where (''number(a)=number(b)'' and ''number(c)=number(d)'') or (''number(a)=number(d)'' and ''number(b)=number(c)'').
One possible CFG is:
$math[S\rightarrow AB \mid C]
$math[A\rightarrow aAb \mid ab]
$math[B\rightarrow cBd \mid cd]
$math[C\rightarrow aCd \mid aDc]
$math[D\rightarrow bDc \mid bc]
This CFG is **ambiguous**: the word $math[aabbccdd] has two different parse trees:
S
/ \
A B
... ...
and
S
|
C
/ | \
... ...
The reason for ambiguity is that in $math[aabbccdd] both conditions of the grammar hold (''number(a)=number(b)=number(c)=number(d)'').
It is not straightforward how ambiguity can be lifted from this grammar. This particular example raises two interesting questions:
* can we **automatically** lift ambiguity from any CFG?
* how to find an unambiguous grammar for a Context-Free Language?
We cannot provide a general answer for any of the above questions. In fact:
**The problem of establishing if a CFG is ambiguous is not decidable**.
Also **there exist context-free languages for which no unambiguous grammar exists**. Our above example is such a language.
===== Regular grammars =====
$def[Regular grammars]
A grammar is called **regular** iff **all its production rules** have **one of** the following forms:
$end
$math[ X \rightarrow aA]
$math[ X \rightarrow A]
$math[ X \rightarrow a]
$math[ X \rightarrow \epsilon]
where $math[A,X] are nonterminals and $math[a] is a terminal.
Formally:
$math[ R\subseteq (V\setminus\Sigma)\times(\Sigma^*((V\setminus\Sigma)\cup\{\epsilon\}))]
Thus:
* each production rule contains **at most one non-terminal**
* each non-terminal appears as //the last symbol in the production body//
As it turns out, regular grammars **precisely capture regular languages**:
$theorem[Regular grammars capture regular languages]
A language is **regular** iff it is generated by a regular grammar.
$end
$proof
Direction $math[\implies]. Suppose $math[L] is a regular language, i.e. it is accepted by a DFA $math[M=(K,\Sigma,\delta,q_0,F)]. We build a regular grammar $math[G] from $math[M]. Informally, each production of $math[G] mimics some transition of $math[M]. Formally, $math[G=(V,\Sigma,R,S)] where:
* $math[V = K \cup \Sigma] - the set of non-terminals is the set of states, and the set of terminals is the set of symbols;
* $math[S = q_0] - the start-symbol corresponds to the start state;
* for each transition $math[\delta(q,\texttt{c})=p], we build a production rule $math[q\rightarrow \texttt{c}p]. For each final state $math[q\in F], we build a production rule $math[q\rightarrow \epsilon]
The grammar is obviously regular.
To prove $math[L(M)=L(G)], we must show:
for all $math[w=c_1\ldots c_n\in\Sigma^*], $math[(q_0,c_1\ldots c_n)\vdash_M^* (p,\epsilon)] with $math[p\in F], **iff** $math[p_0\Rightarrow_G^* c_1\ldots c_np], where we recall that $math[p] is a non-terminal for which the production rule $math[p\rightarrow \epsilon] exists.
The above proposition can be easily proven by induction over the length of the word $math[w].
Direction $math[\Leftarrow]. Suppose $math[G=(V,\Sigma,R,S)] is a regular grammar. We build an NFA $math[M=(K,\Sigma,\Delta,q_0,F)] whose transitions //mimic// production rules:
* $math[K = (V\setminus \Sigma) \cup \{p\}]: for each non-terminal in $math[G], we build a state in $math[M]. Additionally, we build a final state.
* $math[q_0 = S]
* $math[F=\{p\}]
* for each production rule $math[A\rightarrow cB] in $math[G], where $math[B] is a non-terminal and $math[c\in\Sigma^*], we build a transition $math[(A,c,B)\in\Delta]. Also, for each transition $math[A\rightarrow c], with $math[c\in\Sigma^*], we build a transition $math[(A,c,p)\in\Delta].
We must prove that:
for all $math[w=c_1\ldots c_n\in\Sigma^*], we have $math[S\Rightarrow_G^* c_1\ldots c_n] **iff** $math[(q_0,c_1\ldots c_n)\vdash_M^*(p,\epsilon)].
The proof is similar to the above one.
$end
The theorem also shows that Context-Free Languages are a proper **superset** of Regular Languages.