====== Context-Free Grammars ======

===== Motivation =====

Regular expressions are **insufficient** in describing the structure of complicated languages (e.g. programming languages). We recall our previously-used example: //simple arithmetic expressions with parentheses//. 

<code>
<expr> ::= <atom> | <expr> + <expr> | (<expr>)
</code>

This language is **not regular**, however it can be described via a //generator object// called (context-free) **grammar** (short CFG). 

$def[CFG]
A **CFG** is a 4-tuple: $math[G=(V,\Sigma,R,S)] where:
  * $math[V] is a finite set whose elements are called **non-terminals and terminals**
  * $math[\Sigma\subseteq V] is the set of **terminals**
  * $math[R] is a **relation** over $math[(V\setminus\Sigma)\times V^*]. Here $math[V\setminus\Sigma] is the set of **non-terminals** and $math[V^*] is the set of **words** over $math[V]. An element of $math[R] is called **production rule**. We explain productions below. 
  * $math[S\in V\setminus\Sigma] is the **start symbol**.

$end

As an example, consider the following CFG $math[G], where:
  * $math[V=\{S\}]
  * $math[\Sigma=\{a,b\}]
  * $math[R=\{S\rightarrow aSb, S\rightarrow \epsilon\}]

This grammar contains a **single** non-terminal $math[S], which is also the start symbol. **Production rules** are written as follows:

$math[\displaystyle X \rightarrow Y]

where $math[X] is a **non-terminal** and $math[Y] is a string containing terminal and non-terminal symbols. Our grammar has two production rules:
  * $math[S\rightarrow aSb]
  * $math[S\rightarrow \epsilon]

We can also write production rules of the form:

$math[\displaystyle X \rightarrow Y_1, \ldots X \rightarrow Y_n ]

in the more compact form:

$math[\displaystyle X \rightarrow Y_1 \mid \ldots \mid Y_n ]

In our example, we can write:

$math[S\rightarrow aSb \mid \epsilon].

As a general **convention**, we use **italic uppercase symbols** to designate **non-terminals**, and **lowercase symbols** (or occasionally, typewriter symbols e.g. $math[\texttt{A}]) - for **terminals**. At the same time, $math[S] is always used to designate the **start-symbol**. Under this convention, we can completely define a grammar by giving the set of productions only. 

For instance, we can define a CFG for expressions as follows:

$math[S \rightarrow S + S \mid (S) \mid A]

$math[A \rightarrow UVT]

$math[U \rightarrow \texttt{A} \mid \ldots \mid \texttt{Z}]

$math[V \rightarrow LV \mid \epsilon]

$math[L \rightarrow \texttt{a} \mid \ldots \mid \texttt{z}]

$math[T \rightarrow DT \mid \epsilon]

$math[D \rightarrow \texttt{0} \mid \ldots \mid \texttt{1}]

We have preserved the same convention for atoms: they must start with an uppercase, followed by zero-or-more lowercase symbols, and then zero-or-more digits.

(Context-Free) Grammars are the **corner-stone** for writing parsers.

===== The language of a CFG =====

Let $math[\alpha A\beta] and $math[\alpha\gamma\beta] be strings from $math[V^*], where $math[A] is a **non-terminal**. Also, suppose we have a production $math[A\rightarrow\gamma] in a CFG $math[G]. Then we say:

$math[\alpha A\beta \Rightarrow_G \alpha\gamma\beta]

and read that $math[\alpha\gamma\beta] is a **one-step derivation** of $math[\alpha A\beta]. The relation over strings $math[\Rightarrow_G] is very similar in spirit to $math[\vdash_M]. We omit the subscript when the grammar $math[G] is understood from context, and write $math[\Rightarrow^*] to refer to the **reflexive and transitive closure of **$math[\Rightarrow]. $math[\Rightarrow^*] is the **zero-or-more steps derivation** relation.

As an example, consider the grammar for arithmetic expressions, and the following derivation:

$math[S\Rightarrow (S)\Rightarrow(S+S)\Rightarrow(A+S)\Rightarrow(A+(S+S))\Rightarrow(A+(A+S))\Rightarrow(A+(A+A))]

Hence, we have $math[S\Rightarrow^*(A+(A+A))].

Notice that the string $math[(A+(A+A))] contains **non-terminals**.
One possible derivation for $math[A] is:

$math[A\Rightarrow UVT\Rightarrow \texttt{X}VT\Rightarrow\texttt{X}T\rightarrow\texttt{X}DT\rightarrow\texttt{X0}T\rightarrow\texttt{X0}]

Similarly, we may write derivations that witness:
$math[A\Rightarrow^*\texttt{Y}] and $math[A\Rightarrow^*\texttt{Z}].

and finally: $math[S\Rightarrow^*(A+(A+A))\Rightarrow^*(\texttt{X0}+(\texttt{Y}+\texttt{Z}))]. Notice that $math[(\texttt{X0}+(\texttt{Y}+\texttt{Z}))] contains only **terminal symbols**.

$def[Language of a grammar]
For a CFG $math[G], the **language generated by G** is defined as: $math[L(G)=\{w\in\Sigma^*\mid S\Rightarrow^*_G w\}]
$end

Informally, $math[L(G)] is the set of words that be obtained via **zero-or-more** derivations from $math[G].

If a language is generated by a CFG, then it is called a **context-free language**.

===== Parse trees =====

Informally, a parse tree is an //illustration// of //sequences of derivations//. We illustrate a parse tree for $math[(A+(A+A))] below:
<code>
       S
    /  |  \
   (   S   )
     / | \
    S  +  S
    |   / | \
    A  (  S  )
        / | \
       S  +  S
       |     |
       A     A
</code>

Notice that there is not a **one-to-one** correspondence between a sequence of derivations and a parse-tree. For instance, we may first derive the left-hand side of ''+'', or the right-hand side. However, a parse-tree uniquely identifies the set of of productions used in the derivation and how they are applied.

The construction rules for parse trees are as follows:
  * **the root** of the tree is the **start symbol**
  * each **interior node** $math[X] having as children nodes $math[Y_1, \ldots, Y_n] corresponds to a **production rule** $math[X\rightarrow Y_1 \ldots Y_n]
  * if **each leaf** is a **terminal**, then the parse-tree **yields** a word of $math[L(G)].

For instance, the following parse-tree:

<code>
    S
  / | \
 a  S  b
  / | \
 a  e  b
</code>
yields the word $math[aabb], which is obtained by concatenating each terminal leaf from left to right.

Parse-trees are especially useful for parsing, because they reveal **the structure** of a parsed program (or word in general). 

It is only natural that we require **the program structure to be //unique//**. However, it is quite easy to find grammars where **the same word** has **different** parse trees as yield.

Consider the following CFG:
$math[S \rightarrow S + S \mid S * S \mid \texttt{a}]

and the word: $math[a+a*a] has different parse trees:
<code>
    S
  / | \
 S  +  S
 |   / | \
 a  S  *  S
    |     |
    a     a 
</code>
and

<code>
      S
    / | \
   S  *  S
 / | \   |
S  + S   a
|    |
a    a
</code>

Incidentally, these two different //structures// reflect different interpretations of our arithmetic expression. Thus, our grammar is **ambiguous**. In general, a grammar is ambiguous if there **exist two different parse trees for the same word**.

To remove ambiguity in our example, it is sufficient to:
  - include **precedence-rules** in the grammar;
  - enforce parsing to proceed //left-to-right//;

The result is:

$math[S\rightarrow M + S \mid M]

$math[M\rightarrow T * M \mid T]

$math[T\rightarrow a]

The first production rule enforces //left-to-right// parsing. Consider the alternative production:
$math[S \rightarrow S + S \mid M]

Via this production, a parse tree might unfold //to the left// ad-infinitum, **depending on how the parser implementation works**:

<code>
      S
    / | \
   S  +  S
 / | \
S  +  S
... 
</code>

By instead using:

$math[S\rightarrow M + S \mid M], we are requiring that a suitable **multiplication term** be found at the left of ''+'', while any expression may occur at it's right.

The second production describes **multiplication terms**. Note that, under this grammar, addition cannot appear within a multiplication term. If this is the case, we need a new production rule which includes parentheses. Can you figure how this modification should be done?

==== Solving ambiguity in general ====

Consider another example:

$math[L = \{a^nb^nc^md^m \mid n,m\geq 1\} \cup \{a^nb^mc^md^n \mid n,m\geq 1\}]

This language contains strings in $math[L(aa^*bb^*cc^*dd^*)] where (''number(a)=number(b)'' and ''number(c)=number(d)'') or (''number(a)=number(d)'' and ''number(b)=number(c)'').

One possible CFG is:

$math[S\rightarrow AB \mid C]

$math[A\rightarrow aAb \mid ab]

$math[B\rightarrow cBd \mid cd]

$math[C\rightarrow aCd \mid aDc]

$math[D\rightarrow bDc \mid bc]

This CFG is **ambiguous**: the word $math[aabbccdd] has two different parse trees:
<code>
   S
 /   \
A     B
...  ...
</code>

and 

<code>
   S
   |
   C
 / | \
...  ...
</code>

The reason for ambiguity is that in $math[aabbccdd] both conditions of the grammar hold (''number(a)=number(b)=number(c)=number(d)'').

It is not straightforward how ambiguity can be lifted from this grammar. This particular example raises two interesting questions:
  * can we **automatically** lift ambiguity from any CFG?
  * how to find an unambiguous grammar for a Context-Free Language?

We cannot provide a general answer for any of the above questions. In fact:

**The problem of establishing if a CFG is ambiguous is not decidable**.

Also **there exist context-free languages for which no unambiguous grammar exists**. Our above example is such a language.

===== Regular grammars =====

$def[Regular grammars]
A grammar is called **regular** iff **all its production rules** have **one of** the following forms:
$end

$math[ X \rightarrow aA]

$math[ X \rightarrow A]

$math[ X \rightarrow a]

$math[ X \rightarrow \epsilon]

where $math[A,X] are nonterminals and $math[a] is a terminal.

Formally:

$math[ R\subseteq (V\setminus\Sigma)\times(\Sigma^*((V\setminus\Sigma)\cup\{\epsilon\}))]

Thus:
  * each production rule contains **at most one non-terminal**
  * each non-terminal appears as //the last symbol in the production body//

As it turns out, regular grammars **precisely capture regular languages**:

$theorem[Regular grammars capture regular languages]
A language is **regular** iff it is generated by a regular grammar.
$end

$proof
Direction $math[\implies]. Suppose $math[L] is a regular language, i.e. it is accepted by a DFA $math[M=(K,\Sigma,\delta,q_0,F)]. We build a regular grammar $math[G] from $math[M]. Informally, each production of $math[G] mimics some transition of $math[M]. Formally, $math[G=(V,\Sigma,R,S)] where:
  * $math[V = K \cup \Sigma] - the set of non-terminals is the set of states, and the set of terminals is the set of symbols;
  * $math[S = q_0] - the start-symbol corresponds to the start state;
  * for each transition $math[\delta(q,\texttt{c})=p], we build a production rule $math[q\rightarrow \texttt{c}p]. For each final state $math[q\in F], we build a production rule $math[q\rightarrow \epsilon]

The grammar is obviously regular.
To prove $math[L(M)=L(G)], we must show:

for all $math[w=c_1\ldots c_n\in\Sigma^*], $math[(q_0,c_1\ldots c_n)\vdash_M^* (p,\epsilon)] with $math[p\in F], **iff** $math[p_0\Rightarrow_G^* c_1\ldots c_np], where we recall that $math[p] is a non-terminal for which the production rule $math[p\rightarrow \epsilon] exists.

The above proposition can be easily proven by induction over the length of the word $math[w].

Direction $math[\Leftarrow]. Suppose $math[G=(V,\Sigma,R,S)] is a regular grammar. We build an NFA $math[M=(K,\Sigma,\Delta,q_0,F)] whose transitions //mimic// production rules:
  * $math[K = (V\setminus \Sigma) \cup \{p\}]: for each non-terminal in $math[G], we build a state in $math[M]. Additionally, we build a final state.
  * $math[q_0 = S]
  * $math[F=\{p\}]
  * for each production rule $math[A\rightarrow cB] in $math[G], where $math[B] is a non-terminal and $math[c\in\Sigma^*], we build a transition $math[(A,c,B)\in\Delta]. Also, for each transition $math[A\rightarrow c], with $math[c\in\Sigma^*], we build a transition $math[(A,c,p)\in\Delta].

We must prove that:

for all $math[w=c_1\ldots c_n\in\Sigma^*], we have $math[S\Rightarrow_G^* c_1\ldots c_n] **iff** $math[(q_0,c_1\ldots c_n)\vdash_M^*(p,\epsilon)].

The proof is similar to the above one.
$end

The theorem also shows that Context-Free Languages are a proper **superset** of Regular Languages.