====== Context-Free Grammars ====== ===== Motivation ===== Regular expressions are **insufficient** in describing the structure of complicated languages (e.g. programming languages). We recall our previously-used example: //simple arithmetic expressions with parentheses//. ::= | + | () This language is **not regular**, however it can be described via a //generator object// called (context-free) **grammar** (short CFG). $def[CFG] A **CFG** is a 4-tuple: $math[G=(V,\Sigma,R,S)] where: * $math[V] is a finite set whose elements are called **non-terminals and terminals** * $math[\Sigma\subseteq V] is the set of **terminals** * $math[R] is a **relation** over $math[(V\setminus\Sigma)\times V^*]. Here $math[V\setminus\Sigma] is the set of **non-terminals** and $math[V^*] is the set of **words** over $math[V]. An element of $math[R] is called **production rule**. We explain productions below. * $math[S\in V\setminus\Sigma] is the **start symbol**. $end As an example, consider the following CFG $math[G], where: * $math[V=\{S\}] * $math[\Sigma=\{a,b\}] * $math[R=\{S\rightarrow aSb, S\rightarrow \epsilon\}] This grammar contains a **single** non-terminal $math[S], which is also the start symbol. **Production rules** are written as follows: $math[\displaystyle X \rightarrow Y] where $math[X] is a **non-terminal** and $math[Y] is a string containing terminal and non-terminal symbols. Our grammar has two production rules: * $math[S\rightarrow aSb] * $math[S\rightarrow \epsilon] We can also write production rules of the form: $math[\displaystyle X \rightarrow Y_1, \ldots X \rightarrow Y_n ] in the more compact form: $math[\displaystyle X \rightarrow Y_1 \mid \ldots \mid Y_n ] In our example, we can write: $math[S\rightarrow aSb \mid \epsilon]. As a general **convention**, we use **italic uppercase symbols** to designate **non-terminals**, and **lowercase symbols** (or occasionally, typewriter symbols e.g. $math[\texttt{A}]) - for **terminals**. At the same time, $math[S] is always used to designate the **start-symbol**. Under this convention, we can completely define a grammar by giving the set of productions only. For instance, we can define a CFG for expressions as follows: $math[S \rightarrow S + S \mid (S) \mid A] $math[A \rightarrow UVT] $math[U \rightarrow \texttt{A} \mid \ldots \mid \texttt{Z}] $math[V \rightarrow LV \mid \epsilon] $math[L \rightarrow \texttt{a} \mid \ldots \mid \texttt{z}] $math[T \rightarrow DT \mid \epsilon] $math[D \rightarrow \texttt{0} \mid \ldots \mid \texttt{1}] We have preserved the same convention for atoms: they must start with an uppercase, followed by zero-or-more lowercase symbols, and then zero-or-more digits. (Context-Free) Grammars are the **corner-stone** for writing parsers. ===== The language of a CFG ===== Let $math[\alpha A\beta] and $math[\alpha\gamma\beta] be strings from $math[V^*], where $math[A] is a **non-terminal**. Also, suppose we have a production $math[A\rightarrow\gamma] in a CFG $math[G]. Then we say: $math[\alpha A\beta \Rightarrow_G \alpha\gamma\beta] and read that $math[\alpha\gamma\beta] is a **one-step derivation** of $math[\alpha A\beta]. The relation over strings $math[\Rightarrow_G] is very similar in spirit to $math[\vdash_M]. We omit the subscript when the grammar $math[G] is understood from context, and write $math[\Rightarrow^*] to refer to the **reflexive and transitive closure of **$math[\Rightarrow]. $math[\Rightarrow^*] is the **zero-or-more steps derivation** relation. As an example, consider the grammar for arithmetic expressions, and the following derivation: $math[S\Rightarrow (S)\Rightarrow(S+S)\Rightarrow(A+S)\Rightarrow(A+(S+S))\Rightarrow(A+(A+S))\Rightarrow(A+(A+A))] Hence, we have $math[S\Rightarrow^*(A+(A+A))]. Notice that the string $math[(A+(A+A))] contains **non-terminals**. One possible derivation for $math[A] is: $math[A\Rightarrow UVT\Rightarrow \texttt{X}VT\Rightarrow\texttt{X}T\rightarrow\texttt{X}DT\rightarrow\texttt{X0}T\rightarrow\texttt{X0}] Similarly, we may write derivations that witness: $math[A\Rightarrow^*\texttt{Y}] and $math[A\Rightarrow^*\texttt{Z}]. and finally: $math[S\Rightarrow^*(A+(A+A))\Rightarrow^*(\texttt{X0}+(\texttt{Y}+\texttt{Z}))]. Notice that $math[(\texttt{X0}+(\texttt{Y}+\texttt{Z}))] contains only **terminal symbols**. $def[Language of a grammar] For a CFG $math[G], the **language generated by G** is defined as: $math[L(G)=\{w\in\Sigma^*\mid S\Rightarrow^*_G w\}] $end Informally, $math[L(G)] is the set of words that be obtained via **zero-or-more** derivations from $math[G]. If a language is generated by a CFG, then it is called a **context-free language**. ===== Parse trees ===== Informally, a parse tree is an //illustration// of //sequences of derivations//. We illustrate a parse tree for $math[(A+(A+A))] below: S / | \ ( S ) / | \ S + S | / | \ A ( S ) / | \ S + S | | A A Notice that there is not a **one-to-one** correspondence between a sequence of derivations and a parse-tree. For instance, we may first derive the left-hand side of ''+'', or the right-hand side. However, a parse-tree uniquely identifies the set of of productions used in the derivation and how they are applied. The construction rules for parse trees are as follows: * **the root** of the tree is the **start symbol** * each **interior node** $math[X] having as children nodes $math[Y_1, \ldots, Y_n] corresponds to a **production rule** $math[X\rightarrow Y_1 \ldots Y_n] * if **each leaf** is a **terminal**, then the parse-tree **yields** a word of $math[L(G)]. For instance, the following parse-tree: S / | \ a S b / | \ a e b yields the word $math[aabb], which is obtained by concatenating each terminal leaf from left to right. Parse-trees are especially useful for parsing, because they reveal **the structure** of a parsed program (or word in general). It is only natural that we require **the program structure to be //unique//**. However, it is quite easy to find grammars where **the same word** has **different** parse trees as yield. Consider the following CFG: $math[S \rightarrow S + S \mid S * S \mid \texttt{a}] and the word: $math[a+a*a] has different parse trees: S / | \ S + S | / | \ a S * S | | a a and S / | \ S * S / | \ | S + S a | | a a Incidentally, these two different //structures// reflect different interpretations of our arithmetic expression. Thus, our grammar is **ambiguous**. In general, a grammar is ambiguous if there **exist two different parse trees for the same word**. To remove ambiguity in our example, it is sufficient to: - include **precedence-rules** in the grammar; - enforce parsing to proceed //left-to-right//; The result is: $math[S\rightarrow M + S \mid M] $math[M\rightarrow T * M \mid T] $math[T\rightarrow a] The first production rule enforces //left-to-right// parsing. Consider the alternative production: $math[S \rightarrow S + S \mid M] Via this production, a parse tree might unfold //to the left// ad-infinitum, **depending on how the parser implementation works**: S / | \ S + S / | \ S + S ... By instead using: $math[S\rightarrow M + S \mid M], we are requiring that a suitable **multiplication term** be found at the left of ''+'', while any expression may occur at it's right. The second production describes **multiplication terms**. Note that, under this grammar, addition cannot appear within a multiplication term. If this is the case, we need a new production rule which includes parentheses. Can you figure how this modification should be done? ==== Solving ambiguity in general ==== Consider another example: $math[L = \{a^nb^nc^md^m \mid n,m\geq 1\} \cup \{a^nb^mc^md^n \mid n,m\geq 1\}] This language contains strings in $math[L(aa^*bb^*cc^*dd^*)] where (''number(a)=number(b)'' and ''number(c)=number(d)'') or (''number(a)=number(d)'' and ''number(b)=number(c)''). One possible CFG is: $math[S\rightarrow AB \mid C] $math[A\rightarrow aAb \mid ab] $math[B\rightarrow cBd \mid cd] $math[C\rightarrow aCd \mid aDc] $math[D\rightarrow bDc \mid bc] This CFG is **ambiguous**: the word $math[aabbccdd] has two different parse trees: S / \ A B ... ... and S | C / | \ ... ... The reason for ambiguity is that in $math[aabbccdd] both conditions of the grammar hold (''number(a)=number(b)=number(c)=number(d)''). It is not straightforward how ambiguity can be lifted from this grammar. This particular example raises two interesting questions: * can we **automatically** lift ambiguity from any CFG? * how to find an unambiguous grammar for a Context-Free Language? We cannot provide a general answer for any of the above questions. In fact: **The problem of establishing if a CFG is ambiguous is not decidable**. Also **there exist context-free languages for which no unambiguous grammar exists**. Our above example is such a language. ===== Regular grammars ===== $def[Regular grammars] A grammar is called **regular** iff **all its production rules** have **one of** the following forms: $end $math[ X \rightarrow aA] $math[ X \rightarrow A] $math[ X \rightarrow a] $math[ X \rightarrow \epsilon] where $math[A,X] are nonterminals and $math[a] is a terminal. Formally: $math[ R\subseteq (V\setminus\Sigma)\times(\Sigma^*((V\setminus\Sigma)\cup\{\epsilon\}))] Thus: * each production rule contains **at most one non-terminal** * each non-terminal appears as //the last symbol in the production body// As it turns out, regular grammars **precisely capture regular languages**: $theorem[Regular grammars capture regular languages] A language is **regular** iff it is generated by a regular grammar. $end $proof Direction $math[\implies]. Suppose $math[L] is a regular language, i.e. it is accepted by a DFA $math[M=(K,\Sigma,\delta,q_0,F)]. We build a regular grammar $math[G] from $math[M]. Informally, each production of $math[G] mimics some transition of $math[M]. Formally, $math[G=(V,\Sigma,R,S)] where: * $math[V = K \cup \Sigma] - the set of non-terminals is the set of states, and the set of terminals is the set of symbols; * $math[S = q_0] - the start-symbol corresponds to the start state; * for each transition $math[\delta(q,\texttt{c})=p], we build a production rule $math[q\rightarrow \texttt{c}p]. For each final state $math[q\in F], we build a production rule $math[q\rightarrow \epsilon] The grammar is obviously regular. To prove $math[L(M)=L(G)], we must show: for all $math[w=c_1\ldots c_n\in\Sigma^*], $math[(q_0,c_1\ldots c_n)\vdash_M^* (p,\epsilon)] with $math[p\in F], **iff** $math[p_0\Rightarrow_G^* c_1\ldots c_np], where we recall that $math[p] is a non-terminal for which the production rule $math[p\rightarrow \epsilon] exists. The above proposition can be easily proven by induction over the length of the word $math[w]. Direction $math[\Leftarrow]. Suppose $math[G=(V,\Sigma,R,S)] is a regular grammar. We build an NFA $math[M=(K,\Sigma,\Delta,q_0,F)] whose transitions //mimic// production rules: * $math[K = (V\setminus \Sigma) \cup \{p\}]: for each non-terminal in $math[G], we build a state in $math[M]. Additionally, we build a final state. * $math[q_0 = S] * $math[F=\{p\}] * for each production rule $math[A\rightarrow cB] in $math[G], where $math[B] is a non-terminal and $math[c\in\Sigma^*], we build a transition $math[(A,c,B)\in\Delta]. Also, for each transition $math[A\rightarrow c], with $math[c\in\Sigma^*], we build a transition $math[(A,c,p)\in\Delta]. We must prove that: for all $math[w=c_1\ldots c_n\in\Sigma^*], we have $math[S\Rightarrow_G^* c_1\ldots c_n] **iff** $math[(q_0,c_1\ldots c_n)\vdash_M^*(p,\epsilon)]. The proof is similar to the above one. $end The theorem also shows that Context-Free Languages are a proper **superset** of Regular Languages.