====== Properties of Context-Free Languages ======

===== Chomsky Normal Form =====

$def[CNF]
A Context-Free Grammar $math[G] is in **Chomsky Normal Form** if **each** production has one of the following forms:
  * $math[A\rightarrow BC] where $math[A,B,C] are **non-terminals** 
  * $math[A\rightarrow a] where $math[A] is a non-terminal and $math[a] is a **terminal**

and there are no **useless** non-terminals.
$end

We also prove that:

$prop[CNF]
Let $math[L] be a Context-Free Language such that $math[L\neq \emptyset] and $math[\epsilon\not\in L]. Then there exists a grammar $math[G] in **Chomsky Normal Form** such that $math[L(G)=L].
$end

The Chomsky Normal Form is useful because it makes reasoning (proofs) on grammars much easier (fewer situations need to be considered). Also, CNF is useful for many algorithms such as bottom-up parsing, etc.

We shall provide, without proof, a sequence of transformations which can turn any grammar $math[G] into one in CNF.
  * replace each production of the form $math[A\rightarrow B_1 \ldots a\ldots B_n] where $math[B_i] are non-terminals and $math[a] is a terminal, by two productions:
    * $math[A\rightarrow B_1\ldots C\ldots B_n] and $math[C\rightarrow a]
  * replace each production of the form $math[A\rightarrow B_1 \ldots B_n] with $math[n-1] productions:
    * $math[A\rightarrow B_1 C_1], $math[C_1 \rightarrow B_2 C_2], ..., $math[C_{n-1}\rightarrow B_{n-1}B_n]
  * recursively eliminate all productions of the form $math[A\rightarrow \epsilon]; what is the correct algorithm for implementing this elimination?
  * recursively eliminate all productions of the form $math[A\rightarrow B], by replacing $math[A] with $math[B] in all productions.

Remarks:
  * the order in which these transformations are applied may affect the **correctness** of the result. (Can you identify an example?)
  * the same order may affect the size of the CNF grammar, which could be (worst-case) exponential w.r.t. the size of the original grammar.

$exercise[CNF]
Transform the following grammar in CNF:

$math[S\rightarrow ABC]

$math[B\rightarrow \epsilon]

$math[A\rightarrow aC\mid Ab\mid b]

$math[C \rightarrow AB \mid B]
$end 


===== Pumping Lemma for Context-Free Languages =====

In order to state and prove the pumping lemma, we need to investigate the **shape** and **size** of the parse-trees for each word. Chomsky's normal form serves as to turn each parse-tree (of a word) from a CF language, into a **binary tree**. (Recall that productions have form $math[A\rightarrow BC], except for leaves).

We exploit this shape in order to prove the pumping lemma, via the following observation:

$prop[Parse tree]
Let $math[G] be a grammar in CNF and $math[w\in L(G)]. If the **height** (maximal length of a path from the root to a leaf) of the parse-tree of $math[w] is $math[n], then $math[\mid w \mid \leq 2^{n-1}]
$end

$proof
The proof is by induction over $math[n]. 

**Basis** $math[n=0].

If the height of the parse-tree is zero, then $math[w] contains a single symbol, hence $math[\mid w\mid \leq 2^0].

**Induction step** 

Suppose $math[n+1] is the height of a parse-tree for some $math[w]. Since $math[n>0] and $math[G] is in CNF, then $math[G] must contain a production $math[S\rightarrow AB], which is used to derive $math[w]. Moreover, $math[w=uv] such that $math[u] is derived from $math[A] and $math[v] is derived from $math[B]. Also, the parse-trees rooted at $math[A] (resp. $math[B]) have height **at most** $math[n-1]. By induction hypothesis, $math[\mid u\mid \leq 2^{n-2}] and $math[\mid v\mid \leq 2^{n-2}]. Thus, $math[\mid uv \mid\leq 2^{n-2} + 2^{n-2} = 2^{n-1}].
$end

==== Pumping Lemma Statement ====

Let $math[L] be a CF language. Then, there exists a **constant** $math[n\in\mathbb{N}] such that, **all** words $math[z] such that $math[\mid z\mid \geq n], have the following structure $math[z=uvwxy] such that:
  * $math[\mid vwx \mid \leq n] - the middle portion is not //too long//
  * $math[vx \neq \epsilon] - at least one of the portions $math[v] or $math[x] //to-be-pumped// is not empty
  * for all $math[i\geq 0], $math[uv^iwx^iy\in L]

$proof
We first deal with special cases:
  * $math[L=\emptyset]. The lemma trivially holds since the universal condition holds over an empty-set
  * $math[\epsilon \in L]. We can always //ignore// $math[\epsilon] by looking at $math[n] larger than 0.

The rest of the proof deals with a **non-empty** CF language $math[L\setminus\{\epsilon\}].

Suppose $math[G] is a CN grammar such that $math[L(G) = L], and let $math[m] designate the **number of non-terminals**. We fix $math[n=2^{m}]. If a word has parse-tree **height** equal to $math[m], then its length is **at most** $math[2^{m-1} = n/2]. Hence, **each** parse tree for word $math[z] must have height **of at least** $math[m+1].

Let $math[S, A_2, \ldots, A_m, A_{m+1}, a] be a path of **length** $math[m+1] through some parse-tree of $math[z]. Since there are only $math[m] **non-terminals**, there must be some $math[A_i = A_j = A] with $math[i > j]. Therefore, we can split $math[z] as follows:
  * fix $math[w] to be the word generated from $math[A_j]
  * fix $math[vwx] to be the word generated from $math[A_i] (since $math[i>j] this word must include $math[w]). Since no $math[A\rightarrow \epsilon] and no $math[A\rightarrow B] productions are allowed, $math[v] and $math[x] cannot be both $math[\epsilon].

Case $math[i=0]: $math[uwy\in L]. If $math[S, A_2, \ldots, A_i, \ldots, A_j, \ldots] is a path in the parse-tree of $math[z=uvwxy] and $math[A_i=A_j] then replacing the sub-tree corresponding to $math[A_i] by $math[A_j], we get a parse-tree with longest height $math[S, A_2, \ldots, A_j, \ldots] which generates $math[uwv],and which must be in $math[L]

Case $math[i>0]. $math[uv^iwx^iy \in L]. We replace the subtree $math[A_j] by $math[A_i] (which also contains $math[A_j]). In so doing, we generate the word $math[uvvwxxy] (for $math[i=2]). We can repeat this process as many times.
$end

==== Using the Pumping Lemma to show a language is not CF ====

The Pumping Lemma for CFLs is usually deployed to show languages are not CF. The //recipe// for its usage is similar to its counterpart for Regular languages. We illustrate it on the following example:

Let $math[L=\{a^n b^n c^n \mid n\geq 1\}].
  * **for any** $math[n], we get to choose,
  * **some** $math[z] such that $math[\mid z \mid\geq n]. We denote those word-choices by $math[z_n] to emphasize that they depend on $math[n]. For our language, let $math[z=0^n 1^n 2^n] (do not mistake $math[n] with the index used in the definition of our language).
  * **for any** possible split of $math[z] into $math[uvwxy], such that $math[\mid vwx\mid \leq n] and $math[vx\neq\epsilon],
  * we must find **some** $math[i\geq 0] such that $math[uv^iwx^iy \not\in L].

Since $math[\mid vwx\mid \leq n], we need to distinguish two cases:
  * $math[vwx] contains only zeros and ones. By fixing $math[i=0], the word $math[uwx] will contain less than $math[n] zeros or less than $math[n] ones (but $math[n] twos). Hence $math[uwx\neq L]
  * $math[vwx] contains only ones and twos. The same line of reasoning follows.


===== Closure properties =====

We start by formulating a //substitution theorem// which can be used to immediately prove most closure properties of Context-Free Languages.

A **substitution** over $math[\Sigma] is a mapping $math[s:\Sigma\rightarrow 2^{\Sigma^*}] which assigns to each symbol, a **language**. As an example consider a substitution defined over $math[\{0,1\}] where:
  * $math[s(0)=\{a^nb^n \mid n \leq 1\}]
  * $math[s(1)=\{aa,bb\}]

Let $math[w=c_1\ldots c_n] be a word over $math[\Sigma] and $math[s] be a //substitution// over $math[\Sigma]. Then $math[s(w)] is the **concatenation of languages** $math[s(c_1)s(c_2)\ldots s(c_n)].

For instance, $math[s(01)] from our example is the language $math[\{a^nb^{n+2} \mid n\geq 1 \}\cup \{a^nb^naa \mid n \geq 1\}].

Finally, we can extend the semantics of a substitution over **languages**. Hence $math[s(L)] is the **reunion** of languages $math[s(w)] where $math[w] is a word from $math[L]:

$math[s(L) = \cup_{w\in L} s(w)]

We prove the following:

$prop[substitution]
Let $math[L] be a context-free language over $math[\Sigma] and $math[s] be a substitution over $math[\Sigma] such that, for all $math[a], the language $math[s(a)] is context-free. Then $math[s(L)] is a context-free language.
$end

$proof
Let $math[G=(V,\Sigma,R,S)] be a CFG for $math[L] and $math[G_a=(V_a,\Sigma,R_a,S_a)] be a CFG for the language $math[s(a)], for each $math[a\in\Sigma].

We build a grammar $math[G'=(V',\Sigma,R',S')] for the language $math[s(L)]:
  * $math[V'=V\cup\bigcup_{a\in\Sigma} V_a]. We assume all **non-terminals** in each $math[V_a] and in $math[V] are distinct (otherwise, we need to rename them);
  * We build $math[R'] by taking all productions $math[\bigcup_{a\in\Sigma} R_a], and also add all productions in $math[R], which we modify as follows: each occurrence of symbol $math[a] is replaced by $math[S_a].
  * $math[S'=S]

We show $math[L(G')\subseteq s(L)].
Suppose $math[w\in L(G')]. Then, we have a parse-tree of $math[w] which includes some non-terminals $math[S_a]. Hence, $math[w=u_1\ldots u_k] where each portion $math[u_i] is derived from some $math[S_{a_i}]. Hence, each $math[u_i \in L(G_{a_i})]. Moreover, this means that, in $math[G], we have a parse-tree for $math[a_1\ldots a_k]. Hence $math[a_1,\ldots a_k \in L]. By definition of $math[s], $math[w\in s(L)].

We show $math[s(L) \subseteq L(G')].
Suppose $math[w\in s(L)]. Thus, $math[w=u_1\ldots u_k] such that each $math[u_i] is derived from non-terminal $math[S_{a_i}] and $math[a_1\ldots a_k\in L]. We can easily see that $math[w] can be derived in $math[G'], which ends the proof.
$end

==== Union ====

$prop[union] Let $math[A,B] be two CFLs. The language $math[A\cup B] is CF.
$end

$proof
Let $math[s(a)=A] and $math[s(b)=B] be a substitution, and $math[\{a,b\}] be a CFL. By the substitution theorem, $math[s(\{a,b\})] is CF.
$end

==== Concatenation ====

$prop[concatenation] Let $math[A,B] be two CFLs. The language $math[AB] is CF.
$end

$proof
Let $math[s(a)=A] and $math[s(b)=B] be a substitution, and $math[\{ab\}] be a CFL. By the substitution theorem, $math[s(\{ab\})] is CF.
$end

==== Intersection ====

**Context-Free languages are not closed under intersection**.

Consider the languages $math[L_1=\{a^nb^nc^i \mid n,i\geq 1\}] and $math[L_2=\{a^ib^nc^n \mid n,i \geq 1\}].
Informally, $math[L_1] and $math[L_2] contain words from $math[L(a^+b^+c^+)], only in $math[L_1] the sequence of a's and b's have the same length, while in $math[L_2] the sequence of b's and c's have the same length.

We have already shown that $math[L_1\cap L_2 = \{a^nb^nc^n \mid n\geq 1\}] is not a context-free language.

However:
$prop
If $math[A] is CF and $math[B] is regular, then $math[A\cap B] is **CF**.
$end

$proof
The proof follows a **product construction** similar to that for proving closure under intersection for regular languages.
$end

==== Complement ====

**CFLs are not closed under complement**.
Suppose it were so. Then, via de Morgan's laws, we could show that $math[A\cap B= \overline{\overline{A}\cup\overline{B}}] is context-free, which yields a contradiction.


==== Difference ====

**CFLs are not closed under difference**.
Consider $math[L_1=\{a^nb^ic^n\mid n,i\geq 1\}], $math[L_2=\{a^nb^mc^i\mid n,i\geq 1, m>n\}] and $math[L_3=\{a^nb^mc^i\mid n,i\geq 1, m<n\}]. In $math[L_2] the sequence of b's is longer than that of a's, while in $math[L_3] the sequence of b's is shorter than that of b's. 
All languages are CF, hence $math[L_2 \cup L_3] is CF. The language $math[L_1\setminus (L_2 \cup L_3) = \{a^nb^nc^n\mid n \geq 1\}] is not CF.

However:
$prop
If $math[A] is CF and $math[B] is regular, then $math[A\setminus B] is **CF**.
$end

$proof
$math[A\setminus B = A \cap \overline{B}], which is CF, since $math[\overline{B}] is regular.
$end


==== Closure ====

$prop[closure] Let $math[A] be a CF language. The language $math[A^*] is CF.
$end

$proof
Let $math[s(a)=A] be a substitution, and $math[\{a\}^*] be a CFL. By the substitution theorem, $math[s(\{a\}^*)] is CF.
$end

==== Reversal ====

$prop[reversal] Let $math[A] be a CF language. The language $math[A^R] is CF.
$end

==== Homomorphism and inverse homomorphism ====

CFLs are closed under both homomorphism and inverse homomorphism.

The first claim can be proved by building a substitution $math[s] from a homomorphism $math[h], as follows: $math[s(a)=\{h(a)\}] (i.e. the language containing a single word). Then, we have $math[h(L) = s(L)].