Properties of Context-Free Languages

Properties of Context-Free Languages

Chomsky Normal Form

Definition (CNF):

A Context-Free Grammar $ G$ is in Chomsky Normal Form if each production has one of the following forms:

$ A\rightarrow BC$ where $ A,B,C$ are non-terminals

$ A\rightarrow a$ where $ A$ is a non-terminal and $ a$ is a terminal

and there are no useless non-terminals.

We also prove that:

Proposition (CNF):

Let $ L$ be a Context-Free Language such that $ L\neq \emptyset$ and $ \epsilon\not\in L$ . Then there exists a grammar $ G$ in Chomsky Normal Form such that $ L(G)=L$ .

The Chomsky Normal Form is useful because it makes reasoning (proofs) on grammars much easier (fewer situations need to be considered). Also, CNF is useful for many algorithms such as bottom-up parsing, etc.

We shall provide, without proof, a sequence of transformations which can turn any grammar $ G$ into one in CNF.

replace each production of the form $ A\rightarrow B_1 \ldots a\ldots B_n$ where $ B_i$ are non-terminals and $ a$ is a terminal, by two productions:
- $ A\rightarrow B_1\ldots C\ldots B_n$ and $ C\rightarrow a$
replace each production of the form $ A\rightarrow B_1 \ldots B_n$ with $ n-1$ productions:
- $ A\rightarrow B_1 C_1$ , $ C_1 \rightarrow B_2 C_2$ , …, $ C_{n-1}\rightarrow B_{n-1}B_n$
recursively eliminate all productions of the form $ A\rightarrow \epsilon$ ; what is the correct algorithm for implementing this elimination?
recursively eliminate all productions of the form $ A\rightarrow B$ , by replacing $ A$ with $ B$ in all productions.

Remarks:

the order in which these transformations are applied may affect the correctness of the result. (Can you identify an example?)
the same order may affect the size of the CNF grammar, which could be (worst-case) exponential w.r.t. the size of the original grammar.

Exercise (CNF):

Transform the following grammar in CNF:

$ S\rightarrow ABC$

$ B\rightarrow \epsilon$

$ A\rightarrow aC\mid Ab\mid b$

$ C \rightarrow AB \mid B$

Pumping Lemma for Context-Free Languages

In order to state and prove the pumping lemma, we need to investigate the shape and size of the parse-trees for each word. Chomsky's normal form serves as to turn each parse-tree (of a word) from a CF language, into a binary tree. (Recall that productions have form $ A\rightarrow BC$ , except for leaves).

We exploit this shape in order to prove the pumping lemma, via the following observation:

Proposition (Parse tree):

Let $ G$ be a grammar in CNF and $ w\in L(G)$ . If the height (maximal length of a path from the root to a leaf) of the parse-tree of $ w$ is $ n$ , then $ \mid w \mid \leq 2^{n-1}$

Proof:

The proof is by induction over $ n$ .

Basis $ n=0$ .

If the height of the parse-tree is zero, then $ w$ contains a single symbol, hence $ \mid w\mid \leq 2^0$ .

Induction step

Suppose $ n+1$ is the height of a parse-tree for some $ w$ . Since $ n>0$ and $ G$ is in CNF, then $ G$ must contain a production $ S\rightarrow AB$ , which is used to derive $ w$ . Moreover, $ w=uv$ such that $ u$ is derived from $ A$ and $ v$ is derived from $ B$ . Also, the parse-trees rooted at $ A$ (resp. $ B$ ) have height at most $ n-1$ . By induction hypothesis, $ \mid u\mid \leq 2^{n-2}$ and $ \mid v\mid \leq 2^{n-2}$ . Thus, $ \mid uv \mid\leq 2^{n-2} + 2^{n-2} = 2^{n-1}$ .

Pumping Lemma Statement

Let $ L$ be a CF language. Then, there exists a constant $ n\in\mathbb{N}$ such that, all words $ z$ such that $ \mid z\mid \geq n$ , have the following structure $ z=uvwxy$ such that:

$ \mid vwx \mid \leq n$ - the middle portion is not too long
$ vx \neq \epsilon$ - at least one of the portions $ v$ or $ x$ to-be-pumped is not empty
for all $ i\geq 0$ , $ uv^iwx^iy\in L$

Proof:

We first deal with special cases:

$ L=\emptyset$ . The lemma trivially holds since the universal condition holds over an empty-set

$ \epsilon \in L$ . We can always ignore $ \epsilon$ by looking at $ n$ larger than 0.

The rest of the proof deals with a non-empty CF language $ L\setminus\{\epsilon\}$ .

Suppose $ G$ is a CN grammar such that $ L(G) = L$ , and let $ m$ designate the number of non-terminals. We fix $ n=2^{m}$ . If a word has parse-tree height equal to $ m$ , then its length is at most $ 2^{m-1} = n/2$ . Hence, each parse tree for word $ z$ must have height of at least $ m+1$ .

Let $ S, A_2, \ldots, A_m, A_{m+1}, a$ be a path of length $ m+1$ through some parse-tree of $ z$ . Since there are only $ m$ non-terminals, there must be some $ A_i = A_j = A$ with $ i > j$ . Therefore, we can split $ z$ as follows:

fix $ w$ to be the word generated from $ A_j$

fix $ vwx$ to be the word generated from $ A_i$ (since $ i>j$ this word must include $ w$ ). Since no $ A\rightarrow \epsilon$ and no $ A\rightarrow B$ productions are allowed, $ v$ and $ x$ cannot be both $ \epsilon$ .

Case $ i=0$ : $ uwy\in L$ . If $ S, A_2, \ldots, A_i, \ldots, A_j, \ldots$ is a path in the parse-tree of $ z=uvwxy$ and $ A_i=A_j$ then replacing the sub-tree corresponding to $ A_i$ by $ A_j$ , we get a parse-tree with longest height $ S, A_2, \ldots, A_j, \ldots$ which generates $ uwv$ ,and which must be in $ L$

Case $ i>0$ . $ uv^iwx^iy \in L$ . We replace the subtree $ A_j$ by $ A_i$ (which also contains $ A_j$ ). In so doing, we generate the word $ uvvwxxy$ (for $ i=2$ ). We can repeat this process as many times.

Using the Pumping Lemma to show a language is not CF

The Pumping Lemma for CFLs is usually deployed to show languages are not CF. The recipe for its usage is similar to its counterpart for Regular languages. We illustrate it on the following example:

Let $ L=\{a^n b^n c^n \mid n\geq 1\}$ .

for any $ n$ , we get to choose,
some $ z$ such that $ \mid z \mid\geq n$ . We denote those word-choices by $ z_n$ to emphasize that they depend on $ n$ . For our language, let $ z=0^n 1^n 2^n$ (do not mistake $ n$ with the index used in the definition of our language).
for any possible split of $ z$ into $ uvwxy$ , such that $ \mid vwx\mid \leq n$ and $ vx\neq\epsilon$ ,
we must find some $ i\geq 0$ such that $ uv^iwx^iy \not\in L$ .

Since $ \mid vwx\mid \leq n$ , we need to distinguish two cases:

$ vwx$ contains only zeros and ones. By fixing $ i=0$ , the word $ uwx$ will contain less than $ n$ zeros or less than $ n$ ones (but $ n$ twos). Hence $ uwx\neq L$
$ vwx$ contains only ones and twos. The same line of reasoning follows.

Closure properties

We start by formulating a substitution theorem which can be used to immediately prove most closure properties of Context-Free Languages.

A substitution over $ \Sigma$ is a mapping $ s:\Sigma\rightarrow 2^{\Sigma^*}$ which assigns to each symbol, a language. As an example consider a substitution defined over $ \{0,1\}$ where:

$ s(0)=\{a^nb^n \mid n \leq 1\}$
$ s(1)=\{aa,bb\}$

Let $ w=c_1\ldots c_n$ be a word over $ \Sigma$ and $ s$ be a substitution over $ \Sigma$ . Then $ s(w)$ is the concatenation of languages $ s(c_1)s(c_2)\ldots s(c_n)$ .

For instance, $ s(01)$ from our example is the language $ \{a^nb^{n+2} \mid n\geq 1 \}\cup \{a^nb^naa \mid n \geq 1\}$ .

Finally, we can extend the semantics of a substitution over languages. Hence $ s(L)$ is the reunion of languages $ s(w)$ where $ w$ is a word from $ L$ :

$ s(L) = \cup_{w\in L} s(w)$

We prove the following:

Proposition (substitution):

Let $ L$ be a context-free language over $ \Sigma$ and $ s$ be a substitution over $ \Sigma$ such that, for all $ a$ , the language $ s(a)$ is context-free. Then $ s(L)$ is a context-free language.

Proof:

Let $ G=(V,\Sigma,R,S)$ be a CFG for $ L$ and $ G_a=(V_a,\Sigma,R_a,S_a)$ be a CFG for the language $ s(a)$ , for each $ a\in\Sigma$ .

We build a grammar $ G'=(V',\Sigma,R',S')$ for the language $ s(L)$ :

$ V'=V\cup\bigcup_{a\in\Sigma} V_a$ . We assume all non-terminals in each $ V_a$ and in $ V$ are distinct (otherwise, we need to rename them);

We build $ R'$ by taking all productions $ \bigcup_{a\in\Sigma} R_a$ , and also add all productions in $ R$ , which we modify as follows: each occurrence of symbol $ a$ is replaced by $ S_a$ .

$ S'=S$

We show $ L(G')\subseteq s(L)$ . Suppose $ w\in L(G')$ . Then, we have a parse-tree of $ w$ which includes some non-terminals $ S_a$ . Hence, $ w=u_1\ldots u_k$ where each portion $ u_i$ is derived from some $ S_{a_i}$ . Hence, each $ u_i \in L(G_{a_i})$ . Moreover, this means that, in $ G$ , we have a parse-tree for $ a_1\ldots a_k$ . Hence $ a_1,\ldots a_k \in L$ . By definition of $ s$ , $ w\in s(L)$ .

We show $ s(L) \subseteq L(G')$ . Suppose $ w\in s(L)$ . Thus, $ w=u_1\ldots u_k$ such that each $ u_i$ is derived from non-terminal $ S_{a_i}$ and $ a_1\ldots a_k\in L$ . We can easily see that $ w$ can be derived in $ G'$ , which ends the proof.

Union

Proposition (union):

Let $ A,B$ be two CFLs. The language $ A\cup B$ is CF.

Proof:

Let $ s(a)=A$ and $ s(b)=B$ be a substitution, and $ \{a,b\}$ be a CFL. By the substitution theorem, $ s(\{a,b\})$ is CF.

Concatenation

Proposition (concatenation):

Let $ A,B$ be two CFLs. The language $ AB$ is CF.

Proof:

Let $ s(a)=A$ and $ s(b)=B$ be a substitution, and $ \{ab\}$ be a CFL. By the substitution theorem, $ s(\{ab\})$ is CF.

Intersection

Context-Free languages are not closed under intersection.

Consider the languages $ L_1=\{a^nb^nc^i \mid n,i\geq 1\}$ and $ L_2=\{a^ib^nc^n \mid n,i \geq 1\}$ . Informally, $ L_1$ and $ L_2$ contain words from $ L(a^+b^+c^+)$ , only in $ L_1$ the sequence of a's and b's have the same length, while in $ L_2$ the sequence of b's and c's have the same length.

We have already shown that $ L_1\cap L_2 = \{a^nb^nc^n \mid n\geq 1\}$ is not a context-free language.

However: Proposition ():

If $ A$ is CF and $ B$ is regular, then $ A\cap B$ is CF.

Proof:

The proof follows a product construction similar to that for proving closure under intersection for regular languages.

Complement

CFLs are not closed under complement. Suppose it were so. Then, via de Morgan's laws, we could show that $ A\cap B= \overline{\overline{A}\cup\overline{B}}$ is context-free, which yields a contradiction.

Difference

CFLs are not closed under difference. Consider $ L_1=\{a^nb^ic^n\mid n,i\geq 1\}$ , $ L_2=\{a^nb^mc^i\mid n,i\geq 1, m>n\}$ and $ L_3=\{a^nb^mc^i\mid n,i\geq 1, m<n\}$ . In $ L_2$ the sequence of b's is longer than that of a's, while in $ L_3$ the sequence of b's is shorter than that of b's. All languages are CF, hence $ L_2 \cup L_3$ is CF. The language $ L_1\setminus (L_2 \cup L_3) = \{a^nb^nc^n\mid n \geq 1\}$ is not CF.

However: Proposition ():

If $ A$ is CF and $ B$ is regular, then $ A\setminus B$ is CF.

Proof:

$ A\setminus B = A \cap \overline{B}$ , which is CF, since $ \overline{B}$ is regular.

Closure

Proposition (closure):

Let $ A$ be a CF language. The language $ A^*$ is CF.

Proof:

Let $ s(a)=A$ be a substitution, and $ \{a\}^*$ be a CFL. By the substitution theorem, $ s(\{a\}^*)$ is CF.

Reversal

Proposition (reversal):

Let $ A$ be a CF language. The language $ A^R$ is CF.

Homomorphism and inverse homomorphism

CFLs are closed under both homomorphism and inverse homomorphism.

The first claim can be proved by building a substitution $ s$ from a homomorphism $ h$ , as follows: $ s(a)=\{h(a)\}$ (i.e. the language containing a single word). Then, we have $ h(L) = s(L)$ .

Table of Contents

Properties of Context-Free Languages

Chomsky Normal Form

Pumping Lemma for Context-Free Languages

Pumping Lemma Statement

Using the Pumping Lemma to show a language is not CF

Closure properties

Union

Concatenation

Intersection

Complement

Difference

Closure

Reversal

Homomorphism and inverse homomorphism