Pumping Lemma

Consider the language defined via BNF (Bakus-Naur Form) as follows:

  • <expr> ::= <expr> + <expr> | (<expr> + <expr>) | <atom>

which describes simple arithmetic expressions with parentheses. Let us ignore the construction rules for atoms (let $ M_a$ be a DFA with a unique final state which accepts valid atoms).

Consider the following NFA built (with respect to $ M_a$ ) to accept words of the above language:

One can easily observe that the above automaton can only accept words of the following forms:

  • <atom>
  • <atom>+<atom>+ … +<atom>
  • <atom>+<atom>+ … +<atom>+(<atom>+<atom>+ … +<atom>)

There are several possible fixes to our construction:

  • add transitions to accommodate expressions of the form: (<sequence> | <par_sequence>) (+ (<sequence> | <par_sequence>))* (here parentheses and * should be interpreted as with regular expressions), where <sequence> ::= <atom> | <atom> + <sequence> and <par_sequence> ::= ( <sequence> )
  • add new states and transitions which allow defining nested parentheses…

However, it is not possible to add a finite number of states (and transitions) which can accommodate for an arbitrary finite number of parenthesis nestings. (e.g. 1 + (2 + … ))))). Each automaton which we know how to build, can only accept a unique finite number of nestings, hence it can only describe arithmetic expressions with parenthesis nesting of up to some $ k\in\mathbb{N}$ .

This happens because our language is not reg ular.

The pumping lemma captures one particular trait of Regular languages:

  • words of an (infinite) Regular language must exhibit a specific form of regularity - a 'repeating pattern' which is simple-enough that it does not require counting

The previous language is a good counter-example: while arithmetic expressions do capture a 'repeating pattern' - identifying it requires counting: we must keep track of the number $ k$ of open parentheses and make sure that precisely $ k$ parentheses are eventually closed.

Put more formally, the repeating pattern found in words of a Regular Language looks as follows.

Let $ L$ be a regular language. Then there exists $ n$ (dependent on $ L$ ) such that for every word $ w\in L$ of length larger than $ n$ has the following form:

  • $ w = xyz$ where
  • $ \mid xy \mid \leq n$
  • $ y \neq \epsilon$
  • for all $ k > 0$ , we have $ xy^kz \in L$

</blockquote>

Informally, the Pumping lemma tells us that all 'large-enough' words of a regular language must have the form shown in the following image:

  • a prefix $ x$ ;
  • a non-empty word $ y$ which may repeat zero-or-more times
  • a suffix $ z$ which may be

finite.

Note that arithmetic expressions cannot be broken-down in this way.

What does 'large-enough' mean?

Suppose $ L$ is a finite language. Then there exists a DFA accepting $ L$ which has the structure of a tree: each branch of the tree ending in a leaf describes one possible word of the language. The lemma trivially holds here since the $ n$ at hand can be any number larger that the length of the longest word. Thus, there is no word satisfying $ \mid w \mid \geq n$ .

Suppose $ M$ is a DFA which accepts $ L$ . 'Large-enough' words are those of $ \mid w \mid \geq \mid K\mid$ , i.e. the number of states of $ M$ . For such words, the accepting path through $ M$ must explore some state at least twice.

The Pumping Lemma is used as a technical instrument for proving that a language $ L$ is not regular. The proof scheme is as follows:

  • suppose $ L$ is regular. Hence the Pumping lemma must hold.
  • identify a word-construction which violates the Pumping Lemma, yeilding a contradiction.

Consider the language $ L = {0^i1^i \mid i > 1}$ consisting of sequences of zeros followed by ones in the same number. Suppose the language is regular.

  1. The pumping lemma tells us that for some large-enough $ n$ , we can find words with the aforementioned properties in $ L$ ;
    • To violate the Pumping lemma, we must show that for any $ n$ , words with the required properties cannot exist in $ L$
  1. The pumping lemma tells us that for any word $ w\in L$ such that $ \mid w \mid \geq n$ , certain properties hold;
    • To violate the pumping lemma, let us choose $ w_n$ (with respect to any $ n$ ) to be $ 0^n1^n$ .
  2. The pumping lemma tells us that any large-enough word, including $ 0^n1^n$ can be split into $ xyz$ where $ y\neq \emptyset$ and $ \mid xy\mid \leq n$ . Thus, even if we do not know $ x,y,z$ , we can argue that: $ x=0^{i}$ , $ y=0^{j}$ with $ j\neq 0$ and $ i+j\leq n$ , and $ z=0^{n-i-j}1^n$ .
  3. Finally, the pumping lemma tells us that any word $ xy^kz$ , with $ k\leq 0$ is also in $ L$ .
    • To violate it, choose $ k=0$ . The word $ xy^0z=xz=0^{i}0^{n-i-j}1^n=0^{n-j}1^{n}$ has strictly less zeros than ones (since $ j\neq 0$ ). The pumping lemma tells us that this word should be in $ L$ however it obviously is not. Contradiction. The language $ L$ cannot be regular.

The Pumping Lemma recipe

To use the Pumping lemma in order to prove that a language $ L$ is not regular:

  • for any $ n$ , select a word pattern $ w_n$ such that $ w_n \in L$ and $ \mid w_n\mid \geq n$
  • for any 'break-down' of $ w_n$ into $ xyz$ such that $ \mid xy\mid \leq n$ and $ y\neq \epsilon$ , select a value $ k$ such that $ xy^kz$ is not a member of $ L$ .

Thus, the Pumping lemma is contradicted.

For simplicity assume atoms can only be the one-letter word a, hence the alphabet of the language is $ \{a,+,(,)\}$ .

  • For any $ n$ , fix the word $ w_n=$ a+(a+ … (a+a) … ) with $ n$ opened and closed parentheses. Such a word contains $ n+1$ atoms, $ n+1$ addition symbols, and $ n$ pairs of parentheses, hence $ \mid w_n \mid = 4n+2$
  • In any 'break-down' $ w_n = xyz$ , the word $ y$ cannot contain closed parentheses, since $ \mid xy\mid \leq n$ . Since $ y$ is nonempty, it may end with (, a or +. Fix $ k=0$ . In the word $ xy^0z = xz$ , there is a missing atom, addition symbol or open parenthesis:
    • if only an atom/addition symbol is missing, the the word is not a valid arithmetic expression
    • if a sequence which includes ( (such as +(a+) is missing, then we have an expression with more closing parentheses than open ones (and possibly an invalid one too).
  • in any case, $ xz$ cannot be an arithmetic expression.

Closure properties of regular languages

Union

Proposition (union):

Let $ A,B$ be two regular languages. The language $ A\cup B$ is regular.

Proof:

Let $ E_A$ and $ E_B$ be the regular expressions generating $ A$ and $ B$ respectively. The regular expression $ E_A\cup E_B$ generates the language $ A\cup B$ .

Concatenation

Proposition (concatenation):

Let $ A,B$ be two regular languages. The language $ AB$ is regular.

The proof follows the same idea as that of union.

Complement

Proposition (complement):

Let $ A$ be a regular language. The language $ \overline{A}=\Sigma^* \setminus A$ is regular.

Proof:

Let $ M_A = (K,\Sigma,\delta,q_0,F)$ be a DFA which accepts $ A$ . We build the DFA $ \overline{M_A} = (K,\Sigma,\delta,q_0,K\setminus F)$ . $ \overline{M_A}$ only differs from $ M_A$ in the accepting (or final) states: each final state in $ M_A$ is non-final in $ \overline{M_A}$ and vice-versa. It follows immediately that, for a ll words $ w$ , w is accepted by $ M_A$ iff $ w$ is not accepted by $ \overline{M_A}$ . Thus, $ M_A$ accepts any word not in $ A$ , i.e. the language $ \overline{A}$ .

Intersection

Proposition (intersection):

Let $ A,B$ be two regular languages. The language $ A\cap B$ is regular.

Proof:

The language $ A\cap B$ can be defined as $ \overline{\overline{A}\cup\overline{B}}$ . By union and complement closure, we have that $ \overline{A}$ , $ \overline{B}$ , $ \overline{A}\cup\overline{B}$ and finally $ \overline{\overline{A}\cup\overline{B}}$ are regular languages.

An alternative and more useful proof is to construct, starting from DFAs $ M_A=(K_A,\Sigma,q_A,F_A)$ and $ M_B=(K_B,\Sigma,q_B,F_B)$ which accept languages $ A$ and $ B$ , respectively, a DFA for $ A\cap B$ . The construction is called product automaton (written $ M_A\times M_B$ ), and is as follows:

  • the set of states is $ K_A\times K_B$ - each state in the product automaton is a pair of states;
  • the initial state is $ (q_A,q_B)$ ;
  • the transition function $ \delta_{A\times B}$ is defined as: $ \delta_{A\times B}((q_x,q_y),c) = (\delta_A(q_x,c),\delta_B(q_y,c))$ .
  • the set of final states is $ F_A \times F_B$

It is easy to prove by induction that for each word $ w$ such that $ (q_A,w)\ vdash_{M_A}^*(p,\epsilon)$ and $ (q_B,w)\vdash_{M_B}^*(r,\epsilon)$ , with $ p\in F_A$ and $ r\in F_B$ , we also have in the product automaton $ ((q_A,q_B),w)\vdash^*_{M_{A\times B}}(p,r),\epsilon$ .

Difference

Proposition (intersection):

Let $ A,B$ be two regular languages. The language $ A\setminus B$ is regular.

Proof:

The language $ A\setminus B$ can be defined as $ A\cup\overline{B}$ , which is regular via reunion and complement properties.

Closure

Proposition (closure):

Let $ A$ be a regular language. The language $ A^*$ is regular.

The proof is similar to that for union and concatenation.

Reversal

Proposition (reversal):

Let $ A$ be a regular language. The language $ A^R$ is regular.