Pumping Lemma
Motivation
Consider the language defined via BNF (Bakus-Naur Form) as follows:
<expr> ::= <expr> + <expr> | (<expr> + <expr>) | <atom>
which describes simple arithmetic expressions with parentheses. Let us ignore the construction rules for atoms (let $ M_a$ be a DFA with a unique final state which accepts valid atoms).
Consider the following NFA built (with respect to $ M_a$ ) to accept words of the above language:
One can easily observe that the above automaton can only accept words of the following forms:
<atom>
<atom>+<atom>+ … +<atom>
<atom>+<atom>+ … +<atom>+(<atom>+<atom>+ … +<atom>)
There are several possible fixes to our construction:
- add transitions to accommodate expressions of the form:
(<sequence> | <par_sequence>) (+ (<sequence> | <par_sequence>))*
(here parentheses and*
should be interpreted as with regular expressions), where<sequence> ::= <atom> | <atom> + <sequence>
and<par_sequence> ::= ( <sequence> )
- add new states and transitions which allow defining nested parentheses…
However, it is not possible to add a finite number of states (and transitions) which can accommodate for an arbitrary finite number of parenthesis nestings. (e.g. 1 + (2 + … ))))
). Each automaton which we know how to build, can only accept a unique finite number of nestings, hence it can only describe arithmetic expressions with parenthesis nesting of up to some $ k\in\mathbb{N}$ .
This happens because our language is not reg ular.
The pumping lemma
The pumping lemma captures one particular trait of Regular languages:
- words of an (infinite) Regular language must exhibit a specific form of regularity - a 'repeating pattern' which is simple-enough that it does not require counting
The previous language is a good counter-example: while arithmetic expressions do capture a 'repeating pattern' - identifying it requires counting: we must keep track of the number $ k$ of open parentheses and make sure that precisely $ k$ parentheses are eventually closed.
Put more formally, the repeating pattern
found in words of a Regular Language looks as follows.
Let $ L$ be a regular language. Then there exists $ n$ (dependent on $ L$ ) such that for every word $ w\in L$ of length larger than $ n$ has the following form:
- $ w = xyz$ where
- $ \mid xy \mid \leq n$
- $ y \neq \epsilon$
- for all $ k > 0$ , we have $ xy^kz \in L$
</blockquote>
Informally, the Pumping lemma tells us that all 'large-enough' words of a regular language must have the form shown in the following image:
- a prefix $ x$ ;
- a non-empty word $ y$ which may repeat zero-or-more times
- a suffix $ z$ which may be
finite.
Note that arithmetic expressions cannot be broken-down in this way.
What does 'large-enough' mean?
Suppose $ L$ is a finite language. Then there exists a DFA accepting $ L$ which has the structure of a tree: each branch of the tree ending in a leaf describes one possible word of the language. The lemma trivially holds here since the $ n$ at hand can be any number larger that the length of the longest word. Thus, there is no word satisfying $ \mid w \mid \geq n$ .
Suppose $ M$ is a DFA which accepts $ L$ . 'Large-enough' words are those of $ \mid w \mid \geq \mid K\mid$ , i.e. the number of states of $ M$ . For such words, the accepting path through $ M$ must explore some state at least twice.
The pumping lemma in action
The Pumping Lemma is used as a technical instrument for proving that a language $ L$ is not regular. The proof scheme is as follows:
- suppose $ L$ is regular. Hence the Pumping lemma must hold.
- identify a word-construction which violates the Pumping Lemma, yeilding a contradiction.
Consider the language $ L = {0^i1^i \mid i > 1}$ consisting of sequences of zeros followed by ones in the same number. Suppose the language is regular.
- The pumping lemma tells us that for some large-enough $ n$ , we can find words with the aforementioned properties in $ L$ ;
- To violate the Pumping lemma, we must show that for any $ n$ , words with the required properties cannot exist in $ L$
- The pumping lemma tells us that for any word $ w\in L$ such that $ \mid w \mid \geq n$ , certain properties hold;
- To violate the pumping lemma, let us choose $ w_n$ (with respect to any $ n$ ) to be $ 0^n1^n$ .
- The pumping lemma tells us that any large-enough word, including $ 0^n1^n$ can be split into $ xyz$ where $ y\neq \emptyset$ and $ \mid xy\mid \leq n$ . Thus, even if we do not know $ x,y,z$ , we can argue that: $ x=0^{i}$ , $ y=0^{j}$ with $ j\neq 0$ and $ i+j\leq n$ , and $ z=0^{n-i-j}1^n$ .
- Finally, the pumping lemma tells us that any word $ xy^kz$ , with $ k\leq 0$ is also in $ L$ .
- To violate it, choose $ k=0$ . The word $ xy^0z=xz=0^{i}0^{n-i-j}1^n=0^{n-j}1^{n}$ has strictly less zeros than ones (since $ j\neq 0$ ). The pumping lemma tells us that this word should be in $ L$ however it obviously is not. Contradiction. The language $ L$ cannot be regular.
The Pumping Lemma recipe
To use the Pumping lemma in order to prove that a language $ L$ is not regular:
- for any $ n$ , select a word pattern $ w_n$ such that $ w_n \in L$ and $ \mid w_n\mid \geq n$
- for any 'break-down' of $ w_n$ into $ xyz$ such that $ \mid xy\mid \leq n$ and $ y\neq \epsilon$ , select a value $ k$ such that $ xy^kz$ is not a member of $ L$ .
Thus, the Pumping lemma is contradicted.
The language of arithmetic expressions is not regular
For simplicity assume atoms can only be the one-letter word a
, hence the alphabet of the language is $ \{a,+,(,)\}$ .
- For any $ n$ , fix the word $ w_n=$
a+(a+ … (a+a) … )
with $ n$ opened and closed parentheses. Such a word contains $ n+1$ atoms, $ n+1$ addition symbols, and $ n$ pairs of parentheses, hence $ \mid w_n \mid = 4n+2$ - In any 'break-down' $ w_n = xyz$ , the word $ y$ cannot contain closed parentheses, since $ \mid xy\mid \leq n$ . Since $ y$ is nonempty, it may end with
(
,a
or+
. Fix $ k=0$ . In the word $ xy^0z = xz$ , there is a missing atom, addition symbol or open parenthesis:- if only an atom/addition symbol is missing, the the word is not a valid arithmetic expression
- if a sequence which includes
(
(such as+(a+
) is missing, then we have an expression with more closing parentheses than open ones (and possibly an invalid one too).
- in any case, $ xz$ cannot be an arithmetic expression.
Closure properties of regular languages
Union
Proposition (union):
Let $ A,B$ be two regular languages. The language $ A\cup B$ is regular.
Proof:
Let $ E_A$ and $ E_B$ be the regular expressions generating $ A$ and $ B$ respectively. The regular expression $ E_A\cup E_B$ generates the language $ A\cup B$ .
Concatenation
Proposition (concatenation):
Let $ A,B$ be two regular languages. The language $ AB$ is regular.
The proof follows the same idea as that of union.
Complement
Proposition (complement):
Let $ A$ be a regular language. The language $ \overline{A}=\Sigma^* \setminus A$ is regular.
Proof:
Let $ M_A = (K,\Sigma,\delta,q_0,F)$ be a DFA which accepts $ A$ . We build the DFA $ \overline{M_A} = (K,\Sigma,\delta,q_0,K\setminus F)$ . $ \overline{M_A}$ only differs from $ M_A$ in the accepting (or final) states: each final state in $ M_A$ is non-final in $ \overline{M_A}$ and vice-versa. It follows immediately that, for a ll words $ w$ , w is accepted by $ M_A$ iff $ w$ is not accepted by $ \overline{M_A}$ . Thus, $ M_A$ accepts any word not in $ A$ , i.e. the language $ \overline{A}$ .
Intersection
Proposition (intersection):
Let $ A,B$ be two regular languages. The language $ A\cap B$ is regular.
Proof:
The language $ A\cap B$ can be defined as $ \overline{\overline{A}\cup\overline{B}}$ . By union and complement closure, we have that $ \overline{A}$ , $ \overline{B}$ , $ \overline{A}\cup\overline{B}$ and finally $ \overline{\overline{A}\cup\overline{B}}$ are regular languages.
An alternative and more useful proof is to construct, starting from DFAs $ M_A=(K_A,\Sigma,q_A,F_A)$ and $ M_B=(K_B,\Sigma,q_B,F_B)$ which accept languages $ A$ and $ B$ , respectively, a DFA for $ A\cap B$ . The construction is called product automaton (written $ M_A\times M_B$ ), and is as follows:
- the set of states is $ K_A\times K_B$ - each state in the product automaton is a pair of states;
- the initial state is $ (q_A,q_B)$ ;
- the transition function $ \delta_{A\times B}$ is defined as: $ \delta_{A\times B}((q_x,q_y),c) = (\delta_A(q_x,c),\delta_B(q_y,c))$ .
- the set of final states is $ F_A \times F_B$
It is easy to prove by induction that for each word $ w$ such that $ (q_A,w)\ vdash_{M_A}^*(p,\epsilon)$ and $ (q_B,w)\vdash_{M_B}^*(r,\epsilon)$ , with $ p\in F_A$ and $ r\in F_B$ , we also have in the product automaton $ ((q_A,q_B),w)\vdash^*_{M_{A\times B}}(p,r),\epsilon$ .
Difference
Proposition (intersection):
Let $ A,B$ be two regular languages. The language $ A\setminus B$ is regular.
Proof:
The language $ A\setminus B$ can be defined as $ A\cup\overline{B}$ , which is regular via reunion and complement properties.
Closure
Proposition (closure):
Let $ A$ be a regular language. The language $ A^*$ is regular.
The proof is similar to that for union and concatenation.
Reversal
Proposition (reversal):
Let $ A$ be a regular language. The language $ A^R$ is regular.