Consider the language defined via BNF (Bakus-Naur Form) as follows:
<expr> ::= <expr> + <expr> | (<expr> + <expr>) | <atom>
which describes simple arithmetic expressions with parentheses. Let us ignore the construction rules for atoms (let $ M_a$ be a DFA with a unique final state which accepts valid atoms).
Consider the following NFA built (with respect to $ M_a$ ) to accept words of the above language:
One can easily observe that the above automaton can only accept words of the following forms:
<atom>
<atom>+<atom>+ … +<atom>
<atom>+<atom>+ … +<atom>+(<atom>+<atom>+ … +<atom>)
There are several possible fixes to our construction:
(<sequence> | <par_sequence>) (+ (<sequence> | <par_sequence>))*
(here parentheses and *
should be interpreted as with regular expressions), where <sequence> ::= <atom> | <atom> + <sequence>
and <par_sequence> ::= ( <sequence> )
However, it is not possible to add a finite number of states (and transitions) which can accommodate for an arbitrary finite number of parenthesis nestings. (e.g. 1 + (2 + … ))))
). Each automaton which we know how to build, can only accept a unique finite number of nestings, hence it can only describe arithmetic expressions with parenthesis nesting of up to some $ k\in\mathbb{N}$ .
This happens because our language is not reg ular.
The pumping lemma captures one particular trait of Regular languages:
The previous language is a good counter-example: while arithmetic expressions do capture a 'repeating pattern' - identifying it requires counting: we must keep track of the number $ k$ of open parentheses and make sure that precisely $ k$ parentheses are eventually closed.
Put more formally, the repeating pattern
found in words of a Regular Language looks as follows.
Let $ L$ be a regular language. Then there exists $ n$ (dependent on $ L$ ) such that for every word $ w\in L$ of length larger than $ n$ has the following form:
</blockquote>
Informally, the Pumping lemma tells us that all 'large-enough' words of a regular language must have the form shown in the following image:
finite.
Note that arithmetic expressions cannot be broken-down in this way.
Suppose $ L$ is a finite language. Then there exists a DFA accepting $ L$ which has the structure of a tree: each branch of the tree ending in a leaf describes one possible word of the language. The lemma trivially holds here since the $ n$ at hand can be any number larger that the length of the longest word. Thus, there is no word satisfying $ \mid w \mid \geq n$ .
Suppose $ M$ is a DFA which accepts $ L$ . 'Large-enough' words are those of $ \mid w \mid \geq \mid K\mid$ , i.e. the number of states of $ M$ . For such words, the accepting path through $ M$ must explore some state at least twice.
The Pumping Lemma is used as a technical instrument for proving that a language $ L$ is not regular. The proof scheme is as follows:
Consider the language $ L = {0^i1^i \mid i > 1}$ consisting of sequences of zeros followed by ones in the same number. Suppose the language is regular.
To use the Pumping lemma in order to prove that a language $ L$ is not regular:
Thus, the Pumping lemma is contradicted.
For simplicity assume atoms can only be the one-letter word a
, hence the alphabet of the language is $ \{a,+,(,)\}$ .
a+(a+ … (a+a) … )
with $ n$ opened and closed parentheses. Such a word contains $ n+1$ atoms, $ n+1$ addition symbols, and $ n$ pairs of parentheses, hence $ \mid w_n \mid = 4n+2$ (
, a
or +
. Fix $ k=0$ . In the word $ xy^0z = xz$ , there is a missing atom, addition symbol or open parenthesis:(
(such as +(a+
) is missing, then we have an expression with more closing parentheses than open ones (and possibly an invalid one too).Proposition (union):
Let $ A,B$ be two regular languages. The language $ A\cup B$ is regular.
Proof:
Let $ E_A$ and $ E_B$ be the regular expressions generating $ A$ and $ B$ respectively. The regular expression $ E_A\cup E_B$ generates the language $ A\cup B$ .
Proposition (concatenation):
Let $ A,B$ be two regular languages. The language $ AB$ is regular.
The proof follows the same idea as that of union.
Proposition (complement):
Let $ A$ be a regular language. The language $ \overline{A}=\Sigma^* \setminus A$ is regular.
Proof:
Let $ M_A = (K,\Sigma,\delta,q_0,F)$ be a DFA which accepts $ A$ . We build the DFA $ \overline{M_A} = (K,\Sigma,\delta,q_0,K\setminus F)$ . $ \overline{M_A}$ only differs from $ M_A$ in the accepting (or final) states: each final state in $ M_A$ is non-final in $ \overline{M_A}$ and vice-versa. It follows immediately that, for a ll words $ w$ , w is accepted by $ M_A$ iff $ w$ is not accepted by $ \overline{M_A}$ . Thus, $ M_A$ accepts any word not in $ A$ , i.e. the language $ \overline{A}$ .
Proposition (intersection):
Let $ A,B$ be two regular languages. The language $ A\cap B$ is regular.
Proof:
The language $ A\cap B$ can be defined as $ \overline{\overline{A}\cup\overline{B}}$ . By union and complement closure, we have that $ \overline{A}$ , $ \overline{B}$ , $ \overline{A}\cup\overline{B}$ and finally $ \overline{\overline{A}\cup\overline{B}}$ are regular languages.
An alternative and more useful proof is to construct, starting from DFAs $ M_A=(K_A,\Sigma,q_A,F_A)$ and $ M_B=(K_B,\Sigma,q_B,F_B)$ which accept languages $ A$ and $ B$ , respectively, a DFA for $ A\cap B$ . The construction is called product automaton (written $ M_A\times M_B$ ), and is as follows:
It is easy to prove by induction that for each word $ w$ such that $ (q_A,w)\ vdash_{M_A}^*(p,\epsilon)$ and $ (q_B,w)\vdash_{M_B}^*(r,\epsilon)$ , with $ p\in F_A$ and $ r\in F_B$ , we also have in the product automaton $ ((q_A,q_B),w)\vdash^*_{M_{A\times B}}(p,r),\epsilon$ .
Proposition (intersection):
Let $ A,B$ be two regular languages. The language $ A\setminus B$ is regular.
Proof:
The language $ A\setminus B$ can be defined as $ A\cup\overline{B}$ , which is regular via reunion and complement properties.
Proposition (closure):
Let $ A$ be a regular language. The language $ A^*$ is regular.
The proof is similar to that for union and concatenation.
Proposition (reversal):
Let $ A$ be a regular language. The language $ A^R$ is regular.