====== Pumping Lemma ======

===== Motivation =====

Consider the language defined via BNF (Bakus-Naur Form) as follows:

  *''<expr> ::= <expr> + <expr> | (<expr> + <expr>) | <atom>''

which describes simple arithmetic expressions with parentheses. Let us ignore the construction rules for atoms (let $math[M_a] be a DFA with a unique final state which accepts valid atoms).

Consider the following NFA built (with respect to $math[M_a]) to accept words of the above language:

{{:lfa:parenexpr.jpg?600|}}

One can easily observe that the above automaton can only accept words of the following forms:
  * ''<atom>''
  * ''<atom>+<atom>+ ... +<atom>''
  * ''<atom>+<atom>+ ... +<atom>+(<atom>+<atom>+ ... +<atom>)''

There are several possible fixes to our construction:
  * add transitions to accommodate expressions of the form: '' (<sequence> | <par_sequence>) (+ (<sequence> | <par_sequence>))* '' (here parentheses and ''*'' should be interpreted as with regular expressions), where ''<sequence> ::= <atom> | <atom> + <sequence>'' and ''<par_sequence> ::= ( <sequence> )''

  * add new states and transitions which allow defining nested parentheses...

**However**, it is **not possible** to add a **finite** number of states (and transitions) which can accommodate for an **//arbitrary// finite**  number of parenthesis nestings. (e.g. ''1 + (2 + ...  ))))''). Each automaton which we know how to build, can only accept a **//unique// finite** number of nestings, hence it can only describe arithmetic expressions with parenthesis nesting of up to some $math[k\in\mathbb{N}].

This happens because our language **is not reg
ular**.

===== The pumping lemma =====

The //pumping lemma// captures one particular trait of Regular languages:
  * //words of an (infinite) Regular language must exhibit a specific form of **regularity** - a 'repeating pattern' which is simple-enough that it does not require **counting**//

The previous language is a good counter-example: while arithmetic expressions do capture a 'repeating pattern' - identifying it requires **counting**: we must keep track of the number $math[k] of open parentheses and make sure that precisely $math[k] parentheses are eventually closed.

Put more formally, the ''repeating pattern'' found in words of a Regular Language looks as follows.

$theorem[Pumping Lemma]
Let $math[L] be a regular language. Then there exists $math[n] (dependent on $math[L]) such that for every word $math[w\in L] of length larger than $math[n] has the following form:
  * $math[w = xyz] where
  * $math[\mid xy \mid \leq n]
  * $math[y \neq \epsilon]
  * for all $math[k > 0], we have $math[xy^kz \in L]
$end

Informally, the Pumping lemma tells us that all '//large-enough//' words of a regular language must have the form shown in the following image:

{{:lfa:pumping_lemma_fin.jpg?600|}}

  * a prefix $math[x];
  * a **non-empty** word $math[y] which may repeat zero-or-more times
  * a suffix $math[z] which may be 
finite.

Note that arithmetic expressions cannot be broken-down in this way.

==== What does 'large-enough' mean? ====

Suppose $math[L] is a **finite** language. Then there exists a DFA accepting $math[L] which has the structure of a **tree**: each branch of the tree ending in a leaf describes one possible word of the language.
The lemma trivially holds here since the $math[n] at hand can be any number **larger** that the length of the **longest** word. **Thus, there is no word satisfying $math[\mid w \mid \geq n] **.

Suppose $math[M] is a DFA which accepts $math[L]. //'Large-enough'// words are those of $math[\mid w \mid \geq \mid K\mid], i.e. the number of states of $math[M]. For such words, the accepting path through $math[M] must explore some state **at least twice**.

===== The pumping lemma in action =====

The Pumping Lemma is used as a technical instrument for proving that a language $math[L] **is not regular**. The proof scheme is as follows:
  * suppose $math[L] is **regular**. Hence the Pumping lemma must hold.
  * identify a word-construction which **violates** the Pumping Lemma, yeilding a contradiction.


Consider the language $math[L = {0^i1^i \mid i > 1}] consisting of sequences of zeros followed by ones in the same number. Suppose the language is regular.

  - The pumping lemma tells us that for **some** large-enough $math[n], we can find words with the aforementioned properties in $math[L];
    * To violate the Pumping lemma, we must show that **for any** $math[n], words with the required properties cannot exist in $math[L]

  - The pumping lemma tells us that **for any** word $math[w\in L] such that $math[\mid w \mid \geq n], certain properties hold;
    * To violate the pumping lemma, let us choose $math[w_n] (with respect to **any** $math[n]) to be $math[0^n1^n].
    * 
  - The pumping lemma tells us that any large-enough word, including $math[0^n1^n] can be split into $math[xyz] where $math[y\neq \emptyset] and $math[\mid xy\mid \leq n]. Thus, even if we do not know $math[x,y,z], we can argue that: $math[x=0^{i}], $math[y=0^{j}] with $math[j\neq 0] and $math[i+j\leq n], and $math[z=0^{n-i-j}1^n].
  - 
  - Finally, the pumping lemma tells us that any word $math[xy^kz], with $math[k\leq 0] is also in $math[L]. 
    * To violate it, choose $math[k=0]. The word $math[xy^0z=xz=0^{i}0^{n-i-j}1^n=0^{n-j}1^{n}] has **strictly less** zeros than ones (since $math[j\neq 0]). The pumping lemma tells us that this word should be in $math[L] however it obviously is not. Contradiction. The language $math[L] cannot be regular.

==== The Pumping Lemma recipe ====

To use the Pumping lemma in order to prove that a language $math[L] is not regular:
  * for any $math[n], select a //word pattern// $math[w_n] such that $math[w_n \in L] and $math[\mid w_n\mid \geq n]
  * for any 'break-down' of $math[w_n] into $math[xyz] such that $math[\mid xy\mid \leq n] and $math[y\neq \epsilon], select a value $math[k] such that $math[xy^kz] is **not** a member of $math[L].

Thus, the Pumping lemma is contradicted.

===== The language of arithmetic expressions is not regular =====

For simplicity assume atoms can only be the one-letter word ''a'', hence the alphabet of the language is $math[\{a,+,(,)\}].

  * For any $math[n], fix the word $math[w_n=]''a+(a+ ... (a+a) ... )'' with $math[n] opened and closed parentheses. Such a word contains $math[n+1] atoms, $math[n+1] addition symbols, and $math[n] pairs of parentheses, hence $math[\mid w_n \mid = 4n+2]
  * In any 'break-down' $math[w_n = xyz], the word $math[y] cannot contain closed parentheses, since $math[\mid xy\mid \leq n]. Since $math[y] is nonempty, it may end with ''('', ''a'' or ''+''. Fix $math[k=0]. In the word $math[xy^0z = xz], there is a missing atom, addition symbol or open parenthesis:
    * if only an atom/addition symbol is missing, the the word is not a valid arithmetic expression
    * if a sequence which includes ''('' (such as ''+(a+'') is missing, then we have an expression with more closing parentheses than open ones (and possibly an invalid one too).
  * in any case, $math[xz] cannot be an arithmetic expression.

====== Closure properties of regular languages ======

==== Union ====

$prop[union] Let $math[A,B] be two regular languages. The language $math[A\cup B] is regular.
$end

$proof
Let $math[E_A] and $math[E_B] be the regular expressions generating $math[A] and $math[B] respectively. The regular expression $math[E_A\cup E_B] generates the language $math[A\cup B].
$end

==== Concatenation ====

$prop[concatenation] Let $math[A,B] be two regular languages. The language $math[AB] is regular.
$end

The proof follows the same idea as that of union.

==== Complement ====

$prop[complement] Let $math[A] be a regular language. The language $math[\overline{A}=\Sigma^* \setminus A] is regular.
$end

$proof
Let $math[M_A = (K,\Sigma,\delta,q_0,F)] be a DFA which accepts $math[A]. We build the DFA $math[\overline{M_A} = (K,\Sigma,\delta,q_0,K\setminus F)]. $math[\overline{M_A}] only differs from $math[M_A] in the accepting (or final) states: each **final** state in $math[M_A] is **non-final** in $math[\overline{M_A}] and vice-versa. It follows immediately that, for a
ll words $math[w], w is accepted by $math[M_A] iff $math[w] is not accepted by $math[\overline{M_A}]. Thus, $math[M_A] accepts any word not in $math[A], i.e. the language $math[\overline{A}].
$end


==== Intersection ====

$prop[intersection] Let $math[A,B] be two regular languages. The language $math[A\cap B] is regular.
$end

$proof
The language $math[A\cap B] can be defined as $math[\overline{\overline{A}\cup\overline{B}}]. By union and complement closure, we have that $math[\overline{A}], $math[\overline{B}], $math[\overline{A}\cup\overline{B}] and finally $math[\overline{\overline{A}\cup\overline{B}}] are regular languages.
$end

An alternative and more useful proof is to construct, starting from DFAs $math[M_A=(K_A,\Sigma,q_A,F_A)] and $math[M_B=(K_B,\Sigma,q_B,F_B)] which accept languages $math[A] and $math[B], respectively, a DFA for $math[A\cap B]. The construction is called **product automaton** (written $math[M_A\times M_B]), and is as follows:

  * the set of states is $math[K_A\times K_B] - each state in the product automaton is a pair of states;
  * the initial state is $math[(q_A,q_B)];
  * the transition function $math[\delta_{A\times B}] is defined as: $math[\delta_{A\times B}((q_x,q_y),c) = (\delta_A(q_x,c),\delta_B(q_y,c))].
  * the set of final states is $math[F_A \times F_B]

It is easy to prove by induction that for each word $math[w] such that $math[(q_A,w)\
vdash_{M_A}^*(p,\epsilon)] and 
$math[(q_B,w)\vdash_{M_B}^*(r,\epsilon)], with $math[p\in F_A] and $math[r\in F_B], we also have in the product automaton $math[((q_A,q_B),w)\vdash^*_{M_{A\times B}}(p,r),\epsilon].

==== Difference ====

$prop[intersection] Let $math[A,B] be two regular languages. The language $math[A\setminus B] is regular.
$end

$proof
The language $math[A\setminus B] can be defined as $math[A\cup\overline{B}], which is regular via reunion and complement properties.
$end

==== Closure ====

$prop[closure] Let $math[A] be a regular language. The language $math[A^*] is regular.
$end

The proof is similar to that for union and concatenation.

==== Reversal ====

$prop[reversal] Let $math[A] be a regular language. The language $math[A^R] is regular.
$end