====== Pumping Lemma ====== ===== Motivation ===== Consider the language defined via BNF (Bakus-Naur Form) as follows: *'' ::= + | ( + ) | '' which describes simple arithmetic expressions with parentheses. Let us ignore the construction rules for atoms (let $math[M_a] be a DFA with a unique final state which accepts valid atoms). Consider the following NFA built (with respect to $math[M_a]) to accept words of the above language: {{:lfa:parenexpr.jpg?600|}} One can easily observe that the above automaton can only accept words of the following forms: * '''' * ''++ ... +'' * ''++ ... ++(++ ... +)'' There are several possible fixes to our construction: * add transitions to accommodate expressions of the form: '' ( | ) (+ ( | ))* '' (here parentheses and ''*'' should be interpreted as with regular expressions), where '' ::= | + '' and '' ::= ( )'' * add new states and transitions which allow defining nested parentheses... **However**, it is **not possible** to add a **finite** number of states (and transitions) which can accommodate for an **//arbitrary// finite** number of parenthesis nestings. (e.g. ''1 + (2 + ... ))))''). Each automaton which we know how to build, can only accept a **//unique// finite** number of nestings, hence it can only describe arithmetic expressions with parenthesis nesting of up to some $math[k\in\mathbb{N}]. This happens because our language **is not reg ular**. ===== The pumping lemma ===== The //pumping lemma// captures one particular trait of Regular languages: * //words of an (infinite) Regular language must exhibit a specific form of **regularity** - a 'repeating pattern' which is simple-enough that it does not require **counting**// The previous language is a good counter-example: while arithmetic expressions do capture a 'repeating pattern' - identifying it requires **counting**: we must keep track of the number $math[k] of open parentheses and make sure that precisely $math[k] parentheses are eventually closed. Put more formally, the ''repeating pattern'' found in words of a Regular Language looks as follows. $theorem[Pumping Lemma] Let $math[L] be a regular language. Then there exists $math[n] (dependent on $math[L]) such that for every word $math[w\in L] of length larger than $math[n] has the following form: * $math[w = xyz] where * $math[\mid xy \mid \leq n] * $math[y \neq \epsilon] * for all $math[k > 0], we have $math[xy^kz \in L] $end Informally, the Pumping lemma tells us that all '//large-enough//' words of a regular language must have the form shown in the following image: {{:lfa:pumping_lemma_fin.jpg?600|}} * a prefix $math[x]; * a **non-empty** word $math[y] which may repeat zero-or-more times * a suffix $math[z] which may be finite. Note that arithmetic expressions cannot be broken-down in this way. ==== What does 'large-enough' mean? ==== Suppose $math[L] is a **finite** language. Then there exists a DFA accepting $math[L] which has the structure of a **tree**: each branch of the tree ending in a leaf describes one possible word of the language. The lemma trivially holds here since the $math[n] at hand can be any number **larger** that the length of the **longest** word. **Thus, there is no word satisfying $math[\mid w \mid \geq n] **. Suppose $math[M] is a DFA which accepts $math[L]. //'Large-enough'// words are those of $math[\mid w \mid \geq \mid K\mid], i.e. the number of states of $math[M]. For such words, the accepting path through $math[M] must explore some state **at least twice**. ===== The pumping lemma in action ===== The Pumping Lemma is used as a technical instrument for proving that a language $math[L] **is not regular**. The proof scheme is as follows: * suppose $math[L] is **regular**. Hence the Pumping lemma must hold. * identify a word-construction which **violates** the Pumping Lemma, yeilding a contradiction. Consider the language $math[L = {0^i1^i \mid i > 1}] consisting of sequences of zeros followed by ones in the same number. Suppose the language is regular. - The pumping lemma tells us that for **some** large-enough $math[n], we can find words with the aforementioned properties in $math[L]; * To violate the Pumping lemma, we must show that **for any** $math[n], words with the required properties cannot exist in $math[L] - The pumping lemma tells us that **for any** word $math[w\in L] such that $math[\mid w \mid \geq n], certain properties hold; * To violate the pumping lemma, let us choose $math[w_n] (with respect to **any** $math[n]) to be $math[0^n1^n]. * - The pumping lemma tells us that any large-enough word, including $math[0^n1^n] can be split into $math[xyz] where $math[y\neq \emptyset] and $math[\mid xy\mid \leq n]. Thus, even if we do not know $math[x,y,z], we can argue that: $math[x=0^{i}], $math[y=0^{j}] with $math[j\neq 0] and $math[i+j\leq n], and $math[z=0^{n-i-j}1^n]. - - Finally, the pumping lemma tells us that any word $math[xy^kz], with $math[k\leq 0] is also in $math[L]. * To violate it, choose $math[k=0]. The word $math[xy^0z=xz=0^{i}0^{n-i-j}1^n=0^{n-j}1^{n}] has **strictly less** zeros than ones (since $math[j\neq 0]). The pumping lemma tells us that this word should be in $math[L] however it obviously is not. Contradiction. The language $math[L] cannot be regular. ==== The Pumping Lemma recipe ==== To use the Pumping lemma in order to prove that a language $math[L] is not regular: * for any $math[n], select a //word pattern// $math[w_n] such that $math[w_n \in L] and $math[\mid w_n\mid \geq n] * for any 'break-down' of $math[w_n] into $math[xyz] such that $math[\mid xy\mid \leq n] and $math[y\neq \epsilon], select a value $math[k] such that $math[xy^kz] is **not** a member of $math[L]. Thus, the Pumping lemma is contradicted. ===== The language of arithmetic expressions is not regular ===== For simplicity assume atoms can only be the one-letter word ''a'', hence the alphabet of the language is $math[\{a,+,(,)\}]. * For any $math[n], fix the word $math[w_n=]''a+(a+ ... (a+a) ... )'' with $math[n] opened and closed parentheses. Such a word contains $math[n+1] atoms, $math[n+1] addition symbols, and $math[n] pairs of parentheses, hence $math[\mid w_n \mid = 4n+2] * In any 'break-down' $math[w_n = xyz], the word $math[y] cannot contain closed parentheses, since $math[\mid xy\mid \leq n]. Since $math[y] is nonempty, it may end with ''('', ''a'' or ''+''. Fix $math[k=0]. In the word $math[xy^0z = xz], there is a missing atom, addition symbol or open parenthesis: * if only an atom/addition symbol is missing, the the word is not a valid arithmetic expression * if a sequence which includes ''('' (such as ''+(a+'') is missing, then we have an expression with more closing parentheses than open ones (and possibly an invalid one too). * in any case, $math[xz] cannot be an arithmetic expression. ====== Closure properties of regular languages ====== ==== Union ==== $prop[union] Let $math[A,B] be two regular languages. The language $math[A\cup B] is regular. $end $proof Let $math[E_A] and $math[E_B] be the regular expressions generating $math[A] and $math[B] respectively. The regular expression $math[E_A\cup E_B] generates the language $math[A\cup B]. $end ==== Concatenation ==== $prop[concatenation] Let $math[A,B] be two regular languages. The language $math[AB] is regular. $end The proof follows the same idea as that of union. ==== Complement ==== $prop[complement] Let $math[A] be a regular language. The language $math[\overline{A}=\Sigma^* \setminus A] is regular. $end $proof Let $math[M_A = (K,\Sigma,\delta,q_0,F)] be a DFA which accepts $math[A]. We build the DFA $math[\overline{M_A} = (K,\Sigma,\delta,q_0,K\setminus F)]. $math[\overline{M_A}] only differs from $math[M_A] in the accepting (or final) states: each **final** state in $math[M_A] is **non-final** in $math[\overline{M_A}] and vice-versa. It follows immediately that, for a ll words $math[w], w is accepted by $math[M_A] iff $math[w] is not accepted by $math[\overline{M_A}]. Thus, $math[M_A] accepts any word not in $math[A], i.e. the language $math[\overline{A}]. $end ==== Intersection ==== $prop[intersection] Let $math[A,B] be two regular languages. The language $math[A\cap B] is regular. $end $proof The language $math[A\cap B] can be defined as $math[\overline{\overline{A}\cup\overline{B}}]. By union and complement closure, we have that $math[\overline{A}], $math[\overline{B}], $math[\overline{A}\cup\overline{B}] and finally $math[\overline{\overline{A}\cup\overline{B}}] are regular languages. $end An alternative and more useful proof is to construct, starting from DFAs $math[M_A=(K_A,\Sigma,q_A,F_A)] and $math[M_B=(K_B,\Sigma,q_B,F_B)] which accept languages $math[A] and $math[B], respectively, a DFA for $math[A\cap B]. The construction is called **product automaton** (written $math[M_A\times M_B]), and is as follows: * the set of states is $math[K_A\times K_B] - each state in the product automaton is a pair of states; * the initial state is $math[(q_A,q_B)]; * the transition function $math[\delta_{A\times B}] is defined as: $math[\delta_{A\times B}((q_x,q_y),c) = (\delta_A(q_x,c),\delta_B(q_y,c))]. * the set of final states is $math[F_A \times F_B] It is easy to prove by induction that for each word $math[w] such that $math[(q_A,w)\ vdash_{M_A}^*(p,\epsilon)] and $math[(q_B,w)\vdash_{M_B}^*(r,\epsilon)], with $math[p\in F_A] and $math[r\in F_B], we also have in the product automaton $math[((q_A,q_B),w)\vdash^*_{M_{A\times B}}(p,r),\epsilon]. ==== Difference ==== $prop[intersection] Let $math[A,B] be two regular languages. The language $math[A\setminus B] is regular. $end $proof The language $math[A\setminus B] can be defined as $math[A\cup\overline{B}], which is regular via reunion and complement properties. $end ==== Closure ==== $prop[closure] Let $math[A] be a regular language. The language $math[A^*] is regular. $end The proof is similar to that for union and concatenation. ==== Reversal ==== $prop[reversal] Let $math[A] be a regular language. The language $math[A^R] is regular. $end