====== 3. Regular expressions ====== ===== 3.1. Formation rules (concatenation, reunion, Kleene star) ===== **3.1.1.** $math[A=\{ 0^{2k} \mid k \geq 1 \}] $ B = \{0, \epsilon \}$ \\ $ AB = ? $ \\ A = {00, 0000, 00000000 ...} AB = {00, 00**0**, 0000, 0000**0**, ...} <- this is the cartesian product between the sets(languages) A and B, where the elements of A come first. * where the words in the language that have an even length are obtained by combining a word from A with the word ε from B * and those with an odd length are obtained by combining a word from A with the word 0 from B **3.1.2.** $math[A = \{ 0^n 1^n \mid n \geq 1 \}] \\ $ B = \{ 1^n \mid n \geq 1 \} $ \\ $ AB = ? $ \\ $ BA = ? $ A is the language in which the words start with zero and end with one and the number of one is equal to the number of zeros (the same value for n is used). The notation of n in the definition of B is completely **unrelated** to the n used to define A. So, B is the language of words made of sequences of ones, having the length of at least 1, so basically B = L(11*). A = {01, 0011, 000111, 00001111 ...} B = {1, 11, 111 ...} AB = {01**1**, 0011**1**, 00001111**1**, ..., 01**11**, 0011**11**, 000111**11**, 00001111**11**, ... } BA = {**1**01, **1**0011, **1**000111, ..., **11**01, **11**0011, ..., **111**01, **111**0011 ...} **3.1.3.** $ A = \emptyset $ \\ $ B = \{ 1^n \mid n \geq 1 \} $ \\ $ AB = ? $ \\ $ A^* = ? $ \\ $ B^* = ? $ \\ AB = ∅ (because A is empty, so the cartesian product leads to an empty set) A* = {ε} (epsilon is always part of Kleene star) B*= {ε} (epsilon is always part of Kleene star) U {$ 1^n $} U {$ 1^{2n} $} U {$ 1^{3n} $} U ... So basically B = L( ($1^n$)* ) **3.1.4** $math[A = \{ 0^n 1^n 0^m \mid m \geq n \geq 1 \}] \\ $ B = \{ 0^n \mid n \geq 1 \} $ \\ $ AB = ? $ \\ $ BA = ? $ $math[AB = \{ 0^n 1^n 0^{m+k} \mid m \geq n \geq 1, k \geq 1 \}]. Deci $math[AB = A]. Note that the n in the definition of language A **is different** from the n in in the definition of B, they are **independent** when used in defining different sets/languages. However, when n is used several times in the definition of one language, such as the 2 times it appears in language A, it is **the same** value. $math[BA = \{ 0^{(n+k)} 1^n 0^m \mid m \geq n \geq 1, k \geq 1 \}]. Equivalently: $math[BA = \{0^x 1^y 0^z \mid x \geq y\geq 1 \text{ and } z \geq y \geq 1 \}] ===== 3.2. Writing Regular Expressions ===== **3.2.1.** Write a regular expression for the language of arithmetic expressions containing +, * and numbers. **Hint:** you can abbreviate $ 0 \cup 1 \cup ... \cup 9 $ by $ [0-9] $ We start by defining the regex for a number: * a number can be a digit from 0 to 9 => [0-9] * a number can have several digits, but the first one can't be 0 => [1-9][0-9]* * so we have either one of these options: [0-9] U [1-9][0-9]* * But we can write it in a more concise way: 0 U [1-9][0-9]* <= a number of 1 or more digits that doesn't start with 0, or 0 itself Having decided on the regex for the number, we can write a regex for expressions > (0 U [1-9][0-9]*) ( ('+' U '*') (0 U [1-9][0-9]*) )* This can be understood with a clearer/intuitive notation, which however is **not exactly formally correct** (not a regex): > number ( ('+' U '*') number )* **3.2.2.** Write a regular expression for $ L = \{ \omega \text{ in } \text{{0,1}} ^* \text{ | EVERY sequence of two or more consecutive zeros appears before ANY sequence of two or more consecutive ones} \} $ > (1 U ε) ( 0 0* (1 U ε) )* (0 U ε) ( 1 1* (0 U ε) )* We can start with either 1 or 0. Then we can have any sequences of zeros of any length, but no sequence of ones with a length bigger than 1. This way we make sure that any sequence of 2 or more zeros precedes any sequence of 2 or more ones. Then we repeat the same logic on the left, making sure no sequence of 2 or more zeros appear on this side. **3.2.3.** Write a DFA for $ L(( 10 \cup 0) ^* ( 1 \cup \epsilon )) $ {{:lfa:2022:lfa2022_lab_3.2.3.png?300|}} **3.2.4.** Write a regular expression which generates the accepted language of A. Then try to find the most simple and easy to understand way to write it. {{:lfa:graf1.png?400|}} Looking at the DFA we can tell state 3 is a sink state. We can simplify the DFA's drawing by not looking at/ignoring it. Let's see what words are accepted: * ab*ab (when we don't loop on state 1) * ab*()*ab => ab*(aab*)*ab * anything that repeats the previous expression several times * ε (the initial state is also a final state) * from the previous 2 observations => we can use Kleene star So, the regex is: > ( ab*(aab*)*ab )* For now, we will try to determine the equivalence between a regex and a DFA intuitively. But this is not the actual correct approach. We will learn later (and you can review this exercise with the future knowledge) that once we find a regex intuitively, we should check that the DFA and the regex are actually equivalent by transforming the regex into an NFA and then checking for non-distinguishable states OR we can use a DFA to regex conversion algorithm. (ask your TA about this if you want more details right now or wait until you're actually learning for the exam and reviewing all the courses) **3.2.5.** Describe as precisely as possible the language generated by $math[((0(1 \cup 0)(1 \cup 0)) \cup 100)1((0(1 \cup 0)(1 \cup 0)) \cup 100)1(((0(1 \cup 0)(1 \cup 0)) \cup 100)0)*] (hint: BCD) OK, ok. We get it. It looks **bad**. BUT, let's first find repeating patterns: ( ( 0 ( 1 U 0 ) ( 1 U 0 ) ) U 100 ) 1 Ugly, yes. But let's see the language of this little regex: 0001, 0011, 0101, 0111, 1001 What are these in BCD? 1, 3, 5, 7, 9. So, we got so far BCD numbers that start with an odd digit, BUT this pattern repeats twice. So, we have BCD numbers starting with 2 odd digits. NEXT: ( ( 0 ( 1 U 0 ) ( 1 U 0 ) ) U 100 ) 0 )* We have a Kleene star, so something repeats any number of times. And, inside this Kleene star, there is ( 0 (1 U 0) (1 U 0) ) U 100 )0 that accepts the language: 0000, 0010, 0100, 0110, 1000. So, this ugly regex encodes BCD numbers that start with exactly 2 odd digits which are followed by 0 or more even digits. ===== 3.3 Regex Equivalence ===== Are the following regex pair equivalent? ** 3.3.1 ** $ E1 = ab|a|b $ \\ $ E2 = (a|\epsilon)(b|\epsilon) $ We can observe that E2 accepts ε, while E1 does not so they are not equivalent. \\ Another approach is to compute the language of each expression (since they are finite) and check if they are equivalent. ** 3.3.2 ** $ E1 = a(b|c)(d|e)|abb|abc $ \\ $ E2 = ab(b|c|d|e)|acd|ace $ Since both E1 and E2 have a finite language, we could just compute the language and check if they are equivalent. Language is L = {abb, abc, abd, abe, acd, ace}, therefore they are equivalent. ** 3.3.3 ** $ E1 = (a\mid b)^*aa^* \mid \epsilon $ \\ $ E2 = (a\mid ba)^*(b\mid ba)^* $ Both E1 and E2 have an infinite language, so comparing them is not an option. We can see that for example E2 accepts b, while E1 does not accept it, so the expressions are not equivalent. Fun fact: E1 was proposed by a student as a solution for 3.2.2 last year, while E2 is the actual solution. ** 3.3.4 ** $ E1 = ((ab^*a)^+b)^* $ \\ $ E2 = (a(b\mid aa)^*ab)^* $ Both E1 and E2 have an infinite language, so comparing them is not an option. We can try looking for words that are accepted by one and not by the others, but we can't easily find such words. ! This does not mean they are equivalent ! We should transform each expression into its min DFA and check if they are the same (number of states, transitions, alphabet, initial/final states) (renaming of states might be needed). Plot twist: They are the same. The purpose of this exercise is to understand how to approach regex equivalence, not how to solve this given comparison per se. ==== Conclusion ==== What have we learned today? **Plural of regex is regrets**