====== Languages and Regular Expressions ====== ==== Motivation ==== Regular expressions are a means for **defining tokens** (roles) during the lexical phase. Token definition is instrumental in the development of **parser generators** (e.g. ANTLR). Let $math[\Sigma] be an **alphabet**, i.e. a finite set whose elements we call **symbols** or **characters**. In parsing, the alphabet we work with is naturally the (possibly extended) set of ASCII symbols. During the lecture, we will often use simpler alphabets such as the binary alphabet or $math[\{a,b,c\}], etc. ==== Regular expressions - tentative definition ==== Let $math[\Sigma] be an alphabet. * $math[\emptyset] is a **regular expression** (short. **reg.exp**). * any $math[c\in\Sigma] is a **regular expression**. * if $math[e] and $math[e'] are **regular expressions** then: * $math[(ee')] or simply $math[ee] (concatenation) is a **regular expression** * $math[(e\cup e')] (reunion) is a **regular expression** * if $math[e] is a regular expression then $math[(e)^*] (Kleene Star) is a **regular expression** It may be convenient to view regular expressions as //members of an Abstract Datatype//, and each formation rule, as a **constructor rule**. For instance: * $math[(A\cup B)(a\cup b)^*(0 \cup 1)^*] may be used to declare variables, which must be any string starting with ''A'' or ''B'', followed **a sequence** of ''a'''s and ''b'''s of **any length** (including 0), and followed by a sequence of ''0'''s and ''1''s. * Thus, ''Abbb1'' is a correct variable definition * ''Ba'' is also a correct variable definition (''a'' matches $math[(a \cup b)^*]), while **the empty string** matches $math[(0 \cup 1)^*]) * ''aaa01'' is **invalid** since it does not start with any of ''A'' and ''B''. * ''Aa01a'' is also **invalid** since after the digit sequence symbols such as ''a'' or ''b'' are not allowed. === Precendence === Regular expressions such as $math[01^*\cup 1] may be ambiguous. If the ADT notation would be employed, e.g. $math[union(kleene(concat(0,1)),1)], there would be no ambiguity. However, such a notation is cumbersome, and for this reason, we prefer the following order of precedence, for construction rules: * Kleene Star * concatenation * union We shall also use parentheses wherever necessary. $justexercise Write a regular expression identifying capturing all sequences of alternating 0s and 1s. $end $sol One tentative solution would be $math[(01)^*], however it is incomplete, as sequences such as $math[1010] cannot be generated. A complete solution could be $math[(01)^*\cup(10)^*\cup 0(10)^*\cup 1(01)^*], which can also be written as: $math[((1\cup \epsilon)01)^*\cup((0\cup \epsilon)10)^*]. Also, we may refactor this regular expression to a simpler one: $math[(1\cup \epsilon)(01)^*(0\cup \epsilon)]. Notice that several regular expressions may capture exactly the same sequences. $solend $justexercise Write a regular expression identifying capturing all sequences which do **not** contain adjacent ones. $end $sol One alternative is $math[(0^*(100^*)^*)], but also $math[(10\cup 0)^*(\epsilon\cup 1)] is a correct answer. Note that going from one regular expression to another is not trivial in this particular case. $solend In the above examples, we looked at several **words** and saw if they are **accepted** by a regular expression. * A **word** $math[w] (over $math[\Sigma]) is a possibly 0-length sequence of symbols ($math[w\in\Sigma*]). See the Algorithms and Complexity Theory lecture [1]. ===== Languages ===== Formally, a **language** is a **subset** of $math[\Sigma*], that is, a **possibly infinite set** of words. Formal languages are a powerful instrument which finds usage beyond compilers: - So far, we have seen that a **formal language** models a valid set of tokens (specified e.g. using Regular Expressions). Thus, the membership $math[w\in L] tells us that token $math[w] has indeed **the role modelled by $math[L]**. - At the same time, **formal languages** are models for programming languages, and words - for programs. The membership $math[w\in L] models that program $math[w] is indeed a valid program of the programming language $math[L]. It remains to be seen if we can use Regular Expressions to express the constraints suitable for modern programming languages. - **Formal languages** are models for natural language(s), and a great deal of interest into them came from linguistics. - Finally, **formal languages** are models for **decision problems**. Recall that each word $math[w] can be viewed as describing a **problem instance** (e.g. a graph of the k-Vertex-Cover problem, together with a value k, or a CNF formula of the SAT problem). In this case, the membership $math[w\in L] models the fact that the answer to the problem instance $math[w] is //yes//. Thus $math[L] is the set of all //yes//-instances of a problem. Also, it is interesting to have a picture of the **//space of languages//**. We already know that a **language** is an enumerable set (possibly finite). However, are **languages enumerable**? $justprop The set $math[2^{\Sigma^*}] of languages is not enumerable. $end $proof Suppose the set of languages is enumerable, and consider $math[L_1, L_2, \ldots, L_n, \ldots] and $math[w_1, w_2, \ldots, w_n, \ldots] the enumeration of languages, and words, respectively. We construct the following language: $math[L^* = \{w \mid w=w_i\not\in L_i\}] Informally, we can obtain $math[L^*] by taking each word $math[w_i] from $math[\Sigma^*] and checking if $math[w_i\in L_i]. If this is so, then we ignore $math[w_i] and move on. Otherwise - we add $math[w_i] to $math[L^*]. Since $math[2^{\Sigma^*}] is enumerable, then $math[L^*] must be some language $math[L_k] in the enumeration. So, we select $math[w_k] (the word corresponding to exactly the same language as $math[L^*]), and we inquire whether: $math[w_k \in L_k]: * if the answer is //yes//, then by definition of $math[L^*], we have $math[w_k \not \in L^*]. Contradiction. * if the answer is //no//, by the very same definition, we have $math[w_k \in L^*]. Contradiction. Hence the set of languages is not enumerable. $end This very simple proof has powerful implications. For example, regular expressions are essentially words over some alphabet, hence they are enumerable. Our proof shows that there are infinitely more languages than regular expressions. Hence, we cannot use regular expressions to capture certain languages. Imagine that such languages are //harder to define// - require a more complex apparatus. A lot of interesting questions spawn from this observation: * can we afford to define any kind of programming language, without affecting the parsing process? * can we build parsers for natural language? * what is the relationship between Languages and Complexity Theory? We shall address most of these questions throughout the lecture. But for now, we return to regular expressions. ==== Computing L(e) - a semantics for regular expressions ==== We easily notice that: * a **regular expression** $math[e] **uniquely identifies** the **language** $math[L(e)] containing the set of words which are accepted by $math[e]. * therefore there is a **map** which assigns to each regular expression $math[e], the language $math[L(e)] In order to define the aforementioned map, we introduce a few **operations on languages**. Let $math[A,B\subseteq \Sigma*] be two languages: * **concatenation**: the language $math[AB] is defined as $math[\{ww' \mid w \in A, w'\in B\}] - i.e. the set of words consisting of a word from $math[A] followed by a word from $math[B]. * **reunion**: the language $math[A\cup B] is simply the reunion of the languages - that is the set of words from both $math[A] and $math[B]. * **kleene-star**: the language $math[A*] is the set $math[\{w \in\Sigma^* \mid w = w_1\ldots w_n, n\geq 0, w_1, \ldots, w_n \in A\}] of zero or more concatenations of any word from $math[A]. Next, we can define rules for determining the language generated by a regular expression. * the language generated by the regular expression $math[\emptyset] is simply the empty language $math[\emptyset]. * the language generated by the regular expression $math[c], for $math[c\in\Sigma] is $math[L(c)=\{c\}] (a single-word set). * $math[L(e*)] is the language $math[(L(e))*], i.e. the **kleene star** of the language $math[L(e)]. * $math[L(ee')] is $math[L(e)L(e')] * $math[L(e\cup e')] is $math[L(e) \cup L(e')] Returning to our previous example, the language $math[L((A\cup B)(a\cup b)*(0 \cup 1)*)] is: * $ math[L(A\cup B) L((a\cup b)*) L((0 \cup 1)*))], i.e. * $math[L(A\cup B) L((a\cup b))* L((0 \cup 1))*)], i.e. * $math[(L(A)\cup L(B)) (L(a) \cup L(b))* (L(0) \cup L(1))*], i.e. * $math[\{A\}\cup \{B\}) (\{a\} \cup \{b\})* (\{0\} \cup \{1\})*] ==== Properties of languages ==== $math[2^{\Sigma^*}] is the **set of languages** over $math[\Sigma]. Let $math[E] be the set of regular expressions. We have defined **a semantics** for regular expressions, i.e. the map: * $math[E \rightarrow 2^{\Sigma^*}] This map is powerful because it shows that we can **assign a finite representation** (the regular expression) to an **infinite object** (the language). An interesting question, which we shall examine in detail further on, is whether **for each language** there exists a **regular expression** which describes it. We shall investigate this question in a later lecture. ===== References ===== - [[https://ocw.cs.pub.ro/courses/aa|Algorithms & Complexity Theory]]