====== Regular languages =====

===== Definition =====

In the previous lectures, we have introduced **regular expressions**, **NFAs** and **DFAs** as finite representations for languages, and showed the following links between them.
  * $math[E \rightarrow NA] - each regular expression $math[e] can be transformed to a **NFA** $math[M] such that $math[L(e) = L(M)].
  * $math[NA \rightarrow DA] - each **NFA** $math[M] can be transformed to a **DFA** $math[M'] such that $math[L(M) = L(M')].

We introduce the following:
  * a language $math[L] is called **regular**, if it can be generated by a regular expression, i.e. $math[L=L(E)] for some regular expression $math[E]. Denote by $math[LR\subset 2^{\Sigma^*}] **the set of regular languages**
  * Denote by $math[L(NFA)\subset 2^{\Sigma^*}], the set of languages which can be accepted by NFAs, and
  * Denote by $math[L(DFA)\subset 2^{\Sigma^*}], the set of languages which can be accepted by DFAs.

It is both formally and practically important to understand the limits of regular expressions and automata (of different types) in capturing languages.

We have already seen that regular expressions are countable while languages are not. Can automata capture more languages that regular expressions? Our lecture has so far proven the following:
  * $math[LR \subseteq L(NFA) \subseteq L(DFA)]

To this we add the following observation:
  * $math[L(DFA) \subseteq L(NFA)]. If a language can be accepted by a DFA, it also can (trivially) be accepted by an NFA, since the latter automata extend the former. 

Therefore, we have shown that **NFAs and DFAs accept the same languages**, i.e. $math[L(NFA) = L(DFA)]. In other words, if a language $math[L] is accepted by some DFA $math[M] ($math[L=L(M)]), then it can also be accepted by some NFA, and vice-versa.

It remains to establish the relationship between $math[LR] and $math[L(DFA)] (or equivalently $math[L(NFA)]).

===== Equivalence between Regular Expressions and Automata =====

$justtheorem
Let $math[M] be a DFA. There exists a **regular expression** $math[E], such that $math[L(E)=L(M)]. 
$end

To prove the theorem, we rely on:
  * a //naming scheme// for states. We assume a state $math[q_i] is identified by its **index** $math[i]. **The indexes in our proof start with 1**, hence $math[1] is the initial state. How states are ordered, or their //kind// (final/nonfinal) is unimportant, however we use the same ordering throughout the proof;
  * a //naming scheme// for //partial// regular expressions: We label $math[R_{ij}^{(k)}] the **regular expression** such that its **language** is the set of **words** that label a **path** from state $math[i] to $math[j]. Moreover, the path cannot visit states of **index larger than $math[k]**.

We prove the following:

$justprop
Given DFA $math[M], for all states $math[i,j,k] of $math[M], there exists a **regular expression** $math[R_{ij}^{(k)}], which satisfies the above conditions.
$end

$proof
The proof is by induction **over $math[k]**.
**Basis case:** $math[k=0].
If $math[i\neq j], then $math[R^{(0)}_{ij}] must contain **exactly** one transition:
  * $math[R^{(0)}_{ij} = \emptyset] if no transition exists between $math[i] and $math[j]
  * $math[R^{(0)}_{ij} = c_1 \cup \ldots \cup c_m] if **one or more** transitions exist between $math[i] and $math[j], on symbols $math[c_1] to $math[c_n].

If $math[i = j], then:
  * $math[R^{(0)}_{ii}] may contain **zero** transitions, hence $math[R^{(0)}_{ij} = \epsilon]
  * $math[R^{(0)}_{ii}] may contain one transition, and the construction follows the above rules, yielding some regular expression $math[E_0]. 
We combine the two situations in a single one: $math[R^{(0)}_{ii} = \epsilon \cup E_0], where $math[E_0] is constructed as above.

**Induction step**:
By //induction hypothesis//, we assume there exist regular expressions $math[R^{k-1}_{ij}] that satisfy our designated constraints, in $math[M].

We build $math[R^{k}_{ij}], for each possible pair of states $math[i,j] in $math[M].
  - a path from $math[i] to $math[j] may pass only states whose index is **smaller than $math[k]**. In this case: $math[R^{(k)}_{ij} = R^{(k-1)}_{ij}]
  - a path from $math[i] to $math[j] **passes $math[k] one or more times**. This path can be decomposed in the following bits:
     * a path from $math[i] to $math[k] which only visits states $math[<k], identified by the regular expression $math[R_{ik}^{(k-1)}]
     * **zero or more** paths from $math[k] to $math[k] which only visit states $math[<k], **each** identified by: $math[R_{kk}^{(k-1)}]
     * a path from $math[k] to $math[j] which only visits states $math[<k], identified by: $math[R_{kj}^{(k-1)}]. 

The induction hypotheses guarantees that all regular expressions involving the above construction(s) can be properly built. Hence, we assemble $math[R_{ij}^{k}] by combining the two afore-mentioned cases:

$math[\displaystyle R_{ij}^{k} = R_{ij}^{(k-1)} \cup R_{ik}^{(k-1)}(R_{kk}^{(k-1)})^*R_{kj}^{(k-1)}]
$end

The proof of our theorem consists in building the regular expression:

$math[\displaystyle E = \bigcup_{i\in F}R_{1i}^n]

where $math[n] is the total number of states in $math[M].

which, according to our proposition, describes **all paths that start in the initial state, end in a final state, and may visit all other states**.

===== Regular languages =====

We have completed an extensive investigation into languages defined via:
  * regular expressions
  * nondeterministic FA
  * deterministic FA

and established that these three instruments for defining languages are **equivalent**. An important observation is that, languages in general support two //kinds// of definitions:
  * via **generators**: for instance, regular expressions are **generators** for regular languages. They describe how **words of the given language** can be built;
  * via **acceptors**; they are, informally, //machines//. NFAs and DFAs are **acceptors** for regular languages. They describe how **words can be tested for membership** in a given language.

Generators and acceptors are always useful when working with any kind of particular language. 

===== When is a language regular? =====

We already know that a language is regular iff it can be defined via an regular expression, or automaton of either kind. However, what is an //intrinsic feature// do regular languages capture?
  * although we shall explore this in more detail later, we can already state that words in a regular **language** exhibit '//regularities//' which can be observed without **being required to count**, or to have some form of **memory** available. We shall return to this intuition.

Interesting questions regarding languages arise:
  - When is a language **regular**? 
  - When is a language **not** regular?

We can answer question 1. by constructing a regular expression, NFA or DFA to capture the language. However, in practice, there are a few tools which serve this purpose better:

===== Closure properties of languages =====

Although $math[LR \subseteq L(DFA)] has already been proven in the former two lectures, there is another way of establishing this, which has further applications. This second means is related to **closure properties** of languages.

Generally, a **set** $math[A] has closure **under a transformation ($math[T:X\rightarrow X]) or operation ($math[O:X\times X \rightarrow X])** iff, by performing the transformation/operation on member(s) $math[a] ($math[b]) of $math[A] (i.e. $math[T(a)] or $math[O(a,b)]), we obtain an element in the same set.

Here, the set at hand is $math[L(DFA)], and the transformations are:
  * Kleene Star
  * complement

and the operations are:
  * union
  * concatenation
  * intersection

$justtheorem
If $math[L\in L(DFA)], then $math[L^*\in L(DFA)] and also. $math[\overline{L}\in L(DFA)]. By $math[\overline{L}], we refer to the complement of the language $math[L], with respect to $math[\Sigma^*]: $math[\overline{L}=\Sigma^* \setminus L]

If $math[L_1,L_2\in L(DFA)] then the languages $math[L_1 \cup L_2], $math[L_1L_2] (language concatenation) and $math[L_1 \cap L_2] are also members of $math[L(DFA)].
$end