====== An introduction to Flex ====== ===== Compiling Flex ===== Consider the following Flex application, which counts newline, word and byte counts, in a fashion similar to the ''wc'' (word count) Unix tool. %option noyywrap %{ #include int chars = 0; int words = 0; int lines = 0; %} %% [a-zA-Z]+ { words++; chars += strlen(yytext); } \n { chars++; lines++; } . { chars++; } %% int main(int argc, char **argv) { yylex(); printf("%8d%8d%8d\n", lines, words, chars); } ===== The structure of a Flex file ===== A Flex file contains three sections, each separated by the ''%%'' symbols. - The **first** section contains declarations and option settings. In our example, the option ''noyywrap'' (discussed later) has been set. The code inside of ''%{'' and ''%}'' is copied as-is to the C code to-be-generated (details below). In our example, we are including ''unistd.h'' (contains type definitions (e.g. ''size_t'') and POSIX operations which are relevant for FLEX), and defining three variables which hold counters for newlines, words and bytes. - The **second** section contains **patterns** (similar to regular expressions). Each pattern starts at the beginning of the line. Patterns are immediately followed by **actions** (C code to execute when a pattern is matched). We shall discuss actions and patterns in detail later. - The **third** section contains the C code (here - the ''main'' function) which will be copied to the source file. ===== Compiling a Flex file ===== Suppose the name of our file is ''sample.lex''. The command: flex sample.lex will generate the C program ''lex.yy.c''. The extension ''yy'' is related to the fact that Flex was designed to be used in conjuction with **Yacc** (or Bison). Yacc (//Yet another compiler compiler//) is a **parser generator** written for Unix. ''lex.yy.c'' contains the resulting lexical analyser, in our case, a newline, word and byte counter. ==== The option noyywrap ==== Flex must compile in conjunction with a small library called ''LibFL'' (or ''lfl''), which contains a default ''main'' function as well as a function ''yywrap''. Once the input for the analyser is completely processed, ''yywrap'' is called. If multiple files were to be processed, ''yywrap'' would return 0, in order to resume scanning. Otherwise (we processed a single file), it would return 1. Currently, LibFL has been kept in Flex for backwards-compatibility. Programmers can avoid defining and using LibFL, by setting the option ''noyywrap'', as our example does. The alternative is to define a ''yywrap'' function which returns 0 or 1 as desired. The command: gcc -o exefile lex.yy.c will produce an executable file ''exefile'' which, once executed, behaves as follows: * listens for input text, at console; * ''Ctrl+D'' signals the end of the input text; the output of the analyser is subsequently shown. ==== Reading input from file ==== To read the input from a file, modify the main function as follows: int main(int argc, char **argv) { FILE *f = fopen(argv[1], "r"); yyrestart(f); yylex(); fclose(f); printf("%8d%8d%8d\n", lines, words, chars); } Here, the call ''yyrestart(f)'' switches the scanner input to the filepointer ''f''. ===== Writing your own Flex file ===== The interesting part of a Flex file is the pattern (or regular-expression) definitions. ==== Patterns ==== In our example, we have defined the patterns: * ''[a-zA-Z]+'' : one or more alphabetic symbols, lower-case and upper-case; * ''\n'' : the newline symbol * ''.'' : any ASCII symbol except the newline word; ==== Actions ==== We can assign to each pattern an **action**, i.e. C code to be executed when the pattern is matched. For instance, the action: { words++; chars += strlen(yytext); } assigned to the word pattern, increments the number of words and characters (bytes). ''yytext'' is a pointer to the string matched by the pattern at hand. ===== Writing an analyser for arithmetic expressions ===== ==== Stage 1 - Recognising tokens ==== One possible implementation for a token recogniser is given below (due to a wiki display bug, the symbol ''%'' has been escaped in the source). Explanations follow: %option noyywrap %{ #include void show_text(char s[]){ printf(s,yytext); } %} op "+"|"MOD" alfastream [a-zA-Z]+ digitstream [0-9]+ var [A-Z]{alfastream}?{digitstream}? \%\% {op} {show_text("Op(%s)");} {var} {show_text("Var(%s)");} "(" {show_text("((");} ")" {show_text("))");} . \%\% int main(int argc, char **argv) { FILE *f = fopen(argv[1], "r"); yyrestart(f); yylex(); fclose(f); } ==== Pattern names ==== It is often convenient to assign names to specific patterns. For instance, in the **declarations part**, we have created names: ''alfastream'', ''digitstream'', ''var'' and ''op''. These pattern **names** can be freely-reused later. For instance, the pattern: var [A-Z]{alfastream}?{digitstream}? defines strings that start with an upper-case, followed by **zero-or-one** appearance of the pattern ''alfastream'', followed by **zero-or-one** appearance of the pattern ''digitstream''. Similarly, op "+"|"MOD" defines the pattern ''op'' which can be either the string "+" or "MOD". (Careful, introducing whitespaces in a pattern-definition, e.g. ''"+" | "MOD"'' will produce syntax errors). ==== Actions and Ambiguous Patterns ==== The actual pattern definitions are found in the second part of the Flex file. Our unique action for each matched pattern is to show it, via the ''show_text'' function. It is possible for patterns to contain ambiguities. For instance, the input: Variable01 may match the pattern ''{alfastream}'' (not explicitly defined in the code), with ''Variable'', as well as the pattern ''{var}'', with ''Variable01''. * **Flex will always match the longest possible string**. In our example, assuming ''{alfastream}'' is defined, ''{var}'' will be matched and not ''{alfastream}'' Similarly, the input: MOD may simultaneously match ''{var}'' as well as ''{op}''. In such cases: * **Flex will always prefer the first pattern which appears in the program**. In our example, ''{op}'' appears before ''{var}'' in the //pattern// section, hence it is preferred. We have also introduced the pattern ''.'' which, since it is defined as the last pattern (and of the shortest size), it will be matched whenever no other pattern matches. We expect ''.'' to match whitespaces. ==== Step 2 - An ad-hoc Flex-based parser ==== === Data-structures === To represent expressions, we have opted for and ADT-style representation: Atom : String -> Expr Par : Expr -> Expr Binary : Expr x Expr -> Expr with three types of constructors, for each type of expression. In C, we do not have inheritance, or other means for expressing super/sub-types. Hence, we assume a value of type ''Expr'' is a ''void''-pointer. Thus, we define: typedef struct Atom{ char* name; } Atom; typedef struct Par{ void* inner; int sz; } Par; typedef struct Binary{ void* left, *right; int left_sz, right_sz; int op; } Binary; In order to be able to recover type-information from a ''void*'' value, we also add integer values ''sz'' (resp. ''left_sz'' and ''right_sz'') which hold the ''sizeof'' value of the contained object. For instance, in representing ''X + Y'', the values ''left'' and ''right'' will point to objects of type ''Atom'', hence ''left_sz''=''right_sz''=''sizeof(struct Atom)''. We also introduce helper functions for creating the respective objects: void* make_atom (char s[]){...} void* make_par (void* inner, int sz){...} void* make_binary (void* left, int left_sz, int op, void* right, int right_sz){...} In order to display an object (and hence test the parsing correctness), we rely on three functions: void show (void* ob, int sz){ switch (sz){ case ATOM: show_atom((Atom*)ob); break; case PAR: show_par((Par*)ob); break; case BINARY: show_binary((Binary*)ob); break; } } void show_atom (Atom* a){ printf("Atom{%s}",a->name); } void show_par (Par* p){ printf("("); show(p->inner,p->sz); printf(")"); } void show_binary (Binary* b){ show(b->left,b->left_sz); if (b->op == PLUS) printf(" + "); else printf(" MOD "); show(b->right,b->right_sz); } The interesting function is ''void show (void* ob, int sz)'', which, using the expression type stored in ''sz'', //downcasts// the object to be displayed. The rest of the functions simply display different types of expressions, relying on recursive calls to ''show''. Finally, the interesting part of the parser relies on verifying if a correct expression has been read. In order to do so, our program relies on a **stack**. The possible values from the stack are: * an expression (i.e. a ''void*'' and an ''int'') * an opened parenthesis (encoded as a null-pointer value and a pre-defined integer PAR_OPEN) * a closed parenthesis (a null pointer and PAR_CLOSED) * an operator (a null pointer and PLUS or MOD) Whenever an **open parenthesis** is read by Flex, we check the top of the stack: * if it contains another **open parenthesis**, thus we have ''...( ('', an **operator** e.g. ''+('', or the stack is empty, we place the parenthesis on the stack * otherwise, we have a syntax error; Whenever an **operator** is read by Flex, we check the top of the stack: * if it contains an **expression** (of any type), then we place the operator on the stack * otherwise, we signal a syntax error; When a **variable** is read, we check the **top** of the stack: * if it contains an **open parenthesis**, it means we have read something similar to ''...(V''; we place the variable on the stack and continue. We do the same thing if **the stack is empty**. * if it contains an operator, it means we have read e.g. '' + V''; in this case, we pop the ''+'', the '''' we build the binary expression '' + V'', and place it on the stack. The correctness checks guarantee that the stack will have these values available. * otherwise we signal a syntax error; When a **closed parenthesis** is read, we look at the stack top again and: * if the **last two** values of the stack correspond to ''( '', then we pop the expression, the opened parenthesis, and push a new expression corresponding to ''()''. * in any other case, we signal a syntax error (incorrectly-matched parentheses). Below, you can find an implementation which covers **some** of the syntax verifications mentioned above. The implementation should be able to build correctly-written expressions. %option noyywrap %{ #include /* data Expr = Atom | (Expr) | Expr OP Expr */ #define PLUS 0 #define MOD 1 #define PAR_OPEN 2 #define PAR_CLOSE 3 #define ATOM sizeof(struct Atom) #define PAR sizeof(struct Par) #define BINARY sizeof(struct Binary) typedef struct Atom{ char* name; } Atom; typedef struct Par{ void* inner; int sz; } Par; typedef struct Binary{ void* left, *right; int left_sz, right_sz; int op; } Binary; void show_atom(Atom*); void show_par(Par*); void show_binary(Binary*); void* make_atom (char s[]){ Atom* a = (Atom*)malloc(sizeof(struct Atom)); a->name = (char*)malloc(strlen(s)); memcpy(a->name,s,strlen(s)); return (void*)a; } void* make_par (void* inner, int sz){ Par* p = (Par*)malloc(sizeof(struct Par)); p->inner = inner; p->sz = sz; return (void*)p; } void* make_binary (void* left, int left_sz, int op, void* right, int right_sz){ Binary* b = (Binary*)(malloc(sizeof(struct Binary))); b->left = left; b->right = right; b->left_sz = left_sz; b->right_sz = right_sz; b->op = op; return (void*)b; } void show (void* ob, int sz){ switch (sz){ case ATOM: show_atom((Atom*)ob); break; case PAR: show_par((Par*)ob); break; case BINARY: show_binary((Binary*)ob); break; } } void show_atom (Atom* a){ printf("Atom{%s}",a->name); } void show_par (Par* p){ printf("("); show(p->inner,p->sz); printf(")"); } void show_binary (Binary* b){ show(b->left,b->left_sz); if (b->op == PLUS) printf(" + "); else printf(" MOD "); show(b->right,b->right_sz); } typedef struct Stack_elem{ void* val; int sz; struct Stack_elem* next; }* Stack; Stack push (void* val, int sz, Stack s){ Stack sp = (Stack)malloc(sizeof(struct Stack_elem)); sp->val = val; sp->sz = sz; sp->next = s; return sp; } Stack pop (Stack s){ Stack sp = s->next; free(s); return sp; } int isEmpty (Stack s){ return s==0; } Stack stack = 0; void error (char s[]){ printf("Error: %s \n",s); //yyterminate(); //yyerror(s); } /* takes an expression e, and adds another expression to the stack as long as the stack contains " op e' ", we pop the operand and e' and push the binary expression e op e' */ void prune_stack(void* e, int sz){ if (isEmpty(stack)){ stack = push(e, sz, stack); } else { if(stack->sz == PLUS || stack->sz == MOD){ int op = stack->sz; stack = pop(stack); if (isEmpty(stack)) error("Plus is not preceded by an operand"); Binary* b = make_binary(stack->val,stack->sz,op,e,sz); stack = pop(stack); prune_stack(b,BINARY); } else stack = push(e,sz,stack); } } void add_variable(){ prune_stack(make_atom(yytext),ATOM); } void close_par(){ if (isEmpty(stack)){ error("Closed parenthesis ends abruptly"); } else{ Par* p = make_par(stack->val,stack->sz); stack = pop(stack); if (isEmpty(stack)) error("Closed parenthesis preceding expression ends abruptly"); else{ if(stack->sz != PAR_OPEN){ error("Closed parenthesis preceding expression does not match an open one"); } else{ stack = pop(stack); //remove parenthesis prune_stack((void*)p,PAR); } } } } %} alfastream [a-zA-Z]+ digitstream [0-9]+ var [A-Z]{alfastream}?{digitstream}? \%\% "+" {stack = push(0,PLUS,stack); } "MOD" {stack = push(0,MOD,stack); } {var} {add_variable(); } "(" {stack = push(0,PAR_OPEN,stack); } ")" {close_par();} . \%\% int main(int argc, char **argv) { FILE *f = fopen(argv[1], "r"); yyrestart(f); yylex(); fclose(f); if (isEmpty(stack)){ printf("Stack empty!"); return 0; } else{ show(stack->val,stack->sz); printf("\n"); } } ====== References ====== - John Levine, //Flex & bison//, O'Reilly publishing. [[http://web.iitd.ac.in/~sumeet/flex__bison.pdf|link]]