An introduction to Flex

Consider the following Flex application, which counts newline, word and byte counts, in a fashion similar to the wc (word count) Unix tool.

%option noyywrap 
 
%{
 
#include <unistd.h>
 
int chars = 0;
int words = 0;
int lines = 0;
 
%}
 
%%
 
[a-zA-Z]+ { words++; chars += strlen(yytext); }
\n { chars++; lines++; }
. { chars++; }
 
%%
 
int main(int argc, char **argv)
{
 yylex();
 printf("%8d%8d%8d\n", lines, words, chars);
}

A Flex file contains three sections, each separated by the %% symbols.

  1. The first section contains declarations and option settings. In our example, the option noyywrap (discussed later) has been set. The code inside of %{ and %} is copied as-is to the C code to-be-generated (details below). In our example, we are including unistd.h (contains type definitions (e.g. size_t) and POSIX operations which are relevant for FLEX), and defining three variables which hold counters for newlines, words and bytes.
  2. The second section contains patterns (similar to regular expressions). Each pattern starts at the beginning of the line. Patterns are immediately followed by actions (C code to execute when a pattern is matched). We shall discuss actions and patterns in detail later.
  3. The third section contains the C code (here - the main function) which will be copied to the source file.

Suppose the name of our file is sample.lex. The command:

flex sample.lex

will generate the C program lex.yy.c. The extension yy is related to the fact that Flex was designed to be used in conjuction with Yacc (or Bison). Yacc (Yet another compiler compiler) is a parser generator written for Unix. lex.yy.c contains the resulting lexical analyser, in our case, a newline, word and byte counter.

The option noyywrap

Flex must compile in conjunction with a small library called LibFL (or lfl), which contains a default main function as well as a function yywrap. Once the input for the analyser is completely processed, yywrap is called. If multiple files were to be processed, yywrap would return 0, in order to resume scanning. Otherwise (we processed a single file), it would return 1.

Currently, LibFL has been kept in Flex for backwards-compatibility. Programmers can avoid defining and using LibFL, by setting the option noyywrap, as our example does. The alternative is to define a yywrap function which returns 0 or 1 as desired.

The command:

gcc -o exefile lex.yy.c

will produce an executable file exefile which, once executed, behaves as follows:

  • listens for input text, at console;
  • Ctrl+D signals the end of the input text; the output of the analyser is subsequently shown.

Reading input from file

To read the input from a file, modify the main function as follows:

int main(int argc, char **argv)
{
	FILE *f = fopen(argv[1], "r");
	yyrestart(f);
 	yylex();
 	fclose(f);
	printf("%8d%8d%8d\n", lines, words, chars);
}

Here, the call yyrestart(f) switches the scanner input to the filepointer f.

The interesting part of a Flex file is the pattern (or regular-expression) definitions.

Patterns

In our example, we have defined the patterns:

  • [a-zA-Z]+ : one or more alphabetic symbols, lower-case and upper-case;
  • \n : the newline symbol
  • . : any ASCII symbol except the newline word;

Actions

We can assign to each pattern an action, i.e. C code to be executed when the pattern is matched. For instance, the action:

{ words++; chars += strlen(yytext); }

assigned to the word pattern, increments the number of words and characters (bytes). yytext is a pointer to the string matched by the pattern at hand.

Stage 1 - Recognising tokens

One possible implementation for a token recogniser is given below (due to a wiki display bug, the symbol % has been escaped in the source). Explanations follow:

%option noyywrap 

%{
#include <unistd.h>

void show_text(char s[]){
	printf(s,yytext);
}

%}


op          "+"|"MOD"
alfastream  [a-zA-Z]+
digitstream [0-9]+
var         [A-Z]{alfastream}?{digitstream}?

\%\%
{op}         {show_text("Op(%s)");}
{var}        {show_text("Var(%s)");}
"("          {show_text("((");}
")"          {show_text("))");}
.           
\%\%

int main(int argc, char **argv)
{
	FILE *f = fopen(argv[1], "r");
	yyrestart(f);
 	yylex();
 	fclose(f);
}

Pattern names

It is often convenient to assign names to specific patterns. For instance, in the declarations part, we have created names: alfastream, digitstream, var and op. These pattern names can be freely-reused later. For instance, the pattern:

var         [A-Z]{alfastream}?{digitstream}?

defines strings that start with an upper-case, followed by zero-or-one appearance of the pattern alfastream, followed by zero-or-one appearance of the pattern digitstream.

Similarly,

op          "+"|"MOD"

defines the pattern op which can be either the string “+” or “MOD”. (Careful, introducing whitespaces in a pattern-definition, e.g. “+” | “MOD” will produce syntax errors).

Actions and Ambiguous Patterns

The actual pattern definitions are found in the second part of the Flex file. Our unique action for each matched pattern is to show it, via the show_text function.

It is possible for patterns to contain ambiguities. For instance, the input:

Variable01

may match the pattern {alfastream} (not explicitly defined in the code), with Variable, as well as the pattern {var}, with Variable01.

  • Flex will always match the longest possible string. In our example, assuming {alfastream} is defined, {var} will be matched and not {alfastream}

Similarly, the input:

MOD

may simultaneously match {var} as well as {op}. In such cases:

  • Flex will always prefer the first pattern which appears in the program. In our example, {op} appears before {var} in the pattern section, hence it is preferred.

We have also introduced the pattern . which, since it is defined as the last pattern (and of the shortest size), it will be matched whenever no other pattern matches. We expect . to match whitespaces.

Step 2 - An ad-hoc Flex-based parser

Data-structures

To represent expressions, we have opted for and ADT-style representation:

Atom : String -> Expr
Par : Expr -> Expr
Binary : Expr x Expr -> Expr 

with three types of constructors, for each type of expression.

In C, we do not have inheritance, or other means for expressing super/sub-types. Hence, we assume a value of type Expr is a void-pointer. Thus, we define:

typedef struct Atom{
	char* name;
} Atom;

typedef struct Par{
	void* inner;
	int sz;
} Par;

typedef struct Binary{
	void* left, *right;
	int left_sz, right_sz;
	int op;
} Binary;

In order to be able to recover type-information from a void* value, we also add integer values sz (resp. left_sz and right_sz) which hold the sizeof value of the contained object. For instance, in representing X + Y, the values left and right will point to objects of type Atom, hence left_sz=right_sz=sizeof(struct Atom).

We also introduce helper functions for creating the respective objects:

void* make_atom (char s[]){...}
void* make_par (void* inner, int sz){...}
void* make_binary (void* left, int left_sz, int op, void* right, int right_sz){...}

In order to display an object (and hence test the parsing correctness), we rely on three functions:

void show (void* ob, int sz){
	switch (sz){
		case ATOM: show_atom((Atom*)ob); break;
		case PAR: show_par((Par*)ob); break;
		case BINARY: show_binary((Binary*)ob); break;
	}
}

void show_atom (Atom* a){
	printf("Atom{%s}",a->name);
}

void show_par (Par* p){
	printf("(");
	show(p->inner,p->sz);
	printf(")");
}

void show_binary (Binary* b){
	show(b->left,b->left_sz);
	if (b->op == PLUS) 
	    printf(" + ");
	else printf(" MOD ");
	show(b->right,b->right_sz);
}

The interesting function is void show (void* ob, int sz), which, using the expression type stored in sz, downcasts the object to be displayed. The rest of the functions simply display different types of expressions, relying on recursive calls to show.

Finally, the interesting part of the parser relies on verifying if a correct expression has been read. In order to do so, our program relies on a stack. The possible values from the stack are:

  • an expression (i.e. a void* and an int)
  • an opened parenthesis (encoded as a null-pointer value and a pre-defined integer PAR_OPEN)
  • a closed parenthesis (a null pointer and PAR_CLOSED)
  • an operator (a null pointer and PLUS or MOD)

Whenever an open parenthesis is read by Flex, we check the top of the stack:

  • if it contains another open parenthesis, thus we have …( (, an operator e.g. +(, or the stack is empty, we place the parenthesis on the stack
  • otherwise, we have a syntax error;

Whenever an operator is read by Flex, we check the top of the stack:

  • if it contains an expression (of any type), then we place the operator on the stack
  • otherwise, we signal a syntax error;

When a variable is read, we check the top of the stack:

  • if it contains an open parenthesis, it means we have read something similar to …(V; we place the variable on the stack and continue. We do the same thing if the stack is empty.
  • if it contains an operator, it means we have read e.g. <expr> + V; in this case, we pop the +, the <expr> we build the binary expression <expr> + V, and place it on the stack. The correctness checks guarantee that the stack will have these values available.
  • otherwise we signal a syntax error;

When a closed parenthesis is read, we look at the stack top again and:

  • if the last two values of the stack correspond to ( <expr>, then we pop the expression, the opened parenthesis, and push a new expression corresponding to (<expr>).
  • in any other case, we signal a syntax error (incorrectly-matched parentheses).

Below, you can find an implementation which covers some of the syntax verifications mentioned above. The implementation should be able to build correctly-written expressions.

%option noyywrap 

%{

#include <unistd.h>

/*
    data Expr = Atom | (Expr) | Expr OP Expr 
*/

#define PLUS 0
#define MOD 1
#define PAR_OPEN 2
#define PAR_CLOSE 3

#define ATOM sizeof(struct Atom)
#define PAR sizeof(struct Par)
#define BINARY sizeof(struct Binary)

typedef struct Atom{
	char* name;
} Atom;

typedef struct Par{
	void* inner;
	int sz;
} Par;

typedef struct Binary{
	void* left, *right;
	int left_sz, right_sz;
	int op;
} Binary;

void show_atom(Atom*);
void show_par(Par*);
void show_binary(Binary*);

void* make_atom (char s[]){
	Atom* a = (Atom*)malloc(sizeof(struct Atom));
	a->name = (char*)malloc(strlen(s));
	memcpy(a->name,s,strlen(s));
	return (void*)a;
}

void* make_par (void* inner, int sz){
	Par* p = (Par*)malloc(sizeof(struct Par));
	p->inner = inner;
	p->sz = sz;
	return (void*)p;
}

void* make_binary (void* left, int left_sz, int op, void* right, int right_sz){
	Binary* b = (Binary*)(malloc(sizeof(struct Binary)));
	b->left = left;
	b->right = right;
	b->left_sz = left_sz;
	b->right_sz = right_sz;
	b->op = op;
	return (void*)b;
}



void show (void* ob, int sz){
	switch (sz){
		case ATOM: show_atom((Atom*)ob); break;
		case PAR: show_par((Par*)ob); break;
		case BINARY: show_binary((Binary*)ob); break;
	}
}

void show_atom (Atom* a){
	printf("Atom{%s}",a->name);
}

void show_par (Par* p){
	printf("(");
	show(p->inner,p->sz);
	printf(")");
}

void show_binary (Binary* b){
	show(b->left,b->left_sz);
	if (b->op == PLUS) 
	    printf(" + ");
	else printf(" MOD ");
	show(b->right,b->right_sz);
}


typedef struct Stack_elem{
	void* val;
	int sz;
	struct Stack_elem* next;
}* Stack;

Stack push (void* val, int sz, Stack s){
	Stack sp = (Stack)malloc(sizeof(struct Stack_elem));
	sp->val = val;
	sp->sz = sz;
	sp->next = s;
	return sp;
}
Stack pop (Stack s){
	Stack sp = s->next;
	free(s);
	return sp;
}
int isEmpty (Stack s){
	return s==0;
}

Stack stack = 0;

void error (char s[]){
	
	printf("Error: %s \n",s);
	//yyterminate();
	//yyerror(s);
}

/*
	takes an expression e, and adds another expression  to the stack
	as long as the stack contains " op e' ", we pop the operand and e'
	and push the binary expression e op e'

*/
void prune_stack(void* e, int sz){
	if (isEmpty(stack)){
		stack = push(e, sz, stack);
	}
	else {
		if(stack->sz == PLUS || stack->sz == MOD){
			int op = stack->sz;
			stack = pop(stack);
			if (isEmpty(stack))
				error("Plus is not preceded by an operand");
			Binary* b = make_binary(stack->val,stack->sz,op,e,sz);
			stack = pop(stack);
			prune_stack(b,BINARY);
		}
		else
			stack = push(e,sz,stack);
	}
}

void add_variable(){
	prune_stack(make_atom(yytext),ATOM);
}

void close_par(){
	if (isEmpty(stack)){
		error("Closed parenthesis ends abruptly");
	}
	else{
		Par* p = make_par(stack->val,stack->sz);
		stack = pop(stack);
		if (isEmpty(stack))
			error("Closed parenthesis preceding expression ends abruptly");
		else{
			
			if(stack->sz != PAR_OPEN){
				error("Closed parenthesis preceding expression does not match an open one");
			}
			else{
				stack = pop(stack); //remove parenthesis
				prune_stack((void*)p,PAR);
			}
		}
	}
}

%}

alfastream  [a-zA-Z]+
digitstream [0-9]+
var         [A-Z]{alfastream}?{digitstream}?

\%\%

"+"     {stack = push(0,PLUS,stack); }
"MOD"   {stack = push(0,MOD,stack); }
{var}	{add_variable(); }
"("     {stack = push(0,PAR_OPEN,stack); }     
")"     {close_par();}
.            

\%\%

int main(int argc, char **argv)
{
	
	FILE *f = fopen(argv[1], "r");
	yyrestart(f);
 	yylex();
 	fclose(f);
 	if (isEmpty(stack)){
 		printf("Stack empty!");
 		return 0;
 	}
 	else{
 		show(stack->val,stack->sz);
 		printf("\n");
 	}

}

References

  1. John Levine, Flex & bison, O'Reilly publishing. link