Compiler Tutorial, part 2 – Lexing
In Part 1 of my compilers tutorial, I went over context-free grammars. Read that first, unless you only care about lexing.
Lexing, short for “lexical analyzing”, basically means “turn this string into something I can use”. Since I’m assuming you’ve programmed before, I’m sure you’ve had an occasion to take input from the user and turn it into smaller components you can use. Even if it’s a simple expression like “5+2”, you don’t get it as input in a nice format – but rather an obtuse string. If you’ve ever accepted input from the user and done anything other than print it right back out, you’ve written a lexer. Any sort of string-splitting or “understanding” the contents of a string requires a lexer. Even Java’s String.split(“, “) method is lexing, in a simple form.
Lexing is necessary for compilation, but also for many other programs. I’ve written too much input-handling code in my time, and I wish somebody had told me about lexing long ago. So this section of the tutorial is probably the most useful to areas other than compiler writing.
Let’s get started.Since compilers get programs as strings from a file on disk, the first step is making sense of the input. The lexer is the source of some of the final characteristics of the language – particularly with regards to whitespace handing. For example, a lexer for C or Java would discard all unnecessary whitespace, but a Python lexer would not. It’s also the starting point in the compilation process.
The key point about designing a lexer is that you need to be able to reconstruct the exact input given, minus unnecessary differences like whitespace (as noted above). That is, the lexer categorically can not “change” the input.
Let’s look at ‘flex’ – this is a faster (hence the ‘f’) version of ‘lex’. Lexers take input and return a sequence of tokens, or ‘lexemes’. I prefer the word tokens. Consider them arbitrary language-level values that can be easily matched in the rest of your program.
Examples are better. See below, for a lexer for arithmetic expressions:
%top{ #include <stdio.h> #include <stdlib.h> } DIGITÂ [0-9] WHITESPACEÂ Â [ \t\n] %% {DIGIT}+ { printf("%d ", atoi(yytext)); } "X" { printf("[Var X] "); } "+" { printf("[Plus] "); } "-" { printf("[Minus] "); } "*" { printf("[Times] "); } "/" { printf("[Divide] "); } "(" { printf("( "); } ")" { printf(") "); } {WHITESPACE}+ {} . printf("Unrecognized character %s", yytext); %% int main() { yylex(); return 0; }
Go ahead – save this as a file “arithmetic-1.l” (that’s a lowercase L) and run “flex arithmetic-1.l”. L is the canonical extension for ‘lex’ source code. This will produce a file named ‘lex.yy.c’, which you can compile with the command ‘gcc -lfl lex.yy.c’. If you’re not sure how to use ‘flex’, you may need to download it. ‘Lex’ is part of the POSIX standard, so on Unixen you’re guaranteed its availability, but your machine probably has ‘flex’ instead. It’s also available for Windows, but I won’t help you set up a build environment there (though look at Cygwin). If you don’t know how to use a C compiler, this is the wrong tutorial for you.
Go ahead and run the resulting ‘a.out’ and type some algebraic expression, followed by a newline.
Let’s look at what happens:
> 5 + 2 - 3*3(7)423341 5 [Plus] 2 [Minus] 3 [Times] 3 ( 7 ) 423341
Let’s look at what’s happening, starting from the top. Lex source code is of the form:
Definition section %% Rules section %% C code section
The code above follows this structure. Let’s examine it. First, some notes – indented text, or text enclosed by ‘%{‘ and ‘%}’ is copied verbatim, as is the ‘%top{‘ block (which goes at the beginning of the output). We use it to include “stdio.h” because we need printf, and “stdlib.h” for atoi. Both of these are probably included in the generated code, but it’s good practice to include it yourself if you’re using it.
Next, we have two definitions – DIGIT and WHITESPACE. Each defines a ‘pattern’, in the form of a regular expression that matches particular text – any digit, and all whitespace, respectively. Regular expressions, if you’ve never used them, are ways to specify simple patterns of characters. [0-9] means ‘any character between 0 and 9 inclusive’, while [Â \t\n] means ‘space, tab, or newline’. The key here is that you’ve named the regular expression, which makes it look cleaner – you could do without these and just put it right in the rules section
Now for rules. This goes “see a regular expression and run this code”. The first rule, ‘{DIGIT}+’, specifies that for at least one digit in the input, run the code following (which parses it to a digit and prints it out). As you can see, you can use the matched text with the variable ‘yytext’. The following rules work similarly – in these, the regular expressions are the single characters, and all that happens is a textual representation is printed. Finally, any and all whitespace is discarded (the empty action) and the final ‘.’ matches any single character (Sidenote: We can’t use ‘.+’ because it would match the whole input – lex always uses the biggest match, which is why a number like 1234 isn’t processed separately)
Now that we’ve defined our input, and what to do when we see it, we can run the lexer. yylex() starts the lexer on the standard input (by default; you can change it) and runs until it returns – by default, when the end of the file is reached.
We’ll look at parsing next, but fiddle around with this simple lexer for a while first. Try writing a line- and character-counting program (you’ll need to define some global variables). In particular, think about whether you can disallow bad inputs like ‘*/*+*’ or unmatched parentheses. (Hint: what’s a regular language and what are the natures of its limitations?)
Bear with me – it gets interesting next.