Gelex: a generic lexer in JavaScript
Usually, I practice TDD (Test-Driven Development) every day, in personal projects, written in different programming languages, like Java, JavaScript, C#, etc. One of my preferred topics is to write an interpreter, a compiler or a transpiler. After many projects, now I have a clear picture of what I need from a lexer.
A lexer should take a text and separates it in tokens: each token is like a word, a number, ie, has a value and a type. Example, a lexer that process this Java line:
int answer = 32;
could produce a list of tokens (in JavaScript notation) like:
{ type: 'name', value: 'int' }
{ type: 'name', value: 'answer' }
{ type: 'operator', value: '=' }
{ type: 'integer', value: '42 }
{ type: 'delimiter', value: ';' }
To cover additional use cases, it is useful to add the position of the token in the original text, ie:
{ type: 'name', value: 'int', begin: 0, end: 2 }
{ type: 'name', value: 'answer', begin: 4, end: 9 }
{ type: 'operator', value: '=', begin: 11, end: 11 }
{ type: 'integer', value: '42, begin: 13, end: 14 }
{ type: 'delimiter', value: ';', begin: 15, end: 15 }
So, I wrote a generic lexer, source code at https://github.com/ajlopez/gelex
It was published as an npm module, to be installed with the command:
npm install gelex
To use the generic lexer, the first step is to create a lexer definition:
const gelex = require('gelex');
const ldef = gelex.definition();
Then, define the types and the expressions that describe the character sequences that makes the values of that type, ie:
ldef.define('zero', '0');
ldef.define('one', '1');
Instead of only one character you could use a sequence of characters:
ldef.define('forkeyword', 'for');
You can use limited regular expressions:
ldef.define('integer', '[0-9][0-9]*');
ldef.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');
To define a comment (a whitespace to be skipped), declare the start and end sequence:
ldef.defineComment('/*', '*/');
For a line comment, specify only the start sequence:
ldef.defineComment('//');
For strings, you must declare the start and end sequence:
ldef.defineString('string', '"', '"');
In an additional argument, we could specify the mapped characters:
ldef.defineString('string', {
escape: '/',
escaped: { 'n': '/n', ...
}
One the lexer definition is done, you can process a text:
const lexer = ldef.lexer('1 beer');
lexer.next(); // integer 1
lexer.next(); // name beer
lexer.next(); // null
For more detailed explanations, see the README of the project. Also, there is a tokenizer example.
In a previous project, I wrote a similar solution but covering both the lexer and the parser. Now, I prefer to have these concerns separated. Doing “dog fooding”, I started to use this new lexer in my own projects, like selang (a new programming language for smart contracts in Ethereum and RSK), erlie, simpletalk, SolidityCompiler.
In future post, I will describe the generic parser that I wrote to complement this project, and its use.