Gelex: a generic lexer in JavaScript

Angel Java Lopez
2 min readMar 31, 2019

--

Usually, I practice TDD (Test-Driven Development) every day, in personal projects, written in different programming languages, like Java, JavaScript, C#, etc. One of my preferred topics is to write an interpreter, a compiler or a transpiler. After many projects, now I have a clear picture of what I need from a lexer.
A lexer should take a text and separates it in tokens: each token is like a word, a number, ie, has a value and a type. Example, a lexer that process this Java line:

int answer = 32;

could produce a list of tokens (in JavaScript notation) like:

{ type: 'name', value: 'int' }
{ type: 'name', value: 'answer' }
{ type: 'operator', value: '=' }
{ type: 'integer', value: '42 }
{ type: 'delimiter', value: ';' }

To cover additional use cases, it is useful to add the position of the token in the original text, ie:

{ type: 'name', value: 'int', begin: 0, end: 2 }
{ type: 'name', value: 'answer', begin: 4, end: 9 }
{ type: 'operator', value: '=', begin: 11, end: 11 }
{ type: 'integer', value: '42, begin: 13, end: 14 }
{ type: 'delimiter', value: ';', begin: 15, end: 15 }

So, I wrote a generic lexer, source code at https://github.com/ajlopez/gelex
It was published as an npm module, to be installed with the command:

npm install gelex

To use the generic lexer, the first step is to create a lexer definition:

const gelex = require('gelex');
const ldef = gelex.definition();

Then, define the types and the expressions that describe the character sequences that makes the values of that type, ie:

ldef.define('zero', '0');
ldef.define('one', '1');

Instead of only one character you could use a sequence of characters:

ldef.define('forkeyword', 'for');

You can use limited regular expressions:

ldef.define('integer', '[0-9][0-9]*');
ldef.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');

To define a comment (a whitespace to be skipped), declare the start and end sequence:

ldef.defineComment('/*', '*/');

For a line comment, specify only the start sequence:

ldef.defineComment('//');

For strings, you must declare the start and end sequence:

ldef.defineString('string', '"', '"');

In an additional argument, we could specify the mapped characters:

ldef.defineString('string', { 
escape: '/',
escaped: { 'n': '/n', ...
}

One the lexer definition is done, you can process a text:

const lexer = ldef.lexer('1 beer');
lexer.next(); // integer 1
lexer.next(); // name beer
lexer.next(); // null

For more detailed explanations, see the README of the project. Also, there is a tokenizer example.
In a previous project, I wrote a similar solution but covering both the lexer and the parser. Now, I prefer to have these concerns separated. Doing “dog fooding”, I started to use this new lexer in my own projects, like selang (a new programming language for smart contracts in Ethereum and RSK), erlie, simpletalk, SolidityCompiler.
In future post, I will describe the generic parser that I wrote to complement this project, and its use.

--

--