In program analysis, parsing is often needed. ANTLR is a great parser generator, but it can be frustrating if your grammar is ambiguous. I have found that lexer ambiguities can cost you quite some time. Here is a little trick that may help.

An example: Graphstream DGS

I was writing a parser for Graphstream DGS (after all, in program analysis, graphs are important too). The formal specification was a bit incomplete and ambiguous, but with some help of the source code I managed to get something working, until I came across the following rules:

fragment HEXBYTE : ([0-9a-fA-F]) ([0-9a-fA-F]) ;
COLOR : '#' HEXBYTE HEXBYTE HEXBYTE HEXBYTE? ;

COMMENT : '#' .*? EOL;

These lexer rules overlap. ANTLR’s ambiguity resolution takes the first alternative, but only if both rules would match the same input. This works as expected if you have for example reserved words:

ENUM : 'enum';
ID: [a-z]+;

For the input enum Categories { ... }, the input enum will be matched by both rules ENUM and ID. However, if the input would contain enumerate(x) then both ENUM and ID will match, but they will match a different part of the input: enum and enumerate. In that case, the longest match wins. This is also what we usually want.

The situation here is similar. For the input color=#ff0000,length=4, both COLOR and COMMENT will match, but on a different input string. In this case, COMMENT will win, and this is not what we want.

The fix: shortening lexer tokens

A trick is to make the COMMENT token shorter than the COLOR token. We can do this by using ANTLR’s lexer modes, which are used for so-called island grammars.

The COMMENT token will be split in two tokens, START_COMMENT and COMMENT. The first is hidden with skip, but it puts the lexer into ‘comment mode’ (CMT). In comment mode, we match everything including the newline into the token COMMENT. To the parser there is still only one token, COMMENT.

START_COMMENT : '#' -> pushMode(CMT), skip;

mode CMT;

COMMENT : .*? EOL -> popMode ;

You cannot use lexer modes in combined grammars, you will need to split your grammar into a lexer grammar and a parser grammar.

About lexer modes

Lexer modes may make things easier or more readable, but sometimes it is better to fix the lexer patterns or create rules in the parser. Too many lexer modes can make one lose the overview, so use them wisely. I prefer to use them only if it makes sense, such as is the case for island grammars. The above hack could have been solved with a parse rule, but it would have cluttered my parse tree. In this case, the trick with lexer modes made more sense.

Ambiguous ANTLR lexer rules

and how to fix them

An example: Graphstream DGS

The fix: shortening lexer tokens

About lexer modes