In program analysis, parsing is often needed. ANTLR is a great parser generator, but it can be frustrating if your grammar is ambiguous. I have found that lexer ambiguities can cost you quite some time. Here is a little trick that may help.
An example: Graphstream DGS
I was writing a parser for Graphstream DGS (after all, in program analysis, graphs are important too). The formal specification was a bit incomplete and ambiguous, but with some help of the source code I managed to get something working, until I came across the following rules:
fragment HEXBYTE : ([0-9a-fA-F]) ([0-9a-fA-F]) ;
COLOR : '#' HEXBYTE HEXBYTE HEXBYTE HEXBYTE? ;
COMMENT : '#' .*? EOL;
These lexer rules overlap. ANTLR’s ambiguity resolution takes the first alternative, but only if both rules would match the same input. This works as expected if you have for example reserved words:
ENUM : 'enum';
ID: [a-z]+;
For the input enum Categories { ... }
, the input enum
will be matched by both rules ENUM
and ID
. However, if the input would contain enumerate(x)
then both ENUM
and ID
will match, but they will match a different part of the input: enum
and enumerate
. In that case, the longest match wins. This is also what we usually want.
The situation here is similar. For the input color=#ff0000,length=4
, both COLOR
and COMMENT
will match, but on a different input string. In this case, COMMENT
will win, and this is not what we want.
The fix: shortening lexer tokens
A trick is to make the COMMENT
token shorter than the COLOR
token. We can do this by using ANTLR’s lexer modes, which are used for so-called island grammars.
The COMMENT
token will be split in two tokens, START_COMMENT
and COMMENT
. The first is hidden with skip
, but it puts the lexer into ‘comment mode’ (CMT
). In comment mode, we match everything including the newline into the token COMMENT
. To the parser there is still only one token, COMMENT
.
START_COMMENT : '#' -> pushMode(CMT), skip;
mode CMT;
COMMENT : .*? EOL -> popMode ;
You cannot use lexer modes in combined grammars, you will need to split your grammar into a lexer grammar and a parser grammar.
About lexer modes
Lexer modes may make things easier or more readable, but sometimes it is better to fix the lexer patterns or create rules in the parser. Too many lexer modes can make one lose the overview, so use them wisely. I prefer to use them only if it makes sense, such as is the case for island grammars. The above hack could have been solved with a parse rule, but it would have cluttered my parse tree. In this case, the trick with lexer modes made more sense.