Antlr 4

Having once upon a time having had to use Lex & Yacc, and their eventual successors Flex & Bison, I’d been more than passingly familiar with writing and debugging grammars.

And all their weirdness like shift reduce conflicts 😛

At Xojo I’d looked through the grammar for the language to make sure various editors and parsers were conforming to the grammar especially in places like the method signature editor where you could paste in the whole thing into the method name and it would pull it apart. There were other places where the grammar had to be known to try & do the right thing.

And when a side project came up that required writing a grammar I knew I was capable of writing one using those tools; but really didnt want to use them as they have some weirdnesses and they only turned out C++ code, maybe Java too. But if you wanted some other language that wasnt an option.

Enter ANTLR !

One of the things I really like about it is not only can you read the grammar very easily, it has some handy test modes like showing you a visual parse tree, dumping all the tokens, a tree of all the tokens and piles of other nice debug info to help you write a better grammar and therefor a better parser.

For instance, originally I had a rule for reading a Xojo DIM statement. And it was much like

// Declare any number of variables/arrays of varying types.
dimStatement
    : (DIM | VAR) (arrayDecl | IDENTIFIER | ME) (',' (arrayDecl | IDENTIFIER | ME))* AS NEW? fqName('(' arguments ')')? (EQUALS expr)?
      (',' ((arrayDecl | IDENTIFIER | ME) (',' (arrayDecl | IDENTIFIER | ME))* AS NEW? fqName ('(' arguments ')')? ) (EQUALS expr)? )* comment? #DeclareVariables
    ;

quite a beast to read but it permitted any sort of single line dim statement to be parsed

HOWEVER, it had real issues when it came to trying to deal with it in code. Why ? well in antler the code would have lists of possible arraydecl objects, identifier objects, fqnames for the type and none of them were easily associated with each other. So it makes it hard to know what arrays were defined as what type, what identifiers were defined as what types & so on.

What do I mean by that ? Given code like

dim k() , I as integer, s() as string, b, bb() as boolean

I’d get

array decls : k, s, b
identifiers : I, b
fqnames : integer, string, boolean

Very hard to tell which array and indents are integers, which are strings and which are booleans

Now this looks like

// Declare any number of variables/arrays of varying types.
dimStatement
    : (DIM | VAR) declClause ( ',' declClause)* comment? 
    ;

declClause :
	(arrayDecl | simplevarDecl ) (',' (arrayDecl | simplevarDecl))* AS NEW? fqName('(' arguments ')')? (EQUALS expr)?
	;

arrayDecl
    : simplevarDecl '(' ( MINUS? number (',' MINUS? number)* )? ')'
    ;

simplevarDecl 
	: (IDENTIFIER | ME) ;

And parsing that same code I now get and array of decl clauses like

decl clause
     array decls : k
     identifiers : I
     fqname : integer
decl clause
     array decls :
     identifiers : s
     fqname : string
decl clause
     array decls : bb
     identifiers : b
     fqname : boolean

and this is much easier to sort out and way easier to handle in code

As I work more on this project I’ll have to keep an eye out for these sorts of things

Antlr makes some of this super easy to deal with

Now I need to find other books on designing grammars for ANTLR 4 to see what else I might be missing out on