As a project, a student wants to implement a parser for a small language in C/C++. I considered reaching back into my ancient experience with YACC/Bison, but instead we decided to try ANTLR. This is an LL(*) parser generator that produces a recursive-descent (top-down) implementation. YACC/Bison, in contrast, are LR (bottom-up).
Let me interject here that I never considered parsing a very interesting problem in language implementation. I'm open to arguments to the contrary; maybe it just isn't to my taste. As a programmer, I appreciate clean and regular syntax, but as a language implementer I see the appeal of Lisp's non-syntax – just give me the AST directly.
ANTLR is written in Java, but supports several other languages for its output. The documentation for these other languages is a bit sparse, so it may be best to learn its capabilities in Java and then port them. Ultimately, I was reading the antlr3.h files to figure out the capabilities of tree and string types.
ANTLR has some useful operators for structuring and rewriting the resulting AST, which is especially helpful given the restriction against left-recursive grammars. Some examples also use a separate specification for a tree grammar that interprets the output of the parser in some way. We definitely want a grammar that produces an AST, but for further processing we'll stick to straight C/C++ without needing the tool.
Here is the grammar file. One of the tricks for later post-processing is that you want to define as many named tokens as possible: replacing something like ('+'|'-') with (PLUS|MINUS) is beneficial because you can switch on the token type using case PLUS in the interpreter.
grammar ExprCppTree;
options {
language = C;
output = AST;
ASTLabelType=pANTLR3_BASE_TREE;
}
@header {
#include <assert.h>
}
// The suffix '^' means make it a root.
// The suffix '!' means ignore it.
expr: multExpr ((PLUS^ | MINUS^) multExpr)*
;
PLUS: '+';
MINUS: '-';
multExpr
: atom (TIMES^ atom)*
;
TIMES: '*';
atom: INT
| ID
| '('! expr ')'!
;
stmt: expr NEWLINE -> expr // tree rewrite syntax
| ID ASSIGN expr NEWLINE -> ^(ASSIGN ID expr) // tree notation
| NEWLINE -> // ignore
;
ASSIGN: '=';
prog
: (stmt {pANTLR3_STRING s = $stmt.tree->toStringTree($stmt.tree);
assert(s->chars);
printf(" tree \%s\n", s->chars);
}
)+
;
ID: ('a'..'z'|'A'..'Z')+ ;
INT: '~'? '0'..'9'+ ;
NEWLINE: '\r'? '\n' ;
WS : (' '|'\t')+ {$channel = HIDDEN;};
One other thing I ought to point out here is that I'm using the tilde ~ as the negation operator, as in SML. This avoids having to figure out conflicts between subtraction and negation.
#include "ExprCppTreeLexer.h"
#include "ExprCppTreeParser.h"
#include <cassert>
#include <map>
#include <string>
#include <iostream>
using std::map;
using std::string;
using std::cout;
class ExprTreeEvaluator {
map<string,int> memory;
public:
int run(pANTLR3_BASE_TREE);
};
pANTLR3_BASE_TREE getChild(pANTLR3_BASE_TREE, unsigned);
const char* getText(pANTLR3_BASE_TREE tree);
int main(int argc, char* argv[])
{
pANTLR3_INPUT_STREAM input;
pExprCppTreeLexer lex;
pANTLR3_COMMON_TOKEN_STREAM tokens;
pExprCppTreeParser parser;
assert(argc > 1);
input = antlr3AsciiFileStreamNew((pANTLR3_UINT8)argv[1]);
lex = ExprCppTreeLexerNew(input);
tokens = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,
TOKENSOURCE(lex));
parser = ExprCppTreeParserNew(tokens);
ExprCppTreeParser_prog_return r = parser->prog(parser);
pANTLR3_BASE_TREE tree = r.tree;
ExprTreeEvaluator eval;
int rr = eval.run(tree);
cout << "Evaluator result: " << rr << '\n';
parser->free(parser);
tokens->free(tokens);
lex->free(lex);
input->close(input);
return 0;
}
int ExprTreeEvaluator::run(pANTLR3_BASE_TREE tree)
{
pANTLR3_COMMON_TOKEN tok = tree->getToken(tree);
if(tok) {
switch(tok->type) {
case INT: {
const char* s = getText(tree);
if(s[0] == '~') {
return -atoi(s+1);
}
else {
return atoi(s);
}
}
case ID: {
string var(getText(tree));
return memory[var];
}
case PLUS:
return run(getChild(tree,0)) + run(getChild(tree,1));
case MINUS:
return run(getChild(tree,0)) - run(getChild(tree,1));
case TIMES:
return run(getChild(tree,0)) * run(getChild(tree,1));
case ASSIGN: {
string var(getText(getChild(tree,0)));
int val = run(getChild(tree,1));
memory[var] = val;
return val;
}
default:
cout << "Unhandled token: #" << tok->type << '\n';
return -1;
}
}
else {
int k = tree->getChildCount(tree);
int r = 0;
for(int i = 0; i < k; i++) {
r = run(getChild(tree, i));
}
return r;
}
}
pANTLR3_BASE_TREE getChild(pANTLR3_BASE_TREE tree, unsigned i)
{
assert(i < tree->getChildCount(tree));
return (pANTLR3_BASE_TREE) tree->getChild(tree, i);
}
const char* getText(pANTLR3_BASE_TREE tree)
{
return (const char*) tree->getText(tree)->chars;
}
Given the text file:
a = 3
b = 14
(a+b)-2 * ~4
This program outputs:
tree (= a 3)
tree (= b 14)
tree (- (+ a b) (* 2 ~4))
Evaluator result: 25
This works with ANTLR version 3.2 (debian stable), but I also learned today there are some significant C API changes in the latest, version 3.4. For one, antlr3AsciiFileStreamNew must be replaced with antlr3FileStreamNew(filename, ANTLR3_ENC_8BIT).