Source-to-source translation with ANTLR's TokenRewriteStream

by Martin Monperrus
I've recently discovered the existence of org.antlr.runtime.TokenRewriteStream. It's a class of ANTLR which makes certain source-to-source translations easy. For instance, Terence Parr's v2v3 (+patch) uses it to migrate ANTLR2 grammars to ANTLR3.

Basics

TokenRewriteStream is very powerful and its usage is very regular, a grammar uses a single TokenRewriteStream object (say tokens) and grammar actions can call 4 methods on tokens:
* delete deletes one token or a set of tokens
* replace replaces one token or a set of tokens with a string
* insertAfter inserts a string after a token
* insertBefore inserts a string before a token


For instance, one can write:
programHeading
    : PROGRAM identifier LPAREN identifierList RPAREN SEMI 
    {
    programName = $identifier.text;
    tokens.replace($PROGRAM,"class");// replaces PROGRAM by "class"
    tokens.delete($LPAREN,$RPAREN);// deletes everything between LPAREN and RPAREN (incl. identifierList, which can be long) 
    }
Or:
ifStatement
    : i=IF expression t=THEN s=statement 
      {
      tokens.insertBefore($expression.start," (");// refers to start token of expression
      tokens.insertAfter($expression.stop,") ");;// refers to stop token of expression
      tokens.insertAfter($t,"\n");//refers to t, which is equivalent to THEN (t=THEN)
      }
The parameters of those methods are generally references to the rule elements (or their shortcuts). E.g. let's consider a rule is ifStatement : i=IF expression:
* $IF, $i refers to a Token object.
* $expression.start refers to the first token object of the expression
* $expression.stop refers to the last token object of the expression
* $expression.text refers to the text of the expression (a String)
* $expression.tree refers to the Tree of the expression, (typed by Object, must be cast, e.g. (Tree) $t.tree)

Other examples using text and tree:
assignmentStatement
    : variable ASSIGN expression
    {
      if ($variable.text.equals("foo")) { .... }
    }
parameterGroup
    : identifierList COLON t=typeIdentifier
    {
      System.err.println("Warning: "+$t.text+", line "+((Tree)$t.tree).getLine()+");
    }

Main


The main of a source-to-source translator using TokenRewriteStream looks like this:
public class Translator {
  public static void main(String[] args) throws Exception {
    PascalLexer lexer = new PascalLexer(new ANTLRFileStream(args[0]));
    MyTokenRewriteStream tokens = new MyTokenRewriteStream(lexer);
    PascalParser parser = new PascalParser(tokens);
    parser.tokens = tokens; 
    parser.program();
    System.out.print(tokens.toChangedString()); // emit after changes
  }
}

Performance considerations

TokenRewriteStream is pretty intelligent and applies the rewrites only when toString() is called. However, when one uses .text (for instance in $variable.text), ANTLR generates code that actually calls toString() on the TokenRewriteStream object, and ..text returns the rewritten String. As a result, if parsing triggers a lot of .text and rewrite, it becomes extremely slow.

A workaround can be that ..text returns the original String. This can be achieved with using the following class:
public class EfficientTokenRewriteStream extends TokenRewriteStream {

  public EfficientTokenRewriteStream(PascalLexer lexer) {
    super(lexer);
  }

  @Override
  public String toString(int start, int end) {
   return toOriginalString(start, end);// does not rewrite
  }

  public String toChangedString(int a, int b) {
    fill();
    return super.toString(a, b);  
  }

  public String toChangedString() {
    return toChangedString(MIN_TOKEN_INDEX, size()-1);  
  }  
}
Tagged as: