In a previous post, I briefly wrote how Groovy compiler takes advantage of ANTLR  in the compilation process. In this post, I would like to elaborate on that. Simply put, Groovy goes through the following general phases:
Groovy, in past, has had an internal parser. When ANTLR was added to Groovy, seemingly, they still somehow first use the old one to read the source, called CST, then use the ANTLR parser plugin to read and adapt to the newer version. After parsing the source code, there is some facility that converts CST to AST  of ANTLR.
Groovy introduces its own complete data structure set to represent a compilation unit which is usually a script file containing at least one class or some statements of Groovy language. The CST is build to represent the whole structure of the script and then it is converted to the AST. More clarification on this is required.
Byte Code Generation
When you first get introduced to ANTLR, the recommended approach to use ANTLR in programming language development is :
- Develop a parser grammar with which ANTLR generates the lexer and the parser for the grammar of the language. It is recommended to use the output format of AST in ANTLR that will give out the parse tree according to the language.
- Develop a tree grammar according to the parser grammar so that when a sample program is input, you have the ability to traverse the structure of the program and inject the required actions on different nodes of the program.
- Develop actions or take advantage of string templates for some form of output such as translation to a lower level opcodes.
Well, Groovy does not take the recommended approaches. Instead, and for the CST reason, they heavily take advantage of Visitor  pattern. Groovy introduces a GroovyCodeVisitor containing all the methods required for every possible construct in the Groovy language required in the compile process. One implementation of this visitor is AsmClassGenerator. As its name says, it uses ASM to generate byte code while it is based on visitor patter. Specifically, when the byte code generation begins, AsmClassGenerator receives an instance of ClassNode which is the root for the whole source unit parsed and converted to AST. It starts to traverse the children of the root node and visits every node in the tree. In every node, Groovy actually takes advantage of ASM’s facility called ClassVisitor. The ASM’s ClassVisitor is also based on the visitor pattern. So, for instance, when in a level of a statement or a class declaration, the visitor pattern in Groovy class generator takes advantage of the ASM’s ClassVisitor’s different visit methods to actually generate the byte codes for the current node in the AST structure. The concrete instance of the ClassVisitor that Groovy uses is ClassWriter; it has methods to generate operation byte code for different constructs according to JVM byte code specification.
So, at the end, when the starting class node is completely visited, on the other side of the story, the ASM’s class visitor has actually all the byte code for the whole class.
Regardless of the target runtime, a language is first specified  with a set of formal syntax and semantics usually known as the grammar of the language. The most important goal of a language specification is to create a mutual grounds for understanding the language, the discussion and its implementation. Accordingly, a language implementation is essentially creating another software system to run the programs that are expressed according to the specification . There are two general schemes in this regards, namely, translation and interpretation. In translation, the program in directly translated into the underlying machine understandable code while in interpretation the program is basically run line by line through a runtime environment. Moreover, some languages such as Java take advantage of both methods; here comes the concept of a virtual machine .
Classically, a language comes in with a set of tools including a compiler or an interpreter. The programmer writes a program and then compiles the program into the machine-understandable code that is directly executable. However, in the case of VM languages, in the first pass, the programmer, the programmer translates (compiles) the code into the intermediate bytecode  understandable for the virtual machine. In the second pass, the virtual machine is responsible to run the bytecode on the underlying platform which is a form of interpretation approach. The first pass is the same and produces the same byte code on all platforms; so, the programming language becomes platform-independent. However, for the language designers, this creates another task of implementing a virtual machine for all target platforms and OS’s.
Now, VM languages such as Java have created a good platform in a way that they have become a primary target in design and implementation of new programming languages. It means that instead of implementing the new language directly to the machine code, designers tend to create output based on the language that is compatible with JVM; so the JVM will take care of the execution of the program. Here rises two different approaches in this regard:
- Directly create bytecode for JVM
- Create Java sources and then compile them into bytecode for JVM
Case Study: Groovy
Groovy is a dynamic runtime language that supports and favors functional programming paradigm and now is widely used in domain specific applications. It is interesting to take a look at its design and implementation. Groovy takes good use of ANTLR  that is a set of tools for language processing. For instance, Groovy language grammar and syntax is specified using ANTLR grammar syntax language. Based on this grammar, ANTLR can produce various tools including a lexer and a parser. The lexer and parser are both of the fundamental elements required to write a compiler for each language. Another advantage of ANTLR is that it gives the option to what to produce with respect to the grammar provided. One of the options is the abstract syntax tree (AST)  that is an intermediate data structure very useful during the parsing and compiling of a source code . Groovy compiler, after it has scanned the source code of the program, receives an instance of AST for the program that is provided through a parser plugin generated by ANTLR. Briefly, the compilation of a source code unit in Groovy is as follows:
- Scanning the source and parsing it using the parser plugin generated by ANTLR based on the grammar
- Obtaining the AST instance for the source code unit
- Applying additional phases such as code optimization, code semantic analysis and verification
- Generating output
The whole process is heavily based on Visitor pattern . In step 4, the Groovy compiler uses visitor pattern and another library called ASM  to generate bytecode for JVM. ASM is a library that helps manipulate or dynamically generate bytecode; i.e. “.class” files that are instruction sets for JVM. In Groovy compiler, the AST instance is visited throughout, and in each node of the tree, as each node is known to have a specific representation as the JVM bytecode, thus after visiting all the nodes in the AST, the root of the tree can collect all the bytecode for the whole source unit. This is a very neat and modular way of creating a language on top of Java Virtual Machine. In addition, Groovy compiler also has the option to generate the Java source along the bytecode. It is straightforward that having the AST instance, there are a number of things that can easily done.
Case Study: Scala
Scala is a dynamic functional scalable language based on Java. In contrast with Groovy, they both run on JVM. However, speaking of language implementation, Scala takes another interesting approach. As Scala is a dynamic functional language, it takes advantage of this feature in the compiling the source units. Specifically, Scala introduces its own parser actually declared in Scala language. The parsers declares all the syntax rules that are defined in the language; there is also a repository for all the rules in Scala language . Apart from this, Scala introduces a compiler and an interpreter for the language.
Finally, the case studies show that are things to decide apart from the classical ones in language implementation on top of a virtual machine such as JVM. Groovy and Scala each takes a different approach; while different, each shows to have its advantages and applications.
- : http://en.wikipedia.org/wiki/Programming_language_specification
- : http://en.wikipedia.org/wiki/Programming_language_implementation
- : http://en.wikipedia.org/wiki/Virtual_machine
- : http://en.wikipedia.org/wiki/Bytecode
- : http://www.antlr.org/
- : http://en.wikipedia.org/wiki/Abstract_syntax_tree
- : http://www.antlr.org/wiki/display/ANTLR3/Interfacing+AST+with+Java
- : http://en.wikipedia.org/wiki/Visitor_pattern
- : http://asm.ow2.org/
- : http://code.google.com/p/scala-rules/