Core Java

Getting started with ANTLR: building a simple expression language

This post is the first one of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

In this post we will start working on a very simple expression language. We will build it in our language sandbox and therefore we will call the language Sandy.

I think that tool support is vital for a language: for this reason we will start with an extremely simple language but we will build rich tool support for it. To benefit from a language we need a parser, interpreters and compilers, editors and more. It seems to me that there is a lot of material on building simple parsers but very few material on building the rest of the infrastructure needed to make using a language practical and effective.

I would like to focus on exactly these aspects, making a language small but fully useful. Then you will be able to grow your language organically.

The code is available on GitHub: https://github.com/ftomassetti/LangSandbox. The code presented in this article corresponds to the tag 01_lexer.

The language

The language will permit to define variables and expressions. We will support:

  • integer and decimal literals
  • variable definition and assignment
  • the basic mathematical operations (addition, subtraction, multiplication, division)
  • the usage of parenthesis

Examples of a valid file:

1
2
3
var a = 10 / 3
var b = (5 + 3) * 2
var c = a / b

The tools we will use

We will use:

  • ANTLR to generate the lexer and the parser
  • use Gradle as our build system
  • write the code in Kotlin. It will be very basic Kotlin, given I just started learning it.

Setup the project

Our build.gradle file will look like this

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
buildscript {
   ext.kotlin_version = '1.0.3'
  
   repositories {
     mavenCentral()
     maven {
        name 'JFrog OSS snapshot repo'
     }
     jcenter()
   }
  
   dependencies {
     classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
   }
}
  
apply plugin: 'kotlin'
apply plugin: 'java'
apply plugin: 'idea'
apply plugin: 'antlr'
  
repositories {
  mavenLocal()
  mavenCentral()
  jcenter()
}
  
dependencies {
  antlr "org.antlr:antlr4:4.5.1"
  compile "org.antlr:antlr4-runtime:4.5.1"
  compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version"
  compile "org.jetbrains.kotlin:kotlin-reflect:$kotlin_version"
  testCompile "org.jetbrains.kotlin:kotlin-test:$kotlin_version"
  testCompile "org.jetbrains.kotlin:kotlin-test-junit:$kotlin_version"
  testCompile 'junit:junit:4.12'
}
  
generateGrammarSource {
    maxHeapSize = "64m"
    arguments += ['-package', 'me.tomassetti.langsandbox']
    outputDirectory = new File("generated-src/antlr/main/me/tomassetti/langsandbox".toString())
}
compileJava.dependsOn generateGrammarSource
sourceSets {
    generated {
        java.srcDir 'generated-src/antlr/main/'
    }
}
compileJava.source sourceSets.generated.java, sourceSets.main.java
  
clean{
    delete "generated-src"
}
  
idea {
    module {
        sourceDirs += file("generated-src/antlr/main")
    }
}

We can run:

  • ./gradlew idea to generate the IDEA project files
  • ./gradlew generateGrammarSource to generate the ANTLR lexer and parser

Implementing the lexer

We will build the lexer and the parser in two separate files. This is the lexer:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
lexer grammar SandyLexer;
  
// Whitespace
NEWLINE            : '\r\n' | 'r' | '\n' ;
WS                 : [\t ]+ ;
  
// Keywords
VAR                : 'var' ;
  
// Literals
INTLIT             : '0'|[1-9][0-9]* ;
DECLIT             : '0'|[1-9][0-9]* '.' [0-9]+ ;
  
// Operators
PLUS               : '+' ;
MINUS              : '-' ;
ASTERISK           : '*' ;
DIVISION           : '/' ;
ASSIGN             : '=' ;
LPAREN             : '(' ;
RPAREN             : ')' ;
  
// Identifiers
ID                 : [_]*[a-z][A-Za-z0-9_]* ;

Now we can simply run ./gradlew generateGrammarSource and the lexer will be generated for us from the previous definition.

Testing the lexer

Testing is always important but while building languages it is absolutely critical: if the tools supporting your language are not correct this could affect all possible programs you will build for them. So let’s start testing the lexer: we will just verify that the sequence of tokens the lexer produces is the one we aspect.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
package me.tomassetti.sandy
  
import me.tomassetti.langsandbox.SandyLexer
import org.antlr.v4.runtime.ANTLRInputStream
import java.io.*
import java.util.*
import org.junit.Test as test
import kotlin.test.*
  
class SandyLexerTest {
  
    fun lexerForCode(code: String) = SandyLexer(ANTLRInputStream(StringReader(code)))
  
    fun lexerForResource(resourceName: String) = SandyLexer(ANTLRInputStream(this.javaClass.getResourceAsStream("/${resourceName}.sandy")))
  
    fun tokens(lexer: SandyLexer): List<String> {
        val tokens = LinkedList<String>()
        do {
           val t = lexer.nextToken()
            when (t.type) {
                -1 -> tokens.add("EOF")
                else -> if (t.type != SandyLexer.WS) tokens.add(lexer.ruleNames[t.type - 1])
            }
        } while (t.type != -1)
        return tokens
    }
  
    @test fun parseVarDeclarationAssignedAnIntegerLiteral() {
        assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "EOF"),
                tokens(lexerForCode("var a = 1")))
    }
  
    @test fun parseVarDeclarationAssignedADecimalLiteral() {
        assertEquals(listOf("VAR", "ID", "ASSIGN", "DECLIT", "EOF"),
                tokens(lexerForCode("var a = 1.23")))
    }
  
    @test fun parseVarDeclarationAssignedASum() {
        assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "PLUS", "INTLIT", "EOF"),
                tokens(lexerForCode("var a = 1 + 2")))
    }
  
    @test fun parseMathematicalExpression() {
        assertEquals(listOf("INTLIT", "PLUS", "ID", "ASTERISK", "INTLIT", "DIVISION", "INTLIT", "MINUS", "INTLIT", "EOF"),
                tokens(lexerForCode("1 + a * 3 / 4 - 5")))
    }
  
    @test fun parseMathematicalExpressionWithParenthesis() {
        assertEquals(listOf("INTLIT", "PLUS", "LPAREN", "ID", "ASTERISK", "INTLIT", "RPAREN", "MINUS", "DECLIT", "EOF"),
                tokens(lexerForCode("1 + (a * 3) - 5.12")))
    }
}

Conclusions and next steps

We started with the first small step: we setup the project and built the lexer.

There is a long way in front of us before making the language usable in practice but we started. We will next work on the parser with the same approach: building something simple that we can test and compile through the command line.

Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Federico Tomassetti

Federico has a PhD in Polyglot Software Development. He is fascinated by all forms of software development with a focus on Model-Driven Development and Domain Specific Languages.
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Igor Ganapolsky
8 years ago

Thank you for this introduction on Lexers. I am trying to find the link to the next post in your blog series. Where is it located?

Federico Tomassetti
8 years ago

Hi Igot, you are welcome. Here there is the 8th post of the series. It has the links to all the previous posts on top:
https://tomassetti.me/generating-bytecode/

Also, I reworked this series, expanded it and updated it and wrote a book about building languages:
https://leanpub.com/create_languages

Back to top button