Getting started with ANTLR: building a simple expression language
This post is part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.
- Building a lexer
- Building a parser
- Creating an editor with syntax highlighting
- Build an editor with autocompletion
- Mapping the parse tree to the abstract syntax tree
- Model to model transformations
- Validation
- Generating bytecode
After writing this series of posts I refined my method, expanded it, and clarified into this book titled How to create pragmatic, lightweight languages
In this post, we will start working on a very simple expression language. We will build it in our language sandbox and therefore we will call the language Sandy.
I think that tool support is vital for a language: for this reason we will start with an extremely simple language but we will build rich tool support for it. To benefit from a language we need a parser, interpreters and compilers, editors and more. It seems to me that there is a lot of material on building simple parsers but very few material on building the rest of the infrastructure needed to make using a language practical and effective.
I would like to focus on exactly these aspects, making a language small but fully useful. Then you will be able to grow your language organically.
The code is available on GitHub: https://github.com/ftomassetti/LangSandbox. The code presented in this article corresponds to the tag 01_lexer.
The language
The language will permit to define variables and expressions. We will support:
- integer and decimal literals
- variable definition and assignment
- the basic mathematical operations (addition, subtraction, multiplication, division)
- the usage of parenthesis
Examples of a valid file:
1 2 3 | var a = 10 / 3 var b = ( 5 + 3 ) * 2 var c = a / b |
The tools we will use
We will use:
- ANTLR to generate the lexer and the parser
- use Gradle as our build system
- write the code in Kotlin. It will be very basic Kotlin, given I just started learning it.
Setup the project
Our build.gradle file will look like this
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | buildscript { ext.kotlin_version = '1.3.70' repositories { mavenCentral() maven { name 'JFrog OSS snapshot repo' } jcenter() } dependencies { classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version" } } apply plugin: 'kotlin' apply plugin: 'java' apply plugin: 'idea' apply plugin: 'antlr' repositories { mavenLocal() mavenCentral() jcenter() } dependencies { antlr "org.antlr:antlr4:4.8" compile "org.antlr:antlr4-runtime:4.8" compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version" compile "org.jetbrains.kotlin:kotlin-reflect:$kotlin_version" testCompile "org.jetbrains.kotlin:kotlin-test:$kotlin_version" testCompile "org.jetbrains.kotlin:kotlin-test-junit:$kotlin_version" testCompile 'junit:junit:4.13' } generateGrammarSource { maxHeapSize = "64m" arguments += [ '-package' , 'me.tomassetti.langsandbox' ] outputDirectory = new File( "generated-src/antlr/main/me/tomassetti/langsandbox" .toString()) } compileJava.dependsOn generateGrammarSource sourceSets { generated { java.srcDir 'generated-src/antlr/main/' } } compileJava.source sourceSets.generated.java, sourceSets.main.java clean{ delete "generated-src" } idea { module { sourceDirs += file( "generated-src/antlr/main" ) } } |
We can run:
- ./gradlew idea to generate the IDEA project files
- ./gradlew generateGrammarSource to generate the ANTLR lexer and parser
Implementing the lexer
We will build the lexer and the parser in two separate files. This is the lexer:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | lexer grammar SandyLexer; // Whitespace NEWLINE : '\r\n' | 'r' | '\n' ; WS : [\t ]+ ; // Keywords VAR : 'var' ; // Literals INTLIT : '0' |[ 1 - 9 ][ 0 - 9 ]* ; DECLIT : '0' |[ 1 - 9 ][ 0 - 9 ]* '.' [ 0 - 9 ]+ ; // Operators PLUS : '+' ; MINUS : '-' ; ASTERISK : '*' ; DIVISION : '/' ; ASSIGN : '=' ; LPAREN : '(' ; RPAREN : ')' ; // Identifiers ID : [_]*[a-z][A-Za-z0-9_]* ; |
Now we can simply run ./gradlew generateGrammarSource and the lexer will be generated for us from the previous definition.
Testing the lexer
Testing is always important but while building languages it is absolutely critical: if the tools supporting your language are not correct this could affect all possible programs you will build for them. So let’s start testing the lexer: we will just verify that the sequence of tokens the lexer produces is the one we aspect.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | package me.tomassetti.sandy import me.tomassetti.langsandbox.SandyLexer import org.antlr.v4.runtime.CharStreams import java.util.* import kotlin.test.assertEquals import org.junit.Test as test class SandyLexerTest { fun lexerForCode(code: String) = SandyLexer(CharStreams.fromString(code)) fun lexerForResource(resourceName: String) = SandyLexer(ANTLRInputStream( this .javaClass.getResourceAsStream( "https://mk0tuzolorusfnc7thxk.kinstacdn.com/${resourceName}.sandy" ))) fun tokens(lexer: SandyLexer): List<String> { val tokens = LinkedList<String>() do { val t = lexer.nextToken() when (t.type) { - 1 -> tokens.add( "EOF" ) else -> if (t.type != SandyLexer.WS) tokens.add(lexer.ruleNames[t.type - 1 ]) } } while (t.type != - 1 ) return tokens } @test fun parseVarDeclarationAssignedAnIntegerLiteral() { assertEquals(listOf( "VAR" , "ID" , "ASSIGN" , "INTLIT" , "EOF" ), tokens(lexerForCode( "var a = 1" ))) } @test fun parseVarDeclarationAssignedADecimalLiteral() { assertEquals(listOf( "VAR" , "ID" , "ASSIGN" , "DECLIT" , "EOF" ), tokens(lexerForCode( "var a = 1.23" ))) } @test fun parseVarDeclarationAssignedASum() { assertEquals(listOf( "VAR" , "ID" , "ASSIGN" , "INTLIT" , "PLUS" , "INTLIT" , "EOF" ), tokens(lexerForCode( "var a = 1 + 2" ))) } @test fun parseMathematicalExpression() { assertEquals(listOf( "INTLIT" , "PLUS" , "ID" , "ASTERISK" , "INTLIT" , "DIVISION" , "INTLIT" , "MINUS" , "INTLIT" , "EOF" ), tokens(lexerForCode( "1 + a * 3 / 4 - 5" ))) } @test fun parseMathematicalExpressionWithParenthesis() { assertEquals(listOf( "INTLIT" , "PLUS" , "LPAREN" , "ID" , "ASTERISK" , "INTLIT" , "RPAREN" , "MINUS" , "DECLIT" , "EOF" ), tokens(lexerForCode( "1 + (a * 3) - 5.12" ))) } } |
Conclusions and next steps
We started with the first small step: we set up the project and built the lexer.
There is a long way in front of us before making the language usable in practice but we started. We will next work on the parser with the same approach: building something simple that we can test and compile through the command line.
Published on Java Code Geeks with permission by Federico Tomassetti, partner at our JCG program. See the original article here: Getting started with ANTLR: building a simple expression language Opinions expressed by Java Code Geeks contributors are their own. |
What is a Framework?
Frameworks are something that provide a boilerplate to build applications. Frameworks are application-specific i.e. they have been built to solve a particular problem.