Software Development

Antlr 4 grammar with spaces

Intro

This post discusses the difficulties and solutions of creating an Antlr grammar without removing spaces from the text being parsed. But first, a brief explanation of what Antlr really is.

Here is the link https://github.com/antlr/antlr4. Antlr can take a text with a well-defined rules and create a syntax tree out of it. It is basically what it is so good for. The user must create the grammar of the DSL (domain-specific language) and feed it to the tool. The tool generates the parser code. It can be a Java code or Javascript code. Antlr has no less than 5 targets (supported languages). This generated parser code can parse the text and create an abstract syntax tree. Then the user can use the generated visitor (it is the well-known visitor pattern) to walk the AST to get something out of it. 

Problem

 Most Antlr grammars remove the whitespace in the lexer like this:

WS
 : [ \r\n\t] + -> skip
 ;

This definitely makes parsing somewhat easier as it transforms this expression:

    q1 + q2 / f2 – ( 4 + 5 )

into a much simpler form:

    q1+q2/f2-(4+5)

Without these spaces it is much easier to write the rules for the Antlr grammar.

But in some cases, it is not possible to remove all spaces because they may be part of a variable name. In the case, I was working on variables were with single quotes like this: ‘Variable 1’. Spaces were part of the variable name. The grammar had to take into account that a space could be part of the variable name but also could be a “useless” character that the user put in for readability like a space after a comma. Special handling of spaces is required to parse this expression:

‘Variable ABC’ * 5 + ‘Variable Q’*( 1 + ‘Variable W’)

Solution

The idea is as follows: if we cannot remove spaces from the text we have to “pull” spaces into parenthesis, operation signs, etc. 

The rules for the operators and parenthesis must change to pull in spaces. Firstly, we need to declare the tokens for spaces:

ANY_SPACE:                    SINGLE_SPACE+;

SINGLE_SPACE:                 ‘ ‘;

We cannot use SINGLE_SPACE* because the Antlr tool correctly shows that SINGLE_SPACE* can match an empty string. 

Then incorporate these tokens into rules for operators (MUL is ‘*’ and Div is ‘/’, rParameterSeparator is for the comma that separates function actual parameters
):

rLeftVarParenthesis:          L_VAR_PAREN ANY_SPACE | L_VAR_PAREN;

rArithmeticOperatorMulDiv:    ANY_SPACE (MUL | DIV) ANY_SPACE | ANY_SPACE (MUL | DIV) | (MUL | DIV) ANY_SPACE | (MUL | DIV);

rParameterSeparator:          ANY_SPACE ‘,’ ANY_SPACE | ANY_SPACE ‘,’ | ‘,’ ANY_SPACE | ‘,’;

Now, this grammar can handle variables with spaces in their names. The visitor when visiting the operator rule can join the text from the children and use trim to remove all unnecessary space. In Javascript it may look like this:

return ctx.children.join(”).trim();

This simple and elegant solution allows for parsing variable names with spaces. 

Published on Java Code Geeks with permission by Vadim Korkin, partner at our JCG program. See the original article here: Antlr 4 grammar with spaces

Opinions expressed by Java Code Geeks contributors are their own.

Vadim Korkin

Vadim is a senior software engineer with lots of experience with different software technologies including Java, Javascript, databases (Oracle and Postgres), HTML/CSS and even machine learning. He likes learning new technologies and using the latest libraries in his work. Vadim also has some personal projects on Github
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Agneshka
Agneshka
10 months ago

Oh, this is a helpful article for me

Back to top button