ANTLR – Semantic Predicates
Parsing simple grammar with antlr is simple. All you have to do is to use regular expressions to describe your language and let antlr generate lexer and parser. Parsing big or complicated languages occasionally require more, because describing them in regular expressions only can be hard or even impossible.
Semantic predicates are java (or C++, or JavaScript, or …) conditions written inside the grammar. Antlr uses them either to choose between multiple alternatives or as additional assertions to be checked. They can be placed inside both lexer and parser, but this post focus only on their usage within the parser. They add a lot of power to antlr.
This post assumes that you have general idea on what is antlr and how generated parser works. If you do not, please read linked posts first. They contain everything needed.
First chapter contains two motivational use cases. Second chapter describes syntax, terminology and shows simple failed semantic predicate. The post then explains how semantic predicates influence prediction and generated code. It also shows how to write useful conditions and how to solve initial use cases. Final chapter wraps it all into short conclusion.
All examples and grammars used in this post are available on Github.
Table of Contents
- Use Cases
- Basics
- Hoisting and Prediction
- Backtracking
- Writing Conditions
- Solving Initial Use Cases
- Wrapping It Up
- Resources
Use Cases
As we spend some time parsing css-like language, both our use cases describe problem we had to solve while writing css part of that parser. First one is about issue encountered while working on pseudo classes and second deals with tricky whitespaces in selectors.
If you never worked with css or are not interested in use cases, skip this chapter.
Keywords – nth
Some css pseudo classes require parameter which can be either a number, an identifier, a selector or something called nth expression. Nth expressions are allowed only inside some pseudo classes and names of these pseudo classes are not reserved keywords in css.
Nth expression is an expression of the form an+b
where a
and b
are optional integers. Examples of valid nth expressions: 5n+2
, -5n-2
, -5n
, -2
, -n
, n
.
We wanted our grammar to accept nth expressions, but only as parameters of pseudo classes where it is actually allowed. We wanted it to reject nth expressions as parameters of all remaining pseudo classes.
All names normally correspond to IDENT
tokens. Creating special token corresponding to nth pseudo class name is unpractical, because they are not reserved keywords. For example, they are also perfectly valid class names or element names. Having special token would force us to replace almost all IDENT
occurrences by IDENT | NTH
.
Therefore, we are left with general identifier IDENT
which can be either normal or nth pseudo class name. Standard syntactical regular expressions are unable to distiguish between them, but semantic predicates can.
Significant Whitespaces
Css has semi-important whitespaces. Semi-important means that most of them represent only ends of tokens and that is where their usefulness ends. For example, whitespaces in declaration are irrelevant. Following declarations are equal:
padding : 2; padding:2;
Most of CSS grammar behave the above way, so there is strong temptation to throw all whitespaces away. However, if we do that, then the next two selectors end up as the same IDENT COLON IDENT LPAREN COLON IDENT RPAREN COLON IDENT LPAREN COLON IDENT RPAREN LBRACE
token stream:
div :not(:enabled) :not(:disabled) {} div:not(:enabled):not(:disabled) {}
Whitespaces in selectors are significant. The first selector is equivalent to div *:not(:enabled) *:not(:disabled)
while the second is not.
Note:
CSS 2.1 grammar available from antlr site ignores this issue. If you want to use it, you have to fix it first.
One solution would be to stop hiding whitespaces. This would force us to add explicit whitespace handing WS*
into all possible places of all parser rules. That would be a lot of work and the resulting grammar would be less readable.
It is also possible to give up on selectors tree building in antlr parser and write custom hand made tree builder for it. This is how we did it originally and we can safely tell that it works, but requires more time and debugging then the final semantic predicates based solution.
Basics
We start with semantic predicates syntax and some needed terminology. This chapter also outlines basics of what happens if predicate fails. We will not go into details, those are described in the next chapter.
Syntax
Semantic predicate is always enclosed inside curly braces followed either by question mark or question mark and double arrow:
{ condition }?
{ condition }?=>
First example uses simple
1+2==3
and 2+3==5
conditions. The grammar is stored inside the Basics.g file:
LETTER : 'a'..'z' | 'A'..'Z'; word: LETTER {1+2==3}? LETTER; NUMERAL: '0'..'9'; number: {2+3==5}?=> NUMERAL NUMERAL;
Terminology
Depending on which syntax is used and where it is placed, semantic predicates can be called by one of three different names:
- disambiguating semantic predicate,
- validating semantic predicate,
- gated semantic predicate.
Disambiguating Semantic Predicate
Disambiguating predicates use the shorter {...}?
syntax. However, a predicate is called disambiguating only if it is placed in the beginning of a rule or in the beginning of an alternative.
Disambiguating semantic predicate:
LETTER : 'a'..'z' | 'A'..'Z'; // beginning of a rule rule: {1+2==3}? LETTER LETTER; // beginning of an alternative alternatives: LETTER ( {2+3==5}? LETTER* | {2+3==5}? LETTER+ );
Validating Semantic Predicate
Validating predicates also use the shorter {...}?
syntax. The difference against disambiguating predicates is only in placement. Validating predicates are placed in the middle of a rule or in the middle of an alternative:
LETTER : 'a'..'z' | 'A'..'Z'; word: LETTER {1+2==3}? LETTER;
Gated Semantic Predicate
Gated semantic predicates use the longer {...}?=>
syntax. The condition can be placed anywhere.
Gated semantic predicate:
NUMERAL: '0'..'9'; number: {2+3==5}?=> NUMERAL NUMERAL;
Failed Predicates
As we explained in expression language tutorial post, parser starts knowing which rule should correspond to the input and then tries to match it to the input. Matching always starts from left-most element of the rule and continues to the right.
If matching encounters semantic predicate, then it tests whether the condition is satisfied. If it is not satisfied, FailedPredicateException
exception is thrown.
Consider Basics.g grammar shown at the beginning of this chapter:
LETTER : 'a'..'z' | 'A'..'Z'; word: LETTER {1+2==3}? LETTER; NUMERAL: '0'..'9'; number: {2+3==5}?=> NUMERAL NUMERAL;
If you open generated BasicsParser
class, you will find that each rule has corresponding method with the same name. Both of them contains a copy of the predicate and both of them throws an exception if the condition is not satisfied:
// inside the generated word() method if ( !((1+2==3)) ) { throw new FailedPredicateException(input, 'word', '1+2==3'); } // inside the generated number() method if ( !((2+3==5)) ) { throw new FailedPredicateException(input, 'number', '2+3==5'); }
Prediction, e.g. what happens if the parser encounters rule with both multiple alternatives and predicates, for example
start: word | number
rule, is described in the next chapter.
Hoisting and Prediction
Depending on where and how you use semantic predicates, the parser may try to avoid failed predicate exceptions. Used strategy is called ‘hoisting’ and it is what makes predicates useful.
This chapter explains what hoisting is and what consequences it has. Then we will explain when it is used and when it is not used.
What It Is
A parser that encountered rule with multiple alternatives has to decide which of these alternatives should be used. If some of them start with a predicate, parser may use that predicate to help with the decision.
Consider the grammar stored in DisambiguatingHoistingNeeded.g file:
LETTER : 'a'..'z' | 'A'..'Z'; word: {1+2==3}? LETTER LETTER; sameWord: {1+2!=3}? LETTER LETTER; start: word | sameWord;
Both word()
and sameWord()
methods of the generated parser contains the usual failed predicate check.
DisambiguatingHoistingNeededParser
class extract:
//inside the word() method if ( !((1+2==3)) ) { throw new FailedPredicateException(input, 'word', '1+2==3'); } //inside the sameWord() method if ( !((1+2!=3)) ) { throw new FailedPredicateException(input, 'sameWord', '1+2!=3'); }
In addition, the code corresponding to the start
rule contains copies of both word
and sameWord
semantic predicates. The part that chooses which rule to use next contains following code (comments are mine):
int LA1_2 = input.LA(3); //predicate copied from the word rule if ( ((1+2==3)) ) { alt1=1; } //predicate copied from the sameWord rule else if ( ((1+2!=3)) ) { alt1=2; } else { NoViableAltException nvae = new NoViableAltException('', 1, 2, input); throw nvae; }
The act of copying the predicate into prediction part of the generated parser is called hoisting.
Consequences
If there would be no hoisting, predicates would act as assertions only. We could use them to validate some conditions and that would be it. The above grammar would be illegal – it has two syntactically equivalent alternatives if you ignore predicates.
As hoisting copies predicates all over the grammar, it has also several limiting consequences for them. It is not just something that happens on the background you can safely ignore:
- each predicate can run several times,
- order in which predicates run may be hard to predict,
- local variables or parameters may not be available in hoisted copies.
Conditions must be without side effects, repeatable and their evaluation order should not matter. If they are hoisted into other rules, then they can not reference local variables or parameters.
Only predicates placed in the beginning of a rule are hoisted into other rules. Hoisting in the case of alternatives is only within the rule. Therefore, you can break the third rule if the predicate is not placed in the beginning of the rule.
When It Is Used
Hoisting is used only when the parser has to decide between multiple rules or alternatives and some of them begin with a predicate. If it is gated predicate e.g., condition inside the {...}?=>
syntax, then the predicate is hoisted no matter what.
If it is disambiguating predicate e.g., condition inside the {...}?
syntax, then the predicate is hoisted only if it is actually needed. The term ‘actually needed’ means that multiple alternatives could match the same input. Otherwise said, it is used only if multiple alternatives are ambiguous for some input.
Predicates placed in the middle of rules or in the middle of alternatives are never hoisted.
If Needed Hoisting – Disambiguating Predicate
Consider the rule
start
in the DisambiguatingHoistingNotNeeded.g grammar:
LETTER : 'a'..'z' | 'A'..'Z'; NUMBER : '0'..'9'; word: {1+2==3}? LETTER LETTER; differentWord: {1+2!=3}? LETTER NUMBER; start: word | differentWord;
The rule
start
has to choose between the word
and differentWord
rules. Both of them start with predicate, but the predicate is not needed in order to differentiate between them. The second token of the word
is LETTER
while the second token of the differentWord
is NUMBER
.
Hoisting will not be used. Instead, the grammar will look into upcoming 2 tokens to distinguish between these rules. To verify, open the start()
method of the generated DisambiguatingHoistingNotNeededParser
class in our sample project: neither 1+2==3
nor 1+2!=3
condition was copied into it.
int alt1=2; switch ( input.LA(1) ) { case LETTER: { switch ( input.LA(2) ) { case LETTER: { alt1=1; } break; case NUMBER: { alt1=2; } break; default: NoViableAltException nvae = new NoViableAltException('', 1, 1, input); throw nvae; }
On the other hand, consider the rule start
in the DisambiguatingHoistingNeeded.g grammar:
LETTER : 'a'..'z' | 'A'..'Z'; word: {1+2==3}? LETTER LETTER; sameWord: {1+2!=3}? LETTER LETTER; start: word | sameWord;
The rule start
has to choose between the word
and sameWord
rules. These two rules match exactly the same sequence of tokens and differ only by the predicate.
Hoisting will be used. To verify, open the start()
method of the generated DisambiguatingHoistingNeededParser
class in our sample project. It contains copies of both 1+2==3
and 1+2!=3
conditions.
int alt1=2; switch ( input.LA(1) ) { case LETTER: { switch ( input.LA(2) ) { case LETTER: { int LA1_2 = input.LA(3); if ( ((1+2==3)) ) { alt1=1; } else if ( ((1+2!=3)) ) { alt1=2; } else { /* ... */ } } break; default: // ... } } break; default: // ... }
The exactly same thing is happening with disambiguating predicates in alternatives.
This will not be hoisted (DisambiguatingHoistingNotNeeded.g grammar):
LETTER : 'a'..'z' | 'A'..'Z'; alternatives: LETTER ( {2+3==5}? LETTER | {2+3==5}? NUMBER );
This will be hoisted (DisambiguatingHoistingNeeded.g grammar):
LETTER : 'a'..'z' | 'A'..'Z'; alternatives: LETTER ( {2+3==5}? LETTER | {2+3==5}? LETTER );
Always Hoisting – Gated Rules
Look at the start
rule in the GatedHoisting.g grammar:
LETTER : 'a'..'z' | 'A'..'Z'; NUMBER: '0'..'9'; word: {1+2==3}?=> LETTER LETTER; differentWord: {1+2!=3}?=> LETTER NUMBER; start: word | differentWord;
The rule start
has to choose between the word
and differentWord
words. Both of them starts with predicate and that predicate is not needed in order to differentiate between them.
However, hoisting will be used because we used gated semantic predicate. To verify, open the start()
method of the generated GatedHoisting
class in our sample project. It contains copies of both 1+2==3
and 1+2!=3
conditions.
int LA1_0 = input.LA(1); if ( (LA1_0==LETTER) && (((1+2==3)||(1+2!=3)))) { int LA1_1 = input.LA(2); if ( (LA1_1==LETTER) && ((1+2==3))) { alt1=1; } else if ( (LA1_1==NUMBER) && ((1+2!=3))) { alt1=2; } else { NoViableAltException nvae = new NoViableAltException('', 1, 1, input); throw nvae; } } else { NoViableAltException nvae = new NoViableAltException('', 1, 0, input); throw nvae; }
The exactly same thing is happening with gated predicates in alternatives. This will be hoisted (GatedHoisting.g grammar):
LETTER : 'a'..'z' | 'A'..'Z'; NUMBER: '0'..'9'; alternatives: LETTER ( {2+3==5}?=> LETTER | {2+3==5}?=>NUMBER );
Never Hoisting – Middle of a Rule
Hoisting is never used if the predicate is located in the middle of a rule or an alternative. It does not matter which predicate type is used. Therefore, if your rules differ only by the predicate, that predicate must be placed in the beginning of a rule or an alternative.
Non-hoisted gated predicate (GatedNoHoisting.g):
LETTER: 'a'..'z' | 'A'..'Z'; NUMBER: '0'..'9'; //gated predicate in the middle of a rule word: LETTER {1+2==3}?=> LETTER; differentWord: LETTER {1+2!=3}?=> NUMBER; start: word | differentWord;
Another non-hoisted gated predicate (GatedNoHoisting.g):
LETTER: 'a'..'z' | 'A'..'Z'; NUMBER: '0'..'9'; //gated predicate in the middle of an alternative alternatives: LETTER ( LETTER {2+3==5}?=> LETTER | LETTER {2+3==5}?=> NUMBER );
Generated parser is in GatedNoHoistingParser
class.
The most important point is that if your rules differ only by the predicate and that predicate is placed in the middle of a rule, antlr will refuse to generate corresponding parser. Next expandable box contains several examples of syntactically incorrect grammars along with antr errors they cause.
Incorrect grammar (SyntacticallyIncorrect.g):
LETTER : 'a'..'z' | 'A'..'Z'; word: LETTER {1+2==3}? LETTER; sameWord: LETTER {1+2!=3}? LETTER; start: word | sameWord;
Error in console:
warning(200): org\meri\antlr\predicates\SyntacticallyIncorrect.g:28:6: Decision can match input such as 'LETTER LETTER' using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input error(201): org\meri\antlr\predicates\SyntacticallyIncorrect.g:28:6: The following alternatives can never be matched: 2
Another incorrect grammar (SyntacticallyIncorrect.g):
alternativesStart: LETTER ( LETTER {1+2==3}? | LETTER {1+2!=3}? );
Error in console:
warning(200): org\meri\antlr\predicates\SyntacticallyIncorrect.g:31:27: Decision can match input such as 'LETTER' using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input error(201): org\meri\antlr\predicates\SyntacticallyIncorrect.g:31:27: The following alternatives can never be matched: 2
Yet another incorrect grammar (SyntacticallyIncorrect.g):
LETTER : 'a'..'z' | 'A'..'Z'; gatedWord: LETTER {1+2==3}?=> LETTER; gatedSameWord: LETTER {1+2!=3}?=> LETTER; gatedStart: gatedWord | gatedSameWord;
Error in console:
warning(200): org\meri\antlr\predicates\SyntacticallyIncorrect.g:40:11: Decision can match input such as 'LETTER LETTER' using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input error(201): org\meri\antlr\predicates\SyntacticallyIncorrect.g:40:11: The following alternatives can never be matched: 2
Last incorrect grammar (SyntacticallyIncorrect.g):
LETTER : 'a'..'z' | 'A'..'Z'; gatedAlternativesStart: LETTER ( LETTER {1+2==3}?=> LETTER | LETTER {1+2!=3}?=> LETTER );
Error in console:
warning(200): org\meri\antlr\predicates\SyntacticallyIncorrect.g:43:32: Decision can match input such as 'LETTER LETTER' using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input error(201): org\meri\antlr\predicates\SyntacticallyIncorrect.g:43:32: The following alternatives can never be matched: 2
Nuances
Previous ‘When It Is Used’ sub-chapter shown how predicates behave in clearly hoisted and clearly non-hoisted situations. We selected examples to show as clear and simple situations as possible.
This sub-chapter contains different set of examples. We picked most potentially confusing we have been aware of. All used examples are located in Nuances.g file.
Disambiguating Predicates – Advanced
Hoisted disambiguating predicates are used only if multiple alternatives are ambiguous for current input. Otherwise said, hoisted copy of the predicate is run only if the actual input could be parsed by multiple alternatives.
Example: alternatives in the following rule are not syntactically equivalent, because they do not match the same set of inputs. First alternative matches exactly two letters and second alternative matches any number of letters:
advancedDisambiguating: LETTER ( {1+2==3}? LETTER LETTER | {1+2!=3}? LETTER* );
If the input starts with exactly one LETTER
, then it can not be parsed by the first alternative. As only second alternative matches it, predicate will not be used. Parser will use second alternative and if 1+2!=3
condition happen to be false
, parser will throw failed predicate exception.
However, if the input starts with two letters, then it could be matched by both alternatives and predicate will be used. This is how generated code looks:
int alt4=2; switch ( input.LA(1) ) { case LETTER: { switch ( input.LA(2) ) { case LETTER: { int LA4_3 = input.LA(3); //predicate is used only if first two tokens are LETTER if ( ((1+2==3)) ) { alt4=1; } else if ( ((1+2!=3)) ) { alt4=2; } else { // ... irrelevant code ... } } break; //if the second token is not LETTER, predicate is not used case EOF: { alt4=2; } break; default: // ... } } break; //if the first token is not LETTER, predicate is not used case EOF: // ... default: // ... }
Compare it to very similar gated rule:
compareGated: LETTER ( {1+2==3}?=> LETTER LETTER | {1+2!=3}?=> LETTER* );
Parser will use the predicate no mater what. The second alternative will never be entered, because the predicate 1+2!=3
is never satisfied:
int alt6=2; int LA6_0 = input.LA(1); if ( (LA6_0==LETTER) && (((1+2==3)||(1+2!=3)))) { int LA6_1 = input.LA(2); if ( (LA6_1==LETTER) && (((1+2==3)||(1+2!=3)))) { int LA6_3 = input.LA(3);
Gated predicate causes antlr to throw different kind of exception in this case. As we will show later in this post, different hoisting of gated and disambiguating predicates can make much bigger difference with more complicated predicates. Namely, it can make difference between accepting and rejecting the input.
LoopsAlthough it does not look like that at the first sight, loops are alternatives too. They use prediction to guess whether they should do one more round or whether they should end.
Use predicate to stay in the loop only while it returns true:
LETTER : 'a'..'z' | 'A'..'Z'; loop: ( {somePredicate()}?=> LETTER )*;
The loop
rule will match letters until the function somePredicate()
returns false
or until the rule runs out of LETTER
tokens.
loop1: do { int alt1=2; int LA1_0 = input.LA(1); // predicate is used during the prediction if ( (LA1_0==LETTER) && ((somePredicate()))) { alt1=1; } //matching: either jump out or match another LETTER switch (alt1) { case 1: { if ( !((somePredicate())) ) { throw new FailedPredicateException(...); } // ... match LETTER ... } break; default: break loop1; } } while (true);
Disambiguating predicate can not be used for this purpose. Next predicate will not be used to decide whether parser should stay in the loop or not:
LETTER : 'a'..'z' | 'A'..'Z'; loopDisambiguating: ( {somePredicate()}? LETTER )*;
Technically, the loop is deciding between LETTER
and <nothing>
alternatives. Those are syntactically different and prediction uses disambiguating predicates only if it has to decide between syntactically ambiguous alternatives.
The loopDisambiguating
rule will match letters until it runs out of LETTER
tokens. If the function somePredicate()
returns false
during that time, the rule will throw FailedPredicateException
exception.
Generated code is very similar to the previous one, only the prediction part changes. Predicate is not used:
loop2: do { int alt2=2; //prediction ignores the predicate switch ( input.LA(1) ) { case LETTER: { alt2=1; } break; } //matching: either jump out or match another LETTER switch (alt2) { case 1: { if ( !((somePredicate())) ) { throw new FailedPredicateException(...); } // ... match LETTER ... } break; default: break loop2; } } while (true);
Uncovered Alternatives
It is perfectly ok to leave some alternatives uncovered. Alternatives with predicates will work as expected. Gated predicate is always hoisted and disambiguated predicate is hoisted only if there are multiple ambiguous alternatives.
Gated predicate is always hoisted:
uncoveredGated: LETTER ( {3+4==7}?=> LETTER | NUMBER );
Hoisted disambiguated predicate:
uncoveredDisambiguatingHoisted: LETTER ( {2+5==7}? LETTER* | LETTER+ );
Non hoisted disambiguated predicate:
uncoveredDisambiguatingNotHoisted: LETTER ( {2+4==6}? LETTER | NUMBER );
Combination of Gated and Disambiguated PredicatesIf one alternative is gated and the other is disambiguated, then gated predicate is always hoisted and disambiguated predicate is hoisted only if it is actually needed.
Gated predicate is hoisted while disambiguated predicate is not hoisted:
combinationDisambiguatingNotHoisted: LETTER ( {1+4==5}?=> LETTER | {1+4!=5}? NUMBER );
Both predicates are hoisted:
combinationDisambiguatingHoisted: LETTER ( {1+5==6}?=> LETTER* | {1+5!=6}? LETTER+ );
Additional Parenthesis
If you close disambiguating predicate in parenthesis, the predicate is still be treated like disambiguating predicate.
Another way to write disambiguating predicate:
stillDisambiguating: ({2+2==4}?) LETTER; testStillDisambiguating: stillDisambiguating | LETTER;
If you put additional parenthesis around gated predicate, the predicate will be ignored:
ignored: ({3+3==6}?)=>LETTER;
Backtracking
Predicates run even if the parser is backtracking e.g, if it is inside syntactical predicate. If the parser is backtracking and the predicate fails, backtracking fails too. Failed predicate exception is thrown only if the parser is not backtracking.
The backtrack
rule initiates backtracking (Backtracking.g):
LETTER : 'a'..'z' | 'A'..'Z'; word: LETTER {1+2==3}? LETTER; number: LETTER {2+3!=5}? LETTER; backtrack: (number)=> number | word;
Since backtracking is possible, generated predicate check is different:
if ( !((2+3!=5)) ) { if (state.backtracking>0) {state.failed=true; return retval;} throw new FailedPredicateException(input, 'number', '2+3!=5'); }
Backtracking is yet another reason why conditions must not have side effects, must be repeatable and their evaluation order must not matter.
Writing Conditions
This chapter shows how to write advanced conditions for semantic predicates. First, we will show how to access and use input token stream. We will also explain how to reference labeled tokens. Last part is about local variables in non-hoisted conditions.
Unless specified otherwise, all used examples are located in Environnement.g file.
Input Token Stream
Each generated parser has public TokenStream input
field. This field provides access to the whole input token stream and also to current position in that token stream. Its most important method is the Token LT(int k)
method. The parameter k
contains relative position of the token you are interested in.
The number 1 means ‘look ahead one token’, 2 means ‘second token ahead’ and so on. Negative numbers reference passed tokens: -1 will return previous token, -2 returns the one before it and so on. Do not use 0. Its meaning is undefined and the default parser returns null
.
Note: relative referencing works correctly even when the grammar is in backtracking state. -1 is always previous token and 1 is always the next token.
Examples
Disambiguating: if the
word
starts with letter
a
, then it must have at least two letters:
word: LETTER ( { input.LT(-1).getText().equals('a')}? LETTER+ | { !input.LT(-1).getText().equals('a')}? LETTER* );
Gated: if the second numeral of the number
is 9
, then it must have exactly 3 numerals:
number: NUMERAL ( {input.LT(1).getText().equals('9')}?=> NUMERAL NUMERAL | {!input.LT(1).getText().equals('9')}?=> NUMERAL* );
Note: the choice of predicate slightly matter in this case. It influences what kind of error will be thrown if the input does not match the rule.
Labels and Lists
Predicates can reference and use any previously defined label or label list the same way as actions can.
Label Example
If the first letter of the word is a
, then the word must have at least two letters:
labeledWord: a=LETTER ( { $a.getText().equals('a')}? LETTER+ | { !$a.getText().equals('a')}? LETTER* );
Label List ExampleIf the word starts with less then 3 letters, then it must end with a number:
labeledListWord: a+=LETTER+ ( { $a.size() < 3 }?=> NUMERAL | { $a.size() >= 3}?=> );
Note: the choice of predicate does matter in this case. The above example works correctly only if it uses gated {...}?=>
predicate instead of disambiguating {...}?
one. The NUMERAL
and <nothing>
are syntactically different. Disambiguating predicate would not be used for prediction e.g., it would not be hoisted. The parser would base its decision solely on the next token (is it NUMERAL
?).
The condition would be used as an afterwards assertion to check whether the number of letters was right. Such grammar would throw an exception on the abcd9
input, while ours would accept it.
Undefined Labels
Predicate can NOT reference not-yet-defined labels. The parser is generated, but the first attempt to use the rule throws null pointer exception in runtime:
//this would cause null pointer exception nullPointerAtPredicate: LETTER { $a.getText().equals('a') }? a=LETTER;
Labels and Hoisting Into Other Rules
As the label has to be defined before being used in the predicate and predicates are copied into other rules only if they are located in the very beginning of a rule, you do not have to worry about hoisting into other rules.
Access to Local Variables
Antlr allows you to define custom local variables and use them withing one rule. If you are sure that predicate will not be copied into other rules, it can use them. Of course, using local variables if the predicate can be copied into other rules will result in faulty parser.
Create local variables and use them in the predicate. If the word starts with less then 10 letters, then it must end with a number
localVariables @init {int num=0;} //define local variable num : (LETTER { num++; })* //raise the num variable by 1 for each letter ( // what should follow depends on the variable value { num < 10 }?=> NUMERAL | { num >= 10}?=> );
Note: The same warning as before applies. We must use gated predicates.
You must be especially careful not to use local variables in potentialy hoisted predicates. For example, Antlr Reference book recommends following rule to match only numbers composed of less then four numerals (ANTLRReference3Error.g):
localVariablesWarning @init {int n=1;} // n becomes a local variable : ( {n<=4}?=> NUMERAL {n++;} )+ // enter loop only if n<=4 ;
The above rule works well in isolation, that is when it is not used in other rules. Unfortunately, if you include it into other rules, the predicate may be hoisted into that other rule (ANTLRReference3Error.g):
// syntax error in generated parser syntaxError: localVariablesWarning | LETTER;
The n<=4
predicate will be copied into the syntaxError
rule. The variable n
is not accessible inside that rule and generated parser will be syntactically incorrect.
Solving Initial Use Cases
Finally, we are going to solve both use cases described in the motivational chapter.
Keywords – nth
Link to original use case.
We created function isInsideNth
that returns true
only if previous token matched name of some nth pseudo class. The function is used as condition inside gated predicate. Generated parser will assume that input contains nth expression if and only if it is inside nth pseudo class.
UseCasesNth.g file:
@parser::members { private static Set<String> NTH_PSEUDOCLASSES = new HashSet<String>(); static { NTH_PSEUDOCLASSES.add('nth-child'); NTH_PSEUDOCLASSES.add('nth-last-child'); NTH_PSEUDOCLASSES.add('nth-of-type'); NTH_PSEUDOCLASSES.add('nth-last-of-type'); } public boolean isInsideNth() { return isNthPseudoClass(input.LT(-1)); } private boolean isNthPseudoClass(Token a) { if (a == null) return false; String text = a.getText(); if (text == null) return false; return NTH_PSEUDOCLASSES.contains(text.toLowerCase()); } } LPAREN: '('; RPAREN: ')'; COLON: ':'; COMMA: ','; IDENT : ('a'..'z' | 'A'..'Z')+; //pseudoparameters and nth with dummy syntax pseudoparameters: IDENT (COMMA IDENT)*; nth: IDENT; //real nth syntax ommited for simplicity sake // pseudoclass pseudo : COLON COLON? IDENT (( { isInsideNth()}?=> LPAREN nth RPAREN | LPAREN pseudoparameters RPAREN )?) ;
An alternative solution with labels and rewrite rule:
//different solution - note that we need to use rewrite rules in this case pseudoDifferentSolution : COLON COLON? name=IDENT (( { isNthPseudoClass($name)}?=> LPAREN nthParameters=nth RPAREN | LPAREN parameters=pseudoparameters RPAREN )?) -> $name $nthParameters? $parameters? ;
Significant Whitespaces
Link to original use case.
Css selectors can be composed of multiple parts separated by combinators >
, +
, ~
and <space>
. Each part called simple selector starts with an optional element name and may be followed by multiple pseudo classes, attributes and similar structures.
Ignoring the space as combinator problem, simplified simple selector grammar can look like this:
COLON: ':'; STAR: '*'; NUMBER: ('0'..'9')+; IDENT : ('a'..'z' | 'A'..'Z')+; //some options have been removed from following rules for simplicity sake elementName: IDENT | STAR | NUMBER; pseudoClass: COLON COLON? IDENT; elementSubsequent: pseudoClass; simpleSelectorWrong: (elementName elementSubsequent*) | elementSubsequent+ ;
The above simpleSelectorWrong
rule matches valid simple selectors: h1
, h1:first-child:hover
, :first-child:hover
and :hover
.
Unfortunately, as whitespaces are thrown away, the above rule matches more then that. For example, it would match also h1:first-child :hover
which should be interpreted exactly the same way as h1:first-child *:hover
selector e.g., as two simple selectors joined by <space>
.
We created method that returns true only if there is no hidden token between previous and next tokens. Unless configured otherwise, all tokens are instance of CommonToken
class. Since common token knows its start and stop index, we can cast and compare them to see whether there was something between them.
New parser methods (UseCasesSelectors.g):
@parser::members { public boolean onEmptyCombinator(TokenStream input) { return !directlyFollows(input.LT(-1), input.LT(1)); } private boolean directlyFollows(Token first, Token second) { CommonToken firstT = (CommonToken) first; CommonToken secondT = (CommonToken) second; if (firstT.getStopIndex() + 1 != secondT.getStartIndex()) return false; return true; } }
Fixed simple selector uses gated predicate to check whether it should or should not continue adding subsequent elements (UseCasesSelectors.g):
simpleSelector: ( elementName ({!onEmptyCombinator(input)}?=>elementSubsequent)* ) | ( elementSubsequent ({!onEmptyCombinator(input)}?=>elementSubsequent)* );
We have to use gated predicates in this case. If we would use disambiguating predicate, generated parser would not use our predicate to decide whether to stay inside the loop or not. It is because the loop is technically deciding between elementSubsequent
and <nothing>
alternatives and those are syntactically different. The {...}?
predicate would not be used during the prediction, it would just occasionally throw exceptions.
Wrapping It Up
Semantic predicates are java conditions written inside the grammar. They are copied into generated parser as they are, without any changes. If token matching algorithm reaches semantic predicate and that predicate fails, FailedPredicateException
exception is thrown.
If a rule or an alternative starts with semantic predicate, that semantic predicate can be used during prediction phase. Failed predicates during the prediction phase never throw exceptions, but they may disable some alternatives. This is called hoisting.
Conditions must be without side effects, repeatable and their evaluation order should not matter. If they are hoisted into other rules, then they can not reference local variables or parameters.
Semantic predicates are used in three different ways: as validating semantic predicates, as disambiguating semantic predicates and as gated semantic predicates.
Validating Semantic Predicates
Validating semantic predicates act as assertions only. As a result, validating semantic predicates are never hoisted.
Condition is closed inside curly braces followed by question mark { condition }?
. It must be placed either in the middle of a rule or in the middle of an alternative:
LETTER : 'a'..'z' | 'A'..'Z'; word: LETTER {1+2==3}? LETTER;
Disambiguating Semantic Predicates
Disambiguating semantic predicates help to choose between syntactically equivalent alternatives. As a result, disambiguating semantic predicates are hoisted only if the parser has to choose between multiple ambiguous alternatives.
Disambiguating semantic predicates use exactly the same syntax as validating predicates. Condition is closed inside curly braces followed by question mark { condition }?
. However, they must be placed either in the beginning of a rule or in the beginning of an alternative:
LETTER : 'a'..'z' | 'A'..'Z'; // beginning of an alternative alternatives: LETTER ( {2+3==5}? LETTER* | {2+3==5}? LETTER+ );
Gated Semantic Predicates
Gated semantic predicates are used to dynamically turn on and off portions of grammar. As a result, all gated predicates placed in the beginning of a rule or an alternative are hoisted. Gated predicates placed in the middle of a rule or an alternative are never hoisted.
Condition is closed inside curly braces followed by question mark and double arrow { condition }?=>
:
NUMERAL: '0'..'9'; number: {2+3==5}?=> NUMERAL NUMERAL;
Reference: ANTLR – Semantic Predicates from our JCG partner Maria Jurcovicova at the This is Stuff blog.