Pyleri Tutorial: Parsing with Ease
Welcome to a tutorial on Pyleri, aka Python Left-Right Parser, a simple parsing tool. To use it when you need something more than a regular expression, but less than a full parser generator. In this tutorial we are going to show you how to use the tool and the basics of parsing.
Why Learning Pyleri?
We have written a very popular series of posts about parsing libraries and tools: Java, C#, Python, JavaScript. We have also written much loved tutorials for using ANTLR. A great industrial-strength parsing generator tool, which is perfect for parsing real, complex languages. However, we think that we were missing a good tutorial on a simple parsing tool. One that you can use when you need something more than a regular expression, but you are not ready for a full parser. That is where Pyleri comes in.
We choose Pyleri because it is a nice and effective tool. It makes easy to create parsers and can also be a quick way to support features like auto-completion. Plus, the same grammar can also generate parsers for multiple languages: JavaScript, C, Python, Go and Java. It is also well-tested given that it was designed to be used with SiriDB, a highly-scalable, robust and super fast time series database.
The Workflow
A grammar for Pyleri must be defined in Python expressions. These must be part of a class that derives from the base class Grammar
. Like in the following example.
1 2 3 | class BookGrammar(Grammar): k_book = Keyword( 'book' ) r_title = Regex( '[a-zA-Z 0-9\']+' ) |
Once it is defined, the grammar can be exported as a file. The exported file defines the grammar in Python or any other supported language. There are several methods, one for each language (e.g., export_java
, export_js
).
For example, you can define the grammar in Python, export it to JavaScript and then use the JavaScript version of Pyleri (jsleri) to run it.
1 2 3 4 5 | grammarString = BookGrammar().export_js() # now grammarString contains the JS version of BookGrammar # class BookGrammar extends Grammar { # static k_book = Keyword( 'book' ); # static r_title = Regex( '^[a-zA-Z 0-9\']+' ); # etc. |
You cannot do the inverse, i.e., you cannot create a grammar in jsleri and then export it to Python. That is because jsleri, or the other *leri do not support this feature. So, Pyleri is clearly the main version of the *leri family. Therefore it is better to create the grammar in Python and then export it to that language.
Effectively Pyleri and the other libraries are not part of one tool, but they are more a family of similar libraries. This is slightly confusing, but we can live with it.
The Book Format Basics
In this tutorial we are going to create a parser for a simple data format to describe books. Let’s start with the simplest basic version of the grammar: a rule to indicate the title of the book.
1 2 3 4 5 | # Create a Grammar Class to define the format class BookGrammar(Grammar): k_book = Keyword( 'book' ) r_title = Regex( '[a-zA-Z 0-9\']+' ) START = Sequence(k_book, r_title) |
The rule START
must always be present. You need to have a rule with this name in your grammar. This way Pyleri knows the rule from which it should start the parsing of the input.
In practical terms there are two kinds of parsing rules: simple and combination of other rules. If you have previous experience with parsing you might think that simple rules are like lexer rules, while complex rules are like parser rules. This is not properly true, since there are no technical differences between the two.
However it can be useful to organize your grammar as if this was true. So, you can use simple ones to recognize the basic elements and the complex ones to organize them. The simple ones can be thought like tokens created with regular expressions. While the complex ones are created using ready-to-use functions to combine them (e.g., Sequence to parse a sequence of elements). This will be useful to design your grammar and make easier to jump to a more complex parsing tool, if your grammar becomes so complex that you need one.
In this example there are two simple rules: one that represent a keyword (i.e., book
) and another one for the title. Then there is a complex rule that combines the simple ones to create the basic version of our format: the keyword followed by the title.
Testing the Grammar
1 2 3 4 5 6 | # Compile your grammar by creating an instance of the Grammar Class. book_grammar = BookGrammar() # Use the compiled grammar to parse 'strings' print(book_grammar.parse( 'book The fall of Rome' ).is_valid) # => True print(book_grammar.parse( 'book The Karamazov Brothers' ).as_str()) # => parsed successfully |
Next we create an object from our grammar and test a few inputs. Everything works perfectly. The object returned by the parse method is of type Result
and it has a few useful properties:
is_valid
field is a boolean that says whether the input is valid- the
as_str()
method returns a message on the result of the parsing - the
tree
field returns the parsing tree
Let’s try an invalid input to better see what the as_str()
method does.
1 2 | print(book_grammar.parse( 'story Farewell to Arms' ).is_valid) # => False print(book_grammar.parse( 'story Farewell to Arms' ).as_str()) # => error at position 0 , expecting: book |
As you can see, when the input is invalid the aforementioned method returns an error message that explains what went wrong. When the parsing fails you can also look up the field pos
to see exactly where the parsing stopped.
The Parse Tree
Finally we are going to see the parse tree.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # Returns properties of a node object as a dictionary: def node_props(node, children): return { 'start' : node.start, 'end' : node.end, 'name' : node.element.name if hasattr(node.element, 'name' ) else None, 'element' : node.element.__class__.__name__, 'string' : node.string, 'children' : children} # Recursive method to get the children of a node object: def get_children(children): return [node_props(c, get_children(c.children)) for c in children] # View the parse tree: def view_parse_tree(res): start = res.tree.children[ 0 ] \ if res.tree.children else res.tree return node_props(start, get_children(start.children)) # let's print the parse tree pp = pprint.PrettyPrinter() pp.pprint(view_parse_tree(book_grammar.parse( 'book The Karamazov Brothers' ))) |
The functions to navigate the parse comes from the documentation of Pyleri.
This the output of the previous input phrase book The Karamazov Brothers
.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 | { 'children' : [{ 'children' : [], 'element' : 'Keyword' , 'end' : 4 , 'name' : 'k_book' , 'start' : 0 , 'string' : 'book' }, { 'children' : [], 'element' : 'Regex' , 'end' : 27 , 'name' : 'r_title' , 'start' : 5 , 'string' : 'The Karamazov Brothers' }], 'element' : 'Sequence' , 'end' : 27 , 'name' : 'START' , 'start' : 0 , 'string' : 'book The Karamazov Brothers' } |
The parse tree is simple but effective: you get the info on the position in the input string and about the rule triggered. Notice that the title of the book does not contain a starting space between book and the title. That shows that whitespace is generally discarded, although we can define rules in which it is kept.
Describing a Book
Now that we understand the basics of Pyleri grammars, we can developer our grammar for describing a book.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 | class BookGrammar(Grammar): k_book = Keyword( 'book' ) r_title = Regex( '[a-zA-Z 0-9\']+' ) k_author = Keyword( 'authors' ) r_author = Regex( '[a-zA-Z 0-9\']+' ) k_publication = Keyword( 'publication_date' ) k_pub = Keyword( 'pub_date' ) k_pub_date = Choice(k_publication, k_pub) r_pub_date = Regex( '0-9]+' ) k_description = Keyword( 'description' ) r_description = Regex( '"([^"]*)"' ) DESCRIPTION = Sequence(k_description, r_description) AUTHORS = Sequence(k_author, List(r_author, delimiter= ',' , mi= 1 , ma=None)) PUB_DATE = Sequence(k_pub_date, r_pub_date) TITLE = Sequence(k_book, r_title) START = Sequence(TITLE, Optional(AUTHORS), Optional(PUB_DATE), Optional(DESCRIPTION)) |
Our book format shows just a few of the elements that can be used to build rules.
The first of them is the Choice
element that allows to select one among a few options. In our case we use it to specify the keyword for the publication date in either the long or short format.
The List
element represent a series of at least mi
elements up to ma
elements. In our case we use it to indicate that we need at least 1 author and up to an infinite number of authors.
Finally the Optional
element indicate an element that can be present must it is not always present. We use it to compose the final element: there is always a TITLE, but the rest of the metadata are optionals.
This simple example shows how to design a grammar with Pyleri, or any parser combinator. First you design the individual rules, then you combine them together to form a rule that describe the whole document. It is pretty easy once you get the mindset.
This is an example input that our grammar can parse.
1 2 3 4 | book 1984 authors George Orwell pub_date 1949 description "A dystopian novel by English writer George Orwell published in June 1949, whose themes center on the risks of government overreach, totalitarianism and repressive regimentation of all persons and behaviors within society" |
I Have More Than One Book
This grammar works fine when you just have one book, but usually you have more; that is unless you are a really slow reader. So, we must change the grammar to support a collection of books. Actually we want to also support multiple collections of book, in case we want to separate our books into different groups.
For example we want to support something like this input.
1 2 3 4 5 6 7 8 | collection 1 = [ // book 1 // book 2 collection 2 = [ // book 3 // book 4 ] ] |
To support such a structure it is actually quite easy, we just need a few modifications.
1 2 3 4 5 6 7 8 9 | r_name = Regex( '[a-zA-Z 0-9]+' ) t_equals = Token( '=' ) [..] START = Ref() BOOK = Sequence(TITLE, Optional(AUTHORS), Optional(PUB_DATE), Optional(DESCRIPTION)) ELEM = Choice(BOOK, START) START = Sequence(r_name, t_equals, '[' , List(ELEM, delimiter=k_end_book), ']' ) |
The first thing we do is to create a couple of rules to support naming a collection. We use the new type of rule Token
to parse the equals sign. This rule is used for operators and the like. There is also a version that accepts multiple tokens.
1 | Tokens( '+ / )' ) |
You just need to separate each of them with a space.
The big change however is for the rule START
. We use the special rule Ref
, to indicate a forward reference. You can use it to create recursive rules. First, you indicate a forward reference with Ref
, then you define the real rule as normal. In our example the starting rule contain a collection of elements with a potential name. In turn each element can be an individual book or the starting element. Since the starting element is a collection of elements, you see how all of this is recursive.
In short our starting rule is a collections of elements, each of which can be either a single book or another collection of elements.
Using the Grammar
Now that we have the definitive version of our grammar we can use it. The issue is that Pyleri does not offer any ready-to-use method to navigate the parse tree, as more sophisticated tools like ANTLR do. That means that there is no easy way to create a visitor or listener to perform some operation on the tree. Luckily it is not hard to do it manually.
Since we need just to get our list of books grouped in collections, we can simply perform a depth-first search and call the proper function whenever we meet a node. Basically, if we meet an author node, we add the author to the current book and so on. All the information we need is stored in field of the class that are manipulated during the traversing of the tree.
We can start with reusing the functions to navigate the tree and put them in a class called VisitTree
.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | # Returns properties of a node object as a dictionary: def node_props(self, node, children): self.read_info(node) return { 'start' : node.start, 'end' : node.end, 'name' : node.element.name if hasattr(node.element, 'name' ) else None, 'element' : node.element.__class__.__name__, 'string' : node.string, 'children' : children } # Recursive method to get the children of a node object: def get_children(self, children): return [self.node_props(c, self.get_children(c.children)) for c in children] # The main function that we all visit all the leaves of the tree def navigate_parse_tree(self, res): # store the names of the collections self.names_collections = [] start = res.tree.children[ 0 ] \ if res.tree.children else res.tree return self.node_props(start, self.get_children(start.children)) |
As you can see these function are mostly the same, we just did a couple of things:
- we modify them to be part of a class
- we added a call to the function
read_info
insidenode_props
and added a field callednames_collections
at the beginning of the visit
The function read_info
will add all the information to an field containing our collections of books. We added a field to store all the names of the collections because they can be nested. So we can recover the name of the parent collection when we exit a child collection.
Let’s see where the magic happens.
Getting the Data We Need
Actually we divided the function in two functions: one that gather the information about the book and the other that manages a collection.
1 2 3 | def read_info(self, node): self.add_book(node) self.manage_collection(node) |
We have done this to make easy to understand how to operate on the parse tree.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 | def add_book(self, node): if hasattr(node.element, 'name' ): if node.element.name == 'r_title' : self.book = {} self.book[ 'title' ] = node.string if node.element.name == 'r_pub_date' : self.book[ 'publication_date' ] = node.string # we create the authors collection when we find the author keyword if node.element.name == 'k_author' : self.book[ 'authors' ] = [] # we then add each author to the collection if node.element.name == 'r_author' : self.book[ 'authors' ].append(node.string) if node.element.name == 'r_description' : # we remove the delimiting double quotes self.book[ 'description' ] = node.string[ 1 :- 1 ] |
All that this function does is checking the name of the element/node of the tree and add the information to a book
field.
We initialize the current book object when we find the title of the book. Notice that we mostly ignore the nodes for the keywords: they are necessary to correctly divide the different node but they do not contain any information.
The only exception is for the author keyword. The keyword per se also do not contain any information. However since there can be more than one author, the book['authors']
value is a list. Therefore we initialize the list if we find the keyword author. Then we add to the list a value for each r_author
we find.
Managing Collections
The function to manage collections is technically divided in two parts: one look at the name of the nodes, while the other looks for the delimiter of the collection (i.e., [
and ]
).
Let’s start with the first part.
01 02 03 04 05 06 07 08 09 10 | def manage_collection(self, node): if hasattr(node.element, 'name' ): # we have found the end of a book if node.element.name == 'k_end_book' : self.books[self.name_collection].append(self.book) del self.book # set the name of the collection if node.element.name == 'r_name' : # set the current name of the collection self.name_collection = str.strip(node.string) |
The first part does two things: it looks for the end of a book to add it to the current collection and it search for the node containing the name of the collection.
The second part looks for the delimiters of the collection.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | [..] # start of the collection if not hasattr(node.element, 'name' ) and node.string == '[' : # let's check whether this is the first collection if not hasattr(self, 'books' ): self.books = {} # let's store the name of the collection, in case of nested collections self.names_collections.append(self.name_collection) # create the new collection self.books[self.name_collection] = [] # end of the collection if not hasattr(node.element, 'name' ) and node.string == ']' : # let's check whether the last book was added to the collection if hasattr(self, 'book' ): self.books[self.name_collection].append(self.book) del self.book # let's delete the name of the current collection del self.name_collection # let's recover the name of the parent collection, if any if (len(self.names_collections) > 0 ): self.name_collection = self.names_collections.pop() |
There is nothing complicated here. The only novel thing we do is how we search for the delimiters. Since they are tokens their node have no names, so we search for nodes with no name and with the text we are looking for and voilà we found them.
Putting it All Together
Now we just have to put everything together.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 | def main(): # Compile your grammar by creating an instance of the Grammar Class. book_grammar = BookGrammar() # read the data from a file in_file = open( "data.books" , "r" ) text = in_file.read() in_file.close() # if the file is valid we navigate the parse tree if book_grammar.parse(text).is_valid: tree = VisitTree() tree.navigate_parse_tree(book_grammar.parse(text)) # print the resulting tree print(tree.get_collections()) |
In the main function we setup the grammar, parse an example file and print the results.
If you are a keen observer, or if you print the parse tree you will notice something interesting about the way we collect the data. While our format allows for the nested collections, this is not reflected in our use of them. What I mean is that each collection contains only books, it does not contain any collection. That is true even if the way the data is stored into the format would suggest otherwise.
This is an example of the difference between syntax and semantics. A parser just read the data in a well-defined format. There are no hard rules in how to use the data described by the format.
For instance, you might design the format in a way that is easier to parse and then do more work after the parsing phase. Generally speaking, it is a good idea to make the syntax more flexible than the semantics strictly requires. That is because changing the format is cumbersome to the users. You do not want to force them to change past documents written in the previous version of the format. So it is better to make the format more flexible and make room to add new features without changing the formal description of the format.
Did You Expect Something More?
Another neat feature of a Pyleri result object is the property expecting
, that lists the elements that it can accept at that particular position. This is very useful if you are building auto-completion functionality, but can also be used to get more information in case of an error in parsing. We modify the main function to print the content of the property in case the parsing is not valid.
1 2 3 4 5 | if book_grammar.parse(text).is_valid: tree = VisitTree() tree.navigate_parse_tree(book_grammar.parse(text)) else : print(book_grammar.parse(text).expecting) |
Let’s imagine that we forgot to add the closing square bracket. That would be the output.
1 | {end_book, ]} |
A set of the end_book
keyword and the closing square bracket.
This property could be a non-empty set even if the parsing is valid, in case there were optional elements that could be added. So, it is not always a sign of an error, but it can be useful in case of errors, that is to say when the property is_valid
is False
.
Summary
We have seen how to use Pyleri, a good tool, a cross between a parser generator and a parser combinator. It is more powerful than a traditional parser combinator and can also generate a parse tree. This mixture of simplicity of syntax and powerful features can be quite attractive for people that need something powerful, but are not used to a traditional parser generator.
In short Pyleri is a good solution for whoever need to build a parser quickly without the need to learn new complex parsing concepts. You just need to learn a small API and gain experience in how to use the library. In return you will get a solid foundation to build an effective parser.
You can read more about Pyleri at the official GitHub repository.
Published on Java Code Geeks with permission by Federico Tomassetti, partner at our JCG program. See the original article here: Pyleri Tutorial: Parsing with Ease Opinions expressed by Java Code Geeks contributors are their own. |