A Tutorial for jparsec Framework
In this tutorial, we will use the classical example: calculator, to show how jparsec's declarative API can be used to construct a parser.
Our calculator should be able to:
- support calculation of decimal numbers.
- support Operators such as '+', '-', '*', '/'.
- '(' and ')' can be used to group expression.
- To make the parser more interesting, we'll allow java style single-line comment started by a "//" and block comment enclosed by "/*" and "*/".
- In math, we typically omit the '*' operator for multiplication. For example: "3a" is equivalent to "3*a". Though for simplicity, we don't include symbols in the calculator, we still want to allow whitespace as an alternative syntax for multiplication.
A typical parsing process takes 2 steps:
- Scan the input, create a list of tokens. These tokens include keywords, literals, identifiers,operators etc.
The program that recognizes a single token is called "tokenizer".
- Parse the tokens based on the grammar of the language.
First, let's create our tokenizer. The tokenizer should be able to recognize decimal number, parenthesis and the operators. Input character sequence will be recognized and translated to token.
The prebuilt Terminals.DecimalLiteral can be used to tokenize decimal literals.
An instance of Terminals can be created to manage our operators:
This calculator allows any number of white spaces before and after any number or operator.
single-line comment and block comment are also allowed and treated as white space.
Put them together
With the number literal tokenizer and the OPERATORS constant, our tokenizer will be:
And given any syntactical parser, we will be able to chain it with the tokenizer and ignore the whitespaces and comments:
Parser and Interpreter
With the tokenizer taking care of lexical analysis, we can now focus on grammar rules.
From the tokenizer, we could have two different kinds of tokens: decimal numbers and operators. To convert the decimal number token to a Double object, we use the corresponding syntactic parser for decimal literal and transform the string result:
Before proceeding to parsing the operators, let's first model them using enum. The binary operators (+, -, *, /) implement Binary to facilitate calculation; and negative operator (-) implement Unary. We will later on declare them in a OperatorTable.
All the operators are managed by the OPERATORS constant. The Terminals.token() method returns a parser that recognizes operator. A convenience method can be created on top of it to save a few keystrokes for us:
For each operator, the parser will first recognize the token using the term() method and then return the corresponding BinaryOperator/UnaryOperator instance, as in:
At the beginning, we said we want to allow whitespace to be an alternative syntax for multiplication. Well, formally speaking, whitespace itself is not enough to make an operator. For example, "2 -3" should still be parsed as a "minus", not a multiplication between "2" and "-3". Therefore, our whitespace operator should only happen when none of "+", "-", "*", "/" is present.
Therefore we have:
In a classic recursive-desent parser, operators with different precedence are handled by defining different production rules, for example:
This solution can lead to messy production rules when the number of operators and precedence levels scale up.
It is more desirable to be able to specify the operator precedence declaratively.
Another drawback of recursive-desent parser is left recursion.
In order to preserve the left-associative nature of the operators, the above production rules, are defined left-resursively (the first rule at the right hand side of muldiv_expr is muldiv_expr itself).
Such left-recursive definition will lead to infinite-recursion in a naive recursive-desent parser implementation.
Jparsec provides support for operator precedence grammar and automatically handles left recursion incured by left-associative operators.
Our target syntax should look like:
The higher the precedence number, the more tightly it binds the operands.
Parens and Recursion
In order to support parens, recursion needs to be involved. An expression inside a pair of parens is itself another expression, which can be nested inside another pair of parens. The way to handle recursion is to use Parser.Reference:
If you run the "parenthesized" parser above, it will fail with some error message about the reference is not set yet. And that is accurately what we are missing here. The reference needs to finally be set with the parser for expression, because, well, literally any expression can be nested within a pair of parens.
So, putting it together with the operator table and whitespace operator, we have:
And that is almost everything we need. Oh, wait, let's pass the NUMBER parser in and hook it up with the lexer to get the final parser:
And it doesn't hurt to run it and see how it happily calculates, does it?
The final parser code is as following:
In this tutorial, we chose to run the calculation directly in the semantic actions.
We could defer this calculation and create intermediary abstract syntax tree for further analysis and processing (such as type checking and code optimisation). But that makes no real difference in the parser code. We just need to replace the direct calculation code with the code that creates the abstract syntax tree (of course, we also need to replace <Double> with <whatever your expression type>).
This short tutorial cannot cover every detail of the jparsec API. There are many other useful utility methods provided to help the process of constructing a complex parser.