A Tutorial for NParsec Framework
In this tutorial, we will use the classical example: calculator, to show how NParsec's declarative API can be used to create a parser. NParsec can be downloaded here.
The tutorial will use C# 2.0 syntax (generics and anonymous delegates) since NParsec is built exclusively for C# 2.0 or higher.
Our calculator should be able to:
- support calculation of decimal numbers.
- support Operators such as '+', '-', '*', '/'.
- '(' and ')' can be used to group expression.
- To make the parser more interesting, we'll allow java style single-line comment started by a "//" and block comment enclosed by "/*" and "*/".
- In math, we typically omit the '*' operator for multiplication. For example: "3a" is equivalent to "3*a". Though for simplicity, we don't include symbols in the calculator, we still want to allow whitespace as an alternative syntax for multiplication (just for fun).
A typical parsing process takes 2 steps:
- Scan the input, create a list of tokens. These tokens include keywords, literals, identifiers,operators etc.
The program that does this step is called "scanner" or "lexer".
- Parse the tokens based on the grammar of the language.
The program that does this step is called "parser".
Import relavant NParsec classes
The following code imports all relevant NParsec types.
- A Scanner is a Parser that doesn't return anything. We use the special "_" type to denote this.
- A Lexer is a Parser object that returns a Tok object.
- A Lexeme is a Parser object that returns a Tok object.
- A Binary operation is a Map that maps two double value to another double value.
- A Unary operation is a Map that maps one double value to another double value.
- A Term is a Parser object that recognizes a token. We do not really care the return value.
- A Grammar is a Parser object that evaluates to a double value.
- We are calculating expression directly to a double value, an "Expr" is then a synonym of double.
First, let's create our lexer. The lexer should be able to recognize decimal number, parenthesis and the operators. Input character sequence will be recognized and translated to tokens.
To recognize a decimal number, we could use regular expression. NParsec has support for regular expression-based scanner.
We will use straight NParsec API, however, to present a more intuitive way.
Decimal Number Scanner
The Parsec.Pattern.Pattern and Parsec.Pattern.Patterns are two useful classes to represent any sophisticated pattern of input characters.
The following code represents a pattern of integer:
- The Patterns.InRange('0', '9') matches any digit character.
- The Many(1) tells the Pattern object to match 1 or more digit characters.
A decimal number is 1 or more digits optionally followed by a '.' and 1 or more digits as the fractional part:
- The digits.Seq(...) tells the Pattern object to match a second pattern after digits are matched.
- Patterns.IsChar('.') matches the '.' character only.
- .Optional() dictates an optional pattern.
Finally, we use this Pattern object to create our a Scanner object:
- Scanners.isPattern(...) is a function to adapt a Pattern object to a Scanner object.
In fact, NParsec does have built-in support for frequently used lexers such as integer, string literal, decimal number etc. They can be found in the Lexers class.
To better serve the "tutorial" purpose, we chose to write the lexers from scratch instead of using these pre-built functions.
Decimal Number Lexer
A Scanner object only matches input. It knows the position and the length of the match, but it doesn't return any token.
We need to use a Parsec.Tokenizer object to convert the matched sub-sequence to a token object:
- Tokenizers.ForDecimal returns a Tokenizer object that converts the matched input sequence to a TypedToken object whose type is TokenType.Decimal.
- Lexers.Lex(...) uses a Tokenizer object to create a Lexer object, which returns a token object.
Lexer for Operators
To create lexers for the operators, the helper class Terms can be used:
- Terms.getOperatorsInstance(...) creates a Terms object that encapsulates the lexing and parsing of the provided operators.
Now, we are done with all the necessary tokens of the calculator. The only thing left is the white spaces.
This calculator allows any number of white spaces before and after any number or operator.
single-line comment and block comment are also allowed and treated as white space.
- Scanners.IsJavaLineComment() creates a Scanner object that recognizes the single-line comment started by "//".
- Scanners.IsBlockComment("/*", "*/") creates a Scanner object that recognizes the block comment denoted by "/*" and "*/".
- Scanners.IsWhitespaces() creates a Scanner object that recognizes any white space character.
The calculator will ignore any one of the above 3, so:
- The "|" operator means "alternative".
- Many_() repeats a parser for unlimited times, but ignores the return value.
- This s_delim is our real white space that the parser should ignore.
Put them together
The lexer that recognize and return tokens is:
- ops.Lexer returns the Lexer object for operators.
The lexer object should greedily scan the entire input, ignore the whitespaces and comments, recognize the tokens and generate an array of the tokens.
The Lexers.lexeme() function is the ideal helper function to create such Parser object:
- Parsers.Eof() will fail if there's still any input left. This is to ensure that the entire input is scanned.
- Lexers.Lexeme(...) repeatedly ignores the whitespaces and comments and collect the recognized tokens into an array.
- FollowedBy(...) is similar to Seq(...), the only difference is, "FollowedBy" keeps the return value of the first Parser object and discards the second.
Parser and Interpreter
A Parser object can be a character level - which is fed a sequence of characters, or a token level - which is fed with an array of tokens.
The lexer is a character level object, we need a token level Parser object to finish the second phase of the job.
The token level Parser object should recognize different tokens and does the interpretation based on the grammar rules.
Decimal Number Parser
From the lexer, we could have two different kinds of tokens: decimal numbers and operators. To convert the decimal number token to a double value, we can use the helper class Terms:
- Terms.OnDecimal(...) creates a Parser object, which recognizes a TypedToken object whose type is TokenType.Decimal, it then uses the delegate object to convert the recognized string to a double value.
C# compiler can't always infer type parameters properly, so explicitly specifying the type parameter may be needed.
The lexer already recognizes the operators and converted them into special tokens. It is now time to apply grammar rules to these tokens.
Using the Terms object created in the lexer section, we could easily get the Grammar objects for these operators. For example, the "+" operator can be obtained as:
The above Term object only recognizes the "+" operator token, we also need to attach a semantic action to it(in our case, apply the plus operation). The same thought applies to all other operators.
For binary operators, a Binary object is used to represent the semantic action:
For unary operators, a Unary object can be used to represent semantic action:
We can easily attach a Binary object to an operator as such:
Similar code can be applied to all the other operators. To avoid code duplication, we can create a function:
Thus, the plus operator becomes:
In a classic recursive-desent parser, operators with different precedence are handled by defining different production rules, for example:
This solution can lead to messy production rules when the number of operators and precedence levels scales up.
It is more desirable to be able to specify the operator precedence declaratively.
Another drawback of recursive-desent parser is left recursion.
In order to preserve the left-associative nature of the operators, the above production rules, are defined left-resursively (the first rule at the right hand side of muldiv_expr is muldiv_expr itself).
Such left-recursive definition will lead to infinite-recursion in a naive recursive-desent parser implementation.
NParsec provides support for operator precedence grammar and automatically handles left recursion incured by left-associative operators.
Our target syntax should look like:
where all the operator precedence and associativity are specified declaratively.
- Infixl(p_plus, 10) declares that p_plus is an infix binary operator that is left-associative with precedence level "10";
- Prefix(p_neg, 30) declares that p_neg is a prefix unary operator with precedence level "30".
At the beginning, we said we want to allow whitespace to be an alternative syntax for multiplication. Well, formally speaking, whitespace itself is not enough to make an operator. For example, "2 -3" should still be parsed as a "minus", not a multiplication between "2" and "-3". Therefore, our whitespace operator should only happen when none of "+", "-", "*", "/" is present.
The Parser object that recognizes any of the four arithmetic operators is:
And the parser that tells us "none of the four operators are present, this is a valid whitespace operator" is:
- Cast is necessary here because Not() method returns a Parser<_>, not a Parser<object> that we need, though we don't care about the return value anyway. Indeed, all Parser objects are covariant about the return type. The lack of support for covariance annotation in C# leaves us developers responsible for using the Cast method explicitly whenever a covariance is needed.
There are some preparation to do though.
Firstly, the Infixl(...) method expects a Parser<Binary> object.
The Binary object is responsible for implementing the semantic action of the binary operator.
We can either create an abstract syntax tree object, or directly apply the operator to do the calculation.
Similarly, the Prefix(...)| method expects a Parser<Unary>, where the Unary object is responsible for implementing the semantic action of the unary operator.
These Parser<Binary> and Parser<Unary> objects can be easily obtained from the getOperator() method we just defined.
Now, with the OperatorTable object we created earlier, we can use the Expressions helper class to build a parser that handles the operators correctly:
where the p_number is the parser that parses a decimal number and evaluates to a double value; the optable is the OperatorTable object that declares all the operator precedences and associativity.
Executing Parser object
With the lexer and the parser objects ready, we can execute them to calculate real expressions.
First, we need to concatenate the lexer and the parser objects so that the parser object will take as input the output of the lexer object:
- p_expr.FollowedBy(Parsers.eof()) ensures that all tokens are parsed.
- Parsers.ParseTokens(...) creates a character level object that first scans the character input, then runs the token level Parser object to further parse the tokens generated by the lexer.
The parser object can then be used to actually parse/interpret expression:
- Parsers.RunParser(...) executes a Parser object against a string object.
- The 3th parameter of RunParser is the name of the current source code module.
- The value returned by RunParser is the evaluation result of the expression.
Now if we run the above code, we will get -7 as the result.
Our Parser object works except there's one thing missing: parenthesis.
We wanted the parser to be able to parse sub-expression grouped by parenthesis.
The production rule is:
- The 'term' and 'expr' are mutual-recursive.
It is impossible to represent such mutual-recursive production with declarations of local Parser variables. Lazy evaluation is necessary in this case:
- The lazy_expr array is a place holder of the expression parser to enable lazy evaluation.
- Parsers.lazy(...) creates a Parser object that will not evaluate the delegate until the parser is executed.
- The p_term variable represents the 'term' production rule.
Thus, the expression parser can be changed to:
- p_number is changed to p_term to accommodate the new production rule in the expression.
And There We Go
The final parser code is as following:
In this tutorial, we chose to run the calculation directly in the semantic actions.
We could defer this calculation and create intermediary abstract syntax tree for further analysis and processing (such as type checking and code optimisation). But that makes no real difference in the parser code. We just need to replace the direct calculation code with the code to create the abstract syntax tree.
This short tutorial cannot cover every detail of the NParsec API. There are many other useful utility functions provided to help the process of constructing a sophisticated parser.
You are welcome to shoot me a note to email@example.com should you have any comment or question.