In this tutorial, we will use the classical example: calculator, to show how NParsec's declarative API can be used to create a parser. NParsec can be downloaded here.
The tutorial will use C# 2.0 syntax (generics and anonymous delegates) since NParsec is built exclusively for C# 2.0 or higher.
Our calculator should be able to:
A typical parsing process takes 2 steps:
The following code imports all relevant NParsec types.
using Scanner = Parser<_>; using Lexer = Parser<Tok>; using Lexeme = Parser<Tok[]>; using Binary = Map<double, double, double>; using Unary = Map<double, double>; using Term = Parser<object>; using Grammar = Parser<double>; using Expr = System.Double; using Parsec; using System; |
First, let's create our lexer. The lexer should be able to recognize decimal number, parenthesis and the operators. Input character sequence will be recognized and translated to tokens.
To recognize a decimal number, we could use regular expression. NParsec has support for regular expression-based scanner.
We will use straight NParsec API, however, to present a more intuitive way.
The Parsec.Pattern.Pattern and Parsec.Pattern.Patterns are two useful classes to represent any sophisticated pattern of input characters.
The following code represents a pattern of integer:
Pattern digits = Patterns.InRange ('0', '9').Many (1);
|
A decimal number is 1 or more digits optionally followed by a '.' and 1 or more digits as the fractional part:
Pattern number = digits.Seq (Patterns.IsChar ('.').Seq (digits).Optional ());
|
Finally, we use this Pattern object to create our a Scanner object:
Scanner s_number = Scanners.IsPattern ("number", number, "number expected");
|
In fact, NParsec does have built-in support for frequently used lexers such as integer, string literal, decimal number etc. They can be found in the Lexers class.
To better serve the "tutorial" purpose, we chose to write the lexers from scratch instead of using these pre-built functions.
A Scanner object only matches input. It knows the position and the length of the match, but it doesn't return any token.
We need to use a Parsec.Tokenizer object to convert the matched sub-sequence to a token object:
Lexer l_number = Lexers.Lex (s_number, Tokenizers.ForDecimal); |
To create lexers for the operators, the helper class Terms can be used:
Terms ops = Terms.getOperatorsInstance("+","-","*","/", "(", ")");
|
Now, we are done with all the necessary tokens of the calculator. The only thing left is the white spaces.
This calculator allows any number of white spaces before and after any number or operator.
single-line comment and block comment are also allowed and treated as white space.
Scanner s_line_comment = Scanners.IsJavaLineComment ();
Scanner s_block_comment = Scanners.IsBlockComment ("/*", "*/");
Scanner s_whitespace = Scanners.IsWhitespaces ();
|
The calculator will ignore any one of the above 3, so:
Scanner s_delim = (s_line_comment|s_block_comment|s_whitespace).Many_(); |
The lexer that recognize and return tokens is:
Lexer l_tok = ops.Lexer | l_number; |
The lexer object should greedily scan the entire input, ignore the whitespaces and comments, recognize the tokens and generate an array of the tokens.
The Lexers.lexeme() function is the ideal helper function to create such Parser object:
Lexeme lexeme = Lexers.Lexeme (s_delim, l_tok).FollowedBy (Parsers.Eof ()); |
A Parser object can be a character level - which is fed a sequence of characters, or a token level - which is fed with an array of tokens.
The lexer is a character level object, we need a token level Parser object to finish the second phase of the job.
The token level Parser object should recognize different tokens and does the interpretation based on the grammar rules.
From the lexer, we could have two different kinds of tokens: decimal numbers and operators. To convert the decimal number token to a double value, we can use the helper class Terms:
Grammar p_number = Terms.OnDecimal<double> (
delegate (int from, int len, string s) {
return double.Parse (s);
}
);
|
|
C# compiler can't always infer type parameters properly, so explicitly specifying the type parameter may be needed. |
The lexer already recognizes the operators and converted them into special tokens. It is now time to apply grammar rules to these tokens.
Using the Terms object created in the lexer section, we could easily get the Grammar objects for these operators. For example, the "+" operator can be obtained as:
Term plus_op = ops.GetParser("+");
|
The above Term object only recognizes the "+" operator token, we also need to attach a semantic action to it(in our case, apply the plus operation). The same thought applies to all other operators.
For binary operators, a Binary object is used to represent the semantic action:
Binary plus = delegate (double a, double b) { return a + b; };
Binary minus = delegate (double a, double b) { return a - b; };
Binary mul = delegate (double a, double b) { return a * b; };
Binary div = delegate (double a, double b) { return a / b; };
|
For unary operators, a Unary object can be used to represent semantic action:
Unary neg = delegate (double n) { return -n; };
|
We can easily attach a Binary object to an operator as such:
Parser<Binary> p_plus = ops.getParser("+").Seq(Parsers.Return(plus));
|
Similar code can be applied to all the other operators. To avoid code duplication, we can create a function:
static Parser<T> getOperator<T> (Terms ops, string op, T v) {
return ops.GetParser (op).Seq (Parsers.Return (v));
}
|
Thus, the plus operator becomes:
Parser<Binary> p_plus = getOperator(ops, "+", plus); |
In a classic recursive-desent parser, operators with different precedence are handled by defining different production rules, for example:
term_expr ::= ...
muldiv_expr ::= muldiv_expr ('*'|'/') term_expr
expr = expr ('+'|'-') muldiv_expr
|
This solution can lead to messy production rules when the number of operators and precedence levels scales up.
It is more desirable to be able to specify the operator precedence declaratively.
Another drawback of recursive-desent parser is left recursion.
In order to preserve the left-associative nature of the operators, the above production rules, are defined left-resursively (the first rule at the right hand side of muldiv_expr is muldiv_expr itself).
Such left-recursive definition will lead to infinite-recursion in a naive recursive-desent parser implementation.
NParsec provides support for operator precedence grammar and automatically handles left recursion incured by left-associative operators.
Our target syntax should look like:
OperatorTable<Expr> optable = new OperatorTable<Expr>() .Infixl(p_plus, 10) .Infixl(p_minus, 10) .Infixl(p_mul, 20) .Infixl(p_div, 20) .Prefix(p_neg, 30); ... |
where all the operator precedence and associativity are specified declaratively.
At the beginning, we said we want to allow whitespace to be an alternative syntax for multiplication. Well, formally speaking, whitespace itself is not enough to make an operator. For example, "2 -3" should still be parsed as a "minus", not a multiplication between "2" and "-3". Therefore, our whitespace operator should only happen when none of "+", "-", "*", "/" is present.
The Parser object that recognizes any of the four arithmetic operators is:
Term p_binary_ops = ops.GetParser("+","-","*","/");
|
And the parser that tells us "none of the four operators are present, this is a valid whitespace operator" is:
Term p_whitespace_mul = p_binary_ops.Not().Cast<object>(); |
There are some preparation to do though.
Firstly, the Infixl(...) method expects a Parser<Binary> object.
The Binary object is responsible for implementing the semantic action of the binary operator.
We can either create an abstract syntax tree object, or directly apply the operator to do the calculation.
Similarly, the Prefix(...)| method expects a Parser<Unary>, where the Unary object is responsible for implementing the semantic action of the unary operator.
These Parser<Binary> and Parser<Unary> objects can be easily obtained from the getOperator() method we just defined.
Now, with the OperatorTable object we created earlier, we can use the Expressions helper class to build a parser that handles the operators correctly:
Grammar p_expr = Expressions.BuildExpressionParser (p_number, optable); |
where the p_number is the parser that parses a decimal number and evaluates to a double value; the optable is the OperatorTable object that declares all the operator precedences and associativity.
With the lexer and the parser objects ready, we can execute them to calculate real expressions.
First, we need to concatenate the lexer and the parser objects so that the parser object will take as input the output of the lexer object:
Grammar parser = Parsers.ParseTokens (lexeme, p_expr.FollowedBy (Parsers.Eof ()), "calculator"); |
The parser object can then be used to actually parse/interpret expression:
string src = "1+1*2-10";
double result = Parsers.RunParser(src, parser, "test1");
System.Console.WriteLine(""
|
Now if we run the above code, we will get -7 as the result.
Our Parser object works except there's one thing missing: parenthesis.
We wanted the parser to be able to parse sub-expression grouped by parenthesis.
The production rule is:
op ::= '+' | '-' | '*' | '/'
term ::= number | '(' expr ')'
expr ::= expr op term
|
It is impossible to represent such mutual-recursive production with declarations of local Parser variables. Lazy evaluation is necessary in this case:
Grammar[] lazy_expr = new Grammar[1];
Grammar p_lazy_expr = Parsers.Lazy<Expr> (delegate () {
return lazy_expr[0];
});
Grammar p_term = p_lazy_expr.Between (p_lparen, p_rparen) | p_number;
|
Thus, the expression parser can be changed to:
Grammar p_expr = Expressions.BuildExpressionParser (p_term, optable); |
The final parser code is as following:
namespace Codehaus.Parsec.test
{
using Scanner = Parser<_>;
using Lexer = Parser<Tok>;
using Lexeme = Parser<Tok[]>;
using Binary = Map<double, double, double>;
using Unary = Map<double, double>;
using Term = Parser<object>;
using Grammar = Parser<double>;
using Expr = System.Double;
using Parsec;
using System;
public class Calculator
{
public Grammar Parser {
get {
Pattern digits = Patterns.InRange ('0', '9').Many (1);
Pattern number = digits.Seq (Patterns.IsChar ('.').Seq (digits).Optional ());
Scanner s_number = Scanners.IsPattern ("number", number, "number expected");
Scanner s_line_comment = Scanners.IsJavaLineComment ();
Scanner s_block_comment = Scanners.IsBlockComment ("/*", "*/");
Scanner s_whitespace = Scanners.IsWhitespaces ();
Lexer l_number = Lexers.Lex (s_number, Tokenizers.ForDecimal);
Terms ops = Terms.GetOperatorsInstance ("+", "-", "*", "/", "(", ")");
Scanner s_delim = (s_line_comment|s_block_comment|s_whitespace).Many_ ();
Lexer l_tok = ops.Lexer | l_number;
Lexeme lexeme = Lexers.Lexeme (s_delim, l_tok).FollowedBy (Parsers.Eof ());
Binary plus = delegate (double a, double b) { return a + b; };
Binary minus = delegate (double a, double b) { return a - b; };
Binary mul = delegate (double a, double b) { return a * b; };
Binary div = delegate (double a, double b) { return a / b; };
Unary neg = delegate (double n) { return -n; };
Term p_binary_ops = ops.GetParser ("+", "-", "*", "/");
Term p_whitespace_mul = p_binary_ops.Not ().Cast<object>();
Parser<Binary> p_plus = getOperator (ops, "+", plus);
Parser<Binary> p_minus = getOperator (ops, "-", minus);
Parser<Binary> p_mul = (ops.GetParser ("*") | p_whitespace_mul).Seq (Parsers.Return (mul));
Parser<Binary> p_div = getOperator (ops, "/", div);
Parser<Unary> p_neg = getOperator (ops, "-", neg);
Term p_lparen = ops.GetParser ("(");
Term p_rparen = ops.GetParser (")");
Grammar p_number = Terms.OnDecimal<double> (
delegate (int from, int len, string s) {
return double.Parse (s);
}
);
Grammar[] lazy_expr = new Grammar[1];
Grammar p_lazy_expr = Parsers.Lazy<Expr> (delegate () {
return lazy_expr[0];
});
Grammar p_term = p_lazy_expr.Between (p_lparen, p_rparen) | p_number;
OperatorTable<Expr> optable = new OperatorTable<Expr> ()
.Infixl (p_plus, 10)
.Infixl (p_minus, 10)
.Infixl (p_mul, 20)
.Infixl (p_div, 20)
.Prefix (p_neg, 30);
Grammar p_expr = Expressions.BuildExpressionParser (p_term, optable);
lazy_expr[0] = p_expr;
return Parsers.ParseTokens (lexeme, p_expr.FollowedBy (Parsers.Eof ()), "calculator");
}
}
static Parser<T> getOperator<T> (Terms ops, string op, T v) {
return ops.GetParser (op).Seq (Parsers.Return (v));
}
}
}
|
In this tutorial, we chose to run the calculation directly in the semantic actions.
We could defer this calculation and create intermediary abstract syntax tree for further analysis and processing (such as type checking and code optimisation). But that makes no real difference in the parser code. We just need to replace the direct calculation code with the code to create the abstract syntax tree.
This short tutorial cannot cover every detail of the NParsec API. There are many other useful utility functions provided to help the process of constructing a sophisticated parser.
You are welcome to shoot me a note to ajoo.email@gmail.com should you have any comment or question.