Skip to end of metadata
Go to start of metadata

Parser Combinators

The idea that underlies parser combinators is that all tasks that involve
reading complex input can ultimately be reduced to one primitive: read
a character from the input, and figure out what to do with it.

Of course, to figure out what to do with it, you might need to look at
more than one character; you might need to look at a bunch, to choose between
options.

And, of course, you don't really want to think of it in terms of reading
characters. In fact, the way that you'd generally want to think it is
as a formatted object which was serialized in some way.

So - how do geeks talk about what a serialized format looks like? Well,
typically, we do one of two things: we write a regular expression, or
we write a grammar.

Here's where parser combinators start to make sense. A few years ago, some
really bright functional language geeks realized that if you say that
a parser is a function - then you can write a set of meta-functions that
compose those parsers - and in fact, you can write those meta-functions so
that you end up with something that looks darned similar to the EBNF grammar.

For some examples, I'm going to follow the lead of the guys who originally
came up with the idea, and pretend that we can create new operator
symbols if we want. I'm going to use triple-symbols for my new operators:

So here's the basic list of combinators:

Sequencing

This returns a new parser for a followed by b.

Alternation

This returns a new parser for an a or a b.

Optional

this returns a new parser which parses an a if it can, and succeeds no matter what. (That is, an optional a in the
grammar.)

Lists

These return parsers for repetitions, either requiring
at least one or not.

So, take a simple grammar, like one for Lisp lists:

Basic Lisp Grammar
Combinator Parser for Lisp Code

Of course, we don't have any definitions of lparen, rparen, number, or
symbol. That's because we haven't talked about how we get down to the
low level primitives. It's pretty simple really: there's exactly two
primitives: one takes a set of characters, and returns a parser that
succeeds if the current input character contains a character in the set;
the other primitive just reads one character and succeeds. We'll call the
first one "chars", and the second one "any". So now, we can fill in a bit:

BNF for tokens in Lisp grammar

In parser combinators, this becomes:

Parser Combinators for Tokens

That's almost the whole show. There's one little bit left - which is attaching
actions to parsing elements. That is something that you'd typically want to
do. There have been a bunch of ways that people have come up with for handling
actions. I'll just toss one in, for the sake of simplicity; they all have
roughly the same general idea.

(Actually, I'm out of time for the evening, so I'll save this for tomorrow.)

  • No labels