Tree @master (Download .tar.gz)
- ..
- 2014-02-03
- 4d-chess.md
- a-small-haskell-record-complaint
- cherenkov.md
- codata
- commonplace
- daemontools.html
- daemontools.md
- data-files-and-backpack.md
- distributed-git.md
- file-system
- formats.md
- higher-rank-trait-bounds.md
- idiomaticity.md
- in-defense-of-hm
- madlibs-intro.md
- madlibs-ml.md
- madlibs-scheme.md
- multiwhile.md
- orwell.md
- resources-laziness-and-cps.md
- script-based-data-analysis
- semigroup.html
- semigroup.md
- sexprs.md
- shift-resolve-parsing.md
- some-rust-errors.md
- static-versus-dyn
- tansu
- tc-post.md
- tc-post.md.bak
- telml.md
- typeclasses.md
- writer-leak.md
shift-resolve-parsing.md @master — view markup · raw · history · blame
Shift-Resolve Parsing, as described by José Fortes Gálvez, Sylvain Schmitz, and Jacques Farré, promises linear-time parsing with unbounded lookahead. Unfortunately for many, the paper is difficult and abstruse, filled with terrifying charts and obscure notation.
Well, I read it so you don't have to. It's not actually all that bad. I'm gonna start with a basic overview of how shift-reduce parsers work, and then go into how theirs differs. If you already are comfortable with shift/reduce, go and skip ahead.
Shift-Reduce Parsing
A shift-reduce parser operates by maintaining two stacks and performing a series of simple actions on those stacks. For this section, I'll talk about the simple grammar of addition of numbers with parenthesization, i.e. something like
Expr ::= Expr '+' Term | Term
Term ::= Digit | '(' Expr ')'
This is often given in a slightly different format:
Expr -> Expr '+' Term
Expr -> Term
Term -> Digit
Term -> '(' Expr ')'
Part of the reason here is that we aren't producing, we're
parsing. It's very easy to look at the above format and
mentally reverse it: that is, instead of looking at our
grammar as, "An Expr
is either a Term
or an Expr
followed by a plus sign followed by a Term
," we can read
our grammar as, "Once we have parsed an Expr
followed by a
plus sign followed by a Term
, we have pased an Expr
."
When we run a shift-reduce parser for this grammar, we start with all the input tokens on a stack, and an empty stack for processing those:
input | processing | action
----------------+----------------------+------------------------------
2 + ( 3 + 4 ) | |
----------------+----------------------+------------------------------
Depending on what we see on the top of the stack and the current state of the parser, we'll either shift or reduce. The first thing we do is shift, in which case we take a pop a token from the input stack and push it onto the processing stack
input | processing | action
----------------+----------------------+------------------------------
2 + ( 3 + 4 ) | | shift '2'
input | processing | action
----------------+----------------------+------------------------------
Once the processing stack is in the right state, we then
perform a reduce step, which works like the grammar rules
above run in reverse. In the above example, we're looking
at a +
on the top of the stack, and that expected an
Expr
on the left-hand side, so we can reduce based on
our grammar rules, turning the digit 2
into an Expr
.
input | processing | action
----------------+----------------------+------------------------------
2 + ( 3 + 4 ) | | shift '2'
+ ( 3 + 4 ) | 2 | reduce Digit to Term
+ ( 3 + 4 ) | Term | reduce Term to Expr
----------------+----------------------+------------------------------
we can then keep scanning and reducing until we parse the full tree:
input | processing | action
----------------+----------------------+------------------------------
2 + ( 3 + 4 ) | | shift '2'
+ ( 3 + 4 ) | 2 | reduce Digit to Term
+ ( 3 + 4 ) | Term | reduce Term to Expr
+ ( 3 + 4 ) | Expr | shift '+'
( 3 + 4 ) | + Expr | shift '('
3 + 4 ) | ( + Expr | shift '3'
+ 4 ) | 3 ( + Expr | reduce Digit to Term
+ 4 ) | Term ( + Expr | reduce Term to Expr
+ 4 ) | Expr ( + Expr | shift '+'
4 ) | + Expr ( + Expr | shift '4'
) | 4 + Expr ( + Expr | reduce Digit to Term
) | Term + Expr ( + Expr | reduce Term to Expr
) | Expr + Expr ( + Expr | reduce Expr '+' Expr to Expr
) | Expr ( + Expr | shift '('
| ) Expr ( + Expr | reduce '(' Expr ')' to Term
| Term + Expr | reduce Term to Expr
| Expr + Expr | reduce Expr '+' Expr to Expr
| Expr | done
----------------+----------------------+------------------------------
Now, I've completely elided how we actually build the state machine that lets us do this. The process is straightforward and is discussed in great detail elsewhere. There is, however, a problem with shift-reduce grammars.
Unlimited Lookahead
Above, our grammar was simple: we could determine what the next rule
to apply was based entirely on the top token of the input stack. But
what if that isn't true? We can imagine grammars in which the
meaning of what you're doing isn't clear until much later in the
input string. Imagine that you're designing a Go-like language with
tuples, and you use :=
as shorthand for declaring variables. Our
code might look like this.
(a, b) := (1, 2);
(c, d) := foo(a + b);
bar();
You design it so that any expression is also a valid statement, so even though it's a little silly, you could write
(this, that);
as a bare statement. Well, now we have a problem. A parser for this language is parsing something and gets this far into the input string:
'(' 'a' [ ... ]
^
Is this an expression, or a declaration? Well, that depends on the context. if this is the beginning of
(a, b, c) := some_expr();
then we're parsing the left-hand side of a declaration, and a
should be an identifier. But if it's the beginning of
(a, 2+2, foo());
then it's the beginning of an expression! We need to look further ahead to find out which. But in this case, we have no idea how much further to look ahead—it might be an arbitrarily long number of tokens in the future.
If we want to continue using shift/reduce parsing, we have
to get around this somehow. For example, Rust solves the
problem mentioned here by using the keyword let
to
introduce declarations, which means anything after the
let
keyword is going to be a declaration, but otherwise
it'll be an expression. But what if we wanted to keep
our grammar the way it was?
Linear Time
Well, we'd lose some efficiency. The shift/reduce algorithms are guaranteed to walk along the input string directly, doing a limited number of steps per token they observe: they will do either one shift, or several reductions, and the number of reductions done is fixed by the grammar. That's a nice property to have!
But most of the algorithms that handle unbounded lookahead do so with backtracking, which means you might need to move back in the input string and then move forward again. Parser combinators work like this: they take a path and keep at it, and if it turned out to be wrong, they then go back and try another one. Depending on your circumstance, that might be okay, or it might be terrible.
Packrat parsers keep the linear-time guarantee, but do so by keeping around a lot of extra data: your parser is no longer just two stacks plus a state, it is now a two-dimensional table with entries for both every token in your input string and every rule in your grammar. This is inefficient enough that, even though it is technically linear time, the constant factors are large enough to make it slow on reasonably-sized inputs and intractably slow or just impossible to run on large inputs.
It looks like we have a choice: we can either simplify our grammar and keep our efficient, linear-time parsing, or we can keep our grammar and get slower or more memory-hungry parsing algorithms. Is there any way we can have our cake and eat it, too?
Shift-Resolve Parsing
Let's start with the same basic idea, but add an extra wrinkle: we no longer just move from the input to the processing stack, we also sometimes move from the processing stack and push back to the input stack. Before, we had two operations:
- shift an item from the input stack to the processing stack.
- reduce several items from the processing stack, use a grammar rule and put the resulting item on the processing stack.
We now have three operations:
- shift an item from the input stack to the processing stack
- reduce several items from the processing stack, use a grammar rule and put the resulting items on the input stack.
- pushback several items from the processing stack to the input stack.