gdritter repos documents / master posts / shift-resolve-parsing.md
master

Tree @master (Download .tar.gz)

shift-resolve-parsing.md @masterview rendered · raw · history · blame

Shift-Resolve Parsing, as described by José Fortes Gálvez, Sylvain
Schmitz, and Jacques Farré, promises linear-time parsing with
unbounded lookahead. Unfortunately for many, the paper is difficult
and abstruse, filled with terrifying charts and obscure notation.

Well, I read it so you don't have to. It's not actually all that
bad. I'm gonna start with a basic overview of how shift-reduce
parsers work, and then go into how theirs differs. If you already
are comfortable with shift/reduce, go and skip ahead.

# Shift-Reduce Parsing

A _shift-reduce parser_ operates by maintaining two stacks and
performing a series of simple actions on those stacks. For
this section, I'll talk about the simple grammar of addition of
numbers with parenthesization, i.e. something like

```
Expr ::= Expr '+' Term | Term
Term ::= Digit | '(' Expr ')'
```

This is often given in a slightly different format:

```
Expr -> Expr '+' Term
Expr -> Term
Term -> Digit
Term -> '(' Expr ')'
```

Part of the reason here is that we aren't producing, we're
_parsing_. It's very easy to look at the above format and
mentally reverse it: that is, instead of looking at our
grammar as, "An `Expr` is either a `Term` or an `Expr`
followed by a plus sign followed by a `Term`," we can read
our grammar as, "Once we have parsed an `Expr` followed by a
plus sign followed by a `Term`, we have pased an `Expr`."

When we run a shift-reduce parser for this grammar, we start
with all the input tokens on a stack, and an empty stack for
processing those:

```
   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      |
----------------+----------------------+------------------------------
```

Depending on what we see on the top of the stack and the
current state of the parser, we'll either
_shift_ or _reduce_. The first thing we do is _shift_, in
which case we take a pop a token from the input stack and
push it onto the processing stack

```
   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      | shift '2'
   input        |   processing         | action
----------------+----------------------+------------------------------
```

Once the processing stack is in the right state, we then
perform a _reduce_ step, which works like the grammar rules
above run in reverse. In the above example, we're looking
at a `+` on the top of the stack, and that expected an
`Expr` on the left-hand side, so we can _reduce_ based on
our grammar rules, turning the digit `2` into an `Expr`.

```
   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      | shift '2'
 + ( 3 + 4 )    | 2                    | reduce Digit to Term
 + ( 3 + 4 )    | Term                 | reduce Term to Expr
----------------+----------------------+------------------------------
```

we can then keep scanning and reducing until we parse the
full tree:

```
   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      | shift '2'
 + ( 3 + 4 )    | 2                    | reduce Digit to Term
 + ( 3 + 4 )    | Term                 | reduce Term to Expr
 + ( 3 + 4 )    | Expr                 | shift '+'
 ( 3 + 4 )      | + Expr               | shift '('
 3 + 4 )        | ( + Expr             | shift '3'
 + 4 )          | 3 ( + Expr           | reduce Digit to Term
 + 4 )          | Term ( + Expr        | reduce Term to Expr
 + 4 )          | Expr ( + Expr        | shift '+'
 4 )            | + Expr ( + Expr      | shift '4'
 )              | 4 + Expr ( + Expr    | reduce Digit to Term
 )              | Term + Expr ( + Expr | reduce Term to Expr
 )              | Expr + Expr ( + Expr | reduce Expr '+' Expr to Expr
 )              | Expr ( + Expr        | shift '('
                | ) Expr ( + Expr      | reduce '(' Expr ')' to Term
                | Term + Expr          | reduce Term to Expr
                | Expr + Expr          | reduce Expr '+' Expr to Expr
                | Expr                 | done
----------------+----------------------+------------------------------
```

Now, I've completely elided _how_ we actually build the state machine
that lets us do this. The process is straightforward and is discussed
in great detail elsewhere. There is, however, a problem with
shift-reduce grammars.

# Unlimited Lookahead

Above, our grammar was simple: we could determine what the next rule
to apply was based entirely on the top token of the input stack. But
what if that isn't true? We can imagine grammars in which the
meaning of what you're doing isn't clear until much later in the
input string. Imagine that you're designing a Go-like language with
tuples, and you use `:=` as shorthand for declaring variables. Our
code might look like this.

~~~
(a, b) := (1, 2);
(c, d) := foo(a + b);
bar();
~~~

You design it so that any expression is _also_ a valid statement,
so even though it's a little silly, you could write

~~~
(this, that);
~~~

as a bare statement. Well, now we have a problem. A parser for
this language is parsing something and gets this far into the
input string:

~~~
'(' 'a' [ ... ]
     ^
~~~

Is this an expression, or a declaration? Well, that depends on
the context. if this is the beginning of

~~~
(a, b, c) := some_expr();
~~~

then we're parsing the left-hand side of a declaration, and `a`
should be an identifier. But if it's the beginning of

~~~
(a, 2+2, foo());
~~~

then it's the beginning of an expression! We need to look
further ahead to find out which. But in this case, we have
_no idea_ how much further to look ahead—it might be an
arbitrarily long number of tokens in the future.

If we want to continue using shift/reduce parsing, we have
to get around this somehow. For example, Rust solves the
problem mentioned here by using the keyword `let` to
introduce declarations, which means anything after the
`let` keyword is going to be a declaration, but otherwise
it'll be an expression. But what if we wanted to keep
our grammar the way it was?

# Linear Time

Well, we'd lose some efficiency. The shift/reduce algorithms
are guaranteed to walk along the input string directly, doing
a limited number of steps per token they observe: they will
do either one shift, or several reductions, and the number
of reductions done is fixed by the grammar. That's a nice
property to have!

But most of the algorithms that handle unbounded lookahead
do so with backtracking, which means you might need to
move back in the input string and then move forward again.
Parser combinators work like this: they take a path and
keep at it, and if it turned out to be wrong, they then
go back and try another one. Depending on your circumstance,
that might be okay, or it might be terrible.

Packrat parsers keep the linear-time guarantee, but do so
by keeping around a _lot_ of extra data: your parser is
no longer just two stacks plus a state, it is now a
two-dimensional table with entries for both every token in
your input string and every rule in your grammar. This
is inefficient enough that, even though it is technically
linear time, the constant factors are large enough to make
it slow on reasonably-sized inputs and intractably slow
or just impossible to run on large inputs.

It looks like we have a choice: we can either simplify
our grammar and keep our efficient, linear-time parsing,
or we can keep our grammar and get slower or more
memory-hungry parsing algorithms.
Is there any way we can have our cake and eat it, too?

# Shift-Resolve Parsing

Let's start with the same basic idea, but add an extra wrinkle: we
no longer just move from the input to the processing stack, we also
sometimes move from the processing stack and push back to the input
stack. Before, we had two operations:

- _shift_ an item from the input stack to the processing stack.
- _reduce_ several items from the processing stack, use a grammar rule
  and put the resulting item on the processing stack.

We now have three operations:

- _shift_ an item from the input stack to the processing stack
- _reduce_ several items from the processing stack, use a grammar rule
  and put the resulting items on the _input stack_.
- _pushback_ several items from the processing stack to the input stack.