gdritter repos documents / master posts / shift-resolve-parsing.md
master

Tree @master (Download .tar.gz)

shift-resolve-parsing.md @masterview markup · raw · history · blame

Shift-Resolve Parsing, as described by José Fortes Gálvez, Sylvain Schmitz, and Jacques Farré, promises linear-time parsing with unbounded lookahead. Unfortunately for many, the paper is difficult and abstruse, filled with terrifying charts and obscure notation.

Well, I read it so you don't have to. It's not actually all that bad. I'm gonna start with a basic overview of how shift-reduce parsers work, and then go into how theirs differs. If you already are comfortable with shift/reduce, go and skip ahead.

Shift-Reduce Parsing

A shift-reduce parser operates by maintaining two stacks and performing a series of simple actions on those stacks. For this section, I'll talk about the simple grammar of addition of numbers with parenthesization, i.e. something like

Expr ::= Expr '+' Term | Term
Term ::= Digit | '(' Expr ')'

This is often given in a slightly different format:

Expr -> Expr '+' Term
Expr -> Term
Term -> Digit
Term -> '(' Expr ')'

Part of the reason here is that we aren't producing, we're parsing. It's very easy to look at the above format and mentally reverse it: that is, instead of looking at our grammar as, "An Expr is either a Term or an Expr followed by a plus sign followed by a Term," we can read our grammar as, "Once we have parsed an Expr followed by a plus sign followed by a Term, we have pased an Expr."

When we run a shift-reduce parser for this grammar, we start with all the input tokens on a stack, and an empty stack for processing those:

   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      |
----------------+----------------------+------------------------------

Depending on what we see on the top of the stack and the current state of the parser, we'll either shift or reduce. The first thing we do is shift, in which case we take a pop a token from the input stack and push it onto the processing stack

   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      | shift '2'
   input        |   processing         | action
----------------+----------------------+------------------------------

Once the processing stack is in the right state, we then perform a reduce step, which works like the grammar rules above run in reverse. In the above example, we're looking at a + on the top of the stack, and that expected an Expr on the left-hand side, so we can reduce based on our grammar rules, turning the digit 2 into an Expr.

   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      | shift '2'
 + ( 3 + 4 )    | 2                    | reduce Digit to Term
 + ( 3 + 4 )    | Term                 | reduce Term to Expr
----------------+----------------------+------------------------------

we can then keep scanning and reducing until we parse the full tree:

   input        |   processing         |   action
----------------+----------------------+------------------------------
 2 + ( 3 + 4 )  |                      | shift '2'
 + ( 3 + 4 )    | 2                    | reduce Digit to Term
 + ( 3 + 4 )    | Term                 | reduce Term to Expr
 + ( 3 + 4 )    | Expr                 | shift '+'
 ( 3 + 4 )      | + Expr               | shift '('
 3 + 4 )        | ( + Expr             | shift '3'
 + 4 )          | 3 ( + Expr           | reduce Digit to Term
 + 4 )          | Term ( + Expr        | reduce Term to Expr
 + 4 )          | Expr ( + Expr        | shift '+'
 4 )            | + Expr ( + Expr      | shift '4'
 )              | 4 + Expr ( + Expr    | reduce Digit to Term
 )              | Term + Expr ( + Expr | reduce Term to Expr
 )              | Expr + Expr ( + Expr | reduce Expr '+' Expr to Expr
 )              | Expr ( + Expr        | shift '('
                | ) Expr ( + Expr      | reduce '(' Expr ')' to Term
                | Term + Expr          | reduce Term to Expr
                | Expr + Expr          | reduce Expr '+' Expr to Expr
                | Expr                 | done
----------------+----------------------+------------------------------

Now, I've completely elided how we actually build the state machine that lets us do this. The process is straightforward and is discussed in great detail elsewhere. There is, however, a problem with shift-reduce grammars.

Unlimited Lookahead

Above, our grammar was simple: we could determine what the next rule to apply was based entirely on the top token of the input stack. But what if that isn't true? We can imagine grammars in which the meaning of what you're doing isn't clear until much later in the input string. Imagine that you're designing a Go-like language with tuples, and you use := as shorthand for declaring variables. Our code might look like this.

(a, b) := (1, 2);
(c, d) := foo(a + b);
bar();

You design it so that any expression is also a valid statement, so even though it's a little silly, you could write

(this, that);

as a bare statement. Well, now we have a problem. A parser for this language is parsing something and gets this far into the input string:

'(' 'a' [ ... ]
     ^

Is this an expression, or a declaration? Well, that depends on the context. if this is the beginning of

(a, b, c) := some_expr();

then we're parsing the left-hand side of a declaration, and a should be an identifier. But if it's the beginning of

(a, 2+2, foo());

then it's the beginning of an expression! We need to look further ahead to find out which. But in this case, we have no idea how much further to look ahead—it might be an arbitrarily long number of tokens in the future.

If we want to continue using shift/reduce parsing, we have to get around this somehow. For example, Rust solves the problem mentioned here by using the keyword let to introduce declarations, which means anything after the let keyword is going to be a declaration, but otherwise it'll be an expression. But what if we wanted to keep our grammar the way it was?

Linear Time

Well, we'd lose some efficiency. The shift/reduce algorithms are guaranteed to walk along the input string directly, doing a limited number of steps per token they observe: they will do either one shift, or several reductions, and the number of reductions done is fixed by the grammar. That's a nice property to have!

But most of the algorithms that handle unbounded lookahead do so with backtracking, which means you might need to move back in the input string and then move forward again. Parser combinators work like this: they take a path and keep at it, and if it turned out to be wrong, they then go back and try another one. Depending on your circumstance, that might be okay, or it might be terrible.

Packrat parsers keep the linear-time guarantee, but do so by keeping around a lot of extra data: your parser is no longer just two stacks plus a state, it is now a two-dimensional table with entries for both every token in your input string and every rule in your grammar. This is inefficient enough that, even though it is technically linear time, the constant factors are large enough to make it slow on reasonably-sized inputs and intractably slow or just impossible to run on large inputs.

It looks like we have a choice: we can either simplify our grammar and keep our efficient, linear-time parsing, or we can keep our grammar and get slower or more memory-hungry parsing algorithms. Is there any way we can have our cake and eat it, too?

Shift-Resolve Parsing

Let's start with the same basic idea, but add an extra wrinkle: we no longer just move from the input to the processing stack, we also sometimes move from the processing stack and push back to the input stack. Before, we had two operations:

  • shift an item from the input stack to the processing stack.
  • reduce several items from the processing stack, use a grammar rule and put the resulting item on the processing stack.

We now have three operations:

  • shift an item from the input stack to the processing stack
  • reduce several items from the processing stack, use a grammar rule and put the resulting items on the input stack.
  • pushback several items from the processing stack to the input stack.