| 1 |
Shift-Resolve Parsing, as described by José Fortes Gálvez, Sylvain
|
| 2 |
Schmitz, and Jacques Farré, promises linear-time parsing with
|
| 3 |
unbounded lookahead. Unfortunately for many, the paper is difficult
|
| 4 |
and abstruse, filled with terrifying charts and obscure notation.
|
| 5 |
|
| 6 |
Well, I read it so you don't have to. It's not actually all that
|
| 7 |
bad. I'm gonna start with a basic overview of how shift-reduce
|
| 8 |
parsers work, and then go into how theirs differs. If you already
|
| 9 |
are comfortable with shift/reduce, go and skip ahead.
|
| 10 |
|
| 11 |
# Shift-Reduce Parsing
|
| 12 |
|
| 13 |
A _shift-reduce parser_ operates by maintaining two stacks and
|
| 14 |
performing a series of simple actions on those stacks. For
|
| 15 |
this section, I'll talk about the simple grammar of addition of
|
| 16 |
numbers with parenthesization, i.e. something like
|
| 17 |
|
| 18 |
```
|
| 19 |
Expr ::= Expr '+' Term | Term
|
| 20 |
Term ::= Digit | '(' Expr ')'
|
| 21 |
```
|
| 22 |
|
| 23 |
This is often given in a slightly different format:
|
| 24 |
|
| 25 |
```
|
| 26 |
Expr -> Expr '+' Term
|
| 27 |
Expr -> Term
|
| 28 |
Term -> Digit
|
| 29 |
Term -> '(' Expr ')'
|
| 30 |
```
|
| 31 |
|
| 32 |
Part of the reason here is that we aren't producing, we're
|
| 33 |
_parsing_. It's very easy to look at the above format and
|
| 34 |
mentally reverse it: that is, instead of looking at our
|
| 35 |
grammar as, "An `Expr` is either a `Term` or an `Expr`
|
| 36 |
followed by a plus sign followed by a `Term`," we can read
|
| 37 |
our grammar as, "Once we have parsed an `Expr` followed by a
|
| 38 |
plus sign followed by a `Term`, we have pased an `Expr`."
|
| 39 |
|
| 40 |
When we run a shift-reduce parser for this grammar, we start
|
| 41 |
with all the input tokens on a stack, and an empty stack for
|
| 42 |
processing those:
|
| 43 |
|
| 44 |
```
|
| 45 |
input | processing | action
|
| 46 |
----------------+----------------------+------------------------------
|
| 47 |
2 + ( 3 + 4 ) | |
|
| 48 |
----------------+----------------------+------------------------------
|
| 49 |
```
|
| 50 |
|
| 51 |
Depending on what we see on the top of the stack and the
|
| 52 |
current state of the parser, we'll either
|
| 53 |
_shift_ or _reduce_. The first thing we do is _shift_, in
|
| 54 |
which case we take a pop a token from the input stack and
|
| 55 |
push it onto the processing stack
|
| 56 |
|
| 57 |
```
|
| 58 |
input | processing | action
|
| 59 |
----------------+----------------------+------------------------------
|
| 60 |
2 + ( 3 + 4 ) | | shift '2'
|
| 61 |
input | processing | action
|
| 62 |
----------------+----------------------+------------------------------
|
| 63 |
```
|
| 64 |
|
| 65 |
Once the processing stack is in the right state, we then
|
| 66 |
perform a _reduce_ step, which works like the grammar rules
|
| 67 |
above run in reverse. In the above example, we're looking
|
| 68 |
at a `+` on the top of the stack, and that expected an
|
| 69 |
`Expr` on the left-hand side, so we can _reduce_ based on
|
| 70 |
our grammar rules, turning the digit `2` into an `Expr`.
|
| 71 |
|
| 72 |
```
|
| 73 |
input | processing | action
|
| 74 |
----------------+----------------------+------------------------------
|
| 75 |
2 + ( 3 + 4 ) | | shift '2'
|
| 76 |
+ ( 3 + 4 ) | 2 | reduce Digit to Term
|
| 77 |
+ ( 3 + 4 ) | Term | reduce Term to Expr
|
| 78 |
----------------+----------------------+------------------------------
|
| 79 |
```
|
| 80 |
|
| 81 |
we can then keep scanning and reducing until we parse the
|
| 82 |
full tree:
|
| 83 |
|
| 84 |
```
|
| 85 |
input | processing | action
|
| 86 |
----------------+----------------------+------------------------------
|
| 87 |
2 + ( 3 + 4 ) | | shift '2'
|
| 88 |
+ ( 3 + 4 ) | 2 | reduce Digit to Term
|
| 89 |
+ ( 3 + 4 ) | Term | reduce Term to Expr
|
| 90 |
+ ( 3 + 4 ) | Expr | shift '+'
|
| 91 |
( 3 + 4 ) | + Expr | shift '('
|
| 92 |
3 + 4 ) | ( + Expr | shift '3'
|
| 93 |
+ 4 ) | 3 ( + Expr | reduce Digit to Term
|
| 94 |
+ 4 ) | Term ( + Expr | reduce Term to Expr
|
| 95 |
+ 4 ) | Expr ( + Expr | shift '+'
|
| 96 |
4 ) | + Expr ( + Expr | shift '4'
|
| 97 |
) | 4 + Expr ( + Expr | reduce Digit to Term
|
| 98 |
) | Term + Expr ( + Expr | reduce Term to Expr
|
| 99 |
) | Expr + Expr ( + Expr | reduce Expr '+' Expr to Expr
|
| 100 |
) | Expr ( + Expr | shift '('
|
| 101 |
| ) Expr ( + Expr | reduce '(' Expr ')' to Term
|
| 102 |
| Term + Expr | reduce Term to Expr
|
| 103 |
| Expr + Expr | reduce Expr '+' Expr to Expr
|
| 104 |
| Expr | done
|
| 105 |
----------------+----------------------+------------------------------
|
| 106 |
```
|
| 107 |
|
| 108 |
Now, I've completely elided _how_ we actually build the state machine
|
| 109 |
that lets us do this. The process is straightforward and is discussed
|
| 110 |
in great detail elsewhere. There is, however, a problem with
|
| 111 |
shift-reduce grammars.
|
| 112 |
|
| 113 |
# Unlimited Lookahead
|
| 114 |
|
| 115 |
Above, our grammar was simple: we could determine what the next rule
|
| 116 |
to apply was based entirely on the top token of the input stack. But
|
| 117 |
what if that isn't true? We can imagine grammars in which the
|
| 118 |
meaning of what you're doing isn't clear until much later in the
|
| 119 |
input string. Imagine that you're designing a Go-like language with
|
| 120 |
tuples, and you use `:=` as shorthand for declaring variables. Our
|
| 121 |
code might look like this.
|
| 122 |
|
| 123 |
~~~
|
| 124 |
(a, b) := (1, 2);
|
| 125 |
(c, d) := foo(a + b);
|
| 126 |
bar();
|
| 127 |
~~~
|
| 128 |
|
| 129 |
You design it so that any expression is _also_ a valid statement,
|
| 130 |
so even though it's a little silly, you could write
|
| 131 |
|
| 132 |
~~~
|
| 133 |
(this, that);
|
| 134 |
~~~
|
| 135 |
|
| 136 |
as a bare statement. Well, now we have a problem. A parser for
|
| 137 |
this language is parsing something and gets this far into the
|
| 138 |
input string:
|
| 139 |
|
| 140 |
~~~
|
| 141 |
'(' 'a' [ ... ]
|
| 142 |
^
|
| 143 |
~~~
|
| 144 |
|
| 145 |
Is this an expression, or a declaration? Well, that depends on
|
| 146 |
the context. if this is the beginning of
|
| 147 |
|
| 148 |
~~~
|
| 149 |
(a, b, c) := some_expr();
|
| 150 |
~~~
|
| 151 |
|
| 152 |
then we're parsing the left-hand side of a declaration, and `a`
|
| 153 |
should be an identifier. But if it's the beginning of
|
| 154 |
|
| 155 |
~~~
|
| 156 |
(a, 2+2, foo());
|
| 157 |
~~~
|
| 158 |
|
| 159 |
then it's the beginning of an expression! We need to look
|
| 160 |
further ahead to find out which. But in this case, we have
|
| 161 |
_no idea_ how much further to look ahead—it might be an
|
| 162 |
arbitrarily long number of tokens in the future.
|
| 163 |
|
| 164 |
If we want to continue using shift/reduce parsing, we have
|
| 165 |
to get around this somehow. For example, Rust solves the
|
| 166 |
problem mentioned here by using the keyword `let` to
|
| 167 |
introduce declarations, which means anything after the
|
| 168 |
`let` keyword is going to be a declaration, but otherwise
|
| 169 |
it'll be an expression. But what if we wanted to keep
|
| 170 |
our grammar the way it was?
|
| 171 |
|
| 172 |
# Linear Time
|
| 173 |
|
| 174 |
Well, we'd lose some efficiency. The shift/reduce algorithms
|
| 175 |
are guaranteed to walk along the input string directly, doing
|
| 176 |
a limited number of steps per token they observe: they will
|
| 177 |
do either one shift, or several reductions, and the number
|
| 178 |
of reductions done is fixed by the grammar. That's a nice
|
| 179 |
property to have!
|
| 180 |
|
| 181 |
But most of the algorithms that handle unbounded lookahead
|
| 182 |
do so with backtracking, which means you might need to
|
| 183 |
move back in the input string and then move forward again.
|
| 184 |
Parser combinators work like this: they take a path and
|
| 185 |
keep at it, and if it turned out to be wrong, they then
|
| 186 |
go back and try another one. Depending on your circumstance,
|
| 187 |
that might be okay, or it might be terrible.
|
| 188 |
|
| 189 |
Packrat parsers keep the linear-time guarantee, but do so
|
| 190 |
by keeping around a _lot_ of extra data: your parser is
|
| 191 |
no longer just two stacks plus a state, it is now a
|
| 192 |
two-dimensional table with entries for both every token in
|
| 193 |
your input string and every rule in your grammar. This
|
| 194 |
is inefficient enough that, even though it is technically
|
| 195 |
linear time, the constant factors are large enough to make
|
| 196 |
it slow on reasonably-sized inputs and intractably slow
|
| 197 |
or just impossible to run on large inputs.
|
| 198 |
|
| 199 |
It looks like we have a choice: we can either simplify
|
| 200 |
our grammar and keep our efficient, linear-time parsing,
|
| 201 |
or we can keep our grammar and get slower or more
|
| 202 |
memory-hungry parsing algorithms.
|
| 203 |
Is there any way we can have our cake and eat it, too?
|
| 204 |
|
| 205 |
# Shift-Resolve Parsing
|
| 206 |
|
| 207 |
Let's start with the same basic idea, but add an extra wrinkle: we
|
| 208 |
no longer just move from the input to the processing stack, we also
|
| 209 |
sometimes move from the processing stack and push back to the input
|
| 210 |
stack. Before, we had two operations:
|
| 211 |
|
| 212 |
- _shift_ an item from the input stack to the processing stack.
|
| 213 |
- _reduce_ several items from the processing stack, use a grammar rule
|
| 214 |
and put the resulting item on the processing stack.
|
| 215 |
|
| 216 |
We now have three operations:
|
| 217 |
|
| 218 |
- _shift_ an item from the input stack to the processing stack
|
| 219 |
- _reduce_ several items from the processing stack, use a grammar rule
|
| 220 |
and put the resulting items on the _input stack_.
|
| 221 |
- _pushback_ several items from the processing stack to the input stack.
|
| 222 |
|