gdritter repos documents / 30d60df
New docs Getty Ritter 8 years ago
8 changed file(s) with 363 addition(s) and 0 deletion(s). Collapse all Expand all
1 One of my favorite quotes about parsing is the title of
2 \link{http://tratt.net/laurie/blog/entries/parsing_the_solved_problem_that_isnt|a blog post by Laurence Tratt},
3 who referred to parsing as, "The Solved Problem That Isn't."
4 Parsing is tricky and weird and non-intuitive, and yet it seems
5 like it should be a simple task: you take text, you put it into
6
1 Shift-Resolve Parsing, as described by José Fortes Gálvez, Sylvain
2 Schmitz, and Jacques Farré, promises linear-time parsing with
3 unbounded lookahead. Unfortunately for many, the paper is difficult
4 and abstruse, filled with terrifying charts and obscure notation.
5
6 Well, I read it so you don't have to. It's not actually all that
7 bad. I'm gonna start with a basic overview of how shift-reduce
8 parsers work, and then go into how theirs differs. If you already
9 are comfortable with shift/reduce, go and skip ahead.
10
11 # Shift-Reduce Parsing
12
13 A _shift-reduce parser_ operates by maintaining two stacks and
14 performing a series of simple actions on those stacks. For
15 this section, I'll talk about the simple grammar of addition of
16 numbers with parenthesization, i.e. something like
17
18 ```
19 Expr ::= Expr '+' Term | Term
20 Term ::= Digit | '(' Expr ')'
21 ```
22
23 This is often given in a slightly different format:
24
25 ```
26 Expr -> Expr '+' Term
27 Expr -> Term
28 Term -> Digit
29 Term -> '(' Expr ')'
30 ```
31
32 Part of the reason here is that we aren't producing, we're
33 _parsing_. It's very easy to look at the above format and
34 mentally reverse it: that is, instead of looking at our
35 grammar as, "An `Expr` is either a `Term` or an `Expr`
36 followed by a plus sign followed by a `Term`," we can read
37 our grammar as, "Once we have parsed an `Expr` followed by a
38 plus sign followed by a `Term`, we have pased an `Expr`."
39
40 When we run a shift-reduce parser for this grammar, we start
41 with all the input tokens on a stack, and an empty stack for
42 processing those:
43
44 ```
45 input | processing | action
46 ----------------+----------------------+------------------------------
47 2 + ( 3 + 4 ) | |
48 ----------------+----------------------+------------------------------
49 ```
50
51 Depending on what we see on the top of the stack and the
52 current state of the parser, we'll either
53 _shift_ or _reduce_. The first thing we do is _shift_, in
54 which case we take a pop a token from the input stack and
55 push it onto the processing stack
56
57 ```
58 input | processing | action
59 ----------------+----------------------+------------------------------
60 2 + ( 3 + 4 ) | | shift '2'
61 input | processing | action
62 ----------------+----------------------+------------------------------
63 ```
64
65 Once the processing stack is in the right state, we then
66 perform a _reduce_ step, which works like the grammar rules
67 above run in reverse. In the above example, we're looking
68 at a `+` on the top of the stack, and that expected an
69 `Expr` on the left-hand side, so we can _reduce_ based on
70 our grammar rules, turning the digit `2` into an `Expr`.
71
72 ```
73 input | processing | action
74 ----------------+----------------------+------------------------------
75 2 + ( 3 + 4 ) | | shift '2'
76 + ( 3 + 4 ) | 2 | reduce Digit to Term
77 + ( 3 + 4 ) | Term | reduce Term to Expr
78 ----------------+----------------------+------------------------------
79 ```
80
81 we can then keep scanning and reducing until we parse the
82 full tree:
83
84 ```
85 input | processing | action
86 ----------------+----------------------+------------------------------
87 2 + ( 3 + 4 ) | | shift '2'
88 + ( 3 + 4 ) | 2 | reduce Digit to Term
89 + ( 3 + 4 ) | Term | reduce Term to Expr
90 + ( 3 + 4 ) | Expr | shift '+'
91 ( 3 + 4 ) | + Expr | shift '('
92 3 + 4 ) | ( + Expr | shift '3'
93 + 4 ) | 3 ( + Expr | reduce Digit to Term
94 + 4 ) | Term ( + Expr | reduce Term to Expr
95 + 4 ) | Expr ( + Expr | shift '+'
96 4 ) | + Expr ( + Expr | shift '4'
97 ) | 4 + Expr ( + Expr | reduce Digit to Term
98 ) | Term + Expr ( + Expr | reduce Term to Expr
99 ) | Expr + Expr ( + Expr | reduce Expr '+' Expr to Expr
100 ) | Expr ( + Expr | shift '('
101 | ) Expr ( + Expr | reduce '(' Expr ')' to Term
102 | Term + Expr | reduce Term to Expr
103 | Expr + Expr | reduce Expr '+' Expr to Expr
104 | Expr | done
105 ----------------+----------------------+------------------------------
106 ```
107
108 Now, I've completely elided _how_ we actually build the state machine
109 that lets us do this. The process is straightforward and is discussed
110 in great detail elsewhere. There is, however, a problem with
111 shift-reduce grammars.
112
113 # Unlimited Lookahead
114
115 Above, our grammar was simple: we could determine what the next rule
116 to apply was based entirely on the top token of the input stack. But
117 what if that isn't true? We can imagine grammars in which the
118 meaning of what you're doing isn't clear until much later in the
119 input string. Imagine that you're designing a Go-like language with
120 tuples, and you use `:=` as shorthand for declaring variables. Our
121 code might look like this.
122
123 ~~~
124 (a, b) := (1, 2);
125 (c, d) := foo(a + b);
126 bar();
127 ~~~
128
129 You design it so that any expression is _also_ a valid statement,
130 so even though it's a little silly, you could write
131
132 ~~~
133 (this, that);
134 ~~~
135
136 as a bare statement. Well, now we have a problem. A parser for
137 this language is parsing something and gets this far into the
138 input string:
139
140 ~~~
141 '(' 'a' [ ... ]
142 ^
143 ~~~
144
145 Is this an expression, or a declaration? Well, that depends on
146 the context. if this is the beginning of
147
148 ~~~
149 (a, b, c) := some_expr();
150 ~~~
151
152 then we're parsing the left-hand side of a declaration, and `a`
153 should be an identifier. But if it's the beginning of
154
155 ~~~
156 (a, 2+2, foo());
157 ~~~
158
159 then it's the beginning of an expression! We need to look
160 further ahead to find out which. But in this case, we have
161 _no idea_ how much further to look ahead—it might be an
162 arbitrarily long number of tokens in the future.
163
164 If we want to continue using shift/reduce parsing, we have
165 to get around this somehow. For example, Rust solves the
166 problem mentioned here by using the keyword `let` to
167 introduce declarations, which means anything after the
168 `let` keyword is going to be a declaration, but otherwise
169 it'll be an expression. But what if we wanted to keep
170 our grammar the way it was?
171
172 # Linear Time
173
174 Well, we'd lose some efficiency. The shift/reduce algorithms
175 are guaranteed to walk along the input string directly, doing
176 a limited number of steps per token they observe: they will
177 do either one shift, or several reductions, and the number
178 of reductions done is fixed by the grammar. That's a nice
179 property to have!
180
181 But most of the algorithms that handle unbounded lookahead
182 do so with backtracking, which means you might need to
183 move back in the input string and then move forward again.
184 Parser combinators work like this: they take a path and
185 keep at it, and if it turned out to be wrong, they then
186 go back and try another one. Depending on your circumstance,
187 that might be okay, or it might be terrible.
188
189 Packrat parsers keep the linear-time guarantee, but do so
190 by keeping around a _lot_ of extra data: your parser is
191 no longer just two stacks plus a state, it is now a
192 two-dimensional table with entries for both every token in
193 your input string and every rule in your grammar. This
194 is inefficient enough that, even though it is technically
195 linear time, the constant factors are large enough to make
196 it slow on reasonably-sized inputs and intractably slow
197 or just impossible to run on large inputs.
198
199 It looks like we have a choice: we can either simplify
200 our grammar and keep our efficient, linear-time parsing,
201 or we can keep our grammar and get slower or more
202 memory-hungry parsing algorithms.
203 Is there any way we can have our cake and eat it, too?
204
205 # Shift-Resolve Parsing
206
207 Let's start with the same basic idea, but add an extra wrinkle: we
208 no longer just move from the input to the processing stack, we also
209 sometimes move from the processing stack and push back to the input
210 stack. Before, we had two operations:
211
212 - _shift_ an item from the input stack to the processing stack.
213 - _reduce_ several items from the processing stack, use a grammar rule
214 and put the resulting item on the processing stack.
215
216 We now have three operations:
217
218 - _shift_ an item from the input stack to the processing stack
219 - _reduce_ several items from the processing stack, use a grammar rule
220 and put the resulting items on the _input stack_.
221 - _pushback_ several items from the processing stack to the input stack.
222
1 I was an early evangelist of the [Rust language] at my office, having
2 followed it development for a few years, but I still haven't written
3 any large programs in it. So I don't yet have a really strong opinion
4 on the language in practice.
5
6 However, [a coworker] has started writing a program in Rust, and it
7 has given me the opportunity to better understand the language so I
8 can answer his questions. Whenever something would go wrong, he would
9 sent me the code, or a representative snippet, and demand that I
10 explain why it wasn't working. Some of these questions were actually
11 quite difficult, and in at least one case, I was briefly convinced
12 that I had found a compiler bug.
13
14 Because these are some tricky interactions with the language, I wanted
15 to write this up as a second-hand experience report, describing the
16 problems that came up and explaining why they were problems.[^1]
17
18 [^1]: Some of these I suspect will be alleviated by better error
19 messages, but are probably niche enough that improving these errors
20 hasn't been a high priority.
21
22 # Problem One: Determining Temporary Lifetimes
23
24 Rust has a simple rule for determining the lifetimes of temporary values.
25 A temporary value is any value which is not directly bound to a name, but
26 is created somewhere in your program. For example, the return value of
27 a function that is not directly assigned is a temporary, or a struct
28 created for the express purpose of taking a reference to it.
29
30 Rust's rule is that, generally, temporaries live only for the statement
31 in which they are created. For illustration's sake, let's create an
32 empty struct with a noisy constructor and destructor, so we see when
33 values are being created or destroyed:
34
35 ~~~
36 struct Thing;
37
38 impl Thing {
39 fn new() -> Thing {
40 println!("Created Thing");
41 Thing
42 }
43 }
44
45 impl Drop for Thing {
46 fn drop(&mut self) {
47 println!("Destroyed Thing");
48 }
49 }
50 ~~~
51
52 If I create something and don't bind it to a name, then it'll get destroyed
53 before the next line executes:
54
55 ~~~
56 fn main() {
57 Thing::new();
58 println!("fin.");
59 }
60 /* This prints:
61 > Created Thing
62 > Destroyed Thing
63 > fin.
64 */
65 ~~~
66
67 The exception to this rule happens if I bind _a reference to the thing_.
68 The thing itself is still a temporary because, even though we can access
69 it, there's nothing in scope that has ownership over it, we can't pass
70 the ownership elsewhere or force it to drop or anything. However, if we
71 have a reference to it (or some part of it), then it will live as long
72 as the reference does. For example, because of the reference here, this
73 temporary will live longer than before:
74
75 ~~~
76 fn main() {
77 let r = &Thing::new();
78 println!("fin.");
79 }
80 /* This prints:
81 > Created Thing
82 > fin.
83 > Destroyed Thing
84 */
85 ~~~
86
87 This heuristic _only fires of you're directly binding the temporary to
88 a reference in that expression_. This came up for my coworker because
89 he had some initialization logic that he thought would be better served
90 by pushing it into a function, so he wrote the equivalent of
91
92 ~~~
93 fn mk_thing_ref(thing: &Thing) -> &Thing {
94 thing
95 }
96
97 fn main() {
98 let r = mk_thing_ref(&Thing::new());
99 println!("fin.");
100 }
101 ~~~
102
103 This refactor means that the temporary returned by `Thing::new()` is no
104 longer directly being bound to a reference, and therefore the rule
105 no longer applies: the result of `Thing::new()` will die before the
106 next line. This is a problem, because `r` continues to exist after that
107 line, which means this program is rejected by the Rust compiler.
108
109 ~~~
110 temp.rs:21:21: 21:33 error: borrowed value does not live long enough
111 temp.rs:21 let r = mk_ref(&Thing::new());
112 ^~~~~~~~~~~~
113 temp.rs:21:35: 23:2 note: reference must be valid for the block suffix following statement 0 at 21:34...
114 temp.rs:21 let r = mk_ref(&Thing::new());
115 temp.rs:22 println!("fin.");
116 temp.rs:23 }
117 temp.rs:21:5: 21:35 note: ...but borrowed value is only valid for the statement at 21:4
118 temp.rs:21 let r = mk_ref(&Thing::new());
119 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
120 temp.rs:21:5: 21:35 help: consider using a `let` binding to increase its lifetime
121 temp.rs:21 let r = mk_ref(&Thing::new());
122 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123 error: aborting due to previous error
124 ~~~
125
126 In this case, the error is clearer, but in my coworker's case, he
127 was trying to encapsulate much more elaborate initialization logic
128 that abstracted away gritty details, and was confused that what should
129 be an equivalent refactor no longer worked. It didn't help that he
130 was initializing something with closures, making him believe that it
131 was the closures that were at fault.
132
133 # Problem Two: Lifetimes of Trait Objects
134
135 A different lifetime problem came up elsewhere: