Commit 3f2203f8654505e71438f6690c2ef67a5507af2f - documents

New docs Getty Ritter 10 years ago

8 changed file(s) with 363 addition(s) and 0 deletion(s). Collapse all Expand all

-0

docs/guerilla-guide-to-parsing/combinators.telml less more

(New empty file)

-0

docs/guerilla-guide-to-parsing/earley.telml less more

(New empty file)

-0

docs/guerilla-guide-to-parsing/lalr.telml less more

(New empty file)

-0

docs/guerilla-guide-to-parsing/overview.telml less more

	1	One of my favorite quotes about parsing is the title of
	2	\link{http://tratt.net/laurie/blog/entries/parsing_the_solved_problem_that_isnt\|a blog post by Laurence Tratt},
	3	who referred to parsing as, "The Solved Problem That Isn't."
	4	Parsing is tricky and weird and non-intuitive, and yet it seems
	5	like it should be a simple task: you take text, you put it into
	6

-0

docs/guerilla-guide-to-parsing/packrat.telml less more

(New empty file)

-0

docs/guerilla-guide-to-parsing/theory.telml less more

(New empty file)

+222

-0

posts/shift-resolve-parsing.md less more

	1	Shift-Resolve Parsing, as described by José Fortes Gálvez, Sylvain
	2	Schmitz, and Jacques Farré, promises linear-time parsing with
	3	unbounded lookahead. Unfortunately for many, the paper is difficult
	4	and abstruse, filled with terrifying charts and obscure notation.
	5
	6	Well, I read it so you don't have to. It's not actually all that
	7	bad. I'm gonna start with a basic overview of how shift-reduce
	8	parsers work, and then go into how theirs differs. If you already
	9	are comfortable with shift/reduce, go and skip ahead.
	10
	11	# Shift-Reduce Parsing
	12
	13	A _shift-reduce parser_ operates by maintaining two stacks and
	14	performing a series of simple actions on those stacks. For
	15	this section, I'll talk about the simple grammar of addition of
	16	numbers with parenthesization, i.e. something like
	17
	18	```
	19	Expr ::= Expr '+' Term \| Term
	20	Term ::= Digit \| '(' Expr ')'
	21	```
	22
	23	This is often given in a slightly different format:
	24
	25	```
	26	Expr -> Expr '+' Term
	27	Expr -> Term
	28	Term -> Digit
	29	Term -> '(' Expr ')'
	30	```
	31
	32	Part of the reason here is that we aren't producing, we're
	33	_parsing_. It's very easy to look at the above format and
	34	mentally reverse it: that is, instead of looking at our
	35	grammar as, "An `Expr` is either a `Term` or an `Expr`
	36	followed by a plus sign followed by a `Term`," we can read
	37	our grammar as, "Once we have parsed an `Expr` followed by a
	38	plus sign followed by a `Term`, we have pased an `Expr`."
	39
	40	When we run a shift-reduce parser for this grammar, we start
	41	with all the input tokens on a stack, and an empty stack for
	42	processing those:
	43
	44	```
	45	input \| processing \| action
	46	----------------+----------------------+------------------------------
	47	2 + ( 3 + 4 ) \| \|
	48	----------------+----------------------+------------------------------
	49	```
	50
	51	Depending on what we see on the top of the stack and the
	52	current state of the parser, we'll either
	53	_shift_ or _reduce_. The first thing we do is _shift_, in
	54	which case we take a pop a token from the input stack and
	55	push it onto the processing stack
	56
	57	```
	58	input \| processing \| action
	59	----------------+----------------------+------------------------------
	60	2 + ( 3 + 4 ) \| \| shift '2'
	61	input \| processing \| action
	62	----------------+----------------------+------------------------------
	63	```
	64
	65	Once the processing stack is in the right state, we then
	66	perform a _reduce_ step, which works like the grammar rules
	67	above run in reverse. In the above example, we're looking
	68	at a `+` on the top of the stack, and that expected an
	69	`Expr` on the left-hand side, so we can _reduce_ based on
	70	our grammar rules, turning the digit `2` into an `Expr`.
	71
	72	```
	73	input \| processing \| action
	74	----------------+----------------------+------------------------------
	75	2 + ( 3 + 4 ) \| \| shift '2'
	76	+ ( 3 + 4 ) \| 2 \| reduce Digit to Term
	77	+ ( 3 + 4 ) \| Term \| reduce Term to Expr
	78	----------------+----------------------+------------------------------
	79	```
	80
	81	we can then keep scanning and reducing until we parse the
	82	full tree:
	83
	84	```
	85	input \| processing \| action
	86	----------------+----------------------+------------------------------
	87	2 + ( 3 + 4 ) \| \| shift '2'
	88	+ ( 3 + 4 ) \| 2 \| reduce Digit to Term
	89	+ ( 3 + 4 ) \| Term \| reduce Term to Expr
	90	+ ( 3 + 4 ) \| Expr \| shift '+'
	91	( 3 + 4 ) \| + Expr \| shift '('
	92	3 + 4 ) \| ( + Expr \| shift '3'
	93	+ 4 ) \| 3 ( + Expr \| reduce Digit to Term
	94	+ 4 ) \| Term ( + Expr \| reduce Term to Expr
	95	+ 4 ) \| Expr ( + Expr \| shift '+'
	96	4 ) \| + Expr ( + Expr \| shift '4'
	97	) \| 4 + Expr ( + Expr \| reduce Digit to Term
	98	) \| Term + Expr ( + Expr \| reduce Term to Expr
	99	) \| Expr + Expr ( + Expr \| reduce Expr '+' Expr to Expr
	100	) \| Expr ( + Expr \| shift '('
	101	\| ) Expr ( + Expr \| reduce '(' Expr ')' to Term
	102	\| Term + Expr \| reduce Term to Expr
	103	\| Expr + Expr \| reduce Expr '+' Expr to Expr
	104	\| Expr \| done
	105	----------------+----------------------+------------------------------
	106	```
	107
	108	Now, I've completely elided _how_ we actually build the state machine
	109	that lets us do this. The process is straightforward and is discussed
	110	in great detail elsewhere. There is, however, a problem with
	111	shift-reduce grammars.
	112
	113	# Unlimited Lookahead
	114
	115	Above, our grammar was simple: we could determine what the next rule
	116	to apply was based entirely on the top token of the input stack. But
	117	what if that isn't true? We can imagine grammars in which the
	118	meaning of what you're doing isn't clear until much later in the
	119	input string. Imagine that you're designing a Go-like language with
	120	tuples, and you use `:=` as shorthand for declaring variables. Our
	121	code might look like this.
	122
	123	~~~
	124	(a, b) := (1, 2);
	125	(c, d) := foo(a + b);
	126	bar();
	127	~~~
	128
	129	You design it so that any expression is _also_ a valid statement,
	130	so even though it's a little silly, you could write
	131
	132	~~~
	133	(this, that);
	134	~~~
	135
	136	as a bare statement. Well, now we have a problem. A parser for
	137	this language is parsing something and gets this far into the
	138	input string:
	139
	140	~~~
	141	'(' 'a' [ ... ]
	142	^
	143	~~~
	144
	145	Is this an expression, or a declaration? Well, that depends on
	146	the context. if this is the beginning of
	147
	148	~~~
	149	(a, b, c) := some_expr();
	150	~~~
	151
	152	then we're parsing the left-hand side of a declaration, and `a`
	153	should be an identifier. But if it's the beginning of
	154
	155	~~~
	156	(a, 2+2, foo());
	157	~~~
	158
	159	then it's the beginning of an expression! We need to look
	160	further ahead to find out which. But in this case, we have
	161	_no idea_ how much further to look ahead—it might be an
	162	arbitrarily long number of tokens in the future.
	163
	164	If we want to continue using shift/reduce parsing, we have
	165	to get around this somehow. For example, Rust solves the
	166	problem mentioned here by using the keyword `let` to
	167	introduce declarations, which means anything after the
	168	`let` keyword is going to be a declaration, but otherwise
	169	it'll be an expression. But what if we wanted to keep
	170	our grammar the way it was?
	171
	172	# Linear Time
	173
	174	Well, we'd lose some efficiency. The shift/reduce algorithms
	175	are guaranteed to walk along the input string directly, doing
	176	a limited number of steps per token they observe: they will
	177	do either one shift, or several reductions, and the number
	178	of reductions done is fixed by the grammar. That's a nice
	179	property to have!
	180
	181	But most of the algorithms that handle unbounded lookahead
	182	do so with backtracking, which means you might need to
	183	move back in the input string and then move forward again.
	184	Parser combinators work like this: they take a path and
	185	keep at it, and if it turned out to be wrong, they then
	186	go back and try another one. Depending on your circumstance,
	187	that might be okay, or it might be terrible.
	188
	189	Packrat parsers keep the linear-time guarantee, but do so
	190	by keeping around a _lot_ of extra data: your parser is
	191	no longer just two stacks plus a state, it is now a
	192	two-dimensional table with entries for both every token in
	193	your input string and every rule in your grammar. This
	194	is inefficient enough that, even though it is technically
	195	linear time, the constant factors are large enough to make
	196	it slow on reasonably-sized inputs and intractably slow
	197	or just impossible to run on large inputs.
	198
	199	It looks like we have a choice: we can either simplify
	200	our grammar and keep our efficient, linear-time parsing,
	201	or we can keep our grammar and get slower or more
	202	memory-hungry parsing algorithms.
	203	Is there any way we can have our cake and eat it, too?
	204
	205	# Shift-Resolve Parsing
	206
	207	Let's start with the same basic idea, but add an extra wrinkle: we
	208	no longer just move from the input to the processing stack, we also
	209	sometimes move from the processing stack and push back to the input
	210	stack. Before, we had two operations:
	211
	212	- _shift_ an item from the input stack to the processing stack.
	213	- _reduce_ several items from the processing stack, use a grammar rule
	214	and put the resulting item on the processing stack.
	215
	216	We now have three operations:
	217
	218	- _shift_ an item from the input stack to the processing stack
	219	- _reduce_ several items from the processing stack, use a grammar rule
	220	and put the resulting items on the _input stack_.
	221	- _pushback_ several items from the processing stack to the input stack.
	222

+135

-0

posts/some-rust-errors.md less more

	1	I was an early evangelist of the [Rust language] at my office, having
	2	followed it development for a few years, but I still haven't written
	3	any large programs in it. So I don't yet have a really strong opinion
	4	on the language in practice.
	5
	6	However, [a coworker] has started writing a program in Rust, and it
	7	has given me the opportunity to better understand the language so I
	8	can answer his questions. Whenever something would go wrong, he would
	9	sent me the code, or a representative snippet, and demand that I
	10	explain why it wasn't working. Some of these questions were actually
	11	quite difficult, and in at least one case, I was briefly convinced
	12	that I had found a compiler bug.
	13
	14	Because these are some tricky interactions with the language, I wanted
	15	to write this up as a second-hand experience report, describing the
	16	problems that came up and explaining why they were problems.[^1]
	17
	18	[^1]: Some of these I suspect will be alleviated by better error
	19	messages, but are probably niche enough that improving these errors
	20	hasn't been a high priority.
	21
	22	# Problem One: Determining Temporary Lifetimes
	23
	24	Rust has a simple rule for determining the lifetimes of temporary values.
	25	A temporary value is any value which is not directly bound to a name, but
	26	is created somewhere in your program. For example, the return value of
	27	a function that is not directly assigned is a temporary, or a struct
	28	created for the express purpose of taking a reference to it.
	29
	30	Rust's rule is that, generally, temporaries live only for the statement
	31	in which they are created. For illustration's sake, let's create an
	32	empty struct with a noisy constructor and destructor, so we see when
	33	values are being created or destroyed:
	34
	35	~~~
	36	struct Thing;
	37
	38	impl Thing {
	39	fn new() -> Thing {
	40	println!("Created Thing");
	41	Thing
	42	}
	43	}
	44
	45	impl Drop for Thing {
	46	fn drop(&mut self) {
	47	println!("Destroyed Thing");
	48	}
	49	}
	50	~~~
	51
	52	If I create something and don't bind it to a name, then it'll get destroyed
	53	before the next line executes:
	54
	55	~~~
	56	fn main() {
	57	Thing::new();
	58	println!("fin.");
	59	}
	60	/* This prints:
	61	> Created Thing
	62	> Destroyed Thing
	63	> fin.
	64	*/
	65	~~~
	66
	67	The exception to this rule happens if I bind _a reference to the thing_.
	68	The thing itself is still a temporary because, even though we can access
	69	it, there's nothing in scope that has ownership over it, we can't pass
	70	the ownership elsewhere or force it to drop or anything. However, if we
	71	have a reference to it (or some part of it), then it will live as long
	72	as the reference does. For example, because of the reference here, this
	73	temporary will live longer than before:
	74
	75	~~~
	76	fn main() {
	77	let r = &Thing::new();
	78	println!("fin.");
	79	}
	80	/* This prints:
	81	> Created Thing
	82	> fin.
	83	> Destroyed Thing
	84	*/
	85	~~~
	86
	87	This heuristic _only fires of you're directly binding the temporary to
	88	a reference in that expression_. This came up for my coworker because
	89	he had some initialization logic that he thought would be better served
	90	by pushing it into a function, so he wrote the equivalent of
	91
	92	~~~
	93	fn mk_thing_ref(thing: &Thing) -> &Thing {
	94	thing
	95	}
	96
	97	fn main() {
	98	let r = mk_thing_ref(&Thing::new());
	99	println!("fin.");
	100	}
	101	~~~
	102
	103	This refactor means that the temporary returned by `Thing::new()` is no
	104	longer directly being bound to a reference, and therefore the rule
	105	no longer applies: the result of `Thing::new()` will die before the
	106	next line. This is a problem, because `r` continues to exist after that
	107	line, which means this program is rejected by the Rust compiler.
	108
	109	~~~
	110	temp.rs:21:21: 21:33 error: borrowed value does not live long enough
	111	temp.rs:21 let r = mk_ref(&Thing::new());
	112	^~~~~~~~~~~~
	113	temp.rs:21:35: 23:2 note: reference must be valid for the block suffix following statement 0 at 21:34...
	114	temp.rs:21 let r = mk_ref(&Thing::new());
	115	temp.rs:22 println!("fin.");
	116	temp.rs:23 }
	117	temp.rs:21:5: 21:35 note: ...but borrowed value is only valid for the statement at 21:4
	118	temp.rs:21 let r = mk_ref(&Thing::new());
	119	^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	120	temp.rs:21:5: 21:35 help: consider using a `let` binding to increase its lifetime
	121	temp.rs:21 let r = mk_ref(&Thing::new());
	122	^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	123	error: aborting due to previous error
	124	~~~
	125
	126	In this case, the error is clearer, but in my coworker's case, he
	127	was trying to encapsulate much more elaborate initialization logic
	128	that abstracted away gritty details, and was confused that what should
	129	be an equivalent refactor no longer worked. It didn't help that he
	130	was initializing something with closures, making him believe that it
	131	was the closures that were at fault.
	132
	133	# Problem Two: Lifetimes of Trait Objects
	134
	135	A different lifetime problem came up elsewhere: