Commit a0cf672d3a65c5a4f54a33b3032e3f76d1726848 - s-cargot

Updated README to reflect library changes Getty Ritter 10 years ago

1 changed file(s) with 265 addition(s) and 69 deletion(s). Collapse all Expand all

+265

-69

README.md less more

13	13	s-expressions or to extend it in various ways to accomodate new
14	14	flavors.
15	15
	16	## What Are S-Expressions?
	17
	18	S-expressions were originally the data representation format in
	19	Lisp implementations, but have found broad uses outside of that as
	20	a data representation and storage format. S-expressions are often
	21	understood as a representation for binary trees with optional values
	22	in the leaf nodes: an empty leaf is represented with empty
	23	parens `()`, a non-empty leaf is represented as the scalar value
	24	it contains (often tokens like `x` or other programming language
	25	literals), and an internal node is represented as `(x . y)` where
	26	`x` and `y` are standing in for other s-expressions. In Lisp
	27	parlance, an internal node is called a _cons cell_, and the first
	28	and second elements inside it are called the _car_ and the _cdr_,
	29	for historical reasons. Non-empty lef nodes are referred to
	30	in the s-cargot library as _atoms_.
	31
	32	Often, s-expressions are used to represent lists, in which case
	33	the list is treated as a right-branching tree with an empty leaf as
	34	the far right child of the tree. S-expression languages have a
	35	shorthand way of representing these lists: instead of writing successsively
	36	nested pairs, as in `(1 . (2 . (3 . ()))`, they allow the sugar
	37	`(1 2 3)`. This is the most common way of writing s-expressions,
	38	even in languages that allow raw cons cells (or "dotted pairs") to
	39	be written.
	40
	41	The s-cargot library refers to expressions where every right-branching
	42	sequence ends in an empty leaf as _well-formed s-expressions_. Note that
	43	any s-expression which can be written without using a dotted pair is
	44	necessarily well-formed.
	45
	46	Unfortunately, while in common use, s-expressions do not have a single
	47	formal standard. They are often defined in an ad-hoc way, which means
	48	that s-expressions used in different contexts will, despite sharing a common
	49	parentheses-delimited structure, differ in various respects. Additionally,
	50	because s-expressions are used as the concrete syntax for languages of
	51	the Lisp family, they often have conveniences (such as comment syntaxes)
	52	and other bits of syntactic sugar (such as _reader macros_, which are
	53	described more fully later) that make parsing them much more complicated.
	54	Even ignoring those features, the _atoms_ recognized by a given
	55	s-expression variation can differ widely.
	56
	57	The s-cargot library was designed to accomodate several different kinds
	58	of s-expression formats, so that an s-expression format can be easily
	59	expressed as a combination of existing features. It includes a few basic
	60	variations on s-expression languages as well as the tools for parsing
	61	and emitting more elaborate s-expressions variations without having to
	62	reimplement the basic plumbing yourself.
	63
	64	## Using the Library
	65
16	66	The central way of interacting with the S-Cargot library is by creating
17		and modifying a _spec_, which is a value that represents a given
18		family of S-expressions. A _spec_, which is of type `SExprSpec`,
19		contains the information necessary to implement reader macros, arbitrary
20		kinds of comments, and various processing steps. A `SExprSpec` has two
21		type parameters:
22
23		~~~~
24		+------ the type that represents a SExpr atom
25		\|
26		\| +- the Haskell representation of the SExpr value
27		\| \|
28		someSpec :: SExprSpec atom carrier
	67	and modifying datatypes which represent specifications for parsing and
	68	printing s-expressions. Each of those types has two type parameters, which
	69	are often called @atom@ and @carrier@:
	70
	71	~~~~
	72	+------ the type that represents an atom or value
	73	\|
	74	\| +- the Haskell representation of the SExpr itself
	75	\| \|
	76	parser :: SExprParser atom carrier
	77	printer :: SExprPrinter atom carrier
29	78	~~~~
30	79
31	80	Various functions will be provided that modify the carrier type (i.e. the
32	81	output type of parsing or input type of serialization) or the language
33		recognized by the parsing. ~~Examples will be shown below.~~
	82	recognized by the parsing.
34	83
35	84	## Representing S-expressions
36	85
37	86	There are three built-in representations of S-expression lists: two of them
38	87	are isomorphic, as one or the other might be better for processing
39	88	S-expression data in a particular circumstance, and the third represents
40		only a subset of possible S-expressions.
	89	only the well-formed subset of possible S-expressions.
41	90
42	91	~~~~.haskell
43	92	-- cons-based representation

71	120	functions.
72	121
73	122	~~~~.haskell
74		> ~~decode spec~~ "(a b)"
	123	>>> decode basicParser "(a b)"
75	124	Right [SCons (SAtom "a") (SCons (SAtom "b") SNil)]
76		> ~~decode (asRich spec~~) "(a b)"
	125	>>> decode (asRich basicParser) "(a b)"
77	126	Right [RSList [RSAtom "a",RSAtom "b"]]
78		> ~~decode (asWellFormed spec~~) "(a b)"
	127	>>> decode (asWellFormed basicParser) "(a b)"
79	128	Right [WFSList [WFSAtom "a",WFSAtom "b"]]
80		> ~~decode spec~~ "(a . b)"
	129	>>> decode basicParser "(a . b)"
81	130	Right [SCons (SAtom "a") (SAtom "b")]
82		> ~~decode (asRich spec~~) "(a . b)"
	131	>>> decode (asRich basicParser) "(a . b)"
83	132	Right [RSDotted [RSAtom "a"] "b"]
84		> ~~decode (asWellFormed spec~~) "(a . b)"
	133	>>> decode (asWellFormed basicParser) "(a . b)"
85	134	Left "Found atom in cdr position"
86	135	~~~~
87	136

93	142	you plan on working with:
94	143
95	144	~~~~.haskell
96		> A 2 ::: A 3 ::: A 4 ::: Nil
97		SCons (SCons (SCons (SAtom 2) (SAtom 3)) (SAtom 4)) SNil
98		~~~~
99
100		~~~~.haskell
101		> L [A 1,A 2,A 3]
	145	>>> import Data.SCargot.Repr.Basic
	146	>>> A 2 ::: A 3 ::: A 4 ::: Nil
	147	SCons (SAtom 2) (SCons (SAtom 3) (SComs (SAtom 4) SNil))
	148	~~~~
	149
	150	~~~~.haskell
	151	>>> import Data.SCargot.Repr.WellFormed
	152	>>> L [A 1,A 2,A 3]
102	153	WFSList [WFSAtom 1,WFSAtom 2,WFSAtom 3]
103		> let sexprSum (L xs) = sum (map sexprSum xs); sexprSum (A n) = n
104		> :t sexprSum
	154	>>> let sexprSum (L xs) = sum (map sexprSum xs); sexprSum (A n) = n
	155	>>> :t sexprSum
105	156	sexprSum :: Num a => WellFormedSExpr a -> a
106		> sexprSum (L [A 2, L [A 3, A 4]])
	157	>>> sexprSum (L [A 2, L [A 3, A 4]])
107	158	9
108	159	~~~~
109	160
	161	If you are using GHC 7.10, several of these will be powerful
	162	bidirectional pattern synonyms that allow both constructing and
	163	pattern-matchhing on s-expressions in non-trivial ways:
	164
	165	~~~~.haskell
	166	>>> import Data.SCargot.Repr.Basic
	167	>>> L [ A 2, A 3, A 4 ]
	168	SCons (SAtom 2) (SCons (SAtom 3) (SComs (SAtom 4) SNil))
	169	~~~~
	170
110	171	## Atom Types
111	172
112	173	Any type can serve as an underlying atom type provided that it has
113		a~~n Parsec parser and~~ a serializer (i.e. a way of turning it
	174	a Parsec parser or a serializer (i.e. a way of turning it
114	175	into `Text`.) For these examples, I'm going to use a very simple
115	176	serializer that is roughly like the one found in `Data.SCargot.Basic`,
116	177	which parses symbolic tokens of letters, numbers, and some

118	179	is just the identity function:
119	180
120	181	~~~~.haskell
121		spec :: SExprSpec Text (SExpr Text)
122		spec = mkSpec (pack <$> many1 (alphaNum <\|> oneOf "+-*/!?")) id
	182	parser :: SExprParser Text (SExpr Text)
	183	parser = mkParser (pack <$> many1 (alphaNum <\|> oneOf "+-*/!?"))
	184
	185	printer :: SExprPrinter Text (SExpr Text)
	186	printer = flatPrint id
123	187	~~~~
124	188
125	189	A more elaborate atom type would distinguish between different

139	203	sAtom (Ident t) = t
140	204	sAtom (Num n) = pack (show n)
141	205
142		mySpec :: SExprSpec Atom (SExpr Atom)
143		mySpec = mkSpec pAtom sAtom
	206	myParser :: SExprParser Atom (SExpr Atom)
	207	myParser = mkParser pAtom
	208
	209	myPrinter :: SExprPrinter Atom (SExpr Atom)
	210	myPrinter = flatPrint sAtom
144	211	~~~~
145	212
146	213	We can then use this newly created atom type within an S-expression
147	214	for both parsing and serialization:
148	215
149	216	~~~~.haskell
150		> ~~decode mySpec~~ "(foo 1)"
	217	>>> decode myParser "(foo 1)"
151	218	Right [SCons (SAtom (Ident "foo")) (SCons (SAtom (Num 1)) SNil)]
152		> encode mySpec [SCons (SAtom (Num 0)) SNil]
153		"(0)"
	219	>>> encode mySpec [L [A (Num 0), A (Ident "bar")]]
	220	"(0 bar)"
154	221	~~~~
155	222
156	223	## Carrier Types

188	255	the `SExprSpec`:
189	256
190	257	~~~~.haskell
191		> ~~decode (convertSpec toExpr fromExpr (asRich spec)) "(+ 1 2)"~~
	258	>>> let parser' = setCarrier toExpr (asRich myParser)
	259	>>> :t parser'
	260	SExprParser Atom Expr
	261	>>> decode parser' "(+ 1 2)"
192	262	Right [Add (Num 1) (Num 2)]
193		> ~~decode (convertSpec toExpr fromExpr (asRich spec))~~ "(0 1 2)"
	263	>>> decode parser' "(0 1 2)"
194	264	Left "Unrecognized s-expr"
195	265	~~~~
196	266
197	267	## Comments
198	268
199		By default, an S-expression ~~spec~~ does not include a comment syntax, but
	269	By default, an S-expression parser does not include a comment syntax, but
200	270	the provided `withLispComments` function will cause it to understand
201	271	traditional Lisp line-oriented comments that begin with a semicolon:
202	272
203	273	~~~~.haskell
204		> ~~decode spec~~ "(this ; has a comment\n inside)\n"
	274	>>> decode basicParser "(this ; has a comment\n inside)\n"
205	275	Left "(line 1, column 7):\nunexpected \";\"\nexpecting space or atom"
206		> ~~decode (withLispComments spec~~) "(this ; has a comment\n inside)\n"
	276	>>> decode (withLispComments basicParser) "(this ; has a comment\n inside)\n"
207	277	Right [SCons (SAtom "this") (SCons (SAtom "inside") SNil)]
208	278	~~~~
209	279

218	288	For example, the following adds C++-style comments to an S-expression format:
219	289
220	290	~~~~.haskell
221		> let cppComment = string "//" >> manyTill newline >> return ()
222		> decode (setComment cppComment spec) "(a //comment\n b)\n"
	291	>>> let cppComment = string "//" >> manyTill newline >> return ()
	292	>>> decode (setComment cppComment basicParser) "(a //comment\n b)\n"
223	293	Right [SCons (SAtom "a") (SCons (SAtom "b") SNil)]
224	294	~~~~
225	295
226	296	## Reader Macros
227	297
228		A _reader macro_ is a Lisp macro ~~which is invoked during read time. This~~
	298	A _reader macro_ is a Lisp macro---a function that operates on syntactic
	299	structures---which is invoked during the scanning phase of a Lisp parser. This
229	300	allows the _lexical_ syntax of a Lisp to be modified. The most commonly
230		seen reader macro is the quote, which allows the syntax `'expr` to stand
231		in for the s-expression `(quote expr)`. The S-Cargot library accomodates
232		this by keeping a map of characters to Parsec parsers that can be used as
	301	seen reader macro is the quote, which allows the syntax `'expr` to stand as sugar
	302	for the s-expression `(quote expr)`. The S-Cargot library accomodates
	303	this by keeping a map from characters to Haskell functions that can be used as
233	304	readers. There is a special case for the aforementioned quote, but that
234	305	could easily be written by hand as
235	306
236	307	~~~~.haskell
237		> let quoteExpr c = SCons (SAtom "quote") (SCons c SNil)
238		> let withQuote = addReader '\'' (\ p -> fmap quoteExpr p)
239		> ~~decode (withQuote mySpec) "'foo"~~
	308	>>> let quote expr = SCons (SAtom "quote") (SCons expr SNil)
	309	>>> let addQuoteReader = addReader '\'' (\ parse -> fmap quoteExpr parse)
	310	>>> decode (addQuoteReader basicParser) "'foo"
240	311	Right [SCons (SAtom "quote") (SCons (SAtom "foo") SNil)]
241	312	~~~~
242	313
243	314	A reader macro is passed the parser that invoked it, so that it can
244		perform recursive calls, and can return any `SExpr` it would like. It
	315	perform recursive calls into the parser, and can return any `SExpr` it would like. It
245	316	may also take as much or as little of the remaining parse stream as it
246	317	would like; for example, the following reader macro does not bother
247	318	parsing anything else and merely returns a new token:
248	319
249	320	~~~~.haskell
250		> let qmReader = addReader '?' (\ _ -> pure (SAtom "huh"))
251		> decode (qmReader mySpec) "(?1 2)"
	321	>>> let qmReader = addReader '?' (\ _ -> pure (SAtom "huh"))
	322	>>> decode (qmReader basicParser) "(?1 2)"
252	323	Right [SCons (SAtom "huh") (SCons (SAtom "1") (SCons (SAtom "2") SNil))]
253	324	~~~~
254	325
255		Reader macros in S-Cargot can be used to define ~~common~~ bits of Lisp
	326	Reader macros in S-Cargot can be used to define bits of Lisp
256	327	syntax that are not typically considered the purview of S-expression
257	328	parsers. For example, to allow square brackets as a subsitute for
258		proper lists, we could define a reader macro that is in~~itializ~~ed by the
	329	proper lists, we could define a reader macro that is indicated by the
259	330	`[` character and repeatedly calls the parser until a `]` character
260	331	is reached:
261	332
262	333	~~~~.haskell
263		> let pVec p = (char ']' > pure SNil) <\|> (SCons <$> p <> pVec p)
264		> let vec = addReader '[' pVec
265		> ~~decode (asRich (vec mySpec)) "(1 [2 3])"~~
	334	>>> let vec p = (char ']' > pure SNil) <\|> (SCons <$> p <> vec p)
	335	>>> let withVecReader = addReader '[' vec
	336	>>> decode (asRich (withVecReader basicParser)) "(1 [2 3])"
266	337	Right [RSList [RSAtom "1",RSList [RSAtom "2",RSAtom "3"]]]
267	338	~~~~
268	339
	340	## Pretty-Printing and Indentation
	341
	342	The s-cargot library also includes a simple but often adequate
	343	pretty-printing system for s-expressions. A printer that prints a
	344	single-line s-expression is created with `flatPrint`:
	345
	346	~~~~.haskell
	347	>>> let printer = flatPrint id
	348	>>> :t printer
	349	SExprPrinter Text (SCargot Text)
	350	>>> Text.putStrLn $ encode printer [L [A "foo", A "bar"]]
	351	(foo bar)
	352	~~~~
	353
	354	A printer that tries to pretty-print an s-expression to fit
	355	attractively within an 80-character limit can be created with
	356	`basicPrint`:
	357
	358	~~~~.haskell
	359	>>> let printer = basicPrint id
	360	>>> let sentence = "this stupendously preposterously supercalifragilisticexpialidociously long s-expression"
	361	>>> let longSexpr = L [A word \| word <- Text.words sentence ]
	362	>>> Text.putStrLn $ encodeOne printer longSexpr
	363	(this
	364	stupendously
	365	preposterously
	366	supercalifragilisticexpialidociously
	367	long
	368	s-expression)
	369	~~~~
	370
	371	A printer created with `basicPrint` will "swing" things that are too
	372	long onto the subsequent line, indenting it a fixed number of spaces.
	373	We can modify the number of spaces with `setIndentAmount`:
	374
	375	~~~~.haskell
	376	>>> let printer = setIndentAmount 4 (basicPrint id)
	377	>>> Text.putStrLn $ encodeOne printer longSexpr
	378	(this
	379	stupendously
	380	preposterously
	381	supercalifragilisticexpialidociously
	382	long
	383	s-expression)
	384	~~~~
	385
	386	We can also modify what counts as the 'maximum width', which for a
	387	`basicPrint` printer is 80 by default:
	388
	389	~~~~.haskell
	390	>>> let printer = setMaxWidth 8 (basicPrint id)
	391	>>> Text.putStrLn $ encodeOne printer (L [A "one", A "two", A "three"])
	392	(one
	393	two
	394	three)
	395	~~~~
	396
	397	Or remove the maximum, which will put the whole s-expression onto one
	398	line, regardless of its length:
	399
	400	~~~~.haskell
	401	>>> let printer = removeMaxWidth (basicPrint id)
	402	>>> Text.putStrLn $ encodeOne printer longSexpr
	403	(this stupendously preposterously supercalifragilisticexpialidociously long s-expression)
	404	~~~~
	405
	406	We can also specify an _indentation strategy_, which decides how to
	407	indent subsequent expressions based on the head of a given
	408	expression. The default is to always "swing" subsequent expressions
	409	to the next line, but we could also specify the `Align` constructor, which
	410	will print the first two expressions on the same line and then any subsequent
	411	expressions horizontally aligned with the second one, like so:
	412
	413	~~~~.haskell
	414	>>> let printer = setIndentStrategy (\ _ -> Align) (setMaxWidth 8 (basicPrint id))
	415	>>> Text.putStrLn $ encodeOne printer (L [A "one", A "two", A "three", A "four"])
	416	(one two
	417	three
	418	four)
	419	~~~~
	420
	421	Or we could choose to keep some number of expressions on the same line and afterwards
	422	swing the subsequent ones:
	423
	424	~~~~.haskell
	425	>>> let printer = setIndentStrategy (\ _ -> SwingAfter 1) (setMaxWidth 8 (basicPrint id))
	426	>>> Text.putStrLn $ encodeOne printer (L [A "one", A "two", A "three", A "four"])
	427	(one two
	428	three
	429	four)
	430	~~~~
	431
	432	For lots of situations, we might want to choose a different indentation strategy based
	433	on the first expression within a proper list: for example, Common Lisp source code is often
	434	formatted so that, following a `defun` token, we have the function name and arguments
	435	on the same line, and then the body of the function indented some amount subsequently.
	436	We can express an approximation of that strategy like this:
	437
	438	~~~~.haskell
	439	>>> let strategy (A ident) \| "def" `Text.isPrefixOf` ident = SwingAfter 2; strategy _ = Align
	440	>>> let printer = setIndentStrategy strategy (setMaxWidth 20 (basicPrint id))
	441	>>> let fact = L [A "defun", A "fact", L [A "x"], L [A "product", L [A "range", A "1", A "x"]]]
	442	>>> Text.putStrLn $ encodeOne printer fact
	443	(defun fact (x)
	444	(product (range 1 x)))
	445	>>> let app = L [A "apply", L [A "lambda", L [A "y"], L [A "fact", A "y"]], L [A "+", A "2", A "3"]]
	446	(apply (lambda (y) (fact y)
	447	(+ 2 3))
	448	~~~~
	449
269	450	## Putting It All Together
270	451
271	452	Here is a final example which implements a limited arithmetic language
272		with Haskell-style line comments and a special reader to understand hex
	453	with Haskell-style line comments and a special reader macro to understand hex
273	454	literals:
274	455
275	456	~~~~.haskell
	457	-- Our operators are going to represent addition, subtraction, or
	458	-- multiplication
276	459	data Op = Add \| Sub \| Mul deriving (Eq, Show)
	460
	461	-- The atoms of our language are either one of the aforementioned
	462	-- operators, or positive integers
277	463	data Atom = AOp Op \| ANum Int deriving (Eq, Show)
	464
	465	-- Once parsed, our language will consist of the applications of
	466	-- binary operators with literal integers at the leaves
278	467	data Expr = EOp Op Expr Expr \| ENum Int deriving (Eq, Show)
279	468
280		-- Conversions ~~for~~ our Expr type
	469	-- Conversions to and from our Expr type
281	470	toExpr :: SExpr Atom -> Either String Expr
282	471	toExpr (A (AOp op) ::: l ::: r ::: Nil) = EOp op <$> l <*> r
283	472	toExpr (A (ANum n)) = pure (ENum n)
284		toExpr sexpr = Left ("~~Invalid parse~~: " ++ show sexpr)
	473	toExpr sexpr = Left ("Unable to parse expression: " ++ show sexpr)
285	474
286	475	fromExpr :: Expr -> SExpr Atom
287	476	fromExpr (EOp op l r) = A (AOp op) ::: fromExpr l ::: fromExpr r ::: Nil

300	489	sAtom (AOp Mul) = "*"
301	490	sAtom (ANum n) = T.pack (show n)
302	491
303		-- Our comment syntax
	492	-- Our comment syntax is going to be Haskell-like:
304	493	hsComment :: Parser ()
305	494	hsComment = string "--" >> manyTill newline >> return ()
306	495
307		-- Our custom reader macro
	496	-- Our custom reader macro: grab the parse stream and read a
	497	-- hexadecimal number from it:
308	498	hexReader :: Reader Atom
309	499	hexReader _ = (Num . readHex . T.unpack) <$> takeWhile1 isHexDigit
310	500	where isHexDigit c = isDigit c \|\| c `elem` "AaBbCcDdEeFf"
311	501	rd = readHex . head . fst
312	502
313		-- Our final s-expression family
314		myLangSpec :: SExprSpec Atom Expr
315		myLangSpec
	503	-- Our final s-expression parser and printer:
	504	myLangParser :: SExprParser Atom Expr
	505	myLangParser
316	506	= setComment hsComment -- set comment syntax to be Haskell-style
317	507	$ addReader '#' hexReader -- add hex reader
318		$ convertSpec toExpr fromExpr -- convert final repr to Expr
319		$ mkSpec pAtom sAtom -- create spec with Atom type
	508	$ setCarrier toExpr -- convert final repr to Expr
	509	$ mkParser pAtom -- create spec with Atom type
	510
	511	mkLangPrinter :: SexprPrinter Atom Expr
	512	mkLangPrinter
	513	= setFromCarrier fromExpr
	514	$ setIndentStrategy (const Align)
	515	$ basicPrint sAtom
320	516	~~~~
321	517
322	518	Keep in mind that you often won't need to write all this by hand,