gdritter repos s-cargot / 16694de
Updated README to reflect library changes Getty Ritter 9 years ago
1 changed file(s) with 265 addition(s) and 69 deletion(s). Collapse all Expand all
1313 s-expressions or to extend it in various ways to accomodate new
1414 flavors.
1515
16 ## What Are S-Expressions?
17
18 S-expressions were originally the data representation format in
19 Lisp implementations, but have found broad uses outside of that as
20 a data representation and storage format. S-expressions are often
21 understood as a representation for binary trees with optional values
22 in the leaf nodes: an empty leaf is represented with empty
23 parens `()`, a non-empty leaf is represented as the scalar value
24 it contains (often tokens like `x` or other programming language
25 literals), and an internal node is represented as `(x . y)` where
26 `x` and `y` are standing in for other s-expressions. In Lisp
27 parlance, an internal node is called a _cons cell_, and the first
28 and second elements inside it are called the _car_ and the _cdr_,
29 for historical reasons. Non-empty lef nodes are referred to
30 in the s-cargot library as _atoms_.
31
32 Often, s-expressions are used to represent lists, in which case
33 the list is treated as a right-branching tree with an empty leaf as
34 the far right child of the tree. S-expression languages have a
35 shorthand way of representing these lists: instead of writing successsively
36 nested pairs, as in `(1 . (2 . (3 . ()))`, they allow the sugar
37 `(1 2 3)`. This is the most common way of writing s-expressions,
38 even in languages that allow raw cons cells (or "dotted pairs") to
39 be written.
40
41 The s-cargot library refers to expressions where every right-branching
42 sequence ends in an empty leaf as _well-formed s-expressions_. Note that
43 any s-expression which can be written without using a dotted pair is
44 necessarily well-formed.
45
46 Unfortunately, while in common use, s-expressions do not have a single
47 formal standard. They are often defined in an ad-hoc way, which means
48 that s-expressions used in different contexts will, despite sharing a common
49 parentheses-delimited structure, differ in various respects. Additionally,
50 because s-expressions are used as the concrete syntax for languages of
51 the Lisp family, they often have conveniences (such as comment syntaxes)
52 and other bits of syntactic sugar (such as _reader macros_, which are
53 described more fully later) that make parsing them much more complicated.
54 Even ignoring those features, the _atoms_ recognized by a given
55 s-expression variation can differ widely.
56
57 The s-cargot library was designed to accomodate several different kinds
58 of s-expression formats, so that an s-expression format can be easily
59 expressed as a combination of existing features. It includes a few basic
60 variations on s-expression languages as well as the tools for parsing
61 and emitting more elaborate s-expressions variations without having to
62 reimplement the basic plumbing yourself.
63
64 ## Using the Library
65
1666 The central way of interacting with the S-Cargot library is by creating
17 and modifying a _spec_, which is a value that represents a given
18 family of S-expressions. A _spec_, which is of type `SExprSpec`,
19 contains the information necessary to implement reader macros, arbitrary
20 kinds of comments, and various processing steps. A `SExprSpec` has two
21 type parameters:
22
23 ~~~~
24 +------ the type that represents a SExpr atom
25 |
26 | +- the Haskell representation of the SExpr value
27 | |
28 someSpec :: SExprSpec atom carrier
67 and modifying datatypes which represent specifications for parsing and
68 printing s-expressions. Each of those types has two type parameters, which
69 are often called @atom@ and @carrier@:
70
71 ~~~~
72 +------ the type that represents an atom or value
73 |
74 | +- the Haskell representation of the SExpr itself
75 | |
76 parser :: SExprParser atom carrier
77 printer :: SExprPrinter atom carrier
2978 ~~~~
3079
3180 Various functions will be provided that modify the carrier type (i.e. the
3281 output type of parsing or input type of serialization) or the language
33 recognized by the parsing. Examples will be shown below.
82 recognized by the parsing.
3483
3584 ## Representing S-expressions
3685
3786 There are three built-in representations of S-expression lists: two of them
3887 are isomorphic, as one or the other might be better for processing
3988 S-expression data in a particular circumstance, and the third represents
40 only a subset of possible S-expressions.
89 only the well-formed subset of possible S-expressions.
4190
4291 ~~~~.haskell
4392 -- cons-based representation
71120 functions.
72121
73122 ~~~~.haskell
74 > decode spec "(a b)"
123 >>> decode basicParser "(a b)"
75124 Right [SCons (SAtom "a") (SCons (SAtom "b") SNil)]
76 > decode (asRich spec) "(a b)"
125 >>> decode (asRich basicParser) "(a b)"
77126 Right [RSList [RSAtom "a",RSAtom "b"]]
78 > decode (asWellFormed spec) "(a b)"
127 >>> decode (asWellFormed basicParser) "(a b)"
79128 Right [WFSList [WFSAtom "a",WFSAtom "b"]]
80 > decode spec "(a . b)"
129 >>> decode basicParser "(a . b)"
81130 Right [SCons (SAtom "a") (SAtom "b")]
82 > decode (asRich spec) "(a . b)"
131 >>> decode (asRich basicParser) "(a . b)"
83132 Right [RSDotted [RSAtom "a"] "b"]
84 > decode (asWellFormed spec) "(a . b)"
133 >>> decode (asWellFormed basicParser) "(a . b)"
85134 Left "Found atom in cdr position"
86135 ~~~~
87136
93142 you plan on working with:
94143
95144 ~~~~.haskell
96 > A 2 ::: A 3 ::: A 4 ::: Nil
97 SCons (SCons (SCons (SAtom 2) (SAtom 3)) (SAtom 4)) SNil
98 ~~~~
99
100 ~~~~.haskell
101 > L [A 1,A 2,A 3]
145 >>> import Data.SCargot.Repr.Basic
146 >>> A 2 ::: A 3 ::: A 4 ::: Nil
147 SCons (SAtom 2) (SCons (SAtom 3) (SComs (SAtom 4) SNil))
148 ~~~~
149
150 ~~~~.haskell
151 >>> import Data.SCargot.Repr.WellFormed
152 >>> L [A 1,A 2,A 3]
102153 WFSList [WFSAtom 1,WFSAtom 2,WFSAtom 3]
103 > let sexprSum (L xs) = sum (map sexprSum xs); sexprSum (A n) = n
104 > :t sexprSum
154 >>> let sexprSum (L xs) = sum (map sexprSum xs); sexprSum (A n) = n
155 >>> :t sexprSum
105156 sexprSum :: Num a => WellFormedSExpr a -> a
106 > sexprSum (L [A 2, L [A 3, A 4]])
157 >>> sexprSum (L [A 2, L [A 3, A 4]])
107158 9
108159 ~~~~
109160
161 If you are using GHC 7.10, several of these will be powerful
162 bidirectional pattern synonyms that allow both constructing and
163 pattern-matchhing on s-expressions in non-trivial ways:
164
165 ~~~~.haskell
166 >>> import Data.SCargot.Repr.Basic
167 >>> L [ A 2, A 3, A 4 ]
168 SCons (SAtom 2) (SCons (SAtom 3) (SComs (SAtom 4) SNil))
169 ~~~~
170
110171 ## Atom Types
111172
112173 Any type can serve as an underlying atom type provided that it has
113 an Parsec parser and a serializer (i.e. a way of turning it
174 a Parsec parser or a serializer (i.e. a way of turning it
114175 into `Text`.) For these examples, I'm going to use a very simple
115176 serializer that is roughly like the one found in `Data.SCargot.Basic`,
116177 which parses symbolic tokens of letters, numbers, and some
118179 is just the identity function:
119180
120181 ~~~~.haskell
121 spec :: SExprSpec Text (SExpr Text)
122 spec = mkSpec (pack <$> many1 (alphaNum <|> oneOf "+-*/!?")) id
182 parser :: SExprParser Text (SExpr Text)
183 parser = mkParser (pack <$> many1 (alphaNum <|> oneOf "+-*/!?"))
184
185 printer :: SExprPrinter Text (SExpr Text)
186 printer = flatPrint id
123187 ~~~~
124188
125189 A more elaborate atom type would distinguish between different
139203 sAtom (Ident t) = t
140204 sAtom (Num n) = pack (show n)
141205
142 mySpec :: SExprSpec Atom (SExpr Atom)
143 mySpec = mkSpec pAtom sAtom
206 myParser :: SExprParser Atom (SExpr Atom)
207 myParser = mkParser pAtom
208
209 myPrinter :: SExprPrinter Atom (SExpr Atom)
210 myPrinter = flatPrint sAtom
144211 ~~~~
145212
146213 We can then use this newly created atom type within an S-expression
147214 for both parsing and serialization:
148215
149216 ~~~~.haskell
150 > decode mySpec "(foo 1)"
217 >>> decode myParser "(foo 1)"
151218 Right [SCons (SAtom (Ident "foo")) (SCons (SAtom (Num 1)) SNil)]
152 > encode mySpec [SCons (SAtom (Num 0)) SNil]
153 "(0)"
219 >>> encode mySpec [L [A (Num 0), A (Ident "bar")]]
220 "(0 bar)"
154221 ~~~~
155222
156223 ## Carrier Types
188255 the `SExprSpec`:
189256
190257 ~~~~.haskell
191 > decode (convertSpec toExpr fromExpr (asRich spec)) "(+ 1 2)"
258 >>> let parser' = setCarrier toExpr (asRich myParser)
259 >>> :t parser'
260 SExprParser Atom Expr
261 >>> decode parser' "(+ 1 2)"
192262 Right [Add (Num 1) (Num 2)]
193 > decode (convertSpec toExpr fromExpr (asRich spec)) "(0 1 2)"
263 >>> decode parser' "(0 1 2)"
194264 Left "Unrecognized s-expr"
195265 ~~~~
196266
197267 ## Comments
198268
199 By default, an S-expression spec does not include a comment syntax, but
269 By default, an S-expression parser does not include a comment syntax, but
200270 the provided `withLispComments` function will cause it to understand
201271 traditional Lisp line-oriented comments that begin with a semicolon:
202272
203273 ~~~~.haskell
204 > decode spec "(this ; has a comment\n inside)\n"
274 >>> decode basicParser "(this ; has a comment\n inside)\n"
205275 Left "(line 1, column 7):\nunexpected \";\"\nexpecting space or atom"
206 > decode (withLispComments spec) "(this ; has a comment\n inside)\n"
276 >>> decode (withLispComments basicParser) "(this ; has a comment\n inside)\n"
207277 Right [SCons (SAtom "this") (SCons (SAtom "inside") SNil)]
208278 ~~~~
209279
218288 For example, the following adds C++-style comments to an S-expression format:
219289
220290 ~~~~.haskell
221 > let cppComment = string "//" >> manyTill newline >> return ()
222 > decode (setComment cppComment spec) "(a //comment\n b)\n"
291 >>> let cppComment = string "//" >> manyTill newline >> return ()
292 >>> decode (setComment cppComment basicParser) "(a //comment\n b)\n"
223293 Right [SCons (SAtom "a") (SCons (SAtom "b") SNil)]
224294 ~~~~
225295
226296 ## Reader Macros
227297
228 A _reader macro_ is a Lisp macro which is invoked during read time. This
298 A _reader macro_ is a Lisp macro---a function that operates on syntactic
299 structures---which is invoked during the scanning phase of a Lisp parser. This
229300 allows the _lexical_ syntax of a Lisp to be modified. The most commonly
230 seen reader macro is the quote, which allows the syntax `'expr` to stand
231 in for the s-expression `(quote expr)`. The S-Cargot library accomodates
232 this by keeping a map of characters to Parsec parsers that can be used as
301 seen reader macro is the quote, which allows the syntax `'expr` to stand as sugar
302 for the s-expression `(quote expr)`. The S-Cargot library accomodates
303 this by keeping a map from characters to Haskell functions that can be used as
233304 readers. There is a special case for the aforementioned quote, but that
234305 could easily be written by hand as
235306
236307 ~~~~.haskell
237 > let quoteExpr c = SCons (SAtom "quote") (SCons c SNil)
238 > let withQuote = addReader '\'' (\ p -> fmap quoteExpr p)
239 > decode (withQuote mySpec) "'foo"
308 >>> let quote expr = SCons (SAtom "quote") (SCons expr SNil)
309 >>> let addQuoteReader = addReader '\'' (\ parse -> fmap quoteExpr parse)
310 >>> decode (addQuoteReader basicParser) "'foo"
240311 Right [SCons (SAtom "quote") (SCons (SAtom "foo") SNil)]
241312 ~~~~
242313
243314 A reader macro is passed the parser that invoked it, so that it can
244 perform recursive calls, and can return any `SExpr` it would like. It
315 perform recursive calls into the parser, and can return any `SExpr` it would like. It
245316 may also take as much or as little of the remaining parse stream as it
246317 would like; for example, the following reader macro does not bother
247318 parsing anything else and merely returns a new token:
248319
249320 ~~~~.haskell
250 > let qmReader = addReader '?' (\ _ -> pure (SAtom "huh"))
251 > decode (qmReader mySpec) "(?1 2)"
321 >>> let qmReader = addReader '?' (\ _ -> pure (SAtom "huh"))
322 >>> decode (qmReader basicParser) "(?1 2)"
252323 Right [SCons (SAtom "huh") (SCons (SAtom "1") (SCons (SAtom "2") SNil))]
253324 ~~~~
254325
255 Reader macros in S-Cargot can be used to define common bits of Lisp
326 Reader macros in S-Cargot can be used to define bits of Lisp
256327 syntax that are not typically considered the purview of S-expression
257328 parsers. For example, to allow square brackets as a subsitute for
258 proper lists, we could define a reader macro that is initialized by the
329 proper lists, we could define a reader macro that is indicated by the
259330 `[` character and repeatedly calls the parser until a `]` character
260331 is reached:
261332
262333 ~~~~.haskell
263 > let pVec p = (char ']' *> pure SNil) <|> (SCons <$> p <*> pVec p)
264 > let vec = addReader '[' pVec
265 > decode (asRich (vec mySpec)) "(1 [2 3])"
334 >>> let vec p = (char ']' *> pure SNil) <|> (SCons <$> p <*> vec p)
335 >>> let withVecReader = addReader '[' vec
336 >>> decode (asRich (withVecReader basicParser)) "(1 [2 3])"
266337 Right [RSList [RSAtom "1",RSList [RSAtom "2",RSAtom "3"]]]
267338 ~~~~
268339
340 ## Pretty-Printing and Indentation
341
342 The s-cargot library also includes a simple but often adequate
343 pretty-printing system for s-expressions. A printer that prints a
344 single-line s-expression is created with `flatPrint`:
345
346 ~~~~.haskell
347 >>> let printer = flatPrint id
348 >>> :t printer
349 SExprPrinter Text (SCargot Text)
350 >>> Text.putStrLn $ encode printer [L [A "foo", A "bar"]]
351 (foo bar)
352 ~~~~
353
354 A printer that tries to pretty-print an s-expression to fit
355 attractively within an 80-character limit can be created with
356 `basicPrint`:
357
358 ~~~~.haskell
359 >>> let printer = basicPrint id
360 >>> let sentence = "this stupendously preposterously supercalifragilisticexpialidociously long s-expression"
361 >>> let longSexpr = L [A word | word <- Text.words sentence ]
362 >>> Text.putStrLn $ encodeOne printer longSexpr
363 (this
364 stupendously
365 preposterously
366 supercalifragilisticexpialidociously
367 long
368 s-expression)
369 ~~~~
370
371 A printer created with `basicPrint` will "swing" things that are too
372 long onto the subsequent line, indenting it a fixed number of spaces.
373 We can modify the number of spaces with `setIndentAmount`:
374
375 ~~~~.haskell
376 >>> let printer = setIndentAmount 4 (basicPrint id)
377 >>> Text.putStrLn $ encodeOne printer longSexpr
378 (this
379 stupendously
380 preposterously
381 supercalifragilisticexpialidociously
382 long
383 s-expression)
384 ~~~~
385
386 We can also modify what counts as the 'maximum width', which for a
387 `basicPrint` printer is 80 by default:
388
389 ~~~~.haskell
390 >>> let printer = setMaxWidth 8 (basicPrint id)
391 >>> Text.putStrLn $ encodeOne printer (L [A "one", A "two", A "three"])
392 (one
393 two
394 three)
395 ~~~~
396
397 Or remove the maximum, which will put the whole s-expression onto one
398 line, regardless of its length:
399
400 ~~~~.haskell
401 >>> let printer = removeMaxWidth (basicPrint id)
402 >>> Text.putStrLn $ encodeOne printer longSexpr
403 (this stupendously preposterously supercalifragilisticexpialidociously long s-expression)
404 ~~~~
405
406 We can also specify an _indentation strategy_, which decides how to
407 indent subsequent expressions based on the head of a given
408 expression. The default is to always "swing" subsequent expressions
409 to the next line, but we could also specify the `Align` constructor, which
410 will print the first two expressions on the same line and then any subsequent
411 expressions horizontally aligned with the second one, like so:
412
413 ~~~~.haskell
414 >>> let printer = setIndentStrategy (\ _ -> Align) (setMaxWidth 8 (basicPrint id))
415 >>> Text.putStrLn $ encodeOne printer (L [A "one", A "two", A "three", A "four"])
416 (one two
417 three
418 four)
419 ~~~~
420
421 Or we could choose to keep some number of expressions on the same line and afterwards
422 swing the subsequent ones:
423
424 ~~~~.haskell
425 >>> let printer = setIndentStrategy (\ _ -> SwingAfter 1) (setMaxWidth 8 (basicPrint id))
426 >>> Text.putStrLn $ encodeOne printer (L [A "one", A "two", A "three", A "four"])
427 (one two
428 three
429 four)
430 ~~~~
431
432 For lots of situations, we might want to choose a different indentation strategy based
433 on the first expression within a proper list: for example, Common Lisp source code is often
434 formatted so that, following a `defun` token, we have the function name and arguments
435 on the same line, and then the body of the function indented some amount subsequently.
436 We can express an approximation of that strategy like this:
437
438 ~~~~.haskell
439 >>> let strategy (A ident) | "def" `Text.isPrefixOf` ident = SwingAfter 2; strategy _ = Align
440 >>> let printer = setIndentStrategy strategy (setMaxWidth 20 (basicPrint id))
441 >>> let fact = L [A "defun", A "fact", L [A "x"], L [A "product", L [A "range", A "1", A "x"]]]
442 >>> Text.putStrLn $ encodeOne printer fact
443 (defun fact (x)
444 (product (range 1 x)))
445 >>> let app = L [A "apply", L [A "lambda", L [A "y"], L [A "fact", A "y"]], L [A "+", A "2", A "3"]]
446 (apply (lambda (y) (fact y)
447 (+ 2 3))
448 ~~~~
449
269450 ## Putting It All Together
270451
271452 Here is a final example which implements a limited arithmetic language
272 with Haskell-style line comments and a special reader to understand hex
453 with Haskell-style line comments and a special reader macro to understand hex
273454 literals:
274455
275456 ~~~~.haskell
457 -- Our operators are going to represent addition, subtraction, or
458 -- multiplication
276459 data Op = Add | Sub | Mul deriving (Eq, Show)
460
461 -- The atoms of our language are either one of the aforementioned
462 -- operators, or positive integers
277463 data Atom = AOp Op | ANum Int deriving (Eq, Show)
464
465 -- Once parsed, our language will consist of the applications of
466 -- binary operators with literal integers at the leaves
278467 data Expr = EOp Op Expr Expr | ENum Int deriving (Eq, Show)
279468
280 -- Conversions for our Expr type
469 -- Conversions to and from our Expr type
281470 toExpr :: SExpr Atom -> Either String Expr
282471 toExpr (A (AOp op) ::: l ::: r ::: Nil) = EOp op <$> l <*> r
283472 toExpr (A (ANum n)) = pure (ENum n)
284 toExpr sexpr = Left ("Invalid parse: " ++ show sexpr)
473 toExpr sexpr = Left ("Unable to parse expression: " ++ show sexpr)
285474
286475 fromExpr :: Expr -> SExpr Atom
287476 fromExpr (EOp op l r) = A (AOp op) ::: fromExpr l ::: fromExpr r ::: Nil
300489 sAtom (AOp Mul) = "*"
301490 sAtom (ANum n) = T.pack (show n)
302491
303 -- Our comment syntax
492 -- Our comment syntax is going to be Haskell-like:
304493 hsComment :: Parser ()
305494 hsComment = string "--" >> manyTill newline >> return ()
306495
307 -- Our custom reader macro
496 -- Our custom reader macro: grab the parse stream and read a
497 -- hexadecimal number from it:
308498 hexReader :: Reader Atom
309499 hexReader _ = (Num . readHex . T.unpack) <$> takeWhile1 isHexDigit
310500 where isHexDigit c = isDigit c || c `elem` "AaBbCcDdEeFf"
311501 rd = readHex . head . fst
312502
313 -- Our final s-expression family
314 myLangSpec :: SExprSpec Atom Expr
315 myLangSpec
503 -- Our final s-expression parser and printer:
504 myLangParser :: SExprParser Atom Expr
505 myLangParser
316506 = setComment hsComment -- set comment syntax to be Haskell-style
317507 $ addReader '#' hexReader -- add hex reader
318 $ convertSpec toExpr fromExpr -- convert final repr to Expr
319 $ mkSpec pAtom sAtom -- create spec with Atom type
508 $ setCarrier toExpr -- convert final repr to Expr
509 $ mkParser pAtom -- create spec with Atom type
510
511 mkLangPrinter :: SexprPrinter Atom Expr
512 mkLangPrinter
513 = setFromCarrier fromExpr
514 $ setIndentStrategy (const Align)
515 $ basicPrint sAtom
320516 ~~~~
321517
322518 Keep in mind that you often won't need to write all this by hand,