two more unfinished posts
Getty Ritter
8 years ago
1 | I did a terrible thing. It's something lots of programmers do at | |
2 | some point in their lives, but I had kind of hoped to avoid doing, | |
3 | and so I was kind of shocked when I realized what had happened. | |
4 | ||
5 | I wrote a key-value store library. | |
6 | ||
7 | Okay, so, I didn't actually write the key-value store _itself_. | |
8 | What happened was, I wanted to use some kind of simple on-disk | |
9 | key-value-store library like Berkeley DB or Tokyo Cabinet | |
10 | in a Haskell program. There are some _really nice_ bindings | |
11 | to these kinds of libraries in languages like Python: | |
12 | ||
13 | ~~~.python | |
14 | from tokyocabinet import * | |
15 | ||
16 | # using Tokyo Cabinet's b-tree implementation | |
17 | bdb = BDB() | |
18 | bdb.open("sample.tcb", BDBOCREAT) | |
19 | bdb["foo"] = "bar" | |
20 | bdb.close() | |
21 | ~~~ | |
22 | ||
23 | Almost all the existing Haskell bindings were thin wrappers | |
24 | over the C implementations. For example, here's an analogous | |
25 | program in Haskell, using Berkeley DB: | |
26 | ||
27 | ~~~.haskell | |
28 | import Database.Berkeley.Db | |
29 | ||
30 | main = do | |
31 | env <- dbEnv_create [] | |
32 | dbEnv_open [DB_CREATE] | |
33 | db <- db_create [] env | |
34 | db_open [DB_CREATE] DB_BTREE 0 db Nothing "sample.bdb" Nothing | |
35 | db_put [] db Nothing "foo" "bar" | |
36 | db_close [] db | |
37 | dbEnv_close [] env | |
38 | ~~~ | |
39 | ||
40 | Yeesh. There's a lot of boilerplate for what is fundamentally a | |
41 | simple operation: "Open a database and store this mapping." | |
42 | ||
43 | ## The Basics of Tansu | |
44 | ||
45 | So I wrote a simple wrapping library. Here's an analogous program | |
46 | with my Tansu library: | |
47 | ||
48 | ~~~.haskell | |
49 | import Database.Tansu | |
50 | import Database.Tansu.Backend.BerkeleyDb | |
51 | ||
52 | main = withBerkeleyDb "sample.bdb" $ \db -> | |
53 | run db ("foo" =: "bar") | |
54 | ~~~ | |
55 | ||
56 | This is a pretty huge improvement in terms of readability and | |
57 | code size. But there's more! The keys and values transparently | |
58 | use the `Serialize` typeclass from the `cereal` library to | |
59 | convert the keys and values into strings of bytes: consequently, | |
60 | we can store values of any type and index by values of any | |
61 | type as well: | |
62 | ||
63 | ~~~.haskell | |
64 | {-# LANGUAGE DeriveGeneric, DeriveAnyClass #-} | |
65 | ||
66 | import Control.Monad (zipWithM_) | |
67 | import Data.Serialize (Serialize) | |
68 | import Database.Tansu | |
69 | import Database.Tansu.Backend.BerkeleyDb | |
70 | import GHC.Generics (Generic) | |
71 | ||
72 | -- Define a `Person` type with a `Serialize` instance | |
73 | data Person = Person | |
74 | { fullName :: String | |
75 | , currentAge :: Int | |
76 | , favoriteColor :: String | |
77 | } deriving (Eq, Show, Generic, Serialize) | |
78 | ||
79 | -- Create our people list | |
80 | people :: [(String, Person)] | |
81 | people = [ ("alex", Person "Alex Xie" 22 "mauve") | |
82 | , ("blake", Person "Blake MacPool" 33 "chartreuse") | |
83 | , ("cal", Person "Cal Lopez" 44 "pearl") | |
84 | ] | |
85 | ||
86 | main :: IO () | |
87 | main = withBerkeleyDb "sample.bdb" $ \db -> | |
88 | run db $ forM_ people (\ (k,v) -> k =: v) | |
89 | ~~~ | |
90 | ||
91 | I've glossed over another part, too: Tansu is also parametric | |
92 | in the _backend_. I've been using the `BerkeleyDb` backend, but | |
93 | the `Tansu` operations are written in an abstract way that allows | |
94 | backends to be swapped out without requiring any other changes | |
95 | to the program. The Berkeley DB backend is actually kept in a | |
96 | the separate package `tansu-berkeleydb`[^gpl], while the core operations | |
97 | are kept in the `tansu` package. The `tansu` package exposes two | |
98 | very basic backends: the `Filesystem` backend, which represents | |
99 | a key-value mapping as files in a directory, and the `Ephemeral` | |
100 | backend which doesn't save the mapping but just keeps it in memory | |
101 | and throws it away at the end. | |
102 | ||
103 | [^gpl]: This has the extra advantage that, while the `tansu-berkeleydb` | |
104 | library must be released under the GPL because Berkeley DB is also | |
105 | under the GPL, the `tansu` package itself can be released under | |
106 | the a restrictive BSD license. | |
107 | ||
108 | In addition to the `tansu-berkeleydb` backend, I've also written | |
109 | one that uses a table in a SQLite database to store its data. | |
110 | ||
111 | ## Some Drawbacks and Caveats | |
112 | ||
113 | The goal of `tansu` was to build a quick and easy library for use | |
114 | in new Haskell programs. Consequently, the library is designed in | |
115 | a way that makes it fast and easy to use in Haskell, but at the | |
116 | cost of making it more difficult to use across languages or with | |
117 | existing key/value stores. | |
118 | ||
119 | A concrete example of this is that the serialization used is the | |
120 | `cereal` library's serialization routes, which means that, even | |
121 | when storing plain ASCII `String`s for keys and values, the actual | |
122 | values that are stored are not the same sequence of bytes as the | |
123 | raw ASCII keys. They are first run through `cereal`'s `encode` function, | |
124 | which adds a 64-bit length to the start: | |
125 | ||
126 | ~~~ | |
127 | 0000 0000 0000 0003 666f 6f | |
128 | [ 64-bit length ] [chars] | |
129 | ~~~ | |
130 | ||
131 | In order to use a `tansu`-generated database from another language, | |
132 | you would probably have to reimplement the serialization and | |
133 | deserialization logic from the `Serialize` typeclass, which would | |
134 | be a non-trivial amount of work. One way around this is to use the | |
135 | `RawString` newtype wrapper exposed in `Database.Tansu.RawString`, | |
136 | which is a `ByteString` whose `Serialize` instance simply dumps | |
137 | and reads the full raw bytestring. This violates several other | |
138 | `Serialize` assumptions, so should be used with caution. | |
139 |