two more unfinished posts
Getty Ritter
9 years ago
| 1 | I did a terrible thing. It's something lots of programmers do at | |
| 2 | some point in their lives, but I had kind of hoped to avoid doing, | |
| 3 | and so I was kind of shocked when I realized what had happened. | |
| 4 | ||
| 5 | I wrote a key-value store library. | |
| 6 | ||
| 7 | Okay, so, I didn't actually write the key-value store _itself_. | |
| 8 | What happened was, I wanted to use some kind of simple on-disk | |
| 9 | key-value-store library like Berkeley DB or Tokyo Cabinet | |
| 10 | in a Haskell program. There are some _really nice_ bindings | |
| 11 | to these kinds of libraries in languages like Python: | |
| 12 | ||
| 13 | ~~~.python | |
| 14 | from tokyocabinet import * | |
| 15 | ||
| 16 | # using Tokyo Cabinet's b-tree implementation | |
| 17 | bdb = BDB() | |
| 18 | bdb.open("sample.tcb", BDBOCREAT) | |
| 19 | bdb["foo"] = "bar" | |
| 20 | bdb.close() | |
| 21 | ~~~ | |
| 22 | ||
| 23 | Almost all the existing Haskell bindings were thin wrappers | |
| 24 | over the C implementations. For example, here's an analogous | |
| 25 | program in Haskell, using Berkeley DB: | |
| 26 | ||
| 27 | ~~~.haskell | |
| 28 | import Database.Berkeley.Db | |
| 29 | ||
| 30 | main = do | |
| 31 | env <- dbEnv_create [] | |
| 32 | dbEnv_open [DB_CREATE] | |
| 33 | db <- db_create [] env | |
| 34 | db_open [DB_CREATE] DB_BTREE 0 db Nothing "sample.bdb" Nothing | |
| 35 | db_put [] db Nothing "foo" "bar" | |
| 36 | db_close [] db | |
| 37 | dbEnv_close [] env | |
| 38 | ~~~ | |
| 39 | ||
| 40 | Yeesh. There's a lot of boilerplate for what is fundamentally a | |
| 41 | simple operation: "Open a database and store this mapping." | |
| 42 | ||
| 43 | ## The Basics of Tansu | |
| 44 | ||
| 45 | So I wrote a simple wrapping library. Here's an analogous program | |
| 46 | with my Tansu library: | |
| 47 | ||
| 48 | ~~~.haskell | |
| 49 | import Database.Tansu | |
| 50 | import Database.Tansu.Backend.BerkeleyDb | |
| 51 | ||
| 52 | main = withBerkeleyDb "sample.bdb" $ \db -> | |
| 53 | run db ("foo" =: "bar") | |
| 54 | ~~~ | |
| 55 | ||
| 56 | This is a pretty huge improvement in terms of readability and | |
| 57 | code size. But there's more! The keys and values transparently | |
| 58 | use the `Serialize` typeclass from the `cereal` library to | |
| 59 | convert the keys and values into strings of bytes: consequently, | |
| 60 | we can store values of any type and index by values of any | |
| 61 | type as well: | |
| 62 | ||
| 63 | ~~~.haskell | |
| 64 | {-# LANGUAGE DeriveGeneric, DeriveAnyClass #-} | |
| 65 | ||
| 66 | import Control.Monad (zipWithM_) | |
| 67 | import Data.Serialize (Serialize) | |
| 68 | import Database.Tansu | |
| 69 | import Database.Tansu.Backend.BerkeleyDb | |
| 70 | import GHC.Generics (Generic) | |
| 71 | ||
| 72 | -- Define a `Person` type with a `Serialize` instance | |
| 73 | data Person = Person | |
| 74 | { fullName :: String | |
| 75 | , currentAge :: Int | |
| 76 | , favoriteColor :: String | |
| 77 | } deriving (Eq, Show, Generic, Serialize) | |
| 78 | ||
| 79 | -- Create our people list | |
| 80 | people :: [(String, Person)] | |
| 81 | people = [ ("alex", Person "Alex Xie" 22 "mauve") | |
| 82 | , ("blake", Person "Blake MacPool" 33 "chartreuse") | |
| 83 | , ("cal", Person "Cal Lopez" 44 "pearl") | |
| 84 | ] | |
| 85 | ||
| 86 | main :: IO () | |
| 87 | main = withBerkeleyDb "sample.bdb" $ \db -> | |
| 88 | run db $ forM_ people (\ (k,v) -> k =: v) | |
| 89 | ~~~ | |
| 90 | ||
| 91 | I've glossed over another part, too: Tansu is also parametric | |
| 92 | in the _backend_. I've been using the `BerkeleyDb` backend, but | |
| 93 | the `Tansu` operations are written in an abstract way that allows | |
| 94 | backends to be swapped out without requiring any other changes | |
| 95 | to the program. The Berkeley DB backend is actually kept in a | |
| 96 | the separate package `tansu-berkeleydb`[^gpl], while the core operations | |
| 97 | are kept in the `tansu` package. The `tansu` package exposes two | |
| 98 | very basic backends: the `Filesystem` backend, which represents | |
| 99 | a key-value mapping as files in a directory, and the `Ephemeral` | |
| 100 | backend which doesn't save the mapping but just keeps it in memory | |
| 101 | and throws it away at the end. | |
| 102 | ||
| 103 | [^gpl]: This has the extra advantage that, while the `tansu-berkeleydb` | |
| 104 | library must be released under the GPL because Berkeley DB is also | |
| 105 | under the GPL, the `tansu` package itself can be released under | |
| 106 | the a restrictive BSD license. | |
| 107 | ||
| 108 | In addition to the `tansu-berkeleydb` backend, I've also written | |
| 109 | one that uses a table in a SQLite database to store its data. | |
| 110 | ||
| 111 | ## Some Drawbacks and Caveats | |
| 112 | ||
| 113 | The goal of `tansu` was to build a quick and easy library for use | |
| 114 | in new Haskell programs. Consequently, the library is designed in | |
| 115 | a way that makes it fast and easy to use in Haskell, but at the | |
| 116 | cost of making it more difficult to use across languages or with | |
| 117 | existing key/value stores. | |
| 118 | ||
| 119 | A concrete example of this is that the serialization used is the | |
| 120 | `cereal` library's serialization routes, which means that, even | |
| 121 | when storing plain ASCII `String`s for keys and values, the actual | |
| 122 | values that are stored are not the same sequence of bytes as the | |
| 123 | raw ASCII keys. They are first run through `cereal`'s `encode` function, | |
| 124 | which adds a 64-bit length to the start: | |
| 125 | ||
| 126 | ~~~ | |
| 127 | 0000 0000 0000 0003 666f 6f | |
| 128 | [ 64-bit length ] [chars] | |
| 129 | ~~~ | |
| 130 | ||
| 131 | In order to use a `tansu`-generated database from another language, | |
| 132 | you would probably have to reimplement the serialization and | |
| 133 | deserialization logic from the `Serialize` typeclass, which would | |
| 134 | be a non-trivial amount of work. One way around this is to use the | |
| 135 | `RawString` newtype wrapper exposed in `Database.Tansu.RawString`, | |
| 136 | which is a `ByteString` whose `Serialize` instance simply dumps | |
| 137 | and reads the full raw bytestring. This violates several other | |
| 138 | `Serialize` assumptions, so should be used with caution. | |
| 139 |