posts/tansu - documents (master)

Tree @master (Download .tar.gz)

tansu @master — raw · history · blame

I did a terrible thing. It's something lots of programmers do at
some point in their lives, but I had kind of hoped to avoid doing,
and so I was kind of shocked when I realized what had happened.

I wrote a key-value store library.

Okay, so, I didn't actually write the key-value store _itself_.
What happened was, I wanted to use some kind of simple on-disk
key-value-store library like Berkeley DB or Tokyo Cabinet
in a Haskell program. There are some _really nice_ bindings
to these kinds of libraries in languages like Python:

~~~.python
from tokyocabinet import *

# using Tokyo Cabinet's b-tree implementation
bdb = BDB()
bdb.open("sample.tcb", BDBOCREAT)
bdb["foo"] = "bar"
bdb.close()
~~~

Almost all the existing Haskell bindings were thin wrappers
over the C implementations. For example, here's an analogous
program in Haskell, using Berkeley DB:

~~~.haskell
import Database.Berkeley.Db

main = do
  env <- dbEnv_create []
  dbEnv_open [DB_CREATE]
  db <- db_create [] env
  db_open [DB_CREATE] DB_BTREE 0 db Nothing "sample.bdb" Nothing
  db_put [] db Nothing "foo" "bar"
  db_close [] db
  dbEnv_close [] env
~~~

Yeesh. There's a lot of boilerplate for what is fundamentally a
simple operation: "Open a database and store this mapping."

## The Basics of Tansu

So I wrote a simple wrapping library. Here's an analogous program
with my Tansu library:

~~~.haskell
import Database.Tansu
import Database.Tansu.Backend.BerkeleyDb

main = withBerkeleyDb "sample.bdb" $ \db ->
  run db ("foo" =: "bar")
~~~

This is a pretty huge improvement in terms of readability and
code size. But there's more! The keys and values transparently
use the `Serialize` typeclass from the `cereal` library to
convert the keys and values into strings of bytes: consequently,
we can store values of any type and index by values of any
type as well:

~~~.haskell
{-# LANGUAGE DeriveGeneric, DeriveAnyClass #-}

import Control.Monad (zipWithM_)
import Data.Serialize (Serialize)
import Database.Tansu
import Database.Tansu.Backend.BerkeleyDb
import GHC.Generics (Generic)

-- Define a `Person` type with a `Serialize` instance
data Person = Person
  { fullName      :: String
  , currentAge    :: Int
  , favoriteColor :: String
  } deriving (Eq, Show, Generic, Serialize)

-- Create our people list
people :: [(String, Person)]
people = [ ("alex",  Person "Alex Xie" 22 "mauve")
         , ("blake", Person "Blake MacPool" 33 "chartreuse")
		 , ("cal",   Person "Cal Lopez" 44 "pearl")
		 ]

main :: IO ()
main = withBerkeleyDb "sample.bdb" $ \db ->
  run db $ forM_ people (\ (k,v) -> k =: v)
~~~

I've glossed over another part, too: Tansu is also parametric
in the _backend_. I've been using the `BerkeleyDb` backend, but
the `Tansu` operations are written in an abstract way that allows
backends to be swapped out without requiring any other changes
to the program. The Berkeley DB backend is actually kept in a
the separate package `tansu-berkeleydb`[^gpl], while the core operations
are kept in the `tansu` package. The `tansu` package exposes two
very basic backends: the `Filesystem` backend, which represents
a key-value mapping as files in a directory, and the `Ephemeral`
backend which doesn't save the mapping but just keeps it in memory
and throws it away at the end.

[^gpl]: This has the extra advantage that, while the `tansu-berkeleydb`
library must be released under the GPL because Berkeley DB is also
under the GPL, the `tansu` package itself can be released under
the a restrictive BSD license.

In addition to the `tansu-berkeleydb` backend, I've also written
one that uses a table in a SQLite database to store its data.

## Some Drawbacks and Caveats

The goal of `tansu` was to build a quick and easy library for use
in new Haskell programs. Consequently, the library is designed in
a way that makes it fast and easy to use in Haskell, but at the
cost of making it more difficult to use across languages or with
existing key/value stores.

A concrete example of this is that the serialization used is the
`cereal` library's serialization routes, which means that, even
when storing plain ASCII `String`s for keys and values, the actual
values that are stored are not the same sequence of bytes as the
raw ASCII keys. They are first run through `cereal`'s `encode` function,
which adds a 64-bit length to the start:

~~~
0000 0000 0000 0003 666f 6f
[  64-bit length  ] [chars]
~~~

In order to use a `tansu`-generated database from another language,
you would probably have to reimplement the serialization and
deserialization logic from the `Serialize` typeclass, which would
be a non-trivial amount of work. One way around this is to use the
`RawString` newtype wrapper exposed in `Database.Tansu.RawString`,
which is a `ByteString` whose `Serialize` instance simply dumps
and reads the full raw bytestring. This violates several other
`Serialize` assumptions, so should be used with caution.