I did a terrible thing. It's something lots of programmers do at
some point in their lives, but I had kind of hoped to avoid doing,
and so I was kind of shocked when I realized what had happened.
I wrote a key-value store library.
Okay, so, I didn't actually write the key-value store _itself_.
What happened was, I wanted to use some kind of simple on-disk
key-value-store library like Berkeley DB or Tokyo Cabinet
in a Haskell program. There are some _really nice_ bindings
to these kinds of libraries in languages like Python:
~~~.python
from tokyocabinet import *
# using Tokyo Cabinet's b-tree implementation
bdb = BDB()
bdb.open("sample.tcb", BDBOCREAT)
bdb["foo"] = "bar"
bdb.close()
~~~
Almost all the existing Haskell bindings were thin wrappers
over the C implementations. For example, here's an analogous
program in Haskell, using Berkeley DB:
~~~.haskell
import Database.Berkeley.Db
main = do
env <- dbEnv_create []
dbEnv_open [DB_CREATE]
db <- db_create [] env
db_open [DB_CREATE] DB_BTREE 0 db Nothing "sample.bdb" Nothing
db_put [] db Nothing "foo" "bar"
db_close [] db
dbEnv_close [] env
~~~
Yeesh. There's a lot of boilerplate for what is fundamentally a
simple operation: "Open a database and store this mapping."
## The Basics of Tansu
So I wrote a simple wrapping library. Here's an analogous program
with my Tansu library:
~~~.haskell
import Database.Tansu
import Database.Tansu.Backend.BerkeleyDb
main = withBerkeleyDb "sample.bdb" $ \db ->
run db ("foo" =: "bar")
~~~
This is a pretty huge improvement in terms of readability and
code size. But there's more! The keys and values transparently
use the `Serialize` typeclass from the `cereal` library to
convert the keys and values into strings of bytes: consequently,
we can store values of any type and index by values of any
type as well:
~~~.haskell
{-# LANGUAGE DeriveGeneric, DeriveAnyClass #-}
import Control.Monad (zipWithM_)
import Data.Serialize (Serialize)
import Database.Tansu
import Database.Tansu.Backend.BerkeleyDb
import GHC.Generics (Generic)
-- Define a `Person` type with a `Serialize` instance
data Person = Person
{ fullName :: String
, currentAge :: Int
, favoriteColor :: String
} deriving (Eq, Show, Generic, Serialize)
-- Create our people list
people :: [(String, Person)]
people = [ ("alex", Person "Alex Xie" 22 "mauve")
, ("blake", Person "Blake MacPool" 33 "chartreuse")
, ("cal", Person "Cal Lopez" 44 "pearl")
]
main :: IO ()
main = withBerkeleyDb "sample.bdb" $ \db ->
run db $ forM_ people (\ (k,v) -> k =: v)
~~~
I've glossed over another part, too: Tansu is also parametric
in the _backend_. I've been using the `BerkeleyDb` backend, but
the `Tansu` operations are written in an abstract way that allows
backends to be swapped out without requiring any other changes
to the program. The Berkeley DB backend is actually kept in a
the separate package `tansu-berkeleydb`[^gpl], while the core operations
are kept in the `tansu` package. The `tansu` package exposes two
very basic backends: the `Filesystem` backend, which represents
a key-value mapping as files in a directory, and the `Ephemeral`
backend which doesn't save the mapping but just keeps it in memory
and throws it away at the end.
[^gpl]: This has the extra advantage that, while the `tansu-berkeleydb`
library must be released under the GPL because Berkeley DB is also
under the GPL, the `tansu` package itself can be released under
the a restrictive BSD license.
In addition to the `tansu-berkeleydb` backend, I've also written
one that uses a table in a SQLite database to store its data.
## Some Drawbacks and Caveats
The goal of `tansu` was to build a quick and easy library for use
in new Haskell programs. Consequently, the library is designed in
a way that makes it fast and easy to use in Haskell, but at the
cost of making it more difficult to use across languages or with
existing key/value stores.
A concrete example of this is that the serialization used is the
`cereal` library's serialization routes, which means that, even
when storing plain ASCII `String`s for keys and values, the actual
values that are stored are not the same sequence of bytes as the
raw ASCII keys. They are first run through `cereal`'s `encode` function,
which adds a 64-bit length to the start:
~~~
0000 0000 0000 0003 666f 6f
[ 64-bit length ] [chars]
~~~
In order to use a `tansu`-generated database from another language,
you would probably have to reimplement the serialization and
deserialization logic from the `Serialize` typeclass, which would
be a non-trivial amount of work. One way around this is to use the
`RawString` newtype wrapper exposed in `Database.Tansu.RawString`,
which is a `ByteString` whose `Serialize` instance simply dumps
and reads the full raw bytestring. This violates several other
`Serialize` assumptions, so should be used with caution.