posts/script-based-data-analysis - documents (master)

Tree @master (Download .tar.gz)

script-based-data-analysis @master — raw · history · blame

Fulcrum was born out of the fact that I like pivot tables while I hate
spreadsheets.

Pivot tables, if you haven't used them, are a very nice facility in most
spreadsheet programs. They're an interactive facility for analyzing and
working with various kinds of data---you list which labels you want along
the columns and rows and how to reduce your data down (e.g. whether you
want to sum or average the numbers in question), and you can simply
drag-and-drop field identifiers and whatnot in order to get a new view
on your data. It's very nice.

Unfortunately, spreadsheet software is very tedious. Quite often, I
would like to do some kind of automated processing on the data I've
produced, and doing so is often an exercise in horrible frustration.
The scripting languages included with spreadsheet software are
of a uniformly terrible quality and the APIs provided are generally
quite inflexible.

So, I wrote my own from scratch in Haskell. It was a reasonably nice
design---it involved using the lens package to select out particular
fields from a data structure, and sets of reductions on the data.

In fact, it was nice enough that it took only trivial change to turn
it from a program for specific data processing to a library for
data processing in general. It is (almost) totally agnostic as to what
kind of data you put in and what kind of data you get out, with a few
caveats that I'll get into.

So, here's what Fulcrum looks like:

# A Short Example

Let's assume we have an `Employee` type as follows:

~~~~
> data Department = Marketing | Sales | HR
>   deriving (Eq, Show, Ord)

> data Employee = Employee
>   { name      :: String
>   , birthdate :: UTCTime
>   , dept      :: Department
>   } deriving (Eq, Show)
~~~~

We want to find out the average employee age for both management and
non-management. (Fulcrum is overkill for this, but bear with me.)
First step is to find out the age of an employee _at all_, which
requires knowing what today is. So, we'll write a function we can
pass the current day to:

~~~~
> getAge :: UTCTime -> Employee -> NominalDiffTime
> getAge today e = diffUTCTime today (birthdate e)
~~~~

Now that we can find an employee's age, we can build a plan around
it. Again, we don't know what _today_ is, so we'll actually write
a function that, given a current day, will produce a plan for it:

~~~~
> myPlan :: UTCTime -> Plan Employee NominalDiffTime
> myPlan today = Plan
>   { planName   = "Average Employee Age"
~~~~
if we graph it, we'll use the plan name as the title. We can put
basically anything we want.
~~~~
>   , planFCName = "age"
>   , planFocus  = getAge
~~~~
The focus is whatever field we'd like to extract as the relevant
dependent variable. We pass in a function which can take us from
the row type to a focus type.
~~~~
>   , planFilters = const True
~~~~
We don't want to filter out any rows at all, so the filter function
just returns True to every row.
~~~~
>   , planMaps    = id
~~~~
We also don't want to do any pre-emptive modification to the data,
so the planMaps just return the rows unchanged.
~~~~
>   , planAxis    = select department "department"
~~~~
The `planAxis` field also takes a getter (wrapped in a `Select`
constructor, which I'll explain below) to select based on an
independent variable. In this case, we're selecting on which
department the employee is in, so we'll turn the record accessor
function into a `Getter`. The `select` function also wants to
know the name of the axis, for graphing and metadata purposes,
so we'll give it that, too.
~~~~
>   , planLines   = mempty
~~~~
If we wanted to have multiple data series on a chart, we'd use
the `planLines` field, but this example won't require it, so we
can supply it a null value. `Select` values form a monoid in
which combining two `Select`s produces a `Select` that produces
a pair of values, so their identity is an "empty" `Select` that
always produces `()`.
~~~~
>   , planMerge   = avg
>   }
~~~~
And here we supply the function `avg`, which takes the average of
a list of numbers, as the relevant merge function.



# Plans

Fulcrum is based around the idea of a `Plan`, which is a recipe for
taking a dataset and producing specific values. The `Plan` type is
a record containing a bunch of fields which you have to specify,
and can produce a table-ish (if you squint) dataset in the form of
a map using the `runPlanToMap` function. A few auxiliary graphing
functions take the `Plan` and dataset directly, as well.

`Plan`s are parameterized by two types: the first is the type of the
individual entries in your dataset, and the second is the type of
the value that your plan will extract. If you're going to graph
your plan, the latter has to be numeric, but otherwise, it can just
be any Haskell type.