posts/codata - documents (master)

Tree @master (Download .tar.gz)

codata @master — raw · history · blame

# Data

In most statically-typed functional languages like ML or Haskell, the
primary mechanism you have for data types is _named sums_. There are
two important aspects of data types in general: the way you
_construct_ them, and the way you _destruct_ them. Let's start with
how you construct them.

To define a type `T`, you generally list its possible constructors and
their arguments:

```
data T = I Int | B Bool
```

This example type gives you two possible constructors, `MyInt` and
`MyBool`. Equivalently, you can think of this as defining a type `T`
and two injector functions, `I` and `B`:

```
T : Type
I : Int  -> T
B : Bool -> T
```

But that's not just what you get, because also need to be able to
_destruct_ the data in some way. Using the `case` construct, you can
determine _which_ constructor was used to create a particular datum of
type `MyType`, and extract out the pieces of data that were packed
into it:

```
f : T -> Int
f e = case e of
  I i -> i
  B b -> if b then 1 else 0
```

If we look at these definitions, it should be clear that

```
f (I n) == n
```

and that

```
f (B b) == if b then 1 else 0
```

All the constructors `I` and `B` do is inject data, and all the `f`
function does is extract the data from the type and (in the case of
`B`) a trivial computation.

# Codata

So: data is generally defined by listing its constructors. What's the
opposite of data?  Why, codata, of course! And if data is defined
using constructors, then codata is defined using its _destructors_.
In this case, a _destructor_ is effectively an accessor to a piece of
data: some way of taking a value of a codata type and pulling useful
information from it.

Consider the following pseudocode codata declaration:

```
codata CT = CI Int & CB Bool
```

This gives you a type as well as two possible _destructors_, `CI'` and
`CB`, which means you can think of them as two functions:

```
CT : Type
CI : CT -> Int
CB : CT -> Bool
```

They're not constructors, they're _accessor_ functions. How do we
create a value, then? Let's invent a dual form of `case`, which I'll
call `merge`. Instead of listing each possible constructor and
dispatching on it, we'll list each possible _destructor_, and
associate it with a value.

    cf : Int -> CT
    cf e = merge
        CI <- e
        CB <- if e /= 0 then True else False

How do we interpret this snippet of code? This creates a function `f'`
which takes an integer and returns a value of type `MyType'` (the
opposite of what `f` does using `case`.) The `merge` construct lists
each possible destructor, and then the value that will be returned
when that destructor is called on it.

We can do the same algebraic manipulations as above, but this time
with the accessors. It should be the case that

```
CI (cf n) == n
```

and also that

```
CB (cf n) == if n /= 0 then True else False
```

Where data naturally maps to sums, codata naturally maps to
products. In an SML-like language, we can easily define the basic sum
type `a + b` in terms of the named-sum mechanism

    data a :+: b = InL a | InR b

but cannot do the same with products, as all constructors of a type
`T` must be of the form `a -> T` for some specific `a`—consequently,
we cannot define a product type using this named-sums mechanism unless
we already have a product type. (In Haskell, we can, because Haskell
allows its constructors to be curried, so a constructor of the form `a
* b -> T` can be curried to `a -> b -> T`, but let us restrict
ourselves to single-argument constructors for this discussion.)

In a language with this `codata` mechanism, we could easily define that
product type in terms of its two destructors:

    codata a :*: b = Fst a & Snd b

In practice, it's often convenient to express codata as records and its
destructors as projection from those records. That means the `MyType'` example
above would look like

    codata MyType' = { MyInt' :: Int, MyBool' :: Bool }

    f' :: Int -> MyType'
    f' e = { MyInt' = e, MyBool' = if e == 1 then True else False }

    codata X' = { X' :: Int }
    i == { X' = i }.i

but while the syntax here is different, the semantics remain identical.

# Row Polymorphism

Note that each destructor here unambiguously identifies the type of the
codata is projects from—if we use `MyInt'`, then we are necessarily
looking at projecting from a value of type `MyType'`. This is very much
how Haskell records work, but people generally would prefer something
more parametric for records. One system for doing this is called _row
polymorphism_.

In a row-polymorphic record system, your accessors don't name a specific
type, but rather, _some record type for which that accessor holds_. That
means an accessor like `MyInt'` would have a type like

    MyInt' :: (MyInt' Int <~ r) => r -> Int

This means that we can write functions which are agnostic as to the rest
of the fields in the records, so long as they contain the fields we want:

    h :: (MyInt' Int <~ r, MyBool' Bool <~ r) => r -> Int
    h e = if e.MyBool' then e.MyInt' else 0

In the above function, we could supply a record like
`{ A = "foo", B = 7.7, MyBool' = True, MyInt' = 2 }` to `h` and everything
would be fine. Notice that we've dispensed entirely with the type `MyType'`.
This could, hypothetically, cause some usability problems with respect to
type inference: before, if had omitted the type declaration and misspelled
`MyInt'` as `MyItn'`, then our program would have immediately failed
to compile because no such accessor existed, but now, it's possible
for `h` to compile successfully and infer the incorrect type

    h :: (MyItn' Int <~ r, MyBool' Bool <~ r) => r -> Int

and either fail to compile because of a failure of inference elsewhere or
succeed with unintended semantics (e.g. if `g` is an exposed library
function that is not used elsewhere in the module.)

Additionally, row-polymorphic records will incur a higher runtime cost
then our previous treatment. If a record is unambiguously identified by
a set of destructors, then we can compile those records as a fixed-size
chunk of memory with known offsets for each destructor. This is not the
case with row-polymorphic records.

Now we've developed codata as the dual of data and identified the record-like
representation of codata. What's the dual of row-polymorphic records?

# Dualizing Row Polymorphism

For convenience sake, let's look at what row polymorphism looks like
with our previous syntax. Each accessor `A`, when used to extract a value of
type `t`, has the type

    A :: (A t <~ r) => r -> t

and we construct row-polymorphic codata with a set of destructors and our
`merge` construct, which infers the type produced by the set of destructors
it is given:

    g :: Int -> { A :: Int & B :: Bool }
    g e =
      merge
        A <- e
        B <- if e == 1 then True else False

That means that the dual of row polymorphism would have polymorphic
constructors, so that when a constructor `C`, when used with a value of
type `t`, has the type

    C :: (C t ~> s) => r -> s

and our `case` construct infers the type _destructed_ by the constructors
it matches over:

    g' :: [ A :: Int | B :: Bool ] -> Int
    g' e =
      case e of
        A i -> i
        B b -> if b then 1 else 0

As it turns out, this already exists in OCaml, where data types of this
sort are called _polymorphic variants_. The problems I mentioned above
with row-polymorphic records apply here, as well: if I match over `MyItn`
instead of `MyInt`, then that function might compile and its type will
fail to unify elsewhere, or it might be fine. This is particularly
insidious with catch-all patterns: a programmer who wrote this and
missed the type might believe that x is True when it in fact is
False:

    isInt e =
      case e of
        MyItn _ -> True
        _       -> False

    x = isInt (MyInt 5)

This could be mitigated by forbidding catch-all patterns in a case. The
following code would not even type-check:

    isInt e =
      case e of
        MyItn _  -> True
        MyBool _ -> False

    x = isInt (MyInt 5)

Like row-polymorphic records, these sums are also harder to represent at
runtime than traditional sums.

# Are They Useful?

Yes. In their own way.