gdritter repos when-computer / master drafts / dec-types.telml
master

Tree @master (Download .tar.gz)

dec-types.telml @masterraw · history · blame

\meta{( "early-algebraic-data-types" "early algebraic data types" ("programming") 1451674812)}
Algebraic data types are a very convenient and expressive way
of modeling data. You're familiar with them if you've ever
used any ML-like language: you have \em{products},
which are pairs or records:

\code{("hello", 7) : String * Int}

and \em{sums}, which are discriminated unions:

\code{Left "hello" : String + Int}
\code{Right 7      : String + Int}

Usually, modern ML-like languages provide a mechanism for
creating \em{named} sums, where you provide your own names for
each discriminant, or \em{constructor}:

\code{\ttkw{data} \ttcn{IntList} = \ttcn{Cons} (Int * IntList) \| \ttcn{Nil}}

You also have the so-called \em{unit} or \em{top} type, which
is the type with a single possible value, usually written \tt{()},
andat least in the underlying theorythe so-called \em{bottom}
type, which has no values. The \em{top} and \em{bottom} types
are sometimes written with the symbols \\(\\top\\) and \\(\\bot\\), and sometimes
with the numerals \tt{1} and \tt{0}\ref{alg}.
\sidenote{
This is because
\link{http://chris-taylor.github.io/blog/2013/02/10/the-algebra-of-algebraic-data-types/|they act as the identity elements in the algebra of types.}
The types \tt{a+0} and \tt{0+a} are both isomorphic to \tt{a}.
Likewise, \tt{a*1} and \tt{1*a} are also both isomorphic to \tt{a}.
}

Algebraic data types first appeared as a usable feature in the
\link{https://en.wikipedia.org/wiki/Hope_(programming_language)|Hope programming language},
which was a very early statically typed functional
language. Despite some syntactic weirdness, they worked very
much like modern algebraic data types: the operator for product
types was written as \tt{#} instead of \tt{*}, and sum constructors
were written with \tt{++} in between variants, but
\link{https://web.archive.org/web/20140825021150/http://www.soi.city.ac.uk/~ross/Hope/hope_tut/node18.html|Hope type declarations}
still look remarkably familiar 45 years later:

\code{\ttkw{data} \ttcn{numlist} == \ttcn{cons} ( num # numlist ) ++ \ttcn{nil} ;}

Pattern-matching on those types also looks remarkably similar to
modern functional languages, complete with definition-level
pattern-matching as found in languages like Haskell:

\code{\ttkw{dec} \ttcn{sum} : numlist -> num ;
--- \ttcn{sum} ( nil )            <= 0 ;
--- \ttcn{sum} ( cons ( x, xs ) ) <= x + sum xs ;}

Algebraic data types also featured in early ML implemenations, but in a
very awkward way. The first DEC-10 ML implementation didn't have a
mechanism for defining \em{named} sums; instead, you had to use
the built-in sum operator \tt{+} with its constructors\ref{syn}:
\sidenote{The symbols \tt{*} and \tt{**} in these signatures are
type variables, like \tt{'a} in modern versions of ML.}

\code{\ttcn{inl} : *  -> * + **
\ttcn{inr} : ** -> * + **}

You couldn't pattern-match on those values: you had to use the
accessor functions \tt{outl} and \tt{outr} instead, which would throw
exceptions if you tried to access the wrong value. You could
explicitly check the tag with the function \tt{isl}:

\code{\ttcn{outl} : * + ** -> *
\ttcn{outr} : * + ** -> **
\ttcn{isl}  : * + ** -> bool}

We could create a data type by giving the algebraic
representation of the type in terms of products, sums, and existing
types. Declarations used the \tt{\ttkw{abstype}} or
\tt{\ttkw{absrectype}} keywords, depending on whether the type was
recursive or not. Like Hope, ML used \tt{#} for pairs, and
also used the symbol \tt{.} to stand for \tt{unit}:

\code{\ttkw{absrectype} \ttcn{intlist} = (int # intlist) + . ;;}

This definition would create a new type as well as two
functions, \tt{absintlist} and \tt{repintlist}, with
the following types:

\code{\ttcn{absintlist} : ((int # intlist) + .) -> intlist
\ttcn{repintlist} : intlist -> ((int # intlist) + .)}

Whenever we defined a type \tt{T} as isomorphic to a
representation \tt{R}, the language would
define the functions \tt{absT : R -> T} and
\tt{repT : T -> R}, filling in whatever particular type
name and representation we give.
Those functions allow us to, for example, construct values
of type \tt{intlist} by "wrapping" a value of the
representation type:

\code{\ttkw{val} \ttcn{mylist} = absintlist(inl(1, absintlist(inr ())));;}

Of course, constructing all our values like that would be
a nightmare, so DEC-10 ML programs often used the
\tt{with} construct.
The \tt{with}
keyword is preceeded by a set of declarations and followed
by a different set of declarations. The value definitions
before the \tt{with} keyword are available to the
rest of the declarations in the \tt{with} construct, but
only the ones after the \tt{with} keyword are introduced
into the surrounding scope. In our case, this allows us
to write a type with a set of constructor and accessor
functions, but keep the \tt{abs} and \tt{rep} functions
hidden:

\code{\ttkw{absrectype} \ttcn{intlist} = (int # intlist) + .
   \ttkw{with} \ttcn{cons}(x, xs) = absintlist(inl(x, xs))
    \ttkw{and} \ttcn{nil}         = absintlist(inr())
    \ttkw{and} \ttcn{head} lst    = fst(outl(repintlist lst))
    \ttkw{and} \ttcn{tail} lst    = snd(outl(repintlist lst))
    \ttkw{and} \ttcn{isempty} lst = not(isl(repintlist lst));;}

The semantics of \tt{with} ensure that the \tt{absintlist} and
\tt{repintlist} functions don't escape the declaration block,
and lists can \em{only} be constructed the user-written
constructor functions. We can then construct our values much more
tersely:

\code{\ttkw{val} \ttcn{mylist} = cons(1, nil)}

The successor to DEC-10 ML, VAX ML\ref{vax}, did include the same datatype
\sidenote{
DEC-10 ML and VAX ML are both collectively referred to as Edinburgh ML.
The names come via Luca Cardelli, who wrote VAX ML as a successor to and
\link{http://sml-family.org/history/macqueen-lucafest.pdf|improvement on} the original DEC-10 ML.
}
mechanism as described above,
but also included both named sums and named products. Named products
corresponded to what are called \em{structs} or \em{records}, but
unlike structs in languages like C or Pascal, they didn't need to
have their types named or declared, as structural descriptions of
the types would suffice. The VAX ML syntax used
so-called "decorated parentheses":

\code{(\| a = "foo"; b = 7 \|) : (\| a: string; b: int \|)}

The named sum mechanism \em{also} did not need any declarations: the
type instead was written as the set of possible constructor names along
with their argument types:

\code{[\| c = "foo" \|] : [\| c: string; d: int \|]
[\| d = 7     \|] : [\| c: string; d: int \|]}

Unlike sums and products in DEC-10 ML, these sums and products
\em{could} be picked apart by pattern matching, using a syntax
very similar to the syntax used to create the values:

\code{\ttkw{case} e : [\| c: string; d: int \|] \ttkw{of}
  [\| c = s . 0;
     d = n . n + n
   \|]}

The effort to create Standard ML started in the early 80's, and
it drew on features from both Hope and the existing ML languages. In
\link{http://sml-family.org/history/SML-proposal-11-83.pdf|\em{A Proposal for Standard ML}}, Robin Milner explains that:

\blockquote{
a difficult decision had to be made concerning HOPE's treatment of
data types—present only in embryonic form in the original ML—and
the labelled records and variants which Cardelli introduced in his
VAX version. The latter have definite advantages which the former
lack; on the other hand, the HOPE treatment is well-rounded in its
own terms. Though a combination of these features is possible, it
seemed (at least to me, but some disagreed!) to entail too rich a
language for the present proposal. Thus the HOPE treatment is
fully adopted here.
}

Consequently, the first versions of SML drew their algebraic data
types from Hope's implemenation, abandoning the tedious and difficult
DEC-10 ML versions. Later versions did end up drawing on Cardelli's
named products to create records, with a change in notation
from \tt{(\| ... \|)} to \tt{ \{ ... \} }, and thus we get the
algebraic data types we all know and love.