| 1 |
\meta{( "early-algebraic-data-types" "early algebraic data types" ("programming"))}
|
| 2 |
Algebraic data types are a very convenient and expressive way
|
| 3 |
of modeling data. You're familiar with them if you've ever
|
| 4 |
used any ML-like language: you have \em{products},
|
| 5 |
which are pairs or records:
|
| 6 |
|
| 7 |
\code{("hello", 7) : String * Int}
|
| 8 |
|
| 9 |
and \em{sums}, which are discriminated unions:
|
| 10 |
|
| 11 |
\code{Left "hello" : String + Int}
|
| 12 |
\code{Right 7 : String + Int}
|
| 13 |
|
| 14 |
Usually, modern ML-like languages provide a mechanism for
|
| 15 |
creating \em{named} sums, where you provide your own names for
|
| 16 |
each discriminant, or \em{constructor}:
|
| 17 |
|
| 18 |
\code{\ttkw{data} \ttcn{IntList} = \ttcn{Cons} (Int * IntList) \| \ttcn{Nil}}
|
| 19 |
|
| 20 |
You also have the so-called \em{unit} or \em{top} type, which
|
| 21 |
is the type with a single possible value, usually written \tt{()},
|
| 22 |
and—at least in the underlying theory—the so-called \em{bottom}
|
| 23 |
type, which has no values. The \em{top} and \em{bottom} types
|
| 24 |
are sometimes written with the symbols \\(\\top\\) and \\(\\bot\\), and sometimes
|
| 25 |
with the numerals \tt{1} and \tt{0}\ref{alg}.
|
| 26 |
\sidenote{
|
| 27 |
This is because
|
| 28 |
\link{http://chris-taylor.github.io/blog/2013/02/10/the-algebra-of-algebraic-data-types/|they act as the identity elements in the algebra of types.}
|
| 29 |
The types \tt{a+0} and \tt{0+a} are both isomorphic to \tt{a}.
|
| 30 |
Likewise, \tt{a*1} and \tt{1*a} are also both isomorphic to \tt{a}.
|
| 31 |
}
|
| 32 |
|
| 33 |
Algebraic data types first appeared as a usable feature in the
|
| 34 |
\link{https://en.wikipedia.org/wiki/Hope_(programming_language)|Hope programming language},
|
| 35 |
which was a very early statically typed functional
|
| 36 |
language. Despite some syntactic weirdness, they worked very
|
| 37 |
much like modern algebraic data types: the operator for product
|
| 38 |
types was written as \tt{#} instead of \tt{*}, and sum constructors
|
| 39 |
were written with \tt{++} in between variants, but
|
| 40 |
\link{https://web.archive.org/web/20140825021150/http://www.soi.city.ac.uk/~ross/Hope/hope_tut/node18.html|Hope type declarations}
|
| 41 |
still look remarkably familiar 45 years later:
|
| 42 |
|
| 43 |
\code{\ttkw{data} \ttcn{numlist} == \ttcn{cons} ( num # numlist ) ++ \ttcn{nil} ;}
|
| 44 |
|
| 45 |
Pattern-matching on those types also looks remarkably similar to
|
| 46 |
modern functional languages, complete with definition-level
|
| 47 |
pattern-matching as found in languages like Haskell:
|
| 48 |
|
| 49 |
\code{\ttkw{dec} \ttcn{sum} : numlist -> num ;
|
| 50 |
--- \ttcn{sum} ( nil ) <= 0 ;
|
| 51 |
--- \ttcn{sum} ( cons ( x, xs ) ) <= x + sum xs ;}
|
| 52 |
|
| 53 |
Algebraic data types also featured in early ML implemenations, but in a
|
| 54 |
very awkward way. The first DEC-10 ML implementation didn't have a
|
| 55 |
mechanism for defining \em{named} sums; instead, you had to use
|
| 56 |
the built-in sum operator \tt{+} with its constructors\ref{syn}:
|
| 57 |
\sidenote{The symbols \tt{*} and \tt{**} in these signatures are
|
| 58 |
type variables, like \tt{'a} in modern versions of ML.}
|
| 59 |
|
| 60 |
\code{\ttcn{inl} : * -> * + **
|
| 61 |
\ttcn{inr} : ** -> * + **}
|
| 62 |
|
| 63 |
You couldn't pattern-match on those values: you had to use the
|
| 64 |
accessor functions \tt{outl} and \tt{outr} instead, which would throw
|
| 65 |
exceptions if you tried to access the wrong value. You could
|
| 66 |
explicitly check the tag with the function \tt{isl}:
|
| 67 |
|
| 68 |
\code{\ttcn{outl} : * + ** -> *
|
| 69 |
\ttcn{outr} : * + ** -> **
|
| 70 |
\ttcn{isl} : * + ** -> bool}
|
| 71 |
|
| 72 |
We could create a data type by giving the algebraic
|
| 73 |
representation of the type in terms of products, sums, and existing
|
| 74 |
types. Declarations used the \tt{\ttkw{abstype}} or
|
| 75 |
\tt{\ttkw{absrectype}} keywords, depending on whether the type was
|
| 76 |
recursive or not. Like Hope, ML used \tt{#} for pairs, and
|
| 77 |
also used the symbol \tt{.} to stand for \tt{unit}:
|
| 78 |
|
| 79 |
\code{\ttkw{absrectype} \ttcn{intlist} = (int # intlist) + . ;;}
|
| 80 |
|
| 81 |
This definition would create a new type as well as two
|
| 82 |
functions, \tt{absintlist} and \tt{repintlist}, with
|
| 83 |
the following types:
|
| 84 |
|
| 85 |
\code{\ttcn{absintlist} : ((int # intlist) + .) -> intlist
|
| 86 |
\ttcn{repintlist} : intlist -> ((int # intlist) + .)}
|
| 87 |
|
| 88 |
Whenever we defined a type \tt{T} as isomorphic to a
|
| 89 |
representation \tt{R}, the language would
|
| 90 |
define the functions \tt{absT : R -> T} and
|
| 91 |
\tt{repT : T -> R}, filling in whatever particular type
|
| 92 |
name and representation we give.
|
| 93 |
Those functions allow us to, for example, construct values
|
| 94 |
of type \tt{intlist} by "wrapping" a value of the
|
| 95 |
representation type:
|
| 96 |
|
| 97 |
\code{\ttkw{val} \ttcn{mylist} = absintlist(inl(1, absintlist(inr ())));;}
|
| 98 |
|
| 99 |
Of course, constructing all our values like that would be
|
| 100 |
a nightmare, so DEC-10 ML programs often used the
|
| 101 |
\tt{with} construct.
|
| 102 |
The \tt{with}
|
| 103 |
keyword is preceeded by a set of declarations and followed
|
| 104 |
by a different set of declarations. The value definitions
|
| 105 |
before the \tt{with} keyword are available to the
|
| 106 |
rest of the declarations in the \tt{with} construct, but
|
| 107 |
only the ones after the \tt{with} keyword are introduced
|
| 108 |
into the surrounding scope. In our case, this allows us
|
| 109 |
to write a type with a set of constructor and accessor
|
| 110 |
functions, but keep the \tt{abs} and \tt{rep} functions
|
| 111 |
hidden:
|
| 112 |
|
| 113 |
\code{\ttkw{absrectype} \ttcn{intlist} = (int # intlist) + .
|
| 114 |
\ttkw{with} \ttcn{cons}(x, xs) = absintlist(inl(x, xs))
|
| 115 |
\ttkw{and} \ttcn{nil} = absintlist(inr())
|
| 116 |
\ttkw{and} \ttcn{head} lst = fst(outl(repintlist lst))
|
| 117 |
\ttkw{and} \ttcn{tail} lst = snd(outl(repintlist lst))
|
| 118 |
\ttkw{and} \ttcn{isempty} lst = not(isl(repintlist lst));;}
|
| 119 |
|
| 120 |
The semantics of \tt{with} ensure that the \tt{absintlist} and
|
| 121 |
\tt{repintlist} functions don't escape the declaration block,
|
| 122 |
and lists can \em{only} be constructed the user-written
|
| 123 |
constructor functions. We can then construct our values much more
|
| 124 |
tersely:
|
| 125 |
|
| 126 |
\code{\ttkw{val} \ttcn{mylist} = cons(1, nil)}
|
| 127 |
|
| 128 |
The successor to DEC-10 ML, VAX ML\ref{vax}, did include the same datatype
|
| 129 |
\sidenote{
|
| 130 |
DEC-10 ML and VAX ML are both collectively referred to as Edinburgh ML.
|
| 131 |
The names come via Luca Cardelli, who wrote VAX ML as a successor to and
|
| 132 |
\link{http://sml-family.org/history/macqueen-lucafest.pdf|improvement on} the original DEC-10 ML.
|
| 133 |
}
|
| 134 |
mechanism as described above,
|
| 135 |
but also included both named sums and named products. Named products
|
| 136 |
corresponded to what are called \em{structs} or \em{records}, but
|
| 137 |
unlike structs in languages like C or Pascal, they didn't need to
|
| 138 |
have their types named or declared, as structural descriptions of
|
| 139 |
the types would suffice. The VAX ML syntax used
|
| 140 |
so-called "decorated parentheses":
|
| 141 |
|
| 142 |
\code{(\| a = "foo"; b = 7 \|) : (\| a: string; b: int \|)}
|
| 143 |
|
| 144 |
The named sum mechanism \em{also} did not need any declarations: the
|
| 145 |
type instead was written as the set of possible constructor names along
|
| 146 |
with their argument types:
|
| 147 |
|
| 148 |
\code{[\| c = "foo" \|] : [\| c: string; d: int \|]
|
| 149 |
[\| d = 7 \|] : [\| c: string; d: int \|]}
|
| 150 |
|
| 151 |
Unlike sums and products in DEC-10 ML, these sums and products
|
| 152 |
\em{could} be picked apart by pattern matching, using a syntax
|
| 153 |
very similar to the syntax used to create the values:
|
| 154 |
|
| 155 |
\code{\ttkw{case} e : [\| c: string; d: int \|] \ttkw{of}
|
| 156 |
[\| c = s . 0;
|
| 157 |
d = n . n + n
|
| 158 |
\|]}
|
| 159 |
|
| 160 |
The effort to create Standard ML started in the early 80's, and
|
| 161 |
it drew on features from both Hope and the existing ML languages. In
|
| 162 |
\link{http://sml-family.org/history/SML-proposal-11-83.pdf|\em{A Proposal for Standard ML}}, Robin Milner explains that:
|
| 163 |
|
| 164 |
\blockquote{
|
| 165 |
…a difficult decision had to be made concerning HOPE's treatment of
|
| 166 |
data types—present only in embryonic form in the original ML—and
|
| 167 |
the labelled records and variants which Cardelli introduced in his
|
| 168 |
VAX version. The latter have definite advantages which the former
|
| 169 |
lack; on the other hand, the HOPE treatment is well-rounded in its
|
| 170 |
own terms. Though a combination of these features is possible, it
|
| 171 |
seemed (at least to me, but some disagreed!) to entail too rich a
|
| 172 |
language for the present proposal. Thus the HOPE treatment is
|
| 173 |
fully adopted here.
|
| 174 |
}
|
| 175 |
|
| 176 |
Consequently, the first versions of SML drew their algebraic data
|
| 177 |
types from Hope's implemenation, abandoning the tedious and difficult
|
| 178 |
DEC-10 ML versions. Later versions did end up drawing on Cardelli's
|
| 179 |
named products to create records, with a change in notation
|
| 180 |
from \tt{(\| ... \|)} to \tt{ \{ ... \} }, and thus we get the
|
| 181 |
algebraic data types we all know and love.⏎
|