gdritter repos documents / master posts / formats.md
master

Tree @master (Download .tar.gz)

formats.md @masterview rendered · raw · history · blame

I want to get away from syntax—because really, who actually likes
bikeshedding?—but first, I want to give the details of a few
syntaxes of my implementation. In particular, three different
languages for markup and data storage that I've used in personal
projects, split apart into standalone implementations in case
they're useful for anyone else.

I didn't invent any of these formats, but I did implement and
formalize them, which implicates me at least a little bit in
their creation. Of course, whenever you invent a markup/data
storage format, you should ask yourself two questions:

1. What purpose does this serve that isn't served by an existing
format?
2. What purpose does this serve that isn't served by
[S-expressions](http://en.wikipedia.org/wiki/S-expression)
specifically?

For each of these formats, I will include the following:

- A description of the abstract (parsed) representation that
  the format describes, as well as an example of the format
  and its corresponding abstract structure.
- Answers to the above questions: i.e., why does this format
  exist, and why not use an existing format for the same
  purpose?
- A _formal grammar_ which defines the language. There should
  be no ambiguity in the resulting grammar, and in particular,
  it should be the case that, for all `x`, `x = decode(encode(x))`.
  Note that the opposite needn't be true (e.g. because of comments,
  indentation, &c.)
- A C implementation that relies on _no external libraries_.
  This _includes_ libc, which means that the implementations
  _do not rely on the presence of_ `malloc`. This is done
  by only allowing access to the parsed structure through a
  callback, and threading the parsed structure through the
  stack.
- A C implementation that relies on _no external libraries_
  but expects the user to pass in their own allocator, which
  can (and will be, in most cases) `malloc`, but doesn't need
  to be.
- A Haskell implementation whose imports are limited to the
  types that their APIs expose that are not included in `base`,
  e.g. `ByteString`, `Text`, `Vector`, and in one case, `XML`.

# NML

## Structure

All NML documents have the same abstract structure as an
XML document. The following NML document:

~~~~
<one<two|three>|
  four
  <five|six>
>
~~~~

corresponds to the following XML document:

~~~~
<one two="three">
  four
  <five>six</five>
</one>
~~~~

## Why

1. NML is a format originally proposed by <Erik Naggum>, who is
well-known for his vituperative rants in support of Common
Lisp and against XML. NML was originally intended as a
demonstration of why you _don't_ need an attribute/element
distinction, but variations on it were briefly adopted by
<some people> as a somewhat more human-friendly way of writing
XML.

    It was never fleshed out into a full specification by
Naggum—and indeed, most of the users modified it in some
way or other—but I have followed Naggum's examples and extended
it with specifications that deal with the full generality of
XML, including features like namespaces. (Hence, GNML, or
Generalized NML.) Finally, I specified
that GNML allows for Unicode input and in fact does not understand
character references at all: escaped Unicode characters in GNML
correspond to character references in XML, and vice versa.

    My original use for GNML was as a tool for viewing and
producing XML, i.e. I could write NML and on-the-fly convert
it to valid XML that tools could understand, and dump XML
in pleasantly human-readable GNML. If your data will never be
seen by human eyes, it would be better to simply use XML—the
tools for working with XML are far more numerous than tools
for working with my particular dialect of GNML.

2. If you can use S-Expressions instead, then use S-Expressions.
It is in fact trivial to map S-Expressions to XML and vice
versa, and almost every language has mature libraries for
parsing and emitting S-Expressions.

## Alternatives

- [XML]()
- [S-Expressions]()

## Formal Grammar
## Implementations

# NDBL

## Structure

All NDBL documents take the form of a sequence of sequences
of key-value pairs, where a _key_ is a string of non-whitespace
characters and a _value_ is any string. NDBL source files must
be UTF-8 encoded. The following NDBL document:

~~~~
a=b # x
  c="d e"
f= g=h
~~~~

corresponds to the following abstract structure expressed
with in Python syntax:

~~~~
[ [ ("a", "b"), ("c", "d e") ]
, [ ("f", ""), ("g", "h") ]
]
~~~~

## What

NDBL keys are always unquoted strings of Unicode characters
which contain neither whitespace nor the character `=`, but
may contain other punctuation. NDBL values are strings, which
may be unquoted (in which case they are terminated by
whitespace) or may be quoted (in which case they are terminated
by the first quotation mark character not preceded by a
backslash.) Comments are started by whitespace followed by a
pound sign `#` and continue to the end of a line. A new group
is started by a key-value pair at the beginning of a line; so
long as subsequent key-value pairs do not start immediately
after a newline, they are added to the previous group.

## Why

1. NDBL is so-named because it is based on the configuration
format of Plan 9's NDB utility, with some minor modifications.
NDBL is a _configuration_ format, not a data storage or markup
format, and as such does not have a complicated internal
structure. In fact, the grammar of NDBL is _regular_, and could
be recognized by a regular expression. (It cannot be _parsed_
by a regular expression as that would require the ability to
perform a submatch under a Kleene star.)

    The simplest configuration format is really just key-value
pairs (which is one of the reasons that environment variables
are often used for configuration.) NDBL adds only a bare
minimum on top of that: it adds a notion of grouping to
sequences of key-value pairs. Crucially, the groups are _not_
maps, because the same key may appear more than once:

    ~~~~
    # a single group with repeated keys:
    user=terry
      file=foo.txt
      file=bar.txt
      file=baz.txt
    ~~~~

    The reason NDBL might be desirable is that many other
configuration formats are _very_ heavyweight. TOML was
originally created as a response to the complexity of YAML,
and TOML is itself quite complicated. Simpler formats,
such as JSON, lack the ability to have comments—a must for
configuration—and many other configuration formats are
ad-hoc and have no formal specification (e.g. INI, which
TOML greatly resembles.) So, NDBL was codified as a
response to this.

2. You should probably just use S-Expressions.

## Alternatives

- [YAML]()
- [INI]()
- [TOML]()
- [S-Expressions]()

## Formal Grammar
## Implementations

# TeLML

## Structure

All TeLML documents take the form of sequences of _fragments_,
where a fragment is either a chunk of Unicode text or a
_TeLML tag_, which consists of a tag name and a sequence
of sequences of fragments.

The following TeLML document:

~~~~
a \b c \d{e,\f{}, g }
~~~~

Corresponds to the following abstract structure expressed
in Python syntax:

~~~~
[ "a "
, Tag(name="b")
, " c "
, Tag(name="d",
      args=[ "e"
           , Tag(name="f",
                 args=[""])
           , " g "
           ])
]
~~~~

## What

If a document never contains any instance of the characters
`\`, `{`, or `}`, then it is a TeLML document of a single text
fragment. Tags with an empty argument list are written as
`\tagname`, while tags with an argument list are followed by
curly braces, with individual arguments separated by commas,
e.g., `\tagname{arg1,arg2,...,argn}`. Curly braces without
a preceding tag are used for grouping, and do not appear in
the generated source; this may sometimes be useful to
separate argument-less tags from adjacent text fragments:
`\foobar` is a single tag called `foobar`, and `\foo{bar}`
is a single tag `foo` with an argument `bar`, but
`{\foo}bar` is two fragments: a nullary tag `foo` followed
by a text fragment `bar`.

## Why

1. TeLML grew out of Markdown's lack of extensibility. What
I wanted was a user-extensible markup language that had
the ability to include rich structures

## Alternatives

- [SGML]()
- [DocBook]()
- [MediaWiki]()
- [S-Expressions]()

## Formal Grammar
## Implementations