gdritter repos documents / master posts / formats.md
master

Tree @master (Download .tar.gz)

formats.md @masterview markup · raw · history · blame

I want to get away from syntax—because really, who actually likes bikeshedding?—but first, I want to give the details of a few syntaxes of my implementation. In particular, three different languages for markup and data storage that I've used in personal projects, split apart into standalone implementations in case they're useful for anyone else.

I didn't invent any of these formats, but I did implement and formalize them, which implicates me at least a little bit in their creation. Of course, whenever you invent a markup/data storage format, you should ask yourself two questions:

  1. What purpose does this serve that isn't served by an existing format?
  2. What purpose does this serve that isn't served by S-expressions specifically?

For each of these formats, I will include the following:

  • A description of the abstract (parsed) representation that the format describes, as well as an example of the format and its corresponding abstract structure.
  • Answers to the above questions: i.e., why does this format exist, and why not use an existing format for the same purpose?
  • A formal grammar which defines the language. There should be no ambiguity in the resulting grammar, and in particular, it should be the case that, for all x, x = decode(encode(x)). Note that the opposite needn't be true (e.g. because of comments, indentation, &c.)
  • A C implementation that relies on no external libraries. This includes libc, which means that the implementations do not rely on the presence of malloc. This is done by only allowing access to the parsed structure through a callback, and threading the parsed structure through the stack.
  • A C implementation that relies on no external libraries but expects the user to pass in their own allocator, which can (and will be, in most cases) malloc, but doesn't need to be.
  • A Haskell implementation whose imports are limited to the types that their APIs expose that are not included in base, e.g. ByteString, Text, Vector, and in one case, XML.

NML

Structure

All NML documents have the same abstract structure as an XML document. The following NML document:

<one<two|three>|
  four
  <five|six>
>

corresponds to the following XML document:

<one two="three">
  four
  <five>six</five>
</one>

Why

  1. NML is a format originally proposed by , who is well-known for his vituperative rants in support of Common Lisp and against XML. NML was originally intended as a demonstration of why you don't need an attribute/element distinction, but variations on it were briefly adopted by as a somewhat more human-friendly way of writing XML.

    It was never fleshed out into a full specification by Naggum—and indeed, most of the users modified it in some way or other—but I have followed Naggum's examples and extended it with specifications that deal with the full generality of XML, including features like namespaces. (Hence, GNML, or Generalized NML.) Finally, I specified that GNML allows for Unicode input and in fact does not understand character references at all: escaped Unicode characters in GNML correspond to character references in XML, and vice versa.

    My original use for GNML was as a tool for viewing and producing XML, i.e. I could write NML and on-the-fly convert it to valid XML that tools could understand, and dump XML in pleasantly human-readable GNML. If your data will never be seen by human eyes, it would be better to simply use XML—the tools for working with XML are far more numerous than tools for working with my particular dialect of GNML.

  2. If you can use S-Expressions instead, then use S-Expressions. It is in fact trivial to map S-Expressions to XML and vice versa, and almost every language has mature libraries for parsing and emitting S-Expressions.

Alternatives

Formal Grammar

Implementations

NDBL

Structure

All NDBL documents take the form of a sequence of sequences of key-value pairs, where a key is a string of non-whitespace characters and a value is any string. NDBL source files must be UTF-8 encoded. The following NDBL document:

a=b # x
  c="d e"
f= g=h

corresponds to the following abstract structure expressed with in Python syntax:

[ [ ("a", "b"), ("c", "d e") ]
, [ ("f", ""), ("g", "h") ]
]

What

NDBL keys are always unquoted strings of Unicode characters which contain neither whitespace nor the character =, but may contain other punctuation. NDBL values are strings, which may be unquoted (in which case they are terminated by whitespace) or may be quoted (in which case they are terminated by the first quotation mark character not preceded by a backslash.) Comments are started by whitespace followed by a pound sign # and continue to the end of a line. A new group is started by a key-value pair at the beginning of a line; so long as subsequent key-value pairs do not start immediately after a newline, they are added to the previous group.

Why

  1. NDBL is so-named because it is based on the configuration format of Plan 9's NDB utility, with some minor modifications. NDBL is a configuration format, not a data storage or markup format, and as such does not have a complicated internal structure. In fact, the grammar of NDBL is regular, and could be recognized by a regular expression. (It cannot be parsed by a regular expression as that would require the ability to perform a submatch under a Kleene star.)

    The simplest configuration format is really just key-value pairs (which is one of the reasons that environment variables are often used for configuration.) NDBL adds only a bare minimum on top of that: it adds a notion of grouping to sequences of key-value pairs. Crucially, the groups are not maps, because the same key may appear more than once:

    ~~~~

    a single group with repeated keys:

    user=terry file=foo.txt file=bar.txt file=baz.txt ~~~~

    The reason NDBL might be desirable is that many other configuration formats are very heavyweight. TOML was originally created as a response to the complexity of YAML, and TOML is itself quite complicated. Simpler formats, such as JSON, lack the ability to have comments—a must for configuration—and many other configuration formats are ad-hoc and have no formal specification (e.g. INI, which TOML greatly resembles.) So, NDBL was codified as a response to this.

  2. You should probably just use S-Expressions.

Alternatives

Formal Grammar

Implementations

TeLML

Structure

All TeLML documents take the form of sequences of fragments, where a fragment is either a chunk of Unicode text or a TeLML tag, which consists of a tag name and a sequence of sequences of fragments.

The following TeLML document:

a \b c \d{e,\f{}, g }

Corresponds to the following abstract structure expressed in Python syntax:

[ "a "
, Tag(name="b")
, " c "
, Tag(name="d",
      args=[ "e"
           , Tag(name="f",
                 args=[""])
           , " g "
           ])
]

What

If a document never contains any instance of the characters \, {, or }, then it is a TeLML document of a single text fragment. Tags with an empty argument list are written as \tagname, while tags with an argument list are followed by curly braces, with individual arguments separated by commas, e.g., \tagname{arg1,arg2,...,argn}. Curly braces without a preceding tag are used for grouping, and do not appear in the generated source; this may sometimes be useful to separate argument-less tags from adjacent text fragments: \foobar is a single tag called foobar, and \foo{bar} is a single tag foo with an argument bar, but {\foo}bar is two fragments: a nullary tag foo followed by a text fragment bar.

Why

  1. TeLML grew out of Markdown's lack of extensibility. What I wanted was a user-extensible markup language that had the ability to include rich structures

Alternatives

Formal Grammar

Implementations