| 1 |
\meta{( "structural-res" "structural regular expressions" ("programming") )}
|
| 2 |
One of the biggest tragedies of computer science is how we so often forget our field has a history. It's not at all hard to read papers from a decade or two ago and find \link{2016-03-15/subjects-and-entities/|amazing ideas that never got popular}, or even ideas which are currently being popularized in blog posts and tweets without any idea that there's prior art. As Ron Minnich once said:
|
| 3 |
|
| 4 |
\blockquote
|
| 5 |
{
|
| 6 |
You want to make your way in the CS field? Simple. Calculate rough time of amnesia (hell, 10 years is plenty, probably 10 months is plenty), go to the dusty archives, dig out something fun, and go for it. It's worked for many people, and it can work for you.
|
| 7 |
}
|
| 8 |
|
| 9 |
A good source of mostly-forgotten CS ideas is the \link{https://en.wikipedia.org/wiki/plan_9_from_Bell_Labs|Plan 9 from Bell Labs operating system}, which is a lost utopia among computer systems. It was an operating system that took the same ideals that Unix paid lip-service to and realized them more fully than anyone else had. A lot has been written about the \link{http://debu.gs/entries/inferno-part-0-namespaces|amazing networking portions of Plan 9}, and the way that it considers the Unix adage \em{everything is a file} and takes it a hundred times further than Unix ever did. That's not what I'm going to go into here, but you should absolutely investigate it.
|
| 10 |
|
| 11 |
Instead, I'm going to talk about a minor but very interesting feature of the Plan 9 user-space utilities\ref{user} that has been undeservedly forgotten by many programmers: something called \em{Structural Regular Expressions}.
|
| 12 |
\sidenote{Plan 9's user-land utilities \em{are} available for Unix-like operating systems: they are known as \link{https://swtch.com/plan9port/|Plan 9 from User Space}, or \tt{plan9port} for short, and you can easily download them and experiment with the features I describe here.}
|
| 13 |
You can read about them \link{http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf|in Rob Pike's paper describing them} or by reading the documentation to the \tt{ed}-inspired editor \tt{sam}, but I'm going to take my own approach in describing them that's at least partly inspired by a now-deleted blog post that originally informed me.
|
| 14 |
|
| 15 |
|
| 16 |
I suspect at least part of the reason structural regular expressions aren't well-remembered is that the name is both misleading and thoroughly boring. It makes it sound like they're an alternative to regular expressions, which isn't the case: they're an alternative to the \em{command language} found in tools like \tt{ed} or \tt{sed}. Their implementation of regular expressions is (with a small but important exception) the same one you can find everywhere, but structural regular expressions build a more powerful set of actions on top of regular expressions.
|
| 17 |
|
| 18 |
\h2{\tt{ed} and \tt{sed}}
|
| 19 |
|
| 20 |
The most well-known of \tt{ed}'s regex commands is almost certainly regex substitution, which takes the form
|
| 21 |
|
| 22 |
\code{\ttcom{(ed)} \ttkw{s}/<regex>/<replacement>/<flags>}
|
| 23 |
|
| 24 |
For example, the command \tt{\ttkw{s}/this/that/g} will replace every instance of the string \tt{this} with the string \tt{that}. (Omitting the \tt{g} results in a command that only replaces the first instance.) Another \tt{ed} command is \tt{\ttkw{g}/<regex>/<command>}, which loops over every line in the file, and if a line matches the supplied \tt{<regex>}, then it runs \tt{<command>} over that line. The inverse command—which runs the \tt{<command>} if the line \em{doesn't} match, is \tt{\ttkw{v}/<regex>/<command>}. For example, a command like\ref{syn}
|
| 25 |
\sidenote{This syntax is a bit obtuse: you can imagine it as a piping the result from \tt{\ttkw{g} cake} into \tt{\ttkw{s} this that}, but it might be more accurate to think of the latter command as a function or continuation passed to the earlier one: it's a command that represents \em{what to do next}, which is the essence of continuations.}
|
| 26 |
|
| 27 |
\code{\ttcom{(ed)} \ttkw{g}/cake/\ttkw{s}/this/that/g}
|
| 28 |
|
| 29 |
can be read as \em{find every line that contains \tt{cake}, then only on those lines, replace every instance of \tt{this} with \tt{that}}. A classic simple command is finding and printing lines that match a pattern: using \tt{re} to stand in for an arbitrary regular expression, the command would be written \tt{\ttkw{g}/re/\ttkw{p}}: this command is useful enough that it inspired a tool specifically for the purpose of printing lines matched by a regex, which was creatively named after that equivalent \tt{ed} command: \tt{grep}.
|
| 30 |
|
| 31 |
\h2{Structural regular expressions}
|
| 32 |
|
| 33 |
Structural regular expressions build on a similar but non-identical command language, but the \em{first} deficiency identified in traditional Unix regexp-ey tools was that they were \em{necessarily} line-oriented. This isn't a feature of the theory of regular languages, but rather a practical API choice for Unix programs, which often deal with newline-delimited text files. While practical for some applications, this does create a weird edge case for regular expressions where some hopefully-straightforward uses of regular expressions don't suffice: for example, I might want to write a short script to search my prose for accidentally repeated instances of common words like \em{the}: a regex like \tt{the +the} would suffice for most cases, but would completely fail to match the string \tt{"the\\nthe"}.
|
| 34 |
|
| 35 |
Structural regular expressions begin by tossing out line-orientedness: a regular expression like \tt{.*} could match the entire file, newlines and all. The regular expression allow for the escape sequence \tt{ \\n } to represent a newline, so if I wanted to match a single line, I could write the regular expression \tt{.*\\n} to describe it; consequently, I can handle the \tt{"the\\nthe"} case by writing \tt{the[ \\n]+the}, and replace all instances of repeated \em{the}—even across newlines—with the command\ref{sam}
|
| 36 |
\sidenote{I'm marking these snippets with \link{http://doc.cat-v.org/plan_9/4th_edition/papers/sam/|\tt{sam}}, which is the \tt{ed}- and \tt{ex}-inspired stream editor that appeared in Plan 9. There's a bit more complexity to actually using \tt{sam} which I'm eliding for the sake of explanation.}
|
| 37 |
|
| 38 |
\code{\ttcom{(sam)} \ttkw{s}/the[ \\n]+the/the/g}
|
| 39 |
|
| 40 |
That said, line-oriented commands \em{are} often very useful, and it'd be a shame if we lost out on the ability to do things on a per-line basis! Luckily, structural regular expressions have a trick up their sleeves: the \tt{\ttkw{x}} command, which can be thought of as a sort of \em{for-each} over every place where a regular expression matches the input. It takes the form \tt{\ttkw{x}/<regex>/<command>}, and will find every instance of \tt{<regex>} in the input and then run \tt{<command>} over \em{only the portion of the input that matched the regex}. We can trivially combine this with commands like \tt{\ttkw{p}} for printing:
|
| 41 |
|
| 42 |
\code{\ttcom{(sam)} \ttkw{x}/cake/\ttkw{p} }
|
| 43 |
|
| 44 |
That command was pretty boring: all it will do is print out every instance of the word \tt{cake}, and \em{only} that: none of the characters before, none of the surrounding lines, just a bunch of \tt{cake}s, as many as appear in the input. If, instead, we wanted to print out \em{every line that contained the word \tt{cake}}, we could write something like this:
|
| 45 |
|
| 46 |
\code{\ttcom{(sam)} \ttkw{x}/.*cake.*\\n/\ttkw{p} }
|
| 47 |
|
| 48 |
But there's a more elegant way: in the \tt{sam} command language, the \tt{\ttkw{g}} command does something different than in \tt{ed}: it looks like \tt{\ttkw{g}/<regex>/<command>} and acts like a filter, running a supplied command over \em{the entire input} (not just the part it matched) if its regex matches some part of the input. So, if we want to print every line that contains \tt{cake}, we could first focus on each line of input with \tt{\ttkw{x}}, use \tt{\ttkw{g}} to filter down to just the lines that contain the string \tt{cake}, and then print those lines:
|
| 49 |
|
| 50 |
\code{\ttcom{(sam)} \ttkw{x}/.*\\n/\ttkw{g}/cake/\ttkw{p} }
|
| 51 |
|
| 52 |
There's a corresponding negative match command \tt{\ttkw{v}}, which runs the command if it \em{doesn't} find a match, so the following command prints every line which doesn't contain \tt{cake}:
|
| 53 |
|
| 54 |
\code{\ttcom{(sam)} \ttkw{x}/.*\\n/\ttkw{v}/cake/\ttkw{p} }
|
| 55 |
|
| 56 |
And there are commands for prepending, appending, modifying, or deleting text matched by previous commands. We can still use our old friend the \tt{\ttkw{s}} command to replace a regular expression, so the structural equivalent of our replace-\tt{this}-with-\tt{that}-on-lines-containing-\tt{cake} example above could be written by focusing on lines, filtering by \tt{cake}, and then running a traditional \tt{\ttkw{s}} command:
|
| 57 |
|
| 58 |
\code{\ttcom{(sam)} \ttkw{x}/.*\\n/\ttkw{g}/cake/\ttkw{s}/this/that/g}
|
| 59 |
|
| 60 |
But we could also write this a slightly different way: by first focusing on lines, then filtering by \tt{cake}, then focusing on instances of the string \tt{this}, and using the \tt{\ttkw{c}} command to change the focused text to \tt{that}:
|
| 61 |
|
| 62 |
\code{\ttcom{(sam)} \ttkw{x}/.*\\n/\ttkw{g}/cake/\ttkw{x}/this/\ttkw{c}/that}
|
| 63 |
|
| 64 |
\h2{A High-Level View}
|
| 65 |
|
| 66 |
A major advantage of structural regular expressions is that they're more \em{compositional} than the traditional \tt{ed}-like command language. In \tt{ed}, we might be tempted to find lines that contain both \tt{this} and \tt{that} by writing a command like
|
| 67 |
|
| 68 |
\code{\ttcom{(ed)} \ttkw{g}/this/\ttkw{g}/that/\ttkw{p} \ttcom{# bad!}}
|
| 69 |
|
| 70 |
but we can't do this: \tt{\ttkw{g}} commands aren't allow to invoke other \tt{\ttkw{g}} commands, only simpler commands like \tt{\ttkw{p}}rinting or \tt{\ttkw{d}}eletion. But in structural regular expressions, the primitive components are \em{designed} to be used in a recursive way, composing complicated commands out of simple regular expressions and sets of commands, which has the wonderful side-effect that the regular expressions you actually write are much simpler. To borrow a few examples from Rob Pike's paper: with structural regular expressions, if we wanted to print every line that contained \tt{rob} but not \tt{robot}, we could write a command to focus on lines, keep only those that contain \tt{rob}, filter out those that contain \tt{robot}, and print them:
|
| 71 |
|
| 72 |
\code{\ttcom{(sam)} \ttkw{x}/.*\\n/\ttkw{g}/rob/\ttkw{v}/robot/\ttkw{p}}
|
| 73 |
|
| 74 |
The same thing in an \tt{ed}-like command language would require making the regular expression much more complicated, to filter by the string \tt{rob} when not followed by \tt{o}, or when followed by \tt{o} but not by \tt{t}, and so forth:
|
| 75 |
|
| 76 |
\code{\ttcom{(ed)} \ttkw{g}/rob($\|[^o]\|o[^t])/\ttkw{p}}
|
| 77 |
|
| 78 |
Another advantage of structural regular expressions is that they alleviate a common problem with submatches. I neglected to mention it above, but the \tt{\ttkw{s}} command also allows you to use parens within the supplied regex to select some subpart of the matched input, and use that within the output, e.g.
|
| 79 |
|
| 80 |
\code{\ttcom{(sam)} \ttkw{s}/([A-Za-z]+) ([A-Za-z]+)/\\2 \\1/g}
|
| 81 |
|
| 82 |
is a command which will match two words and swap their order: \tt{ \\1 } referring to the item matched within the first set of parens, and \tt{ \\2 } within the second. One problem with submatches is that they are no longer valid when you put them underneath a \tt{*}:
|
| 83 |
|
| 84 |
\code{\ttcom{(ed)} \ttkw{s}/words: ([A-Za-z]+ )*/got: \\1/g \ttcom{# also bad!}}
|
| 85 |
|
| 86 |
What does this invocation mean? Well, nothing: \tt{ \\1 } can't unambiguously refer to any submatch, because it may match zero or more times.
|
| 87 |
|
| 88 |
Because structural regular expressions contain \em{for-each}-like constructs, we can start to articulate commands that perform the same repetitions, but with no ambiguity about what replaces what:
|
| 89 |
|
| 90 |
\code{\ttcom{(sam)} \ttkw{x}/words: [A-Za-z ]+\\n/\ttkw{x}/[A-Za-z]+ /\ttkw{i}/got: }
|
| 91 |
|
| 92 |
Structural regular expressions are a powerful and expressive tool: separating the ability to focus, filter, and edit into different but complimentary commands provides a surprising amount of power while also \em{simplifying} the regular expressions needed to perform these operations. They are sadly not as popular as they could be—as far as I know, the only editors that support them are part of the Plan 9 tools—but implementing them in other tools would be a wonderful and easy way of making those tools more powerful—so, editor-writers and tool-writers, keep them in mind!
|