Clarified RE syntax in examples
Getty Ritter
8 years ago
30 | 30 | |
31 | 31 | \h2{Structural regular expressions} |
32 | 32 | |
33 |
Structural regular expressions build on a similar but non-identical command language, but the \em{first} deficiency identified in traditional Unix regexp-ey tools was that they were \em{necessarily} line-oriented. This isn't a feature of the theory of regular languages, but rather a practical API choice for Unix programs, which often deal with newline-delimited text files. While practical for some applications, this does create a weird edge case for regular expressions where some hopefully-straightforward uses of regular expressions don't suffice: for example, I might want to write a short script to search my prose for accidentally repeated instances of common words like \em{the}: a regex like \tt{ |
|
33 | Structural regular expressions build on a similar but non-identical command language, but the \em{first} deficiency identified in traditional Unix regexp-ey tools was that they were \em{necessarily} line-oriented. This isn't a feature of the theory of regular languages, but rather a practical API choice for Unix programs, which often deal with newline-delimited text files. While practical for some applications, this does create a weird edge case for regular expressions where some hopefully-straightforward uses of regular expressions don't suffice: for example, I might want to write a short script to search my prose for accidentally repeated instances of common words like \em{the}: a regex like \tt{/the +the/} would suffice for most cases, but would completely fail to match the string \tt{"the\\nthe"}. | |
34 | 34 | |
35 |
Structural regular expressions begin by tossing out line-orientedness: a regular expression like \tt{.*} |
|
35 | Structural regular expressions begin by tossing out line-orientedness: a regular expression like \tt{.*} would match the entire file, newlines and all. The regular expression allow for the escape sequence \tt{ \\n } to represent a newline, so if I wanted to match a single line, I could write the regular expression \tt{.*\\n} to describe it; consequently, I can handle the \tt{"the\\nthe"} case by writing \tt{/the[ \\n]+the/}, and replace all instances of repeated \em{the}—even across newlines—with the command\ref{sam} | |
36 | 36 | \sidenote{I'm marking these snippets with \link{http://doc.cat-v.org/plan_9/4th_edition/papers/sam/|\tt{sam}}, which is the \tt{ed}- and \tt{ex}-inspired stream editor that appeared in Plan 9. There's a bit more complexity to actually using \tt{sam} which I'm eliding for the sake of explanation.} |
37 | 37 | |
38 | 38 | \code{\ttcom{(sam)} \ttkw{s}/the[ \\n]+the/the/g} |