A Regular Expression Matcher in 30 lines of C

gjm11 · on July 3, 2011

It's very nice pattern-matching code, but it isn't a regular expression matcher. The class of patterns it's able to match is much smaller than the class of all regular expressions -- not because of missing "sugar" (character classes, +, non-greedy wildcards, ...) but because it doesn't support groups or the | operator.

As a result, the language supported by this matcher has, e.g., no way to match the pattern (ab)* or the pattern a|b. It's much, much less powerful than an actual regular expression matcher would be, and much of what makes it possible to do it in 30 lines of C is that loss of power.

(I've written extremely similar code before: this level of functionality -- basically, glob or DOS wildcards -- is pretty useful. I'm not Rob Pike, and my code for similar functionality would probably be longer than his. But my code, or even Rob Pike's code, for even the simplest thing that could honestly be called a regular expression matcher, would be longer than this by a bigger factor.)

chanux · on July 3, 2011

I see the OP has discovered [http://www.hnsearch.com/search#request/submissions&q=cod...] r/tinycode which was posted on HN few days back [http://news.ycombinator.com/item?id=2717279].

antirez · on July 3, 2011

I wrote a similar glob-style string matching, but it's far more than 30 lines (130 lines of C):

https://github.com/antirez/redis/blob/unstable/src/util.c#L1...

I use it in Redis, but in general being "glob matching" more standard compared to a regexp subset people are more likely to know the syntax.

ihodes · on July 3, 2011

That was a superb read. Given the author(s) (who I noticed only after reading), I suppose that's to be expected. I'd like to see a similarly elegant version that translates the RE to a DFA–if anyone has any suggestions, I'd be quite happy!

slug · on July 3, 2011

This an excerpt of the excellent book http://oreilly.com/catalog/9780596510046

perlgeek · on July 3, 2011

Yes, matching regular expressions is easy, if you don't care about speed or features. Like, being able to match a literal dot or asterisk. Or char classes, or getting case insensitivity right (like the German ß that matches SS, so case folding needs not to preserve string length).

Still nice to see that it can be done in so few lines.

shangaslammi · on July 3, 2011

Here's my attempt at implementing the same functionality in Haskell: https://gist.github.com/1062472

I tried to make it short, but still somewhat readable, so it's not completely "golfed".

sheffield · on July 3, 2011

"In 1967, Ken [Thompson] applied for a patent ..."

Wow, he pioneered that, too...

swah · on July 3, 2011

That code is dense.