Tuesday, June 19, 2012

Regular Expressions that Cheat

Sometimes I see a regular expression, and I think "Cheater!" Here's an example: (\d+\.){3}\d+ used to match an IPv4 address. It cheats because it matches a broader set of strings than the set of all legal IPv4 addresses. It will match 127.0.0.1 (legal) as well as 999.999.999.999 (illegal).

Recently, I was writing some expressions to normalize a log file, and found myself "cheating." The question then is "When is it OK to cheat?" I think it's OK to cheat when you know the input well, and know that it is well formed. This might be the case, for example, if I'm parsing the output of a program that I wrote. Then I should have a pretty good idea of all the possible outputs of the program.

As it happens, the log file I was normalizing was not very well known to me. I went back and rewrote several expressions. Here's one of the rewritten expressions:

^20\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01]) ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9],\d{5}

This was to match a timestamp with microseconds. There's a similar regex to match IPv4 addresses in the SELF slides.

Anyone feel differently about cheating - feel free to comment.

No comments:

Post a Comment