Tuesday, June 19, 2012

Regular Expressions that Cheat

Sometimes I see a regular expression, and I think "Cheater!" Here's an example: (\d+\.){3}\d+ used to match an IPv4 address. It cheats because it matches a broader set of strings than the set of all legal IPv4 addresses. It will match 127.0.0.1 (legal) as well as 999.999.999.999 (illegal).

Recently, I was writing some expressions to normalize a log file, and found myself "cheating." The question then is "When is it OK to cheat?" I think it's OK to cheat when you know the input well, and know that it is well formed. This might be the case, for example, if I'm parsing the output of a program that I wrote. Then I should have a pretty good idea of all the possible outputs of the program.

As it happens, the log file I was normalizing was not very well known to me. I went back and rewrote several expressions. Here's one of the rewritten expressions:

^20\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01]) ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9],\d{5}

This was to match a timestamp with microseconds. There's a similar regex to match IPv4 addresses in the SELF slides.

Anyone feel differently about cheating - feel free to comment.

Monday, June 11, 2012

SELF 2012 Retrospective

The conference has come and gone. I think I had even more fun this year than last. Next year I'll definitely bring an offline copy of the slides, just in case. If all else had failed, I suppose I could have asked everyone to pull the slides up on their smart phones - wouldn't that have been a laugh.

The ride home to Georgia on the motorcycle had some exciting moments too. I got rained on a couple of times, but nothing so bad as to make me want to stop and put on my rain gear. I did notice I'd lost one of the bolts holding on my windshield. I occasionally had these little panicky thoughts - what if it flies off and hits me while I'm going 75 mph down the highway. Still, if I'd wanted a completely boring trip, I'd have taken a greyhound bus.

Just some review of things I said, but that weren't in the slides:

  • + is like * but means 1 or more repetitions
  • {N} means exactly N repetitions
  • {N,} means N or more repetitions
  • {,N} means 0 up to N repetitions
  • {N,M} means from N to M repetitions

Depending on the flavor or regular expressions you're dealing with, you may need to put a \ in front of the curly braces to use the above behavior.

I'd love to hear people's thoughts on the conference, or any questions. Thanks for sharing.