Tuesday, June 19, 2012

Regular Expressions that Cheat

Sometimes I see a regular expression, and I think "Cheater!" Here's an example: (\d+\.){3}\d+ used to match an IPv4 address. It cheats because it matches a broader set of strings than the set of all legal IPv4 addresses. It will match 127.0.0.1 (legal) as well as 999.999.999.999 (illegal).

Recently, I was writing some expressions to normalize a log file, and found myself "cheating." The question then is "When is it OK to cheat?" I think it's OK to cheat when you know the input well, and know that it is well formed. This might be the case, for example, if I'm parsing the output of a program that I wrote. Then I should have a pretty good idea of all the possible outputs of the program.

As it happens, the log file I was normalizing was not very well known to me. I went back and rewrote several expressions. Here's one of the rewritten expressions:

^20\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01]) ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9],\d{5}

This was to match a timestamp with microseconds. There's a similar regex to match IPv4 addresses in the SELF slides.

Anyone feel differently about cheating - feel free to comment.

Monday, June 11, 2012

SELF 2012 Retrospective

The conference has come and gone. I think I had even more fun this year than last. Next year I'll definitely bring an offline copy of the slides, just in case. If all else had failed, I suppose I could have asked everyone to pull the slides up on their smart phones - wouldn't that have been a laugh.

The ride home to Georgia on the motorcycle had some exciting moments too. I got rained on a couple of times, but nothing so bad as to make me want to stop and put on my rain gear. I did notice I'd lost one of the bolts holding on my windshield. I occasionally had these little panicky thoughts - what if it flies off and hits me while I'm going 75 mph down the highway. Still, if I'd wanted a completely boring trip, I'd have taken a greyhound bus.

Just some review of things I said, but that weren't in the slides:

  • + is like * but means 1 or more repetitions
  • {N} means exactly N repetitions
  • {N,} means N or more repetitions
  • {,N} means 0 up to N repetitions
  • {N,M} means from N to M repetitions

Depending on the flavor or regular expressions you're dealing with, you may need to put a \ in front of the curly braces to use the above behavior.

I'd love to hear people's thoughts on the conference, or any questions. Thanks for sharing.

Friday, November 11, 2011

Quick Tips - grep ps output

I use regular expressions in plenty of programs and scripts, but even more than that I use them on the command line. Here's a quick tip on grep'ing the output of ps.

Instead of this tired old command line trope:
ps -ef | grep foo | grep -v grep
try this:
ps -ef | grep '[f]oo'

$ ps -ef | grep vim
dg 14823 14739 0 15:12 pts/0 00:00:00 grep vim
dg 21295 13905 0 Nov09 pts/0 00:00:01 vim -R main.c
$ ps -ef | grep vim | grep -v grep
dg 21295 13905 0 Nov09 pts/0 00:00:01 vim -R main.c
$ ps -ef | grep '[v]im'
dg 21295 13905 0 Nov09 pts/0 00:00:01 vim -R main.c
This works because the regex [v]im contains a character class matching the single letter v. This will match the string vim in the ps output, but will not match the grep, i.e. the character class [v] does not match the string [v].

Friday, June 17, 2011

A Rant: Comments in Regex

I've followed a tenet for a long time that comments in code should explain why, not how. This tenet is founded on years of debugging code. When a programmer writes a comment explaining how something works, they are actually saying how they think it should work, which is rarely how it actually works. These type of comments are actually harmful in that they can prejudice the reader's understanding and hide flaws which might otherwise be visible. Since I can't really know what's in a comment until I read it, I try to read the code first without reading the comments, to form my own idea of its workings.

Comments in regular expressions typically say how something matches, not why. When a regular expression is not matching as expected, these sort of comments are worse than useless, tainting expectations of the code at hand. Given the terseness and complexity of large regular expressions, it hardly matters whether the author of the expression and comments is separate from the reader or one and the same. When looking at my old code, I regularly have to stop and think about a regex, but I believe that it is time well spent. Every time I find a bug, it reinforces that programming is a human endeavor, and as such is never perfect. Recently, I used one of my old scripts as an example in the SELF 2011 talk. This is a script I've used for years, and as I was looking at it on the slide, I saw where I had put s,^./,, when I meant s,^\./,,.

Abstraction is a tool for managing complexity, and it can be used with regular expressions too. Here's an example from a ruby class:

numrange = /\d+(?:-\d+)?/
numlist = /#{numrange}(?:,#{numrange})*/
step = /\/\d+/
numspec = /(?:\*|#{numlist})(?:#{step})?/

This code tries to be self-documenting, so the intent is explicit in the choice of names. This is akin to a comment, but is simplistic enough to be instantly validated. Since more complex patterns are built using prior abstractions, each can be understood and validated with little effort.

Monday, June 13, 2011

My fond regards to PHP

Advanced regex with PHP

Some love for the php world in this one... and a reminder to myself that a rant on comments in regex is overdue.