Sunday, April 24, 2011

Regex for Bash-style Variables

Returning to the regex from Friday's post, the first thing to understand is context.  What should it match and what should it not match?  Here, I want to match key-value pairs.  These kvp's are written using Bash (or rather Bourne shell) syntax, for example:
FOO=42
BAR='easter bunny'
Trivially, .* will match these lines, or any line for that matter, so what should our regex not match?  I'll start by excluding comments at the end of the line.
FOO=42 # answer to the question
BAR='easter bunny' # hippity hoppity
A regex of [^#]* would do the trick, except that a # character might appear within a string.
FOO=42 # answer to the question
BAR='easter bunny #2' # hippity hoppity
A regex of '[^']*' works for the quoted string, but not for the text leading up to the quote and not for the line without quotes.  These two regex can be combined, but the results aren't quite right.
$ cat in
FOO=42 # answer to the question
BAR='easter bunny #2' # hippity hoppity
$ perl -ne 'print "$&\n" if '"m/[^#]*|'[^']*'/" in
FOO=42
BAR='easter bunny
$ perl -ne 'print "$&\n" if '"m/'[^']*'|[^#]*/" in
FOO=42
BAR='easter bunny
The matches are actually the same as for the first regex alone, regardless of the ordering.  The [^#]* regex is partially consuming the quote, rather than the second regex consuming the whole quote.

My favorite solution to this problem is to weight the combined regex in favor of the quote-matching portion.  I do this by removing the * from [^#]*.  The combined regex then matches either a whole quoted string or a single non-# character.  I then put a * on the combined regex, so it repeatedly consumes quotes or single non-# characters.
$ perl -ne 'print "$&\n" if '"m/('[^']*'|[^#])*/" in
FOO=42
BAR='easter bunny #2'
Now, the order of the subexpressions is important.  Next time, I'll expand this to handle double-quoted strings.

No comments:

Post a Comment