I was writing a regular expression yesterday and this popped up. It's just a quick note about greedy vs. non-greedy mode in regular expression matching. Say I have a regular expression that looks something like this:
/(\[.*\])/
In English that says something roughly like: find an opening bracket [ with 0 or more of any character followed by a closing bracket. The backslashes are to escape the brackets and the parenthesis specify grouping so we can later access that matched text.
The greedy mode comes up with the 0 or more characters part of the match (the .* part of the expression). The default mode of greedy means that the parser will gobble up as many characters as it can and match the very last closing bracket. So if you have text like this:
a = [:foo, :bar]
b = [:hello, :world]
The resulting grouped match would be this:
[:foo, :bar]
b = [:hello, :world]
If you just wanted the [:foo, :bar] part, the solution is to parse in non-greedy mode. This means that it will match on the first closing bracket it sees. The modified regular expression looks like this:
/(\[.*?\])/
I love the regular expression engine in Ruby. It's one of the best things it ripped off from Perl. The one thing I don't like is the magic global variable that it places matched groups into. You can access that first match through the $1 variable. If you're unfamiliar with regular expressions, a good place to start is the Camel book. It's about Perl, but the way they work is very similar. I actually haven't seen good coverage of regexes in a Ruby book.
Comments