Groovy: Don’t Fear the RegExp
2008/05/19
UPDATE: if you’re using Groovy 1.6.1 or greater (released April 2009), check out the new find and find all methods in this post.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions!”
Now they have two problems. — Jaime Zawinski
There is a common and well-earned aversion in the Java world to
regular expressions. Prior to Java 1.4, regular expressions weren’t
even part of the core language. Post 1.4, using regular expressions is
still a painful task of working with Pattern and Matcher objects. Lots
of typing is involved to make anything happen. It’s difficult enough
that most Java devs don’t end up using them enough to actually remember
how to read a regular expression, and they need to dig up the JavaDocs (or cut and paste an old example), every time they want to use them.
This aversion has persisted into the Groovy community to a level that
I haven’t seen in other dynamic scripting languages like Ruby, Python,
and (obviously) Perl.
The current regexp docs that pop up when doing a google search
are all outdated and don’t use any of the best techniques that are
available in the groovy 1.5.X and 1.6-beta code that is now available.
The recent Groovy Recipes
book doesn’t have an entry for regular expressions in the index, and I
was unable to find a single example of a regular expression in the
entire book.
This is unfortunate because Groovy makes using regular expressions much
easier than in Java. Under the covers, you’re still working with the
same old Java Pattern and Matcher objects, but the Groovy syntax and
additions to those classes are pleasant to work with.
String Escaping with Slashy Strings
Groovy adds a new type of string escaping, Slashy Strings, that can
be used to make your regular expressions easier to read. Forward
slashes around text create String objects, just like quotes do. Unlike
quoted strings, you don’t have to escape backslashes with another
backslash in a Slashy String:
assert java.lang.String == /foo/.class assert ( /Count is \d/ == "Count is \\d" )
You can also use groovy expressions Slashy Strings, just like double-quoted GStrings:
def name = "Ted Naleid" assert ( /$name/ == "Ted Naleid" ) assert ( /$name/ == "$name" )
There isn’t anything specific to regular expressions with Slashy
Strings, but many regular expressions use shorthand character classes
such as \d (digit), \s (non-whitespace character), \b (word boundary)
etc. The JavaDocs for Pattern actually has a nice reference for regular expression character classes if you’re not familiar with them.
Groovy Regular Expression Operators
Groovy adds 3 new operators
- “
~” – used before a string and it will cause the string to be compiled to a Pattern for later use - “
=~” – Creates a Matcher out of the String on the left hand side and the Pattern on the right. - “
==~” – Returns a boolean that specifies if the full String matches the Pattern
// \b means word boundary, [A-Z] means any capital letter, + means one or more // so this matches any string of one or more capital letter with a word boundary (non-word character) on either side of it def shoutedWord = ~/\b[A-Z]+\b/
def matcher = ("EUREKA" =~ shoutedWord) assert matcher.matches() // TRUE def numberMatcher = "1234" =~ /\d+/ assert numberMatcher.matches() // TRUE
assert "1234" ==~ /\d+/ // TRUE assert "FOO2" ==~ /\d+/ // FALSE!!!
Enhancements to the String Class
In Groovy, the String class has been enhanced with a few “replace*”
methods that allow you to leverage regular expressions. These methods
originally come from the Matcher class, but attaching them directly to
String puts them right at your fingertips.
replaceFirst will replace the first substring matched by a regular expression within the specified String:
assert "Green Eggs and Spam" == "Spam Spam".replaceFirst(/Spam/, "Green Eggs and")
replaceAll will replace all matching substrings within the specified String:
assert "The armor was colored silver" == "The armour was coloured silver".replaceAll(/ou/, "o")
There is an alternate version of replaceAll that takes a
closure for the second parameter. This is especially useful in the
situations where you want to manipulate the matched value, or groups
within the match to dynamically determine the replacement text.
For example, if we wanted to be able to turn a dashed phrase
(“foo-bar”) into a camel case word (“fooBar”) we can’t just remove all
dash characters, we also need to make the first letter after the dash
capitalized (the “B” in “fooBar”).
To do this, we can use a regular expression that captures the first letter after a dash in a group using parenthesis.
def dashedToCamelCase(orig) { // regular expression is a dash, followed by parenthesis that form a group where we hold the word's first character orig.replaceAll(/-(\w)/) { fullMatch, firstCharacter -> firstCharacter.toUpperCase() } } assert "firstName" == dashedToCamelCase("first-name") assert "oneTwoThreeFourFiveSixSevenEight" == dashedToCamelCase("one-two-three-four-five-six-seven-eight")
Using the version of replaceAll that takes a closure
gives us a chance to manipulate the first character of the word and
capitalize it. This closure is always passed the full matched text of
the regular expression as the first value, and then any groups as
subsequent values.
Here we modify a phone number and keep the area code group, but replace the exchange and station number with hash marks:
assert "612-###-####" == "612-555-1212".replaceAll(/(\d{3})-(\d{3})-(\d{4})/) { fullMatch, areaCode, exchange, stationNumber -> assert fullMatch == "612-555-1212" assert areaCode == "612" assert exchange == "555" assert stationNumber == "1212" return "$areaCode-###-####" }
Enhancements to Collections
Groovy also makes significant additions to what you can do with
Collections. In addition to each, collect, inject, etc, there is a
regular expression aware iterator called grep
that will pass each item in the Collection through a filter and return a
subset of items that match the filter. We can use a regular expression
as a filter:
// regular expression says 0 or more characters (".*") followed by the string "bar" that is at the end of the string ("$") assert ["foobar", "bazbar"] == ["foobar", "bazbar", "barquux"].grep(~/.*bar$/)
You can achieve the same thing with findAll but it takes a little more typing:
assert ["foobar", "bazbar"] == ["foobar", "bazbar", "barquux"].findAll { it ==~ /.*bar$/ }
Working with Matchers
As we’ve seen, using the =~ operator will return a
Matcher object. Many of the existing regular expression examples on the
web work by treating the Matcher as a list and getting the first
(zero-based) element out of the list:
def matcher = "foobazaarquux" =~ "o(b.*r)q" assert ["obazaarq", "bazaar"] == matcher[0] assert "bazaar" == matcher[0][1] // get the first grouping of the first map
This is a little fragile as matcher[0] will throw an error if there was not actually a match. Calling matches() doesn’t help as matches only checks if the regular expression matches the WHOLE string:
("foobazaarquux" =~ "o(b.*r)q").matches() // returns false! ("foobazaarquux" =~ ".*(b.*r).*").matches() // returns true, ".*" matches 0 or more chars of any type
You can check getCount() to see how many matches there were for some safety:
def m = "foobar" =~ /quux/ if (m.getCount()) { // example won't get here as "quux" doesn't exist in "foobar", the count is 0 println m[0] }
A groovier way to work with Matchers leverages collection iterators
and the built in closures that Groovy provides to them. Matcher
supports the iterator() method and with that, gets everything else that any groovy List or Collection would have, including collect, inject, findAll, etc.
def paragraph = """ Lorem ipsum dolor 12:30 AM sit amet, consectetuer adipiscing 1:15 AM elit. Nunc rutrum diam sagittis nisi 9:22 PM. """ def HOUR = /10|11|12|[0-9]/ def MINUTE = /[0-5][0-9]/ def AM_PM = /AM|PM/ def time = /($HOUR):($MINUTE) ($AM_PM)/ assert ["12:30 AM", "1:15 AM", "9:22 PM"] == (paragraph =~ time).collect { it } assert ["12:30 AM", "1:15 AM"] == (paragraph =~ time).grep(~/.*AM$/)
A limitation of the iterator-based methods is that they don’t give
you access to the individual groups (hour, minute, am/pm), just the full
matched string (“12:30 AM”). The each method is more
powerful because as it iterates through, it passes the full match as
well as each of the individual groups into the closure.
("foo1 bar30 foo27 baz9 foo600" =~ /foo(\d+)/).each { match, digit -> println "+$digit" } // result: // +1 // +27 // +600
Another example (using the paragraph and time Matcher from above) showing how to pretty print all of the timestamps:
(paragraph =~ time).each {match, hour, minute, amPm -> println "$hour:$minute ${amPm == 'AM' ? 'this morning' : 'this evening' }" } // result: // 12:30 this morning // 1:15 this morning // 9:22 this evening
Regular expressions are a powerful tool that Groovy makes as
accessible as any other top-tier scripting language. Using techniques
to break more complicated regular expressions into their component
pieces can make them much more readable (as in the time example above).
If you’re doing any sort of string processing beyond a simple contains or split, regular expressions in groovy can turn mountains of Java into a couple of lines of code.