자바에서 사용하는 정규 표현 Pattern과 Matcher를 제대로 사용할 수 없어, 아래와 같은 자료를 찾아 유용히 사용합니다.

출처는 다음과 같으며,

Ted Naleid

http://naleid.com/blog/2008/05/19/dont-fear-the-regexp/

특별히 동의를 구한다거나 메일을 보낸다거나 아이 베그 유어 파든 같은 건 하지 않았습니다.
왜냐면 영어를 사용하기 해야 하기 때문에…

시작
============================================================================================

Groovy: Don’t Fear the RegExp

2008/05/19

UPDATE: if you’re using Groovy 1.6.1 or greater (released April 2009), check out the new find and find all methods in this post.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions!”
Now they have two problems. — Jaime Zawinski

There is a common and well-earned aversion in the Java world to
regular expressions. Prior to Java 1.4, regular expressions weren’t
even part of the core language. Post 1.4, using regular expressions is
still a painful task of working with Pattern and Matcher objects. Lots
of typing is involved to make anything happen. It’s difficult enough
that most Java devs don’t end up using them enough to actually remember
how to read a regular expression, and they need to dig up the JavaDocs (or cut and paste an old example), every time they want to use them.

This aversion has persisted into the Groovy community to a level that
I haven’t seen in other dynamic scripting languages like Ruby, Python,
and (obviously) Perl.

The current regexp docs that pop up when doing a google search
are all outdated and don’t use any of the best techniques that are
available in the groovy 1.5.X and 1.6-beta code that is now available.
The recent Groovy Recipes
book doesn’t have an entry for regular expressions in the index, and I
was unable to find a single example of a regular expression in the
entire book.

This is unfortunate because Groovy makes using regular expressions much
easier than in Java. Under the covers, you’re still working with the
same old Java Pattern and Matcher objects, but the Groovy syntax and
additions to those classes are pleasant to work with.

String Escaping with Slashy Strings

Groovy adds a new type of string escaping, Slashy Strings, that can
be used to make your regular expressions easier to read. Forward
slashes around text create String objects, just like quotes do. Unlike
quoted strings, you don’t have to escape backslashes with another
backslash in a Slashy String:

assert java.lang.String == /foo/.class
assert ( /Count is \d/ == "Count is \\d" )

You can also use groovy expressions Slashy Strings, just like double-quoted GStrings:

def name = "Ted Naleid"
assert ( /$name/ == "Ted Naleid" )
assert ( /$name/ == "$name" )

There isn’t anything specific to regular expressions with Slashy
Strings, but many regular expressions use shorthand character classes
such as \d (digit), \s (non-whitespace character), \b (word boundary)
etc. The JavaDocs for Pattern actually has a nice reference for regular expression character classes if you’re not familiar with them.

Groovy Regular Expression Operators

Groovy adds 3 new operators

  • ~” – used before a string and it will cause the string to be compiled to a Pattern for later use
  • // \b means word boundary, [A-Z] means any capital letter, + means one or more
    // so this matches any string of one or more capital letter with a word boundary (non-word character) on either side of it
    def shoutedWord = ~/\b[A-Z]+\b/
  • =~” – Creates a Matcher out of the String on the left hand side and the Pattern on the right.
  • def matcher = ("EUREKA" =~ shoutedWord)  
    assert matcher.matches()         // TRUE
     
    def numberMatcher = "1234" =~ /\d+/  
    assert numberMatcher.matches()   // TRUE
  • ==~” – Returns a boolean that specifies if the full String matches the Pattern
  • assert "1234" ==~ /\d+/    // TRUE
    assert "FOO2" ==~ /\d+/    // FALSE!!!

Enhancements to the String Class

In Groovy, the String class has been enhanced with a few “replace*”
methods that allow you to leverage regular expressions. These methods
originally come from the Matcher class, but attaching them directly to
String puts them right at your fingertips.

replaceFirst will replace the first substring matched by a regular expression within the specified String:

assert "Green Eggs and Spam" == "Spam Spam".replaceFirst(/Spam/, "Green Eggs and")

replaceAll will replace all matching substrings within the specified String:

assert "The armor was colored silver" == "The armour was coloured silver".replaceAll(/ou/, "o")

There is an alternate version of replaceAll that takes a
closure for the second parameter. This is especially useful in the
situations where you want to manipulate the matched value, or groups
within the match to dynamically determine the replacement text.

For example, if we wanted to be able to turn a dashed phrase
(“foo-bar”) into a camel case word (“fooBar”) we can’t just remove all
dash characters, we also need to make the first letter after the dash
capitalized (the “B” in “fooBar”).

To do this, we can use a regular expression that captures the first letter after a dash in a group using parenthesis.

 
def dashedToCamelCase(orig) {
	// regular expression is a dash, followed by parenthesis that form a group where we hold the word's first character
    orig.replaceAll(/-(\w)/) { fullMatch, firstCharacter -> firstCharacter.toUpperCase() }
}
 
assert "firstName" == dashedToCamelCase("first-name")
 
assert "oneTwoThreeFourFiveSixSevenEight" == dashedToCamelCase("one-two-three-four-five-six-seven-eight")

Using the version of replaceAll that takes a closure
gives us a chance to manipulate the first character of the word and
capitalize it. This closure is always passed the full matched text of
the regular expression as the first value, and then any groups as
subsequent values.

Here we modify a phone number and keep the area code group, but replace the exchange and station number with hash marks:

assert "612-###-####" == "612-555-1212".replaceAll(/(\d{3})-(\d{3})-(\d{4})/) { fullMatch, areaCode, exchange, stationNumber ->
    assert fullMatch == "612-555-1212" 
    assert areaCode == "612"
    assert exchange == "555"
    assert stationNumber == "1212"
    return "$areaCode-###-####"
}

Enhancements to Collections

Groovy also makes significant additions to what you can do with
Collections. In addition to each, collect, inject, etc, there is a
regular expression aware iterator called grep
that will pass each item in the Collection through a filter and return a
subset of items that match the filter. We can use a regular expression
as a filter:

// regular expression says 0 or more characters (".*") followed by the string "bar" that is at the end of the string ("$")
assert ["foobar", "bazbar"] == ["foobar", "bazbar", "barquux"].grep(~/.*bar$/)

You can achieve the same thing with findAll but it takes a little more typing:

assert ["foobar", "bazbar"] == ["foobar", "bazbar", "barquux"].findAll { it ==~ /.*bar$/ }

Working with Matchers

As we’ve seen, using the =~ operator will return a
Matcher object. Many of the existing regular expression examples on the
web work by treating the Matcher as a list and getting the first
(zero-based) element out of the list:

def matcher = "foobazaarquux" =~ "o(b.*r)q"
assert ["obazaarq", "bazaar"] == matcher[0]
assert "bazaar" == matcher[0][1] // get the first grouping of the first map

This is a little fragile as matcher[0] will throw an error if there was not actually a match. Calling matches() doesn’t help as matches only checks if the regular expression matches the WHOLE string:

("foobazaarquux" =~ "o(b.*r)q").matches()  // returns false!
("foobazaarquux" =~ ".*(b.*r).*").matches()  // returns true, ".*" matches 0 or more chars of any type

You can check getCount() to see how many matches there were for some safety:

def m = "foobar" =~ /quux/
if (m.getCount()) {
	// example won't get here as "quux" doesn't exist in "foobar", the count is 0
        println m[0]
}

A groovier way to work with Matchers leverages collection iterators
and the built in closures that Groovy provides to them. Matcher
supports the iterator() method and with that, gets everything else that any groovy List or Collection would have, including collect, inject, findAll, etc.

def paragraph = """
    Lorem ipsum dolor 12:30 AM sit amet, 
    consectetuer adipiscing 1:15 AM elit. 
    Nunc rutrum diam sagittis nisi 9:22 PM.
"""
 
def HOUR = /10|11|12|[0-9]/
def MINUTE = /[0-5][0-9]/
def AM_PM = /AM|PM/
def time = /($HOUR):($MINUTE) ($AM_PM)/
 
assert ["12:30 AM", "1:15 AM", "9:22 PM"] == (paragraph =~ time).collect { it }
 
assert ["12:30 AM", "1:15 AM"] == (paragraph =~ time).grep(~/.*AM$/)

A limitation of the iterator-based methods is that they don’t give
you access to the individual groups (hour, minute, am/pm), just the full
matched string (“12:30 AM”). The each method is more
powerful because as it iterates through, it passes the full match as
well as each of the individual groups into the closure.

("foo1 bar30 foo27 baz9 foo600" =~ /foo(\d+)/).each { match, digit -> println "+$digit" }
 
// result:
// +1
// +27
// +600

Another example (using the paragraph and time Matcher from above) showing how to pretty print all of the timestamps:

 
(paragraph =~ time).each {match, hour, minute, amPm -> 
    println "$hour:$minute ${amPm == 'AM' ? 'this morning' : 'this evening' }"
}
 
// result: 
// 12:30 this morning
// 1:15 this morning
// 9:22 this evening

Regular expressions are a powerful tool that Groovy makes as
accessible as any other top-tier scripting language. Using techniques
to break more complicated regular expressions into their component
pieces can make them much more readable (as in the time example above).

If you’re doing any sort of string processing beyond a simple contains or split, regular expressions in groovy can turn mountains of Java into a couple of lines of code.

There are 17 comments in this article:

  1. 2008/05/19Andres Almiray say:

    Great write up, keep’em coming!

  2. 2008/05/19Berkay say:

    This is really helpful, thanks for taking the time to write it up!

  3. 2008/05/20kodeninja say:

    An interesting corner case that could come up here 🙂

    def shoutedWord = “NASA”
    def shoutedWordP = ~/\b[A-Z]+\b/ // creates a pattern
    def shoutedWordM = “EUREKA” =~ /\b[A-Z]+\b/ // creates a matcher

    shoutedWord =~/\b[A-Z]+\b/ // what does this create?

    -kodeninja

  4. 2008/05/20codecraig say:

    Great information, thanks for the post.

  5. 2008/05/20tednaleid say:

    Thanks guys I’m glad you found the post useful!

    @kodeninja that is an interesting situation that I hadn’t considered
    previously. It looks like “=~” takes precedence over the “~” in the
    groovy parser.

    So that in your example where you don’t have a space between any of “=~/”, the groovy parser is picking “=~” over the “~/”.

    This demonstrates what’s going on:

    shoutedWord = "NASA"
     
    // no space between = and ~ so it treats it as a =~
    // returns a matcher object and doesn't modify shoutedWord
    def noSpaceCreatesMatcher() { shoutedWord =~/\b[A-Z]+\b/ }
     
    // space between = and ~ so this assigns the Pattern to shoutedWord 
    def spaceAssignsPattern() { shoutedWord = ~/\b[A-Z]+\b/ }
     
    assert "NASA" == shoutedWord
    assert java.util.regex.Matcher == noSpaceCreatesMatcher().class
    assert "NASA" == shoutedWord
    assert java.util.regex.Pattern == spaceAssignsPattern().class
    assert java.util.regex.Pattern == shoutedWord.class
    assert ( /\b[A-Z]+\b/ == shoutedWord.toString() )

    I guess the moral of the story is to be explicit where you put your
    spaces to insure you’re getting the right intention. If you want a
    matcher use “=~ /” If you want a pattern, use “= ~/”.

  6. 2008/05/21Dave Klein say:

    After seeing your enhancement to my blog post, I decided to look
    into RegEx more (it always just looked like comic strip swearing to
    me). So thanks for giving me a starting point!

    On the down side, I can’t get that song out of my head and it’s getting depressing 🙂

  7. 2008/05/27Grails Podcast Episode 57: Newscast for May 24th 2008 « Sven Haiges’ Personal Blog say:

    […] Don’t fear the RegExp – new features in Groovy regular expressions. […]

  8. 2008/05/27Ted Naleid » Syntactic Sugar in Groovy and Ruby say:

    […] used to think that the regular expression support in
    Groovy was lacking compared to Ruby, but after a little digging, I found
    that it’s just about as easy and powerful as Ruby […]

  9. 2008/06/19Mibmib say:

    Great article, thx !

    I was struggling to get groups out of a find until I read this :

    “A limitation of the iterator-based methods is that they don’t give you access to the individual groups”

    why doesn’t this work (why Is x a sting instead of list ) :

    matcher = ( “test1 test2″ =~ /test([0-9]+)/ )

    for ( x in matcher ){
    println x
    }

  10. 2008/06/19tednaleid say:

    @Mibmib: I think the example you give doesn’t work because the
    for loop is using the iterator (just like the ones I have above, like
    collect, inject, etc). If you want access to the groups, you just have
    to use .each:

    matcher = ( "test1 test2" =~ /test([0-9]+)/ )
    matcher.each { fullMatch, number ->
        println "full match: $fullMatch, number: $number"
    }

    prints:

    full match: test1, number: 1
    full match: test2, number: 2

  11. 2009/01/29Mike say:

    Just have to say that I keep coming back to this post over and
    over for reference when I’m dealing with regex. Thanks for the great
    examples!

  12. 2009/02/28Tom say:

    This was helpful.

    Thanks

  13. 2009/03/4Steve say:

    Thanks for the great info, much better than my Groovy book.

    One thing I’m still confused by — for replaceAll with a closure, you
    get pass the whole match, and then all the groups, right? And it seems
    from your example that the closure’s result replaces the first group if
    there are groups?

    What if there’s more than one group? How can I replace individual
    ones? Or does it replace the whole match, and you have to construct it
    out of all the parameters representing the groups?

    An example for that would be great.

    Thanks!

  14. 2009/03/7tednaleid say:

    @Steve

    Your statement regarding replaceAll: “does it replace the whole
    match, and you have to construct it out of the parameters representing
    the groups” is the correct one.

    The entire thing that was matched by the regexp statement is replaced
    by whatever is returned. You just have access to the groups to help
    you parse out the different pieces to help you construct the updated
    string. A couple of examples:

    without groups, it replaces the whole match:

    assert "foo qux qux" == "foo bar baz".replaceAll(/b../) { wholeMatch ->
       return "qux"
    }

    with groups, you can use the groups to help you construct the result,
    but it still replaces the whole match (and leaves the start of the
    string, “foo”, untouched):

    assert "foo car caz" == "foo bar baz".replaceAll(/b(..)/) { wholeMatch, lastTwoChars ->
       return "c" + lastTwoChars
    }

    Hope that helps!

  15. 2009/10/29Nikolas say:

    Hi I need to include a parameter / String variable/ in a Matcher
    pattern so then the Groovy script could search dinamicaly for my needs,
    Could you suggest how to do that ? Should look like this :
    gmatcher = evp =~ “myparam ([0-9/-]+)([0-9/-]+)” where
    my param = ([1] [2] [3])

    Any help is grealtly apreciated.

    Thanks:
    Nikolas

  16. 2009/10/30tednaleid say:

    @Nikolas

    You can use gstrings in the patterns just like in any other string, so for your example, this should work:

    def myparam = "([1][2][3])"
    def pattern = ~"$myparam ([0-9/-]+)([0-9/-]+)"
     
    evp = "123 456"
     
    // asserts just show what will be matched for example string
    // returns the group matching what's in myparam
    evp.find(pattern) { full, matchedMyParam, firstDigitGroup, secondDigitGroup ->
        assert full == "123 456"
        assert matchedMyParam == "123"
        assert firstDigitGroup == "45"
        assert secondDigitGroup == "6"
        return matchedMyParam
    }

    Hope that helps!

Leave a Reply

Your email address will not be published. Required fields are marked *