Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

One of the most useful features of Groovy is the ability to use regular expressions to "capture" data out of a regular expression. Let's say for example we wanted to extract the location data of Liverpool, England from the following data:

Code Block

locationData = "Liverpool, England: 53d 25m 0s N 3d 0m 0s"

We could use the split() function of string and then go through and strip out the comma between Liverpool and England, and all the special location characters. Or we could do it all in one step with a regular expression. The syntax for doing this is a little bit strange. First, we have to define a regular expression, putting anything we are interested in in parentheses.

Code Block

myRegularExpression = /([a-zA-Z]+), ([a-zA-Z]+): ([0-9]+). ([0-9]+). ([0-9]+). ([A-Z]) ([0-9]+). ([0-9]+). ([0-9]+)./

Next, we have to define a "matcher" which is done using the =~ operator:

Code Block

matcher = ( locationData =~ myRegularExpression )

The variable matcher contains a java.util.regex.Matcher as enhanced by groovy. You can access your data just as you would in Java from a Matcher object. A groovier way to get your data is to use the matcher as if it were an array--a two dimensional array, to be exact. A two dimensional array is simply an array of arrays. In this case the first "dimension" of the array corresponds to each match of the regular expression to the string. With this example, the regular expression only matches once, so there is only one element in the first dimension of the two-dimensional array. So consider the following code:

Code Block

matcher[0]

That expression should evaluate to:

Code Block

["Liverpool, England: 53d 25m 0s N 3d 0m 0s", "Liverpool", "England", "53", "25", "0", "N", "3", "0", "0"]

And then we use the second dimension of the array to access the capture groups that we're interested in:

Code Block

if (matcher.matches()) {
	println(matcher.getCount()+ " occurrence of the regular expression was found in the string.");
	println(matcher[0][1] + " is in the " + matcher[0][6] + " hemisphere. (According to: " + matcher[0][0] + ")")
}

...

Sometimes it is desirable to group an expression without marking it as a capture group. You can do this by enclosing the expression in parentheses with ?: as the first two characters. For example if we wanted to reformat the names of some people, ignoring middle names if any, we might:

Code Block

names = [
    "Graham James Edward Miller",
    "Andrew Gregory Macintyre",
    "No MiddleName"
]

printClosure = {
	matcher = (it =~ /(.*?)(?: .+*)+* (.*)/);  // notice the non-matching group in the middle
	if (matcher.matches())
		println(matcher[0][2]+", "+matcher[0][1]);
}
names.each(printClosure);

Should output:

Code Block

Miller, Graham
Macintyre, Andrew
MiddleName, No

That way, we always know that the last name is the second matcher group.

...

So let's say we want to replace all occurrences of Harry Potter's name so that we can resell J.K. Rowlings books as Tanya Grotter novels (yes, someone tried this, Google it if you don't believe me).

Code Block

excerpt = "At school, Harry had no one. Everybody knew that Dudley's gang hated that odd Harry Potter "+
          "in his baggy old clothes and broken glasses, and nobody liked to disagree with Dudley's gang.";
matcher = (excerpt =~ /Harry Potter/);
excerpt = matcher.replaceAll("Tanya Grotter");

matcher = (excerpt =~ /Harry/);
excerpt = matcher.replaceAll("Tanya");
println("Publish it! "+excerpt);

...

The operators ?, +, and * are by default "greedy". That is, they attempt to match as much of the input as possible. Sometimes this is not what we want. Consider the following list of fifth century popes:

Code Block

popesArray = [
    "Pope Anastasius I 399-401",
    "Pope Innocent I 401-417",
    "Pope Zosimus 417-418",
    "Pope Boniface I 418-422",
    "Pope Celestine I 422-432",
    "Pope Sixtus III 432-440",
    "Pope Leo I the Great 440-461",
    "Pope Hilarius 461-468",
    "Pope Simplicius 468-483",
    "Pope Felix III 483-492",
    "Pope Gelasius I 492-496",
    "Pope Anastasius II 496-498",
    "Pope Symmachus 498-514"
]

A first attempt at a regular expression to parse out the name (without the sequence number or modifier) and years of each pope might be as follows:

Code Block

/Pope (.*)(?: .*)? ([0-9]+)-([0-9]+)/

...

Clearly the first capture group is capturing too much of the input. We only want it to capture Anastasius, and the modifiers should be captured by the second capture group. Another way to put this is that the first capture group should capture as little of the input as possible to still allow a match. In this case it would be everything until the next space. Java regular expressions allow us to do this using "reluctant" versions of the *, + and ? operators. In order to make one of these operators reluctant, simply add a ? after it (to make *?, +? and ??). So our new regular expression would be:

Code Block

/Pope (.*?)(?: .*)? ([0-9]+)-([0-9]+)/

...

So to test this out, we would use the code:

Code Block

popesArray = [
    "Pope Anastasius I 399-401",
    "Pope Innocent I 401-417",
    "Pope Zosimus 417-418",
    "Pope Boniface I 418-422",
    "Pope Celestine I 422-432",
    "Pope Sixtus III 432-440",
    "Pope Leo I the Great 440-461",
    "Pope Hilarius 461-468",
    "Pope Simplicius 468-483",
    "Pope Felix III 483-492",
    "Pope Gelasius I 492-496",
    "Pope Anastasius II 496-498",
    "Pope Symmachus 498-514"
]

myClosure = {
	myMatcher = (it =~ /Pope (.*?)(?: .*)? ([0-9]+)-([0-9]+)/);
	if (myMatcher.matches())
		println(myMatcher[0][1]+": "+myMatcher[0][2]+" to "+myMatcher[0][3]);
}
popesArray.each(myClosure);

...