Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Capture groups

One of the most useful features of Groovy is the ability to use regular expressions to "capture" data out of a regular expression. Let's say for example we wanted to extract the location data of Liverpool, England from the following data:

We could use the split() function of string and then go through and strip out the comma between Liverpool and England, and all the special location characters. Or we could do it all in one step with a regular expression. The syntax for doing this is a little bit strange. First, we have to define a regular expression, putting anything we are interested in in parentheses.

Next, we have to define a "matcher" which is done using the '''=~''' operator:

Then we have to check to see if the pattern matches, and then access the pieces using array notation (in this case, the zeroth index refers to the whole match, and the rest refer to the capture groups in the regular expression):

Notice that the extra benefit that we get from using regular expressions is that we can see if the data is well-formed. That is if '''locationData''' contained the string "Could not find location data for Lima, Peru", the if statement would not execute.

Non-matching Groups

Sometimes it is desirable to group an expression without marking it as a capture group. You can do this by enclosing the expression in parentheses with ?: as the first two characters. For example if we wanted to reformat the names of some people, ignoring middle names if any, we might:

Should output:

That way, we always know that the last name is the second matcher group.


One of the simpler but more useful things you can do with regular expressions is to replace the matching part of a string. You do that using the replaceFirst() and replaceAll() functions on java.util.regex.Matcher (this is the type of object you get when you do something like myMatcher = ("a" += /b/); ).

So let's say we want to replace all occurrences of Harry Potter's name so that we can resell J.K. Rowlings books as Tanya Grotter novels (yes, someone tried this, Google it if you don't believe me).

In this case, we do it in two steps, one for Harry Potter's full name, one for just his first name.

Reluctant Operators

The operators ?, +, and * are by default "greedy". That is, they attempt to match as much of the input as possible. Sometimes this is not what we want. Consider the following list of fifth century popes:

A first attempt at a regular expression to parse out the name (without the sequence number or modifier) and years of each pope might be as follows:

Which splits up as:

Unknown macro: {| border="1" cellpadding="2" cellspacing="0" width="650px"|-! / || Pope || (.*) || (?}

We hope that then the first capture group would just be the name of the pope in each example, but as it turns out, it captures too much of the input. For example the first pope breaks up as follows:

Unknown macro: {| border="1" cellpadding="2" cellspacing="0" width="650px"|-! / || Pope || (.*) || (?}

Clearly the first capture group is capturing too much of the input. We only want it to capture Anastasius, and the modifiers should be captured by the second capture group. Another way to put this is that the first capture group should capture as little of the input as possible to still allow a match. In this case it would be everything until the next space. Java regular expressions allow us to do this using "reluctant" versions of the *, + and ? operators. In order to make one of these operators reluctant, simply add a ? after it (to make *?, +? and ??). So our new regular expression would be:

So now let's look at our new regular expression with the most difficult of the inputs, the one before Pope Hilarius (a real jokester), breaks up as follows:

Unknown macro: {| border="1" cellpadding="2" cellspacing="0" width="650px"|-! / || Pope || (.*?) || (?}

Which is what we want.

So to test this out, we would use the code:

Try this code with the original regular expression as well to see the broken output.

  • No labels