Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Excerpt
hiddentrue

Capture groups

データをまとめて取り出す


Excerpt
hiddentrue

One of the most useful features of Groovy is the ability to use regular expressions to "capture" data out of a regular expression. Let's say for example we wanted to extract the location data of Liverpool, England from the following data:

Groovyのとても便利な機能の一つに 「正規表現を使ってデータを取り出す」があります。 以下のイングランド リバプールのロケーションデータで試してみましょう。

Code Block
locationData = '''Liverpool, England: 53° 25' 0" N 3° 0' 0" '''      // 訳注 北緯 53度 25分 0秒 東経 3度 0分 0秒  東経の記号 E は書かれていない

We could use the split() function of string and then go through and strip out the comma between Liverpool and England, and all the special location characters. Or we could do it all in one step with a regular expression. The syntax for doing this is a little bit strange. First, we have to define a regular expression, putting anything we are interested in in parentheses.

stringのsplit()関数を繰り返し使って、LiverpoolとEnglandの間のコンマやその他のロケーション記号を除いて、個々のデータを取り出せますが、正規表現で同じ事が一度でできます。 最初に正規表現を書きます。個々の欲しいデータを()かっこの中に定義します。

Code Block
myRegularExpression = /([a-zA-Z]+), ([a-zA-Z]+): ([0-9]+). ([0-9]+). ([0-9]+). ([A-Z]) ([0-9]+). ([0-9]+). ([0-9]+)./

Next, we have to define a "matcher" which is done using the =~ operator:

次に =~ オペレーターを使ってmatcherを生成します。 

Code Block
matcher = ( locationData =~ myRegularExpression )

The variable matcher contains a java.util.regex.Matcher as enhanced by groovy. You can access your data just as you would in Java from a Matcher object. A groovier way to get your data is to use the matcher as if it were an array--a two dimensional array, to be exact. A two dimensional array is simply an array of arrays. In this case the first "dimension" of the array corresponds to each match of the regular expression to the string. With this example, the regular expression only matches once, so there is only one element in the first dimension of the two-dimensional array. So consider the following code:

 変数matcherはgroovyの強化版java.util.regex.Matcherオブジェクトです。 Matcherオブジェクトから Java と同じようにデータにアクセスできますが、groovy流データ取り出しは、matcherオブジェクトを2次元配列のように扱って行います。 2次元配列は、配列の配列ですね。 この例では、最初の次元は、正規表現にマッチした結果

Code Block
matcher[0]

That expression should evaluate to:

Code Block
["Liverpool, England: 53° 25? 0? N 3° 0? 0?", "Liverpool", "England", "53", "25", "0", "N", "3", "0", "0"]

And then we use the second dimension of the array to access the capture groups that we're interested in:

Code Block
if (matcher.matches()) {
	println(matcher.getCount()+ " occurrence of the regular expression was found in the string.");
	println(matcher[0][1] + " is in the " + matcher[0][6] + " hemisphere. (According to: " + matcher[0][0] + ")")
}

Notice that the extra benefit that we get from using regular expressions is that we can see if the data is well-formed. That is if locationData contained the string "Could not find location data for Lima, Peru", the if statement would not execute.

Non-matching Groups

Sometimes it is desirable to group an expression without marking it as a capture group. You can do this by enclosing the expression in parentheses with ?: as the first two characters. For example if we wanted to reformat the names of some people, ignoring middle names if any, we might:

Code Block
names = [
    "Graham James Edward Miller",
    "Andrew Gregory Macintyre"
]

printClosure = {
	matcher = (it =~ /(.*?)(?: .+)+ (.*)/);  // notice the non-matching group in the middle
	if (matcher.matches())
		println(matcher[0][2]+", "+matcher[0][1]);
}
names.each(printClosure);

Should output:

Code Block
Miller, Graham
Macintyre, Andrew

That way, we always know that the last name is the second matcher group.

Replacement

One of the simpler but more useful things you can do with regular expressions is to replace the matching part of a string. You do that using the replaceFirst() and replaceAll() functions on java.util.regex.Matcher (this is the type of object you get when you do something like myMatcher = ("a" += /b/); ).

So let's say we want to replace all occurrences of Harry Potter's name so that we can resell J.K. Rowlings books as Tanya Grotter novels (yes, someone tried this, Google it if you don't believe me).

Code Block
excerpt = "At school, Harry had no one. Everybody knew that Dudley's gang hated that odd Harry Potter "+
          "in his baggy old clothes and broken glasses, and nobody liked to disagree with Dudley's gang.";
matcher = (excerpt =~ /Harry Potter/);
excerpt = matcher.replaceAll("Tanya Grotter");

matcher = (excerpt =~ /Harry/);
excerpt = matcher.replaceAll("Tanya");
println("Publish it! "+excerpt);

In this case, we do it in two steps, one for Harry Potter's full name, one for just his first name.

Reluctant Operators

The operators ?, +, and * are by default "greedy". That is, they attempt to match as much of the input as possible. Sometimes this is not what we want. Consider the following list of fifth century popes:

Code Block
popesArray = [
    "Pope Anastasius I 399-401",
    "Pope Innocent I 401-417",
    "Pope Zosimus 417-418",
    "Pope Boniface I 418-422",
    "Pope Celestine I 422-432",
    "Pope Sixtus III 432-440",
    "Pope Leo I the Great 440-461",
    "Pope Hilarius 461-468",
    "Pope Simplicius 468-483",
    "Pope Felix III 483-492",
    "Pope Gelasius I 492-496",
    "Pope Anastasius II 496-498",
    "Pope Symmachus 498-514"
]

A first attempt at a regular expression to parse out the name (without the sequence number or modifier) and years of each pope might be as follows:

Code Block
/Pope (.*)(?: .*)? ([0-9]+)-([0-9]+)/

Which splits up as:

/

Pope

(.*)

(?: .*)?

([0-9]+)

-

([0-9]+)

/

begin expression

Pope

capture some characters

non-capture group: space and some characters

capture a number

-

capture a number

end expression

We hope that then the first capture group would just be the name of the pope in each example, but as it turns out, it captures too much of the input. For example the first pope breaks up as follows:

/

Pope

(.*)

(?: .*)?

([0-9]+)

-

([0-9]+)

/

begin expression

Pope

Anastasius I

 

399

-

401

end expression

Clearly the first capture group is capturing too much of the input. We only want it to capture Anastasius, and the modifiers should be captured by the second capture group. Another way to put this is that the first capture group should capture as little of the input as possible to still allow a match. In this case it would be everything until the next space. Java regular expressions allow us to do this using "reluctant" versions of the *, + and ? operators. In order to make one of these operators reluctant, simply add a ? after it (to make *?, +? and ??). So our new regular expression would be:

Code Block
/Pope (.*?)(?: .*)? ([0-9]+)-([0-9]+)/

So now let's look at our new regular expression with the most difficult of the inputs, the one before Pope Hilarius (a real jokester), breaks up as follows:

/

Pope

(.*?)

(?: .*)?

([0-9]+)

-

([0-9]+)

/

begin expression

Pope

Leo

I the Great

440

-

461

end expression

Which is what we want.

So to test this out, we would use the code:

Code Block
popesArray = [
    "Pope Anastasius I 399-401",
    "Pope Innocent I 401-417",
    "Pope Zosimus 417-418",
    "Pope Boniface I 418-422",
    "Pope Celestine I 422-432",
    "Pope Sixtus III 432-440",
    "Pope Leo I the Great 440-461",
    "Pope Hilarius 461-468",
    "Pope Simplicius 468-483",
    "Pope Felix III 483-492",
    "Pope Gelasius I 492-496",
    "Pope Anastasius II 496-498",
    "Pope Symmachus 498-514"
]

myClosure = {
	myMatcher = (it =~ /Pope (.*?)(?: .*)? ([0-9]+)-([0-9]+)/);
	if (myMatcher.matches())
		println(myMatcher[0][1]+": "+myMatcher[0][2]+" to "+myMatcher[0][3]);
}
popesArray.each(myClosure);

Try this code with the original regular expression as well to see the broken output.