Skip to content
Skip to breadcrumbs
Skip to header menu
Skip to action menu
Skip to quick search
Quick Search
Browse
Pages
Blog
Labels
Attachments
Mail
Advanced
What’s New
Space Directory
Feed Builder
Keyboard Shortcuts
Confluence Gadgets
Log In
Sign Up
Dashboard
Groovy
Copy Page
You are not logged in. Any changes you make will be marked as
anonymous
. You may want to
Log In
if you already have an account. You can also
Sign Up
for a new account.
This page is being edited by
.
Paragraph
Paragraph
Heading 1
Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
Preformatted
Quote
Bold
Italic
Underline
More colours
Strikethrough
Subscript
Superscript
Monospace
Clear Formatting
Bullet list
Numbered list
Outdent
Indent
Align left
Align center
Align right
Link
Table
Insert
Insert Content
Image
Link
Attachment
Symbol
Emoticon
Wiki Markup
Horizontal rule
tinymce.confluence.insert_menu.macro_desc
Info
JIRA Issue
Status
Gallery
Tasklist
Table of Contents
Other Macros
Page Layout
No Layout
Two column (simple)
Two column (simple, left sidebar)
Two column (simple, right sidebar)
Three column (simple)
Two column
Two column (left sidebar)
Two column (right sidebar)
Three column
Three column (left and right sidebars)
Undo
Redo
Find/Replace
Keyboard Shortcuts Help
<h1>Capture groups</h1><p>One of the most useful features of Groovy is the ability to use regular expressions to "capture" data out of a regular expression. Let's say for example we wanted to extract the location data of Liverpool, England from the following data:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>locationData = "Liverpool, England: 53d 25m 0s N 3d 0m 0s" </pre></td></tr></table><p>We could use the split() function of string and then go through and strip out the comma between Liverpool and England, and all the special location characters. Or we could do it all in one step with a regular expression. The syntax for doing this is a little bit strange. First, we have to define a regular expression, putting anything we are interested in in parentheses.</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>myRegularExpression = /([a-zA-Z]+), ([a-zA-Z]+): ([0-9]+). ([0-9]+). ([0-9]+). ([A-Z]) ([0-9]+). ([0-9]+). ([0-9]+)./ </pre></td></tr></table><p>Next, we have to define a "matcher" which is done using the <strong>=~</strong> operator:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>matcher = ( locationData =~ myRegularExpression ) </pre></td></tr></table><p>The variable matcher contains a java.util.regex.Matcher as enhanced by groovy. You can access your data just as you would in Java from a Matcher object. A groovier way to get your data is to use the matcher as if it were an array--a two dimensional array, to be exact. A two dimensional array is simply an array of arrays. In this case the first "dimension" of the array corresponds to each match of the regular expression to the string. With this example, the regular expression only matches once, so there is only one element in the first dimension of the two-dimensional array. So consider the following code:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>matcher[0] </pre></td></tr></table><p>That expression should evaluate to:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>["Liverpool, England: 53d 25m 0s N 3d 0m 0s", "Liverpool", "England", "53", "25", "0", "N", "3", "0", "0"] </pre></td></tr></table><p>And then we use the second dimension of the array to access the capture groups that we're interested in:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>if (matcher.matches()) { println(matcher.getCount()+ " occurrence of the regular expression was found in the string."); println(matcher[0][1] + " is in the " + matcher[0][6] + " hemisphere. (According to: " + matcher[0][0] + ")") } </pre></td></tr></table><p>Notice that the extra benefit that we get from using regular expressions is that we can see if the data is well-formed. That is if <strong>locationData</strong> contained the string "Could not find location data for Lima, Peru", the if statement would not execute.</p><h1>Non-matching Groups</h1><p>Sometimes it is desirable to group an expression without marking it as a capture group. You can do this by enclosing the expression in parentheses with ?: as the first two characters. For example if we wanted to reformat the names of some people, ignoring middle names if any, we might:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>names = [ "Graham James Edward Miller", "Andrew Gregory Macintyre", "No MiddleName" ] printClosure = { matcher = (it =~ /(.*?)(?: .*)* (.*)/); // notice the non-matching group in the middle if (matcher.matches()) println(matcher[0][2]+", "+matcher[0][1]); } names.each(printClosure); </pre></td></tr></table><p>Should output:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>Miller, Graham Macintyre, Andrew MiddleName, No</pre></td></tr></table><p>That way, we always know that the last name is the second matcher group.</p><h2>Replacement</h2><p>One of the simpler but more useful things you can do with regular expressions is to replace the matching part of a string. You do that using the replaceFirst() and replaceAll() functions on java.util.regex.Matcher (this is the type of object you get when you do something like myMatcher = ("a" += /b/); ).</p><p>So let's say we want to replace all occurrences of Harry Potter's name so that we can resell J.K. Rowlings books as Tanya Grotter novels (yes, someone tried this, Google it if you don't believe me).</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>excerpt = "At school, Harry had no one. Everybody knew that Dudley's gang hated that odd Harry Potter "+ "in his baggy old clothes and broken glasses, and nobody liked to disagree with Dudley's gang."; matcher = (excerpt =~ /Harry Potter/); excerpt = matcher.replaceAll("Tanya Grotter"); matcher = (excerpt =~ /Harry/); excerpt = matcher.replaceAll("Tanya"); println("Publish it! "+excerpt); </pre></td></tr></table><p>In this case, we do it in two steps, one for Harry Potter's full name, one for just his first name.</p><h2>Reluctant Operators</h2><p>The operators ?, +, and * are by default "greedy". That is, they attempt to match as much of the input as possible. Sometimes this is not what we want. Consider the following list of fifth century popes:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>popesArray = [ "Pope Anastasius I 399-401", "Pope Innocent I 401-417", "Pope Zosimus 417-418", "Pope Boniface I 418-422", "Pope Celestine I 422-432", "Pope Sixtus III 432-440", "Pope Leo I the Great 440-461", "Pope Hilarius 461-468", "Pope Simplicius 468-483", "Pope Felix III 483-492", "Pope Gelasius I 492-496", "Pope Anastasius II 496-498", "Pope Symmachus 498-514" ] </pre></td></tr></table><p>A first attempt at a regular expression to parse out the name (without the sequence number or modifier) and years of each pope might be as follows:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>/Pope (.*)(?: .*)? ([0-9]+)-([0-9]+)/ </pre></td></tr></table><p>Which splits up as:</p><table class="confluenceTable"><tbody><tr><td class="confluenceTd"><p>/</p></td><td class="confluenceTd"><p>Pope</p></td><td class="confluenceTd"><p>(.*)</p></td><td class="confluenceTd"><p>(?: .*)?</p></td><td class="confluenceTd"><p>([0-9]+)</p></td><td class="confluenceTd"><p>-</p></td><td class="confluenceTd"><p>([0-9]+)</p></td><td class="confluenceTd"><p>/</p></td></tr><tr><td class="confluenceTd"><p>begin expression</p></td><td class="confluenceTd"><p>Pope</p></td><td class="confluenceTd"><p>capture some characters</p></td><td class="confluenceTd"><p>non-capture group: space and some characters</p></td><td class="confluenceTd"><p>capture a number</p></td><td class="confluenceTd"><p>-</p></td><td class="confluenceTd"><p>capture a number</p></td><td class="confluenceTd"><p>end expression</p></td></tr></tbody></table><p>We hope that then the first capture group would just be the name of the pope in each example, but as it turns out, it captures too much of the input. For example the first pope breaks up as follows:</p><table class="confluenceTable"><tbody><tr><td class="confluenceTd"><p>/</p></td><td class="confluenceTd"><p>Pope</p></td><td class="confluenceTd"><p>(.*)</p></td><td class="confluenceTd"><p>(?: .*)?</p></td><td class="confluenceTd"><p>([0-9]+)</p></td><td class="confluenceTd"><p>-</p></td><td class="confluenceTd"><p>([0-9]+)</p></td><td class="confluenceTd"><p>/</p></td></tr><tr><td class="confluenceTd"><p>begin expression</p></td><td class="confluenceTd"><p>Pope</p></td><td class="confluenceTd"><p>Anastasius I</p></td><td class="confluenceTd"><p> </p></td><td class="confluenceTd"><p>399</p></td><td class="confluenceTd"><p>-</p></td><td class="confluenceTd"><p>401</p></td><td class="confluenceTd"><p>end expression</p></td></tr></tbody></table><p>Clearly the first capture group is capturing too much of the input. We only want it to capture Anastasius, and the modifiers should be captured by the second capture group. Another way to put this is that the first capture group should capture as little of the input as possible to still allow a match. In this case it would be everything until the next space. Java regular expressions allow us to do this using "reluctant" versions of the *, + and ? operators. In order to make one of these operators reluctant, simply add a ? after it (to make *?, +? and ??). So our new regular expression would be:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>/Pope (.*?)(?: .*)? ([0-9]+)-([0-9]+)/ </pre></td></tr></table><p>So now let's look at our new regular expression with the most difficult of the inputs, the one before Pope Hilarius (a real jokester), breaks up as follows:</p><table class="confluenceTable"><tbody><tr><td class="confluenceTd"><p>/</p></td><td class="confluenceTd"><p>Pope</p></td><td class="confluenceTd"><p>(.*?)</p></td><td class="confluenceTd"><p>(?: .*)?</p></td><td class="confluenceTd"><p>([0-9]+)</p></td><td class="confluenceTd"><p>-</p></td><td class="confluenceTd"><p>([0-9]+)</p></td><td class="confluenceTd"><p>/</p></td></tr><tr><td class="confluenceTd"><p>begin expression</p></td><td class="confluenceTd"><p>Pope</p></td><td class="confluenceTd"><p>Leo</p></td><td class="confluenceTd"><p>I the Great</p></td><td class="confluenceTd"><p>440</p></td><td class="confluenceTd"><p>-</p></td><td class="confluenceTd"><p>461</p></td><td class="confluenceTd"><p>end expression</p></td></tr></tbody></table><p>Which is what we want.</p><p>So to test this out, we would use the code:</p><table class="wysiwyg-macro" data-macro-name="code" style="background-image: url(/plugins/servlet/confluence/placeholder/macro-heading?definition=e2NvZGV9&locale=en_GB&version=2); background-repeat: no-repeat;" data-macro-body-type="PLAIN_TEXT"><tr><td class="wysiwyg-macro-body"><pre>popesArray = [ "Pope Anastasius I 399-401", "Pope Innocent I 401-417", "Pope Zosimus 417-418", "Pope Boniface I 418-422", "Pope Celestine I 422-432", "Pope Sixtus III 432-440", "Pope Leo I the Great 440-461", "Pope Hilarius 461-468", "Pope Simplicius 468-483", "Pope Felix III 483-492", "Pope Gelasius I 492-496", "Pope Anastasius II 496-498", "Pope Symmachus 498-514" ] myClosure = { myMatcher = (it =~ /Pope (.*?)(?: .*)? ([0-9]+)-([0-9]+)/); if (myMatcher.matches()) println(myMatcher[0][1]+": "+myMatcher[0][2]+" to "+myMatcher[0][3]); } popesArray.each(myClosure); </pre></td></tr></table><p>Try this code with the original regular expression as well to see the broken output.</p>
Please type the word appearing in the picture.
Attachments
Labels
Location
Watch this page
< Edit
Preview >
Loading…
Save
Cancel
Next hint
search
attachments
weblink
advanced