Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Documentation for java classes can be found here
  • Documentation for Groovy extensions to Java classes can be found here

Regular Expressions

Regular expressions are the Swiss Army knife of text processing. They provide the programmer the ability to match and extract patterns from strings. The simplest example of a regular expression is a string of letters and numbers. And the simplest expression involving a regular expression uses the ==~ operator. So for example to match Dan Quayle's spelling of 'potato':

Code Block

"potatoe" ==~ /potatoe/

If you put that in the groovyConsole and run it, it will evaluate to true. There are a couple of things to notice. First is the ==~ operator, which is similar to the == operator, but matches patterns instead of computing exact equality. Second is that the regular expression is enclosed in /'s. This tells groovy (and also anyone else reading your code) that this is a regular expression and not just a string.

But let's say that we also wanted to match the correct spelling, we could add a '?' after the 'e' to say that the e is optional. The following will still evaluate to true.

Code Block

"potatoe" ==~ /potatoe?/

And the correct spelling will also match:

Code Block

"potato" ==~ /potatoe?/

But anything else will not match:

Code Block

"motato" ==~ /potatoe?/

So this is how you define a simple boolean expression involving a regular expression. But let's get a little bit more tricky. Let's define a method that tests a regular expression. So for example, let's write some code to match Pete Wisniewski's last name:

Code Block

def checkSpelling(spellingAttempt, spellingRegularExpression)
{
        if (spellingAttempt ==~ spellingRegularExpression)
        {
                println("Congratulations, you spelled it correctly.")
        } else {
                println("Sorry, try again.")
        }
}

theRegularExpression = /Wisniewski/
checkSpelling("Wisniewski", theRegularExpression)
checkSpelling("Wisnewski", theRegularExpression)

There are a couple of new things we have done here. First is that we have defined a function (actually a method, but I'll use the two words interchangably). A function is a collection of code similar to a closure. Functions always have names, whereas closures can be "anonymous". Once we define this function we can use it over and over later.

In this function the if statement in bold tests to see if the parameter spellingAttempt matches the regular expression given to the function by using the ==~ operator.

Now let's get a little bit more tricky. Let's say we also want to match the string if the name does not have the 'w' in the middle, we might:

Code Block

theRegularExpression = /Wisniew?ski/
checkSpelling("Wisniewski", theRegularExpression)
checkSpelling("Wisnieski", theRegularExpression)
checkSpelling("Wisniewewski", theRegularExpression)

The single ? that was added to the spellingRegularExpression says that the item directly before it (the character 'w') is optional. Try running this code with different spellings in the variable spellingAttempt to prove to yourself that the only two spellings accepted are now "Wisniewski" and "Wisnieski". (Note that you'll have to leave the definition of checkSpelling at the top of your groovyConsole)

The ? is one of the characters that have special meaning in the world of regular expressions. You should probably assume that any punctuation has special meaning.

Now let's also make it accept the spelling if "ie" in the middle is transposed. Consider the following:

Code Block

theRegularExpression = /Wisn(ie|ei)w?ski/
checkSpelling("Wisniewski", theRegularExpression)
checkSpelling("Wisnieski", theRegularExpression)
checkSpelling("Wisniewewski", theRegularExpression)

Once again, play around with the spelling. There should be only four spellings that work, "Wisniewski", "Wisneiwski", "Wisnieski" and "Wisneiski". The bar character '|' says that either the thing to the left or the thing to the right is acceptable, in this case "ie" or "ei". The parentheses are simply there to mark the beginning and end of the interesting section.

One last interesting feature is the ability to specify a group of characters all of which are ok. This is done using square brackets [ ]. Try the following regular expressions with various misspellings of Pete's last name:

Code Block

theRegularExpression = /Wis[abcd]niewski/ // requires one of 'a', 'b', 'c' or 'd'
theRegularExpression = /Wis[abcd]?niewski/ // will allow one of 'a', 'b', 'c' or 'd', but not required (like above)
theRegularExpression = /Wis[a-zA-Z]niewski/ // requires one of any upper- or lower-case letter
theRegularExpression = /Wis[^abcd]niewski/ // requires one of any character that is '''not''' 'a', 'b', 'c' or 'd'

The last one warrants some explanation. If the first character in the square brackets is a ^ then it means anything but the characters specified in the brackets.

The operators

So now that you have a sense for how regular expressions work, here are the operators that you will find helpful, and what they do:

Regular Expression Operators

a?

matches 0 or 1 occurrence of a

'a' or empty string

a*

matches 0 or more occurrences of a

empty string or 'a', 'aa', 'aaa', etc

a+

matches 1 or more occurrences of a

'a', 'aa', 'aaa', etc

a|b

match a or b

'a' or 'b' -

.

match any single character

'a', 'q', 'l', '_', '+', etc

[woeirjsd]

match any of the named characters

'w', 'o', 'e', 'i', 'r', 'j', 's', 'd'

[1-9]

match any of the characters in the range

'1', '2', '3', '4', '5', '6', '7', '8', '9'

[^13579]

match any characters not named

even digits, or any other character

(ie)

group an expression (for use with other operators)

'ie'

^a

match an a at the beginning of a line

'a'

a$

match an a at the end of a line

'a'

There are a couple of other things you should know. If you want to use one of the operators above to mean the actual character, like you want to match a question mark, you need to put a '\' in front of it. For example:

Code Block

// evaluates to true, and will for anything ending in a question mark (that doesn't have a question mark in it)
"How tall is Angelina Jolie?" ==~ /[^\?]+\?/

This is your first really ugly regular expression. (The frequent use of these in PERL is one of the reasons it is considered a "write only" language). By the way, google knows how tall she is. The only way to understand expressions like this is to pick it apart:

/

[^?]

+

?

begin expression

any character other than '?'

more than one of those

a question mark

end expression

So the use of the \ in front of the ? makes it refer to an actual question mark.