Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Regular Expressions

Regular expressions are the Swiss Army knife of text processing. They provide the programmer the ability to match and extract patterns from strings. The simplest example of a regular expression is a string of letters and numbers. And the simplest expression involving a regular expression uses the ==~ operator. So for example to match Dan Quayle's spelling of 'potato':

If you put that in the groovyConsole and run it, it will evaluate to true. There are a couple of things to notice. First is the ==
~ operator, which is similar to the == operator, but matches patterns instead of computing exact equality. Second is that the regular expression is enclosed in /'s. This tells groovy (and also anyone else reading your code) that this is a regular expression and not just a string.

But let's say that we also wanted to match the correct spelling, we could add a '?' after the 'e' to say that the e is optional. The following will still evaluate to true.

And the correct spelling will also match:

But anything else will not match:

So this is how you define a simple boolean expression involving a regular expression. But let's get a little bit more tricky. Let's define a method that tests a regular expression. So for example, let's write some code to match Pete Wisniewski's last name:

There are a couple of new things we have done here. First is that we have defined a function (actually a method, but I'll use the two words interchangably). A function is a collection of code similar to a closure. Functions always have names, whereas closures can be "anonymous". Once we define this function we can use it over and over later.

In this function the if statement in bold tests to see if the parameter spellingAttempt matches the regular expression given to the function by using the ==~ operator.

Now let's get a little bit more tricky. Let's say we also want to match the string if the name does not have the 'w' in the middle, we might:

The single ? that was added to the spellingRegularExpression says that the item directly before it (the character 'w') is optional. Try running this code with different spellings in the variable spellingAttempt to prove to yourself that the only two spellings accepted are now "Wisniewski" and "Wisnieski". (Note that you'll have to leave the definition of checkSpelling at the top of your groovyConsole)

The *?* is one of the characters that have special meaning in the world of regular expressions. You should probably assume that any punctuation has special meaning.

Now let's also make it accept the spelling if "ie" in the middle is transposed. Consider the following:

Once again, play around with the spelling. There should be only four spellings that work, "Wisniewski", "Wisneiwski", "Wisnieski" and "Wisneiski". The bar character '|' says that either the thing to the left or the thing to the right is acceptable, in this case "ie" or "ei". The parentheses are simply there to mark the beginning and end of the interesting section.

One last interesting feature is the ability to specify a group of characters all of which are ok. This is done using square brackets *[ ]*. Try the following regular expressions with various misspellings of Pete's last name:

The last one warrants some explanation. If the first character in the square brackets is a *^* then it means anything but the characters specified in the brackets.

The operators

So now that you have a sense for how regular expressions work, here are the operators that you will find helpful, and what they do:

Regular Expression Operators

a?

matches 0 or 1 occurrence of *a*

'a' or empty string

a*

matches 0 or more occurrences of *a*

empty string or 'a', 'aa', 'aaa', etc

a+

matches 1 or more occurrences of *a*

'a', 'aa', 'aaa', etc

a|b

match *a* or *b*

'a' or 'b'

.

match any single character

'a', 'q', 'l', '_', '+', etc

[woeirjsd]

match any of the named characters

'w', 'o', 'e', 'i', 'r', 'j', 's', 'd'

[1-9]

match any of the characters in the range

'1', '2', '3', '4', '5', '6', '7', '8', '9'

[^13579]

match any characters not named

even digits, or any other character

(ie)

group an expression (for use with other operators)

'ie'

^a

match an *a* at the beginning of a line

'a'

a$

match an *a* at the end of a line

'a'

There are a couple of other things you should know. If you want to use one of the operators above to mean the actual character, like you want to match a question mark, you need to put a '\' in front of it. For example:

This is your first really ugly regular expression. (The frequent use of these in PERL is one of the reasons it is considered a "write only" language). By the way, google knows how tall she is. The only way to understand expressions like this is to pick it apart:

/

[^?]

+

?

/

begin expression

any character other than '?'

more than one of those

a question mark

end expression

So the use of the \ in front of the ? makes it refer to an actual question mark.

  • No labels