Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Documenting Regular Expressions in Groovy

Here are some suggestions for using regular expressions in Groovy. This page mainly focuses on documenting regular expressions in Groovy, however, those suggestions are applicable to any programming language that supports regular expressions such as Perl and Java.

Here are some useful reference links that you may want to open up along side this page:

Documenting the RegEx

It's important to document any regular expression, or "regex" for short, that is more than a trivial match. Documenting regexes is the key to making them understandable so they can be debugged and modified either by someone else or by you after you've had time to forget the details.

Overview

  • Include a sample of text to match
    • give a plain English description of your goal
    • omit excess lines of sample if long
  • Use extended patterns with comments
    • mark capturing groups by number
    • include "landmark" keys in the pattern
  • Include debugging feedback
    • use debugging lines

Include a sample of text to match

Having a sample of the input that the regular expression is being applied to look at right on screen is a great help in deciphering what the pattern is trying to match. This is best done in a block comment before the regular expression pattern is defined.

  • give a plain English description of your goal
    • note "landmark" keys in the pattern that you rely on to reliably parse the data
    • list any sub-parts (captured groups) of the pattern you wish to use after the match
  • omit excess lines of sample if long

For example, on a system that has remotely mounted disk space with names like "/nfs/data" or "/nfs/DATA" we wish to gather the space free in kilobytes and the name on which the space is mounted. The output from the "df -k" (disk free space in kilobytes, on linux/mac/unix systems) could be parsed by this pattern:

The "(?i)" is a match flag that means the pattern is case insensitive.

To summarize the parts (regular expression constructs) here:

  • (\d+) - One or more digits, captured for later use, "+" means 1 or more repetitions, see the Pattern API
  • \s+ - One or more whitespace characters
  • \d+% - One or more digits followed by "%", the percentage of disk used
  • (\/nfs\/data[^\/])* - look for a partition name that starts out with "/nfs/data"
    • \/ - A literal "/", escaped by "\" since a slash by itself starts or ends the pattern
    • [^\/]* - Matches 0 or more characters that are NOT "/", "*" means 0 or more repetitions

Following the suggestions above, a header comment is added:

Use extended patterns with comments

The extended match mode is enabled by a pattern match flag which allows white space and comments to be embedded into the pattern. You can then describe, piece by piece, the parts of the regular expression without dumping those details into the already large header comment suggested above. Pattern match flags are discussed in more detail later in this document.

In Groovy, this match flag is "(?x)" and can be combined with other flags you wish to turn on such as "(?ix)" for both extended and case-insensitive modes. This is done in conjunction with Groovy "here" documents (triple quoting), which is handled somewhat differently than the "slashy" quoting used for regular expression patterns. The three examples below are equivalent, but I've highlighted in red what is removed from the first, and colored green the new text in the second and third examples.

// slashy regex

pattern = ~/(?i)(\d+)\s+\d+%\s+({{color:#ff0000}}/nfs{{color:#ff0000}}/data.*)/ |
| // string converted to a regex

regex = "(?i)({{color:#008000}}\d+){{color:#008000}}\s+{{color:#008000}}\d+%{{color:#008000}}\s+(/nfs/data.*)"
pattern = ~regex

// here document string converted to a regex

regex = '''(?ix)({{color:#008000}}\d+){{color:#008000}}\s+{{color:#008000}}\d+%{{color:#008000}}\s+(/nfs/data.*)'''
pattern = ~regex

Essentially:

  • Forward slashes don't need to be escaped by back slashes so "\/" becomes "/"
  • Double the remaining back slashes. Back slashes need to be escaped by back slashes when quoting strings (either normal or here documents)
  • If you want to match whitespace, then you must use "
    s"
  • You can match "#" with "
    #" so that it's not interpreted as the beginning of a comment

What does the third example buy you? Now newlines and comments can be included. The third example (here document) above can also be written:

// here document string converted to a regex

regex = '''(?ix) # comments are now allowed!
(
d+) # disk space
s+
d+% # one or more numbers followed by "%"
s+
(/nfs/data.*) # partition name'''
pattern = ~regex |
This allows you to

  • mark capturing groups by number
    • I mark these with a numbered comment like "# 1: The disk space we want"
  • explain "landmark" keys in the pattern
    • For example "{{
      d% # a number followed by %}}". Not every line needs a comment, but don't leave out any important key matches.

Expanding on the example above:

If that's not any easier to understand than what we started out with,

I'll just assume you're the sort of person who never reads code comments.

Include debugging feedback

  • use debugging lines
    • they are easy to turn on/off with a flag variable
    • they verify the regular expression is working

While developing regular expressions, you will probably want to be able to easily test the result. An easy option is to add debugging lines The debugging lines can be controlled by a boolean flag to turn them on or off. For little development programs and snippets in the groovyConsole, this is easier than setting up logging. The example above can be expanded with a 'debugging' flag and debugging lines like this:

With 'debugging = true', this prints some information to show the regular expression is working on the test data:

matcher pattern:
/---------------------------------\
(?ix) # enable case-insensitive matches, extended patterns
(\d+) # 1: The disk space we want
\s+ # some whitespace
\d+% # a number followed by %
\s+ # some more whitespace
(/nfs/data.*) # 2: partition name
---------------------------------/
match count=2
text matched in matcher0: '3885824 63% /nfs/data_d2/dog_data'
free space in (group 1): '3885824'
partition name (group 2): '/nfs/data_d2/dog_data'
text matched in matcher1: '259683200 18% /nfs/DATA-1/cat_data'
free space in (group 1): '259683200'
partition name (group 2): '/nfs/DATA-1/cat_data'
KB available=263569024

You can see by the output above that most of the input is ignored because it doesn't meet the described pattern. For those entries that are split across two lines, it turns out that all the information we want is in the second line, which still meets the pattern criteria, and the first line is ignored for not matching.

And if you set 'debugging = false', only the result is printed:

KB available=263569024

  • No labels