Documenting Regular Expressions in Groovy
Here are some suggestions for using regular expressions in Groovy. This page mainly focuses on documenting regular expressions in Groovy, however, those suggestions are applicable to any programming language that supports regular expressions such as Perl and Java.
Reference Links
Here are some useful reference links that you may want to open up along side this page:
- Groovy Regular Expressions
- java.util.regex.PatternAPI
- java.util.regex.MatcherAPI
- PLEAC Pattern Matching, PLEAC is Programming Language Examples Alike Cookbook and serves many programming languages
Documenting the RegEx
It's important to document any regular expression, or "regex" for short, that is more than a trivial match. Documenting regexes is the key to making them understandable so they can be debugged and modified either by someone else or by you after you've had time to forget the details.
Overview
- Include a sample of text to match
- give a plain English description of your goal
- omit excess lines of sample if long
- Use extended patterns with comments
- mark capturing groups by number
- include "landmark" keys in the pattern
- Include debugging feedback
- use debugging lines
Include a sample of text to match
Having a sample of the input that the regular expression is being applied to look at right on screen is a great help in deciphering what the pattern is trying to match. This is best done in a block comment before the regular expression pattern is defined.
- give a plain English description of your goal
- note "landmark" keys in the pattern that you rely on to reliably parse the data
- list any sub-parts (captured groups) of the pattern you wish to use after the match
- omit excess lines of sample if long
For example, on a system that has remotely mounted disk space with names like "/nfs/data" or "/nfs/DATA" we wish to gather the space free in kilobytes and the name on which the space is mounted. The output from the "df -k" (disk free space in kilobytes, on linux/mac/unix systems) could be parsed by this pattern:
The "(?i)" is a match flag that means the pattern is case insensitive.
To summarize the parts (regular expression constructs) here:
- (\d+) - One or more digits, captured for later use, "+" means 1 or more repetitions, see the Pattern API
- \s+ - One or more whitespace characters
- \d+% - One or more digits followed by "%", the percentage of disk used
- (\/nfs\/data[^\/])* - look for a partition name that starts out with "/nfs/data"
- \/ - A literal "/", escaped by "\" since a slash by itself starts or ends the pattern
- [^\/]* - Matches 0 or more characters that are NOT "/", "*" means 0 or more repetitions
Following the suggestions above, a header comment is added: