Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Matching Strings to Patterns

We can define string patterns, aka "Regular Expressions" or "Regexes", and see if a String matches it:

Twelve characters that are special syntax for regexes need to be quoted:

The chars \c@, \cA, \cB, ..., \cZ, \c[, \c], \c^, and \c_ map to the special characters 0x0 to 0x1f, except 0x1c:

We have special pattern syntax for whitespace \s, word characters \w, digits \d, and their complements:

There's certain characters that the dot . doesn't match, except when (?s) is used:

Some other flags:

Some other ways to use flags:

A character class is a set of characters, one of which may be matched. We've already seen the predefined character classes \s, \w, \d, \S, \W, \D. We can also define our own:

The only meta-characters inside a character class are \, [, ^ (in the first position), ] (not in the first position or after the ^), - (not in the first position, after the ^, or before the ]), and &&. Quote them with a / to get the literal character. The other usual meta-characters are normal characters inside a character class, and do not need to be quoted with a backslash, though can be. Character class precedences are, from highest: literal escapes (eg \s), grouping (eg [abc]), ranges (eg a-g), unions (eg [abc][xyz]), then intersections ([a-z&&[gjpqy]]).

We can use the alternation operator | to give some options:

We use ? to indicate optional character/s:

Use {n} to match a character exactly n times:

We can match a character a variable number of times. Use the * operator to match any number of a character:

By using longhand syntax, we see that * operator is greedy, repeating the preceding token as often as possible, returning the leftmost longest match:

Anything between parentheses is a capturing group, whose matched values can be accessed later:

\1 through \9 in patterns are always interpreted as group references, and a backslash-escaped number greater than 9 is treated as a group reference if at least that many groups exist at that point in the string pattern. Otherwise digits are dropped until either the number is smaller or equal to the existing number of groups or it is one digit. Grouping parentheses and group references cannot be used inside character classes.

Some miscellaneous methods:

Finding Patterns in Strings

As well as matching an entire string to a pattern, we can also find a pattern within a string using =~ syntax:

There can be more than one occurence of the pattern:

Some longhand syntax, with various methods:

We can group when finding with =~ just as we do when matching with ==~:

Calling collect() and each() require some special tricks to work:

Aggregate functions we can use are:

The sequence of text joined by operators such as | ? * + {} has no effect on the success of the ==~ matcher, but does affect what's found with the =~ finder. The first choice of the | is found first, and backtracking to the second choice is only tried if necessary. The choice of the ? is tried first, and backtracking to ignore the choice only tried if necessary. As much as possible of the * + {} is found first, and backtracking to find less text only tried if necessary.

Because the ? and * operators can match nothing, they may not always be intuitive to understand:

By putting a ? after the operators ? * + {}, we can make them "lazy" instead of "greedy", that is, as little as possible is found first, and backtracking to find MORE text is tried if necessary:

We've seen some longhand methods such as 'find', 'matches', 'start', and 'end'. There's many more such methods:

Similarly to back-references in patterns, $1 through $9 in replacement strings are always interpreted as group references, and a dollar-escaped number greater than 9 is treated as a group reference if at least that many groups exist in the string pattern. Otherwise digits are dropped until either the number is smaller or equal to the existing number of groups or it is one digit.

We've already seen the greedy and lazy operators. There's also possessive operators, which act like greedy operators, except they never backtrack. Whereas choosing greedy or lazy operators affects the efficiency of a match, they don't affect the outcome. However, possessive operators can affect the outcome of a match:

Atomic grouping, a more general form of possessiveness, enables everything in the atom group to be considered as one token. No backtracking occurs within the group, only outside of it:

Atomic grouping and possessiveness are handy with nested repetition, allowing much faster match failures.

Finding Positions in Strings

We can use ^ and $ to match the beginning and end of each line using flag m:

At the end of strings with \n at the end, $ matches twice:

We can use \A \Z and \z to match the beginning and end of input, even in multiline mode:

We can match at word boundaries:

We can can look behind or ahead of a position, ie, find a position based on text that precedes follows it, but without matching that text itself. We can only use fixed-length strings when looking behind, ie, literal text, character classes, finite repetition ( {length} and ? ), and alternation where each string in it is also of fixed length, because the length of the match must be able to be predetermined:

Matching positions in a string is useful for splitting the string, and for inserting text:

We can split a string in many ways:

Restricting a String to a Region for a Pattern

We can set the limit of the part of the input string that will be searched to find a match:

  • No labels