Using Regular Expressions (lynda.com)
http://www.regexpal.com/
file:///Users/stedwar1/Desktop/Lynda.com/Ex_Files_UsingRegEx/Exercise%20Files/regexpal/index.html
Modes
Characters
Literal Characters
Regular expressions are eager. The are eager to return a match so the earliest match is preferred.
Characters with special meaning
Only a few metacharacters to learn
Can have more than one meaning
Variation between regex engines
Allows use of metacharacters as literal characters
Only for metacharactes
Quotation marks are not metacharacters, do not need to be escaped
Other Special Characters
Spaces (type a literal space to match)
Tabs (\t)
Line returns (\r, \n, \r\n)
Non-printable characters
-
Character Sets
Defining a Character Set
Character Ranges
Negative Character Sets
Shorthand Character Sets
| Shorthand | Meaning | Equivalent | |
| \d | Digit | [0-9] | |
| \w | Word Character | [a-zA-Z0-9] | |
| \s | Whitespace | [ \t\r\n] | |
| \D | Not Digit | `[ | 0-9]` |
| \W | Not Word Character | `[ | a-zA-Z0-9_` |
| \S | Not whitespace | `[ | \t\r\n]` |
\w
Examples
/\d\d\d\d/ matches “1984”, but not “text”
/\w\w\w\/matches “ABC”, “123”, and “1_A”
/\w\s\w\w/ matches “I am”, but not “Am I”
/[\w\-]/ matches any digit or whitespace character
/[^\d]/ is the same as /\D/ and /[^0-9]/
Caution
/[^\d\s]/ is not the same as [\D\S]
/[^\d\s]/ = NOT digit OR space character
/[\D\S]/ = Either NOT digit or NOT space character
Support
Originates with Perl
All modern regex engines
Not in many Unix tools
POSIX Bracket Expressions
| Class | Meaning | Equivalent |
| [:alpha:] | Alpabetic characters | A-Za-z |
| [:digit:] | Numeric characters | 0-9 |
| [:alnum:] | Alphanumeric characters | A-Za-z0-9 |
| [:lower:] | Lowercase alphabetic characters | a-z |
| [:upper:] | Uppercase alphabetic characters | A-Z |
| [:punct:] | Punctuation characters | |
| [:space:] | Space Characters | \s |
| [:blank:] | Blank characters (space, tab) | |
| [:print:] | Printable characters, spaces | |
| [:graph:] | Printable characters, no spaces | |
| [:cntrl:] | Control characters (non-printable) | |
| [:cdigit:] | hexadecimal characters | A-Fa-f0-9 |
ps aux | grep --regexp="s[[digit:]]" works
ps aux | grep --regexp="s[digit:]" returns s followed by either :, d, i, g, t
Repetition Expressions
| Metacharacter | Meaning |
| * | Preceding item zero or more times |
| + | Preceding item one or more times |
| ? | Preceding item zero or one time |
Examples
`/apples*/` matches “apple”, “apples”, and “applessssssssss”
`/apples+/` matches “apples, “applessssssss”, but not “apple”
`/apples?/` matches “apple”, “apples”, but not “applesssssssss”
`/\d\d\d\d*/` matches numbers with three digets or more
`/\d\d\d+/` matches numbers with three digets or more
`/coulu?r/` matches “color” and “colour”
`\w+s` matches “apples”, “applesssssss” (and words that contain), but not “apple” (and sens of sensation)
`[a-z]+\d[a-z]*` matches abc9xyz, a9xyz, but not 9xyz. matches abc9z and abc9.
`\w+s` matches apples in 'We picked apples' even if it is applessssss. Any word ending in s(s).
Support
* is supported in all regex engines
+ and ? are not supported in BREs (i.e., old Unix programs) (grep does not support + and ?)
Quantified repetition
| Metacharacter | Meaning |
| { | Start quantified repetition of preceding item |
| } | End quantified repetition of preceding item |
`{min,max}`
Three syntaxes
`\d{4,8}` matches numbers with four to eight digits
`\d{4}` matches numbers with exactly four digits (min is maximum when no max is specified)
`\d{4,}` matches numbers with four or more digits (max is infinite)
Examples
`\d{0,}` is the same as `\d*`
`\d{1,}` is the same as `\d+`
`/\d{3}-\d{3}-\d{4}/` matches most U.S. phone numbers
`/A{1,2} bonds/` matches “A bonds” and “AA bonds”, not “AAA bonds”
More Examples
`\w+\s` (almost) any word with a space after
`\w{5}\s` any 5 characters followed by a space
`\w{2,5}\s` any minimum of 2 characters up to 5 characters followed by a space
`\w{5,}\s` any 5 characters or more followed by a space
`\w+_\d{2,4}-\d{2} finds report_1997-04, budget_03-04, but not memo_71239-100
`\w+_\d+-\d+ finds memo_71239-100 also
Greedy expressions
Example 1
Example 2
Standard repetition quantifiers are greedy
Expression tries to match the longest possible string (the repetition quantified part of the expression)
Defers to achieving overall match
`/.+\.jpg/` matches “filename.jpg”
The + is greedy, but “gives back” the “.jpg” to make the match
Think of it as rewinding or backtracking
Gives back as little as possible to try to make the match
regular expression engines are eager. (It is eager to give a result. If it does not work out, it will backtrack trying to matche the last part of the expression.)
regular expression engines are greedy.
Lazy Expressions
| Metacharacter | Meaning |
| ? | Make preceding quantifier lazy |
Syntax
`*?`
`+?`
`{min,max}?`
`??`
Example 1
Example 2
`“Milton”, “Waddams”, “Initech, Inc.”`
`/“.+?”, “.+?”/`
Now it matches “Milton”, “Waddams”
Efficiency When Using Repetition
`/\w+s/`
We picked apples
In the “picked” part of the string, p, i, c, k, e, d are all parts of a word character until it gets to the space, then it test the space to be an “s”, which it fails, then it backtracks to the “i” and goes through the word again and again for the “c” and again again for the “k”, etc.
The parser cannot look globally like a human can.
`/\w*s/`
We picked apples.
Starts like above, but at the last character, the “space”, it goes back to the p assumes the p is zero characters and looks for the “s” after the p (which is a non-character this time to the logic).
Efficient matchin + less backtracking = sppeedy results
Define the quantity of repeated expressions
Narrow the scope of the repeated expression
Provide clearer starting and ending points
When looking for something inside `<>` brackets,
`/<.+>/` looks for `<`, then any character one or more times, then `>`
`/<[^>]+>/` looks for `<`, then any character the is not a `>` one or more times, the `>`
Use anchors and word boundaries
Example
`/w*s/` would be improved as `/w+s/`
`/w+s/` would be improved as `/[A-Za-z]+s/`
Perhaps as `/[a-z]+s/` or as `/[A-Z][a-z]+s/` (1 Uppercase followed by 1 or more other letters followed by s)
Search for whole words only
Spaces, anchors, or work boundaries
Scans “picked” but not “icked”, “cked”, “ked”, “ed”, or “d”
Grouping and Alternation Expressions
| Metacharacter | Meaning |
| ( | Start grouped expression |
| ) | End grouped expression |
* Group portions of the expression
Apply repetition operator to a group
Makes expressions easier to read
Captures group for use in matching and replacing
Examples
`/(abc)+/` matches “abc” and “abcabcabc”
`/(in)?dependent/` matches “independent” and “dependent”
`/run(s)?/` is the same as `/runs?/` but the former might be easier to understand
| Metacharacter | Meaning | |
| ` | ` | Match Previous or next expression |
a.k.a Pipe, OR
`|` is an OR operator
Either match expression on the left or match expression on the right
Ordered, leftmost expression gets precedence
Multiple choices can be daisy-chained, but it stops when first match is found if global search is not specified.
Group alternation expressions to keep them distinct
Examples
`/apple|orange/` matches “apple” and “orange”
`/abc|def|ghi|jkl/` matches “abc”, “def”, “ghi”, and “jkl”
`/apple(juice|sauce)/` is not the same as `/applejuice|sauce/`
`/w(ei|ie)rd/` matches “weird” and “wierd”.
Writing Logical and Efficient Alternations
So far we have learned that:
`/(peanut|peanutbutter)/` matches “peanut” in “peanutbutter”. It is eager to return a result and the leftmost item gets priority.
`/peanut(butter)?/` matches “peanutbutter” because “butter” is preferred because it is greedy even though “butter” is optional.
`/(w+|FY\d{4}_report\.xls)/` This is an alternation: word character one or more times, or the second choice is “FY four digits, _report.xls”. Using `FY2003_report.xls`, it matches the words “FT2003_report” and “xls”, not the whole thing because it is eager to return a result and never tried the second part.
`/abc|def|ghi|jkl/` using string “abcdefghijlkmnopqrstuvwxyz” matches “abc” with global off.
`/xyz|abc|def|ghi|jkl/` using string “abcdefghijlkmnopqrstuvwxyz” also matches “abc” because it tries the second alternation before it ever gets to the end of the string.
`/(three|see|thee|tree)/` “I think those are thin trees.” Moves forward and backward checking each character one at a time starting over each time as it tries the four options.
Repeating and nesting alternations
Repeating
Nesting
Check nesting carefully
`/(d{2}([A-Z{2}|-\d\w\d\w)|\d{4}(-\d{2}-[A-Z]{2,8}|_x[A-F]))/`
Trade-ff between precision, readability, and efficiency.
`/(\d\d|[A-Z][A-Z]){3}/` matches “112233”, “AABBCC”, “AA66ZZ”, and “11AA44”
`/(apple (juice|sauce)|mile(shake)?|sweet (peas|corn|potatoes))/`
milk
apple juice
sweet peas
yogurt (no match)
sweet corn
apple sauce
milkshake
sweet potatoes
`/(apple juice|apple sauce|mile|milkshake|sweet peas|sweet corn|sweet potatoes)/` is the same, just not nested.
`/[\w ]+/` matches all words.
Additional notes from steve
Anchored Expressions
Start and end anchors
| Metacharacter | Meaning | | | |
| ` | ` | Start of string/line. Note this is a dual meaning for ` | `. The other is to negate character set if it is the first chararacter in the set. `[ | abcd]` |
| `$` | Endo of string/line | | | |
| \A | Start of string, never end of line | | | |
| \Z | Endo of string, never end of line | | | |
* Reference a postion, not an actual character
Zero-width
Examples
`/^apple/` or `/\Aapple/` matches “apple” at the begining
`/apple$/` or `/apple\Z/` matches “apple” at the end
`/^apple$/` or `/\Aapple\Z/` matches “apple” alone on a line
Support
^ and $ are supprted in all regex engines
`\A` and `\Z` are supported in Jave, .NET, Perl, PHP, Python, Ruby
Line Breaks and Multiline Mode
Single-line mode
^ and $ do not match at line breaks
`\A` and `\Z` do not match at line breaks
Many Unix tools support only single line
Multiline mode
^ and $ will match at the start and end of line
`\A` and `\Z` do not match at line breaks
Languages usually offer a multiline option
Java: `Pattern.compile(“^regex$”, Pattern.MULTILINE)`
JavaScript: `/^regex$/m`
.Net: `Regex.Match(“string”, “^regex$”, RegexOptions.Multiline)`
Perl: `m/^regex$/m`
PHP: `preg_match(/^regex$/m, “string”, re.MULTILINE)`
Ruby: string.match(/^regex$/m)
Examples
^[a-z ]+ With multiline disabled, it only matches the item in the first line. With multiline enabled, it matches each line separately.
[a-z ]+$ With multiline disabled, it only matches the item in the last line. With multiline enabled, it matches each line separately.
^[a-z ]+$ works similarly.
Word Boundaries
| Metacharacter | Meaning |
| `\b` | Word baoundary (start/end of word) |
| `\B` | Not a word boundary |
* Refence a position, not an actual character
Capturing Groups and Backreferences
Backreferences
| Metacharacter | Meaning |
| `\1 through \9` | Backreference for positions 1 to 9 |
| `\10 through \99` | Backreference for positions 10 to 99 |
Usage
Can be used in the same expression as the group
Can be accessed after the match is complete (would need to be inside some programming language to refer to the variables
Cannot be used inside character classes
Support
Most regex engines support `\1` through `\9`
Some regex engines support `\10` through `\99`
Some regecx engines use $1 throught $9 instead
Examples
`/(apples) to \1/` matches “apples to apples”
`/(ab)(cd)(ef)\3\2\1/` matches “abcdefefcdabd”
`/<(i|em|b|strong).+?</\1>/` matches `<i>Hello</i>` and `<em>Hello</em>`
finds “i” or “e”, then any text inside the tag, not greedy so it does not skip to the next tag
Does not match `<i>Hello</em>` or `<em>Hello</i>`
`/\b[A-Z][a-z]+/`
`\b([A-Z][a-z]+)\b\s\b\1(s|d)on\b`
matches “John Johnson” and “Evan Evanson”
in “Steve Smith, John Johnson, Eric Erikson, Evan Evanson”
`/\b(\w+)\s+\1\b/`
Paris in the
the spring
Backreferences to optional expressions
Optional elements
Capture occur on zero-width matches
Backreferences become zero-width too
Captures do not always occur on optional groups
Backreference is to a group that failed to match
Element is optional, group/capture is not optional
Element is not optional, group/capture is optional
Finding and replacing using backreferences
Create a regular expression that matches target data
Test regular expression and revise as needed
Add capturing groups
Write the replacement string
Use all captures
Add back anything not captured but still needed, like commas and spaces
May need to use $1 instead of \1
U.S. Presidents example
Non-captureing group expressions
Third use of a ? mark. First was an optional character, the second was to signify non-greedy.
| Metacharacter | Meaning |
| `?:` | Specify a non-capturing group |
8. Lookaround Assertions
Positive lookahead assertions