Table of Contents
Using Regular Expressions (lynda.com)
file:///Users/stedwar1/Desktop/Lynda.com/Ex_Files_UsingRegEx/Exercise%20Files/regexpal/index.html
Modes
- Standard: `/re/`
- Global: `/re/g`
- Case0insensitive: `/re/i`
- Multiline: `/re/m`
- Dot-matches-all: `/re/s`
Characters
Literal Characters
- Strings
- /car/ matches “car”
- /car/ matches the first three letters of “carnival”
- Similar to searching in a work processor
- Simplets match there is
- Case-sensitive (by default)
- Standard (non-global) matching
- Earliest (leftmost) match is always preferred
- /zz/ matches the first set of z's in “pizzazz”
- Global matching
- All matches are found throughout the text
- /zz/g matches both set of z's in “pizzazz”
Regular expressions are eager. The are eager to return a match so the earliest match is preferred.
Metacharacters
- Characters with special meaning
- Like mathematical operators
- Transform literal characters into powerful expressions
- Only a few metacharacters to learn
- / . * + - {} [] ^ $ | ? () : ! =
- Can have more than one meaning
- Variation between regex engines
The Wildcard Metacharacter
- . Matches any one character except newline
- Oringal Unix regex tools were line-based
- /h.t/ matches “hat”, “hot”, and “hit”, but not “heat”
- Broadest match possible
- Most common metacharacter
- Most common mistake
- /9.00/ matches “9.00”, “9500”, and “9-00”
Escaping Metacharacters
- Allows use of metacharacters as literal characters
- Match a period with \.
/9\.00/
matches “9.00”, but not “9500” or “9-00”
- Match a backslash by escaping a backslash (\\)
- Only for metacharactes
- Literal characters should never be escaped, gives them meaning
- Quotation marks are not metacharacters, do not need to be escaped
Other Special Characters
- Spaces (type a literal space to match)
- Tabs (\t)
- Line returns (\r, \n, \r\n)
- Non-printable characters
- bell (\a), escape (\e), form feed (\f), vertical tab (\v)
- ASCII or ANSI codes
- Codes that control appearance of a text terminal
- 0xA9 = \xA9
Character Sets
Defining a Character Set
- [ Begin a character set
- ] End a character set
- Any one of several characters
- But only one character
- Order of characters does not matter
- Examples
- /[aeiou]/ matches any one vowel
- /gr[ea]y/ matches “grey” and “gray”
- /gr[ea]t/ does not match “great”
Character Ranges
- - indicates a range of characters
- Range of metacharacter
- Represents all characters between two characters
- Only a metacharacter inside a character set, a literal dash otherwise
- Examples
- [0-9]
- [A-Za-z]
- [a-ek-ou-y]
- Phone number [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
- Caution
- [50-99] is not all numbers from 50 to 99, it is the same as [0-9] with 5 and 9 repeated individually
Negative Character Sets
- ^ Negate a character set
- Not any one of several characters
- Add ^ as the first character inside a character set
- still represents one character
- Examples
- /[^aeiou]/ matches any one consonant (non-vowel)
- /see[^mn]/ matches “seek” and “sees” but not “seem” or “seen”
- Caution
- /see[^mn]/ matches “see ” but not “see”
Metacharacter inside character sets
- Metacharacters inside character sets are already escaped
- Do not need to escape them again
- /h[abc.xqz]t/ matches “hat” and “h.t”, but not “hot” since the dot is escaped inside the brackets.
- Exceptions
- ] - ^ \
- Examples
- /var
[[(][0-9][\])]/
- /2003[-/]10[-/]05/ the slash may not require escaping because the hyphen is the first character in the set
- /file[0-\_]1/ does require excaping
Shorthand Character Sets
| Shorthand | Meaning | Equivalent | |
|---|---|---|---|
| \d | Digit | [0-9] | |
| \w | Word Character | [a-zA-Z0-9] | |
| \s | Whitespace | [ \t\r\n] | |
| \D | Not Digit | `[ | 0-9]` |
| \W | Not Word Character | `[ | a-zA-Z0-9_` |
| \S | Not whitespace | `[ | \t\r\n]` |
- \w
- Underscore is a word character
- Hyphen is not a word character
- Examples
- /\d\d\d\d/ matches “1984”, but not “text”
- /\w\w\w\/matches “ABC”, “123”, and “1_A”
- /\w\s\w\w/ matches “I am”, but not “Am I”
- /[\w\-]/ matches any digit or whitespace character
- /[^\d]/ is the same as /\D/ and /[^0-9]/
- Caution
- /[^\d\s]/ is not the same as [\D\S]
- /[^\d\s]/ = NOT digit OR space character
- /[\D\S]/ = Either NOT digit or NOT space character
- Support
- Originates with Perl
- All modern regex engines
- Not in many Unix tools
POSIX Bracket Expressions
| Class | Meaning | Equivalent |
| [:alpha:] | Alpabetic characters | A-Za-z |
| [:digit:] | Numeric characters | 0-9 |
| [:alnum:] | Alphanumeric characters | A-Za-z0-9 |
| [:lower:] | Lowercase alphabetic characters | a-z |
| [:upper:] | Uppercase alphabetic characters | A-Z |
| [:punct:] | Punctuation characters | |
| [:space:] | Space Characters | \s |
| [:blank:] | Blank characters (space, tab) | |
| [:print:] | Printable characters, spaces | |
| [:graph:] | Printable characters, no spaces | |
| [:cntrl:] | Control characters (non-printable) | |
| [:cdigit:] | hexadecimal characters | A-Fa-f0-9 |
- Use inside a character class, not standalone
- Correct `alpha` or `[^[:alpha:]]`
- Incorrect `[:alpha:]`
- Good idea not to mix POSIX set and other shorthand sets
- Support
- Yes Perl, PHP, Ruby, Unix
- No: Java JavaScript, .NET, Python
ps aux | grep --regexp="s[[digit:]]" works ps aux | grep --regexp="s[digit:]" returns s followed by either :, d, i, g, t
Repetition Expressions
Repetition Metacharacters
| Metacharacter | Meaning |
|---|---|
| * | Preceding item zero or more times |
| + | Preceding item one or more times |
| ? | Preceding item zero or one time |
- Examples
- `/apples*/` matches “apple”, “apples”, and “applessssssssss”
- `/apples+/` matches “apples, “applessssssss”, but not “apple”
- `/apples?/` matches “apple”, “apples”, but not “applesssssssss”
- `/\d\d\d\d*/` matches numbers with three digets or more
- `/\d\d\d+/` matches numbers with three digets or more
- `/coulu?r/` matches “color” and “colour”
- `\w+s` matches “apples”, “applesssssss” (and words that contain), but not “apple” (and sens of sensation)
- `[a-z]+\d[a-z]*` matches abc9xyz, a9xyz, but not 9xyz. matches abc9z and abc9.
- `\w+s` matches apples in 'We picked apples' even if it is applessssss. Any word ending in s(s).
- Support
- * is supported in all regex engines
- + and ? are not supported in BREs (i.e., old Unix programs) (grep does not support + and ?)
Quantified repetition
| Metacharacter | Meaning |
|---|---|
| { | Start quantified repetition of preceding item |
| } | End quantified repetition of preceding item |
- `{min,max}`
- min and max are positive numbers
- min must always be included, can be zero
- max is optional
- Three syntaxes
- `\d{4,8}` matches numbers with four to eight digits
- `\d{4}` matches numbers with exactly four digits (min is maximum when no max is specified)
- `\d{4,}` matches numbers with four or more digits (max is infinite)
- Examples
- `\d{0,}` is the same as `\d*`
- `\d{1,}` is the same as `\d+`
- `/\d{3}-\d{3}-\d{4}/` matches most U.S. phone numbers
- `/A{1,2} bonds/` matches “A bonds” and “AA bonds”, not “AAA bonds”
- More Examples
- `\w+\s` (almost) any word with a space after
- `\w{5}\s` any 5 characters followed by a space
- `\w{2,5}\s` any minimum of 2 characters up to 5 characters followed by a space
- `\w{5,}\s` any 5 characters or more followed by a space
- `\w+_\d{2,4}-\d{2} finds report_1997-04, budget_03-04, but not memo_71239-100
- `\w+_\d+-\d+ finds memo_71239-100 also
Greedy expressions
- Example 1
- `01_FY_07_report_99.xls`
- `/\d+\w+\d+/`
- Matches 01_FY_07_report_99
- Example 2
- `”Milton“, “Waddams”, Initech, Inc.”`
- `/“.+”, “.+”/`
- Matches the whole string
- Standard repetition quantifiers are greedy
- Expression tries to match the longest possible string (the repetition quantified part of the expression)
- Defers to achieving overall match
- `/.+\.jpg/` matches “filename.jpg”
- The + is greedy, but “gives back” the “.jpg” to make the match
- Think of it as rewinding or backtracking
- Gives back as little as possible to try to make the match
- `/.*[0-9]+/` matches “Page 266”
- `/.*/`… matches “Page 26” while …`/[0-9]+/` matches “6”
- regular expression engines are eager. (It is eager to give a result. If it does not work out, it will backtrack trying to matche the last part of the expression.)
- regular expression engines are greedy.
Lazy Expressions
| Metacharacter | Meaning |
|---|---|
| ? | Make preceding quantifier lazy |
- Syntax
- `*?`
- `+?`
- `{min,max}?`
- `??`
- Instructs quantifier to use a “lazy strategy” for making choices
- Greedy strategy
- Match as much as possible before giving control to the next expression part
- Lazy strategy
- Match as little as possible before giving control to the next expression part
- Still defers to overall match
- Not necessarily faster or slower
- Examples
- `/\w*?/d{3}/`
- `/[A-Za-z-]+?\./`
- `/.{4,8}?_.{4.8}/`
- `/apples??/` (meaningless because there is no way to not find apple and still need the s to match apples)
- Support
- Not supported in most Unix tools (BRE, ERE)
- `/.*?[0-9]+/`
- the * quits searching when if matches a letter, then lets the digit try to match. If the digit part fails, then the * tries again, makes a match and passed to the digit, and so on..
- `/.*?[0-9]*?/` will make a match of nothing, nothing being “success” in the search's logic.
- Example 1
- `01_FY_07_report_99.xls`
- `/\d+\w+?\d+/`
- Now it matches “01_FY_07”
- Example 2
- `“Milton”, “Waddams”, “Initech, Inc.”`
- `/“.+?”, “.+?”/`
- Now it matches “Milton”, “Waddams”
Efficiency When Using Repetition
- `/\w+s/`
- We picked apples
- In the “picked” part of the string, p, i, c, k, e, d are all parts of a word character until it gets to the space, then it test the space to be an “s”, which it fails, then it backtracks to the “i” and goes through the word again and again for the “c” and again again for the “k”, etc.
- The parser cannot look globally like a human can.
- `/\w*s/`
- We picked apples.
- Starts like above, but at the last character, the “space”, it goes back to the p assumes the p is zero characters and looks for the “s” after the p (which is a non-character this time to the logic).
- Efficient matchin + less backtracking = sppeedy results
- Define the quantity of repeated expressions
- `..+/` is faster than `/.*/`
- `/.{5}/` and `/.{3,7}/` are even faster
- Narrow the scope of the repeated expression
- `/.+/` can become `/[A-Za-z]+/`
- Provide clearer starting and ending points
- When looking for something inside `<>` brackets,
- `/<.+>/` looks for `<`, then any character one or more times, then `>`
- `/<[^>]+>/` looks for `<`, then any character the is not a `>` one or more times, the `>`
- Use anchors and word boundaries
- Example
- `/w*s/` would be improved as `/w+s/`
- `/w+s/` would be improved as `/[A-Za-z]+s/`
- Perhaps as `/[a-z]+s/` or as `/[A-Z][a-z]+s/` (1 Uppercase followed by 1 or more other letters followed by s)
- Search for whole words only
- Spaces, anchors, or work boundaries
- Scans “picked” but not “icked”, “cked”, “ked”, “ed”, or “d”
Grouping and Alternation Expressions
Grouping Metacharacters
| Metacharacter | Meaning |
|---|---|
| ( | Start grouped expression |
| ) | End grouped expression |
* Group portions of the expression
- Apply repetition operator to a group
- Makes expressions easier to read
- Captures group for use in matching and replacing
- Examples
- `/(abc)+/` matches “abc” and “abcabcabc”
- `/(in)?dependent/` matches “independent” and “dependent”
- `/run(s)?/` is the same as `/runs?/` but the former might be easier to understand
Alternation metacharacter
| Metacharacter | Meaning | |
|---|---|---|
| ` | ` | Match Previous or next expression |
a.k.a Pipe, OR
- `|` is an OR operator
- Either match expression on the left or match expression on the right
- Ordered, leftmost expression gets precedence
- Multiple choices can be daisy-chained, but it stops when first match is found if global search is not specified.
- Group alternation expressions to keep them distinct
- Examples
- `/apple|orange/` matches “apple” and “orange”
- `/abc|def|ghi|jkl/` matches “abc”, “def”, “ghi”, and “jkl”
- `/apple(juice|sauce)/` is not the same as `/applejuice|sauce/`
- `/w(ei|ie)rd/` matches “weird” and “wierd”.
Writing Logical and Efficient Alternations
So far we have learned that:
- Regular expression engines are eager.
- Regular expression engines are greedy.
`/(peanut|peanutbutter)/` matches “peanut” in “peanutbutter”. It is eager to return a result and the leftmost item gets priority.
`/peanut(butter)?/` matches “peanutbutter” because “butter” is preferred because it is greedy even though “butter” is optional.
`/(w+|FY\d{4}_report\.xls)/` This is an alternation: word character one or more times, or the second choice is “FY four digits, _report.xls”. Using `FY2003_report.xls`, it matches the words “FT2003_report” and “xls”, not the whole thing because it is eager to return a result and never tried the second part.
`/abc|def|ghi|jkl/` using string “abcdefghijlkmnopqrstuvwxyz” matches “abc” with global off. `/xyz|abc|def|ghi|jkl/` using string “abcdefghijlkmnopqrstuvwxyz” also matches “abc” because it tries the second alternation before it ever gets to the end of the string.
`/(three|see|thee|tree)/` “I think those are thin trees.” Moves forward and backward checking each character one at a time starting over each time as it tries the four options.
- Put simplest (most efficient) ecpression first
- `/\w+_\d{2,4}|\d{4}_\d{2}_\w+|export\d{2}/`
- `/export\d{2}|\d{2}|\d{4}_\w+|\w+_\d{2,4}/` is more efficient because the more permissive alternations are last.
Repeating and nesting alternations
- Repeating
- First matched alternation does not effect the next matches
- `/(AA|BB|CC){6}/` matches “AABBAACCAABB”
- Nesting
- Check nesting carefully
- `/(d{2}([A-Z{2}|-\d\w\d\w)|\d{4}(-\d{2}-[A-Z]{2,8}|_x[A-F]))/`
- Trade-ff between precision, readability, and efficiency.
`/(\d\d|[A-Z][A-Z]){3}/` matches “112233”, “AABBCC”, “AA66ZZ”, and “11AA44”
`/(apple (juice|sauce)|mile(shake)?|sweet (peas|corn|potatoes))/`
- milk
- apple juice
- sweet peas
- yogurt (no match)
- sweet corn
- apple sauce
- milkshake
- sweet potatoes
`/(apple juice|apple sauce|mile|milkshake|sweet peas|sweet corn|sweet potatoes)/` is the same, just not nested.
`/[\w ]+/` matches all words.
Additional notes from steve
`?>` is a non-backtracking group: http://stackoverflow.com/questions/15413594/what-does-mean-in-a-pcre-regex
Anchored Expressions
Start and end anchors
| Metacharacter | Meaning | |||
|---|---|---|---|---|
| ` | ` | Start of string/line. Note this is a dual meaning for ` | `. The other is to negate character set if it is the first chararacter in the set. `[ | abcd]` |
| `$` | Endo of string/line | |||
| \A | Start of string, never end of line | |||
| \Z | Endo of string, never end of line |
* Reference a postion, not an actual character
- Zero-width
- Examples
- `/^apple/` or `/\Aapple/` matches “apple” at the begining
- `/apple$/` or `/apple\Z/` matches “apple” at the end
- `/^apple$/` or `/\Aapple\Z/` matches “apple” alone on a line
- Support
- ^ and $ are supprted in all regex engines
- `\A` and `\Z` are supported in Jave, .NET, Perl, PHP, Python, Ruby
Line Breaks and Multiline Mode
- Single-line mode
- ^ and $ do not match at line breaks
- `\A` and `\Z` do not match at line breaks
- Many Unix tools support only single line
- Multiline mode
- ^ and $ will match at the start and end of line
- `\A` and `\Z` do not match at line breaks
- Languages usually offer a multiline option
- Java: `Pattern.compile(“^regex$”, Pattern.MULTILINE)`
- JavaScript: `/^regex$/m`
- .Net: `Regex.Match(“string”, “^regex$”, RegexOptions.Multiline)`
- Perl: `m/^regex$/m`
- PHP: `preg_match(/^regex$/m, “string”, re.MULTILINE)`
- Ruby: string.match(/^regex$/m)
- Examples
- ^[a-z ]+ With multiline disabled, it only matches the item in the first line. With multiline enabled, it matches each line separately.
- [a-z ]+$ With multiline disabled, it only matches the item in the last line. With multiline enabled, it matches each line separately.
- ^[a-z ]+$ works similarly.
Word Boundaries
| Metacharacter | Meaning |
|---|---|
| `\b` | Word baoundary (start/end of word) |
| `\B` | Not a word boundary |
* Refence a position, not an actual character
- Conditions for matching
- Before the first word character in the string
- After the last word character in the string
- Between a word character and a non-word character
- Word characters: [A-Za-z0-9_]
- Support
- Most regex engines, not in the early Unix tools (BREs)
- egrep
- but not grep
- Boundary examples
- `/\b\w+\b/` finds four matches in “This is a test.”
- `/\b\w+\b/` matches all of “abc_123” but only part of “top-notch”
- Not a boundary examples
- `/\BThis/` does not match “This is a test.” because the string starts with a T, which is counted as a boundary.
- `/\B\w+\B/` finds two matches in “This is a test.” (“hi” and “es”)
- examples
- `/\b\w+\b/` finds most words, but not “summer's” which has a word boundary after summer and before the “s” after the apostrophe.
- `/\b[\w']+\b/` adding the apostrophe counts the whole word because it is greedy.
- `/\b[\w']+?\b/` causes the apostrophe to be a separate word
- `/w+s/` matches “apples” in “We picked apples.”
- `/\b\w+s\b/` puts boundaries around the word and makes the expression more efficient causing it to look for the whole word. It no longer check “icked”, “cked”, “ked”, “ed” and “d” because it looks for a boundary instead of a word character followed by “s”.
- Caution
- A space is not a word baundary
- Word boundaries reference a postion
- Not an actual character
- Zero-length
- Examples
- String: “apples and oranges”
- No Match: `/apples\band\boranges/`
- Match: `/apples\b \band\b \boranges/` There are boundaries around the words and spaces between the words causing a boundary between the space (which is a non-word character) and the next word.
Capturing Groups and Backreferences
Backreferences
- Grouped expressions are captured
- Stores the matched portion in parentheses
- `/a(p{2}l)e/` matches “apple” and stores “ppl”
- Stores the data matched, not the expression
- Automatically, by default
- Backreferences allow access to captured data
- Refer to first backreference with \1
| Metacharacter | Meaning |
|---|---|
| `\1 through \9` | Backreference for positions 1 to 9 |
| `\10 through \99` | Backreference for positions 10 to 99 |
- Usage
- Can be used in the same expression as the group
- Can be accessed after the match is complete (would need to be inside some programming language to refer to the variables
- Cannot be used inside character classes
- Support
- Most regex engines support `\1` through `\9`
- Some regex engines support `\10` through `\99`
- Some regecx engines use $1 throught $9 instead
- Examples
- `/(apples) to \1/` matches “apples to apples”
- `/(ab)(cd)(ef)\3\2\1/` matches “abcdefefcdabd”
- `/<(i|em|b|strong).+?</\1>/` matches `<i>Hello</i>` and `<em>Hello</em>`
- finds “i” or “e”, then any text inside the tag, not greedy so it does not skip to the next tag
- Does not match `<i>Hello</em>` or `<em>Hello</i>`
- `/\b[A-Z][a-z]+/`
- `\b([A-Z][a-z]+)\b\s\b\1(s|d)on\b`
- matches “John Johnson” and “Evan Evanson”
- in “Steve Smith, John Johnson, Eric Erikson, Evan Evanson”
- `/\b(\w+)\s+\1\b/`
- matches “the the” in
Paris in the the spring
Backreferences to optional expressions
- Optional elements
- `/A?B/` matches “AB” and “B”
- Capture occur on zero-width matches
- `/(A?)B/` matches “AB” and captures “A”
- `/(A?)B/` matches “B” and captures “”
- Backreferences become zero-width too
- `/(A?)B\1/` matches “ABA” and “B”
- `/(A?)B\1C/` matches “ABAC” and “BC”
- Captures do not always occur on optional groups
- `/(A)?B/` matches “AB” and captures “A”
- `/(A)?B/` matches “B” and does not capture anything
- Backreference is to a group that failed to match
- `/(A)?B\1/` matches “ABA” but not “B”
- Except in JavaScript
- Element is optional, group/capture is not optional
- `/(A?)B/` matches “B” and captures “”
- Element is not optional, group/capture is optional
- `/(A)?B/` matches “B” and does not capture anything
Finding and replacing using backreferences
- TextMate works for this, but not the javascript tester
- Create a regular expression that matches target data
- Test regular expression and revise as needed
- Use anchors and specificity to narrow scope
- Add capturing groups
- Capture anything that varies row-to-row
- Write the replacement string
- Use all captures
- Add back anything not captured but still needed, like commas and spaces
- May need to use $1 instead of \1
U.S. Presidents example
- Using TextMate
- Find: `^(\d{1,2}),([\w .]+?) ([\w ]+?),(\d{4})`
- Replace `$1,$3,$2,$4`
Non-captureing group expressions
Third use of a ? mark. First was an optional character, the second was to signify non-greedy.
| Metacharacter | Meaning |
|---|---|
| `?:` | Specify a non-capturing group |
- Syntax
- `/(\w+)/` becomes `/(?:\w+)/`
- Turns off capture and backreferences
- Optimize for speed
- Preserve space for more captures
- Support
- Most regex engines except Unix tools
- `/(?:regex)/`
- ? = “Give this group a different meaning”
- : = “The meaning is non-capturing”