Using Regular Expressions (lynda.com)

http://www.regexpal.com/

file:///Users/stedwar1/Desktop/Lynda.com/Ex_Files_UsingRegEx/Exercise%20Files/regexpal/index.html

Modes

Standard: `/re/`
Global: `/re/g`
Case0insensitive: `/re/i`
Multiline: `/re/m`
Dot-matches-all: `/re/s`

Characters

Literal Characters

Strings
- /car/ matches “car”
- /car/ matches the first three letters of “carnival”
- Similar to searching in a work processor
- Simplets match there is
Case-sensitive (by default)

Standard (non-global) matching
- Earliest (leftmost) match is always preferred
- /zz/ matches the first set of z's in “pizzazz”
Global matching
- All matches are found throughout the text
- /zz/g matches both set of z's in “pizzazz”

Regular expressions are eager. The are eager to return a match so the earliest match is preferred.

Metacharacters

Characters with special meaning
- Like mathematical operators
- Transform literal characters into powerful expressions
Only a few metacharacters to learn
- / . * + - {} [] ^ $ | ? () : ! =
Can have more than one meaning
Variation between regex engines

The Wildcard Metacharacter

. Matches any one character except newline
- Oringal Unix regex tools were line-based
- /h.t/ matches “hat”, “hot”, and “hit”, but not “heat”
- Broadest match possible
- Most common metacharacter
- Most common mistake
  - /9.00/ matches “9.00”, “9500”, and “9-00”

Escaping Metacharacters

Allows use of metacharacters as literal characters
- Match a period with \.
  - ```
  /9\.00/
```
  matches “9.00”, but not “9500” or “9-00”
- Match a backslash by escaping a backslash (\\)
Only for metacharactes
- Literal characters should never be escaped, gives them meaning
Quotation marks are not metacharacters, do not need to be escaped

Other Special Characters

Spaces (type a literal space to match)
Tabs (\t)
Line returns (\r, \n, \r\n)
Non-printable characters
- bell (\a), escape (\e), form feed (\f), vertical tab (\v)
ASCII or ANSI codes
- Codes that control appearance of a text terminal
- 0xA9 = \xA9

Character Sets

Defining a Character Set

[ Begin a character set
] End a character set
Any one of several characters
- But only one character
- Order of characters does not matter
Examples
- /[aeiou]/ matches any one vowel
- /gr[ea]y/ matches “grey” and “gray”
- /gr[ea]t/ does not match “great”

Character Ranges

- indicates a range of characters
Range of metacharacter
- Represents all characters between two characters
- Only a metacharacter inside a character set, a literal dash otherwise
Examples
- [0-9]
- [A-Za-z]
- [a-ek-ou-y]
- Phone number [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
Caution
- [50-99] is not all numbers from 50 to 99, it is the same as [0-9] with 5 and 9 repeated individually

Negative Character Sets

^ Negate a character set
Not any one of several characters
- Add ^ as the first character inside a character set
- still represents one character
Examples
- /[^aeiou]/ matches any one consonant (non-vowel)
- /see[^mn]/ matches “seek” and “sees” but not “seem” or “seen”
Caution
- /see[^mn]/ matches “see ” but not “see”

Metacharacter inside character sets

Metacharacters inside character sets are already escaped
- Do not need to escape them again
- /h[abc.xqz]t/ matches “hat” and “h.t”, but not “hot” since the dot is escaped inside the brackets.
Exceptions
- ] - ^ \
Examples
- /var
```
[[(][0-9][\])]/
```
- /2003[-/]10[-/]05/ the slash may not require escaping because the hyphen is the first character in the set
- /file[0-\_]1/ does require excaping

Shorthand Character Sets

Shorthand	Meaning	Equivalent
\d	Digit	[0-9]
\w	Word Character	[a-zA-Z0-9]
\s	Whitespace	[ \t\r\n]
\D	Not Digit	`[	0-9]`
\W	Not Word Character	`[	a-zA-Z0-9_`
\S	Not whitespace	`[	\t\r\n]`

\w
- Underscore is a word character
- Hyphen is not a word character
Examples
- /\d\d\d\d/ matches “1984”, but not “text”
- /\w\w\w\/matches “ABC”, “123”, and “1_A”
- /\w\s\w\w/ matches “I am”, but not “Am I”
- /[\w\-]/ matches any digit or whitespace character
- /[^\d]/ is the same as /\D/ and /[^0-9]/
Caution
- /[^\d\s]/ is not the same as [\D\S]
- /[^\d\s]/ = NOT digit OR space character
- /[\D\S]/ = Either NOT digit or NOT space character
Support
- Originates with Perl
- All modern regex engines
- Not in many Unix tools

POSIX Bracket Expressions

Class	Meaning	Equivalent
[:alpha:]	Alpabetic characters	A-Za-z
[:digit:]	Numeric characters	0-9
[:alnum:]	Alphanumeric characters	A-Za-z0-9
[:lower:]	Lowercase alphabetic characters	a-z
[:upper:]	Uppercase alphabetic characters	A-Z
[:punct:]	Punctuation characters
[:space:]	Space Characters	\s
[:blank:]	Blank characters (space, tab)
[:print:]	Printable characters, spaces
[:graph:]	Printable characters, no spaces
[:cntrl:]	Control characters (non-printable)
[:cdigit:]	hexadecimal characters	A-Fa-f0-9

Use inside a character class, not standalone
- Correct `alpha` or `[^[:alpha:]]`
- Incorrect `[:alpha:]`
Good idea not to mix POSIX set and other shorthand sets
Support
- Yes Perl, PHP, Ruby, Unix
- No: Java JavaScript, .NET, Python

ps aux | grep --regexp="s[[digit:]]" works
ps aux | grep --regexp="s[digit:]" returns s followed by either :, d, i, g, t

Repetition Expressions

Repetition Metacharacters

Metacharacter	Meaning
*	Preceding item zero or more times
+	Preceding item one or more times
?	Preceding item zero or one time

Examples
- `/apples*/` matches “apple”, “apples”, and “applessssssssss”
- `/apples+/` matches “apples, “applessssssss”, but not “apple”
- `/apples?/` matches “apple”, “apples”, but not “applesssssssss”
- `/\d\d\d\d*/` matches numbers with three digets or more
- `/\d\d\d+/` matches numbers with three digets or more
- `/coulu?r/` matches “color” and “colour”
- `\w+s` matches “apples”, “applesssssss” (and words that contain), but not “apple” (and sens of sensation)
- `[a-z]+\d[a-z]*` matches abc9xyz, a9xyz, but not 9xyz. matches abc9z and abc9.
- `\w+s` matches apples in 'We picked apples' even if it is applessssss. Any word ending in s(s).

Support
- * is supported in all regex engines
- + and ? are not supported in BREs (i.e., old Unix programs) (grep does not support + and ?)

Quantified repetition

Metacharacter	Meaning
{	Start quantified repetition of preceding item
}	End quantified repetition of preceding item

`{min,max}`
- min and max are positive numbers
- min must always be included, can be zero
- max is optional
Three syntaxes
- `\d{4,8}` matches numbers with four to eight digits
- `\d{4}` matches numbers with exactly four digits (min is maximum when no max is specified)
- `\d{4,}` matches numbers with four or more digits (max is infinite)
Examples
- `\d{0,}` is the same as `\d*`
- `\d{1,}` is the same as `\d+`
- `/\d{3}-\d{3}-\d{4}/` matches most U.S. phone numbers
- `/A{1,2} bonds/` matches “A bonds” and “AA bonds”, not “AAA bonds”
More Examples
- `\w+\s` (almost) any word with a space after
- `\w{5}\s` any 5 characters followed by a space
- `\w{2,5}\s` any minimum of 2 characters up to 5 characters followed by a space
- `\w{5,}\s` any 5 characters or more followed by a space
- `\w+_\d{2,4}-\d{2} finds report_1997-04, budget_03-04, but not memo_71239-100
- `\w+_\d+-\d+ finds memo_71239-100 also

Greedy expressions

Example 1
- `01_FY_07_report_99.xls`
- `/\d+\w+\d+/`
- Matches 01_FY_07_report_99
Example 2
- `”Milton“, “Waddams”, Initech, Inc.”`
- `/“.+”, “.+”/`
- Matches the whole string
Standard repetition quantifiers are greedy
Expression tries to match the longest possible string (the repetition quantified part of the expression)
Defers to achieving overall match
- `/.+\.jpg/` matches “filename.jpg”
- The + is greedy, but “gives back” the “.jpg” to make the match
- Think of it as rewinding or backtracking
Gives back as little as possible to try to make the match
- `/.*[0-9]+/` matches “Page 266”
- `/.*/`… matches “Page 26” while …`/[0-9]+/` matches “6”
regular expression engines are eager. (It is eager to give a result. If it does not work out, it will backtrack trying to matche the last part of the expression.)
regular expression engines are greedy.

Lazy Expressions

Metacharacter	Meaning
?	Make preceding quantifier lazy

Syntax
- `*?`
- `+?`
- `{min,max}?`
- `??`

Instructs quantifier to use a “lazy strategy” for making choices
Greedy strategy
- Match as much as possible before giving control to the next expression part
Lazy strategy
- Match as little as possible before giving control to the next expression part
- Still defers to overall match
- Not necessarily faster or slower
Examples
- `/\w*?/d{3}/`
- `/[A-Za-z-]+?\./`
- `/.{4,8}?_.{4.8}/`
- `/apples??/` (meaningless because there is no way to not find apple and still need the s to match apples)
Support
- Not supported in most Unix tools (BRE, ERE)

`/.*?[0-9]+/`
- the * quits searching when if matches a letter, then lets the digit try to match. If the digit part fails, then the * tries again, makes a match and passed to the digit, and so on..
`/.*?[0-9]*?/` will make a match of nothing, nothing being “success” in the search's logic.

Example 1
- `01_FY_07_report_99.xls`
- `/\d+\w+?\d+/`
- Now it matches “01_FY_07”
Example 2
- `“Milton”, “Waddams”, “Initech, Inc.”`
- `/“.+?”, “.+?”/`
- Now it matches “Milton”, “Waddams”

Efficiency When Using Repetition

`/\w+s/`
We picked apples
In the “picked” part of the string, p, i, c, k, e, d are all parts of a word character until it gets to the space, then it test the space to be an “s”, which it fails, then it backtracks to the “i” and goes through the word again and again for the “c” and again again for the “k”, etc.
The parser cannot look globally like a human can.
`/\w*s/`
We picked apples.
Starts like above, but at the last character, the “space”, it goes back to the p assumes the p is zero characters and looks for the “s” after the p (which is a non-character this time to the logic).
Efficient matchin + less backtracking = sppeedy results
Define the quantity of repeated expressions
- `..+/` is faster than `/.*/`
- `/.{5}/` and `/.{3,7}/` are even faster
Narrow the scope of the repeated expression
- `/.+/` can become `/[A-Za-z]+/`
Provide clearer starting and ending points
- When looking for something inside `<>` brackets,
- `/<.+>/` looks for `<`, then any character one or more times, then `>`
- `/<[^>]+>/` looks for `<`, then any character the is not a `>` one or more times, the `>`
- Use anchors and word boundaries
Example
- `/w*s/` would be improved as `/w+s/`
- `/w+s/` would be improved as `/[A-Za-z]+s/`
- Perhaps as `/[a-z]+s/` or as `/[A-Z][a-z]+s/` (1 Uppercase followed by 1 or more other letters followed by s)
- Search for whole words only
  - Spaces, anchors, or work boundaries
  - Scans “picked” but not “icked”, “cked”, “ked”, “ed”, or “d”

Grouping and Alternation Expressions

Grouping Metacharacters

Metacharacter	Meaning
(	Start grouped expression
)	End grouped expression

* Group portions of the expression

Apply repetition operator to a group
Makes expressions easier to read
Captures group for use in matching and replacing

Examples
- `/(abc)+/` matches “abc” and “abcabcabc”
- `/(in)?dependent/` matches “independent” and “dependent”
- `/run(s)?/` is the same as `/runs?/` but the former might be easier to understand

Alternation metacharacter

Metacharacter	Meaning
`	`	Match Previous or next expression

a.k.a Pipe, OR

`|` is an OR operator
- Either match expression on the left or match expression on the right
- Ordered, leftmost expression gets precedence
- Multiple choices can be daisy-chained, but it stops when first match is found if global search is not specified.
- Group alternation expressions to keep them distinct
Examples
- `/apple|orange/` matches “apple” and “orange”
- `/abc|def|ghi|jkl/` matches “abc”, “def”, “ghi”, and “jkl”
- `/apple(juice|sauce)/` is not the same as `/applejuice|sauce/`
- `/w(ei|ie)rd/` matches “weird” and “wierd”.

Writing Logical and Efficient Alternations

So far we have learned that:

Regular expression engines are eager.
Regular expression engines are greedy.

`/(peanut|peanutbutter)/` matches “peanut” in “peanutbutter”. It is eager to return a result and the leftmost item gets priority.

`/peanut(butter)?/` matches “peanutbutter” because “butter” is preferred because it is greedy even though “butter” is optional.

`/(w+|FY\d{4}_report\.xls)/` This is an alternation: word character one or more times, or the second choice is “FY four digits, _report.xls”. Using `FY2003_report.xls`, it matches the words “FT2003_report” and “xls”, not the whole thing because it is eager to return a result and never tried the second part.

`/abc|def|ghi|jkl/` using string “abcdefghijlkmnopqrstuvwxyz” matches “abc” with global off. `/xyz|abc|def|ghi|jkl/` using string “abcdefghijlkmnopqrstuvwxyz” also matches “abc” because it tries the second alternation before it ever gets to the end of the string.

`/(three|see|thee|tree)/` “I think those are thin trees.” Moves forward and backward checking each character one at a time starting over each time as it tries the four options.

Put simplest (most efficient) ecpression first
- `/\w+_\d{2,4}|\d{4}_\d{2}_\w+|export\d{2}/`
- `/export\d{2}|\d{2}|\d{4}_\w+|\w+_\d{2,4}/` is more efficient because the more permissive alternations are last.

Repeating and nesting alternations

Repeating
- First matched alternation does not effect the next matches
- `/(AA|BB|CC){6}/` matches “AABBAACCAABB”
Nesting
- Check nesting carefully
- `/(d{2}([A-Z{2}|-\d\w\d\w)|\d{4}(-\d{2}-[A-Z]{2,8}|_x[A-F]))/`
- Trade-ff between precision, readability, and efficiency.

`/(\d\d|[A-Z][A-Z]){3}/` matches “112233”, “AABBCC”, “AA66ZZ”, and “11AA44”

milk
apple juice
sweet peas
yogurt (no match)
sweet corn
apple sauce
milkshake
sweet potatoes

`/[\w ]+/` matches all words.

Additional notes from steve

`?>` is a non-backtracking group: http://stackoverflow.com/questions/15413594/what-does-mean-in-a-pcre-regex

Anchored Expressions

Start and end anchors

Metacharacter	Meaning
`	`	Start of string/line. Note this is a dual meaning for `	`. The other is to negate character set if it is the first chararacter in the set. `[	abcd]`
`$`	Endo of string/line
\A	Start of string, never end of line
\Z	Endo of string, never end of line

* Reference a postion, not an actual character

Zero-width
Examples
`/^apple/` or `/\Aapple/` matches “apple” at the begining
`/apple$/` or `/apple\Z/` matches “apple” at the end
`/^apple$/` or `/\Aapple\Z/` matches “apple” alone on a line
Support
^ and $ are supprted in all regex engines
`\A` and `\Z` are supported in Jave, .NET, Perl, PHP, Python, Ruby

Line Breaks and Multiline Mode

Single-line mode
- ^ and $ do not match at line breaks
- `\A` and `\Z` do not match at line breaks
- Many Unix tools support only single line
Multiline mode
- ^ and $ will match at the start and end of line
- `\A` and `\Z` do not match at line breaks
- Languages usually offer a multiline option
  - Java: `Pattern.compile(“^regex$”, Pattern.MULTILINE)`
  - JavaScript: `/^regex$/m`
  - .Net: `Regex.Match(“string”, “^regex$”, RegexOptions.Multiline)`
  - Perl: `m/^regex$/m`
  - PHP: `preg_match(/^regex$/m, “string”, re.MULTILINE)`
  - Ruby: string.match(/^regex$/m)
Examples
- ^[a-z ]+ With multiline disabled, it only matches the item in the first line. With multiline enabled, it matches each line separately.
- [a-z ]+$ With multiline disabled, it only matches the item in the last line. With multiline enabled, it matches each line separately.
- ^[a-z ]+$ works similarly.

Word Boundaries

Metacharacter	Meaning
`\b`	Word baoundary (start/end of word)
`\B`	Not a word boundary

* Refence a position, not an actual character

Conditions for matching
- Before the first word character in the string
- After the last word character in the string
- Between a word character and a non-word character
Word characters: [A-Za-z0-9_]
Support
- Most regex engines, not in the early Unix tools (BREs)
  - egrep
  - but not grep
Boundary examples
- `/\b\w+\b/` finds four matches in “This is a test.”
- `/\b\w+\b/` matches all of “abc_123” but only part of “top-notch”
Not a boundary examples
- `/\BThis/` does not match “This is a test.” because the string starts with a T, which is counted as a boundary.
- `/\B\w+\B/` finds two matches in “This is a test.” (“hi” and “es”)
examples
- `/\b\w+\b/` finds most words, but not “summer's” which has a word boundary after summer and before the “s” after the apostrophe.
- `/\b[\w']+\b/` adding the apostrophe counts the whole word because it is greedy.
- `/\b[\w']+?\b/` causes the apostrophe to be a separate word
- `/w+s/` matches “apples” in “We picked apples.”
- `/\b\w+s\b/` puts boundaries around the word and makes the expression more efficient causing it to look for the whole word. It no longer check “icked”, “cked”, “ked”, “ed” and “d” because it looks for a boundary instead of a word character followed by “s”.
Caution
- A space is not a word baundary
- Word boundaries reference a postion
  - Not an actual character
  - Zero-length
Examples
- String: “apples and oranges”
- No Match: `/apples\band\boranges/`
- Match: `/apples\b \band\b \boranges/` There are boundaries around the words and spaces between the words causing a boundary between the space (which is a non-word character) and the next word.

Capturing Groups and Backreferences

Backreferences

Grouped expressions are captured
- Stores the matched portion in parentheses
  - `/a(p{2}l)e/` matches “apple” and stores “ppl”
  - Stores the data matched, not the expression
- Automatically, by default
Backreferences allow access to captured data
- Refer to first backreference with \1

Metacharacter	Meaning
`\1 through \9`	Backreference for positions 1 to 9
`\10 through \99`	Backreference for positions 10 to 99

Usage
- Can be used in the same expression as the group
- Can be accessed after the match is complete (would need to be inside some programming language to refer to the variables
- Cannot be used inside character classes
Support
- Most regex engines support `\1` through `\9`
- Some regex engines support `\10` through `\99`
- Some regecx engines use $1 throught $9 instead
Examples
- `/(apples) to \1/` matches “apples to apples”
- `/(ab)(cd)(ef)\3\2\1/` matches “abcdefefcdabd”
- `/<(i|em|b|strong).+?</\1>/` matches `Hello` and `Hello`
 - finds “i” or “e”, then any text inside the tag, not greedy so it does not skip to the next tag
 - Does not match `Hello` or `Hello`
- `/\b[A-Z][a-z]+/`
- `\b([A-Z][a-z]+)\b\s\b\1(s|d)on\b`
 - matches “John Johnson” and “Evan Evanson”
 - in “Steve Smith, John Johnson, Eric Erikson, Evan Evanson”
- `/\b(\w+)\s+\1\b/`
 - matches “the the” in

Paris in the
the spring

Backreferences to optional expressions

Optional elements
- `/A?B/` matches “AB” and “B”
Capture occur on zero-width matches
- `/(A?)B/` matches “AB” and captures “A”
- `/(A?)B/` matches “B” and captures “”
Backreferences become zero-width too
- `/(A?)B\1/` matches “ABA” and “B”
- `/(A?)B\1C/` matches “ABAC” and “BC”
Captures do not always occur on optional groups
- `/(A)?B/` matches “AB” and captures “A”
- `/(A)?B/` matches “B” and does not capture anything
Backreference is to a group that failed to match
- `/(A)?B\1/` matches “ABA” but not “B”
- Except in JavaScript
Element is optional, group/capture is not optional
- `/(A?)B/` matches “B” and captures “”
Element is not optional, group/capture is optional
- `/(A)?B/` matches “B” and does not capture anything

Finding and replacing using backreferences

TextMate works for this, but not the javascript tester

Create a regular expression that matches target data
Test regular expression and revise as needed
- Use anchors and specificity to narrow scope
Add capturing groups
- Capture anything that varies row-to-row
Write the replacement string
- Use all captures
- Add back anything not captured but still needed, like commas and spaces
- May need to use $1 instead of \1

U.S. Presidents example

Using TextMate
Find: `^(\d{1,2}),([\w .]+?) ([\w ]+?),(\d{4})`
Replace `$1,$3,$2,$4`

Non-captureing group expressions

Third use of a ? mark. First was an optional character, the second was to signify non-greedy.

Metacharacter	Meaning
`?:`	Specify a non-capturing group

Syntax
- `/(\w+)/` becomes `/(?:\w+)/`
Turns off capture and backreferences
- Optimize for speed
- Preserve space for more captures
Support
- Most regex engines except Unix tools
`/(?:regex)/`
- ? = “Give this group a different meaning”
- : = “The meaning is non-capturing”

Bitnami DokuWiki

Table of Contents