Regular Expression Tutorial
https://www.lynda.com/Regular-Expressions-tutorials
1
- There are different types of RegEx engines that will behave slightly different than others.
2
-
Literal characters
-
Regular expressions are eager - important to keep in mind. It wants to return a match as fast as possible. Meta characters
-
Characters with special meaning:
\ . * + - {} [] ^ $ | ? () : ! =
- A wildcard match:
.
- Escape the next character:
\
Escaping characters like this is generally used inside programming languages. Take note of the slash period.
/\9.00/ matches "9.00" but not "9500" or "9-00"
\/home\/user\/document\/text\.txt ```
-
Other special characters:
- Tabs:
\t
- Line returns:
\r \n \r\n
- Non-printable characters. Not commonly used. Bell, escape, form feed, vertical tab:
\a \e \f \v
- ASCII or ANSI Codes
3
-
Defining a character set
-
Match any one of several characters but only one character
[]
-
This does not match "great".
/gr[ea]t/
-
Character ranges
- Match zero to nine:
[0-9]
-
Negative character sets
- Do not match character:
^
- Matches any one non-vowel:
/[^aeiou]/
- Matches "seek" and "sees" but not "seem" or "seen" or "see". "See " is a match so be aware
/see[^mn]/
-
Metacharacters inside character sets
- Metacharacters inside character sets are already escaped with the exception of
[ - ^
- Metacharacters inside character sets are already escaped with the exception of
-
Shorthand character sets
- Digit:
\d or [0-9]
- Word character:
\w or [a-zA-Z0-9_]
- Whitespace:
\s or [ \t\r\n]
- Not digit:
\D or [^0-9]
- Not word character:
\W or [^a-zA-Z0-9_]
- Not whitespace:
\S or [^ \t\r\n]
4
-
Repetition metacharacters
- Preceding item zero or more times:
*
- Preceding item one or more times:
+
- Preceding item zero or more times:
?
- Repetition example 1:
/apples*/ matches "apple", "apples", and "applessssss"
- Repetition example 2:
/apples+/ matches "apples" and "applessssss"
- Repetition example 3:
/apples?/ matches "apple" and "apples"
-
Quantified repetition
min
must be included,max
is optional:
{min,max}
- Matches numbers with 4 to 8 digits:
\d{4,8}
- Matches numbers with exactly 4 digits:
\d{4}
- Matches numbers with 4 or more digits:
\d{4,}
-
Greedy expressions
-
Regular expressions are greedy - important to keep in mind.
-
Greedy strategy - match as much as possible before giving control to the next expression part
-
-
Lazy expressions
*?, +?, {min,max}?, ??
- Lazy strategy - match as little as possible before giving control to the next expression part
/\w*?\d{3}/
/[A-Za-z-]+?./
/.{4-8}?_.{4,8}/
-
Using repetition efficiently
-
Efficient matching + less backtracking = speedy results
-
Define the quantity of repeated expressions
-
/.+/
is faster than/.*/
-
/.{5}/
and/.{3,7}/
are even faster
-
-
-
Narrow the scope of the repeated expression
/.+/
can become/[A-Za-z]+/
-
Provide clearer starting and ending points
/<.+>/
can become/<[^>]+>/
5: Grouping and Alternation
-
Grouping metacharacters
- Grouping
( )
-
Apply repetition operators to a group
-
Makes expressions easier to read
-
Captures group for use in matching and replacing
-
Cannot be used inside character set
-
/(abc)+/
matches "abc" and "abcabcabc" -
/(in)?dependent/
matches "independent" and "dependent"
-
Alternation metacharacter
- OR operator:
| -
-
Match expression on left or right side
-
Ordered, leftmost expression gets precedence. If first expression matches, it will stop searching.
-
Multiple choices can be daisy-chained
-
/apple|orange/
matches "apple" and "orange" -
/abc|def|ghi|jkl/
matches "abc", "def", "ghi", and "jkl" -
/w(ei|ie)rd/
matches "weird" and "wierd"
-
Writing local and efficient alternations
-
peanut(butter)?
matches "peanutbutter" and not peanut first because (butter) is greedy. -
(\w+|FY\d{4}_report.xls)
matches "FY2003_report.xls" using the first expressionw+
because it's greedy. It doesn't understand the second expression even though it's more precise -
/xyz|abc|def|ghi/
matches "abc" first and not "xyz" because it doesn't scan the entire string. It will scan the first few characters first and see if it matches the first alternation -
Put simplest (most efficient) expression first
/\w+_\d{2,4}|\d{4}_\d{2}_\w+|export\d{2}/
(inefficent,backtrack) vs/export\d{2}|\d{4}_\d{2}_\w+|\w+_\d{2,4}/
(efficient,simple)
-
-
Repeating and nesting alternations
-
Nesting means when you have an expression that looks like
(apple(juice|sauce))
and not(applejuice|applesauce)
-
Repeating alternations: first matched alternation does not affect the next matches meaning.
(AA|BB|CC){6}
matches "AABBCCAABBCC"
-
6: Anchored Expressions
-
Start and end anchors
-
Reference a position, not an actual character
-
Start of string/line:
^
- End of string/line:
$
- Start of string, never end of line (not supported by some programming languages):
\A
- End of string, never end of line (not supported by some programming languages):
\Z
/^apple/ or /\Aapple/ /apple$/ or /apple\Z/ /^apple$/ or /\Aapple\Z/
- If you have
^
and$
in a single expression, it will have to match the whole string
-
-
Line breaks and multiline mode
-
/[a-z]+/
matches the first line on the list but does the next lines after it. This is due to being in single line mode- To match multiple lines, you need to add
/m
to end of line. Just like adding a global (/g) switch.
- To match multiple lines, you need to add
-
-
Word boundaries
-
Word boundaries references a position, not a character.
-
Word boundary (start/end word boundary):
\b
- Not a word boundary:
\B
-
Watch out, it doesn't match certain characters such as
.
or-
or spaces unless specified -
/\b/w+s\b/
matches "apples" - This expression is efficient because you are exactly specifying the boundaries. -
/apples\b
\band\b
\boranges/
matches "apples and oranges"
-
7: Capturing groups and Backreferences
-
Backreferences
-
Allow access to captured data
-
How it works:
/a(p{2}l)e/
matches "apple" and the regex engine will store "ppl" automatically. -
\1
through\9
-
/(apple)
to\1/
matches "apples to apples" -
/<(i|em)>.+?</\1>/
matches Hello and Hello
-
-
Non-capturing group expressions
- Turns off capture and backreferences to optimize for speed and preserve space for captures
?:
8: Lookaround Assertions
-
Positive lookahead assertions
- Positive lookahead assertion
?=
-
/(?=seashore)sea/
matches "sea" in "seashore" but not "seaside" -
/\b[A-Za-z']+\b(?=,)
matches words with commas after it but does not capture the comma -
This is a good way to sift through large documents for specific words with symbols before or after it
-
Double-testing with lookahead assertions
(?=^[0-5-]+$)(?=.*4321)\d{3}-\d{3}-\d{4}
- matches "555-302-4321" and not "555-245-1312". Pay attention how many times it goes through and check the line
-
Negative lookahead assertions
- Negative lookahead assertion:
?!
-
Used to rule out cases you specific to not search for
-
/(?!seashore)sea/
matches "sea" in "seaside" but not "seashore" -
/online (?!.*training)/
does not match "online video training"
-
Lookbehind assertions
- Lookbehind positive assertion:
?<=
- Lookbehind negative assertion:
?<!
-
/(?<=base)ball/
matches the "ball" in "baseball" but not "football" -
/(?!=base)ball/
matches the "ball" in "football" but not "baseball" -
Alternation only with fixed-length items
-
Allowed:
(?<=cat|dog|rat)
-
Not allowed:
(?<=apple|banana|plum)
-
-
The power of positions
-
Zero-width means zero position movement
-
/(?<![$\d])(?=\d+\.\d\d)/
- doesn't match either "54.00" or "$54.00". It moves the cursor infront of the 5 instead, no insertion of characters or anything, just the cursor. This is an ideal situation where you don't need to do a text replacement but instead text insertion like commas between large numbers 123,123,123.00 rather than 123123123.00
-
9: Unicode and Multibyte characters
- Briefly glanced at it, not interested as it doesn't apply to me yet. Never had an experience with unicode issues yet.