Regular Expression Tutorial
https://www.lynda.com/Regular-Expressions-tutorials
1
- There are different types of RegEx engines that will behave slightly different than others.
2
-
Literal characters
-
Regular expressions are eager - important to keep in mind. It wants to return a match as fast as possible. Meta characters
-
Characters with special meaning:
\ . * + - {} [] ^ $ | ? () : ! =- A wildcard match:
.- Escape the next character:
\Escaping characters like this is generally used inside programming languages. Take note of the slash period.
/\9.00/ matches "9.00" but not "9500" or "9-00"\/home\/user\/document\/text\.txt ``` -
Other special characters:
- Tabs:
\t- Line returns:
\r \n \r\n- Non-printable characters. Not commonly used. Bell, escape, form feed, vertical tab:
\a \e \f \v- ASCII or ANSI Codes
3
-
Defining a character set
-
Match any one of several characters but only one character
[] -
This does not match "great".
/gr[ea]t/ -
Character ranges
- Match zero to nine:
[0-9] -
Negative character sets
- Do not match character:
^- Matches any one non-vowel:
/[^aeiou]/- Matches "seek" and "sees" but not "seem" or "seen" or "see". "See " is a match so be aware
/see[^mn]/ -
Metacharacters inside character sets
- Metacharacters inside character sets are already escaped with the exception of
[ - ^
- Metacharacters inside character sets are already escaped with the exception of
-
Shorthand character sets
- Digit:
\d or [0-9]- Word character:
\w or [a-zA-Z0-9_]- Whitespace:
\s or [ \t\r\n]- Not digit:
\D or [^0-9]- Not word character:
\W or [^a-zA-Z0-9_]- Not whitespace:
\S or [^ \t\r\n]
4
-
Repetition metacharacters
- Preceding item zero or more times:
*- Preceding item one or more times:
+- Preceding item zero or more times:
?- Repetition example 1:
/apples*/ matches "apple", "apples", and "applessssss"- Repetition example 2:
/apples+/ matches "apples" and "applessssss"- Repetition example 3:
/apples?/ matches "apple" and "apples" -
Quantified repetition
minmust be included,maxis optional:
{min,max}- Matches numbers with 4 to 8 digits:
\d{4,8}- Matches numbers with exactly 4 digits:
\d{4}- Matches numbers with 4 or more digits:
\d{4,} -
Greedy expressions
-
Regular expressions are greedy - important to keep in mind.
-
Greedy strategy - match as much as possible before giving control to the next expression part
-
-
Lazy expressions
*?, +?, {min,max}?, ??- Lazy strategy - match as little as possible before giving control to the next expression part
/\w*?\d{3}//[A-Za-z-]+?.//.{4-8}?_.{4,8}/ -
Using repetition efficiently
-
Efficient matching + less backtracking = speedy results
-
Define the quantity of repeated expressions
-
/.+/is faster than/.*/ -
/.{5}/and/.{3,7}/are even faster
-
-
-
Narrow the scope of the repeated expression
/.+/can become/[A-Za-z]+/
-
Provide clearer starting and ending points
/<.+>/can become/<[^>]+>/
5: Grouping and Alternation
-
Grouping metacharacters
- Grouping
( )-
Apply repetition operators to a group
-
Makes expressions easier to read
-
Captures group for use in matching and replacing
-
Cannot be used inside character set
-
/(abc)+/matches "abc" and "abcabcabc" -
/(in)?dependent/matches "independent" and "dependent"
-
Alternation metacharacter
- OR operator:
| --
Match expression on left or right side
-
Ordered, leftmost expression gets precedence. If first expression matches, it will stop searching.
-
Multiple choices can be daisy-chained
-
/apple|orange/matches "apple" and "orange" -
/abc|def|ghi|jkl/matches "abc", "def", "ghi", and "jkl" -
/w(ei|ie)rd/matches "weird" and "wierd"
-
Writing local and efficient alternations
-
peanut(butter)?matches "peanutbutter" and not peanut first because (butter) is greedy. -
(\w+|FY\d{4}_report.xls)matches "FY2003_report.xls" using the first expressionw+because it's greedy. It doesn't understand the second expression even though it's more precise -
/xyz|abc|def|ghi/matches "abc" first and not "xyz" because it doesn't scan the entire string. It will scan the first few characters first and see if it matches the first alternation -
Put simplest (most efficient) expression first
/\w+_\d{2,4}|\d{4}_\d{2}_\w+|export\d{2}/(inefficent,backtrack) vs/export\d{2}|\d{4}_\d{2}_\w+|\w+_\d{2,4}/(efficient,simple)
-
-
Repeating and nesting alternations
-
Nesting means when you have an expression that looks like
(apple(juice|sauce))and not(applejuice|applesauce) -
Repeating alternations: first matched alternation does not affect the next matches meaning.
(AA|BB|CC){6}matches "AABBCCAABBCC"
-
6: Anchored Expressions
-
Start and end anchors
-
Reference a position, not an actual character
-
Start of string/line:
^- End of string/line:
$- Start of string, never end of line (not supported by some programming languages):
\A- End of string, never end of line (not supported by some programming languages):
\Z/^apple/ or /\Aapple/ /apple$/ or /apple\Z/ /^apple$/ or /\Aapple\Z/- If you have
^and$in a single expression, it will have to match the whole string
-
-
Line breaks and multiline mode
-
/[a-z]+/matches the first line on the list but does the next lines after it. This is due to being in single line mode- To match multiple lines, you need to add
/mto end of line. Just like adding a global (/g) switch.
- To match multiple lines, you need to add
-
-
Word boundaries
-
Word boundaries references a position, not a character.
-
Word boundary (start/end word boundary):
\b- Not a word boundary:
\B-
Watch out, it doesn't match certain characters such as
.or-or spaces unless specified -
/\b/w+s\b/matches "apples" - This expression is efficient because you are exactly specifying the boundaries. -
/apples\b\band\b\boranges/matches "apples and oranges"
-
7: Capturing groups and Backreferences
-
Backreferences
-
Allow access to captured data
-
How it works:
/a(p{2}l)e/matches "apple" and the regex engine will store "ppl" automatically. -
\1through\9 -
/(apple)to\1/matches "apples to apples" -
/<(i|em)>.+?</\1>/matches Hello and Hello
-
-
Non-capturing group expressions
- Turns off capture and backreferences to optimize for speed and preserve space for captures
?:
8: Lookaround Assertions
-
Positive lookahead assertions
- Positive lookahead assertion
?=-
/(?=seashore)sea/matches "sea" in "seashore" but not "seaside" -
/\b[A-Za-z']+\b(?=,)matches words with commas after it but does not capture the comma -
This is a good way to sift through large documents for specific words with symbols before or after it
-
Double-testing with lookahead assertions
(?=^[0-5-]+$)(?=.*4321)\d{3}-\d{3}-\d{4}- matches "555-302-4321" and not "555-245-1312". Pay attention how many times it goes through and check the line
-
Negative lookahead assertions
- Negative lookahead assertion:
?!-
Used to rule out cases you specific to not search for
-
/(?!seashore)sea/matches "sea" in "seaside" but not "seashore" -
/online (?!.*training)/does not match "online video training"
-
Lookbehind assertions
- Lookbehind positive assertion:
?<=- Lookbehind negative assertion:
?<!-
/(?<=base)ball/matches the "ball" in "baseball" but not "football" -
/(?!=base)ball/matches the "ball" in "football" but not "baseball" -
Alternation only with fixed-length items
-
Allowed:
(?<=cat|dog|rat) -
Not allowed:
(?<=apple|banana|plum)
-
-
The power of positions
-
Zero-width means zero position movement
-
/(?<![$\d])(?=\d+\.\d\d)/- doesn't match either "54.00" or "$54.00". It moves the cursor infront of the 5 instead, no insertion of characters or anything, just the cursor. This is an ideal situation where you don't need to do a text replacement but instead text insertion like commas between large numbers 123,123,123.00 rather than 123123123.00
-
9: Unicode and Multibyte characters
- Briefly glanced at it, not interested as it doesn't apply to me yet. Never had an experience with unicode issues yet.