John Tuyen

Adventures of Functional Mediocrity

01 Mar 19

Regular Expression Tutorial

https://www.lynda.com/Regular-Expressions-tutorials

1

  • There are different types of RegEx engines that will behave slightly different than others.

2

  • Literal characters

  • Regular expressions are eager - important to keep in mind. It wants to return a match as fast as possible. Meta characters

  • Characters with special meaning:

    \ . * + - {} [] ^ $ | ? () : ! =
    
    • A wildcard match:
    .
    
    • Escape the next character:
    \
    

    Escaping characters like this is generally used inside programming languages. Take note of the slash period.

    /\9.00/ matches "9.00" but not "9500" or "9-00"
    
    \/home\/user\/document\/text\.txt
    ```	
    
    
  • Other special characters:

    • Tabs:
    \t
    
    • Line returns:
    \r \n \r\n
    
    • Non-printable characters. Not commonly used. Bell, escape, form feed, vertical tab:
    \a \e \f \v
    
    • ASCII or ANSI Codes

3

  • Defining a character set

  • Match any one of several characters but only one character

    []
    
  • This does not match "great".

    /gr[ea]t/
    
  • Character ranges

    • Match zero to nine:
    [0-9]
    
  • Negative character sets

    • Do not match character:
    ^
    
    • Matches any one non-vowel:
    /[^aeiou]/
    
    • Matches "seek" and "sees" but not "seem" or "seen" or "see". "See " is a match so be aware
    /see[^mn]/ 
    
  • Metacharacters inside character sets

    • Metacharacters inside character sets are already escaped with the exception of [ - ^
  • Shorthand character sets

    • Digit:
    \d or [0-9]
    
    • Word character:
    \w or [a-zA-Z0-9_]
    
    • Whitespace:
    \s or [ \t\r\n]
    
    • Not digit:
    \D or [^0-9]
    
    • Not word character:
    \W or [^a-zA-Z0-9_]
    
    • Not whitespace:
    \S or [^ \t\r\n]
    

4

  • Repetition metacharacters

    • Preceding item zero or more times:
    *
    
    • Preceding item one or more times:
    +
    
    • Preceding item zero or more times:
    ?
    
    • Repetition example 1:
    /apples*/ matches "apple", "apples", and "applessssss"
    
    • Repetition example 2:
    /apples+/ matches "apples" and "applessssss"
    
    • Repetition example 3:
    /apples?/ matches "apple" and "apples"
    
  • Quantified repetition

    • min must be included, max is optional:
    {min,max}
    
    • Matches numbers with 4 to 8 digits:
    \d{4,8}
    
    • Matches numbers with exactly 4 digits:
    \d{4}
    
    • Matches numbers with 4 or more digits:
    \d{4,}
    
  • Greedy expressions

    • Regular expressions are greedy - important to keep in mind.

    • Greedy strategy - match as much as possible before giving control to the next expression part

  • Lazy expressions

    *?, +?, {min,max}?, ??
    
    • Lazy strategy - match as little as possible before giving control to the next expression part
    /\w*?\d{3}/
    
    /[A-Za-z-]+?./
    
    /.{4-8}?_.{4,8}/
    
  • Using repetition efficiently

    • Efficient matching + less backtracking = speedy results

    • Define the quantity of repeated expressions

      • /.+/ is faster than /.*/

      • /.{5}/ and /.{3,7}/ are even faster

  • Narrow the scope of the repeated expression

    • /.+/ can become /[A-Za-z]+/
  • Provide clearer starting and ending points

    • /<.+>/ can become /<[^>]+>/

5: Grouping and Alternation

  • Grouping metacharacters

    • Grouping
    ( )
    
    • Apply repetition operators to a group

    • Makes expressions easier to read

    • Captures group for use in matching and replacing

    • Cannot be used inside character set

    • /(abc)+/ matches "abc" and "abcabcabc"

    • /(in)?dependent/ matches "independent" and "dependent"

  • Alternation metacharacter

    • OR operator:
    | -
    
    • Match expression on left or right side

    • Ordered, leftmost expression gets precedence. If first expression matches, it will stop searching.

    • Multiple choices can be daisy-chained

    • /apple|orange/ matches "apple" and "orange"

    • /abc|def|ghi|jkl/ matches "abc", "def", "ghi", and "jkl"

    • /w(ei|ie)rd/ matches "weird" and "wierd"

  • Writing local and efficient alternations

    • peanut(butter)? matches "peanutbutter" and not peanut first because (butter) is greedy.

    • (\w+|FY\d{4}_report.xls) matches "FY2003_report.xls" using the first expression w+ because it's greedy. It doesn't understand the second expression even though it's more precise

    • /xyz|abc|def|ghi/ matches "abc" first and not "xyz" because it doesn't scan the entire string. It will scan the first few characters first and see if it matches the first alternation

    • Put simplest (most efficient) expression first

      • /\w+_\d{2,4}|\d{4}_\d{2}_\w+|export\d{2}/ (inefficent,backtrack) vs /export\d{2}|\d{4}_\d{2}_\w+|\w+_\d{2,4}/ (efficient,simple)
  • Repeating and nesting alternations

    • Nesting means when you have an expression that looks like (apple(juice|sauce)) and not (applejuice|applesauce)

    • Repeating alternations: first matched alternation does not affect the next matches meaning.

      • (AA|BB|CC){6} matches "AABBCCAABBCC"

6: Anchored Expressions

  • Start and end anchors

    • Reference a position, not an actual character

    • Start of string/line:

    ^
    
    • End of string/line:
    $
    
    • Start of string, never end of line (not supported by some programming languages):
    \A
    
    • End of string, never end of line (not supported by some programming languages):
    \Z
    
    /^apple/ or /\Aapple/
    
    /apple$/ or /apple\Z/
    
    /^apple$/ or /\Aapple\Z/
    
    • If you have ^ and $ in a single expression, it will have to match the whole string
  • Line breaks and multiline mode

    • /[a-z]+/ matches the first line on the list but does the next lines after it. This is due to being in single line mode

      • To match multiple lines, you need to add /m to end of line. Just like adding a global (/g) switch.
  • Word boundaries

    • Word boundaries references a position, not a character.

    • Word boundary (start/end word boundary):

    \b
    
    • Not a word boundary:
    \B
    
    • Watch out, it doesn't match certain characters such as . or - or spaces unless specified

    • /\b/w+s\b/ matches "apples" - This expression is efficient because you are exactly specifying the boundaries.

    • /apples\b \band\b \boranges/ matches "apples and oranges"

7: Capturing groups and Backreferences

  • Backreferences

    • Allow access to captured data

    • How it works: /a(p{2}l)e/ matches "apple" and the regex engine will store "ppl" automatically.

    • \1 through \9

    • /(apple) to \1/ matches "apples to apples"

    • /<(i|em)>.+?</\1>/ matches Hello and Hello

  • Non-capturing group expressions

    • Turns off capture and backreferences to optimize for speed and preserve space for captures
    ?:
    

8: Lookaround Assertions

  • Positive lookahead assertions

    • Positive lookahead assertion
    ?=
    
    • /(?=seashore)sea/ matches "sea" in "seashore" but not "seaside"

    • /\b[A-Za-z']+\b(?=,) matches words with commas after it but does not capture the comma

    • This is a good way to sift through large documents for specific words with symbols before or after it

  • Double-testing with lookahead assertions

    • (?=^[0-5-]+$)(?=.*4321)\d{3}-\d{3}-\d{4} - matches "555-302-4321" and not "555-245-1312". Pay attention how many times it goes through and check the line
  • Negative lookahead assertions

    • Negative lookahead assertion:
    ?!
    
    • Used to rule out cases you specific to not search for

    • /(?!seashore)sea/ matches "sea" in "seaside" but not "seashore"

    • /online (?!.*training)/ does not match "online video training"

  • Lookbehind assertions

    • Lookbehind positive assertion:
    ?<=
    
    • Lookbehind negative assertion:
    ?<!
    
    • /(?<=base)ball/ matches the "ball" in "baseball" but not "football"

    • /(?!=base)ball/ matches the "ball" in "football" but not "baseball"

    • Alternation only with fixed-length items

      • Allowed: (?<=cat|dog|rat)

      • Not allowed: (?<=apple|banana|plum)

  • The power of positions

    • Zero-width means zero position movement

    • /(?<![$\d])(?=\d+\.\d\d)/ - doesn't match either "54.00" or "$54.00". It moves the cursor infront of the 5 instead, no insertion of characters or anything, just the cursor. This is an ideal situation where you don't need to do a text replacement but instead text insertion like commas between large numbers 123,123,123.00 rather than 123123123.00

9: Unicode and Multibyte characters

  • Briefly glanced at it, not interested as it doesn't apply to me yet. Never had an experience with unicode issues yet.