Tuesday, November 1, 2016

Regular Expression




Regular expression is a sequence of characters that defines search pattern.
starts as /abc/


There are four matching modes :
a) Global match (g)
b) Case insensitive (i)
c) Multiline anchors (m)
d) Dot matches all (s)

Literal character matching :
non global matching : matches earliest / left most set is going to match.
global matching : matches all matching set.

Meta Character :
These are the characters with special meaning like mathematical operators. It transforms
literal character to powerful expressions.

 \ . * +  -  { } [ ]  ^  $ |  ?  ( )  : ! =


Wildcard meta character :

 " . "

for example -  c.r  will match car , cor , cbr , cxr etc
                        resume.\.txt  will match resume1.txt , resume2.txt resume3.txt etc


spaces : c a t  matches c a t
tab (\t)  : a\tb matches a    b
new line (\n) :  a\nd  matches a
                                                d

line return (\r)

character set  [ ]   :

/g[aeiuo]t/  matches gat , get , git , gut, got


Character range :

Represents all the character between two characters  ex - [0-9]
[A-Za-z] , [a-zA-Z] , [0-9A-Z] , [A-Z0-9]

Negative character set :  /[^abcdefg]/
                                       see[^abcd]  matches  sees , seen but not matches seed
                                       see[^a-d]  matches sees seen but not seed or seea seeb seec

metacharacter inside the character set  is considered as literal , except ] , - , ^ , \

Repetition Expressions : 

repetition metacharacters : 

*  -   /apples*/  matches apple , apples , applessssssss        :   0 or more repetition

+  -  /apples+/  matches apples , applessss , but not apple   :  1 or more repetition

?  -  /apples?/   matches apple , apples but not applessss     :   no repetition


Quantified repetition :

{min, max}

min and max are positive number defining the repetition range. minimum must be always there , least value can be 0 , if max is not included then max range will be infinity.

for ex :
/\d{5}/  matches 5 digit number
/\d{5,}/  matches 5 digit number or more
/\d{2,5}/ matches number having number of digits from 2 to 5

Lazy expressions :

*?
+?
{min,max}?
??

/apples??/   first ? makes it greedy and second one lazy , apples? matches apple or apples but go greedy and matches apples but another ? puts it to apple

Capturing groups and Back referencing :

( ) captures the things need to be put together . grouping metacharacter.
/(abc)+/  matches abcabcabc
abc+ matches abcccccc

/(in)?dependent/ matches both independent and dependent

Anchored expressions :

^  : caret sign means start of string/line
$  :  dollar sign means end of string/line
\A : start of string, never end of line  : same as ^
\Z : End of string , never end of line : same as $

single line mode , ^ , \A and $ , \Z do not match line break , they match in Multiline mode
(except \A and \Z) .  re.MULTILINE

word boundaries :  \b word boundary , \B not a word boundary
very first and last word , between a word character and non-word character
 This helps skipping backtracking while  matching
space is not word boundary , boundary is at end position and start position of any word

apples and oranges
matches , /bapples/b and /boranges/b  however not matches with  /bapples/band/boranges/b

Alteration metacharacter ->   |   pipe character

here its an OR operator.

regular expressions are eager to return and greedy , so put the best search first

Back referencing : stores the matched data , not the expression. it allows to access the captured data
by \1 through \9   back reference positions

/(A?)B\1/ matches "ABA" and "B"
/(A?)B\1C/ matches "ABAC" and "BC"   , back reference gets zero or more

find and replace using back reference :

for ex
^(\d{1,2}),([\w.]+?) ([\w]+?),(\d[4])}   -> find

matches :
1, Leonardo Dicaprio, 2016

replace with
$1,$3,$2,$4

result : 1, Dicaprio, Leonardo, 2016

Non capturing group expression : turn off grouping  ?:  specify a non-capturing group

/(?:\w+)/  -> increases speed , more spaces for capture.
ex: (oranges) and (apples) to \1
oranges and apples to oranges 

(?:oranges) and (apples) to \1

oranges and apples to apples
 ? -> give this group different meaning
: -> non capturing 

LookAround Assertions :
two types : LookAhead and LookBehind assertions

positive lookahead :  looks what should be ahead ,  /(?=regex)/
/(?=seashore)sea/ matches sea from seashore not from seaside  ?= says looks ahead if its present but dont match whatever followed after it

Negative lookahead : /(?!regex)/

/(?!seashore)sea/ matches seaside not seashore

Lookbehind Assertion :
/(?<=regex)/  positive look behind assertion
/(?!regex)/     negative look behind assertion

/(?<=base)ball/ matches ball in baseball not football   efficiently


position :

This costs 54.00 or $54.00.

(?<![$\d])\d+\.\d\d   matches 54.00
(?<![$\d])(?=\d+\.\d\d)  gets the regular expression pointer jumps to front of 53 , helps in inserting

Unicode and Multibyte Characters :

single byte  , uses 8 bits to repesent a character  , allows for 256 characters
A-Z , a-z , 0-9 , punctuation , common symbols

Double byte :  16 bits ; 2 to power of 16 , 65,536 characters

Unicode is varibale byte size : 1 , 2 or more byte , it allows over 1 million characters
its mapping between character and number
"U+" followed by a four digit hexadecimal number
infinity is written as U+221E

"cafe","café"

café can be encoded as four or five characters

unicode indicator : \u
\u followed by hexadecimal number (0000-FFFF)
/caf\u00E9/  matches café not cafe

unicode wildcard and properties :
Unicode wildcard : \X  matches any single character , always matches line breaks like /./s
/cafe\X/ matches café  and cafe

Unicode property : \p
/\p{Mark}/ or /\p{M}/ matches any mark  (just accents)

Letter L
Mark M
Separator Z
Punctuation P
Number N
Symbol S
Other C

unicode not-property : \P

/caf\P{M}\p{M}/ matches café

The unicode property  is widely supported

Now Practice!

No comments:

Post a Comment