Regular expression is a sequence of characters that defines search pattern.
starts as /abc/
There are four matching modes :
a) Global match (g)
b) Case insensitive (i)
c) Multiline anchors (m)
d) Dot matches all (s)
Literal character matching :
non global matching : matches earliest / left most set is going to match.
global matching : matches all matching set.
Meta Character :
These are the characters with special meaning like mathematical operators. It transforms
literal character to powerful expressions.
\ . * + - { } [ ] ^ $ | ? ( ) : ! =
Wildcard meta character :
" . "
for example - c.r will match car , cor , cbr , cxr etc
resume.\.txt will match resume1.txt , resume2.txt resume3.txt etc
spaces : c a t matches c a t
tab (\t) : a\tb matches a b
new line (\n) : a\nd matches a
d
line return (\r)
character set [ ] :
/g[aeiuo]t/ matches gat , get , git , gut, got
Character range :
Represents all the character between two characters ex - [0-9]
[A-Za-z] , [a-zA-Z] , [0-9A-Z] , [A-Z0-9]
Negative character set : /[^abcdefg]/
see[^abcd] matches sees , seen but not matches seed
see[^a-d] matches sees seen but not seed or seea seeb seec
metacharacter inside the character set is considered as literal , except ] , - , ^ , \
Repetition Expressions :
repetition metacharacters :
* - /apples*/ matches apple , apples , applessssssss : 0 or more repetition
+ - /apples+/ matches apples , applessss , but not apple : 1 or more repetition
? - /apples?/ matches apple , apples but not applessss : no repetition
Quantified repetition :
{min, max}
min and max are positive number defining the repetition range. minimum must be always there , least value can be 0 , if max is not included then max range will be infinity.
for ex :
/\d{5}/ matches 5 digit number
/\d{5,}/ matches 5 digit number or more
/\d{2,5}/ matches number having number of digits from 2 to 5
Lazy expressions :
*?
+?
{min,max}?
??
/apples??/ first ? makes it greedy and second one lazy , apples? matches apple or apples but go greedy and matches apples but another ? puts it to apple
Capturing groups and Back referencing :
( ) captures the things need to be put together . grouping metacharacter.
/(abc)+/ matches abcabcabc
abc+ matches abcccccc
/(in)?dependent/ matches both independent and dependent
Anchored expressions :
^ : caret sign means start of string/line
$ : dollar sign means end of string/line
\A : start of string, never end of line : same as ^
\Z : End of string , never end of line : same as $
single line mode , ^ , \A and $ , \Z do not match line break , they match in Multiline mode
(except \A and \Z) . re.MULTILINE
word boundaries : \b word boundary , \B not a word boundary
very first and last word , between a word character and non-word character
This helps skipping backtracking while matching
space is not word boundary , boundary is at end position and start position of any word
apples and oranges
matches , /bapples/b and /boranges/b however not matches with /bapples/band/boranges/b
Alteration metacharacter -> | pipe character
here its an OR operator.
regular expressions are eager to return and greedy , so put the best search first
Back referencing : stores the matched data , not the expression. it allows to access the captured data
by \1 through \9 back reference positions
/(A?)B\1/ matches "ABA" and "B"
/(A?)B\1C/ matches "ABAC" and "BC" , back reference gets zero or more
find and replace using back reference :
for ex
^(\d{1,2}),([\w.]+?) ([\w]+?),(\d[4])} -> find
matches :
1, Leonardo Dicaprio, 2016
replace with
$1,$3,$2,$4
result : 1, Dicaprio, Leonardo, 2016
Non capturing group expression : turn off grouping ?: specify a non-capturing group
/(?:\w+)/ -> increases speed , more spaces for capture.
ex: (oranges) and (apples) to \1
oranges and apples to oranges
(?:oranges) and (apples) to \1
oranges and apples to apples
? -> give this group different meaning
: -> non capturing
LookAround Assertions :
two types : LookAhead and LookBehind assertions
positive lookahead : looks what should be ahead , /(?=regex)/
/(?=seashore)sea/ matches sea from seashore not from seaside ?= says looks ahead if its present but dont match whatever followed after it
Negative lookahead : /(?!regex)/
/(?!seashore)sea/ matches seaside not seashore
Lookbehind Assertion :
/(?<=regex)/ positive look behind assertion
/(?!regex)/ negative look behind assertion
/(?<=base)ball/ matches ball in baseball not football efficiently
position :
This costs 54.00 or $54.00.
(?<![$\d])\d+\.\d\d matches 54.00
(?<![$\d])(?=\d+\.\d\d) gets the regular expression pointer jumps to front of 53 , helps in inserting
Unicode and Multibyte Characters :
single byte , uses 8 bits to repesent a character , allows for 256 characters
A-Z , a-z , 0-9 , punctuation , common symbols
Double byte : 16 bits ; 2 to power of 16 , 65,536 characters
Unicode is varibale byte size : 1 , 2 or more byte , it allows over 1 million characters
its mapping between character and number
"U+" followed by a four digit hexadecimal number
infinity is written as U+221E
"cafe","café"
café can be encoded as four or five characters
unicode indicator : \u
\u followed by hexadecimal number (0000-FFFF)
/caf\u00E9/ matches café not cafe
unicode wildcard and properties :
Unicode wildcard : \X matches any single character , always matches line breaks like /./s
/cafe\X/ matches café and cafe
Unicode property : \p
/\p{Mark}/ or /\p{M}/ matches any mark (just accents)
Letter L
Mark M
Separator Z
Punctuation P
Number N
Symbol S
Other C
unicode not-property : \P
/caf\P{M}\p{M}/ matches café
The unicode property is widely supported
Now Practice!
No comments:
Post a Comment