TECHNO-TUNING: Regular Expression

Regular expression is a sequence of characters that defines search pattern.
starts as /abc/

There are four matching modes :
a) Global match (g)
b) Case insensitive (i)
c) Multiline anchors (m)
d) Dot matches all (s)

Literal character matching :
non global matching : matches earliest / left most set is going to match.
global matching : matches all matching set.

Meta Character :
These are the characters with special meaning like mathematical operators. It transforms
literal character to powerful expressions.

\ . * + - { } [ ] ^ $ | ? ( ) : ! =

Wildcard meta character :

" . "

for example - c.r will match car , cor , cbr , cxr etc
resume.\.txt will match resume1.txt , resume2.txt resume3.txt etc

spaces : c a t matches c a t
tab (\t) : a\tb matches a b
new line (\n) : a\nd matches a
d

line return (\r)

character set [ ] :

/g[aeiuo]t/ matches gat , get , git , gut, got

Character range :

Represents all the character between two characters ex - [0-9]
[A-Za-z] , [a-zA-Z] , [0-9A-Z] , [A-Z0-9]

Negative character set : /[^abcdefg]/
see[^abcd] matches sees , seen but not matches seed
see[^a-d] matches sees seen but not seed or seea seeb seec

metacharacter inside the character set is considered as literal , except ] , - , ^ , \

Repetition Expressions :

repetition metacharacters :

* - /apples*/ matches apple , apples , applessssssss : 0 or more repetition

+ - /apples+/ matches apples , applessss , but not apple : 1 or more repetition

? - /apples?/ matches apple , apples but not applessss : no repetition

Quantified repetition :

{min, max}

min and max are positive number defining the repetition range. minimum must be always there , least value can be 0 , if max is not included then max range will be infinity.

for ex :
/\d{5}/ matches 5 digit number
/\d{5,}/ matches 5 digit number or more
/\d{2,5}/ matches number having number of digits from 2 to 5

Lazy expressions :

*?
+?
{min,max}?
??

/apples??/ first ? makes it greedy and second one lazy , apples? matches apple or apples but go greedy and matches apples but another ? puts it to apple

Capturing groups and Back referencing :

( ) captures the things need to be put together . grouping metacharacter.
/(abc)+/ matches abcabcabc
abc+ matches abcccccc

/(in)?dependent/ matches both independent and dependent

Anchored expressions :

^ : caret sign means start of string/line
$ : dollar sign means end of string/line
\A : start of string, never end of line : same as ^
\Z : End of string , never end of line : same as $

single line mode , ^ , \A and $ , \Z do not match line break , they match in Multiline mode
(except \A and \Z) . re.MULTILINE

word boundaries : \b word boundary , \B not a word boundary
very first and last word , between a word character and non-word character
This helps skipping backtracking while matching
space is not word boundary , boundary is at end position and start position of any word

apples and oranges
matches , /bapples/b and /boranges/b however not matches with /bapples/band/boranges/b

Alteration metacharacter -> | pipe character

here its an OR operator.

regular expressions are eager to return and greedy , so put the best search first

Back referencing : stores the matched data , not the expression. it allows to access the captured data
by \1 through \9 back reference positions

/(A?)B\1/ matches "ABA" and "B"
/(A?)B\1C/ matches "ABAC" and "BC" , back reference gets zero or more

find and replace using back reference :

for ex
^(\d{1,2}),([\w.]+?) ([\w]+?),(\d[4])} -> find

matches :
1, Leonardo Dicaprio, 2016

replace with
$1,$3,$2,$4

result : 1, Dicaprio, Leonardo, 2016

Non capturing group expression : turn off grouping ?: specify a non-capturing group

/(?:\w+)/ -> increases speed , more spaces for capture.
ex: (oranges) and (apples) to \1
oranges and apples to oranges

(?:oranges) and (apples) to \1

oranges and apples to apples
? -> give this group different meaning
: -> non capturing

LookAround Assertions :
two types : LookAhead and LookBehind assertions

positive lookahead : looks what should be ahead , /(?=regex)/
/(?=seashore)sea/ matches sea from seashore not from seaside ?= says looks ahead if its present but dont match whatever followed after it

Negative lookahead : /(?!regex)/

/(?!seashore)sea/ matches seaside not seashore

Lookbehind Assertion :
/(?<=regex)/ positive look behind assertion
/(?!regex)/ negative look behind assertion

/(?<=base)ball/ matches ball in baseball not football efficiently

position :

This costs 54.00 or $54.00.

(?<![$\d])\d+\.\d\d matches 54.00
(?<![$\d])(?=\d+\.\d\d) gets the regular expression pointer jumps to front of 53 , helps in inserting

Unicode and Multibyte Characters :

single byte , uses 8 bits to repesent a character , allows for 256 characters
A-Z , a-z , 0-9 , punctuation , common symbols

Double byte : 16 bits ; 2 to power of 16 , 65,536 characters

Unicode is varibale byte size : 1 , 2 or more byte , it allows over 1 million characters
its mapping between character and number
"U+" followed by a four digit hexadecimal number
infinity is written as U+221E

"cafe","café"

café can be encoded as four or five characters

unicode indicator : \u
\u followed by hexadecimal number (0000-FFFF)
/caf\u00E9/ matches café not cafe

unicode wildcard and properties :
Unicode wildcard : \X matches any single character , always matches line breaks like /./s
/cafe\X/ matches café and cafe

Unicode property : \p
/\p{Mark}/ or /\p{M}/ matches any mark (just accents)

Letter L
Mark M
Separator Z
Punctuation P
Number N
Symbol S
Other C

unicode not-property : \P

/caf\P{M}\p{M}/ matches café

The unicode property is widely supported

Now Practice!

TECHNO-TUNING

Tuesday, November 1, 2016

Regular Expression

No comments:

Post a Comment