Cracking IT Interview



Regular Expressions : BRE, ERE 


A regular expression is a set of special characters which is used to describe a pattern. Why these are special characters because these are having special meaning attached with its use. Although we are using the same wild characters here but it is different from shell's meta characters in meaning. Regular expressions must be interpreted at command level, not at the shell level so we use regular expression inside a pair of double quotes ("") to hide its meaning to the shell. In this way, we ensure that the shell is not going to interpret these special characters.


Use of regular expression may be little confusing to its users but it is yet very flexible and powerful in pattern search. There are various circumstances when it is compulsory to use regular expression for getting desired results. Lets take a scenario where we have to search for a 10 digits mobile contact numbers in a large file and the file size consisting of millions of records. We make it possible when we use these special characters in correct sequence inside the given pattern. 


There are two types of regular expressions:


  1. Basic Regular Expression (BRE)
  2. Extended Regular Expression (ERE) 







Basic Regular Expression (BRE)


The following special character sets are used as BRE in pattern search:


CharactersBrief Descriptions
* (Asterisk)Zero or more occurrence of preceding character
. (Period)Represents single character in pattern search
^ (Caret)Represents beginning of the line in pattern search
$ (Dollar)Represents end of the line in pattern search
[] (Character class)Any of the single character among all mentioned multiple characters inside character class 












Here we are going to elaborate more with suitable examples:


The "*" Asterisk : 


It refers to immediately preceding characters for Zero or more number of occurrence. Let's say, if we give "a*" indicates "a", "aa", "aaa" or any number of a in the given pattern.


Example:


$cat worker.txt
1  Parker
2  Peter Paarker
3  Blaze
4  Brain
5  Dorothy
6  Linda
7  prker


We have to search the line having worker name "Parker" or "Paarker" or "prker", here "-i" option is used below for ignore case, please refer grep filter for this.


$grep -i "Pa*rker" worker.txt
1  Parker
2  Peter Paarker
7  prker
$


The "." Period / Dot :


It matches with the single character in the pattern string. Lets say we have to search for a worker name in "worker.txt" file whose name starts from letter "L" and name consist of 5 letters:


$grep "L...." worker.txt
6  Linda
$


when "." period and "*" asterisk are used together like ".*", is very useful indicating that any number of characters in the pattern string. 


$grep -i "p.*" worker.txt
1  Parker
2  Peter Paarker
7  prker
$


Similarly:


$grep -i "b.*" worker.txt
3  Blaze
4  Brain
$







The "^" Caret :


It indicates the pattern match at the beginning of the line. Let's say, we have to search for a line which starts from serial number 4 in "worker.txt" file:


$grep "^4" worker.txt
4  Brain
$


Similarly the worker name starting with letter "B",


$grep "^...B" worker.txt
3  Blaze
4  Brain
$


The "$" Dollar:


It indicates the pattern match at the end of the line. Let's say, we have to search for a line which end with letter "r" in "worker.txt" file:


$grep "r$" worker.txt
1  Parker
2  Peter Paarker
7  prker
$


when we use "^" with "$" together, it represents the blank line in a file if no other characters in the middle and it is very useful to delete blank lines from a file (refer grep filter).



The "[]" character class:

It represents any one of the single character among all the given numbers of characters inside character class to match with pattern string. 


$grep -i "P[ar]rker" worker.txt
1  Parker

$
$grep -i "P[ar]*rker" worker.txt
1  Parker
2  Peter Paarker
7  prker
$


The "*" asterisk gives you the multiple occurrences but does not give you the upper bound. So here, with the "{}" we can specify the lower and upper bound for the character set in the pattern.


Let's take few contact numbers in "worker.txt" file:


$cat worker.txt
1  Parker            04264554792
2  Peter Paarker     04653476238
3  Blaze
4  Braain
5  Dorothy           4768256981
6  Linda
7  prker

Now we have to list the worker's record those are having contact numbers in the database.


$grep "[0]*[0-9]\{10\}" worker.txt
1  Parker            04264554792
2  Peter Paarker     04653476238
5  Dorothy           4768256981
$


here:

[0]* - used for optional zero before the contact number

[0-9] - any number from 0 to 9 going to be used one at a time, also used to specify the range of numbers within character set like this.

\{10\} - here we have given fixed bound for 10 digits, backslashes (\) are used to hide its meaning to the shell.







Extended Regular Expression (ERE)

This is the extended form of BRE as the name implies, it makes us enable to match with dissimilar pattern with more ease with its spacial characters. These expressions uses some additional characters and has to be used with "-E" option with grep command. Another option is "egrep" and we have to eliminate "-E" option with "egrep" command. As it is the extended form of regular expression so it also supports BRE in the matching pattern. 



Characters 
Brief Description 
+ (Plus) 
One or more occurrence of preceding character
? (Qs mark) 
Zero or one occurrence of preceding character
| (Pipe) 
Used as "OR"
() parentheses
Used to group the pattern.





 







The "+" Plus


It matches with one or more occurrence of preceding character and gives the result. Let's take "worker.txt" file and understand this:

$cat worker.txt
1  Parker            04264554792
2  Peter Paarker     04653476238
3  Blaze
4  Braain
5  Dorothy           4768256981
6  Linda
7  prker

$
$grep -E "Pa+rker" worker.txt
1  Parker            04264554792
2  Peter Paarker     04653476238
$


The "?" Qs mark


It matches with single occurrence of preceding character or nothing at all in defined pattern. 

$grep -E "Pa?rker" worker.txt
1  Parker            04264554792
$
$grep -i -E "Pa?rker" worker.txt
1  Parker            04264554792
7  prker
$

The "|" pipe

It acts like OR operator for multiple string pattern match. We can give two or more pattern delimited with "|" and it will display all the mentioned pattern whichever is found in the given file. 

$grep -E "Linda|Blaze" worker.txt
3  Blaze
6  Linda
$
$grep -E "Do?rothy | L+inda" worker.txt
5  Dorothy           4768256981
6  Linda
$

The "()" parentheses

It given you the ability to group the pattern together.

$grep -E "(p|Pa)rker" worker.txt
1  Parker            04264554792
7  prker
$


NEXT->​