Getting Started With Regex
Is Regex is the hard thing to learn in any programming language ?
Of course “YES” is my answer until I actually started learning it.
Regex is like a end slice of bread where everyone skips while learning programming. In this article I will be discussing everything from regex basics by providing source codes and examples, So follow me throughout this blog to gain great knowledge in regex and I am sure that by the end of this blog you change your perception towards regex changes just as I did.
1. What Is Regex and Where Regex can be used ?
- A Regex or Regular Expression is a sequence of characters that forms a search pattern which can be used for string searching in a given text file or etc. One or more matches of a substring which satisfies the search pattern can also be found using Regex.
- For example consider a scenario where you need to extract all the mail id’s or contact numbers from a given text file. It would be difficult for us to do manual work in this case, where we need to go through each and every line of text file and check whether it contains any email id or contact number or not. In this case regex helps us to find all mail id’s and contact numbers with just two to three lines of code.
Here are the different use cases where regex can be used
- Extracting emails from text document.
- Regular expressions for web scrapping (data collection).
- Working with date-time features.
- Using regex for text preprocessing in NLP (Natural language processing). and many more.
2. How Regex can be implemented in python ?
- Python’s vast inbuilt libraries is one of the reason why most of the newbie’s to programming choose python as their first programming language.
- re module in python supports Perl style / Perl flavor regex ( Perl is the first programming language which supported regex ).
- Let us see a simple python program to extract an email from the given string .
import retext = "my mail id is rupee***123@gmail.com" # create a pattern to extract "mail id" from the above textpattern = re.compile("[A-Za-z]+[0-9]*@gmail.com")print(pattern.findall(text))
output:
['xyz123@gmail.com']
Don’t get embarrassed by seeing the above program, it will be your cup of tea after learning regex basics.
3. Meta Characters in Regex
- Meta Characters are the characters with some special meaning unlike literals.
- List of meta characters
[] - A set of characters - "[a-z"
\ - Can be used as an escape sequence - "\d"
. - Any (matches any character except newline) - "[\.]*"
^ - Starts with - "^hi"
$ - Ends with - "bye$"
* - zero or more occurrences - "[hi]*"
+ - one or more occurrences - "[hi]+"
| - Either or - "ravi|ramu"
{} - Exactly specified number of matches - "a{2}"
Here is a simple program to know whether the string starts with “Hi” or not.
import re# creating compile object
pattern = re.compile("^Hi")text = "Hi All Happy Learning !"if pattern.search(text):
print("The string starts with Hi.")
else:
print("The string doesn't starts with Hi.")
OUTPUT :
The string starts with Hi.
4. Special sequences in Regex
- Special sequence starts with a “ \ “ and it has a special meaning.
- List of Special sequences available in regex
\A - returns a match if the specified characters are at the
starting of the string --> "\Athe".\b - returns a match where the specified characters are at the
beginning or at the end of a word --> "\brain".\B - returns a match where the specified characters are present,
but NOT at the beginning ( or at the end) of a
word --> "\Bain".\d - returns a match where the string contains digits --> "\d".\D - returns a match where string DOES NOT contain digits --> "\D".\s - returns a match where string contains a white space
characters --> "\s".\S - returns a match where the string DOES NOT contain a white
space character --> "\S".\w - returns a match where the string contains any word characters
(from a to z and digits from 0 to 9) --> "\w".
\W - returns a match where the string DOES NOT contain any word
characters --> "\W".\Z - returns a match if the specified characters are at the end of
the string --> "Spain\Z".
Here is a sample program to extract student names and their roll numbers
text ="""
Stud1: Rupesh Keesaram:176
Stud2: Rajasekhar Reddy:872
Stud3: Harshavardhan Reddy:923
Stud4: Sumanth Dasariraju:999
"""
# pattern to extract student names
name_pattern = re.compile("Stud\d:\s(\w+\s\w+)")# pattern to extract student roll numbers
rollnumber_pattern = re.compile(":(\d*)")print(name_pattern.findall(text))print(rollnumber_pattern.findall(text))
OUTPUT :
['Rupesh Keesaram', 'Rajasekhar Reddy', 'Harshavardhan Reddy', 'Sumanth Dasariraju']['', '176', '', '872', '', '923', '', '999']
Explanation :
5. Sets in Regex ( Regular Expressions)
- A set is a collection of characters inside a pair of square brackets [] with a special meaning.
[abc] --> returns a match where one of the chars (a,b or c)
are present.[a-z] --> returns a match for any lowercase character alphabet.[A-N] --> returns a match for any Uppercase char between A to N.[^arn] --> returns a match for any char EXCEPT a,r and n.[0124] --> return a match for any specific digits (0,1,2,4)
are present.[0-9] --> returns a match for any digit between 0 to 9.[0-5][0-9] --> returns a match for any two digit numbers
from 00 to 59.[A-Za-z] --> returns a match for any character alphabet, whether
that may be lower or upper case.[+] --> In sets +,* and etc has no special meanings. So it
will match a object for + sign.
Sample program to extract phone numbers
import retext ="""
530-896-5970
Landline, from Chico, CA(state),USA
814-765-5032
Landline, from Clearfield, PA(state),USA
714-921-5629
Landline, from Orange, CA(state),USA
863-303-9189
Landline, from Arcadia, FL(state),USA
470-375-4326
Landline, from Lewisville, TX(state),USA
"""# pattern for phone numbersphone_pattern = re.compile("[0-9]{3}\-[0-9]{3}\-[0-9]{4}")print(phone_pattern.findall(text))
OUTPUT :
['530-896-5970', '814-765-5032', '714-921-5629', '863-303-9189', '470-375-4326']
6. REGEX FUNCTIONS
re module provides many number of function which allow us to search a string for a match. Here we are going to discuss 5 major function which are frequently used in regex .
1. compile() :
compile() is used to convert a regular expression into regex pattern objects, later on which we can able to perform searching and matching operations.
Syntax :-
compile(pattern, repl, string)
Example program :-
import res = "hello world"
pattern = re.compile("^hello")
print(pattern.findall(s))
OUTPUT
["hello"]
2. search() :
- The re.search() function searches the string for a match and return match object if there is a match.
- If there is more than one match, then only the first occurrence of the match will be returned.
- If there is no match found then None will be returned.
- start(), end(),lastindex(),group(),groupdict() and etc can be applied on the search object.
NOTE : Practice different permutations and combinations of examples to get good knowledge about regex.
Syntax :-
re.search(pattern, string, flags=0)
Example program :-
import res = "The rain in spain"
search_obj = re.search(r"ain\b",s)
print(search_obj.start())
print(search_obj.end())
OUTPUT :
5
8
3. split ():
- Split function returns a list where the string has been split at each match.
Syntax :
re.split(pattern, string, maxsplit=0, flags=0)
Example program :
import res = "The rain in spain"
search_obj = re.split(r"\s",s)print(type(search_obj))print(search_obj)
OUTPUT :
<class 'list'>['The', 'rain', 'in', 'spain']
- We can control the number of splits by specifying the maxsplit parameter
import res = "The rain in spain"
search_obj = re.split(r"\s",s,1)print(search_obj)
OUTPUT :
['The', 'rain in spain']
4. sub() :
- sub function replaces the matches with the text of your choice
Syntax :
re.sub(pattern, repl, string, count=0, flags=0)
Example program :
import retxt = "The rain in Spain"
x = re.sub("\s", "-", txt)
print(x)
OUTPUT :
The-rain-in-spain
- We can also control the number of replacements by specifying the count parameter.
import retxt = "The rain in Spain"
x = re.sub("\s", "-", txt,2)
print(x)
OUTPUT :
The-rain-in Spain
5. findall () :
- The findall function return a list containing all the matches.
Syntax :
re.findall(pattern, string, flags=0)
Example program :
import retxt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)
OUTPUT :
['ai', 'ai']
- The list contains the matches in the order they found
- If no matches are found, then an empty list will be returned.
import retxt = "The rain in Spain"
x = re.findall("money", txt)
print(x)
OUTPUT :
[]
Thanks for reading ….!
Click the below link to access my github file in which i had written different permutations and combinations of regex examples ( I will be updating this file regularly)
Follow me on Instagram