Regular Expression

Mehmet Ali Baykara
4 min readOct 17, 2020

Regular expression = regex

A few weeks ago I faced a task that required parse a text and extract certain pieces of information. There is a bunch of ways of handling kinds of tasks in programming. However, realized that using regular expression will be much simpler than writing a parsing function or importing an external library instead. Especially if you implementing C++ and to use third-party libraries for parsing for a small task, it will be an unnecessary dependency. So I have noticed that I was not confident in the regular expression operations. Thus push me to read a bit of documentation and some examples as well. So in this post will share my impression of the regular expression.

https://www.pexels.com/photo/alphabet-close-up-communication-conceptual-278887/

Regular expressions are simply a developer-defined pattern that iterating over string search for matching object. As a developer, you have to define a pattern that to be extracted from the text and then make a search based on that pattern. Most of the programming languages are providing regex features. Despite my latest regex experience in C+, in this tutorial, I decided to use python as a sample language. Nonetheless, the usage of regex in other languages are really similar.

The below you see the most common chars that might be used to express regex operations.

\d Any numeric digit from 0 to 9.\D Any character that is not a numeric digit from 0 to 9.\w Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)\W Any character that is not a letter, numeric digit, or the underscore character.\s Any space, tab, or newline character. (Think of this as matching “space” characters.)\S Any character that is not a space, tab, or newline.+ plus sign is for mathces one or more, at least one* star is for matches zero or more, sought word may occur may not.? question mark matches zero or one occurrences of the regular expression.^ caret tells the sought word should be at the begining$ dolar sign is opposite to caret, tells the sought word should be at the end.(.) The dot (.) matches any character.

Python has very comprehensive documentation about regular expressions. In python, regexs are modules that can be imported from Lib/re.py.

The re library gives the following methods to perform regular expression operations. import re as regex

regex.compile(your_pattern) #where you define the wanted pattern
regex.search(your_string) #string that you search in it
regex.group() # matched object to be return

Let’s take an example and see how it works.

Task: Extract a phone number from a given text. So if you implement this task without regex, you have to use many if-else statements and for or while loop to iterate entire text etc. But with regular expression, it is possible to write the same functionality with much less line of code and most probably more efficient.

So we assume the phone number looks like 0444–542–52–11

Steps:

  1. Import regex library that called re
  2. Create the wanted pattern
  3. Pass the text in the search method
  4. Invoke group() to get matched part

A. extracting number using \d stand for any numeric digit from 0 to 9

at least one nor more integerimport re as regexs = ‘Call me at 175–555–10–94 tomorrow.’#\d stand for any numeric digit from 0 to 9
exp = regex.compile(r’\d\d\d-\d\d\d-\d\d-\d\d’)
mo = exp.search(s)
print(‘wanted number: ‘ + mo.group())

The output will be:

wanted number: 175–555–10–94

We can write more compact by 2 LoC

mo = regex.search(‘\d\d\d-\d\d\d-\d\d-\d\d’,‘Call me at 175–555–10–94 tomorrow.’)print(‘wanted number: ‘ + mo.group())

Even we can replace multiple values by curly brackets {number of repetition}

mo = regex.search(‘\d{3}-\d{3}-\d{2}-\d{2}’,‘Call me at 175–555–10–94 tomorrow.’)
print(‘wanted number: ‘ + mo.group())
#Above in curly brackets, we set the number of digits.
:::::::::::output:::::::::::::
wanted number: 175–555–10–94

B. using * called asterisk which means the value you search it might occur 0 times or more. I.e:

wanted = re.search(‘(\S)*(\d)+’, ‘Call me at 175–555–10–94 tomorrow.’)
print('wanted number: ' + wanted.group())
:::::::::::output:::::::::::::
wanted number: 175-555-10-94

Above

wanted is the variable name
re is the imported module from python
search the method from re library
(\S)* tells do not extract any space, new line and tab in with * do is in entire string
(\d)+ one or more numeric digit
wanted.group() retrieve corresponding expression from string

Each programming language allows you via its library different methods to invoke extracted expression. You may not need to extract any substring instead use simple in a conditional statement to check whether the sought substring occurs or not and accordingly perform different tasks. Before we end up with the regex blog let's see the pros and cons.

PROS

  • save you a lot of lines of code.
  • they match exactly what you look for

CONS:

  • Hard to read them and maintain it.
  • Break the KISS keep it stupid simple principle.
  • Mostly regexs are inefficient

Nevertheless, there are pros and cons like any other subject in programming. In point of my view, it depends on many parameters, the developer should decide on his/her own with the team. Btw. regex allows you to perform really compilated parsing as well.

Resources
* https://docs.python.org/3/library/re.html
* https://en.cppreference.com/w/cpp/regex
* https://nostarch.com/automatestuff2
* https://users.pja.edu.pl/~jms/qnx/help/watcom/wd/regexp.html

--

--