Regular Expressions (RegEx)

Sometimes we have a lot of data to sift through and we have a rough idea of what information we are looking for. Regular Expressions help us find that information relatively quickly.

In this blog, I'm going to take you through what RegEx is, the different use cases for RegEx, and some basic examples using qualifiers and quantifiers.

So, without further ado....

The journey is on. Great way to start your next adventure
Photo by Maxime Horlaville / Unsplash

A regular expression is a set of characters that are put together to specify a pattern. We can then look for this pattern within any body of text.

If I have documentation that contains phone numbers, I can put in a regular expression which brings back all the phone numbers in the document because phone numbers all have 11 digits.

One could simply write the exact sequence they were looking for but that's very limiting. What this means is I can search for a specific number like 07123 456789 but it would only return that one specific number and not any other phone number. With regex, we can set instructions and have all the data that matches those instructions. This can be done for emails and postcodes and just about anything else we might want to isolate.

The first thing we need to know are the regex qualifiers. We won't go through all of them, just the most important. Qualifiers are the "symbols" that tell our regex tool what we are looking for.

Qualifiers:

  1. If we want to find an alphanumeric character (letters or numbers) we use \w. We can see this in the image below. All individual numbers and letters have been identified, and all the symbols and spaces have been left behind

2. If we want just numbers then we use \d (d as in digits)

3. If we want just letters we can use [a-z] or [A-Z]. The reason there are two is because this is case sensitive. The two images below show the different characters that are picked up depending on what is used.

With lowercase [a-z]
With UPPERCASE [A-Z]

4. If we want a space we use \s. This way only the spaces are highlighted

5. If we want to find a character that could be anything (a wildcard) then we use a . (a full stop). It is important to not that this is NOT a backslash dot (\.) because that would only bring out the full stops.

Using a . (dot)

To put this in a little bit more context, let's use the example from above about finding phone numbers. We know that phone numbers would only be digits and we know that UK phone numbers have 11 digits so we can put in \d eleven times.

If you do try this, you might run in to a problem. In the photo below I have rewritten the phone number but in a slightly different format. Because of the space between the 5th and 6th number, the first format the phone number was in couldn't be recognised. We'd have to put a space in there and the result of this can be seen in the second picture: \d\d\d\d\d\s\d\d\d\d\d\d

Without the space: \d\d\d\d\d\d\d\d\d\d\d
With the space: \d\d\d\d\d\s\d\d\d\d\d\d

You may find it very inefficient having to type out \d so many times and this is where our quantifiers come in. They tell you how many characters you want that are in the qualified format.

Quantifiers:

  1. If we want a specified number of characters we can use curly brackets {}. For example, if we want 3 digits we can type in \d{3} and this will bring back every set of 3 digits that can be found.

2. If we don't know how many characters we need to return but we know that it will be one or more, then we can use a + symbol. In the image below I have requested numbers to be returned but I don't know how many digits there will be in each sequence so I have used \d+.

At first glance, the image below looks just like the \d above, however if you look closely, you'll notice that the numbers in this example are split into groups of one or more whereas in the example above each number was selected individually.

This is the grouping of the numbers when \d+ is used.
This is how the numbers are returned when just \d is used without the +

Let's put all this information into better context.

In the image below, 3 people have sent me texts with their name, phone number, house address and email address. I'm going to identify the different things I need from the text. See if you can work it out, the answers are in the image captions. (There are many ways to answer the questions so your answers may differ. The important thing is to have the right data extracted)

  1. I need the phone numbers so I can add them to a group chat. Thankfully they are all in the same format so that means one regex expression can be used. Can you work it out?

Regex expression: \d{5}\s\d{6}

2. I'd like to know how old they are as well. (Remember that if you have a specific string (word) you can type it out directly as well)

\d{2}\syears

3. I would like to post some invitations to their houses so I need their complete house address.

\d+\s[A-Z]\w+\s[A-Z]\w+

If you'd like to practice some regex, this is a good website to use: https://regexr.com/

I hope you've had a little bit of fun with this 😊

Author:
Angelica Obi
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab