‘RegEx’, or the Regular Expression, is used across many programming languages and software packages.
Using RegEx in Alteryx, we can look at strings of text and try to find matches for specific patterns of characters, or replace parts of the string that match our patterns. We have the option of using the REGEX_Match, REGEX_CountMatch, or REGEX_Replace functions within the formula tool to complete these tasks.
Sometimes we might want to separate out parts of a string into new fields.
In many cases, our first port of call for separating out strings will be the Text To Column tool, but what should be use when there is not an obvious delimiter to use in this tool?
Example: messy strings with contact information
Here we have an example of some messy strings that contain a first name, a phone number, and an email address; in order to use this data we will need to split out the strings. However, there is no common delimiter between the three rows: the Text to Columns tool is no good, so let’s use the RegEx tool here.
Note: there are two output options in the RegEx tool that allow us to create new fields containing the contact info we want to use.
Tokenize can be used to find all matches for one specific string; for example, if we had multiple phone numbers in each row in similar formats, we could put every phone number into a new column.
Here, however, we want to create a new column for first name, phone number, and email address: we will need to use the Parse tool instead.
To work through this problem, I used the free regex101 tool to test my pattern out on the three text strings, and use their helpful reference pane to figure out a solution.
Let’s consider what we have here:
The names Angie and Greg are made up of one or more word characters (or 'alphanumeric' characters), and come at the start of the whole string.
We can use the start anchor (^), the word character (\w), and the non-greedy one or more quantifier (^?) to start building out our expression.
However, Jean-Luc’s name has an inconvenient hyphen in it: we’ll need to take this into account too.
We can use a wildcard character for things like hyphens (.), and since there are no hyphens in the other two names, we should use the non-greedy zero or more quantifier (*?) to tell Alteryx that there might or might not be a character after our first few letters. To include the last few letters of Jean-Luc’s name, we can finish off this expression by stating that there will be zero or more word characters after the hyphen (\w*?).
In very plain English, the expression we have built so far tells Alteryx “we are looking for some letters at the start of the string, followed by maybe a hyphen and maybe some more letters”.
We can enclose this expression in brackets to let Alteryx know that matches for this expression should form the first of our three new columns.
^(\w+?.*?\w*?)
Next, the phone numbers: we have two mobile numbers and one landline to consider, with slightly different formats.
The mobile numbers start with a +, a non-alphanumeric character (\W), followed by digits (\d). The landline number is only made up of digits. In plain English, “we are looking for zero or one non-alphanumeric characters, followed by at least one digit”, again enclosed in brackets, to make up our second new column.
(\W*\d+)
Finally, the email addresses: each person has formatted the first part of their email slightly differently, with a combination of letters, numbers and non-alphanumeric characters. All of them end with an @, and the domain name, including some full stops. In plain English, "we are looking for a string with a mix of characters followed by an @, followed by some more mixed characters".
In this instance, we might want to use a wildcard (.) for the first part of the email, and tell Alteryx that there will be at least one character of any kind (+), followed by an @ character, and a mix of letters and dots, which another wildcard character will match (.+). Once again, we can bracket off this expression to form our final new column.
(.+@.+)
Our final regular expression looks like this:
We can build out our workflow in Alteryx as below, and run it to see the output:
It is worth noting that RegEx can be used in different ways, and so there are likely to be several alternative - or maybe even better - solutions to this problem!