Using RegEx in Alteryx

RegEx (short for Regular Expression) is a sequence of characters that allows a user to extract a subset of a string from a field. It's a powerful data preparation and transformation tool to check for incorrect data formats (think e-mail addresses, phone numbers, postal codes), scan data for key words (e.g., in customer feedback), and/or extract characters of interest (like a domain name or codes).

In practice, RegEx acts quite similarly to the CONTAINS() function, with which a user can search a string in a field and receive a Boolean (True/False) output. With RegEx, expressions can be customized much more than with the CONTAINS() function. And in Alteryx, the built-in RegEx tool has four distinct functions that allow users to specify an action in addition to looking for a given expression.

In this blog, I'll provide a practical example using each of the four RegEx tool functions.

An Alteryx workflow using RegEx will look something like the image below: A given tool (Text Input, in this case) on the left that feeds into the RegEx tool on the right.

For this exercise we'll be working with this small dataset:

Replace: Replaces a given expression with a string of your choice. To replace the 'Miss' and 'Mrs' in Field1 with 'Ms,' we can enter the following:

What this RegEx expression does:

| looks for fields containing either the expression to the left OR right

Tokenize: Allows you to split text to columns. Each time the expression is found in a field, a new column is created. Here we want to create 3 new columns for 'Title,' 'First Name,' and 'Last Name.'

What this RegEx expression does:

[A-Z] looks for uppercase characters

\w looks for alphanumeric characters

+ looks for multiple alphanumeric characters

Parse: Extracts a given expression from a string. To extract each person's age from the Approximate Age field and output an integer data type, we can add an expression in parentheses:

What this RegEx expression does:

() indicates that we want to create a new column for our output

\d looks for digits

+ looks for multiple digits

Match: Checks whether the input expression can be found in a field (whether there are any matches). This function outputs a True/False in a new column, which you can then clean up using the Filter tool to extract only those that are True.

To find all people whose name starts with an M and does not end in a space, we enter the following:

What this RegEx expression does:

^M looks for words beginning with M

. looks for wildcards (basically anything)

+ looks for multiple wildcards

\S looks for anything not ending in a space

Author:

Britt van der Poel

View Profile