A few weeks ago, I came across an exciting Alteryx Challenge that presented me with the challenge of having three different types of information in one column.
As this phenomenon is not all too rare in data analysis, but often sounds easier than it is, I would like to share my solution with you. (Note: This is of course only one of many possible solutions.)
If you want to try it yourself, you can find the challenge here.
Starting point:
Specifically, the challenge was about the Oscars 2024 and among other things finding out which actors and actresses had the most nominations so far. The challenge had four subtasks, whereby cleaning the data was only a secondary task, which I would like to emphasise here.
The data set I want to focus on in this blog looks like this:
You have three different pieces of information in one column. This includes the name of the actors/actresses, the name of the film and in columns 1 and 7 you also get the information as to whether the category is Best actor or Best actress. To be able to analyse the information better, it would of course be nice if we could untangle the information and get three columns. But what steps do we need to take to do this?
Step 1: Formula Tool
After an initial data check, the Formula tool can be used in a first step. It allows us to use two formulas in one step, which enable us to create two new columns, one with only the category information and the other with only the names + films.
The formulas look like this:
Our interim result looks like this:
Now we have separated the first information, but we have many zero values. For this reason, we take the second step.
Step 2: Multi-Row Formula
We can use the Multi-Row Formula Tool to create a fill-down of the header column. This requires the following formulas in the expression area of the tool:
After that, we are left with the Nulls in the data column, as you can see here:
Step 3: Filter & Text To Columns
Our aim in the third step is to get rid of the zeros in the data column and also to put the names of the actors/actresses and the film titles in two different columns. To do this, we first need the filter tool. With this tool we can filter out the rows that have zeros in the data column if the tool is configured as follows:
Then we need the Text to Columns tool to separate the film title from the actors/actresses name. It should be set as followed:
After this step, your data record will look like this:
Step 4: Cleaning & Select
In a final step, our data set still needs to be tidied up, as we still have a lot of duplicate information and the column headers don't have infpormative/useful names. We therefore use the cleaning tool and the select tool in the fourth step.
With the help of the cleaning tool, we can remove whitespaces and the inverted commas around the film name. To do this, the tool can be set as followed:
By subsequently using the Select tool, we can get rid of the superfluous columns Test and Data and also change the name of the remaining columns to something meaningful:
Result:
Your final data set should look like this:
And the corresponding alteryx flow should look like this: