Web Scraping in Alteryx

Today DS40 were tasked with scraping the results of the London marathon over the past ten years using the TCS London marathon website to create a dashboard in Tableau. With 50,000 runners a year this was not an easy task.

I started the day by taking a look at what data was available on the website. This included the names of the runners, their final position across various categories, and the time they ran the marathon. Having learnt from past experiences, I next sketched out my dashboard. This meant that whilst webscraping, I only had to worry about certain columns. In the end, I decided to just take the participants' names and finishing times, as I could then create their race finishing position in Alteryx.

For me, the key to undertaking a task as large as this is to break the challenge up into small manageable chunks.

  1. Download the first page of data for one year
  2. Figure out how to parse the data so I can extract the columns I need: Name and Time
  3. In a copy of the workflow, I convert to a macro so I can download each page of the website. I used a batch macro but an iterative one would probably work. In the development only try to bring the first two pages - if this works, it will most likely work for all of the pages
  4. In a copy of the workflow where the first macro works, integrate the need to cycle through the different years to create a second macro (which has the first one nested in it). Again, just try two or three years to test that it works
  5. Now, you can put it all together. I still tested it with just a couple of pages and years to check it was working as expected.
  6. Hit run and go for lunch 🥳

In general, my best advice when building macros is to start with a workflow which does the thing you want it to do once. If you can make this work, then start converting your work into a macro. ALWAYS make copies of your workflows when you are happy its working as expected. This way, even if you mess everything up you can easily get back to something that works. Finally, only test with a few values to make sure each step works - my final workflow took about half an hour to run so I definitely wouldn't suggesting debugging with this much data.

Author:
Lydia Wren
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab