Web Scraping in Alteryx

We recently looked at web scraping in Alteryx, so I though I would write this blog to give some context to anyone interested in using this to extract data. Firstly, we need to understand HTML, the standard coding language for webpages. HTML consists of elements that tell the browser how to display content, and define a web pages structure. You can find more info on this here: https://www.w3schools.com/

HTML consists of tags <> that separate the code into sections, with info nested inside the crocodiles i.e <body>...</body> contain the contents of the document.

To inspect this, we just need to right click on a webpage and click inspect (see below).

This will bring up a window on the right hand side with the HTML code for the page. Note that any Java script elements will not work with Alteryx web scraping, so be careful of video elements in the page as they will not be scraped. From here we can click on the button, and then onto the webpage to find where the element we want to scrape sits in the HTML. Note this for later reference as it will help narrow down our regex search.

Our next step will be to setup Alteryx so that it will download the HTML from the webpage, and allow us to start exploring.

Drag in a text input, and enter in the url of the page that we want to scrape. In this case, I wanted to bring in the data from multiple pages, so each one has to be imputed onto a new line. Next we will need to drag a download tool onto the page, and set the URL field to the one that we just imputed with the text input. (The download tool comes from the developer set of tools in Alteryx).

Our output will look something like this. The DownloadHeaders column will allow us to make sure that the pages downloaded okay, with the 200 OK showing up. If you run into a 404 or other errors, try to limit the download rate to something slower. It should be noted that running a web scrape too many times against a website could result in an IP block, as you will be overloading the connection requests, and possibly crash the server. Always check that the website is okay with web scraping!

The download Data column will contain all of the information we need. Use a browse tool, and double click on a row in this column to inspect the data further.

Now that you have this ready to go, we can use the regex tool to split the data we want out of these rows. If you want to find a specific part, then I would recommend copying this data from the view in Alteryx, and pasting into a tool such as Visual Studio Code as it will allow us to inspect the HTML and search for the specific part that we want. We can also use the search button in the inspect function in our web browser to limit this further. Web scraping this way will require quite a good understanding on regex, so be sure to practice with https://regex101.com/ to help get the coding correct.

Author:
James Driver
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab