Web scraping 101 on Alteryx

Today we covered web scraping for the first time, and as usual I'm going to write a blog about it to be able to come back to it if I ever forget any of the steps.

I will be going through the steps I took on Alteryx to web scrape this website with multiple books and their prices: http://books.toscrape.com/catalogue/page-2.html

STEPS:

1)Copy URL: Copy the URL of the page you want to scrape into a text input tool like such:

2)Download tool: Insert a download tool into your Alteryx workflow and configure in the following way:

3)Add browse tool: Adding a browse tool will ensure you have all the data available to be copied in the following step.

4)Copy download data without headers: click on the download data field in the results pane and go to action, copy in clipboard, and then selected cells without headers

5)Tokenize on delimiting section(<>) that separates areas with RegEx: Go on the website, right click and inspect. then look for a delimiter part that repeats itself for every section you want to pull the data from. In this example it was <h3>.

For this reason I configured the RegEx tool in the following manner:

6) RegEx parse all info wanted : the way to do this is with copy, pasting and replacing wanted section with (.*?)

Here is how I used it to extract the price from the following text:

And another example for the rating:

TIP: right click on the website and select inspect, then click on this icon and hover over the data on the website you want to find.

Then use ctrl+F on your keyboard and write what word you need for it to highlight and find it for you

Author:
Eugenia Losada Gamst
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab