Today we covered web scraping for the first time, and as usual I'm going to write a blog about it to be able to come back to it if I ever forget any of the steps.
I will be going through the steps I took on Alteryx to web scrape this website with multiple books and their prices: http://books.toscrape.com/catalogue/page-2.html
STEPS:
1)Copy URL: Copy the URL of the page you want to scrape into a text input tool like such:
data:image/s3,"s3://crabby-images/0c35a/0c35aa63f3edb4c65710499570bc7b4013c2865e" alt=""
2)Download tool: Insert a download tool into your Alteryx workflow and configure in the following way:
data:image/s3,"s3://crabby-images/fed82/fed82d60941569abe692d336c3e1d040941cbe7d" alt=""
3)Add browse tool: Adding a browse tool will ensure you have all the data available to be copied in the following step.
data:image/s3,"s3://crabby-images/80bf7/80bf7944478958dbd024c880ef18a3954ced799b" alt=""
4)Copy download data without headers: click on the download data field in the results pane and go to action, copy in clipboard, and then selected cells without headers
data:image/s3,"s3://crabby-images/3f3a6/3f3a6957a7bbee13fd2520b320427b247b61de15" alt=""
5)Tokenize on delimiting section(<>) that separates areas with RegEx: Go on the website, right click and inspect. then look for a delimiter part that repeats itself for every section you want to pull the data from. In this example it was <h3>.
data:image/s3,"s3://crabby-images/af379/af3790b95d41091b00eb3cab1129a677094508d7" alt=""
For this reason I configured the RegEx tool in the following manner:
data:image/s3,"s3://crabby-images/ba0b3/ba0b378e53ebd1eff48754c656b497d050d3a049" alt=""
6) RegEx parse all info wanted : the way to do this is with copy, pasting and replacing wanted section with (.*?)
Here is how I used it to extract the price from the following text:
data:image/s3,"s3://crabby-images/3f9cc/3f9ccc31eedf8fae8fd07c3f1fb637b72ef5722e" alt=""
data:image/s3,"s3://crabby-images/4b9d6/4b9d62d52cbb93f9d25024193f125bddb6034762" alt=""
And another example for the rating:
data:image/s3,"s3://crabby-images/829a8/829a8a187b71338a8c4b84435eee4f36ebf4e6bc" alt=""
data:image/s3,"s3://crabby-images/fe038/fe038797d017d945cb2208e105d4723424c02fdc" alt=""
TIP: right click on the website and select inspect, then click on this icon and hover over the data on the website you want to find.
data:image/s3,"s3://crabby-images/4d88e/4d88e846a2650ecea4410f737c964c7a55563bff" alt=""
Then use ctrl+F on your keyboard and write what word you need for it to highlight and find it for you