The Data School - Dashboard Week | Day 4: Web Scraping

Waking up this morning, I knew I would find today difficult. When Lorna asked us what we wanted to do during Dashboard Week, web scraping was one of the options and I chose it as I wanted to challenge myself. Oh, was I challenged.

The task was to scrape the NRL Match Centre games from 2017 to 2022 to obtain game data (score, play by play commentary and player statistics) for two home teams (everyone had two different ones) - mine were Strom and Titans. Once everyone had completed this, the data was to be stored in one big shared data source.

I thought that referring to my notes on web scraping would see me through today, however, the website contained JAVA script inside the html and this threw me completely. Copying and pasting code into VS code also didn’t help as it copied as one looooooooonnnnnggg line. So, all in all I really struggled to identity the code that I needed to Regex in Alteryx. Eventually after some help, I managed to get data downloaded into Alteryx and from there, things started looking more positive. See my score workflow below (for 2017 – both teams):

This worked perfectly for my one team who only played in the Rounds, however, for the team who made it to the Grand Final, the data wasn’t pulling through and I couldn’t find it in the code – so I parked it with a note to come back to in the future.

I also tried to make the workflow dynamic by using all the seasons and both teams (see below).

This presented some problems due to the way I used the Regex and Multi-row to get my Round number and my Finals Week Number. So, I made a note to fix this (as I don’t think it will be difficult) and moved on.

At this point I think I had 15 minutes before presentations, so I thought I try the same approach as before – this time for the player statistics. After scanning through the horrible looking word doc (where I copied my data into so that I could see it properly), most of the code seemed useful. So, I used ALL of it. I managed to get this downloaded into Alteryx and tried to find an identifier to differentiate between all the data. I found something called key, which had the team next to it – so I used this and that was that.

I didn’t finish today, but so did many others. It was a challenge indeed, but a good one. Going forward, I want to practice web scraping – as the more experience I have, the less daunting it will seem.