In this edition dashboard week, our cohort was tasked with web scraping national rugby league data...
The tasks were set as follows:
- Each person will have 2 home teams they need to collect data for
- For each game they should get the score, play by play commentary and Player Statistics for both teams in that game
- Store the data into one big shared datasource
- Each person should work independently
- Write a blog about a specific technique or section from the web scraping challenge
- Presentations should different, each person should talk about a specific section of their workflows, the how and the why
While it seemed a fairly daunting task, with over (*approximately) a million regex tools used, the end result was a success as I was able to fulfill all the technical and resultant requirements.
The first workflow scrapped all the match information. Although successful, this was not dynamic. Therefore more work would need to be done in order to obtain all the matches between every team throughout all 5 years that were required.
The next ultimately combined individual player records with their specific match statistics in order to obtain the following output.
And lastly, I wished to created a 'play-by-play' match commentary of the each match that was played while also making this obtainable for every match that was played.
At first I appended all the possible URL combinations of each match and then filtered out only the matches which take place. Then I used this as the input for the matches I was required to look at, in this case it was both the Eels and the Knights. This was then followed by more regex parsing to obtain every notable action within each match (for example substitutions, penalties and tries). To save running the workflow again (it takes 15 minutes...) I will just show the workflow, but just imagine the download tool is connected on both sides.
All in all, today surpassed my expectations as I was much more proficient in web scraping and using regex then I had given myself credit for. So if the reader of this blog could take anything away from this blog, it would be this. Despite any lack of experience or technical ability, the reality that exists will almost certainly live beyond your expectation and exceed others.