This is a R conversion of a tutorial by FC Python. I take no credit for the idea and have their blessing to make this conversion. All text is a direct copy unless changes were relevant. Please follow them on twitter and if you have a desire to learn Python then they are a fantastic resource!
Before starting the article, I'm obliged to mention that web scraping is a grey area legally and ethicaly in lots of circumstances. Please consider the positive and negative effects of what you scrape before doing so!
Warning over. Web scraping is a hugely powerful tool that, when done properly, can give you access to huge, clean data sources to power your analysis. The applications are just about endless for anyone interested in data. As a professional analyst, you can scrape fixtures and line-up data from around the world every day to plan scouting assignments or alert you to youth players breaking through. As an amateur analyst, it is quite likely to be your only source of data for analysis.
This tutorial is just an introduction for R scraping. It will take you through the basic process of loading a page, locating information and retrieving it. Combine the knowledge on this page with for loops to cycle through a site and HTML knowledge to understand a web page, and you'll be armed with just about any data you can find.
Let's fire up our package & get started. We'll need the 'rvest' package, so make sure you that installed.
Our process for extracting data is going to go something like this:
For this example, we are going to take the player names and values for the most expensive players in a particular year. You can find the page that we'll use here.
This is pretty easy with rvest. Just set the page variable to the url we want to scrape and the pass in through the read_html function.
page <- "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000" scraped_page <- read_html(page)
To fully appreciate what we are doing here, you probably need a basic grasp of HTML - the language that structures a webpage. As simply as I can put it for this article, HTML is made up of elements, like a paragraph or a link, that tell the browser what to render. For scraping, we will use this information to tell our program what information to take.
You can inspect the source code of the page or use SelectorGadget to help us tell our scraping code where the information is that we want to grab.
Take another look at the page we are scraping. We want two things - the player name and the transfer value.
Using SelectorGadget we can find the following node locations:
Player Name = #yw2 .spielprofil_tooltip
Transfer Value = .rechts.hauptlink a
Extracting the data is then quiet easy. Reading the code left to right the_page -> the_nodes -> the_text -> as_text. Each time storing them as objects with <-
PlayerNames <- scraped_page %>% html_nodes("#yw2 .spielprofil_tooltip") %>% html_text() %>% as.character() TransferValue <- scraped_page %>% html_nodes(".rechts.hauptlink a") %>% html_text() %>% as.character()
This is pretty simple as we now have two objects PlayersNames and TransferValues. So we just add them to the construction of a dataframe.
df <- data.frame(PlayerNames, TransferValue)
and then display the top 5 items of the dataframe with :
This article has gone through the absolute basics of scraping, we can now load a page, identify elements that we want to scrape and then process them into a dataframe.
There is more that we need to do to scrape efficiently though. Firstly, we can apply a for loop to the whole program above, changing the initial webpage name slightly to scrape the next year - I'll let you figure out how!
You will also need to understand more about HTML, particularly class and ID selectors, to get the most out of scraping. Regardless, if you've followed along and understand what we've achieved and how, then you're in a good place to apply this to other pages.
require(rvest) page <- "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000" scraped_page <- read_html(page) PlayerNames <- scraped_page %>% html_nodes("#yw2 .spielprofil_tooltip") %>% html_text() #%>% as.character() TransferValue <- scraped_page %>% html_nodes(".rechts.hauptlink a") %>% html_text() %>% as.character() df <- data.frame(PlayerNames, TransferValue)