UFC Data Scrape using R (UFC Data Analysis Part I)
As an amateeur martial artist, I’m interested in the various styles, like Wing Chun, Krav Maga, Jiu Jitsu, and others... Although it's highly unlikely that one can draw a definite conclusion on which martial art is the most effective one, I am interested to see what findings I can get through this project. The entire project can be found on my Github repository: https://github.com/Jian-Qiao/UFC-Data-Scrapping
The first part is getting the data, which is going to be the main topic of this post. I will explain how I scrape relevant data and process it in detail.
The Ultimate Fighting Championship (UFC) is a worldwide mixed martial arts competition based in Las Vegas, Nevada. Its first competition was held in November, 1993. Till the day I performed the data scraping, 390 events and 4058 matches has been held around the globe. The purpose of the early Ultimate Fighting Championship competitions was to identify the most effective martial art in a contest with minimal rules between competitors of different fighting disciplines like boxing, Brazilian jiu-jitsu, Sambo, wrestling, Muay Thai, karate, judo, and other styles.
The sport's popularity was also noticed by the sports betting community as BodogLife.com, an online gambling site, stated in July 2007 that in 2007 UFC would surpass boxing for the first time in terms of betting revenues. In fact, the UFC had already broken the pay-per-view industry's all-time records for a single year of business, generating over $222,766,000 in revenue in 2006, surpassing both WWE and boxing.
That makes it a great data set for me to analyze different martial art styles.
- R (rvest)
- SelectorGadget (Chrome add-on)
P.S. SelectorGadget is a very useful tool to pick up a website project; a demo can be found here.
As can be seen from the screen-shot I took below, I noticed that the events data is spread into 4 pages.
For each page, there is a table containing all the links to each events. From that table, I was able to get the Date, Name, Location, URL of each event. I begin by looping through all 4 pages and scraping the events data.
I'm going to append all structured links into an array for later use.
P.S. the reason why I exclude the first 8 data point is because they are the upcoming events rather than the historical ones.
After getting the URL of all pages containing data for each event, the next step will be scraping all matches data from each event page. Inspecting one of the event pages shows that our match data is separated into several areas:
Note that the page is structured as 'winner on the left'. So I was able to know the result of each match from the table I scraped. Otherwise, I would have to scrape the 'Win' or 'Loss' tag separately to get the result.
The code is posted below:
The code is separated in 2 parts, corresponding to 2 parts of data. The first part of this code is for getting the gold part on my screen shot, which is structured as a table. The second part of this code is for scraping the red part on my screen shot, which needed more work to pull information from different tags together.
After some cleaning work, I add the Time, Name of the Event.
With both parts set, I was able to bind them together and create a complete data set.
Fighter data is relatively easy to scrape. Although it's not in one table, each detail is put in a specific tag. With the help of SelectorGadget, I can easily pinpoint each area needed and retrieve it.
However, since UFC has around 24 years of history, some of the data are missing. That will result in a misplace of the following data. A data cleaning process is necessary if I want to use such data in the future. So I did some data cleaning and formatting using regular-expression.
Get Stadium Data:
As I would want to see where each event was held. I will need the geometric data of each stadium. Since I only have the name and the address of each event, I will need another tool to get such data. I used the geocode function to retrieve such latitude and longitude data from Google Map. As Google has a limit on query rate. The code would keep querying the data until it fails, then wait for one hour and keep going.
Also, some of the address was either misspelled or couldn't be found on Google Map. I manually checked Wikipedia.com and manually typed in there.
I got the data for each event, match, fighter, stadium location and saved it as a CSV file for later use. I also checked over the data one last time to check if everything is formatted as I wanted, no typos, no unwanted spaces.
As a result, I have scraped the following:
390 events (Time, Name, Location, URL)
4058 matches ( Event_Name, Match index, Fighter1, Fighter2, Method, Method_Detail, Round, Time, Referee, Fighter1_Url, Fighter2_Url, Event_id, Event Date, Event_Location)
1641 fighters (Name, Birth_Date, Age, Birth_Place, Country, Height, Weight, Association, Class, Fighter_id, Url, Nick Name, Feet (height), Inch (height), PhothUrl)
199 Stadiums (Address, frequency (how many events are held here), Latitude, Longitude)
If you want to have a up-to-date data set in the future, you are very welcome to do so. You can simply download my data, modify my code a little bit, and just update the missing part of my data.
Now, I'm ready to use this data to write a Shiny App and see if there is any interesting findings. Please see UFC Data Analysis - Shiny App ( UFC Data Analysis Part II)