UFC Data Scrape using R (UFC Data Analysis Part I)

Jian Qiao
Posted on Aug 24, 2017

Introduction:

As an amateeur martial artist, I’m interested in the various styles, like Wing Chun, Krav Maga, Jiu Jitsu, and others... Although it's highly unlikely that one can draw a definite conclusion on which martial art is the most effective one, I am interested to see what findings I can get through this project. The entire project can be found on my Github repository: https://github.com/Jian-Qiao/UFC-Data-Scrapping

The first part is getting the data, which is going to be the main topic of this post. I will explain how I scrape relevant data and process it in detail.

What's UFC:

The Ultimate Fighting Championship (UFC) is a worldwide mixed martial arts competition based in Las Vegas, Nevada. Its first competition was held in November, 1993. Till the day I performed the data scraping, 390 events and 4058 matches has been held around the globe. The purpose of the early Ultimate Fighting Championship competitions was to identify the most effective martial art in a contest with minimal rules between competitors of different fighting disciplines like boxing, Brazilian jiu-jitsu, Sambo, wrestling, Muay Thai, karate, judo, and other styles.

The sport's popularity was also noticed by the sports betting community as BodogLife.com, an online gambling site, stated in July 2007 that in 2007 UFC would surpass boxing for the first time in terms of betting revenues. In fact, the UFC had already broken the pay-per-view industry's all-time records for a single year of business, generating over $222,766,000 in revenue in 2006, surpassing both WWE and boxing.

That makes it a great data set for me to analyze different martial art styles.

Scraping process:

The Website I will be scraping from will be Sherdog.com, a website focusing on mixed martial art competitions. I will use its UFC section for my data scraping, which will be this.

  • R (rvest)
  • SelectorGadget (Chrome add-on)

P.S. SelectorGadget is a very useful tool to pick up a website project; a demo can be found here.

Scraping Events:

As can be seen from the screen-shot I took below, I noticed that the events data is spread into 4 pages.

For each page, there is a table containing all the links to each events. From that table, I was able to get the Date, Name, Location, URL of each event.  I begin by looping through all 4 pages and scraping the events data.

https://gist.github.com/Jian-Qiao/e9aa355fa3a14510a04c4b097a51911e

I'm going to append all structured links into an array for later use.

P.S. the reason why I exclude the first 8 data point is because they are the upcoming events rather than the historical ones.

Scraping Matches:

After getting the URL of all pages containing data for each event, the next step will be scraping all matches data from each event page. Inspecting one of the event pages shows that our match data is separated into several areas:

A

Note that the page is structured as 'winner on the left'. So I was able to know the result of each match from the table I scraped. Otherwise, I would have to scrape the 'Win' or 'Loss' tag separately to get the result.

The code is posted below:

https://gist.github.com/Jian-Qiao/5369372cc86c817b210c2cb46ec01a2e

The code is separated in 2 parts, corresponding to 2 parts of data. The first part of this code is for getting the gold part on my screen shot, which is structured as a table. The second part of this code is for scraping the red part on my screen shot, which needed more work to pull information from different tags together.

After some cleaning work, I add the Time, Name of the Event.

With both parts set, I was able to bind them together and create a complete data set.

Scraping Fighters:

Fighter data is relatively easy to scrape. Although it's not in one table, each detail is put in a specific tag. With the help of SelectorGadget, I can easily pinpoint each area needed and retrieve it.

https://gist.github.com/Jian-Qiao/b1d760d08af480896be754a47bb805a5

However, since UFC has around 24 years of history, some of the data are missing. That will result in a misplace of the following data. A data cleaning process is necessary if I want to use such data in the future. So I did some data cleaning and formatting using regular-expression.

https://gist.github.com/Jian-Qiao/4971347d512fd085e79438165318e0bc

Get Stadium Data:

As I would want to see where each event was held. I will need the geometric data of each stadium. Since I only have the name and the address of each event, I will need another tool to get such data. I used the geocode function to retrieve such latitude and longitude data from Google Map. As Google has a limit on query rate. The code would keep querying the data until it fails, then wait for one hour and keep going.

https://gist.github.com/Jian-Qiao/5bc9cc30ada1109aa069d6dfc38009ae

Also, some of the address was either misspelled or couldn't be found on Google Map. I manually checked Wikipedia.com and manually typed in there.

https://gist.github.com/Jian-Qiao/a5af03845f1285edb0a7e6d7d2ba9a16

Result:

I got the data for each event, match, fighter, stadium location and  saved it as a CSV file for later use. I also checked over  the data one last time to check if everything is formatted as I wanted, no typos, no unwanted spaces.

https://gist.github.com/Jian-Qiao/002db7b20014b5d591b89575e5842c40

As a result, I have scraped the following:

390 events (Time, Name, Location, URL)

4058 matches ( Event_Name, Match index, Fighter1, Fighter2, Method, Method_Detail, Round, Time, Referee, Fighter1_Url, Fighter2_Url, Event_id, Event Date, Event_Location)

1641 fighters (Name, Birth_Date, Age, Birth_Place, Country, Height, Weight, Association, Class, Fighter_id, Url, Nick Name, Feet (height), Inch (height), PhothUrl)

199 Stadiums (Address, frequency (how many events are held here), Latitude, Longitude)

If you want to have a up-to-date data set in the future, you are very welcome to do so. You can simply download my data, modify my code a little bit, and just update the missing part of my data.

Now, I'm ready to use this data to write a Shiny App and see if there is any interesting findings. Please see UFC Data Analysis - Shiny App ( UFC Data Analysis Part II) 

About Author

Jian Qiao

Jian Qiao

Jian Qiao is a recent graduate of 12-weeks Online Data Science Boot-camp from NYC Data Science Academy. He has earned his M.S. in Quantitative Finance in 2015. Currently working as a data analyst in Almod Diamonds Ltd, he...
View all posts by Jian Qiao >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp