Learning R: Analyzing the English Premier League (III)

Posted on May 29, 2014


Contributed by Bryan Valentini. Bryan took R003 class with Vivian Zhang(Data Science by R, Intensive beginner level) in Mar-Apr, 2014 and did great in class.The post was based on his week 4 Homework submission.


This is a follow-up post to “Analyzing the English Premier League” parts 1 and 2. This part is going to concern with the fun part of analysis, as in all the data has been cleaned and we're ready to look at team performance. You can see all code examples and the sample data on my Learning R GitHub page.


Practicing the basics

Let's keep things simple, and just use the console to explore the data we cleaned up in part 2. We want to get a sense of it before building some visualizations. We will reload the arsenal.rda file, which has all the cleaned up data from the Arsenal perspective. Specifically, we're looking at the actual Premier League matches from 2001 to 2012.

 ## Analysis with Arsenal Data ##
 > names(arsenal)
    [1] "Date" "Status" "Attendance" "Competition" "Season" "Visiting"
    [7] "Opponent" "Goals" "OppGoals" "WinLoss"
 > summary(arsenal)
Date Status Attendance Competition Season
Length:739 Length:739 Min. : 8562 Length:739 Min. :2001
Class :character Class :character 1st Qu.:31756 Class :character 1st Qu.:2003
Mode :character Mode :character Median :38186 Mode :character Median :2006
    Mean :43179   Mean :2006
    3rd Qu.:60004   3rd Qu.:2009
    Max. :95000   Max. :2012
        NA's :93
Visiting Opponent Goals OppGoals WinLoss
Mode :logical Length:739 Min. :0.000 Min. :0.0000 D :161
FALSE:372 Class :character 1st Qu.:1.000 1st Qu.:0.0000 L :132
TRUE :367 Mode :character Median :2.000 Median :1.0000 W :401
NA's :0   Mean :1.915 Mean :0.9769 NA's: 45
    3rd Qu.:3.000 3rd Qu.:1.0000  
    Max. :7.000 Max. :8.0000  
    NA's :45 NA's :45  

Here we see the table summary, which might differ somewhat if you've pre-excluded other competitions when executing part 2. Otherwise, we can clean focus on that subset of data really quickly, by running some filters.

# Look at full time, regular league play - if you don't do this, have to deal with NA results 
# from postponed games.
epl <- arsenal[!is.na(arsenal$Goals) & arsenal$Competition == "Premier League",]

As a warm up, let's look at the calculation below of ratios of wins, losses, and draw. We use the daply function from the plyr2 package to group the fixture results by Season, and then run the ratios function we defined over each grouping.

 ratios <- function(df) {
   total <- length(df$WinLoss)
   w <- sum(df$WinLoss=="W", na.rm=TRUE)/total
   l <- sum(df$WinLoss=="L", na.rm=TRUE)/total
   d <- sum(df$WinLoss=="D", na.rm=TRUE)/total
   ret <- c(w,l,d)
   names(ret) <- c("Wins", "Losses", "Draws")

 # get win/loss/draw ratios for regular play
 (winPercentage <- daply(epl, .(Season), ratios))

Season      Wins      Losses      Draws
  2001 0.6842105  0.07894737  0.2368421
  2002 0.6052632  0.15789474  0.2368421
  2003 0.6842105  0.00000000  0.3157895
  2004 0.6578947  0.13157895  0.2105263
  2005 0.5263158  0.28947368  0.1842105
  2006 0.5000000  0.21052632  0.2894737
  2007 0.6315789  0.07894737  0.2894737
  2008 0.5263158  0.15789474  0.3157895
  2009 0.6052632  0.23684211  0.1578947
  2010 0.5000000  0.21052632  0.2894737
  2011 0.5526316  0.26315789  0.1842105
  2012 0.5526316  0.18421053  0.2631579

Notice how 2003 didn't have any losses? What a year that was!  Similarly, we want to see how a team performs, and outright wins, at home versus away. That's pretty simple, we just add another grouping criteria to the daply function (see .(Season, Visiting)). By summing the count of wins over the length of the input vector, we get the ratio of wins to total games.

 # get winning rate by whether home(FALSE) or away (TRUE)
(visitingWin <- daply(epl, .(Season, Visiting), function(x) { 
      sum(x$WinLoss=="W")/length(x$WinLoss) }))
Season     FALSE       TRUE
  2001 0.6315789  0.7368421
  2002 0.7894737  0.4210526
  2003 0.7894737  0.5789474
  2004 0.6842105  0.6315789
  2005 0.7368421  0.3157895
  2006 0.6315789  0.3684211
  2007 0.7368421  0.5263158
  2008 0.5789474  0.4736842
  2009 0.7894737  0.4210526
  2010 0.5789474  0.4210526
  2011 0.6315789  0.4736842
  2012 0.5789474  0.5263158

If we want to look at the number of wins, losses, or draws in a row, we can use the rle function to calculate the run-length encoding of the groupings for each type of game outcome. For example, if we look at the 2001 season's data, and we run the rle function on the $WinLoss column, then we will see the following output.

  [1] W L W D W D W W D D L D W W W D L W W W D D W W D W W W W W W W W W W W W W
Levels: D L W

 [1] 38

Run-length encoding is a simple data compression algorithm, but here it is used to calculate the {Wins,Losses,Draws}-in-a-row. Looking at the vector of factors above, you can see the string of Ws at the end of the season. These Ws could be summarized by string '13W' instead, compacting the representation of the data, or in other words, telling us that there was a streak of 13 won games. Turns out this is the longest win streak of the season, and a good candidate to see what happened late in the season in terms of players on the field, or strategies used by the the manager Arsene Wenger ('the Frenchman'). Here we see the entire encoding:

  [1]  1  1  1  1  1  1  2  2  1  1  3  1  1  3  2  2  1 13
  [1] "W" "L" "W" "D" "W" "D" "W" "D" "L" "D" "W" "D" "L" "W" "D" "W" "D" "W"

Packaging this into a helper function, we want to process each season and pull out the streaks for wins, losses, and draws, respectively.

# get wins/losses/draws in a row, do existence checks on each category
winsInARow <- function(x) {
  ds <- rle(as.vector(x))    #run length encoding
  win  <- ifelse(any(ds$values == "W"), max(ds$lengths[ds$values == "W"]), 0)
  loss <- ifelse(any(ds$values == "L"), max(ds$lengths[ds$values == "L"]), 0)
  draw <- ifelse(any(ds$values == "D"), max(ds$lengths[ds$values == "D"]), 0)
  return(c(win, loss, draw))
result <- ddply(epl, .(Season), function(df) winsInARow(df$WinLoss))
names(result)[c(2,3,4)] <- c("Wins_in_a_row", "Losses_in_a_row", "Draws_in_a_row")

   Season Wins_in_a_row Losses_in_a_row Draws_in_a_row
1    2001            13               1              2
2    2002             5               2              2
3    2003             9               0              3
4    2004             5               1              2
5    2005             4               3              2
6    2006             5               3              2
7    2007             7               1              4
8    2008             5               2              5
9    2009             6               2              1
10   2010             3               2              3
11   2011             7               3              3
12   2012             4               2              2

A common complaint about soccer, especially from Americans, is that there is very little scoring. This is a fair point, but only if you care about that. Watching the beautiful game is more about watching the skill of the team or the individual talents. Still, the really good teams can average at least one or two goals per game, and in Arsenal's case, it consistently averages above 1.8 goals per game, as the following analysis shows.

# do summary stats on points per season
(goals <- daply(epl, .(Season), 
         function(x) each(mean,sd,max,min)(x$Goals, na.rm=TRUE))) 
Season     mean        sd max min
  2001 2.078947 0.9967943   4   1
  2002 2.236842 1.3443417   6   0
  2003 1.921053 1.2165876   5   0
  2004 2.289474 1.6259366   7   0
  2005 1.789474 1.7109804   7   0
  2006 1.657895 1.3411636   6   0
  2007 1.947368 1.2723115   6   0
  2008 1.789474 1.3980275   4   0
  2009 2.184211 1.5571155   6   0
  2010 1.894737 1.3514651   6   0
  2011 1.947368 1.5412766   7   0
  2012 1.894737 1.8127100   7   0

What about fan attendance? Can we gage how consistently fans attend home versus away games? We'll use the each function from the plyr package to calculate the mean, standard deviation, max, and min function. Rather than manually calculating each inside a helper function and manually cbind the results, the each does this automatically for use.

# do summary stats on attendance per season
(attendance <- daply(epl, .(Visiting, Season), 
          function(x) each(mean,sd,max,min)(x$Attendance, na.rm=TRUE)))
, , = mean

Visiting 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
 FALSE 38054.53 38041.84 38078.84 37979.05 38184.16 60045.42 60070.26 60039.74 59927.00 60025.26
 TRUE 35412.42 36665.63 36040.28 34647.58 35052.95 33821.74 35919.53 35093.74 34208.37 34676.79
Visiting 2011 2012
 FALSE 60000.32 60079.32
 TRUE 33474.11 35436.74

[rest removed for brevity]

Looking at the mean data above, we see that 2006 had a huge jump in attendance at home. Curious, I checked Wikipedia and turns out that their new stadium opened on 22 July 2006, with a new capacity of 60,338 seats. Taking a step back, we see that a lot of information can be gleaned by just examining the numbers. On the other hand, it's not always the ideal way to learn about data. Eventually, you want to visualize it to help guide the non-obvious exploration. In the next section, we'll look at few examples on how to do visual analysis.

Using the ggplot2 package

Explaining how to completely use the ggplot2 package is outside the scope of this post, but many tutorials can be found online and in all of the R-based books. The main attraction of using this package is the themes that are practically ready out-of-the-box with the additional ggthemes package. For this entry, we'll stick to one theme, but you can check out the help section on the package to see the various options like the "Economist" theme, the "Wall Street Journal" theme, the "Edward Tufte" them, just to name a few.

Let's assume we want to look at Arsenal's mean goal performance vs all other opponents. We want to calculate the column-wise mean for Arsenal's Goals versus other teams OppGoals. Rather than pulling out the individual columns, we use the colwise ('column wise') functor to help us build another function that then calculates the respective means.

#Arsenal avg # goals vs opponents
opponents <- ddply(epl,.(Season), function(x) colwise(mean,c('Goals','OppGoals'))(x))

p <- ggplot(opponents, aes(x=Season)) +

 geom_line(aes(y=Goals, group=1), linetype=4) +
 geom_point(aes(y=Goals), size=4, color='red') +
 annotate("text", x = 2005.4, y = 2.3, label = "Gunner goals") +

 geom_line(aes(y=OppGoals, group=1), linetype=2) +
 geom_point(aes(y=OppGoals), size=4, color='gray') +
 annotate("text", x = 2004, y = 1.04, label = "Opponent goals") +

 ggtitle("Arsenal Goal Average vs Opponents") +

In the code above, I've introduced some spaces to make it clear how we treat each column (y=Goals,OppGoals) for all rows (x=Season). The graph this produces is:


From this we can see that, overall, both the Arsenal defense and the offense seem to do their jobs well. Towards the later years, we see a worrying trend in the opponents able to score more goals on Arsenal, so there might be problems in the back line. Let's dive into this issue a bit. Let's look at which opponents are better against Arsenal. We'll plot the ratio of losses versus the goal difference to look at how often a team beats Arsenal and by how many goals.

# Which opponent is stronger?
# get winning rate
output1 <- ddply(epl, .(Opponent), function(x) sum(x$WinLoss == 'L') / length(x$WinLoss))

# get points difference
output2 <- ddply(epl, .(Opponent), function(x) mean(x$OppGoals - x$Goals))
# merge two datasets
opponents <- join(output1, output2, by='Opponent')
names(opponents)[2:3] <- c('WinningRate','PointsDiff')

# !add legend and lines
p <- ggplot(opponents, aes(x=PointsDiff, y=WinningRate)) +
 geom_point(color='#014d64', size=4) +
 geom_hline(y=0.28, colour='grey20', size=0.5, linetype=2) +
 geom_vline(x=0, colour='grey20', size=0.5, linetype=2) +
 geom_point(data=opponents[opponents$WinningRate > 0.3 & opponents$PointsDiff >= 0,],
 aes(x=PointsDiff, y=WinningRate), color='red3', size=5) +
 geom_point(data=opponents[opponents$WinningRate > 0.3 & opponents$PointsDiff < 0,],
 aes(x=PointsDiff, y=WinningRate), color='skyblue', size=5) +
 annotate("text", x = 0.30, y= 0.35, label = "Chelsea") +
 annotate("text", x = 0.583, y = 0.44, label = "ManU") +
 annotate("text", x = 0.35, y = 0.52, label = "Swansea") +
 annotate("text", x = -1.5, y = 0.36, label = "Leeds") +
 annotate("text", x = -1, y = 0.48, label = "Sheffield") +
 labs(x="Mean Goal Difference",
 y="Winning Rate",
 title ="Ranking Opponent's Performance vs. Arsenal") +


At first, looking at the top right corner of the graph, we see the opponents that consistently score more goals or about the same number of goals that Arsenal scores on them, plotted against the opponents' winning percentage. We see the usual top-of-the-table teams with Manchester United and Chelsea, but I almost fell out of my chair when I saw Swansea and Sheffield United. No offense to those teams, they have loyal supporters, but that graph didn't make sense at first. Thinking an error in my calculation, I stared at the data for a bit until it dawned on me that I wasn't taking into account the EPL's relegation system. Teams that consistently perform poorly are booted from the top-flight league down to the next rung, and the best team(s) from that league replace the relegated teams. Without getting crazy, we need to take the number of seasons played against Arsenal into account with respect to performance. I came up with a linear relationship where each loss is multiplied times 3 for heavier weighting of the opponent's win, summed with each draw, and all this multiplied by the total number of seasons played against that opponent. I'm not sure if this is strictly correct, but it made intuitive sense.

# new methodolgy taking draws and # of seasons into account
output1 <- ddply(epl, .(Opponent), 
 function(x) {
 seasons = max(x$Season) - min(x$Season)
 points <- (sum(x$WinLoss == 'L') * 3 + sum(x$WinLoss == 'D')) * seasons
 return (points / (length(x$WinLoss) * 4 * 13)) #normalize

# re-merge two datasets
opponents2 <- join(output1, output2, by='Opponent')
names(opponents2)[2:3] <- c('WeightedWinningRate','PointsDiff')

p2 <- ggplot(opponents2, aes(x=PointsDiff, y=WeightedWinningRate)) +
 geom_point(color='#014d64', size=4) +
 geom_hline(y=0.28, colour='grey20', size=0.5, linetype=2) +
 geom_vline(x=0, colour='grey20', size=0.5, linetype=2) +
 geom_point(data=opponents2[opponents2$WeightedWinningRate > 0.28 & opponents2$PointsDiff >= 0,],
 aes(x=PointsDiff, y=WeightedWinningRate), color='red3', size=5) +
 annotate("text", x = 0.41, y= 0.28, label = "Chelsea") +
 annotate("text", x = 0.56, y = 0.33, label = "ManU") +
 annotate("text", x = 0.41, y = 0.032, label = "Swansea") +
 annotate("text", x = -1.5, y = 0.05, label = "Leeds") +
 annotate("text", x = -1, y = 0.1, label = "Sheffield") +
 labs(x="Mean Goal Difference", y="Winning Rate",
 title ="Ranking Opponent's Performance vs. Arsenal") +


Looks good! We see that Leeds, Sheffield, and Swansea now drop in strength against Arsenal where one might expect. Turns out that Swansea only played against Arsenal four times over the 10+ seasons, with two draws, which explains Swansea's initial strength in the first chart. I glossed over some details about how I built the charts, as in the text for individual data points doesn't automatically populate itself like in many other charting packages. This can be a little tedious to line up the label with the data point, but it does provide some presentation flexibility. I was also pleased with how nice the charts look, and flexibility to switch to a cleaner design, like the Tufte theme, when needed.


This concludes this series of posts looking at how to clean up data and plot some basic charts using R. It was really interesting to see how easy the plyr package makes data manipulation easy with a few lines of code. The important part to remember is that, due to how terse this code can become, its important to test each step and verify you're getting the results you expect. As my professors used to say, "Think before you code."

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp