Learning R: Analyzing the English Premier League (III)
Contributed by Bryan Valentini. Bryan took R003 class with Vivian Zhang(Data Science by R, Intensive beginner level) in Mar-Apr, 2014 and did great in class.The post was based on his week 4 Homework submission.
--------------------------------------------------
This is a follow-up post to “Analyzing the English Premier League” parts 1 and 2. This part is going to concern with the fun part of analysis, as in all the data has been cleaned and we're ready to look at team performance. You can see all code examples and the sample data on my Learning R GitHub page.
--------------------------------------------------
Practicing the basics
Let's keep things simple, and just use the console to explore the data we cleaned up in part 2. We want to get a sense of it before building some visualizations. We will reload the arsenal.rda file, which has all the cleaned up data from the Arsenal perspective. Specifically, we're looking at the actual Premier League matches from 2001 to 2012.
library(plyr) library(ggplot2) library(ggthemes)
## Analysis with Arsenal Data ##
load("arsenal.rda")
> names(arsenal)
[1] "Date" "Status" "Attendance" "Competition" "Season" "Visiting"
[7] "Opponent" "Goals" "OppGoals" "WinLoss"
> summary(arsenal)
Date | Status | Attendance | Competition | Season |
---|---|---|---|---|
Length:739 | Length:739 | Min. : 8562 | Length:739 | Min. :2001 |
Class :character | Class :character | 1st Qu.:31756 | Class :character | 1st Qu.:2003 |
Mode :character | Mode :character | Median :38186 | Mode :character | Median :2006 |
Mean :43179 | Mean :2006 | |||
3rd Qu.:60004 | 3rd Qu.:2009 | |||
Max. :95000 | Max. :2012 | |||
NA's :93 | ||||
Visiting | Opponent | Goals | OppGoals | WinLoss |
Mode :logical | Length:739 | Min. :0.000 | Min. :0.0000 | D :161 |
FALSE:372 | Class :character | 1st Qu.:1.000 | 1st Qu.:0.0000 | L :132 |
TRUE :367 | Mode :character | Median :2.000 | Median :1.0000 | W :401 |
NA's :0 | Mean :1.915 | Mean :0.9769 | NA's: 45 | |
3rd Qu.:3.000 | 3rd Qu.:1.0000 | |||
Max. :7.000 | Max. :8.0000 | |||
NA's :45 | NA's :45 |
Here we see the table summary, which might differ somewhat if you've pre-excluded other competitions when executing part 2. Otherwise, we can clean focus on that subset of data really quickly, by running some filters.
# Look at full time, regular league play - if you don't do this, have to deal with NA results # from postponed games. epl <- arsenal[!is.na(arsenal$Goals) & arsenal$Competition == "Premier League",]
As a warm up, let's look at the calculation below of ratios of wins, losses, and draw. We use the daply function from the plyr2 package to group the fixture results by Season, and then run the ratios function we defined over each grouping.
ratios <- function(df) {
total <- length(df$WinLoss)
w <- sum(df$WinLoss=="W", na.rm=TRUE)/total
l <- sum(df$WinLoss=="L", na.rm=TRUE)/total
d <- sum(df$WinLoss=="D", na.rm=TRUE)/total
ret <- c(w,l,d)
names(ret) <- c("Wins", "Losses", "Draws")
return(ret)
}
# get win/loss/draw ratios for regular play
(winPercentage <- daply(epl, .(Season), ratios))
Season Wins Losses Draws
2001 0.6842105 0.07894737 0.2368421
2002 0.6052632 0.15789474 0.2368421
2003 0.6842105 0.00000000 0.3157895
2004 0.6578947 0.13157895 0.2105263
2005 0.5263158 0.28947368 0.1842105
2006 0.5000000 0.21052632 0.2894737
2007 0.6315789 0.07894737 0.2894737
2008 0.5263158 0.15789474 0.3157895
2009 0.6052632 0.23684211 0.1578947
2010 0.5000000 0.21052632 0.2894737
2011 0.5526316 0.26315789 0.1842105
2012 0.5526316 0.18421053 0.2631579
Notice how 2003 didn't have any losses? What a year that was! Similarly, we want to see how a team performs, and outright wins, at home versus away. That's pretty simple, we just add another grouping criteria to the daply function (see .(Season, Visiting)). By summing the count of wins over the length of the input vector, we get the ratio of wins to total games.
# get winning rate by whether home(FALSE) or away (TRUE)
(visitingWin <- daply(epl, .(Season, Visiting), function(x) {
sum(x$WinLoss=="W")/length(x$WinLoss) }))
Visiting
Season FALSE TRUE
2001 0.6315789 0.7368421
2002 0.7894737 0.4210526
2003 0.7894737 0.5789474
2004 0.6842105 0.6315789
2005 0.7368421 0.3157895
2006 0.6315789 0.3684211
2007 0.7368421 0.5263158
2008 0.5789474 0.4736842
2009 0.7894737 0.4210526
2010 0.5789474 0.4210526
2011 0.6315789 0.4736842
2012 0.5789474 0.5263158
If we want to look at the number of wins, losses, or draws in a row, we can use the rle function to calculate the run-length encoding of the groupings for each type of game outcome. For example, if we look at the 2001 season's data, and we run the rle function on the $WinLoss column, then we will see the following output.
epl[epl$Season=='2001',]$WinLoss [1] W L W D W D W W D D L D W W W D L W W W D D W W D W W W W W W W W W W W W W Levels: D L W length(epl[epl$Season=='2001',]$WinLoss) [1] 38
Run-length encoding is a simple data compression algorithm, but here it is used to calculate the {Wins,Losses,Draws}-in-a-row. Looking at the vector of factors above, you can see the string of Ws at the end of the season. These Ws could be summarized by string '13W' instead, compacting the representation of the data, or in other words, telling us that there was a streak of 13 won games. Turns out this is the longest win streak of the season, and a good candidate to see what happened late in the season in terms of players on the field, or strategies used by the the manager Arsene Wenger ('the Frenchman'). Here we see the entire encoding:
rle(as.vector(epl[epl$Season=='2001',]$WinLoss))$lengths [1] 1 1 1 1 1 1 2 2 1 1 3 1 1 3 2 2 1 13 rle(as.vector(epl[epl$Season=='2001',]$WinLoss))$values [1] "W" "L" "W" "D" "W" "D" "W" "D" "L" "D" "W" "D" "L" "W" "D" "W" "D" "W"
Packaging this into a helper function, we want to process each season and pull out the streaks for wins, losses, and draws, respectively.
# get wins/losses/draws in a row, do existence checks on each category
winsInARow <- function(x) {
ds <- rle(as.vector(x)) #run length encoding
win <- ifelse(any(ds$values == "W"), max(ds$lengths[ds$values == "W"]), 0)
loss <- ifelse(any(ds$values == "L"), max(ds$lengths[ds$values == "L"]), 0)
draw <- ifelse(any(ds$values == "D"), max(ds$lengths[ds$values == "D"]), 0)
return(c(win, loss, draw))
}
result <- ddply(epl, .(Season), function(df) winsInARow(df$WinLoss))
names(result)[c(2,3,4)] <- c("Wins_in_a_row", "Losses_in_a_row", "Draws_in_a_row")
(result)
Season Wins_in_a_row Losses_in_a_row Draws_in_a_row
1 2001 13 1 2
2 2002 5 2 2
3 2003 9 0 3
4 2004 5 1 2
5 2005 4 3 2
6 2006 5 3 2
7 2007 7 1 4
8 2008 5 2 5
9 2009 6 2 1
10 2010 3 2 3
11 2011 7 3 3
12 2012 4 2 2
A common complaint about soccer, especially from Americans, is that there is very little scoring. This is a fair point, but only if you care about that. Watching the beautiful game is more about watching the skill of the team or the individual talents. Still, the really good teams can average at least one or two goals per game, and in Arsenal's case, it consistently averages above 1.8 goals per game, as the following analysis shows.
# do summary stats on points per season
(goals <- daply(epl, .(Season),
function(x) each(mean,sd,max,min)(x$Goals, na.rm=TRUE)))
Season mean sd max min
2001 2.078947 0.9967943 4 1
2002 2.236842 1.3443417 6 0
2003 1.921053 1.2165876 5 0
2004 2.289474 1.6259366 7 0
2005 1.789474 1.7109804 7 0
2006 1.657895 1.3411636 6 0
2007 1.947368 1.2723115 6 0
2008 1.789474 1.3980275 4 0
2009 2.184211 1.5571155 6 0
2010 1.894737 1.3514651 6 0
2011 1.947368 1.5412766 7 0
2012 1.894737 1.8127100 7 0
What about fan attendance? Can we gage how consistently fans attend home versus away games? We'll use the each function from the plyr package to calculate the mean, standard deviation, max, and min function. Rather than manually calculating each inside a helper function and manually cbind the results, the each does this automatically for use.
# do summary stats on attendance per season
(attendance <- daply(epl, .(Visiting, Season),
function(x) each(mean,sd,max,min)(x$Attendance, na.rm=TRUE)))
, , = mean
Season
Visiting 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
FALSE 38054.53 38041.84 38078.84 37979.05 38184.16 60045.42 60070.26 60039.74 59927.00 60025.26
TRUE 35412.42 36665.63 36040.28 34647.58 35052.95 33821.74 35919.53 35093.74 34208.37 34676.79
Season
Visiting 2011 2012
FALSE 60000.32 60079.32
TRUE 33474.11 35436.74
[rest removed for brevity]
Looking at the mean data above, we see that 2006 had a huge jump in attendance at home. Curious, I checked Wikipedia and turns out that their new stadium opened on 22 July 2006, with a new capacity of 60,338 seats. Taking a step back, we see that a lot of information can be gleaned by just examining the numbers. On the other hand, it's not always the ideal way to learn about data. Eventually, you want to visualize it to help guide the non-obvious exploration. In the next section, we'll look at few examples on how to do visual analysis.
Using the ggplot2 package
Explaining how to completely use the ggplot2 package is outside the scope of this post, but many tutorials can be found online and in all of the R-based books. The main attraction of using this package is the themes that are practically ready out-of-the-box with the additional ggthemes package. For this entry, we'll stick to one theme, but you can check out the help section on the package to see the various options like the "Economist" theme, the "Wall Street Journal" theme, the "Edward Tufte" them, just to name a few.
Let's assume we want to look at Arsenal's mean goal performance vs all other opponents. We want to calculate the column-wise mean for Arsenal's Goals versus other teams OppGoals. Rather than pulling out the individual columns, we use the colwise ('column wise') functor to help us build another function that then calculates the respective means.
#Arsenal avg # goals vs opponents opponents <- ddply(epl,.(Season), function(x) colwise(mean,c('Goals','OppGoals'))(x)) p <- ggplot(opponents, aes(x=Season)) + geom_line(aes(y=Goals, group=1), linetype=4) + geom_point(aes(y=Goals), size=4, color='red') + annotate("text", x = 2005.4, y = 2.3, label = "Gunner goals") + geom_line(aes(y=OppGoals, group=1), linetype=2) + geom_point(aes(y=OppGoals), size=4, color='gray') + annotate("text", x = 2004, y = 1.04, label = "Opponent goals") + ggtitle("Arsenal Goal Average vs Opponents") + theme_economist() print(p)
In the code above, I've introduced some spaces to make it clear how we treat each column (y=Goals,OppGoals) for all rows (x=Season). The graph this produces is:
From this we can see that, overall, both the Arsenal defense and the offense seem to do their jobs well. Towards the later years, we see a worrying trend in the opponents able to score more goals on Arsenal, so there might be problems in the back line. Let's dive into this issue a bit. Let's look at which opponents are better against Arsenal. We'll plot the ratio of losses versus the goal difference to look at how often a team beats Arsenal and by how many goals.
# Which opponent is stronger? # get winning rate output1 <- ddply(epl, .(Opponent), function(x) sum(x$WinLoss == 'L') / length(x$WinLoss)) # get points difference output2 <- ddply(epl, .(Opponent), function(x) mean(x$OppGoals - x$Goals)) # merge two datasets opponents <- join(output1, output2, by='Opponent') names(opponents)[2:3] <- c('WinningRate','PointsDiff') # !add legend and lines p <- ggplot(opponents, aes(x=PointsDiff, y=WinningRate)) + geom_point(color='#014d64', size=4) + geom_hline(y=0.28, colour='grey20', size=0.5, linetype=2) + geom_vline(x=0, colour='grey20', size=0.5, linetype=2) + geom_point(data=opponents[opponents$WinningRate > 0.3 & opponents$PointsDiff >= 0,], aes(x=PointsDiff, y=WinningRate), color='red3', size=5) + geom_point(data=opponents[opponents$WinningRate > 0.3 & opponents$PointsDiff < 0,], aes(x=PointsDiff, y=WinningRate), color='skyblue', size=5) + annotate("text", x = 0.30, y= 0.35, label = "Chelsea") + annotate("text", x = 0.583, y = 0.44, label = "ManU") + annotate("text", x = 0.35, y = 0.52, label = "Swansea") + annotate("text", x = -1.5, y = 0.36, label = "Leeds") + annotate("text", x = -1, y = 0.48, label = "Sheffield") + labs(x="Mean Goal Difference", y="Winning Rate", title ="Ranking Opponent's Performance vs. Arsenal") + theme_economist() print(p)
At first, looking at the top right corner of the graph, we see the opponents that consistently score more goals or about the same number of goals that Arsenal scores on them, plotted against the opponents' winning percentage. We see the usual top-of-the-table teams with Manchester United and Chelsea, but I almost fell out of my chair when I saw Swansea and Sheffield United. No offense to those teams, they have loyal supporters, but that graph didn't make sense at first. Thinking an error in my calculation, I stared at the data for a bit until it dawned on me that I wasn't taking into account the EPL's relegation system. Teams that consistently perform poorly are booted from the top-flight league down to the next rung, and the best team(s) from that league replace the relegated teams. Without getting crazy, we need to take the number of seasons played against Arsenal into account with respect to performance. I came up with a linear relationship where each loss is multiplied times 3 for heavier weighting of the opponent's win, summed with each draw, and all this multiplied by the total number of seasons played against that opponent. I'm not sure if this is strictly correct, but it made intuitive sense.
# new methodolgy taking draws and # of seasons into account output1 <- ddply(epl, .(Opponent), function(x) { seasons = max(x$Season) - min(x$Season) points <- (sum(x$WinLoss == 'L') * 3 + sum(x$WinLoss == 'D')) * seasons return (points / (length(x$WinLoss) * 4 * 13)) #normalize }) # re-merge two datasets opponents2 <- join(output1, output2, by='Opponent') names(opponents2)[2:3] <- c('WeightedWinningRate','PointsDiff') p2 <- ggplot(opponents2, aes(x=PointsDiff, y=WeightedWinningRate)) + geom_point(color='#014d64', size=4) + geom_hline(y=0.28, colour='grey20', size=0.5, linetype=2) + geom_vline(x=0, colour='grey20', size=0.5, linetype=2) + geom_point(data=opponents2[opponents2$WeightedWinningRate > 0.28 & opponents2$PointsDiff >= 0,], aes(x=PointsDiff, y=WeightedWinningRate), color='red3', size=5) + annotate("text", x = 0.41, y= 0.28, label = "Chelsea") + annotate("text", x = 0.56, y = 0.33, label = "ManU") + annotate("text", x = 0.41, y = 0.032, label = "Swansea") + annotate("text", x = -1.5, y = 0.05, label = "Leeds") + annotate("text", x = -1, y = 0.1, label = "Sheffield") + labs(x="Mean Goal Difference", y="Winning Rate", title ="Ranking Opponent's Performance vs. Arsenal") + theme_economist() print(p2)
Looks good! We see that Leeds, Sheffield, and Swansea now drop in strength against Arsenal where one might expect. Turns out that Swansea only played against Arsenal four times over the 10+ seasons, with two draws, which explains Swansea's initial strength in the first chart. I glossed over some details about how I built the charts, as in the text for individual data points doesn't automatically populate itself like in many other charting packages. This can be a little tedious to line up the label with the data point, but it does provide some presentation flexibility. I was also pleased with how nice the charts look, and flexibility to switch to a cleaner design, like the Tufte theme, when needed.
Conclusion
This concludes this series of posts looking at how to clean up data and plot some basic charts using R. It was really interesting to see how easy the plyr package makes data manipulation easy with a few lines of code. The important part to remember is that, due to how terse this code can become, its important to test each step and verify you're getting the results you expect. As my professors used to say, "Think before you code."