Analysis of Olympic Games by web scraping
Contributed by Shuo Zhang. She is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016.
Introduction
I really like to watch the summer Olympics. It’s simply breathtaking to watch the world's best athletes compete in the various sports! I also love the Olympics because of the plethora of data available. From judging to timing to preliminary rounds to finals to the the various Olympic records, there is data for every sport and country at the Olympic Games. Most of it, especially more recent data, is free and easy to find. The Rio Olympics is almost over, but we can be confident of one thing: the US, United Kingdom and China will top the medals table when it's all over. One question we might want to ask is why these countries are so successful. Has the Olympic games achieve gender equality in competitors? Does age impact the number of medals the athletes can get? Data can help answer these questions.
Web Scraping
Past summer Olympics data can be found at: http://www.sports-reference.com/olympics/.
The overall website is organized and structured very well. I wrote a web scraper in Python using the package "Beautiful Soup" and extracted the following data for the further analysis:
- Events of country medal leaders among 28 summer Olympics : 1188 in total
- Events of country medal leaders of 3 sports among 28 summer Olympics: 1141 in total
- Events of sports, athletes and medalists of 84 countries during 2012 Olympics: 9457 in total
Here's an example of the code used to do the scraping:
https://gist.github.com/shuozhang1985/388551d7417d5877219462e23ba39f44
Data Exploration and Modeling
What affects the total medals won by different countries?
Analyzing the number of Olympic medals won by geographic region by year reveals the true impact and extent of medal diversification. For example, whilst more countries are winning Olympic medals, how many medals are they capturing compared to traditionally strong Olympic nations? Is their success fairly minor or more pronounced? What are the possible contributing factors to their success in the Olympics? I listed the history of total medals won by the top 13 leading countries.
This graph shows that since the first modern Olympic games, the landscape of medal-winning nations has markedly changed. Before World War II, Olympic success was dominated by the United States and Europe. Afterwards, more African and Asian countries begin to participate in the Olympics and the medal standings are marked by the arrival and growth of many regions including Japan, South Korea, China and Hungary. Here are three factors that affect medal standings revealed by analysis:
Past Olympic success: Medals won in the past can be seen as an indicator of a "sports culture". The United States, for example, always perform quite well. Sporting prowess is important to them so many people take part.
Host-country effect: The United States hosted the 1904 Olympics and won 231 medals compared to 48 at the previous games. The phenomena occurs again and again. For instance, China hosted the 2008 Olympics and collected 100 medals compared to 63 at the previous Olympics. This is a recognized pattern. Performing in front of a home crowd combined with extra investment in sport gives the host country a medals boost.
Future-host effect: Australia won 27 medals in 1992 followed by 41 medals four years later. This was probably due to increased investment in sport in the run-up to the 2000 Sydney Games. The UK, as another example, increased its medal haul from 30 to 47 between 2004 and 2008, prior to hosting the 2012 Games.
This graph also illustrates some national indicators such as GDP, population, GDP growth and life expectancy also possibly play roles in the medal-winning battle of the Olympics:
Wealth: Countries with a high GDP, like Germany or the USA, can afford to invest in sports facilities and their populations have enough leisure time and money to take part in sports. This may not be the case in poorer countries.
Population: A big population means a big talent pool to choose athletes from - in China's case, 1.36 billion people.
Planned economies: These countries with a high GDP growth tend to invest more in sport because they value the prestige that sporting success brings. China is a good example.
Health: Countries with a high life expectancy have a big healthy pool to choose athletes from such as Japan.
T0 take a closer look, I extracted the data from the 2012 Olympics - including the medal winning record of all the participating countries - and analyzed the relationship between total winning-medals won by each country and its population, GDP, GDP growth and life expectancy.
This correlation graph demonstrates the total winning-medals is primarily correlated with population and also other variables such as GDP and life expectancy affect the total winning-medals. And the scatterplots illustrates cluster pattern between the five variables. Thus, a K-means clustering model is applied to find the underlying pattern.
K-means clustering
K-Means clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. The key of K-means clustering is to determine the number of clusters, K. I used the average silhouette method to determine K.
The graph indicates that the optimal number of clusters is 5. I choose two dimensions that cover the most of data and plot the data to see the clusters.
The distribution of the 5 clusters by different variables is illustrated below and we can see cluster 5 has highest average total winning-medals, a relatively high average population, a relatively high average GDP per capita, a relatively low average GDP growth and a relatively long average life expectancy. Cluster 4 has the lowest average total winning-medals, a relatively small average population, the lowest average GDP per capita, the highest average GDP growth and the shortest average life expectancy. Which countries belong to these clusters?
To get a clear view of the country distribution, I only labeled the top 5 leading countries in the total medals standings in cluster 2, 3, 4 in the following graph. We can see the K-means clustering well separates the countries to developing and developed countries. While developed countries win the most medals, China as a developing-country catches our eye for its successful performance in the Olympics.
By applying the same K-means clustering to the data of 2008 Olympics, we can see the same pattern in the below graph.
The analysis can be found in the following link:
https://gist.github.com/shuozhang1985/a646354e92dadd010f08cc63f0d70a78
Which countries dominate the 3 traditional sports: athletics, swimming and gymnastics?
I created a breakdown of which countries the data shows will excel in which sports by investigating its past performance in 28 summer Olympics, given the total medal values– where a gold medal is worth three points, silver two and bronze one.
athletics:
swimming:
gymnastics:
The three graphs shows that United States, Russia and Jamaica are likely to dominate the athletics events, the United States, Australia and China excel in swimming, and China, the United States and Russia athletes are good at gymnastics.
The analysis can be found in the following link:
https://gist.github.com/shuozhang1985/df2bdcb3ee74827de0cd02e530072ddf
Does the Olympics show gender equality?
This graph shows that more women participate in the Olympics. My focus is on the popular Olympic sports. What differences remain between the ways that male and female athletes are involved in Olympic competitions? I analyze all of the men’s and women’s events at the London 2012 Olympics to identify gender differences in the structure and rules of the sports, and in the opportunities for male and female athletes.
There were 132 women’s events, and 162 men’s events at the London 2012 Olympics. Of these, 57 events on the program are gender exclusive (i.e. there were medal opportunities for men but not for women and vice versa): 42 events are open only to men (25.9% of men’s events); and 15 events are open only to women (11.4% of women’s events). Together, these exclusive events constituted 18.9% of the Olympic programs.
This graph shows sport distribution of these exclusive events. For example, there are more men's events than women's in the wrestling and canoeing sports, while there are more women's events than men's in synchronized swimming and rhythmic gymnastics.
This graph shows a total of 10,903 athletes competed in the 302 medal events – 6,068 men and 4,835 women, so there are 1,233 more men than women competing in London. And male athletes compete for 991 medals while female athletes compete for 872 medals. In conclusion, the data shows that there is still some way to go to achieve gender equality.
Does age impact the number of medals?
I wanted to dig a little deeper, so I created density plots of the distribution of athletes’ age by 4 types: all athletes, athletes who won gold medals, athletes who won silver medals, and athletes who won bronze medals.
You can see that very young medalists (early teens), and older medalists (late 30s-over 40) tend to win fewer medals than medalists within the “sweet spot” of late teens-early 30s.
Some sports, like equestrianism, have older athletes winning medals, whereas a sport like gymnastics has a peak age-range of early-to-late teens and early 20s.
Which sports produce the most medals in 2012 Olympics?
We can see which sports produce the most medals. Athletics is number one, followed by swimming, rowing, and football. So if you’re an aspiring Olympian,but you’re not quite sure what sport to train for, you’ll increase your chances of medaling if you choose athletics.
The analysis can be found in the following link:
https://gist.github.com/shuozhang1985/da03c78e983b56f305650c4ac628b719
Future work
If I had more time, I would apply K-means clustering to all the data from 1960-2004 and investigate whether the same pattern exists, and, if there is difference, what is the reason for this difference.