Meta Kaggle Shiny App
Contributed by Hayes Cozart. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project - Shiny R visualization (due on the 4th week of the program).
May 15, 2016
What is Kaggle?
Kaggle is a community of data scientists that compete with each other in competitions to solve complex data science problems. Competitors in these competitions are ranked based on how well their predictive models are able to determine the outcomes of various problems.
Using Kaggle's Meta Kaggle dataset which has information on these competitions along with other data on their site, I designed an app to better explore why certain teams competing in Kaggle competitions might perform better than others. My two questions that I wanted to answer with this app were: Is the size of the team associated with getting a higher rank?; and Is a team's experience associated with a teams rank?.
The following is an early version of the app, I will update it over time: Shiny App
What can we learn from the app?
My main concern with this app was making sure that information could be gleaned from it without my presence. This is why the first tab of the app shown below is a foreword telling the user about Kaggle and what data I used to make this app. One of the main details from this app is that my data only included users who had participated in the Kaggle competitions. This may seem like an obvious decision but it is important to let the user know what data they are actually viewing.
The Next tab is the graph showing the visualization of Ranks in the Kaggle competition by the size of the team. The team size variable is weighted by population so that the different groups can be compared equally. On the left side of the tab you can choose what ranks you want to look at out of the top hundred. Below this scale you can choose what team sizes you want to view. Currently, you can choose from any of the team sizes and compare them. I designed this feature to give the user freedom to compare different possibilities. However, the data does not tell much about the team sizes after 3 and 4 due to a smaller number of those teams competing. For this reason, in a future version of this app, I will change it so that you can either view all of them or group some of the team sizes to make the information more understandable. If you want to compare team sizes, it is possible to,select two groups you are interested in and select ranks one to a hundred. This will allow you to view their trends. The graph show's that teams with only 1 member stay pretty constant while a team size of two is more likely to get a higher rank. In the middle of the tab you can choose to compare the team sizes either by the percent of the team size that got that rank or proportionally. There are also some useful tooltips to help the user below this selection. Finally, on the right hand side is a table that tells you in total how many teams of each size there were in all the competitions. This is here to help the user understand the percents of the groups and not be misled by high percents for groups with small number of teams.
The third tab shows very similar information to the second tab except that you can specifically choose what competitions you would like to view. It also gives a brief description of the competition and tells you when that competition started. You can also type in the selection bar to find a specific competition. This tab has many changes that I would like to make in the next iteration of this app. First, I want to figure out a way that I can filter out competitions so you can select if they do not have the team sizes that have been selected on the right. Next allow some way to compare multiple competitions against each other or allow the user to select multiple competitions.
This tab is designed exactly the same as the second except that it is now showing the number of competitions the team leader has participated in. I chose the team leader as I found through working with the data that teams are formed per competition so the same team is not part of multiple competitions. However, the team leader of each team is information I had access to on the data set. This graph shows the number of competitions the team leader participated in. Why the tabs are designed so similarly is so that the user can go from tab to tab and easily understand the information given. A disadvantage I can see in choosing this format is that the user might get confused about what is different between each tab. The main observations I gleaned from this information is that the team leader who has been in more competitions does seem to be associated with a higher rank. However, at around 50 and 60 times there seems to be a drop off. This could be because there are not enough team leaders who have participated that many times to show a real difference.
The final tab looks at the same information as tab four but now once again I give the user the ability to look at specific competitions. There is not much more for me to say about this tab than what I said earlier than this. Because what I am looking at is the competitions that a leader has participated in. There needs to be past competitions for this information to give any new indicators. As time passes in later competitions, experience seems to have more of an effect than what is shown in the total view. With time being so important to this information, I believe I should try and look at time differentiation and allow the user to be able to compare different time periods instead of specific competitions. This is something important to look into in the future.
There is a lot of information in this data set and I will most likely return to it to see what other information can be gleaned from it. There were different ways I looked at the data before deciding to represent it in this way. The main one I wanted to note is that with one version instead of looking at weighting the data by the total population of each group, I actually weighted the data by what rankings the user selected. I did this as I thought it would be interesting to see what differences there could be between selected populations and the total. What I found is that there was not much difference between the two. Though this is an interesting finding, it was not something that would add to what a user could learn and was therefore taken out of the final product. I hope this app is helpful and look forward to refining it as time goes by. Just know that when taking part in a Kaggle competition experience is important and find a good small team to work with.