Meta Kaggle Shiny App

Hayes Cozart
Posted on May 16, 2016

Contributed by Hayes Cozart. He  is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project - Shiny R visualization (due on the 4th week of the program).

What is Kaggle?

Kaggle is a community of data scientists that compete with each other in competitions to solve complex data science problems. Competitors in these competitions are ranked based on how well their predictive models are able to determine the outcomes of various problems.

Using Kaggle's Meta Kaggle dataset which has information on these competitions along with other data on their site, I designed an app to better explore why certain teams competing in Kaggle competitions might perform better than others. My two questions that I wanted to answer with this app were:  Is the size of the team associated with getting a higher rank?; and Is a team's experience associated with a teams rank?.

The following is an early version of the app, I will update it over time: Shiny App

What can we learn from the app?

My main concern with this app was making sure that information could be gleaned from it without my presence.  This is why the first tab of the app shown below is a foreword telling the user about Kaggle and what data I used to make this app. One of the main details from this app is that my data only included users who had participated in the Kaggle competitions. This may seem like an obvious decision but it is important to let the user know what data they are actually viewing.

First Tab

The Next tab is the graph showing the visualization of Ranks in the Kaggle competition by the size of the team. The team size variable is weighted by population so that the different groups can be compared equally. On the left side of the tab you can choose what ranks you want to look at out of the top hundred.  Below this scale you can choose what team sizes you want to view. Currently, you can choose from any of the team sizes and compare them.  I designed this feature to give the user freedom to compare different possibilities. However, the data does not tell much about the team sizes after 3 and 4 due to a smaller number of those teams competing. For this reason, in a future version of this app, I will change it so that you can either view all of them or group some of the team sizes to make the information more understandable. If you want to compare team sizes, it is possible to,select two groups you are interested in and select ranks one to a hundred.  This will allow you to view their trends. The graph show's that teams with only 1 member stay pretty constant while a team size of two is more likely to get a higher rank. In the middle of the tab you can choose to compare the team sizes either by the percent of the team size that got that rank or proportionally. There are also some useful tooltips to help the user below this selection. Finally, on the right hand side is a table that tells you in total how many teams of each size there were in all the competitions. This is here to help the user understand the percents of the groups and not be misled by high percents for groups with small number of teams.

Second Tab

The  third tab shows very similar information to the second tab except that you can specifically choose what competitions you would like to view. It also gives a brief description of the competition and tells you when that competition started. You can also type in the selection bar to find a specific competition. This tab has many changes that I would like to make in the next iteration of this app. First, I want to figure out a way that I can filter out competitions so you can select if they do not have the team sizes that have been selected on the right.  Next allow some way to compare multiple competitions against each other or allow the user to select multiple competitions.

Third Tab

This tab is designed exactly the same as the second except that it is now showing the number of competitions the team leader has participated in. I chose the team leader as I found through working with the data that teams are formed per competition so the same team is not part of multiple competitions. However, the team leader of each team is information I had access to on the data set. This graph shows the number of competitions the team leader participated in. Why the tabs are designed so similarly is so that the user can go from tab to tab and easily understand the information given. A disadvantage I can see in choosing this format is that the user might get confused about what is different between each tab. The main observations I gleaned from this information is that the team leader who has been in more competitions does seem to be associated with a higher rank. However, at  around 50 and 60 times there seems to be a drop off. This could be because there are not enough team leaders who have participated that many times to show a real difference.

Fourth Tab

The final tab looks at the same information as tab four but now once again I give the user the ability to look at specific competitions. There is not much more for me to say about this tab than what I said earlier than this.  Because what I am looking at is the competitions that a leader has participated in.  There needs to be past competitions for this information to give any new indicators.  As time passes in later competitions,  experience seems to have more of an effect than what is shown in the total view. With time being so important to this information, I believe I should try and look at time differentiation and allow the user to be able to compare different time periods instead of specific competitions. This is something important to look into  in the future.

Fifth Tab

Closing Remarks

There is a lot of information in this data set and I will most likely return to it to see what other information can be gleaned from it. There were different ways I looked at the data before deciding to represent it in this way. The main one I wanted to note is that with one version instead of looking at weighting the data by the total population of each group,  I actually weighted the data by what rankings the user selected. I did this as I thought it would be interesting to see what differences there could be between selected populations and the total. What I found is that there was not much difference between the two. Though this is an interesting finding, it was not something that would add to what a user could learn and was therefore taken out of the final product. I hope this app is helpful and look forward to refining it as time goes by. Just know that when taking part in a Kaggle competition experience is important and find a good small team to work with.

About Author

Hayes Cozart

Hayes Cozart

Rigorous analysis has been the foundation for Hayes’s educational and professional experiences to date, as an undergraduate psychology major at the College of William and Mary, and more recently in his work as a Pricing and Data Analyst...
View all posts by Hayes Cozart >

Leave a Comment

Avatar
pokemongo August 16, 2016
You've got awesome information here
Avatar
cs go purple skins May 19, 2016
Wow, lovely website. Thnx ..

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp