Popularity at Kaggle.com

Rifat Dincer
Posted on Feb 4, 2019

Background:
Kaggle is an online community of data scientists and machine learners, owned by Google. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Scraping Work:
For this scraping project, I wanted to scrape a website that wasn't scraped many times before. I started working with Json calls, and seeing that this made it very easy to scrape (and it's not really web scraping per se), I scrapped that idea and moved on to trying Scrapy, Beautiful Soup, and finally made the decision of utilizing Selenium for this project as the infinite scroll of Kaggle Kernels page was creating problems with other frameworks. You can see my code here.

Data Collected:
I scraped over 5000 kernels, sorted by vote count as well as ~1500 user pages. I also scraped ~100K kernels as supporting data to combine with the data I scraped so that I could see more finely tuned results. The data included:

  • Kernels
    • User Name
    • Author Performance Tier
    • Best Public Score
    • Language
    • Execution Time
    • Medal
    • Output Types
    • Date Created
    • Kernel Name
    • Comment Count
    • Vote Count
  • Users (Utilized the user profile URL from the kernels I scraped)
    • User Name
    • Date Joined
    • Follower_count
    • Title
    • Location
    • User Type
    • Bio

Challenges:

My first try with Scrapy left much to desire since the website uses infinite scroll, and I needed to utilize a plug in of Scrapy called Scrapy Splash. My research about the Splash package showed me that it wasn't very easy to use or efficient, so I needed to find another  framework that would work more quickly and efficiently. Thus, Selenium came into the mix.

Another challenge was uncovering the tags. I had difficulty at first because the way Kaggle's html is structured,  the tags are buried deep under other tags. Consequently I didn't always get the text/attributes I needed. With a little tweaking and a new loop, I started getting the data I needed, albeit slowly, as I didn't want to overload the website with my requests.


Scraping Work:
I wanted to scrape both the top Kernels page as well as the user profiles who submitted the Kernels. To achieve that  I created two loops. The first scraped the Kernels page while collecting the URL of the profiles of users, and the second loop scraped the user pages with URLs from the Kernels scrape work. This approach made it easy to track what was being scraped while maintaining  data integrity.

Analysis:
Once the data was cleaned, the first step was to take a look at the correlation between the numbers in the data to see what I should be focusing on in my analysis.

As expected, the vote count and the comment count are directly correlated. I thought about including the follower count of users in this analysis but decided against it because only the handful of top users have a decent number of followers and even those numbers are too small to be meaningful.

From the ~100k kernels I used in my analysis, it was interesting to see that there are 5 times more tier 5 users than tier 4 users. Kaggle defines performance tiers based on the amount of work a user puts in in kernels, competitions and discussions.

Kaggle having more kernels written in Python is not surprising as Python is arguably the most popular language for data science.

Vote count per performance tier

This graph shows the mean vote count per performance tier. This is very interesting as I'd expect the performance tier 5 users to collect more votes compared to other users. As the second part of this project, I will be looking more  deeply into the reasons of this finding.

The graph above shows us the mean vote count for each language. Unexpectedly, we see that kernels written with R collected more votes compared to kernels written with Python. What we learn here is that even though Python is the most preferred language in kernels, it didn't always translate into more votes. I will be spending more time on this data in the second part of my project. There may be outliers which skew the data.

Another surprising result. For visualizations,the  Kaggle community prefers R over Python. Again, this will be an area I'll be investigating further.

In the last part of my analysis, I wanted to create a map of the locations for the top 1000  Kaggle users. Normally, this is an interactive map, and the interactive part will be updated later in the blog. In the second part of my project, I will be doing a deeper dive and create more maps that show where different user groups (based on their Kaggle status) contribute to Kaggle.

To be investigated further: (with more data)

  • How does the follower & following count impacts the distribution of votes & comments a user receives?
  • More detailed user tier distribution
  • Distribution of vote count per user tier and why Tier 4 users collect more votes per kernel?
  • Why kernels written in R collect more votes compared to kernels written in R? Are there outlier kernels that have an impact on this finding?
  • Why the R users create more visualizations compared to Python users?
  • The distribution of user types from each country. For example, does USA produce more kernel experts compared to India? or vice versa?


About Author

Rifat Dincer

Rifat Dincer

Rifat Yuce Dincer (U.J), spent the last 10 years in business development working for AT&T, Salesforce & HackerRank. He worked with companies that ranged from small startups to large enterprises by partnering with their C suite and solving...
View all posts by Rifat Dincer >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp