Data Study on the Popularity of Kaggle.com

Posted on Feb 4, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background:

Kaggleย is an online community of data scientists and machine learners, owned byย Google.ย Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Data Scraping Work:

For this scraping project, I wanted to scrape a website that wasn't scraped many times before. I started working with Json calls, and seeing that this made it very easy to scrape (and it's not really web scraping per se), I scrapped that idea and moved on to trying Scrapy, Beautiful Soup, and finally made the decision of utilizing Selenium for this project as the infinite scroll of Kaggle Kernels page was creating problems with other frameworks. You can see my code here.


Data Collected:

I scraped over 5000 kernels, sorted by vote count as well as ~1500 user pages. I also scraped ~100K kernels as supporting data to combine with the data I scraped so that I could see more finely tuned results. The data included:

  • Kernels
    • User Name
    • Author Performance Tier
    • Best Public Score
    • Language
    • Execution Time
    • Medal
    • Output Types
    • Date Created
    • Kernel Name
    • Comment Count
    • Vote Count
  • Users (Utilized the user profile URL from the kernels I scraped)
    • User Name
    • Date Joined
    • Follower_count
    • Title
    • Location
    • User Type
    • Bio

Challenges:

My first try with Scrapy left much to desire since the website uses infinite scroll, and I needed to utilize a plug in of Scrapy called Scrapy Splash. My research about the Splash package showed me that it wasn't very easy to use or efficient, so I needed to find another ย framework that would work more quickly and efficiently. Thus, Selenium came into the mix.

Another challenge was uncovering the tags. I had difficulty at first because the way Kaggle's html is structured, ย the tags are buried deep under other tags. Consequently I didn't always get the text/attributes I needed. With a little tweaking and a new loop, I started getting the data I needed, albeit slowly, as I didn't want to overload the website with my requests.

 

Data Scraping Work:

I wanted to scrape both the top Kernels page as well as the user profiles who submitted the Kernels. To achieve that ย I created two loops. The first scraped the Kernels page while collecting the URL of the profiles of users, and the second loop scraped the user pages with URLs from the Kernels scrape work. This approach made it easy to track what was being scraped while maintaining ย data integrity.

Data Analysis:

Once the data was cleaned, the first step was to take a look at the correlation between the numbers in the data to see what I should be focusing on in my analysis.

Data Study on the Popularity of Kaggle.com

As expected, the vote count and the comment count are directly correlated. I thought about including the follower count of users in this analysis but decided against it because only the handful of top users have a decent number of followers and even those numbers are too small to be meaningful.

Data Study on the Popularity of Kaggle.com

From the ~100k kernels I used in my analysis, it was interesting to see that there are 5 times more tier 5 users than tier 4 users. Kaggle defines performance tiers based on the amount of work a user puts in in kernels, competitions and discussions.

Data Study on the Popularity of Kaggle.com

Kaggle having more kernels written in Python is not surprising as Python is arguably the most popular language for data science.

Vote count per performance tier

This graph shows the mean vote count per performance tier. This is very interesting as I'd expect the performance tier 5 users to collect more votes compared to other users. As the second part of this project, I will be looking more ย deeply into the reasons of this finding.

The graph above shows us the mean vote count for each language. Unexpectedly, we see that kernels written with R collected more votes compared to kernels written with Python. What we learn here is that even though Python is the most preferred language in kernels, it didn't always translate into more votes. I will be spending more time on this data in the second part of my project. There may be outliers which skew the data.

Another surprising result. For visualizations,the ย Kaggle community prefers R over Python. Again, this will be an area I'll be investigating further.

In the last part of my analysis, I wanted to create a map of the locations for the top 1000 ย Kaggle users. Normally, this is an interactive map, and the interactive part will be updated later in the blog. In the second part of my project, I will be doing a deeper dive and create more maps that show where different user groups (based on their Kaggle status) contribute to Kaggle.

To be investigated further: (with more data)

  • How does the follower & following count impacts the distribution of votes & comments a user receives?
  • More detailed user tier distribution
  • Distribution of vote count per user tier and why Tier 4 users collect more votes per kernel?
  • Why kernels written in R collect more votes compared to kernels written in R? Are there outlier kernels that have an impact on this finding?
  • Why the R users create more visualizations compared to Python users?
  • The distribution of user types from each country. For example, does USA produce more kernel experts compared to India? or vice versa?

 

About Author

Rifat Dincer

Rifat Yuce Dincer (U.J), spent the last 10 years in business development working for AT&T, Salesforce & HackerRank. He worked with companies that ranged from small startups to large enterprises by partnering with their C suite and solving...
View all posts by Rifat Dincer >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI