Popularity at Kaggle.com
Kaggle is an online community of data scientists and machine learners, owned by Google. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
For this scraping project, I wanted to scrape a website that wasn't scraped many times before. I started working with Json calls, and seeing that this made it very easy to scrape (and it's not really web scraping per se), I scrapped that idea and moved on to trying Scrapy, Beautiful Soup, and finally made the decision of utilizing Selenium for this project as the infinite scroll of Kaggle Kernels page was creating problems with other frameworks.
I scraped over 5000 kernels, sorted by vote count as well as ~1500 user pages. I also scraped ~100K kernels as supporting data to combine with the data I scraped so that I could see more finely tuned results. The data included:
- User Name
- Author Performance Tier
- Best Public Score
- Execution Time
- Output Types
- Date Created
- Kernel Name
- Comment Count
- Vote Count
- Users (Utilized the user profile URL from the kernels I scraped)
- User Name
- Date Joined
- User Type
Kaggle having a public API is a gift to the data science community, but the goal of the project was to learn web scraping. My first try with Scrapy left much to desire since the website uses infinite scroll, and I needed to utilize a plug in of Scrapy called Scrapy Splash. My research about the Splash package showed me that it wasn't very easy to use or time efficient, so I needed to find a framework that would scrape the website quickly and efficiently. Thus, Selenium came into the mix. Also, the way Kaggle's html is structured is that the tags are buried deep under other tags, which meant that I didn't always get the text/attributes I needed. With a little tweaking and a new loop, I started getting the data I needed, albeit slowly as I didn't want to overload the website with my requests.
I wanted to scrape both the top Kernels page as well as the user profiles who submitted the Kernels. So, I created 2 loops that first scraped the Kernels page while collecting the URL of the profiles of users, and the second loop scraped the user pages with URLs from the kernels scrape work. This approach made it easy to track what was being scraped as well making sure that I had data integrity.
Once the data was cleaned, the first step was to take a look at the correlation between numerical data to see what I should be focusing on in my analysis.
As expected, the vote count and the comment count are directly correlated. I thought about including the follower count of users in this analysis, but decided against it since only the handful top users have decent number of followers and even then, the follower numbers are too small to be meaningful.
From the ~100k kernels I used in my analysis, it was interesting to see that there are 5 times more tier 5 users than tier 4 users. Kaggle defines performance tiers based on the amount of work a user puts in in kernels, competitions and discussions.
Kaggle having more kernels written in Python is not surprising as Python is arguably the most popular language for data science.
This graph shows the mean vote count per performance tier. This is very interesting as I'd expect the performance tier 5 users to collect more votes compared to other users. As the 2nd part of this project, I will be looking deeper into the reasons of this finding.
Above graph shows us the mean vote count for each language. Unexpectedly, we see that kernels written with R collected more votes compared to kernels written with Python. What we learn here is that even though Python is the most preferred language in kernels, it didn't always translate to more votes. I will be spending more time on this data in the 2nd part of my project. There may be outliers which skew the data.
Another surprising result. For visualizations, Kaggle community prefers R over Python. Again, this will be an area I'll be investigating further.
In the last part of my analysis, I wanted to create a map of where the top 1000 users of Kaggle were hailing from. Normally, this is an interactive map and the interactive part will be updated later in the blog. In the 2nd part of my project, I will be doing a deeper dive and create more maps that show where different user groups (based on their Kaggle status) contribute to Kaggle.
To be investigated further: (with more data)
- How does the follower & following count impacts the distribution of votes & comments a user receives?
- More detailed user tier distribution
- Distribution of vote count per user tier and why Tier 4 users collect more votes per kernel?
- Why kernels written in R collect more votes compared to kernels written in R? Are there outlier kernels that have an impact on this finding?
- Why the R users create more visualizations compared to Python users?
- The distribution of user types from each country. For example, does USA produce more kernel experts compared to India? or vice versa?