Medium, scraping data From The Six Main Categories
Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Background
For my project, I decided to scrape data from Medium.com. In their own words, Medium taps into the brains of the world's most insightful writers, thinkers, and storytellers to bring you the smartest takes on topics that matter. So whatever your interest, you can always find fresh thinking and unique perspectives.
https://adlardee.shinyapps.io/webscrape_inshiny/ (Shiny App Link)
https://github.com/adlardee/webscrape_inshiny (Github Link)
Data Scraped
I pulled 1026 articles from Medium's six main reading categories (Tech, Startups, Self, Politics, Health, Design). I used selenium infinite scroll as it was required to pull the articles from the website and told it to scroll down 7 times for each category. This left me with article totals of (Tech: 183, Startups: 188, Self: 204, Politics 140, Health: 118, Design 193). Within these articles I decided to pull article title, author, date, length of article, claps(thumbs up), and tags. I will use the term 'claps' in this blog post as that is what is used on Medium, it is the same thing as a 'like' or 'thumbs up'.
Findings
One of the first things I did was to visualize the overall traffic of the website categories in the below graphs. In the top graph, it is looking at the number of claps per day for each of the six categories. Here we can see which categories are generating the highest number of claps. What is interesting to me is the shifts you see in the categories for what people deemed clap worthy. To me, that means the quality they have for each category is quite variable from day to day.
The second graph in the picture above is capturing the number of articles posted each day per category. Both graphs cover the same timeframe and you can see some interesting interaction between the two graphs in the middle and on the right side.
For the middle, we are seeing a spike in overall claps particularly in the design category, but are not seeing many articles posted in that timeframe. Whereas looking at the far right on both graphs you can see the spike in a number of articles posted as well as the increase in claps. To me, I wasn't expecting such a wide variation in the number of articles posted per day. I was surprised to see the drastic uptick in the number of articles.
One of the next interesting things to look at was the article length and see how that was distributed across the categories as well as the number of claps. In the first chart around the 10-minute mark, you start to see a significant drop-off in the number of articles with a high number of claps. This is expected as I had read a Medium article highlighting that the optimal blog post was seven minutes to keep peoples attention. So you have the natural dissipating attention of the reader, as well as the informed blog post writer, keeping their articles in the range of seven minutes.
The second boxplot chart does a great job illustrating article length as well. The actual box of each category catches the length of the middle 50 percent of articles and as you can see all of them fall in the 4-8 minute range. The solid line in the middle of those box plots represents the middle quartile/median which further confirms most authors writing articles in the 7-minute range.
Finally, the last item that was interesting to me was the use of tags in the articles by authors. Within Medium you are able to tag keywords to your article to help identify its content. In the below visuals you can see what are the most common tags used on the website. The chart on the left gives you a count of the top 15 most used tags and the word cloud on the right gives you a visual interpretation of the most commonly used tag words.
As I looked into tagging more I decided to split out the articles that had tags vs those that did not. Of the 1,026 articles, 963 had tags and 63 did not have any associated tags. The average number of claps for articles without tags was 2,476 while the average number of claps for articles with tags were 387. It was extremely interesting to me why this smaller subset of articles with tags had a march larger average number of claps.
The whole point of the tagging is to link your article to related content to help drive traffic to your article. But, for some reason, it seems that there is a small subset of authors that are not abiding by this general rule but are nonetheless are writing very well liked content.
Conclusion
There were a couple of takeaways from this project. For Medium, I think they should look into the fluctuations in the number of articles they were putting out. Maybe there is a better way to release its content to have continued user interest.
Next was the importance of article length if you want to reach a large number of people. Finally, tagging is another measure that Medium should look into a bit more, why were they having this sub-group of articles that were performing extremely well, but we're not using any of the tagging features. I hope you found some of these insights as interesting as me. Thank you for taking the time to read!