User interest on Medium.com

Eric Adlard
Posted on Mar 4, 2019

Background

For my project, I decided to scrape data from Medium.com. In their own words, Medium taps into the brains of the world's most insightful writers, thinkers, and storytellers to bring you the smartest takes on topics that matter. So whatever your interest, you can always find fresh thinking and unique perspectives.

https://adlardee.shinyapps.io/webscrape_inshiny/  (Shiny App Link)

https://github.com/adlardee/webscrape_inshiny  (Github Link)

Data Scraped

I pulled 1026 articles from Medium's six main reading categories (Tech, Startups, Self, Politics, Health, Design). I used selenium infinite scroll as it was required to pull the articles from the website and told it to scroll down 7 times for each category. This left me with article totals of (Tech: 183, Startups: 188, Self: 204, Politics 140, Health: 118, Design 193). Within these articles I decided to pull article title, author, date, length of article, claps(thumbs up), and tags. I will use the term 'claps' in this blog post as that is what is used on Medium, it is the same thing as a 'like' or 'thumbs up'.

Findings

One of the first things I did was to visualize the overall traffic of the website categories in the below graphs. In the top graph, it is looking at the number of claps per day for each of the six categories. Here we can see which categories are generating the highest number of claps. What is interesting to me is the shifts you see in the categories for what people deemed clap worthy. To me, that means the quality they have for each category is quite variable from day to day. 

The second graph in the picture above is capturing the number of articles posted each day per category. Both graphs cover the same timeframe and you can see some interesting interaction between the two graphs in the middle and on the right side. For the middle, we are seeing a spike in overall claps particularly in the design category, but are not seeing many articles posted in that timeframe. Whereas looking at the far right on both graphs you can see the spike in a number of articles posted as well as the increase in claps. To me, I wasn't expecting such a wide variation in the number of articles posted per day. I was surprised to see the drastic uptick in the number of articles.

One of the next interesting things to look at was the article length and see how that was distributed across the categories as well as the number of claps. In the first chart around the 10-minute mark, you start to see a significant drop-off in the number of articles with a high number of claps. This is expected as I had read a Medium article highlighting that the optimal blog post was seven minutes to keep peoples attention. So you have the natural dissipating attention of the reader, as well as the informed blog post writer, keeping their articles in the range of seven minutes.

The second boxplot chart does a great job illustrating article length as well. The actual box of each category catches the length of the middle 50 percent of articles and as you can see all of them fall in the 4-8 minute range. The solid line in the middle of those box plots represents the middle quartile/median which further confirms most authors writing articles in the 7-minute range.

Finally, the last item that was interesting to me was the use of tags in the articles by authors. Within Medium you are able to tag keywords to your article to help identify its content. In the below visuals you can see what are the most common tags used on the website.  The chart on the left gives you a count of the top 15 most used tags and the word cloud on the right gives you a visual interpretation of the most commonly used tag words.

As I looked into tagging more I decided to split out the articles that had tags vs those that did not. Of the 1,026 articles, 963 had tags and 63 did not have any associated tags. The average number of claps for articles without tags was 2,476 while the average number of claps for articles with tags were 387. It was extremely interesting to me why this smaller subset of articles with tags had a march larger average number of claps. The whole point of the tagging is to link your article to related content to help drive traffic to your article. But, for some reason, it seems that there is a small subset of authors that are not abiding by this general rule but are nonetheless are writing very well liked content.

Conclusion

There were a couple of takeaways from this project. For Medium, I think they should look into the fluctuations in the number of articles they were putting out. Maybe there is a better way to release its content to have continued user interest. Next was the importance of article length if you want to reach a large number of people. Finally, tagging is another measure that Medium should look into a bit more, why were they having this sub-group of articles that were performing extremely well, but we're not using any of the tagging features. I hope you found some of these insights as interesting as me. Thank you for taking the time to read!

About Author

Eric Adlard

Eric Adlard

Eric is an aspiring data scientist with a track record of using data to drive business insights in financial services. He has hands-on experience in R and Python in web-scraping, data visualization, supervised and unsupervised machine learning, as...
View all posts by Eric Adlard >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp