Medium, scraping data From The Six Main Categories

Posted on Mar 4, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

For my project, I decided to scrape data from Medium.com. In their own words, Medium taps into the brains of the world's most insightful writers, thinkers, and storytellers to bring you the smartest takes on topics that matter. So whatever your interest, you can always find fresh thinking and unique perspectives.

https://adlardee.shinyapps.io/webscrape_inshiny/  (Shiny App Link)

https://github.com/adlardee/webscrape_inshiny  (Github Link)

Data Scraped

I pulled 1026 articles from Medium's six main reading categories (Tech, Startups, Self, Politics, Health, Design). I used selenium infinite scroll as it was required to pull the articles from the website and told it to scroll down 7 times for each category. This left me with article totals of (Tech: 183, Startups: 188, Self: 204, Politics 140, Health: 118, Design 193). Within these articles I decided to pull article title, author, date, length of article, claps(thumbs up), and tags. I will use the term 'claps' in this blog post as that is what is used on Medium, it is the same thing as a 'like' or 'thumbs up'.

Findings

One of the first things I did was to visualize the overall traffic of the website categories in the below graphs. In the top graph, it is looking at the number of claps per day for each of the six categories. Here we can see which categories are generating the highest number of claps. What is interesting to me is the shifts you see in the categories for what people deemed clap worthy. To me, that means the quality they have for each category is quite variable from day to day. 

The second graph in the picture above is capturing the number of articles posted each day per category. Both graphs cover the same timeframe and you can see some interesting interaction between the two graphs in the middle and on the right side.

For the middle, we are seeing a spike in overall claps particularly in the design category, but are not seeing many articles posted in that timeframe. Whereas looking at the far right on both graphs you can see the spike in a number of articles posted as well as the increase in claps. To me, I wasn't expecting such a wide variation in the number of articles posted per day. I was surprised to see the drastic uptick in the number of articles.

One of the next interesting things to look at was the article length and see how that was distributed across the categories as well as the number of claps. In the first chart around the 10-minute mark, you start to see a significant drop-off in the number of articles with a high number of claps. This is expected as I had read a Medium article highlighting that the optimal blog post was seven minutes to keep peoples attention. So you have the natural dissipating attention of the reader, as well as the informed blog post writer, keeping their articles in the range of seven minutes.

The second boxplot chart does a great job illustrating article length as well. The actual box of each category catches the length of the middle 50 percent of articles and as you can see all of them fall in the 4-8 minute range. The solid line in the middle of those box plots represents the middle quartile/median which further confirms most authors writing articles in the 7-minute range.

Finally, the last item that was interesting to me was the use of tags in the articles by authors. Within Medium you are able to tag keywords to your article to help identify its content. In the below visuals you can see what are the most common tags used on the website.  The chart on the left gives you a count of the top 15 most used tags and the word cloud on the right gives you a visual interpretation of the most commonly used tag words.

As I looked into tagging more I decided to split out the articles that had tags vs those that did not. Of the 1,026 articles, 963 had tags and 63 did not have any associated tags. The average number of claps for articles without tags was 2,476 while the average number of claps for articles with tags were 387. It was extremely interesting to me why this smaller subset of articles with tags had a march larger average number of claps.

The whole point of the tagging is to link your article to related content to help drive traffic to your article. But, for some reason, it seems that there is a small subset of authors that are not abiding by this general rule but are nonetheless are writing very well liked content.

Conclusion

There were a couple of takeaways from this project. For Medium, I think they should look into the fluctuations in the number of articles they were putting out. Maybe there is a better way to release its content to have continued user interest.

Next was the importance of article length if you want to reach a large number of people. Finally, tagging is another measure that Medium should look into a bit more, why were they having this sub-group of articles that were performing extremely well, but we're not using any of the tagging features. I hope you found some of these insights as interesting as me. Thank you for taking the time to read!

About Author

Eric Adlard

Eric is an aspiring data scientist with a track record of using data to drive business insights in financial services. He has hands-on experience in R and Python in web-scraping, data visualization, supervised and unsupervised machine learning, as...
View all posts by Eric Adlard >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI