Web Scraping GoodReads: The Power of Words

Posted on Oct 15, 2019
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

There are more than 3 million cases per year recorded for clinical depression diagnosis. This diagnosis can be characterized by persistent depressed mood or loss of interest in activities, causing significant impairment in daily life. The anti-dote of depression is good words as captured by the quote below:

Words are singularly the most powerful force available to humanity. We can choose to use this force constructively with words of encouragement, or destructively using words of despair. Words have energy and power with the ability to help, to heal, to hinder, to hurt, to harm, to humiliate and to humble.”

Yehuda Berg

We use this as a motivation to scrape GoodReads Quotes to analyze several categories of quotes and try to learn correlations that could provide some useful insight into the words used to compose a quote.

Scraping

The GoodReads website was scraped using a spider built in Scrapy, a fast and powerful scraping and web crawling tool. The spider scraped four categories of quotes, namely, humor quotes, inspirational quotes, life quotes and love quotes. For each category of quotes, several pages were crawled yielding a total dataset of 3 MB. The code related to scraping and cleaning of data can be found in my GitHub repository. The scraping process extracted the following information:

  • Author Name
  • Length of Quote
  • Number of Likes
  • Category of Quote
  • Tags associated with the quote
  • Quote Text

After successfully scraping the raw data, I used several processes data was cleaned and formatted for data analysis. I parsed the scraping url to determine the category of quote to be included as part of the extracted data. In addition, the quote text and author named were trimmed for new line character and stripped of quotes. Finally, the likes were parsed to only extract the numeric value.

As a first research question, I wanted to see if there is a correlation between the number of likes for a quote and the length of text for the quote with the hypothesis that there would be a preference for shorter quotes. An initial plot of likes vs length simply showed outliers and was not conclusive. To that end, I created a plot by taking into consideration the log values of likes and length and then ran linear regression. The results yielded a R-squared value of 0.0553 and a p-value of 0.99, indicative that while there may be some correlation, it is not strong enough to be conclusive.

Likes vs Length of Quote

Log scale view of likes vs length of quote

The second research question was around the authors of the quotes. In an era of social media influence, fan following, and motivational speakers, it would add value to sample the authors with most popular quotes. To that end, the top twenty-five are shown in the visual below. The results placed Roy Bennet (476), Steve Maraboll (212), Cassandra Clare (210), J.K Rowling (153) and Rick Riordan (152) as the top authors.

Top Authors with most popular quotes

As the final research question, I wanted to determine what are the most popular tags associated with quotes. This would ultimately be the first step in labeling of quotes which in turn could be used for building a predictor for quote search/matching given tags and keywords.

Top Tags associated with Quotes

Conclusion

The premilinary review of data extracted has yielded some relationships in data but not strong relationships. The data is interesting and further deeper analysis can be done to yield correlations that add utility and the ability to learn from data. A good future direction can be to build a predictor that given some tags and keywords from the text of a quote, relevant and accurate predictions can be made as to which existing quotes in the knowledge base would be a close match.

About Author

Muhammad Ihsanulhaq Sarfraz

Ihsan is an NYC Data Science Academy Fellow currently pursuing his PhD in Computer Engineering from Purdue University with a dissertation on analyzing patterns of learner behaviors in MOOCs. He has a passion for building dashboards and interfaces...
View all posts by Muhammad Ihsanulhaq Sarfraz >

Leave a Comment

Google August 28, 2021
Google Sites of interest we've a link to.
Google February 10, 2021
Google The time to study or pay a visit to the subject material or sites we have linked to beneath.
Google February 9, 2021
Google Below you will uncover the link to some web-sites that we assume you should visit.
CBD For Dogs December 11, 2020
CBD For Dogs [...]please visit the websites we adhere to, such as this one particular, because it represents our picks in the web[...]
Google October 12, 2020
Google Just beneath, are quite a few entirely not related websites to ours, however, they are certainly really worth going over.
Google October 2, 2020
Google Wonderful story, reckoned we could combine a handful of unrelated data, nonetheless genuinely really worth taking a appear, whoa did a single master about Mid East has got a lot more problerms too.
Generator September 1, 2020
Generator [...]the time to read or stop by the content or websites we have linked to beneath the[...]
Thesis Writers July 24, 2020
Thesis Writers [...]just beneath, are a lot of completely not connected websites to ours, on the other hand, they are certainly really worth going over[...]
My Homepage November 9, 2019
... [Trackback] [...] Find More Informations here: nycdatascience.edu/blog/student-works/web-scraping-goodreads-the-power-of-words/ [...]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI