Web Scraping GoodReads: The Power of Words

Muhammad Ihsanulhaq Sarfraz
Posted on Oct 15, 2019

Introduction

There are more than 3 million cases per year recorded for clinical depression diagnosis. This diagnosis can be characterized by persistent depressed mood or loss of interest in activities, causing significant impairment in daily life. The anti-dote of depression is good words as captured by the quote below:

Words are singularly the most powerful force available to humanity. We can choose to use this force constructively with words of encouragement, or destructively using words of despair. Words have energy and power with the ability to help, to heal, to hinder, to hurt, to harm, to humiliate and to humble.”

Yehuda Berg

We use this as a motivation to scrape GoodReads Quotes to analyze several categories of quotes and try to learn correlations that could provide some useful insight into the words used to compose a quote.

Scraping

The GoodReads website was scraped using a spider built in Scrapy, a fast and powerful scraping and web crawling tool. The spider scraped four categories of quotes, namely, humor quotes, inspirational quotes, life quotes and love quotes. For each category of quotes, several pages were crawled yielding a total dataset of 3 MB. The code related to scraping and cleaning of data can be found in my GitHub repository. The scraping process extracted the following information:

  • Author Name
  • Length of Quote
  • Number of Likes
  • Category of Quote
  • Tags associated with the quote
  • Quote Text

After successfully scraping the raw data, I used several processes data was cleaned and formatted for data analysis. I parsed the scraping url to determine the category of quote to be included as part of the extracted data. In addition, the quote text and author named were trimmed for new line character and stripped of quotes. Finally, the likes were parsed to only extract the numeric value.

As a first research question, I wanted to see if there is a correlation between the number of likes for a quote and the length of text for the quote with the hypothesis that there would be a preference for shorter quotes. An initial plot of likes vs length simply showed outliers and was not conclusive. To that end, I created a plot by taking into consideration the log values of likes and length and then ran linear regression. The results yielded a R-squared value of 0.0553 and a p-value of 0.99, indicative that while there may be some correlation, it is not strong enough to be conclusive.

Likes vs Length of Quote
Log scale view of likes vs length of quote

The second research question was around the authors of the quotes. In an era of social media influence, fan following, and motivational speakers, it would add value to sample the authors with most popular quotes. To that end, the top twenty-five are shown in the visual below. The results placed Roy Bennet (476), Steve Maraboll (212), Cassandra Clare (210), J.K Rowling (153) and Rick Riordan (152) as the top authors.

Top Authors with most popular quotes

As the final research question, I wanted to determine what are the most popular tags associated with quotes. This would ultimately be the first step in labeling of quotes which in turn could be used for building a predictor for quote search/matching given tags and keywords.

Top Tags associated with Quotes

Conclusion

The premilinary review of data extracted has yielded some relationships in data but not strong relationships. The data is interesting and further deeper analysis can be done to yield correlations that add utility and the ability to learn from data. A good future direction can be to build a predictor that given some tags and keywords from the text of a quote, relevant and accurate predictions can be made as to which existing quotes in the knowledge base would be a close match.

About Author

Muhammad Ihsanulhaq Sarfraz

Muhammad Ihsanulhaq Sarfraz

Ihsan is an NYC Data Science Academy Fellow currently pursuing his PhD in Computer Engineering from Purdue University with a dissertation on analyzing patterns of learner behaviors in MOOCs. He has a passion for building dashboards and interfaces...
View all posts by Muhammad Ihsanulhaq Sarfraz >

Leave a Comment

Avatar
Google October 12, 2020
Google Just beneath, are quite a few entirely not related websites to ours, however, they are certainly really worth going over.
Avatar
Google October 2, 2020
Google Wonderful story, reckoned we could combine a handful of unrelated data, nonetheless genuinely really worth taking a appear, whoa did a single master about Mid East has got a lot more problerms too.
Avatar
Generator September 1, 2020
Generator [...]the time to read or stop by the content or websites we have linked to beneath the[...]
Avatar
Thesis Writers July 24, 2020
Thesis Writers [...]just beneath, are a lot of completely not connected websites to ours, on the other hand, they are certainly really worth going over[...]
Avatar
My Homepage November 9, 2019
... [Trackback] [...] Find More Informations here: nycdatascience.com/blog/student-works/web-scraping-goodreads-the-power-of-words/ [...]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp