Web Scraping GoodReads: The Power of Words

Muhammad Ihsanulhaq Sarfraz

Posted on Oct 15, 2019

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

There are more than 3 million cases per year recorded for clinical depression diagnosis. This diagnosis can be characterized by persistent depressed mood or loss of interest in activities, causing significant impairment in daily life. The anti-dote of depression is good words as captured by the quote below:

“Words are singularly the most powerful force available to humanity. We can choose to use this force constructively with words of encouragement, or destructively using words of despair. Words have energy and power with the ability to help, to heal, to hinder, to hurt, to harm, to humiliate and to humble.”
Yehuda Berg

We use this as a motivation to scrape GoodReads Quotes to analyze several categories of quotes and try to learn correlations that could provide some useful insight into the words used to compose a quote.

Scraping

The GoodReads website was scraped using a spider built in Scrapy, a fast and powerful scraping and web crawling tool. The spider scraped four categories of quotes, namely, humor quotes, inspirational quotes, life quotes and love quotes. For each category of quotes, several pages were crawled yielding a total dataset of 3 MB. The code related to scraping and cleaning of data can be found in my GitHub repository. The scraping process extracted the following information:

Author Name
Length of Quote
Number of Likes
Category of Quote
Tags associated with the quote
Quote Text

After successfully scraping the raw data, I used several processes data was cleaned and formatted for data analysis. I parsed the scraping url to determine the category of quote to be included as part of the extracted data. In addition, the quote text and author named were trimmed for new line character and stripped of quotes. Finally, the likes were parsed to only extract the numeric value.

As a first research question, I wanted to see if there is a correlation between the number of likes for a quote and the length of text for the quote with the hypothesis that there would be a preference for shorter quotes. An initial plot of likes vs length simply showed outliers and was not conclusive. To that end, I created a plot by taking into consideration the log values of likes and length and then ran linear regression. The results yielded a R-squared value of 0.0553 and a p-value of 0.99, indicative that while there may be some correlation, it is not strong enough to be conclusive.

Likes vs Length of Quote

log-length-likes-022166-3Rdnc37v | Data Science Blog — Log scale view of likes vs length of quote

The second research question was around the authors of the quotes. In an era of social media influence, fan following, and motivational speakers, it would add value to sample the authors with most popular quotes. To that end, the top twenty-five are shown in the visual below. The results placed Roy Bennet (476), Steve Maraboll (212), Cassandra Clare (210), J.K Rowling (153) and Rick Riordan (152) as the top authors.

top-authors-004562-Fl0znWUD | Data Science Blog — Top Authors with most popular quotes

As the final research question, I wanted to determine what are the most popular tags associated with quotes. This would ultimately be the first step in labeling of quotes which in turn could be used for building a predictor for quote search/matching given tags and keywords.

tags-word-cloud-038047-nIqnO8hs | Data Science Blog — Top Tags associated with Quotes

Conclusion

The premilinary review of data extracted has yielded some relationships in data but not strong relationships. The data is interesting and further deeper analysis can be done to yield correlations that add utility and the ability to learn from data. A good future direction can be to build a predictor that given some tags and keywords from the text of a quote, relevant and accurate predictions can be made as to which existing quotes in the knowledge base would be a close match.

About Author

Muhammad Ihsanulhaq Sarfraz

Ihsan is an NYC Data Science Academy Fellow currently pursuing his PhD in Computer Engineering from Purdue University with a dissertation on analyzing patterns of learner behaviors in MOOCs. He has a passion for building dashboards and interfaces...

View all posts by Muhammad Ihsanulhaq Sarfraz >

Google August 28, 2021

Google Sites of interest we've a link to.

Google February 10, 2021

Google The time to study or pay a visit to the subject material or sites we have linked to beneath.

Google February 9, 2021

Google Below you will uncover the link to some web-sites that we assume you should visit.

CBD For Dogs December 11, 2020

CBD For Dogs [...]please visit the websites we adhere to, such as this one particular, because it represents our picks in the web[...]

Google October 12, 2020

Google Just beneath, are quite a few entirely not related websites to ours, however, they are certainly really worth going over.

Google October 2, 2020

Google Wonderful story, reckoned we could combine a handful of unrelated data, nonetheless genuinely really worth taking a appear, whoa did a single master about Mid East has got a lot more problerms too.

Generator September 1, 2020

Generator [...]the time to read or stop by the content or websites we have linked to beneath the[...]

Thesis Writers July 24, 2020

Thesis Writers [...]just beneath, are a lot of completely not connected websites to ours, on the other hand, they are certainly really worth going over[...]

My Homepage November 9, 2019

... [Trackback] [...] Find More Informations here: nycdatascience.edu/blog/student-works/web-scraping-goodreads-the-power-of-words/ [...]

Web Scraping GoodReads: The Power of Words

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Scraping

Likes vs Length of Quote

Conclusion

About Author

Muhammad Ihsanulhaq Sarfraz

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Web Scraping GoodReads: The Power of Words

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Scraping

Likes vs Length of Quote

Conclusion

About Author

Muhammad Ihsanulhaq Sarfraz

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!