Web Scraping GoodReads: The Power of Words
There are more than 3 million cases per year recorded for clinical depression diagnosis. This diagnosis can be characterized by persistent depressed mood or loss of interest in activities, causing significant impairment in daily life. The anti-dote of depression is good words as captured by the quote below:
“Words are singularly the most powerful force available to humanity. We can choose to use this force constructively with words of encouragement, or destructively using words of despair. Words have energy and power with the ability to help, to heal, to hinder, to hurt, to harm, to humiliate and to humble.”Yehuda Berg
We use this as a motivation to scrape GoodReads Quotes to analyze several categories of quotes and try to learn correlations that could provide some useful insight into the words used to compose a quote.
The GoodReads website was scraped using a spider built in Scrapy, a fast and powerful scraping and web crawling tool. The spider scraped four categories of quotes, namely, humor quotes, inspirational quotes, life quotes and love quotes. For each category of quotes, several pages were crawled yielding a total dataset of 3 MB. The code related to scraping and cleaning of data can be found in my GitHub repository. The scraping process extracted the following information:
- Author Name
- Length of Quote
- Number of Likes
- Category of Quote
- Tags associated with the quote
- Quote Text
After successfully scraping the raw data, I used several processes data was cleaned and formatted for data analysis. I parsed the scraping url to determine the category of quote to be included as part of the extracted data. In addition, the quote text and author named were trimmed for new line character and stripped of quotes. Finally, the likes were parsed to only extract the numeric value.
As a first research question, I wanted to see if there is a correlation between the number of likes for a quote and the length of text for the quote with the hypothesis that there would be a preference for shorter quotes. An initial plot of likes vs length simply showed outliers and was not conclusive. To that end, I created a plot by taking into consideration the log values of likes and length and then ran linear regression. The results yielded a R-squared value of 0.0553 and a p-value of 0.99, indicative that while there may be some correlation, it is not strong enough to be conclusive.
The second research question was around the authors of the quotes. In an era of social media influence, fan following, and motivational speakers, it would add value to sample the authors with most popular quotes. To that end, the top twenty-five are shown in the visual below. The results placed Roy Bennet (476), Steve Maraboll (212), Cassandra Clare (210), J.K Rowling (153) and Rick Riordan (152) as the top authors.
As the final research question, I wanted to determine what are the most popular tags associated with quotes. This would ultimately be the first step in labeling of quotes which in turn could be used for building a predictor for quote search/matching given tags and keywords.
The premilinary review of data extracted has yielded some relationships in data but not strong relationships. The data is interesting and further deeper analysis can be done to yield correlations that add utility and the ability to learn from data. A good future direction can be to build a predictor that given some tags and keywords from the text of a quote, relevant and accurate predictions can be made as to which existing quotes in the knowledge base would be a close match.