There are more than 3 million cases per year recorded for clinical depression diagnosis. This diagnosis can be characterized by persistent depressed mood or loss of interest in activities, causing significant impairment in daily life. The anti-dote of depression is good words as captured by the quote below:
“Words are singularly the most powerful force available to humanity. We can choose to use this force constructively with words of encouragement, or destructively using words of despair. Words have energy and power with the ability to help, to heal, to hinder, to hurt, to harm, to humiliate and to humble.”
Yehuda Berg
We use this as a motivation to scrape GoodReads Quotes to analyze several categories of quotes and try to learn correlations that could provide some useful insight into the words used to compose a quote.
Scraping
The GoodReads website was scraped using a spider built in Scrapy, a fast and powerful scraping and web crawling tool. The spider scraped four categories of quotes, namely, humor quotes, inspirational quotes, life quotes and love quotes. For each category of quotes, several pages were crawled yielding a total dataset of 3 MB. The code related to scraping and cleaning of data can be found in my GitHub repository. The scraping process extracted the following information:
Author Name
Length of Quote
Number of Likes
Category of Quote
Tags associated with the quote
Quote Text
After successfully scraping the raw data, I used several processes data was cleaned and formatted for data analysis. I parsed the scraping url to determine the category of quote to be included as part of the extracted data. In addition, the quote text and author named were trimmed for new line character and stripped of quotes. Finally, the likes were parsed to only extract the numeric value.
As a first research question, I wanted to see if there is a correlation between the number of likes for a quote and the length of text for the quote with the hypothesis that there would be a preference for shorter quotes. An initial plot of likes vs length simply showed outliers and was not conclusive. To that end, I created a plot by taking into consideration the log values of likes and length and then ran linear regression. The results yielded a R-squared value of 0.0553 and a p-value of 0.99, indicative that while there may be some correlation, it is not strong enough to be conclusive.
Likes vs Length of Quote
Log scale view of likes vs length of quote
The second research question was around the authors of the quotes. In an era of social media influence, fan following, and motivational speakers, it would add value to sample the authors with most popular quotes. To that end, the top twenty-five are shown in the visual below. The results placed Roy Bennet (476), Steve Maraboll (212), Cassandra Clare (210), J.K Rowling (153) and Rick Riordan (152) as the top authors.
Top Authors with most popular quotes
As the final research question, I wanted to determine what are the most popular tags associated with quotes. This would ultimately be the first step in labeling of quotes which in turn could be used for building a predictor for quote search/matching given tags and keywords.
Top Tags associated with Quotes
Conclusion
The premilinary review of data extracted has yielded some relationships in data but not strong relationships. The data is interesting and further deeper analysis can be done to yield correlations that add utility and the ability to learn from data. A good future direction can be to build a predictor that given some tags and keywords from the text of a quote, relevant and accurate predictions can be made as to which existing quotes in the knowledge base would be a close match.
About Author
Muhammad Ihsanulhaq Sarfraz
Ihsan is an NYC Data Science Academy Fellow currently pursuing his PhD in Computer Engineering from Purdue University with a dissertation on analyzing patterns of learner behaviors in MOOCs. He has a passion for building dashboards and interfaces...
Google
The time to study or pay a visit to the subject material or sites we have linked to beneath.
Google February 9, 2021
Google
Below you will uncover the link to some web-sites that we assume you should visit.
CBD For Dogs December 11, 2020
CBD For Dogs
[...]please visit the websites we adhere to, such as this one particular, because it represents our picks in the web[...]
Google October 12, 2020
Google
Just beneath, are quite a few entirely not related websites to ours, however, they are certainly really worth going over.
Google October 2, 2020
Google
Wonderful story, reckoned we could combine a handful of unrelated data, nonetheless genuinely really worth taking a appear, whoa did a single master about Mid East has got a lot more problerms too.
Generator September 1, 2020
Generator
[...]the time to read or stop by the content or websites we have linked to beneath the[...]
Thesis Writers July 24, 2020
Thesis Writers
[...]just beneath, are a lot of completely not connected websites to ours, on the other hand, they are certainly really worth going over[...]
My Homepage November 9, 2019
... [Trackback]
[...] Find More Informations here: nycdatascience.edu/blog/student-works/web-scraping-goodreads-the-power-of-words/ [...]