Data Analysis on Recommendation System and Spam Review
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Xu Gao. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 30th, 2017. This post is based on his web scraping projects. - See the IPython notebook at:
This post introduces the recommendation system and spam review analysis based on the reviews from Google App Store Top Charting apps. Google App Store is the biggest Android app store, which not only has numerous applications, but also movies and songs. In this data analysis, I scraped the apps listed in the Top Charting page. (https://play.google.com/store/apps/top) Scrapy is the key tool in this project and I will introduce the detail in Part II.
The recommendation system is based on algorithm "doc2vec" developed by Google geeks in 2013, which helps transfer the text to numeric vector. Then I use cosine similarity to find the matched reviews and apps.
The spam review is based on "Sentiment Analysis and Opinion" from Bing Liu, published in May 2012, in which he specifies the different types of spam reviews. In this project, due to some limitation, I select spam reviews manually and perform an analysis on these results.
II. Data Web Scraping on Google App Store
In this web scraping project, the item structure is listed in the following table:
where Top Charting is categorical variable which includes Top Free in Android Apps, Top Paid in Android Apps, Top Grossing Android Apps, Top Free Games, Top Paid Games, Top Grossing Games
And in the source page, there are 3 layers for scraping. I first use the basic parse function to parse the source page. Then through the "See More" button to get the next page which locates the right upper corner in the screenshots. Then through the picture of each apps I finally get the target pages to get the reviews of users.
Source Code: Scrapy spider.py
Finally I got 8972 reviews from these apps. To show which words have more importance, I plot 3 WordCloud from 3 Top-Charting categories: Top Free in Android Apps, Top Paid in Android, Top Grossing Games.
Free Apps in Android
Top Paid Apps in Android
|From these 3 WordCloud graphs, we can find that "game", "money", "work" and other words are important words. If I was an application developer, I might find the direction of my next app on these words.
III. Recommendation System
Natural Language Processing is one of the most important topics in Machine Learning problem. There are several choices about algorithms to detect words such as term frequency-inverse document frequency(TF-IDF), LDA and LSI. What I used in this project is "doc2vec" algorithm which was developed by Google geeks in 2013 and free in "genism" library in Python.
This algorithm can transfer text to numeric vector. Then I use cosine similarity to calculate the similarity between the input vector and every vector in the dataset to find the most similar reviews.
First we can have a test. Try to find top 3 single words in all reviews which is similar to "game" and "great" but different from "bad".
The result is "app", "job" and "love" with their similarity. Looks not bad! Word "app" is quite similar to "game" in this dataset and "love" means positive attitude which is opposite to "bad".
Then I write a input "great puzzle game!" After calculating the best similarity, the algorithm generates the following review.
´(Review-6784, Similarity:0.8765): The game is very fun, but the amount of little bugs in the game bothers me; tricks wont register or multiplier decides to not go up. I love the game but just wish it would run smooth, especially after its been out this long
App name is: Block! Hexa Puzzle , and the similarity is equal to: 0.7963.
which is quite good. "Block! Hexa Puzzle" is obviously a puzzle game. And in this review the user shows a positive attitude. This means this game is at least not bad.
But before I continue, I notice the similarity is only 0.7963. This makes me doubtful on this problem. Although the first example is cool, what about other inputs? So I try another input "Great RPG Game". This time I get totally different result.
´(Review-5845, Similarity:0.9248): New version loses all stats. -- New version is low quality pixelated graphics. -- New version loses all paid purchases. -- New version is slow & laggy. -- No support, thieves. -- GARBAGE -- Advertisements in a paid application. Go rot in a pit. -- Seriously, you got paid, I don't care if it is $0.01 or $1000.00. -- You got paid, therefore advertisements should be removed. -- I'll take lessons taught here & keep them in mind to decide if I should purchase future titles from publisher & developers involved.
´App name is: My Talking Angela , and the similarity is equal to: 0.7891,
"My Talking Angela" is a virtual pet game like "My Talking Tom". It is not a typical role-playing game at all. So there must be some shortcomings or mistakes on this recommendation system.
In my opinion, there are three possible reasons leading to this bad result. The first one is the small dataset. 8972 is not a big number for a review dataset. When people usually do the research about reviews, 30,000 might be the minimum because of the complexity of natural language.
The second reason is spam review. The recommendation system is based on the assumption that all the reviews in the dataset are real and effective. However sometimes people might post biased reviews or fake reviews due to some private reasons. This will make the review dataset unreliable. The last reason is about the algorithm shortcoming. Generally speaking, natural language processing cannot be 100% correct. Since it is based on the training set and labels which are set manually, the accuracy of this algorithm is quite uncertain.
IV. Spam Review Data Analysis
According to the Book "Sentiment Analysis and Opinion" from Bing Liu in the University of Illinois at Chicago, the spam reviews can be divided into two types: real spam reviews(I called trash reviews) and fake reviews. The trash reviews mean that due to some reasons, comments in the reviews are biased from the real attitude of users, or comments do not show any emotion or evaluation which makes those comments spam. The fake reviews are just like their name. They are created by someone due to profit or other purposes.
The best way to detect fake reviews is from the user basis. The duplicates will show this review is highly possibly made.(P.S. In my dataset, I remove the absolutely same reviews) But the definition of trash reviews is still not clear. So I use NLP to do the sentiment analysis and compare with the rating, and try to do a logistic regression on several factors related reviews on ratings.
From this boxplot, we can find that sentiment and rating are correlated in mean. But the data out of the 25% to 75% quantile is in big quantity.
The Logistic Regression Detail is listed below:
Rating > 3: Positive
Rating <=3: Negative
´Factors on this regression:
1. Length of reviews; 2. date index; 3. Number mentioned brand/company name; 4. Compound Sentiment; 5. Negative Sentiment; 6. Neutral Sentiment; 7. Positive Sentiment
Accuracy On Training Set: 69.9%
Accuracy On Test Set: 41.8%
From these two facts, we can find that the rating and sentiment are not truly correlated in this dataset although they are supposed to be. As a fact, I decide to define the trash review as the review when absolute value of difference between rating and adjusted sentiment(1-5) is larger than 3.
Based on the review's app category, I make a pie chart to show what kind of apps have high spam reviews. In this graph, we can clearly see that Casino apps like card games or other betting apps have the most trash reviews. Then Action and Strategy apps are next.
Another topic is about the fake reviews. Based on the user side, we can divide duplicate fake reviews as 4 types:
First Type: Duplicates from different users on the same apps´
Second Type: Duplicates from the same user on different apps´
Third Type: Duplicates from the same user on the same apps´
Fourth Type: Duplicates from different users on different apps´
In this dataset, I try to find the duplicates and find the percentage of duplicates 19.77% approximately.
This is the pie chart on category of duplicates reviews. We can see that Casino, Action, Strategy, Casual are still Top 4. But in this time, Puzzle game gets more duplicate fake reviews.
In this project, I use Scrapy to scrap the reviews from Google App Store, make a recommendation system and do a manual analysis about spam reviews. There are several result we can learn from this project. For all top apps people downloads, "game" is the top one word. Second,"doc2vec" is a very useful algorithm to treat text to numeric vectors. Third one is how we can define and detect trash reviews manually. And the conclusion from all the techniques above is "Casino Apps are not reliable."
- Bigger Datasets
- The way to find duplicates in the dataset is not a general way because users may change some words. Is cosine similarity a good method to improve this problem?
- User-Interface Recommendation System
- If we have a big dataset, it is possible to build a spam review classifier based on the NLP.
Sentiment Analysis and Opinion Mining,draft, Bing Liu, Morgan & Claypool Publishers, May 2012
30 Ways You Can Spot Fake Online Reviews, By Ben Popken April 14, 2010