Data Analysis on Recommendation System and Spam Review

Posted on Feb 18, 2017

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Contributed by Xu Gao. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 30th, 2017. This post is based on his web scraping projects. - See the IPython notebook at:

https://github.com/RayyGao/RayyGao.github.io/tree/master/Xu%20Gao%20Webscrap

I. Introduction

This post introduces the recommendation system and spam review analysis based on the reviews from Google App Store Top Charting apps. Google App Store is the biggest Android app store, which not only has numerous applications, but also movies and songs. In this data analysis, I scraped the apps listed in the Top Charting page. (https://play.google.com/store/apps/top) Scrapy is the key tool in this project and I will introduce the detail in Part II.

The recommendation system is based on algorithm "doc2vec" developed by Google geeks in 2013, which helps transfer the text to numeric vector. Then I use cosine similarity to find the matched reviews and apps.

The spam review is based on "Sentiment Analysis and Opinion" from Bing Liu, published in May 2012, in which he specifies the different types of spam reviews. In this project, due to some limitation, I select spam reviews manually and perform an analysis on these results.

II. Data Web Scraping on Google App Store

source:https://play.google.com/store/apps/top

In this web scraping project, the item structure is listed in the following table:

Top Charting	Category	Company	App Name
Username	Date	Rating	Review

where Top Charting is categorical variable which includes Top Free in Android Apps, Top Paid in Android Apps, Top Grossing Android Apps, Top Free Games, Top Paid Games, Top Grossing Games

And in the source page, there are 3 layers for scraping. I first use the basic parse function to parse the source page. Then through the "See More" button to get the next page which locates the right upper corner in the screenshots. Then through the picture of each apps I finally get the target pages to get the reviews of users.

Source Code: Scrapy spider.py

Word Clouds

Finally I got 8972 reviews from these apps. To show which words have more importance, I plot 3 WordCloud from 3 Top-Charting categories: Top Free in Android Apps, Top Paid in Android, Top Grossing Games.

Free Apps in Android

Top Free Apps in Android

Top Paid Apps in Android

Grossing Games

From these 3 WordCloud graphs, we can find that "game", "money", "work" and other words are important words. If I was an application developer, I might find the direction of my next app on these words.

III. Recommendation System

Natural Language Processing is one of the most important topics in Machine Learning problem. There are several choices about algorithms to detect words such as term frequency-inverse document frequency(TF-IDF), LDA and LSI. What I used in this project is "doc2vec" algorithm which was developed by Google geeks in 2013 and free in "genism" library in Python.

This algorithm can transfer text to numeric vector. Then I use cosine similarity to calculate the similarity between the input vector and every vector in the dataset to find the most similar reviews.

First we can have a test. Try to find top 3 single words in all reviews which is similar to "game" and "great" but different from "bad".

Results

The result is "app", "job" and "love" with their similarity. Looks not bad! Word "app" is quite similar to "game" in this dataset and "love" means positive attitude which is opposite to "bad".

Then I write a input "great puzzle game!" After calculating the best similarity, the algorithm generates the following review.

´(Review-6784, Similarity:0.8765): The game is very fun, but the amount of little bugs in the game bothers me; tricks wont register or multiplier decides to not go up. I love the game but just wish it would run smooth, especially after its been out this long

App name is: Block! Hexa Puzzle , and the similarity is equal to: 0.7963.

which is quite good. "Block! Hexa Puzzle" is obviously a puzzle game. And in this review the user shows a positive attitude. This means this game is at least not bad.

Observations

But before I continue, I notice the similarity is only 0.7963. This makes me doubtful on this problem. Although the first example is cool, what about other inputs? So I try another input "Great RPG Game". This time I get totally different result.

´(Review-5845, Similarity:0.9248): New version loses all stats. -- New version is low quality pixelated graphics. -- New version loses all paid purchases. -- New version is slow & laggy. -- No support, thieves. -- GARBAGE -- Advertisements in a paid application. Go rot in a pit. -- Seriously, you got paid, I don't care if it is $0.01 or $1000.00. -- You got paid, therefore advertisements should be removed. -- I'll take lessons taught here & keep them in mind to decide if I should purchase future titles from publisher & developers involved.

´App name is: My Talking Angela , and the similarity is equal to: 0.7891,

"My Talking Angela" is a virtual pet game like "My Talking Tom". It is not a typical role-playing game at all. So there must be some shortcomings or mistakes on this recommendation system.

Findings

In my opinion, there are three possible reasons leading to this bad result. The first one is the small dataset. 8972 is not a big number for a review dataset. When people usually do the research about reviews, 30,000 might be the minimum because of the complexity of natural language.

The second reason is spam review. The recommendation system is based on the assumption that all the reviews in the dataset are real and effective. However sometimes people might post biased reviews or fake reviews due to some private reasons. This will make the review dataset unreliable. The last reason is about the algorithm shortcoming. Generally speaking, natural language processing cannot be 100% correct. Since it is based on the training set and labels which are set manually, the accuracy of this algorithm is quite uncertain.

IV. Spam Review Data Analysis

According to the Book "Sentiment Analysis and Opinion" from Bing Liu in the University of Illinois at Chicago, the spam reviews can be divided into two types: real spam reviews(I called trash reviews) and fake reviews. The trash reviews mean that due to some reasons, comments in the reviews are biased from the real attitude of users, or comments do not show any emotion or evaluation which makes those comments spam. The fake reviews are just like their name. They are created by someone due to profit or other purposes.

The best way to detect fake reviews is from the user basis. The duplicates will show this review is highly possibly made.(P.S. In my dataset, I remove the absolutely same reviews) But the definition of trash reviews is still not clear. So I use NLP to do the sentiment analysis and compare with the rating, and try to do a logistic regression on several factors related reviews on ratings.

Box Plot

From this boxplot, we can find that sentiment and rating are correlated in mean. But the data out of the 25% to 75% quantile is in big quantity.

The Logistic Regression Detail is listed below:

Rating > 3: Positive

Rating <=3: Negative

´Factors on this regression:

1. Length of reviews; 2. date index; 3. Number mentioned brand/company name; 4. Compound Sentiment; 5. Negative Sentiment; 6. Neutral Sentiment; 7. Positive Sentiment

Result:

Accuracy On Training Set: 69.9%

Accuracy On Test Set: 41.8%

From these two facts, we can find that the rating and sentiment are not truly correlated in this dataset although they are supposed to be. As a fact, I decide to define the trash review as the review when absolute value of difference between rating and adjusted sentiment(1-5) is larger than 3.

abs(Adjusted Sentiment-Rating)≥3

Based on the review's app category, I make a pie chart to show what kind of apps have high spam reviews. In this graph, we can clearly see that Casino apps like card games or other betting apps have the most trash reviews. Then Action and Strategy apps are next.

Fake Reviews

Another topic is about the fake reviews. Based on the user side, we can divide duplicate fake reviews as 4 types:

First Type: Duplicates from different users on the same apps´

Second Type: Duplicates from the same user on different apps´

Third Type: Duplicates from the same user on the same apps´

Fourth Type: Duplicates from different users on different apps´

In this dataset, I try to find the duplicates and find the percentage of duplicates 19.77% approximately.

click to see

This is the pie chart on category of duplicates reviews. We can see that Casino, Action, Strategy, Casual are still Top 4. But in this time, Puzzle game gets more duplicate fake reviews.

V. Summary

In this project, I use Scrapy to scrap the reviews from Google App Store, make a recommendation system and do a manual analysis about spam reviews. There are several result we can learn from this project. For all top apps people downloads, "game" is the top one word. Second,"doc2vec" is a very useful algorithm to treat text to numeric vectors. Third one is how we can define and detect trash reviews manually. And the conclusion from all the techniques above is "Casino Apps are not reliable."

What's More:

Bigger Datasets
The way to find duplicates in the dataset is not a general way because users may change some words. Is cosine similarity a good method to improve this problem?
User-Interface Recommendation System
If we have a big dataset, it is possible to build a spam review classifier based on the NLP.

References:

Sentiment Analysis and Opinion Mining,draft, Bing Liu, Morgan & Claypool Publishers, May 2012

https://www.cs.uic.edu/~liub/FBS/SentimentAnalysis-and-OpinionMining.pdf

30 Ways You Can Spot Fake Online Reviews, By Ben Popken April 14, 2010

https://consumerist.com/2010/04/14/how-you-spot-fake-online-reviews /

About Author

Xu

Xu is a Master of Financial Engineering student in New York University. He received Bachelor of Economics in University of International Business and Economics. Xu has a good experience about machine learning and pair trading system. Besides, he...

View all posts by Xu >

Capstone

Catching Fraud in the Healthcare System

Data Analysis

Car Sales Report R Shiny App

Data Analysis

Injury Analysis of Soccer Players with Python

Capstone

Acquisition Due Dilligence Automation for Smaller Firms

R Shiny

Forecasting NY State Tax Credits: R Shiny App for Businesses

Cancel reply

You must be logged in to post a comment.

No comments found.

Data Analysis on Recommendation System and Spam Review

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Contributed by Xu Gao. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 30th, 2017. This post is based on his web scraping projects. - See the IPython notebook at:

https://github.com/RayyGao/RayyGao.github.io/tree/master/Xu%20Gao%20Webscrap

I. Introduction

II. Data Web Scraping on Google App Store

Word Clouds

III. Recommendation System

Results

Observations

Findings

IV. Spam Review Data Analysis

Box Plot

Result:

Fake Reviews

V. Summary

What's More:

References:

About Author

Xu

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Data Analysis on Recommendation System and Spam Review

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Contributed by Xu Gao. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 30th, 2017. This post is based on his web scraping projects. - See the IPython notebook at:

https://github.com/RayyGao/RayyGao.github.io/tree/master/Xu%20Gao%20Webscrap

I. Introduction

II. Data Web Scraping on Google App Store

Word Clouds

III. Recommendation System

Results

Observations

Findings

IV. Spam Review Data Analysis

Box Plot

Result:

Fake Reviews

V. Summary

What's More:

References:

About Author

Xu

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!