Round 7 Of The Yelp Dataset Challenge

Deepak Khurana

Posted on Jul 25, 2016

In this blog i am documenting my explorations in the data from "The Yelp Dataset Challenge". In particular i am interested in finding peculiar patterns and some interesting insights the data could lend us.

Contents

YELP Dataset Challenge
Data
Data Munging
Visual Exploratory Data Analysis
Conclusions
Future Work

YELP Dataset Challenge

YELP is a crowd-sourced review platform for local businesses where people can rate and write a review about a variety of services. For every year since Since 2013 YELP have been organizing the "Yelp Dataset Challenge", where YELP provide some of the data collected from it's website and also offer prizes for data science projects deemed good. Possible research directions such as Cultural Trends, Location Mining , Urban Planning etc are listed on the challenge's website.

Dataset

The data is available on Yelp's website across five different json files under business , users , checkin, tips and reviews categories. Yelp's description includes a concise look at the data’s contents.

2.2M reviews and 591K tips by 552K users for 77K businesses
566K business attributes, e.g., hours, parking availability, ambience.
Social network of 552K users for a total of 3.5M social edges.
Aggregated check-ins over time for each of the 77K businesses
200,000 pictures from the included businesses

Data Munging

This maiden experience of working with json data in R was difficult. However, with some research, the data’s loading and transformation was eventually complete.

Visual Exploratory Data Analysis

I first explored the distribution of business ratings. Most are rated between 3.5-4.5 out of 5. Perhaps these ratings are warranted or there is a leaning towards positive ratings by people.

Business_ratings

A similar pattern occurs when the microscope is turned on the average number of reviews for different ratings. On an average businesses with ratings between 3.5-4.5 stars have more reviews. I suspect that this is due to a positive feedback loop. People will look for highly rated business on YELP to visit and then themselves leave high ratings themselves increasing number of reviews .

avg_rv_ratings

Going deeper in the data, a cursory glance at the state level revealed that Arizona has the most businesses listed on YELP.

top_10_states_business

At the city-level, Las Vegas took the crown for most businesses in Yelp. This is unsurprisingly given the city’s many casinos , hotels and restaurants.

top_10_cities_business

The Las Vegas effect is perhaps responsible for Nevada’s having the highest number of reviews even though Arizona has more Yelp business listings. Again, I think this is a knock-on effect of its the service industry .

top_10_states

When it comes to the businesses themselves, the field is dominated by restaurants

top_10_business_categories

The distribution of the various restaurant subcategories is more uniform.

top_10_restaurant_categories

This is some of my preliminary analysis of this data. There is definitely room for a deeper look at its intricacies which will reveal some of America’s attitudes towards various businesses. Here are some things that I learned and possible directions for future analysis:

Conclusions

Most businesses are rated above 3.
3-5 rated businesses have most number of reviews on an average.
Arizona and Nevada have the most number of businesses listed.
Arizona and Nevada have the most number of reviews.
Restaurants are the most frequently represented business in the data set

Future Work

Explore the other three datasets provided - User checkins , tips and reviews
Apply statistical analysis and create more visualizations
Apply predictive modeling when round 8 begins

About Author

Deepak Khurana

Deepak holds a Masters Degree in Physics from the Indian Institute of Technology Kharagpur, one of the top engineering school in India. He was then awarded the Henry M. MacCracken fellowship at New York University to pursue a...

View all posts by Deepak Khurana >

Cancel reply

You must be logged in to post a comment.

No comments found.

Round 7 Of The Yelp Dataset Challenge

About Author

Deepak Khurana

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Round 7 Of The Yelp Dataset Challenge

About Author

Deepak Khurana

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!