Round 7 Of The Yelp Dataset Challenge
In this blog i am documenting my explorations in the data from "The Yelp Dataset Challenge". In particular i am interested in finding peculiar patterns and some interesting insights the data could lend us.
- YELP Dataset Challenge
- Data Munging
- Visual Exploratory Data Analysis
- Future Work
YELP Dataset Challenge
YELP is a crowd-sourced review platform for local businesses where people can rate and write a review about a variety of services. For every year since Since 2013 YELP have been organizing the "Yelp Dataset Challenge", where YELP provide some of the data collected from it's website and also offer prizes for data science projects deemed good. Possible research directions such as Cultural Trends, Location Mining , Urban Planning etc are listed on the challenge's website.
The data is available on Yelp's website across five different json files under business , users , checkin, tips and reviews categories. Yelp's description includes a concise look at the data’s contents.
- 2.2M reviews and 591K tips by 552K users for 77K businesses
- 566K business attributes, e.g., hours, parking availability, ambience.
- Social network of 552K users for a total of 3.5M social edges.
- Aggregated check-ins over time for each of the 77K businesses
- 200,000 pictures from the included businesses
This maiden experience of working with json data in R was difficult. However, with some research, the data’s loading and transformation was eventually complete.
Visual Exploratory Data Analysis
I first explored the distribution of business ratings. Most are rated between 3.5-4.5 out of 5. Perhaps these ratings are warranted or there is a leaning towards positive ratings by people.
A similar pattern occurs when the microscope is turned on the average number of reviews for different ratings. On an average businesses with ratings between 3.5-4.5 stars have more reviews. I suspect that this is due to a positive feedback loop. People will look for highly rated business on YELP to visit and then themselves leave high ratings themselves increasing number of reviews .
Going deeper in the data, a cursory glance at the state level revealed that Arizona has the most businesses listed on YELP.
At the city-level, Las Vegas took the crown for most businesses in Yelp. This is unsurprisingly given the city’s many casinos , hotels and restaurants.
The Las Vegas effect is perhaps responsible for Nevada’s having the highest number of reviews even though Arizona has more Yelp business listings. Again, I think this is a knock-on effect of its the service industry .
When it comes to the businesses themselves, the field is dominated by restaurants
The distribution of the various restaurant subcategories is more uniform.
This is some of my preliminary analysis of this data. There is definitely room for a deeper look at its intricacies which will reveal some of America’s attitudes towards various businesses. Here are some things that I learned and possible directions for future analysis:
- Most businesses are rated above 3.
- 3-5 rated businesses have most number of reviews on an average.
- Arizona and Nevada have the most number of businesses listed.
- Arizona and Nevada have the most number of reviews.
- Restaurants are the most frequently represented business in the data set
- Explore the other three datasets provided - User checkins , tips and reviews
- Apply statistical analysis and create more visualizations
- Apply predictive modeling when round 8 begins