Web Scraping OpenTable
The goal of this web scraping project is to identify a restaurant type with an above average success rate based on data from OpenTable. This project correlates number of reviews with popularity and average rating as a metric for how well a restaurant is liked. Creating a successful restaurant is difficult, so the standard deviation is a useful statistic when considering how likely a restaurant is to succeed of fail. Here, we compare the standard deviation of ratings and number of reviews for different restaurant groupings.
How OpenTable Operates
OpenTable is a SAS company that focuses on connecting people with restaurants throughout the world. OpenTable provides information on the location, cuisine, price range, rating and number of reviews for many restaurants throughout the world. A key feature of OpenTable's great success is its online reservation system which includes rewards. Restaurants pay to be a part of the OpenTable system and are charged a dollar per seat reserved through the website. This relationship results in OpenTable skewing their review data toward higher ratings. The lowest rating a restaurant on OpenTable can receive is 3 out of 5 stars.
The Web Scraping and Cleaning Process
The data was collected using Scrapy. Getting the data to analyze required sifting though many paths. Starting on OpenTables main page I collected all of the URL's associated with each area. Sixty areas in total were collected. The collection ranged from 6,580 restaurants in Miami, United States to 68 restaurants in Kyoto, Japan and many in between. Once each area was collected the URL's associated with each area needed to be generated. The URL's included a latitude, longitude and metroID. This information was collected from the main page of each area and then formatted to redirect by each location. Finally, I scraped every restaurant off of OpenTable and collected the area, location, restaurant name, review count, rating, cuisine, cost, link to the restaurant, positioning on the website page, booking information for that day and weather or not it was currently being promoted. In total 163,831 restaurants were collected.
After collecting this information the data needed to be cleaned. All areas, locations, restaurant names and cuisines were converted to lowercase and any white space was removed. The cost was converted from dollar signs to the ranges associated with those dollar signs. For example, $$$ was converted to $31 to $50. The rating was converted to a percent. This was done by subtracting 3, then dividing by 2 and multiplying by 100, since all restaurant scores fell between 3 and 5 stars. Finally, the promoted section was converted to a boolean.
The Data Analysis
For the majority of the data analysis the 163,831 restaurants are reduced to 23,727 restaurants, where all restaurants with less than 50 reviews are eliminated. This would prevent outliers with very few or no rating at all from heavily impacting the analysis. However this also scews the information presented to represent a subset of all restaurants that may not be representative of the entire population, an important aspect to keep in concideration. This issue is illustrated in the pie charts below. A larger percentage of expensive resaurants are represnted in the final dataset.
This graph illustrates how scewed the OpenTable data is toward higher ratings. Only 8.32% of all restaurants in the dataset have a rating below 4 starts. In order to idntify the best location to open a restaurant we are going to use a funneling approach. Looking only at the four major cities in the United States we compare ratings and popularity.
These graphs indicated that Chicago restaurants are more likely to be successful on OpenTable. They receive more reviews on average and have a heigher average reveiw. Now we will investigate the top neihgborhoods within Chicago.
River North and Gold Coast are comparable in popularity, lets investigate further by splitting each neighborhood by cost. The Gold Coast in Chicago seems to perform particularly well in the upper end price range.
So what food would be best to sell? Below we compare the top cuisines represented on OpenTable and compare their performance within Chicago.
From these graphs it is clear that Steak Houses outperform other cuisine types. Finally we compare this choice with all restaurants in our dataset and all steakhouses to see if our proposal is accurate.
These graphs support the claim. While this approach is not exhausitve and does not represent the full investigation that a perspective owner might do before opening a new location it demonstraites an apporach to investigating large data to come to a conclussion.
When using the number of reviews and the ratings of restaurants in the four major cities in the United States as criteria for success it is shown that, if you were considering investing in a restaurant, a likely successful location would be a steakhouse in Chicago. With a smaller standard deviation in both ratings and review count as well as higher average ratings and reviews, steakhouses in Chicago are less likely to fail than other restaurant types through out the United States.
- Collect Review data to identify words associated with positively reviewed restaurants
- comparing restaurant ratings with menu items
- Collect restaurant data over time to identify trends, assuming that restaurants that are removed from the website went out of business
- Use regular expressions to identify chain restaurants by name and compare their success in varying locations
- Identifying the impact of location on page on the number of online bookings for OpenTable
- Identifying the impact of being promoted on the number of online bookings for OpenTable
- Seeing the relationship between booking and average income of an area