Restaurant Web Scraping Seamless.com: East Village Rating
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Seamless is a food ordering service that allows users to place delivery or takeout orders online. It operates in New York City, among other major metropolitan areas in the US and Europe. I scraped data on 1,513 restaurants that deliver in my neighborhood in Manhattan, and analyzed how cuisine type, order accuracy, order timeliness, and food quality correlate with overall user restaurant ratings.
During my MBA program, a colleague and I were admitted to a program within the Yale School of Management that allowed students to work on new ventures for course credit. Our venture offered demand-based pricing analysis to small independent restaurants and burgeoning chains. Our hypothesis was that while large chains may have sophisticated pricing procedures, small restaurants price non-optimally based on loose competitor analysis and general intuition, thus leaving money on the table (pun intended).
Our hypothesis turned out to be correct, but our service never gained traction as restauranteurs proved highly risk averse when it came to experimenting with price adjustments. We felt that richer data on competitor pricing might have allowed us to overcome this barrier to adopting our service, but had no systematic way to collect this data.
Building on this experience, my goal for this project was to explore how the tools I have been learning at NYC Data Science Academy might have helped our venture (or a similar venture) succeed in its earliest stages.
Starting small, I decided to investigate the following three key questions instead of pricing considerations:
- Are user ratings for restaurants that deliver to the East Village in Manhattan via Seamless.com significantly different by cuisine type?
- How are order accuracy, timeliness, and food quality correlated with ratings, if at all?
- Do the correlations mentioned above differ by cuisine type?
The Scraping Procedure and Data
In order to answer these three questions, I filtered on the Seamless website for all restaurants that deliver to my address in the East Village. This yielded 76 pages of results with 20 restaurants listed per page. I then scraped the Seamless urls for each listed restaurant on each page by identifying the HTML tag containing the address as shown below.
Using python and Selenium, I then built another web scraper to pull the data listed in the table below from each web address. Selenium was chosen as data on the urls is retrieved via Ajax calls, so an active connection was required; a wait time was also built in before running the scraper to allow the data to load. The HTML tag for star rating only reported the range of pixels to be filled in with a yellow color, so this range was converted to the corresponding rating during the scraping procedure.
Question 1: Impact of Cuisine Type
The bar chart above shows the differences in average rating by restaurant cuisine type. While these differences appear minute, an ANOVA test (F = 9.57, p<0.01) determined that at least one of the groups showed a statistically significant difference (Bartlett test for equal variance and visual inspection of ratings distribution confirmed ANOVA assumptions were met).
While not all pairwise t-tests were performed, those performed showed that lower Manhattan consumers rated Asian cuisines significantly higher than all other types except for Italian. This may be due to restaurant quality, consumer preference, or demographic mix in the area.
Question 2: Importance of Delivery Quality, Timeliness, and Accuracy
The graphs above show that delivery quality, timeliness, and accuracy all show moderate positive correlation with restaurant rating. However, a separate analysis indicated that these factors have a high degree of multicollinearity among themselves, suggesting that they are not all independent explanatory factors for restaurant rating.
The next step in my analysis here will be to build a multivariate linear or penalized linear regression model to isolate the best predictor(s). This has not been completed as of yet. Given the limited variance and interval nature of the rating variable, a classification model may also be useful.
Question 3: Importance of Delivery Quality, Timeliness, and Accuracy
For the third question, I decided to look just at two cuisine types - Chinese and American. One might hypothesize that for Chinese delivery, customers would care more about accuracy and timeliness, relative to American food. The above graphs only show correlation, but suggest that this hypothesis may be worth investigating further. Accuracy and timeliness are more strongly correlated with rating among Chinese restaurants vs. American restaurants.
However, quality is also more highly correlated with ratings for Chinese restaurants, though the gap in correlation coefficients is less. An additional caveat is that the looser correlations among American restaurants may be due to the wide variability in quality, accuracy, and timeliness for poorly related American restaurants.
Again, the next step in this analysis would be to create to mutlivariate linear models, and compare coefficients across the models.
Conclusions and Next Steps
This project demonstrated that cuisine type significantly impacts ratings for restaurants that deliver to the East Village in Manhattan. Order accuracy, quality, and timeliness are also correlated with ratings, and this correlation may differ by cuisine type. As mentioned above, additional modeling is required to tease out the nuances of these relationships.
In addition, a broader business analysis is required to understand the import of these findings. For example, how does rating relate to demand? Number of review could be used as a proxy for demand in this scenario. How much might it cost a restaurant to improve order accuracy or timeliness, and would this lead to a bump in demand that makes such an investment worthwhile.
Finally, by allotting a longer time to run my scraper, I would be able to collect data on all restaurants in Manhattan and beyond, and could also collect the prices of all menu items offered online. Building this type of pricing database would be highly valuable to restaurant owners, as competitive dynamics are a key input to price determination.
I would not be surprised to learn that Seamless already returns similar analytics to its partners as an incentive to list on the site. Regardless, it is clear that the ability to scrape and analyze restaurant data would have been highly valuable to my (failed) restaurant consulting venture.