This is the first part of the Yelper_Helper capstone project blog post. Please find the second part here.
Nowadays every company and individual can use a recommender system -- not just customers buying things on Amazon, watching movies on Netflix, or looking for food nearby on Yelp. In fact, one fundamental driver of data science’s skyrocketing popularity is the overwhelming amount of information available for anyone trying to make a good decision.
This is the capstone project sitting at the end of our 12 week journey in the data science bootcamp. For this project, we would like to work on something that:
- has clear and strong business demand
- requires Big Data handling (gigabyte at least)
- allows us to demonstrate a few Machine Learning techniques
Based on these criteria, we decided on the ‘Yelper Helper’ -- a real-time restaurant recommender system using Yelp open source data. We believe the experience we gained from this project will be widely applicable.
2. Data Source
The source of our data was courtesy of the Yelp dataset challenge. The purpose of the challenge is to use the provided data to produce innovative and creative insights. Yelp recognizes the top submission with a $5000 prize winning.
Included in the dataset were five json files, which are encompassed users, checkins, tips, reviews, and businesses. Yelp also included 200,000 photos with their associated labels, though we did not incorporate any image analysis in our end product. The total size of the text data was roughly 5Gb. While this barely qualifies as 'big data,' our desire to use big data tools and techniques is justified. The richness and variety of the data gave us the freedom to apply a wide range of machine learning techniques to create what we called “Yelper Helper.” The goal of Yelper Helper was to build a recommendation app for users based on keyword inputs, location, social networks, and reviews. This will be explained in more detail in the following sections.
3. Project Summary
What does our app do?
Knowing the user’s location and other optional information (user ID, keywords), our engine can recommend nearby restaurants and visualize them on a map. The engine is a hybrid recommender. For new or anonymous users, we would be able to provide base-case recommendations using only location information. With additional keywords like "spicy" or "tacos," an NLP (Natural Language Processing) module is turned on that can offer similarity based recommendations; finally with user ID as input, the collaborative filtering and social network modules will provide more personalized results based on historical rating activities and friends’ opinions.
A quick demo
Here are the main components of the system we built:
- Front End: a Flask powered web page
- Stream Processor: a scalable, fault-tolerant structured stream server built with Kafka
- Recommendation Engine: a real-time recommendation generator that resides completely in the Databricks cloud environment (Databricks + Apache Spark + AWS), with four different recommending approaches:
- Content Based: Natural Language Processing (NLP) cosine similarity
- Collaborative Filtering: Alternating Least Squares algorithm
- Social Network: friends’ opinions
- Location-Based: no cold start problem
- Remote Databases:
- AWS RDS for structured data (processed tables)
- AWS S3 for unstructured data (json files, trained models, etc.)
4. Data Prep and EDA
Before building the app and training the model, we wanted to investigate a subset of the data to explore. Among the cities that were available in the Yelp data was Las Vegas, which our team thought would be a good training ground due to the number of restaurants, visitors, reviews, and variety in the data. Once the relational database was set up, we dove in and performed some exploratory data analysis on the reviews.
To get a better understanding of how the words in a review related to the rating that a user gave, we performed a high-level sentiment analysis to identify which words were associated with positive and negative reviews. We took a sample of 1000 negative reviews and 1000 positive reviews from restaurants in Las Vegas, and analyzed words or phrases that were used most frequently for those reviews.
Words most often associated with bad reviews are on the top left. They include: worst, awful, poor, blah, and tasteless. Words most frequently associated with good reviews are on the bottom right. Among them: great experience, highly recommend, karaoke, steakhouses, and unique.
Additional exploratory data analysis could be performed, such as restaurant trends, unique attributes per location, etc. Since the data provided by Yelp was already clean with minimal missingness, we did not impute any data. Finally, because we were able to convert the data from json to csv easily, we loaded the data into MySQL database for easy storage and extraction.
5. Data Storage and Access Strategy
Our team believed that it would be wise to load the converted csv files into a MySQL database hosted by Amazon’s Relational Database Service. Having the data stored in a relational database would then allow us to extract and access the data with efficiency.
MySQL provides many advantages for our recommendation app. First, some of the data files contained millions of rows, so we spun up the MySQL instance and loaded each dataset into separate tables. This made data extraction easy and fast, depending on what we wanted to analyze. Initially, five tables were created, one for reviews, businesses, tips, check-ins, and users. A visualization of schema can be seen here:
From there, we created subtables which only included the relevant Las Vegas restaurant data. An additional advantage MySQL provided was that it allowed us to easily establish a connection via sqlalchemy and MySQLdb to Spark and Python, thereby making it unnecessary to create multiple intermediate/temporary csv files.
6. The Hybrid Recommender Engine
6.1 Natural Language Processing (NLP)
6.1.1 Sentiment Analysis
First, we decided to analyze the Las Vegas subset using Natural Language Processing (NLP), a machine learning technique that aims to understand human language with all its intricacies and nuances. The 800,000 reviews were divided into low ratings of 1 and 2 stars and high ratings of 5 stars. Our goal was to create a supervised learning neural network that could predict the sentiment of a review as positive or negative based on the language used. Restaurant reviews with ratings of 3 or 4 stars were thrown out due to the lack of consistency between reviewers (i.e., one reviewer’s 3 star review could be another’s 5 star review).
The low rated reviews were given a rank of zero and the high rated ones were given a rank of one. The text of each review was then tokenized and converted to a sequence using the keras package. A neural network was set up in keras as well using a convolutional filter, pooling filter, and both ReLU and sigmoid activation function to predict the sentiment of each review as either 0 or 1. The final model was 94% accurate and was spot-checked on newly-written reviews and on the remaining 3 and 4 star reviews. The model did quite well for our purposes and so we fit all 800,000 reviews using the sentiment analysis neural network and then averaged the reviews by restaurant and used the results as a new feature in the location-based recommendation (See section 6.4). The overall process for the sentiment analysis is outlined below.
6.1.2 Content-Based Recommendations
We also wanted to use NLP to make content-based recommendations for our app. Item-to-item similarity is calculated to make these types of recommendations. Think of when you read an article online: there is usually a side menu or a section at the bottom of the page that recommends new articles based on your interest in the current article. Often, these new articles are written on the same or a similar topic. That’s a content-based recommendation.
Content-based recommenders can take a user’s profile and past ratings to make new recommendations to the user. This, however, presents the cold start problem, an issue that arises for a brand-new user who has no rating history. To combat this, our NLP recommendation is based on a keyword soft match (similarity calculation) rather than the user profile. So, typing “tacos” into our app should bring up a ranking of Mexican joints, with the possibility to also recommend some non-Mexican places serving taco-like food.
To make this happen, we once again decided to implement a neural network to understand the language used in the reviews. We used the Word2Vec function from Spark MLlib to create the model. Our process is outlined in the figure below.
To pre-process the data, we concatenated the reviews for each business together for a total of 6,199 restaurants. Then, we tokenized the reviews and removed all stop words. Stop words are common words in the English languages that provide function within a sentence but no context, such as "of," "the," and "has."
Word2Vec then translates each word into a vector of 100 features. These vectors are located in a feature space of 100-dimensions, with similar words closer together and unrelated words farther apart. For example, the vectors for “ice cream” and “frozen yogurt” should be pointing in nearly the same direction, but the vectors for “delicious” and “disgusting” would be far apart.
Once the words are translated into vectors, in order to check if the result makes sense (and have some fun), we can perform some word algebra. We can add or subtract the vectors from one another to find a new word.
The word vectors above have been flattened into a 2D feature space. Beef and filet mignon are both foods from a cow, while seafood and lobster tail both come from the sea. Lobster tail and filet mignon are exquisite and expensive types of beef and seafood. If we have filet mignon, take away the fact that it is beef and add in a new category of seafood, we end up with lobster tail.
In addition to using word algebra, we can determine how similar two words or documents are based on cosine similarity. Mathematically, cosine similarity is the dot product of the two vectors divided by the product of the magnitudes of those two vectors. This calculates the angle between two vectors. The smaller the angle, the more similar the words, the closer to 1 the value becomes. Larger angles are more dissimilar and are calculated closer to negative one. Vectors perpendicular to one another will have a cosine similarity of zero.
In the figure above, "burger" and "sandwich" point in somewhat similar directions and have a similarity of about 0.6. Below, we can see the results of a similarity search for the word "Chinese."
Since the business reviews are more than a single word and a user may want to search using multiple words as well, Word2Vec averages the vectors of all the words together and then calculates the similarity between the user’s keywords and each of the available restaurants’ reviews. The results are then ranked from most similar to least and returned to the user on the map.
6.2 Collaborative Filtering
Collaborative filtering (CF) is commonly used for recommender systems. These techniques aim to predict user interests by collecting preferences or taste information from many users. In other words, CF fills in the missing entries of a user-item association matrix. The underlying assumption is that if person A agrees with person B on one issue, A is more likely to have B's opinion on another issue than that of a randomly chosen person.
Below is a great visual example from Wikipedia. In order to predict the unknown rating marked with ‘?’, we rely more on the opinions from other users with similar rating histories (green rows), thereby arriving at a negative rating prediction (thumb’s down).
Mathematically, this is done by low-rank matrix factorization, combined with a minimization problem (see picture below). The often-sparse user-item rating matrix R is approximated as a product of user matrix U and item matrix PT, which are built of latent factors. We then form the cost function J, and try to minimize it. Currently in the spark.ml library, the alternating least squares (ALS) algorithm has been implemented to learn these latent factors. Additionally, since we directly rely on the user rating itself, our approach is often referred as "explicit."
For our project, we pre-train the model and save it on our Amazon S3 server. When the recommendation engine boots up, it will load the model from S3 and use it for prediction. This architecture is designed so that we can keep training multiple models offline as new data comes in. Once a new model is ready, the recommender engine will make the switch by editing one line of code.
6.3 Social Network
This is a digital version of the classic Word-of-Mouth recommender system -- what people have been using for thousands of years.
The Yelp dataset is unique in that there is an embedded social network. In the user json file, each row describes one user in a dictionary format. For the “friends” key, the value stored is a list of encrypted user IDs. In order to quickly convert the unstructured social network data into a structured "node" and "edge" set (often required by graph theory related packages), we employ the Spark distributed computing ecosystem. This allows us to finish the task within a minute, compared to hours in a single machine Python environment.
With the social network database now structured and easily searchable, a few SQL join-groupby-aggregate commands can quickly answer questions like: “What are the average ratings of the nearby restaurants only based on my friends’ opinions?” This interesting feature has the potential to both provide conversation-triggering recommendation and improve user stickiness to the app. As an extension of the algorithm, one can easily come up with other intuitive rating estimation schemes, such as expanding the network to second- and third-degree connections and applying a weighted average.
Users who come to our webpage may want a quick suggestion of all possible restaurants based on only their location. Unlike the methods mentioned so far, this requires no additional information and returns recommendations fastest to improve user experience.
While yelp provides aggregated ratings for each business, these are not always indicative of a restaurant’s quality. For example, a restaurant with one five-star rating would be ranked ahead of a restaurant with ten ratings averaging 4.9 stars. Another problem is that a star rating varies from person to person and is integer based. Finally, do we want to take into account reviews that could be irrelevant due to their date, e.g. from over ten years ago?
Our strategy for dealing with these problems is as follows:
- On the review level, apply a time weight and a sentiment weight to get an up to date, accurate representation of each review
- On the business level, modify the adjusted star rating with popularity features
How old is too old for a review to matter? As this dataset spans 2004 to present, we need to define a reasonable cutoff. Instead of a hard filter, we weighted each review based on its age with a sigmoid function centered about 2012, so a new review receives a weight of 1.0 while a review from 2012 receives a weight of 0.5. Because the rate of review writing is steadily increasing, approximately 60% of all reviews are unaffected by the time weighting. Our filter effectiveness is demonstrated in the figure shown below.
Based on the NLP sentiment analysis described above, each review was assigned a sentiment value. This is indicative of how positive a review is with a range of zero to one. In the final rating scatterplot below, it is clear that sentiment and star rating given by the user are dependent variables.
On the business level, we need to address the popularity measures: review and check-in counts. A high review count is indicative of either a popular business or an exceptional one. Check-ins are a more direct measure of popularity but it is a more recent feature so even the most popular businesses have counts around 160. With this in mind, check-ins are assigned approximated twice the weight of reviews.
These distributions are extremely skewed, as most businesses having a limited number of reviews and check-ins. By taking a log transformation, we can get a reasonable multiplicative factor that includes but doesn’t overstate popularity.
With our final score metric fully defined, we can remap stars to percentiles (in Las Vegas). In the chart below, it is clear that our new stars need a different interpretation from what is usually assumed. In this system, four stars is considered one of the best restaurants in the area, and three stars is a good if not great restaurant. Though skewed, perhaps this final score distribution is more realistic than a uniformly or normally distributed score. In reality, there are only a handful of exceptional—and an enormous amount of average—restaurants.
Regardless of the model used, we don’t want to check every restaurant in our database if the user is requesting information about a specific area. The most accurate metric for this is haversine distance: the minimum distance between two points on the surface of a sphere. For reference, the defining equations are:
6.5 Recommendation Engine Summary
Putting all of these individual parts described in section 6 together, our recommendation engine can react to whatever user input is passed in. In its fundamental location-only mode, we can return recommendations within half of a second. As keyword and user-based models require computation, they require 8 and 20 seconds respectively.
7. Data Pipeline
See blog post part 2
8. Conclusion and Future Directions
Yelper Helper is a user-friendly interface powered by a robust and varied recommender and supported by an efficient data pipeline making it a quick and easy way to find nearby restaurants the customer will love.
Having recommendations available for all levels of interaction with the app provides quick suggestions for the casual digital passerby and yet promotes more consistent user engagement with the benefit of more personalized results. The location-based recommendations aim to provide a quick and dirty service for passing users. The NLP neural networks of the content-based recommendations allow users to filter restaurants by cuisine or dish. Collaborative filtering and social network recommendations provide individualized recommendations based on personalized taste and friends’ opinions.
Building the machine learning models using Apache Spark and setting up a Flask-Kafka-RDS-Databricks pipeline creates a powerful and scalable system robust to working with big data and a continuous stream of user requests.
With more time, we would improve Yelper Helper with the following ideas.
- Scale the recommender to include all cities, restaurants, and businesses.
- Convert it to a mobile app to automatically input the user’s location for quicker recommendations
- Use A/B testing to design an optimal user-interface, possibly including pop-ups of helpful tips or friends’ reviews of recommended restaurants.
- Increase the speed of the engine
- Collect data on user’s final decisions and subsequent ratings to potentially provide local restaurants with insights on how to improve their business and attract customers.
- Build a Latent Dirichlet Allocation (LDA) model or Doc2Vec model to improve and vary the NLP recommendation method.
- Allow users to personalize their experience with favorite lists or the option of planning meals for a vacation.
In a short two-week period, we learned a great deal about working effectively as a team by using an agile approach to divide up the labor and regularly meet up to discuss progress and problems. We gained experience in SQL queries, Amazon Web Services, PySpark programming language, and Kafka streaming. We implemented several machine learning techniques and built an entire data pipeline to create a useful and professional product for the modern consumer.
Guidance from Shu Yan, Yvonne Lau, Zeyu Zhang
Inspiration from Chuan Sun and Aiko Liu