Yelper: A Collaborative Filtering Based Recommendation System

Chuan Sun
Posted on September 25, 2016

1. Why we need recommendation?

Recommendation is a big topic. Netflix was willing to spend 1 million dollars to summon the best movie recommendation algorithm for their service. Facebook devised hundreds of variables just to carefully gauge the recommended feeds on the first page when you log in your page. Not to mention myriads of other applications in the market.

"Getting information off the internet is like taking a drink from a fire hydrant" (Mitchell Kapor). Information overload is a real phenomenon preventing us from making good decisions or taking actions. This is why recommendation systems are becoming common and extremely useful in products such as Netflix, Amazon Echo, and Facebook News Feed in recent year.

Besides, in the big data era, real-time recommendations may become a new norm because real-time recommendations:

  • Allows both customers and businesses to gain real-time insights
  • Enable rapid development
  • Perform real-time analytics

2 What is "Yelper"?

Based on the dataset provided by the "Yelp Challenge 2016", "Yelper" is a system that:

  • Performs preprocessing by dividing business data by cities to allow fine tuned and customized recommendations
  • Uses collaborative filtering based recommendation using Spark MLlib
  • Generates user-business graph visualizations using D3 and graph-tool library
  • Performs user-business graph analysis using Spark GraphX in Scala
  • Handles real-time user request handling simulation using Spark Streaming and Apache Kafka
  • Creates a Google Map view to recommend high rated businesses for users

The rationale for building Yelper is that:

Data scientists should not only know how machine learning works, but they also need to be prepared for the incoming new norm: the large scale streaming data mining and analytics in the big data era.

The GitHub page can be found here:

Slides: Presentation slides

3. Components and technical stack to build Yelper

There are 5 major components to build Yelper, as shown below. Here is the overall flowchart of what happens.


3.1 Data preprocessing

Yelper was built based on the Yelp Challenge 2016 Dataset, which includes:

  • 2.7M reviews and 649K tips by 687K users for 86K businesses
  • 566K business attributes, e.g., hours, parking availability, ambiance
  • Social network of 687K users for a total of 4.2M social edges
  • Aggregated check-ins over time for each of the 86K businesses
  • 200,000 pictures from the included businesses

Our system mainly relies on two data sets: the 687K users and the 86K businesses. We performed two preprocessing steps (Python source code):

  • Map the raw IDs in ASCII string format for both user data and business data into integer indices
  • Split the business data by nine major cities (see sample structure of splitting by cities): us_charlotte, us_lasvegas, us_madison, us_phoenix, us_pittsburgh, us_urbana_champaign, canada_montreal, germany_karlsruhe, uk_edinburgh

In Yelper, we deliberately created all business IDs as integers in range [0, 1M] (M means million), while all user IDs are in range [10M, +∞], because we have much more user IDs than business IDs. In this way, we can use the cutoff value 10M (=10000000) to check if an ID is associated to a user. To sum up, using integer number as ID (such as 3234) instead of a raw string (eg., "G8qH6TbfEhoYmS9KZM2Hfg") has several advantages:

  • Building the graph using Spark GraphX requires that all nodes and edges have an integer ID
  • Building the required nodes and edges for D3 Javascript visualization was easier to program and debug
  • Exported CSV or JSON files of either user data or business data became much smaller, leading to faster loading and manipulation
  • It was easy to differentiate the user id and business id just by a threshold. 

Splitting business data by cities has the following advantages:

  • Making each city have its own recommendation model allows customized recommendation
  • Model training becomes faster since city-wise data is smaller than the totality
  • City-wise model tuning becomes manageable
  • Scalability is possible. Remember there are thousands of cities all over the world
  • It does not make sense to build one single recommendation model to handle all world cities since, like human, each city has its own "gene"

3.2 Low-rank matrix factorization

We want to recommend the most highly-rated businesses to users. There are mainly two types of algorithms used for recommendation systems:

  • Collaborative filtering based algorithm, such as low-rank matrix factorization, SVD, etc
  • Content-based recommendations

Both algorithms have pros and cons. For simplicity and fast prototype, we explored the first one in Yelper. If we have more time, it is not difficult to integrate the second one into our system, which will be discussed in the final section.

Technically, we use Spark MLlib to train the ALS-based collaborative filtering models in Yelper (Python source code). Below are several brief steps:

  • Load the city-wise user-business-star tuple file F by Spark (sample tuple for Las Vegas)
  • Divide the F into three parts, 60% for training, 20% for testing, and 20% for validation using Spark MLlib
  • Setup several rank values (such as 4, 8, 12) as the tuning parameter, perform cross-validation to find the best rank that leads to minimal Residual Mean Squared Error (RMSE)
  • Generate the best model using the best rank
  • Export the best model to a directory under the corresponding city name (sample trained model for Las Vegas)

For the last step above, persisting the city-wise models to disk (or cloud) is a good choice, because we can compress, transfer, distribute, or cache those models.

3.3 Dynamic network visualization

We are particularly interested in analyzing the user-business interaction and gaining city-level insights. Why do we want to do this? Below are two assumptions:

  • Generally speaking, the more businesses a city has, the more prosperous a city will be
  • The more customers (residents, tourists, population, etc) a city has, the more businesses a city will need to have to sustain the needs

We build city-wise user-business network graphs in the following steps:

  • Extract connected components using Spark GraphX (Source code in Scala)
  • Generate dynamic user-business network graph using Python (code) to generate the Javascript JS file (js sample) for D3 visualization (HTML)

As an example, the dynamic user-business network graph of the city Madison (US) is shown below. Each node is either a user (in green color) or a business (in blue color). If a user u rated a business b, then there is an edge from u to b, namely, u --> b. For the city of Madison, there are in total more than 10K edges, and unfortunately, the library D3.js cannot handle and render such a large number of nodes and edges. Thus we randomly selected only a small number of edges to be rendered in Chrome.

What can we gain from this dynamic network? Well, just from a single network, we can gain the topological relationship between users and businesses. The density of edges (in red) reflects, to some extent, how the users rate all the businesses in a city. If we generated those dynamic graphs for all cities, we may find that each city's network is distinct. We can further distill in-depth information just from the networks themselves, such as in/out degree, clustering, page rank analysis, min cut, community discovery, etc.


3.4 Google Map API based frontend

We implemented a web server to allow the user to interact with Yelper to get recommendations. This web server was built using:

  • Spark
  • Flask
  • cherrypy
  • Python paste

Below are several features of the frontend:

  • Users can search any keywords they are interested in (not just restaurants) (Python code)
  • Recommended business will be shown in the Google Map interface (Javascript to interact with map)
  • User scan change the city and how many recommended businesses to return on the map
  • RESTful API is supported. User can change city/topk/keywords from the URL

As an example, we show the user interaction of Yelper through the image below. The user with ID "10081786" requested recommendations for the keywords "restaurants", "book store", "library", and "ice cream" in the city of Charlotte. The Yelper returned the "topK" recommended results and showed the businesses locations via Google Map API.


3.5 Simulation of real-time customer requests handling

Now, imagine that there are one thousand or even one million of customers who need to access Yelper to get recommendations in different cities. Before we show how Yelper handles this challenging scenario, let us first delve a little bit deeper into why this scenario is challenging.

  • More user requests means Yelper needs the ability of handle a large volume of requests from various sources (phone, iPad, laptop, etc)
  • Users' recommendation requests need to be handled promptly, and asynchronously
  • Yelper cannot lose the request from any users. Otherwise, customer satisfaction and loyalty will be compromised
  • An army of computing resources in the cloud need to be deployed to handle requests
  • Customers have different tastes and needs

Those are just a few factors out of many that drive the need for a strong recommendation infrastructure. A detailed enumeration is definitely out of the scope of this post.

One neat solution is to view all incoming request from users across many cities as a non-stopping stream, and use Apache Kafka as a message broker to redirect all requests to Kafka, then let the Spark Streaming handle anything that is piped from Kafka into Spark streaming in a fault-tolerant and scalable manner. The advantages are:

  • The streaming is fault-tolerant to any sudden uncertainties
  • Allows different types of messages being piped to piper
  • Modulization
  • Easy to handle

We thus simulate the real-time handling of a large amount of users' recommendation requests locally (Python source code), as shown in the figure below. On the left side of the terminal window, we run the Spark job to handle the requests in Spark Streaming at a frequency of 1.5 seconds. Note the text message showing the time stamp, such as "Time: 2016-09-25 14:37:22.500000". The Yelper logo is shown before each of the recommended results.

On the right side of the terminal, we run a python script to generate users' recommendation requests at a random time interval. Those requests were piped into Apache Kafka such that Spark Streaming can consume those requests.


(Click image above to see animated simulation)

4. Insights from user-business networks

We build static user-business network graph using a Python library called graph-tool (Python code) (sample network images). Now we show briefly what it looks like for the city-wise user-business network. Again, if a user rated a business, there will be an edge connecting the user node and the business node in the network.

In terms of visualization those edges, there is a dilemma. The more nodes and edges that are included in the graph, the more precise the graph will be, but the visualization will become a mess, since it is nearly impossible for many graph layout algorithms to elegantly visualize clearly the graph details when the edge number is above, say, 0.1 million. For this reason, we adopted two simple strategies: out of all the edges in a city, we either randomly selected 1% (or 2%) of all the edges, or sequentially selected the first 1% (or 2%) of all edges.

It is interesting to mention that, even if we randomly select only 2% of the entire edges, the generated networks still have the power to reveal many insights about a city if we dig deeper. Let us now briefly go over three cities in the US: Charlotte, Las Vegas, and Pittsburgh.

4.1 Charlotte, US

We randomly selected 3312 edges, which is only 1% of the total edges in this city.


4.2 Las Vegas, US

We randomly selected 11547 edges, which is only 1% of the total edges in this city.


What if we randomly select 23095 edges (2%)?


Instead of randomly, what if we sequentially select23095 edges out of all the edge list?


4.3 Pittsburgh, US

What about Pittsburgh? We randomly choose 2230 edges (2%) out of all edges lists.


4.4 What can we gain from the networks?

Again, similar to the dynamic network mentioned in the previous section, a lot of insights can be revealed from those static graphs if we delve deeper using graph-based algorithms and analytics techniques. For example, as we all already know, the economy of Las Vegas is higher than Pittsburgh. From the network, there are clearly more nodes with a high number of incoming edges in Las Vegas than in Pittsburgh.

If I have more time, below are several potential directions that I can think of to dig deeper into those networks:

  • Use the edges with high ratings (i.e., rating star > 4.5) to generate a network of high-quality businesses, comparing the differences amongst cities
  • Build the graph with low ratings (i.e., rating star < 2.5) businesses, see their distribution and figure out what led to low-ratings
  • Add time stamps into consideration. Build networks that evolve with time such that we know how a city evolves over time
  • Find influential businesses in cities via network community discovery algorithm
  • Build user-business networks by business category. i.e, networks of restaurants, networks of libraries, etc. This allows gaining insights by sectors

5. Future work

○ More graph analysis

  • Graph page rank analysis using Spark GraphX and Python graph-tool library
  • Community discovery (similar to Facebook social network)

○ Improve recommendation

  • Content-based recommendation
  • Clustering all businesses
  • Extract objects from business photos using Convolutional Neural Network

About Author

Chuan Sun

Chuan Sun

Chuan is interested in uncovering the relationship of things. He likes to seek order from chaos. Previously, he worked on a unannounced project in Amazon Seattle as a software engineer. The project is related to machine learning and...
Read more

Leave Responses

Your email address will not be published. Required fields are marked *

summoners war cheat January 13, 2018
hello there and thank you for your information – I have definitely picked up something new from right here. I did however expertise some technical points using this web site, since I experienced to reload the web site many times previous to I could get it to load properly. I had been wondering if your web host is OK? Not that I am complaining, but slow loading instances times will very frequently affect your placement in google and could damage your quality score if ads and marketing with Adwords. Anyway I am adding this RSS to my e-mail and could look out for much more of your respective intriguing content. Make sure you update this again soon.
Wholesale NHL Jerseys Online June 14, 2017
An advertising marketing campaign can churn out benefits only when the message used to persuade customers is ready to generate sales. He was awarded the Society for Marketing Advances Distinguished Scholar Award for 2002 and is one of the most frequently cited marketing professors worldwide, his Google Scholar citation rate exceeded 560 in 2011. If you opt for commercial investment properties, you should learn a few things in order to reap full benefits. Radio stations was fundamentally used for the military during World War I. However, you need to steer clear of any dropship scam to keep your business's reputation up.
cats and dogs June 4, 2017
{Interesting|Lovely!!!!! :D|Beautiful|Congrat for almost reachhing 2k subbies|You're very welcome!  :) Of course!!!!|+My EcoKids Club Thank you ! So nice of you to take the time to watch ourr article and comment >Pleasde keep watching and let us know if you would like a specific reading .|Thank you !!|Thank you!! We made it!! Yay|no dis likes wow YOU MUST BE A GREAT PERSON and this was posted last year|I love dogs|i like trainz|i love cats|i love cats|this reminds me of Yurio Plisetsky|I love cats|I can't have a cat my dad's allergic. :'(|In our place theirs a cat running around in our house named sunny you shoulkd get her or him.|i'm allergic to cats|im bunny and every anikal and plushie lover|i love cats or you|like meeeeeeee|I ccan relate so much from this|me too bloody i FREAKING LOVE CATS IM A CRAZY CAT GIRLLLL!!!!!!|yaaaasssss me I love kitty's lol|I love CATs too there so CUTE|Alexis: THIS IS HOW I FEEL ABOUT CATSSSSSSSSS (Ima srry life -.-)|and i luv u!!!!!!!|there's know a game and the ats sleep walk play and I ggot it foor free|* gives u a neko cat a real one not the ones u seee at China house *|Clicking subsscribe now.....|I cann be your cat :3|meow~ (=^・ω・^=)|LOL ME TOO I LUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUV CATS|my cat has 1 eye. itt ssaadd|me tooo I am cat lover|awwwwwwwwww, that's so cute!|Are you somehow related to Sebastian in black butler? XD|I love cats too :3|I OHMYGAWD I LOVE CATS SO MUCH I GOTTA MAKE THIS OMG!!!! (i used too have a cat....)|It's not your fault, i have two pet cats, the black one nammed Moonlight, the second one is named Ginger i love them so much. X3|Sqedeeeeee SAME|I love cats too |im a dog lover and cat lover DONT JUDGE ME!|GEEZZ!! TAKE MY FREAKING CAT!!!!!! PLZZZ!!!btw i lloved it!|I'm A CAT LOVER TOO!!!! |same im a cat lover|oh this so cut QuQ|you haven't made my request yet ;-;|OMG I LOVE CAT(ALL ANIMALS)|awwww this is just cute, and my grandmothernever wanted me to have a cat cuz I just keep on loosing them >^<3|Ma sono meravigliosiiii! *_*|:) sei gentilissima come sempre ^^|Grazie mille <3|thanks! ^.^|Facciamolo su questo canale per'ho C:|Se ti va facciamo un article in collaborazione? scegli tu il tema! Mandami un mex su posta personale sarebbe divertente.. P.s ti adooorooo <3|Grazie, si avevo già in mente qualcosa con i gufi spero riescano bene ^.^|che cariniiiiiiii *_* la feresti con dei gufi diversi tra loro? *_*|Sono felice che ti piacciano ^.^ <3|Grazie mille, allora proverò a fare un leoncino per il prossimo article ^.^|carinissimi! Brava! sono davvero dolcissimi... mmmm un... leone? :)|Visiti il mio canale??|Grazie sono felice ti piacciano ^.^|I love coat hangers too.....Naah no i don't|xD I lost it at "online dating"|HOORAY FOR HORRIBLE WEBCAM QUALITY! GO FRAWN! |Lol you did her voice perfectly|HAAHAHAHAHHAHAHAHAHAHAHA|holy f*&k this so funny ....loved it.|
Kurtiskix February 5, 2017
CFFHOOK лучший чит для варфейс. cheats for warface : скачать читы на warface Наш сайт: Читы варфейс