NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Yelper: A Collaborative Filtering Based Recommendation System

Yelper: A Collaborative Filtering Based Recommendation System

Chuan Sun
Posted on Sep 25, 2016

1. Why we need recommendation?

Recommendation is a big topic. Netflix was willing to spend 1 million dollars to summon the best movie recommendation algorithm for their service. Facebook devised hundreds of variables just to carefully gauge the recommended feeds on the first page when you log in your page. Not to mention myriads of other applications in the market.

"Getting information off the internet is like taking a drink from a fire hydrant" (Mitchell Kapor). Information overload is a real phenomenon preventing us from making good decisions or taking actions. This is why recommendation systems are becoming common and extremely useful in products such as Netflix, Amazon Echo, and Facebook News Feed in recent year.

Besides, in the big data era, real-time recommendations may become a new norm because real-time recommendations:

  • Allows both customers and businesses to gain real-time insights
  • Enable rapid development
  • Perform real-time analytics

2 What is "Yelper"?

Based on the dataset provided by the "Yelp Challenge 2016", "Yelper" is a system that:

  • Performs preprocessing by dividing business data by cities to allow fine tuned and customized recommendations
  • Uses collaborative filtering based recommendation using Spark MLlib
  • Generates user-business graph visualizations using D3 and graph-tool library
  • Performs user-business graph analysis using Spark GraphX in Scala
  • Handles real-time user request handling simulation using Spark Streaming and Apache Kafka
  • Creates a Google Map view to recommend high rated businesses for users

The rationale for building Yelper is that:

Data scientists should not only know how machine learning works, but they also need to be prepared for the incoming new norm: the large scale streaming data mining and analytics in the big data era.

The GitHub page can be found here: https://github.com/sundeepblue/yelper_recommendation_system

Slides: Presentation slides

3. Components and technical stack to build Yelper

There are 5 major components to build Yelper, as shown below. Here is the overall flowchart of what happens.

screen-shot-2016-09-24-at-9-23-58-pm

3.1 Data preprocessing

Yelper was built based on the Yelp Challenge 2016 Dataset, which includes:

  • 2.7M reviews and 649K tips by 687K users for 86K businesses
  • 566K business attributes, e.g., hours, parking availability, ambiance
  • Social network of 687K users for a total of 4.2M social edges
  • Aggregated check-ins over time for each of the 86K businesses
  • 200,000 pictures from the included businesses

Our system mainly relies on two data sets: the 687K users and the 86K businesses. We performed two preprocessing steps (Python source code):

  • Map the raw IDs in ASCII string format for both user data and business data into integer indices
  • Split the business data by nine major cities (see sample structure of splitting by cities): us_charlotte, us_lasvegas, us_madison, us_phoenix, us_pittsburgh, us_urbana_champaign, canada_montreal, germany_karlsruhe, uk_edinburgh

In Yelper, we deliberately created all business IDs as integers in range [0, 1M] (M means million), while all user IDs are in range [10M, +โˆž], because we have much more user IDs than business IDs. In this way, we can use the cutoff value 10M (=10000000) to check if an ID is associated to a user. To sum up, using integer number as ID (such as 3234) instead of a raw string (eg., "G8qH6TbfEhoYmS9KZM2Hfg") has several advantages:

  • Building the graph using Spark GraphX requires that all nodes and edges have an integer ID
  • Building the required nodes and edges for D3 Javascript visualization was easier to program and debug
  • Exported CSV or JSON files of either user data or business data became much smaller, leading to faster loading and manipulation
  • It was easy to differentiate the user id and business id just by a threshold. 

Splitting business data by cities has the following advantages:

  • Making each city have its own recommendation model allows customized recommendation
  • Model training becomes faster since city-wise data is smaller than the totality
  • City-wise model tuning becomes manageable
  • Scalability is possible. Remember there are thousands of cities all over the world
  • It does not make sense to build one single recommendation model to handle all world cities since, like human, each city has its own "gene"

3.2 Low-rank matrix factorization

We want to recommend the most highly-rated businesses to users. There are mainly two types of algorithms used for recommendation systems:

  • Collaborative filtering based algorithm, such as low-rank matrix factorization, SVD, etc
  • Content-based recommendations

Both algorithms have pros and cons. For simplicity and fast prototype, we explored the first one in Yelper. If we have more time, it is not difficult to integrate the second one into our system, which will be discussed in the final section.

Technically, we use Spark MLlib to train the ALS-based collaborative filtering models in Yelper (Python source code). Below are several brief steps:

  • Load the city-wise user-business-star tuple file F by Spark (sample tuple for Las Vegas)
  • Divide the F into three parts, 60% for training, 20% for testing, and 20% for validation using Spark MLlib
  • Setup several rank values (such as 4, 8, 12) as the tuning parameter, perform cross-validation to find the best rank that leads to minimal Residual Mean Squared Error (RMSE)
  • Generate the best model using the best rank
  • Export the best model to a directory under the corresponding city name (sample trained model for Las Vegas)

For the last step above, persisting the city-wise models to disk (or cloud) is a good choice, because we can compress, transfer, distribute, or cache those models.

3.3 Dynamic network visualization

We are particularly interested in analyzing the user-business interaction and gaining city-level insights. Why do we want to do this? Below are two assumptions:

  • Generally speaking, the more businesses a city has, the more prosperous a city will be
  • The more customers (residents, tourists, population, etc) a city has, the more businesses a city will need to have to sustain the needs

We build city-wise user-business network graphs in the following steps:

  • Extract connected components using Spark GraphX (Source code in Scala)
  • Generate dynamic user-business network graph using Python (code) to generate the Javascript JS file (js sample) for D3 visualization (HTML)

As an example, the dynamic user-business network graph of the city Madison (US) is shown below. Each node is either a user (in green color) or a business (in blue color). If a user u rated a business b, then there is an edge from u to b, namely, u --> b. For the city of Madison, there are in total more than 10K edges, and unfortunately, the library D3.js cannot handle and render such a large number of nodes and edges. Thus we randomly selected only a small number of edges to be rendered in Chrome.

What can we gain from this dynamic network? Well, just from a single network, we can gain the topological relationship between users and businesses. The density of edges (in red) reflects, to some extent, how the users rate all the businesses in a city. If we generated those dynamic graphs for all cities, we may find that each city's network is distinct. We can further distill in-depth information just from the networks themselves, such as in/out degree, clustering, page rank analysis, min cut, community discovery, etc.

yelper_network

3.4 Google Map API based frontend

We implemented a web server to allow the user to interact with Yelper to get recommendations. This web server was built using:

  • Spark
  • Flask
  • cherrypy
  • Python paste

Below are several features of the frontend:

  • Users can search any keywords they are interested in (not just restaurants) (Python code)
  • Recommended business will be shown in the Google Map interface (Javascript to interact with map)
  • User scan change the city and how many recommended businesses to return on the map
  • RESTful API is supported. User can change city/topk/keywords from the URL

As an example, we show the user interaction of Yelper through the image below. The user with ID "10081786" requested recommendations for the keywords "restaurants", "book store", "library", and "ice cream" in the city of Charlotte. The Yelper returned the "topK" recommended results and showed the businesses locations via Google Map API.

yelper_ui

3.5 Simulation of real-time customer requests handling

Now, imagine that there are one thousand or even one million of customers who need to access Yelper to get recommendations in different cities. Before we show how Yelper handles this challenging scenario, let us first delve a little bit deeper into why this scenario is challenging.

  • More user requests means Yelper needs the ability of handle a large volume of requests from various sources (phone, iPad, laptop, etc)
  • Users' recommendation requests need to be handled promptly, and asynchronously
  • Yelper cannot lose the request from any users. Otherwise, customer satisfaction and loyalty will be compromised
  • An army of computing resources in the cloud need to be deployed to handle requests
  • Customers have different tastes and needs

Those are just a few factors out of many that drive the need for a strong recommendation infrastructure. A detailed enumeration is definitely out of the scope of this post.

One neat solution is to view all incoming request from users across many cities as a non-stopping stream, and use Apache Kafka as a message broker to redirect all requests to Kafka, then let the Spark Streaming handle anything that is piped from Kafka into Spark streaming in a fault-tolerant and scalable manner. The advantages are:

  • The streaming is fault-tolerant to any sudden uncertainties
  • Allows different types of messages being piped to piper
  • Modulization
  • Easy to handle

We thus simulate the real-time handling of a large amount of users' recommendation requests locally (Python source code), as shown in the figure below. On the left side of the terminal window, we run the Spark job to handle the requests in Spark Streaming at a frequency of 1.5 seconds. Note the text message showing the time stamp, such as "Time: 2016-09-25 14:37:22.500000". The Yelper logo is shown before each of the recommended results.

On the right side of the terminal, we run a python script to generate users' recommendation requests at a random time interval. Those requests were piped into Apache Kafka such that Spark Streaming can consume those requests.

simulation2

(Click image above to see animated simulation)

4. Insights from user-business networks

We build static user-business network graph using a Python library called graph-tool (Python code) (sample network images). Now we show briefly what it looks like for the city-wise user-business network. Again, if a user rated a business, there will be an edge connecting the user node and the business node in the network.

In terms of visualization those edges, there is a dilemma. The more nodes and edges that are included in the graph, the more precise the graph will be, but the visualization will become a mess, since it is nearly impossible for many graph layout algorithms to elegantly visualize clearly the graph details when the edge number is above, say, 0.1 million. For this reason, we adopted two simple strategies: out of all the edges in a city, we either randomly selected 1% (or 2%) of all the edges, or sequentially selected the first 1% (or 2%) of all edges.

It is interesting to mention that, even if we randomly select only 2% of the entire edges, the generated networks still have the power to reveal many insights about a city if we dig deeper. Let us now briefly go over three cities in the US: Charlotte, Las Vegas, and Pittsburgh.

4.1 Charlotte, US

We randomly selected 3312 edges, which is only 1% of the total edges in this city.

us_charlotte_graph_random_3312_edges

4.2 Las Vegas, US

We randomly selected 11547 edges, which is only 1% of the total edges in this city.

us_lasvegas_graph_random_11547_edges

What if we randomly select 23095 edges (2%)?

us_lasvegas_graph_random_23095_edges

Instead of randomly, what if we sequentially select23095 edges out of all the edge list?

us_lasvegas_graph_first_23095_edges

4.3 Pittsburgh, US

What about Pittsburgh? We randomly choose 2230 edges (2%) out of all edges lists.

us_pittsburgh_graph_random_2230_edges

4.4 What can we gain from the networks?

Again, similar to the dynamic network mentioned in the previous section, a lot of insights can be revealed from those static graphs if we delve deeper using graph-based algorithms and analytics techniques. For example, as we all already know, the economy of Las Vegas is higher than Pittsburgh. From the network, there are clearly more nodes with a high number of incoming edges in Las Vegas than in Pittsburgh.

If I have more time, below are several potential directions that I can think of to dig deeper into those networks:

  • Use the edges with high ratings (i.e., rating star > 4.5) to generate a network of high-quality businesses, comparing the differences amongst cities
  • Build the graph with low ratings (i.e., rating star < 2.5) businesses, see their distribution and figure out what led to low-ratings
  • Add time stamps into consideration. Build networks that evolve with time such that we know how a city evolves over time
  • Find influential businesses in cities via network community discovery algorithm
  • Build user-business networks by business category. i.e, networks of restaurants, networks of libraries, etc. This allows gaining insights by sectors

5. Future work

โ—‹ More graph analysis

  • Graph page rank analysis using Spark GraphX and Python graph-tool library
  • Community discovery (similar to Facebook social network)

โ—‹ Improve recommendation

  • Content-based recommendation
  • Clustering all businesses
  • Extract objects from business photos using Convolutional Neural Network

About Author

Chuan Sun

Chuan is interested in uncovering the relationship of things. He likes to seek order from chaos. Previously, he worked on a unannounced project in Amazon Seattle as a software engineer. The project is related to machine learning and...
View all posts by Chuan Sun >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Kurtiskix February 5, 2017
CFFHOOK ะปัƒั‡ัˆะธะน ั‡ะธั‚ ะดะปั ะฒะฐั€ั„ะตะนั. cheats for warface : ัะบะฐั‡ะฐั‚ัŒ ั‡ะธั‚ั‹ ะฝะฐ warface ะะฐัˆ ัะฐะนั‚: ะงะธั‚ั‹ ะฒะฐั€ั„ะตะนั

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application