NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Applying Machine Learning to Amazon Reviews

Applying Machine Learning to Amazon Reviews

Yan Qi
Posted on Aug 8, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Overview

As a long-time Amazon Prime member, I rely heavily on the product reviews for my Amazon purchases. I consider online reviews a very valuable source of information for consumers.

Typically, I would first look at the number of reviews, then the overall numeric rating and its distribution, and read through some text reviews to pick out the customersโ€™ likes and dislikes and key features of the product. When reading the review text, unconsciously, Iโ€™m also filtering for features that matter to me.

Although very useful, digesting and synthesizing qualitative comments this way is rather time consuming and subject to bias, depending on which reviews I decide to read. Wouldnโ€™t it be great if we could be presented with a concise and statistically sound summary that captures the essence of all the reviews?

The cognitive burden is aggravated even more when one tries to compare numerous similar products, each with a large number of reviews. As a good comparison shopper, I often find myself evaluating all the options against a set of common features, explicitly or implicitly. For example, when purchasing shoes, one may consider material, style, fit and price. Obviously, different people may place different importance on those features.

Reading reviews is a great way to tease out how well the product performs in different aspects. Can machine learning help us systematically, efficiently, and objectively digest the vast number of reviews so that we can match product features with our personal preferences?  

Goals

These are the problems that I set out to explore in my capstone project, where I used natural language processing techniques to extract meaningful features (which Iโ€™d also refer to as topics) common to products within a similar category, based on customer reviews. The output from this project would enable the following use cases:

  • A customer would be able to:
    • Browse top features learned from reviews across all products in a category and focus further evaluation on only products with features that he/she cares about
    • Look at the most salient features of an individual product and read a few representative reviews for each feature of interest
  • A manufacturer or seller would be able to treat top-features extracted from reviews as real-word customer perception of the product. Comparing that against its intended positioning or offerings from competitors and potentially tracking the evolution of perception over time, can inform sales & marketing strategy, as well as product design and customer service.

Reviews summary

Currently, on each productโ€™s page, Amazon provides โ€œreview highlightsโ€ in the form of a list of phrases summarized from its reviews (figure 1). Most likely, this is based on topic modeling of reviews for one product at a time. Each product will get a different list of unique topics. In my project, a common list of topics is learned from reviews of all products in the categories. Comparisons across similar products can thus be more easily made along those features. 

For some products, a โ€œRating by featuresโ€ section is available (Figure 2). It is not clear whether customers were asked to specifically to rate on each feature, or if statistical models were used to obtain the โ€œdecomposedโ€ ratings. Although scoring along individual features has not been implemented in this project, I see it as a natural next step in future development. An approach that unifies the overall numeric rating and sentiments on each feature based on the review text would be a very interesting direction to explore. The code for this project can be found on my github repository

Data selection & basic cleaning

For this project, I used the Amazon datasets that Dr. Julian McAuley shared on his labโ€™s website. This dataset contains Amazon product reviews and metadata spanning May 1996 - July 2014. The review data includes numeric ratings, text, helpfulness votes, time of review, etc., and the product meta data includes product category, brand, name, and description. The 5-core dataset where each user and product have at least 5 reviews was utilized.

The Amazon datasets are classified into broad categories, such as โ€œclothing, shoes and jewelry,โ€โ€œelectronics,โ€ โ€œbooksโ€ etc. The meta data includes the hierarchical categories that a product belongs to. For the business use I envisioned, I wanted to focus on a sub-category thatโ€™s fairly specific but still includes a large number of products and reviews where users can benefit from machine learning.

Then I remembered the time when a friend was shopping for a bluetooth headset on Amazon and complained that the number of choices was overwhelming, and it was too laborious to read through so many reviews. With these considerations in mind, I picked โ€œBluetooth Headsets,โ€ a subcategory within โ€œCell Phones & Accessories,โ€ which includes 10284 reviews for 422 products from 2004 to 2014.

From here on, some simple cleaning steps were applied, including removing html tags, dropping a few records with missing review text or product name, filtering out products that are not truly Bluetooth or headsets based on keywords in product names. Missing values in the original โ€œbrandโ€ column were filled by the brand in the โ€œtitleโ€ (product name) column.

Review text processing

The first step in natural language processing is always to transform text into a format that the computer can understand, i.e. numbers. The transformation depends on the type of modeling intended. As I was planning to do topic modeling on the reviews, I treated each review as one document and the reviews for the 390 Bluetooth headset as the collection. I then used the Python packages NLTK, spaCy, and regex-based custom functions to process each review following the steps in Figure 4.

There were quite a lot of HTML entity numbers and tags in the review text, such as โ€œ"โ€ and โ€œ<a href=โ€™xyz.htmlโ€™>โ€. These can cause problems for subsequent steps when punctuations were removed and nonsensical words could form. Applying the html parser from BeautifulSoup package got rid of the HTML entity numbers and tags.

Expanding contractions means mapping โ€œarenโ€™tโ€ to โ€œare notโ€ and โ€œitโ€™sโ€ to โ€œit isโ€ etc while removing punctuation means getting rid of symbols such as !, &, %. Since our aim is to extract topics from text, we are only interested in words and phrases, hence the symbols can be safely removed. However, punctuation may be useful for some NLP tasks. For example, double exclamation marks could mean more intense emotions in sentiment analysis.

Next each review is split into individual words (tokens) and converted to lowercase. To remove stopwords, it is often necessary to extend the list of English stopwords from NLTK with frequently occurring but trivial words that are specific to the dataset at hand. In this case, โ€œheadset,โ€ โ€œBluetooth,โ€ โ€œheadphone," and โ€œheadphonesโ€ would appear in almost every review but have little value as part of a topic. Those words were added to the stopwords list and removed.

Lemmatization

Lemmatization refers to the process of converting various forms of a word to its canonical form, also known as lemma. For example, the lemma for โ€œteaches, teaching, taughtโ€ is โ€œteach,โ€ the lemma for โ€œmiceโ€ is โ€œmouse,โ€ and the lemma for โ€œbestโ€ is โ€œgoodโ€ (with spaCy lemmatizer).  I chose the spaCy lemmatizer over others as lemma depends on the POS (part-of-speech) of the word, and the spaCy lemmatizer includes a step to determine the POS before assigning the corresponding lemma.

As lemmatization produced some additional stopwords, another round of stopwords removal was applied. Numbers were also removed, as they were not likely to be part of an interesting topic. Apart from the more standard text processing above, some customized cleaning was done to handle the special โ€œnoiseโ€ in this dataset such as missing space in โ€œThis is great.Recommend to anyoneโ€ and patterns such as โ€œwordAโ€ฆwordBโ€ฆwordCโ€ and โ€œwordA/wordBโ€. 

Figure 5 shows the top 25 most frequent words from all the review text before and after the text processing steps. The most frequent words in the cleaned text provides much more insights about Bluetooth headsets than those from the original reviews.

Topic modeling Reviews with LDA

The purpose of topic modeling is to extract โ€œtopicsโ€ from a collection of documents. Applications include document classification and summarization. On our Bluetooth headsets review data, the goal is to discover a list of meaningful, non-overlapping, exhaustive โ€œtopicsโ€ that best reflect the product features and aspects that customers commented on.

Latent Dirichlet Allocation is a generative probabilistic model for a collection of documents (corpus). Each document is generated from a random mixture of latent topics and each word in a document is generated from multinomial distributions conditioned on topics. The Goal of LDA topic modeling is then to infer the unobserved โ€œtopicsโ€ and the associated probability distributions from the observed data.

In the model output, each topic is represented by a weighted list of words, and each document is assigned a weighted list of topics, where the weights represent multinomial probabilities. For each topic, weights for all words add to 1. And for each document, weights for all topics add to 1. This probabilistic model makes sense for review data, where each review could address several topics with different amount of emphasis, and different reviewer may talk about the same topic using slightly different wording. 

Gensim

In my implementation, I used Gensim (python package) for preparing the corpus and Mallet (Java package) with Gensim wrapper for LDA modeling. I experimented with both Gensim and Mallet for LDA modeling and ultimately decided to go with Mallet for this dataset. Gensim LDA has a lot of nice features such as online learning and multicore mode and constant memory requirement, which makes it well suited for large-scale applications.

However, as some articles have pointed out, Mallet often outperforms Gensim in terms of the quality of the topics learned. Indeed, in my own experiments on two Amazon review datasets (Bluetooth headsets, Laptops), Mallet LDA with default setting gave better or similar results than Gensim LDA with optimized parameters. Here the โ€œGoodness of modelโ€ was assessed by the topic coherence score in Gensim, a mutual-information based quantitative measure of how coherent a list of words are for describing a common topic.

Mallet LDA produced topics with higher coherence scores than those from Gensim LDA, and they were also more interpretable upon human inspection. This better performance is likely due to the fact that Mallet uses an optimized version of collapsed Gibbs Sampling, which is more precise than the variational Bayesian optimization used by Gensim.

We can use the topic coherence score as the metric for optimizing the number of topics in an LDA model. On the Bluetooth headset data, we obtained the best model with 20 topics.

Reviews interpretation and labeling

The topics returned by LDA are lists of words. Some human inspections are still needed to turn those lists into more interpretable โ€œtopics.โ€ When the quality of the topic model is good, the word list itself can be very telling. Can you guess from the lists in Figure 6 what those topics are?

To be more precise and get a bit more color, I reviewed the topic word lists, as well as the review texts with the highest probabilities being generated from each topic, in order to label the topics with a short name and a detailed description (Table 1).

Out of the 20 topics, I could easily interpret 17 of them. Out of these 17 topics, two identifies the brands Plantronics and Jabra (blue rows in Table 1). โ€œPlantronicsโ€ topic mostly captured high praises from loyalists who have owned many products from this brand.

The remaining 15 topics cover diverse aspects highly relevant for Bluetooth headsets. Ten topics are about individual product features (green rows in Table 1), such as the quality of voice command functionality, sound quality, and noise cancellation capability. The model also correctly separated fit of the headset on the head vs. fit in/on the ear into two topics. The other 5 topics (yellow rows in Table 1) were not referring to just one product feature, but more holistic assessments.

Topic 4

For example, topic 4 is about โ€œmiscellaneous issues and problemsโ€ and the actual review often contained such language as โ€œissues, problemโ€. Topic 17 was not about the product itself, but the overall customer service experience especially related to shipping speed, ease of exchange and returns, through the manufacturer or amazon. Topic 18 highlights product usage in an active setting such as during โ€œrunning or gym workoutโ€. Topic 8 captures an expression of high overall satisfaction.

Those 15 topics thus succinctly summarizes the aspects that customers consider important for evaluating Bluetooth headsets based on their reviews for all products in this category.  

Drawing insights from topic distribution in reviews and products

Now that we have a set of well-labeled high-quality topics learned from the data, there is a lot we can do to derive insights on the products. We start by looking at what topics are present in each review, then calculate the extent to which each topic is present in a productโ€™s review.

From the trained LDA model, we can obtain a matrix of document-topic probabilities X where each row corresponds to a document and each column corresponds to a topic. Element Xik represents the probability that review i is generated from topic k. Each row contains the probabilities that a particular review is generated from each of the 20 topics and those probabilities sum to 1. Each column contains the probabilities of every review being generated from a particular topic.

It turns out that the probabilities in all 20 columns follow a similar distribution. For each topic, more than 80% of all reviews have a probability that is less than 0.1. Based on these observations and inspection of review text for various topics with different probabilities, we arrived at a cutoff of 0.09. With that we can turn the probability matrix X into an indicator matrix Y where Yik is 1 only if review i contains topic k.

So how many topics does a review contain according to our LDA model? Well, that depends on the length of the review. Most reviews are short. More than 50% of all reviews have fewer than 40 tokens each. There are also, however, some extremely detailed reviews. About 7% of reviews are more than 200 tokens long. Some techie users are very serious about their Bluetooth headsets and create head-to-toe meticulous reviews.

Figure 8

Figure 8 shows that our model does not capture any topic when the review is extremely short and simply does not contain enough occurrences of the keywords that comprise a topic. Longer reviews do tend to contain more topics, with a maximum of 5 on this dataset. But longer reviews are not always better. Since for each review, the probabilities for all topics are forced to add to 1, topics have to compete with each other to stand out.

Only the few strongest ones win. This explains why only 1 topic was detected in each of the two longest reviews. When one read through these two reviews, they apparently just ramble on and on, seeming to be talking about everything, though nothing really stands out. Even though the reviews definitely touched upon several topics, these topics inevitably were buried in too much noise.

Conversely, the reviews with 5 topics are on average 200-300 words long, with focused discussion on each topic. To the human eye, those are โ€œwell writtenโ€ reviews, conveying clear messages, providing just the right amount of details, but not so much to induce cognitive fatigue.

The presence/absence of topics in reviews can be aggregated at the product level to create a matrix Z where rows correspond to products and columns correspond to topics and element Zjk represent the proportion of reviews for product j that contain topic k. This โ€œproduct-topic proportionโ€ matrix Z would allow us to ask many interesting business questions. We demonstrate a few below.

Case 1 - Topics that predicts numeric rating

Letโ€™s start with a simple topic โ€œgreat_overallโ€. Typically, reviews with this topic express a high level of overall satisfaction, with strong positive phrases like โ€œawesomeโ€ or โ€œhighly recommendโ€. The scatterplot in Figure 10 shows the products with a higher % of reviews containing the topic โ€œgreat_overallโ€ tend to have a higher overall numeric rating.

The 6 products with > 10% of reviewing containing โ€œgreat_overallโ€ are among the top 10 most highly rated Bluetooth headsets. This feature can be a good predictor for the productโ€™s overall high performance. Similarly, the % of reviews mentioning topic โ€œmisc_issues_problemsโ€ correlates negatively with a productโ€™s numeric rating.

Case 2 โ€“ Headsets most suitable for an active lifestyle

As another example, a customer might be looking for Bluetooth headsets that he can use while running or working out in the gym. We would look at all products with a minimum number of reviews (say 50), and check out the top 10 with the highest product-topic proportions for the topic โ€œactive_lifestyleโ€.  Two of the top products from Jabra and Jaybird indeed have SPORT in their names and are designed for this setting.

By looking at the overall rating, two LG products are the only ones with > 4 stars. They also have larger number of reviews than the other 8 products. The customer may decide to only consider the headsets in this list with a rating > 3.7 and further evaluate based on price and other features.

Case 3 โ€“ Perception of most reviewed products

In the last case, we will take a look at how the most reviewed Bluetooth headsets are perceived by users, reflected by the top topics that are most associated with those products. LG and Plantronics are two renowned brands in the Bluetooth headset space. Among the 5 most reviewed (based on the number of reviews) products, 2 are from LG and 2 are from Plantronics. If we take our product-topic proportion matrix Z and use a cutoff of 0.08, we get 4 or 5 top mentioned topics for each product (Figure 11).

The topics for products from the same brand are fairly concordant while we can see clear differences between the topics across brands. While customers of LG often talked about its use in an active setting, the phone connection capability, and audio performance, Plantronics headphones seem to stand out with the noise cancellation and voice command functionalities.

Conclusions and future development

The business cases explored here demonstrate how topic modeling on customer review data is useful for both customers and e-commerce companies. Other potential uses include product segmentation, rating predictions, and predicting topics on unseen reviews. A really neat expansion of this work is to calculate customer satisfaction along each of the identified topics (features) via sentiment analysis.

This would allow us to then see not only which features are most talked about for each product but also whether the customers thought of that aspect of the product negatively or positively. Given more time, it will also be easier to visualize the results through a web-app powered by Flask or R shiny where users can explore different business cases to mine insights.

About Author

Yan Qi

Yan is an experienced business analytics professional with well-balanced skills in quantitative analysis, strategic thinking, and communications. She received a Ph.D. in biomedical engineering from the Johns Hopkins University. During graduate studies, Yan innovated statistical inference and machine...
View all posts by Yan Qi >

Related Articles

Capstone
Using NLP to Explore Unconventional Targets
Python
Video Game Descriptions: Do Some Words Sell Better?
Capstone
Using Data for A Recipe Recommendation System
Capstone
NLP Recipe Search Engine
Data Visualization
Sentiment Data Analysis of Amazon's Decaying Product Ratings

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application