NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Data Gathering Your Next Trip in US

Data Gathering Your Next Trip in US

Ali Rishty and Matthew Boccio
Posted on Jul 13, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

GitHub Repository

To top off our intensive three-month-long experience at NYC Data Science Academy, we chose to assist a US travel recommendation start-up by implementing intelligence and automation into their operations, to save hours of manual work. The company targets prospective travelers living in the US who know approximately when theyโ€™d like to take a trip within the country, but would like to be recommended exact location and itinerary ideas, rather than already know where exactly theyโ€™d like to go. Similar to Netflix, the companyโ€™s website is designed to get to know users so that curated recommendations can be given to them for their vacations.

Upon navigating to the website, users fill out a short quiz, including prompts that ask them to select which images inspire them and what kinds of activities they enjoy when traveling. They also fill out approximately how soon theyโ€™d like to travel (e.g. โ€œ3-6 months from nowโ€). After a user completes the quiz, the company sends an email over, containing 3 separate trip ideas, which are based around events happening during the specified time window.

Objective

For example, if it is evident that a certain userโ€™s preferred trip includes live shows, their itineraries will include events tagged with the category โ€œArts & Entertainmentโ€. Since the trip recommendations are centered around linking user profiles to recurring events that happen around the country, categorizing these events accurately and efficiently is key. After speaking to the start-up founder, it was clear that weโ€™d have a few interesting tasks to explore in our project:

  1. Automate the process of categorizing new events added to the company database. 
  2. Find a new event categorization scheme intelligently.
  3. Automate the end-to-end process of taking in user inputs and outputting appropriate event recommendations from the database.

Data

To tackle task #1, the current labelling scheme would be used and predicted using supervised classification models. Task #2, on the other hand, would involve reconstructing the categories and perhaps finding more specific ones, using unsupervised clustering techniques to find structure in the event data from scratch.

After finalizing event tags, end-to-end automation could be simulated rather simply by filtering events out of the database that match with specified inputs. We decided to primarily focus on the first two tasks, since full-on automation would require more integration with the rest of the platform, including making potential changes to the websiteโ€™s front-end. 

We were given access to the companyโ€™s compiled event database, which includes details for about 2000 events around the US, including location data, the eventโ€™s recurring annual timeframe, event title, event description, and more. It also includes an event category column which the company manually filled in with one of four categories for each event: Arts & Entertainment, Active, Tastebuds, and Learn. 

Since we started off with the manual entries of categories for these 2000 events, our first supervised task was straightforward: to train a model on a portion of the event data and test it on the rest in order to predict these categories. With an accurate enough supervised model, new events would be categorized according to this scheme without much manual intervention. 

Preprocessing Data

In order to properly analyze the event descriptions using natural language processing techniques, we started with classic text preprocessing methods.

We first decided to concatenate the event titles with their corresponding descriptions so that the model would treat these as one combined feature when classifying events. We then cleaned the newly generated description texts to remove punctuation and special characters. Our next task was to remove stopwords, or transitory words that wouldnโ€™t tell us more about what each event was about.

In addition to filtering out the general dictionary of English stopwords that comes with Pythonโ€™s natural language toolkit (NLTK), we also removed certain words that were present in nearly every event, but didnโ€™t add any detail to the event description. These were words that just described the presence of an event (โ€œeventโ€, โ€œfestivalโ€, โ€œcelebrationโ€), and time-specific words referring to whether an event took place during a specific month or week (although we kept words that referred to specific seasons).

After this, we chose to lemmatize the remaining words within each event description, so that words of the same root were reduced to the same root word, rather than being counted as differing ideas. For example, โ€œDrinkโ€, โ€œDrinksโ€, and โ€œDrankโ€ would all be processed down to their root word โ€œDrinkโ€ in this instance. This allowed us to be more sure that the frequency of a specific word, as it relates to the theme of an event, would be represented accurately. 

Exploratory Data Analysis

Once our text was processed, it was time to start analyzing it. We decided to begin by creating a simple word cloud in order to visualize the most frequently appearing words within all the events. 

Data Gathering Your Next Trip in US

Word Cloud

Our word cloud convinced us to look at word frequency a bit more carefully, since โ€œmusicโ€ โ€œartโ€ and โ€œfoodโ€ largely dominated the frequency tallies. By putting all the event descriptions into one text-blob, and then converting that text-blob into a dictionary with the words as keys and their frequencies as values, we were able to look at word frequencies as proportions of the total word count. Sure enough, the top 5 words in all event descriptions (โ€œmusicโ€, โ€œfoodโ€, โ€œartโ€, โ€œliveโ€, โ€œcraftโ€) made up 35% of all the words left in our event descriptions after preprocessing.

The top 10 words made up 54% of all the words. Additionally, the spread of the top words in each event was important to analyze. We realized that while events that were music-specific had the word โ€œmusicโ€ in them, there were also events that mentioned the word music without being predominantly focused on music.

For example, an art gallery event may include a live performance by a local string quartet. We also found that the top 5 most frequent words each occurred in more than 20% of events, sometimes as high as 30% of events. To avoid concept dilution, we decided to look at the amount of times each of these top five words occurred throughout each event, relative to all events. This is performed in TF*IDF vectorization, which we explored in the next step. 

Automation of Event Labeling

With our preprocessed data ready, we were ready to tackle our first main task: automating the categorization of events using the current classification scheme. To do so, we explored two supervised learning classification methods: Naive Bayes and Doc2Vec.

Naive Bayes

In order to apply any predictive models to our data, we first needed to convert it from text format to a format a machine would understand--numerical vectors. This process is called vectorization. There are several flavors of vectorization, but we tried two: plain count vectorization (also known as the โ€œbag of wordsโ€ method), and TF*IDF vectorization.

Using a count vectorizer with our event data, we generated a word frequency matrix, where each row represents one event description, and each column represents one of the words present in the entire event corpus. The values of the matrix represent the count of a particular word (column value) in an event description (row value).

TF*IDF vectorization works almost the same way, except it incorporates a penalization method to mitigate the effects of highly-present words on the model; TF, standing for the total frequency of a word in an event, is multiplied by IDF, or the inverse document frequency of that word, which will reduce term frequency more if the wordโ€™s frequency across all the documents is high.

Models

We built two separate Naive Bayes models using each of these vectorization techniques in order to compare their performances before choosing a final one; we hypothesized in advance that TF*IDF vectorization would do better with our data because of its mitigation of our highly present words.

With our vectorized matrices ready, we started building our first predictive model: a Naive Bayes classifier. A Naive Bayes classifier utilizes conditional probabilities and Bayesโ€™ Theorem to analyze previous classifications and to then calculate the probability that a new data point belongs to a certain class.

For example, after calculating the probability that the word โ€œrunโ€ appears in an โ€œActiveโ€ event, the model calculates the reverse probability that a new event description containing the word โ€œrunโ€ should belong to the โ€œActiveโ€ category, by combining the conditional probabilities of each word in the description.

"Naive"

This method is called โ€œnaiveโ€ because it assumes an independence of its tokens and treats them as a simple bag of words with no grammar rules. However, for our purposes of figuring out the topic of an event based on the appearance of words rather than wording styles, this classifier is especially appropriate. Furthermore, we chose to implement a multinomial version of the classifier, which means the model looks at word count in each event description rather than just whether a word was present at all or not (binomial), since a higher frequency of a word might imply stronger relevance to the event. 

Finally, we used cross-validation to randomly split our data into 5 different folds so that the model could be trained and tested on 5 different combinations of data, to reduce potential noise dependence. The aggregated results for our TF*IDF vectorized multinomial Naive Bayes model were as follows:

Data Gathering Your Next Trip in US

Analysis

To break down this set of scores, weโ€™ll start with the precision column. This describes the proportion of events categorized as a certain category that indeed shouldโ€™ve been under that category. A nuanced difference is described by recall, which is the proportion of a certain categoryโ€™s events being correctly categorized as that category, rather than ending up in a different category.

For example, 78% of the modelโ€™s guesses for Arts & Entertainment were indeed supposed to be that, while 22% of those shouldโ€™ve been guessed as other categories (precision); 74% of actual Arts & Entertainment events were guessed by the model to be under that category, while 26% ended up being guessed as other categories (recall).

The f1-score column simply aggregates the precision and recall accuracy measures by taking their mean, and the accuracy row of the chart takes the average of all the f1 scores, providing an overall estimate of the modelโ€™s accuracy: 74%. Itโ€™s worth noting that the model using regular count vectorization produced a very comparable score.

Doc2Vec

Another method we explored to both classify and cluster events was Doc2Vec, a machine learning technique often used in autocorrect technology. The algorithm uses a continuous bag of words - a sliding window of words in each document - in order to make probabilistic predictions about the most likely words that occur before and after each word in a given document. The probability of the preceding and following words allow us to look at the presence of specific words in the context of a sentence or a full event description, and can associate specific words or phrases with a class based on where and how that word or phrase appears in the description.

We would then measure differences in context through cosine distance - once our contextual associations are converted into a series of vectors, we would look to see the angle between a vector representing one phrase, and a vector representing another. Words and phrases that were closer in context would have a smaller angle between their respective vectors, and therefore a higher score. We believed a context-based approach to looking at each event description would be able to classify events more accurately, given that the overwhelming presence of specific words out of context might throw off our classifications. We also attempted to cluster events based on context.

By using hierarchical clustering methods, and bucketing event descriptions together based on how similar they were in context, we were hoping to find groups of events that were closely related by the specific word orderings or phrases used in the event text. This, ideally, would group events together by how words were used, not just by what words were used.

Problems 

Unfortunately, this method wasnโ€™t giving us useful results. We theorized that because of our necessary pre-processing, the event descriptions themselves were shrunk so much that the algorithm wasnโ€™t able to infer different contexts from different event descriptions. Because the algorithm wasnโ€™t able to differentiate between events based on limited words used to infer context, the cosine distance between somewhat different events wound up being very small, meaning that our model would guess that events were similar to each other when they werenโ€™t.

The cosine distance between many event descriptions was so similar, that our model would only be able to correctly classify an event around 50% of the time. Furthermore, this led to problems with hierarchical clustering as well.

Since this clustering method was meant to group events together based on similar cosine distance, and our model assumed that many events had very similar cosine distances, the clustering method grouped most events together into one large cluster. We wound up having one large cluster with 80% of all event descriptions put inside of it, and 9 clusters made up of mostly outliers representing the other 20%. With context-based prediction and clustering methods failing to give us useful results, we turned back to our frequency-based methods. 

Classification Threshold

Our final prediction model is able to return a class prediction based on a probability distribution - for each of the four classes, the model assigns a percentage likelihood that an event belongs to a specific class, based on the frequency of specific words in the description. The class that receives the highest percentage likelihood is our prediction.

However, because event descriptions can often include words that are associated with multiple classes, the best prediction might not always have a high probability associated with it. If an event seems equally likely to belong to three out of the four classes, for example, the percentage likelihood that an event belongs to each class will be about 33.3% for each.

Cutoff Threshold

This raised two important questions: is there a cutoff point where the probability is too low, and we donโ€™t classify an event into a specific category? And what is the ideal cutoff point? To answer both of these questions, we built a tool to check the tradeoff between misclassification and non-classification (assigning no class based on a low probability) as we approached different cutoff points. 

Our tool first split up our modelโ€™s training and test data via 10-fold cross-validation, to ensure best fit. Once we ran our model on each split dataset, and received probability values associated with each prediction, our tool began to actually classify events based on this probability distribution.

What was different from our normal predictive model is that we were able to assign a โ€œcutoff probabilityโ€ between 0% and 100%. If the highest percentage likelihood of an event fell below that cutoff, the event was not put into any class, and received a โ€œnoneโ€ value. If our cutoff was 50%, for example, and the highest percentage likelihood that an event received was 45%, that event would not be classified. From there, our tool was able to tally correct, incorrect, and โ€œnoneโ€ classifications, across a range of cutoff probabilities, in order to see the tradeoff between misclassification and non-classification as the cutoff probability got higher. The following represents this tradeoff graphically:

Data Gathering Your Next Trip in US

Analysis

It stands to reason that if an event has 4 classes, and our algorithm was completely unsure how to classify it, it would assign each class an equal percentage likelihood of 25%. So any cutoff rate 25% and below was not considered. One is able to see from this graph that at a 40% cutoff rate, our non-classifications match our misclassifications, meaning that youโ€™re equally as likely to not assign an event to a specific class as you are to misclassify an event at this cutoff point, The number of correct guesses remains fairly high as well.

Based on this analysis, we were able to confidently recommend that one should only classify an event based on their percent likelihood if that event is 40% or more likely to be in that class, relative to all the others. This cutoff point has a very interesting application in multi-class classification as well. If an event is 70% likely to be in class A, but only 30% likely to be in class B, our cutoff threshold suggests that you should only classify the event as class A. However, if an event is 55% likely to be in class A, and 45% likely to be in class B, you could classify an event into both class A and class B.

Given the categories that the owner of this company was using to classify events (A&E, Tastebuds, Learn, Active), sometimes it might make sense to classify an event into two buckets rather than one based on this probability distribution (an event might fit the description for A&E and Tastebuds at the same time, for example.)

Finding New Data Categories

After developing a successful supervised predictive classifier, we were excited to try detecting a new clustering scheme altogether for the classifier to predict on. As mentioned previously, we explored Doc2Vec for this purpose, but it didnโ€™t end up working well with our data to generate meaningful clusters.

LDA

We then tried another unsupervised learning method called Latent Dirichlet Allocation (LDA). This is a popular technique for topic modeling, or clustering text data by topic. In short, it creates a user-specified amount of topic clusters based on likelihoods of term co-occurrences, rather than semantic definitions. Each topic is modeled as a distribution of words, and each document is modeled as a distribution of the created topics. Because we cared about mitigating the effects of overly common words just like earlier, we cut out any words that appeared in more than 20% of events (which essentially cut out โ€œmusicโ€, โ€œfoodโ€, and โ€œliveโ€).

We also applied TF*IDF to word frequency values like before. The following visualization depicts an LDA model we fine-tuned after playing around with different parameter combinations and ideas:

Shown here, 6 topic clusters were created by the LDA model, and topic โ€œ5โ€ is defined as a frequency distribution of the words in the event data, with relatively high appearances of the words shown above. Scanning all the words that made it to this top-30 list, we noticed a good amount of food-related words (bacon, wine, chocolate, gluten, sweet, chef), especially compared to the distributions of the other clusters.

Evaluation

After evaluating each clusterโ€™s word distribution, we tried coming up with a label for each category that best described the topic being portrayed. Though challenging, finding these labels during the topic modeling process is the human element needed to make sense of the unsupervised modelโ€™s distributions. We came up with the following scheme:

  • Category 1 - outdoor fairs
  • Category 2 - artsy โ€œworldโ€ expos
  • Category 3 - local/ethnic culture
  • Category 4 - physical and mental wellness
  • Category 5 - food parties
  • Category 6 - southern cultural

Constructing this LDA model was indeed interesting, and something we worked on for a nice chunk of our project time. After showing the finalized categorization scheme to our client, we didnโ€™t feel that it was fitting or robust enough to replace the current categorization scheme.

We concluded that using the basic set-up of our LDA model with a larger, more robust database of events in the future might allow it to pick up on more interesting, contrasting patterns. Another interesting future direction might be to hone in on just one of the current categories (perhaps the largest, most general one) in the database and run an LDA model on that subset to divide it into subclusters. This would achieve an increase in total category amounts, and thus a more nuanced categorization theme.

Our Command Line Interface Tool

After working on both supervised and unsupervised models and polishing off a working predictive classifier, we wanted to provide an easy-to-use deliverable to our client that would allow them to quickly feed in new events and receive predictions of the most appropriate categories for them. We built a CLI tool using Python and loaded our models into it by a process called โ€œpicklingโ€. The tool allows users to input a new event description and outputs the probabilities of it belonging to each of the four categories. Different levels of output verbosity can be specified, with โ€˜2โ€™ being the most verbose on a scale from 0 - 2.

In this example, the user inputted a blurb about a Green Day concert, which the model preprocessed into the shortened text shown. The model then predicted A&E (Arts and Entertainment) as the category with the highest probability to match this event, which is accurate! The probability of .48 is above our calculated threshold of 40%, and thus the user can confidently categorize this event under A&E without concern!

Conclusions & Future Work

It was a pleasure working with our client to tremendously increase their efficiency and introduce automation and machine intelligence into their daily work. While our LDA model may not be put to use right away, it has potential to be improved down the line as the database grows to include a more robust variety of events. A new LDA model can be used to subdivide one or more of the existing clusters as well to provide a more nuanced categorization scheme. Semi-supervised clustering methods, such as โ€œGuided LDAโ€, are another interesting future direction to explore.

Our client was delighted to receive a tool that would automate the categorization of new events, saving hours of work and providing a more seamless pipeline. While we didnโ€™t spend much time on end-to-end automation for the application, we demonstrated that filtering the event database according to user inputs from the front-end can be done without much complication, once a rule-based scheme for matching is decided on and finalized by the company.

About Authors

Ali Rishty

Coming from a diversified background in neuroscience and computer science, Ali is passionate about programmatically analyzing how humans think, behave, and develop their personal tastes. She is enjoying her time at NYCDSA, where she is immersing herself in...
View all posts by Ali Rishty >

Matthew Boccio

Data Scientist with the ability to quickly dissect complex problems, provide thoughtful detailed analysis using data-driven tools and business intuition, and communicate solutions clearly to relevant stakeholders. I believe in the power of data science and its potential...
View all posts by Matthew Boccio >

Related Articles

Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Capstone
Using NLP to Explore Unconventional Targets
Capstone
Blind Dating Ensemble Classifier
Student Works
Data Driven Ads by Starbucks Customer Segmentation

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application