NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship πŸ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release πŸŽ‰
Free Lesson
Intro to Data Science New Release πŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See πŸ”₯
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular πŸ”₯ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New πŸŽ‰ Generative AI for Finance New πŸŽ‰ Generative AI for Marketing New πŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular πŸ”₯ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular πŸ”₯ Data Science R: Machine Learning Designing and Implementing Production MLOps New πŸŽ‰ Natural Language Processing for Production (NLP) New πŸŽ‰
Find Inspiration
Get Course Recommendation Must Try πŸ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release πŸŽ‰
Free Lessons
Intro to Data Science New Release πŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See πŸ”₯
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > Team DataUniversalis' Higgs Boson Kaggle Competition

Team DataUniversalis' Higgs Boson Kaggle Competition

Bin Fang, Miaozhi Yu, Chuan Sun and shuo zhang
Posted on Sep 2, 2016

Contributed by Shuo Zhang, Chuan Sun, Bin Fang, and Miaozhi Yu . They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016.

All R codes can be found here:
https://github.com/nycdatasci/bootcamp006_project/tree/master/Project4-MachineLearning/DataUniversalis/R_code

1. Introduction

Discovery of the long awaited Higgs boson was announced July 4, 2012 and confirmed six months later. But for physicists, the discovery of a new particle means the beginning of a long and difficult quest to measure its characteristics and determine if it fits the current model of nature. The ATLAS experiment has recently observed a signal of the Higgs boson decaying into two tau particles, but this decay is a small signal buried in background noise. The goal of the Higgs Boson Machine Learning Challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment.

Using simulated data with features characterizing events detected by ATLAS, our task is to classify events into "tau tau decay of a Higgs boson" versus "background."

2. Work Flow

Screenshot 2016-09-04 17.18.02

3. Pre-processing

3.1 Exploratory Data Analysis

3.1.1 Data Distribution Pattern

The previous research on the Higgs Boson revealed that the derived mass is an important indicator to determine its presence. By analyzing the distribution pattern of the variable DER_mass_MMC by label, we could see that the mass observations of β€œs” label has narrower distribution and higher peak value than β€œb” label. Mass observations of the two label groups have similar mean values.

Picture1

3.1.2 Missingness

We noticed that only about a quarter of the dataset are complete cases. 11 variables out of 33 have missing records which are designated as -999.  10 out of the 11 variables are related to jets as well as the number of the jets.

By carefully examining the missingness through the jet numbers, it can be found that the variables aforementioned above corresponding to jet number 0 are all missing, while they are partially missing in jet number 1. There is no missingness detected in jet number 2.

      PRI_jet_num = 0                                 PRI_jet_num = 1

Picture2 Picture3

 PRI_jet_num = 2/3

Picture4

3.1.3 Correlation Analysis

After imputing all -999 values for each jet number group, we can find high positive or negative correlations between a few variables, especially the variables with a prefix β€œDER”, and such correlations differ in different jet number groups. It can be inferred that these variables may derive from same origin.

      PRI_jet_num = 0                               PRI_jet_num = 1

 Picture5 Picture6

PRI_jet_num = 2/3

Picture7

3.1.4 Principal Component Analysis

With the help of principal component analysis (PCA), we can better understand importance and relationship among all variables. From the PCA scree plot it can be known that first component has contained most of information and the data dimension can be reduced down to 8 variables. The PCA correlation result shows that some variables with prefixes β€œPRI_jet” or β€œDER” have relatively higher loading values in the first two component PC1 and PC2 which may indicate the importance.

Picture8 Picture9

3.2 Imputation

After splitting data to 3 parts based on jet_number and dropping missing columns that have no physical meaning for each subset, there is still one important feature ("DER_mass_MMC") which has missing values for subsets with jet_number equal to 0 and 1. We applied KNN to impute missing values by choosing parameters k=sqrt(nrow(subset)) and distance=2.

 

4. Learning Algorithm Training

4.1 Overview of Machine Learning Methods

We have tried a number of different machine learning methods. Below is a quick review of the methods with a general description, their pros and cons.

Screenshot 2016-09-04 18.05.30

4.2 Random Forests

Let us have a look at our first tree-based model, Random Forest(abbreviated as RF below). There are two parameters we need to tune for RF: ntree and mtry. Ntree controls how many trees to grow and mtry controls how many variables to draw each time. Due to the large size of the data, my first try was very ambitious: ntree = [2000,5000,8000] and mtry = [3,4,5,6]. However, it turned out that the computation was tremendously large and led to crush of R. So my second try was less ambitious with parameters ntree=[500,800,1000] and mtry = [3,4,5].

From the below graphs, we can see that mtry = 3 is the optimal parameter for each subset of data set.

df1_AMS df23_AMS rf0

However, the AUC curve looks abnormal because the graphs showed that the RF predicts every single observation correct. Why is that?

aoc23 auc df1_roc

In our case, since weight is estimated from a probability space. In order to do prediction, RF model bisect on the probability space continuously until reach certain criterion (eg. no more than 5 observation in each sub-region) and use the mean of each sub-region to predict. However, our probability space is very sensitive and ntree = 1000 is too large thus making the RF too accurate.

What is more, by looking at only 3 variables (especially the traverse-mass variable), we can clearly tell whether the label is going to be a 'l' or a 'b'. So the first few split of the tree model is already sufficient to make the prediction. By making ntree equal 2 or 10, we can get a 'normal' graph. However, doing so is not very meaningful because it simply made a fine model less finer.

df23_ntree=10 df23_tree=2

4.3 GBM

To improve RF, next step is to apply gradient boosting. There are 4 tuning parameters: interaction.depth, n.trees, shrinkage and n.minobsinnode. Interaction.depth controls maximum nodes per tree - number of splits it has to perform on a tree (starting from a single node). For example, interaction.depth = 1 : additive model and interaction.depth = 2 : two-way interactions. N.trees means the number of trees (the number of gradient boosting iteration), and increasing N reduces the error on training set, but setting it too high may lead to over-fitting.

Shrinkage is considered as a learning rate, which is used for reducing, or shrinking, the impact of each additional fitted base-learner (tree). It reduces the size of incremental steps and thus penalizes the importance of each consecutive iteration. The intuition behind this technique is that it is better to improve a model by taking many small steps than by taking fewer large steps. If one of the boosting iterations turns out to be erroneous, its negative impact can be easily corrected in subsequent steps.

N.minobsinnode controls the minimum number of observations in trees' terminal nodes, which is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances. Imposing this limit helps to reduce variance in predictions at leaves. We can conclude the 4 tuning parameters are related with each other and change in one parameter has impact on the others.

Let's take subset 1(jet_number equal to 1) for instance. Due to the large size of data and computational cost, our first try is to use small trees (200, 500, 800) with lower interaction depth (3, 4, 5), a wide range of shrinkage (0.1, 0.05, 0.01) and small values of n.minobsinnode (10, 50, 100). The best optimal parameters with the highest AMS value is 800 n.trees, 5 interaction.depth, 0.1 learning rate and 100 n.minobsinnode, which lead to AUC equal to 0.88(as shown in the graph). Therefore our second try is to fix n.trees and increase interaction.depth to (5,7,10).

The best optimal parameters with the highest AMS is 10 interaction.depth, but it produces a lower AUC value equal to 0.86. The possible reason is overfitting by introducing too much interaction between the features. So our third try is to fix interaction.depth to 5, increase n.trees to (800, 2000, 5000) and lower shrinkage to 0.01 in order to reduce overfitting.

And 5000 trees results in a higher AUC value (0.91). So our last try is to increase n.trees to (5000, 7500, 10000) and fix interaction.depth and shrinkage to 5 and 0.01. Our best optical parameters (interaction.depth:5, shrinkage: 0.01, n.trees: 10000, n.minobsinnode: 100) give us the best result with AUC equal to 0.91.

Screenshot 2016-09-04 22.27.22 Screenshot 2016-09-04 22.30.43 Screenshot 2016-09-04 22.32.15 Screenshot 2016-09-04 22.33.12

The same process of tuning parameters is applied to the other 2 subsets.

For subset 0(jet_number equal to 0), the best tuning parameters are interaction.depth: 5, shrinkage: 0.01, n.trees: 10000, and n.minobsinnode: 100 and result in AUC equal to 0.91.

For subset 2(jet_number equal to 2 and 3), the best tuning parameters are interaction.depth: 5, shrinkage: 0.05, n.trees: 800, and n.minobsinnode: 100 and result in AUC equal to 0.91.

Screenshot 2016-09-04 22.38.45

4.4 Neural Networks

The data set was split into three categories to do the neural network (NN) training. We tested several neural network tools under β€œCaret” library and chose β€œnnet”, as we thought one hidden layer would be appropriate for training. Two tuning parameters are available for controlling prediction accuracy: number of hidden units, which connects inputs and outputs and can feed into the output layer; weight decay, which is a parameter to control growing rate of the weights after each update.

We assumed that small weight decay and similar number of hidden units as number of variables would generate accurate predictions. We tried different numbers of hidden units ranging between 6-22, with an increment of 2, as well different number of decay weights between 0-0.1.

Below is the architecture of neural network when applies 1 hidden layer and 20 hidden nodes, where, blue lines indicate positive relationships while red lines indicate negative relationships.

Picture10

Relative importance was acquired from neural network training using label as response variable. It can be noticed several variables with β€œDER” as prefix have higher importance, ranging 0.6-0.9, to label than the other variables. This shows consistency to the result of PCA analysis.

Picture11

 

The result of NN parameter tuning on the data set when PRI_jet_num = 0 demonstrates that the NN training performance, which is reflected by AMS score is quite fluctuated as hidden node units increase. Great weight decay values (0.1, 0.01) have better AMS score than small values as well as clearer increasing trend of AMS can be observed. The AMS almost stays at 0 when weight decay is set as 0.0001. We found the best combination for training this dataset when hidden size of nodes = 10, weight decay = 0.01 and accuracy prediction = 82.54%.

After the first attempt of parameter tuning, it can be concluded that higher weight decay may positively contribute to AMS score and consequently a new range between 0.001-0.15 of weight decay was applied for NN parameter tuning on the data set when PRI_jet_num = 2/3. From the tuning result it can be found that most of AMS score line monotonically increase as more hidden units of nodes involved in NN training, only except when weight decay = 0.01.

Generally NN training performs better for weight decay = 0.1 or 0.15 than smaller weight decay values. The best set for training this data set is found when hidden size =22, weight decay = 0.1 and corresponding accuracy of prediction = 80.4%.

PRI_jet_num = 0

  Picture12 Picture13 Picture14 Picture15

PRI_jet_num = 2/3

Picture16 Picture17 Picture18 Picture19

 

4.5 XGBoost

There are 6 parameters to tune for Xgboost.

parameter note range Our chosen value
nrounds The number of rounds for boosting [1,∞] 20, 60, 100, 150, 200, 250
eta [default=0.3] step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. range: [0,1]. A value in between means we do not do full optimization in each step and reserve chance for future rounds, it helps prevent overfitting 0.001, 0.003, 0.02, 0.05, 0.15
gamma [default=0] minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. range: [0,∞] 1, 5
max_depth [default=6] maximum depth of a tree, increase this value will make model more complex / likely to be overfitting. range: [1,∞] 7, 10, 15, 20
min_child_weight [default=1] minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. range: [0,∞] 0.1
colsample_bytree [default=1] subsample ratio of columns when constructing each tree. range: (0,1] 0.7

Our tuning strategy for xgboost is as follows:

  • Get an initial estimation of the range of each parameter, run scripts to get the best one, take a look at the β€œtrend” from the saved graphs.
  • Get rid of β€œbad” parameters that will not boost result, setup a new tuning grid, and run script again
  • Repeat step 1 and step 2

5. Combining Predicted Probabilities

5.1 Analysis

We combined the training results from the three split datasets into one submission file, as illustrated in the hand-drawn picture below.

Screen Shot 2016-08-30 at 9.53.56 AM

After the split, each subset corresponds to its own optimal tuned parameter. Then we were facing two strategies to combine the three vectors of probabilities:

  • Strategy 1: Simply concatenate the three vectors (V0, V1, and V23) of predicted probabilities into one single vector V, then choose a uniform cutoff value C (i.e. C=15%) to map the V into a vector of labels containing 's' and 'b'. Finally submit this vector of labels to Kaggle to get a ranking.
  • Strategy 2: Treat each subset differently. Each classifier of one subset has an optimal cutoff value based on the ROC curve. For example, the t0=0.35 on the ROC curve corresponds to the best threshold value of jet #0 . Using this threshold t0 as a cutoff value for all the predicted probabilities (eg., P01, P02, P03, etc), we can map the predicted probabilities to labels (signal 's' or background 'b') for all the testing observations that has jet number 0. As shown in graph below, P02 = 0.4 > t0 = 0.35, so P02 maps to 's'. Similarly, for the observations of jet #1, since P23 = 0.36 < t1 = 0.4, P12 maps to 'b'. We performed this procedure for jet #1 and jet #23 as well, concatenated the three vectors of mapped labels, and generated the submission file for Kaggle.

We tried both strategies multiple times and submitted our results to Kaggle. However, the rankings fluctuated around 1000 in the private leaderboard, which did not meet our expectations. Indeed, at the submission stage, it is possible for us to further tune a "magic" cutoff value, say, the value "C" in strategy 1 above, to obtain a great submission file that achieving relatively higher AMS score or ranking in the leaderboard, by submitting perhaps hundreds of times to Kaggle.

However, we didn't step into this direction, because we don't think repeatedly submitting results to Kaggle to achieve high ranking was an elegant solution. Trial and error may work for Kaggle, but remember in reality, we may only have one shot.

We sit back and started to figure out the root cause. Soon our team realized that, this combining process is actually mathematically flimsy.

5.2 Why splitting data by jet number is not a good idea?

Let's briefly revisit what made us want to split the dataset:

  • This dataset happened to have 4 available jet numbers,
  • Through EDA, we happened to find that the missingness of each jet number behaved differently
  • We noticed someone in the Kaggle forum achieved high AMS score (~3.5) after splitting the data
  • Splitting the data indeed led to customized imputation and faster parameter tuning
  • We subjectively assumed that jet number was a very critical physics parameter

Let's again look at the examples in the hand-drawn graph. Although P03 = 0.36 in the subset jet #0 and P12 = 0.36 in the subset jet #1, the two probabilities 0.36 has totally different meaning:

  • P03 = P(p03 | jet#=0) = 0.36
  • P12 = P(p12 | jet#=1) = 0.36

In other words, the predicted probabilities in each subset were all conditional probabilities, a.k.a, conditioned on jet numbers. Thus, two identical probability values from two subsets mean different things, and they were not necessarily comparable with each other, because we had no prior knowledge about the internal distributions of each subset. Even if we do, do we know the signal distribution of the testing dataset? Not at all. The testing dataset could contain 15% observations with jet #0, or completely no jet #0.

To sum up, splitting the dataset based on jet number is indeed a possible way to achieve relative acceptable ranking or AMS score. However, based on our analysis, it also has drawbacks:

  • Has the cost of finding another "magic" parameter at the final submitting stage
  • Is mathematically shaky when combining predicted probabilities from each split

This means that splitting the dataset into 3 subsets was actually not a good strategy. Knowing this was very important since it made us think even deeper into our decision process as we navigated the project.

6. Takeaways

  • Imputation turns out to be necessary after splitting the data (for EDA, feature importance, NN, RF)
  • Try larger search grids for parameter tuning if possible (our grids were still too sparse)
  • Use cloud computing for all team members not their own laptops (avoid crashing and uncertainties)
  • Generally speaking, splitting data by jet number is not a good idea because it is mathematically shaky when combining predicted probabilities from each split, although relatively acceptable rankings in Kaggle can still be achieved

About Authors

Bin Fang

With a multi-disciplinary background in earth science, electrical engineering and satellite technology, Bin has spent more than ten years in scientific research and teaching in university and research institute. His previous study aimed to integrate and interpret remote...
View all posts by Bin Fang >

Miaozhi Yu

Miaozhi recently received her Master’s degree in Mathematics from New York University. Before that she received a Bachelor’s Degree in both Mathematics and Statistics with a minor in Physics from UIUC. Her research interests lie in random graphs...
View all posts by Miaozhi Yu >

Chuan Sun

Chuan is interested in uncovering the relationship of things. He likes to seek order from chaos. Previously, he worked on a unannounced project in Amazon Seattle as a software engineer. The project is related to machine learning and...
View all posts by Chuan Sun >

shuo zhang

Shuo Zhang graduated from Columbia University with a Ph.D degree in Chemical Engineering and the focus of her academic research was to design a protocol to synthesize layer-by-layer polymer films on nano-surfaces, investigate dynamics and kinetics, construct quantitative...
View all posts by shuo zhang >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    Β© 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application