NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Predicting Housing Tax with Machine Learning Models

Predicting Housing Tax with Machine Learning Models

Steven Jongerden, I-Peng Liu, Jing Wang and Huanghaotian Fu
Posted on Aug 22, 2017

Californians who buy a house often experience sticker shock when they get their property tax bill. The reason for the dramatic spike in  in real estate tax is the execution of California's Proposition 13 of 1978, which equates the  tax assessed value with  the purchase price. In an increasing market, that will always result in a steep increase of the real estate tax value, especially when the house has not changed ownership for an extended period of time. While equating the the tax value with the market value may seem fair, the lack of yearly assessments creates disadvantages for multiple parties.

  • New property owners have to pay a higher effective tax rate as the real estate tax value is assessed on the latest market value, giving homeowners that have owned their house for a longer period of time a tax advantage.
  • In declining markets, the real market value of the real estate might be lower than the real estate tax value, generating an artificial higher tax load as the tax value is not adjusted.
  • In an increasing real estate market, the real estate tax value is not adjusted yearly, creating a loss in tax revenue for the governing body.

As indicated by these disadvantages, the problem is a double-edged sword that creates disadvantages for both the real estate owner and the governing body. A possible solution could be to re-assess all real estate every year in order to ensure that the tax value equals the market value; however, such a process would be very cumbersome and time consuming in light of the fact that there are 3.5 million houses in Los Angeles alone. Consequently, in order to assist in determining the market value of real estate with the aim of setting the tax value equal to the market value the following research question is coined: Can machine learning algorithms help in the prediction of real estate tax value and thereby reduce the imbalance in tax load and governmental loss of tax revenue?

 

 

Methodology

In order to create a model that might assist in the prediction of the real estate tax value, market information is required. A current Kaggle competition provides information on 2.9 million houses in three counties in the United States, one if which is Los Angeles (https://www.kaggle.com/c/zillow-prize-1), That makes it an excellent source of information, though we must begin by addressing the generalizability of the sample.

Generalization of results

The concept of generalization, from an academic point of view, implies that a sample, on a number of characteristics, does not significantly differs from the population. Additionally, the concept of generalization indirectly assumes that the data is randomly sampled, which is required for any statistical test. With regards to the comparison of characteristics of the sample and population (US Census bureau), chi square tests were performed on the building construction year and the estimated value of the house (the house tax value), where both test statistics were found insignificant. However, even if these tests might indicate that the data is comparable to the population, itโ€™s not a perfect match. The fact that the data was conveniently sampled because transaction information was used to select the houses reduces the generalizability of the data to transferability. This difference is only slight, as it implies that the conclusion from the sample dataset can be used for the population; however, when the sample dataset would increase, the sample data will not resemble the population more.

 

Cleaning the dataset

The quality of data input is one of the key factors to ensure accurate machine learning prediction accuracy. In order to ensure that the data quality is sufficient, the following  cleaning work-flow was performed:

  1.     Calculate the amount (percentage) of missing data in each variable (feature)
  2.     Impute missing data
  3.     Check if each variable has the correct data type
  4.     Detect and winsorize outliers

Overall, with regard to the missing values, it is possible to indicate that 41.3% of the observations were missing. The variables with the highest number of missing values are the building class type, the story type, the size of the basement and the size of the garden. In most of these variables, the value is not missing at random, as the house could simply not have a garden or basement. However, in other case, like the building class type, clearly the value is missing, as every house can be classified by a certain building type. In order to process the information within machine learning models, these missing values must be imputed as machine linear algorithm cannot handle missing value.

To make appropriate and reasonable decisions on the imputation methodology and data type, each variable was compared with the description provided by Kaggle. Additionally, logical reasoning was used for imputation. The following methods were used for some of the manipulations:

  1. For variables with over 99% missing data, the variable was deleted due to lack of  useful information.
  2. For numeric variables,  the median was imputed.
  3. For factorial variables, the 0 or mode was imputed.
  4. With regards to the variables, such as the pool type and basement size, zero was imputed. The assumption is  that the data is not recorded due to the fact that the property does not have such a feature.
  5. For variables that are missing totally at random, the mode was imputed. This technique was chosen in order to prevent the generation of additional levels within factor variables.

Next, the winsorization technique was used on all numerical variables. This transformation is performed to eliminate potential adverse effect that extreme values (outliers) may have prediction model.  Observations that fall below the 97.5th quantile and beyond 2.5th quantile were replaced with the mean value.

 

Data Exploration

As part of any data analysis, exploratory data analysis must be performed. This analysis will ensure that the researcher has a good understanding of the data and can use this understanding and possible findings as an input for machine learning modelling. One could therefore even state that exploration is a prerequisite for machine learning modelling. The exploratory data analysis can be split into two sections: (1) analyzing the dependent variables, (2) analyzing the independent variables.

Dependent Variable

Investigating the distribution of the dependent variable, the real estate tax value, indicates two distinct peaks, an indication that might imply that multiple processes are driving the outcome. The 1st peak is the land tax, which differs strongly from the house tax (2nd peak), and might even be referred to as zero inflated. This occurs most probably because the land tax, in many cases is equal to zero, which is true for example apartments or houses with a very low land value. Attempting to predict the real estate tax value in its current shape could result in problems as the data (log transformed) is not normally distributed. Therefore, it is advisable to split the independent variable, retail tax, in land tax and house tax and predict these individually. However, here weโ€™ll focus on the house tax prediction in order to reduce the complexity and length of this article.

The two processes driving the independent variable


Independent Variables

By reviewing the independent variables, it is possible to gain a glance of the real estate market within Los Angeles. With over 50 independent variables, only a small selection will be presented within this blog post. For example, the real estate total tax value is concentrated between $150,000 and $200,000.  Second, most of the houses within the dataset are built before 1959, with a decline in the total number of constructions within later years.

Demographics comparison

Another interesting variable is the available perimeter. The data indicates that the majority of the houses have 1500 square feet of living space. Additionally, the data indicates that the house perimeter ranges between  200 and 2400, excluding outliers in the data.


Based on these insights and insights from other exploratory analysis not described in this blog post, hypotheses can be formulated.

 

Creating new variables based on Clustering

In order to predict the real estate tax value, a dataset is required that catches the influential factors, which combined, result in an accurate prediction. However, as indicated in the previous sections, the large number of missing values has resulted in the loss of a large number of columns that maybe could have held valuable information. In order to retain some of this lost information, a clustering analysis was executed by means of K-Means clustering in which eight groups were identified. These groups were determined by observing the reduction in the gradient of the function of the within cluster variation and the number of clusters. As this process can be arbitrary in the absence of a strong inflection point, the overall reduction of the within cluster variation was observed for different K, resulting in a K of 8. Further analysis of this new variable indicated that the groups significantly differ with regards to the tax value and could therefore provide additional information that was not available within the existing variables. This new variable can now be used within the real estate tax predictions models.

Causal model of hypothesized relationships for the housing tax

Hypotheses formulation and testing

 

In the presence of a robust data set that has been cleaned and investigated for uncovered patterns, hypotheses formulation and testing can be performed. Within daily practice, this process is mostly done based on exploratory data analysis; however, from a statistical standpoint, it should be performed based on causal relationships, which can then be investigated through correlation studies. Therefore, a set of hypotheses was defined, and consequently tested through bivariate analysis.

In order to evaluate these hypotheses the Pearsonโ€™s product-moment correlation, Welch Two Sample t-test and the Kruskal-Wallis rank sum test were used.

 

Variable Test Statistic P Value Cor Alt Hyp Con
airconditioningtypeid Welch Two Sample t-test 54.269 p-value < 2.2e-16 Positive Reject H0
bathroomcnt Pearson's product-moment correlation 179.33 p-value < 2.2e-16 0.5962435 Positive Reject H0
bedroomcnt Pearson's product-moment correlation 80.7 p-value < 2.2e-16 0.3170245 Positive Reject H0
buildingqualitytypeid Pearson's product-moment correlation -32.534 p-value < 2.2e-16 -0.1335384 Positive Cannot reject
calculatedfinishedsquarefeet Pearson's product-moment correlation 175.53 p-value < 2.2e-16 0.5880057 Positive Reject H0
poolcnt Welch Two Sample t-test 32.77 p-value < 2.2e-16 Positive Reject H0
yearbuilt Pearson's product-moment correlation 111.12 p-value < 2.2e-16 0.4180533 Positive Reject H0
unitcnt Pearson's product-moment correlation -0.11402 p-value = 0.9092 -0.0004722 Positive Cannot reject
propertylandusetypeid Kruskal-Wallis rank sum test 2231.5 p-value < 2.2e-16 Difference Reject H0
heatingorsystemtypeid Kruskal-Wallis rank sum test 11822 p-value < 2.2e-17 Difference Reject H0

 

It is interesting to note that the hypotheses for the building quality type, which was hypothesised to be higher for higher levels of structural tax, turned out to be negatively correlated. Consequently, we were not able to reject the H0 hypotheses and could not add this variable to the model because it was not logical from a causal perspective. The other hypotheses, except the unit count, which showed insignificant results, the other hypotheses indicated significant relations that allowed for the rejection of the H0 hypotheses. Consequently, these variables will be added to the various machine learning models.

 

Machine Learning to Predict the Housing Tax

In order to construct a machine learning model, a researcher has a large number of options, and therefore an initial selection must be made. This selection process first focusses on the type of variable one aims to predict, which in this case is a numeric variable, making this a regression problem. Within the regression family various models are available, ranging from linear regression, lasso regression, random forests and boosted random forests. With the aim of constructing a parsimonious model that can predict the real estate tax as accurate as possible, these four machine learning models will be investigated. It is important to note that for all the applied techniques, training and testing was performed with K-folded separation of data with a 80/20 ratio, respectively.

 

Multiple Linear Regression

A multiple linear regression models is a model that aims to find the best linear unbiased estimators under the Gauss Markov assumptions. Within this model multiple variables can be combined to predict one particular outcome where the relationship between the independent variables and dependent variable are assumed to be linear. Initial results from a linear regression model, where the real estate tax value is determined based on the size of the house, the construction year, the property land type, the number of bathrooms and bedrooms, the type of airconditioning, the number of pools and the earlier introduced clustering variable, indicates that this model can predict 66.4 % of the variance within the data, which can be considered a medium fit. However, condition verification of the Gauss Markov assumptions indicate that the assumption of constant variance is violated, making the estimatorsโ€™ significance unreliable and poses the probability of creating an overfitted model. Consequently, with the aim to correct for this violation of equal variance, which is driven by the violation of normality, the box cox transformation is performed.

The box cox transformation aims to reduce the level of skewness within the dependent variable. Reducing the level of skewness should reduce the level of unequal variance within the model. The model result indicates that the R2 decreases to 60.05 in comparison to the 66.4% without the transformation. This decrease can be explained by certain variables losing significance and no longer contributing to explaining variance within the model. Consequently, the box cox transformed model can be considered more parsimonious than the model without transformation.

In a further attempt to create the best parsimonious model, automatic variables imputation can be performed. This modelling technique is based on a multiple linear regression model, where the Bayesian information criterion (BIC) is used to determine the most parsimonious model out of all possible model combinations. The downside of this modelling technique is that the variables that are used within the model are no longer driven by underlying causal relations, but only based on their contribution to the reduction of the residual. This can result in models that are parsimonious but prone to overfitting the data. Nonetheless, the results from this model indicate an R2 of 60.08%, making it as strong as the box cox transformed model, though it is based on a different set of variables, which in most cases do not have any causal relation. Consequently, from these three models, the box cox transformed models is the most reliable and parsimonious.

 

Lasso Regression

In the previous section, variable selection was performed by imputing variables through the BIC; however, there are other options available for the selection of variables for a model like the Lasso Regression. In Lasso regression, shrinkage/regularization is performed for variable selection where Lasso regression attempts to minimize the error while also minimizing the number of variables used for prediction. The balance between the goodness of fit and the prevention of overfitting is determined by lambda. To determine this tuning parameter lambda, 10-folds cross validation was performed (see the figures below). Through this technique, it is possible to determine the best lambda that minimizes the mean square error, which indicates the prediction error. Through Lasso regression, it was possible to improve the prediction to and R2 of 68.1, in comparison to 60.08% from the box cox transformed model, which is an increase of 8%.

Cross validation for Lasso Regression

                 

Random Forest

In the last two machine learning approaches, the focus lied on using numerical variables for linear prediction, where categorical variables are used as dummies. 

However, as the dataset contains a multitude of categorical variables, the Random Forest machine learning method is introduced. Random Forest models are considered as an important statistical pattern recognition tool for prediction with categorical variables. As Lasso Regression, the Random Forest machine learning algorithm also requires cross validation in order to determine the tuning parameter. The tuning parameters for Random Forest are the number of variables tried at each tree split and the total number of trees. Cross validation indicated that the number of variables tried at each split of 4 provides the best fit, while reducing the computation time, and a total number of trees of 100 is sufficient to capture the total reduction in the prediction error. Based on these tuning parameters the Random Forest model predicts with an accuracy of R2 01.66%, which is comparable to the Lasso regression model.

 

The selection process for the number of variables selected per tree split

Boosting

The mean squared error for the number of trees

With the aim of constructing a model that can predict the real estate tax value as close as possible, the boosting machine learning model is used. The boosting machine learning model is based on tree bagging, which is used to reduces the prediction variance, but in addition uses the last model in order to construct the next model. This technique will enhance the prediction power on the training data set but is prone to overfitting the testing data set. In order to fit a boosting machine learning model, three tuning parameters must be determined, which are the shrinkage, the tree depth and the number of trees. Through cross validation, the calculation of the mean square error and the Boosting test error plot (presented below) the tuning parameters are determined to be a shrinkage of 0.001 and a depth of 4. Based on these tuning parameters the boosting modelโ€™s R2 is 88.1%, which is a strong improvement in comparison to the random forest model. However, as indicated earlier, the boosting model is prone to overfitting the training data, which implies that the model is weak in prediction out of sample. Consequently, validation on the testing dataset only indicates an R2 of 37.8, which is a very strong decline in prediction power.     

 

 

Conclusions and limitations

Within this project, a multitude of machine learning algorithms were used with the aim of predicting the real estate tax value in order to automate the real estate valuation process and reduce the bias within the California tax system. Overall it can be concluded that the models are able to predict the real estate tax value with medium accuracy, as indicated in the discussion of the machine learning models, where the Random Forest machine learning model presents the best results. This medium fit is the result of the poor quality of the dataset used within this analysis. If better information, with the emphasis on the less missing values, is available, higher levels of accuracy can be reached. However, analysis of the California tax system revealed one of the underlying problems which prevent accurate prediction of the tax value. The tax value of a house is determined at the moment a house is sold, as indicate in the introduction. This implies that two identical properties of equal value can have a great amount of variation in their assessed value, even if they are next to each other. Consequently, with this dataset, it is impossible to capture 100% of the variance, even when models are overfitted.

Overall, this research project can serve as a proof of value. It indicates multiple shortcomings within the California tax system and that predicting the real estate tax value might be a good approach to automate the real estate tax evaluation process. Nonetheless, for further research, a complete dataset that contains information on the actual market value of the house would be better, in order to prevent the misclassification of the households real estate tax value due to the existing time dimension in the tax value assessment.

 

About Authors

Steven Jongerden

Steven graduated summa cum laude from the Delft University of Technology with a Masters degree in Engineering and Policy Analysis and a Bachelors degree in Aerospace Engineering. He is currently a Data Science Consultant employed by Capgemini Netherlands....
View all posts by Steven Jongerden >

I-Peng Liu

View all posts by I-Peng Liu >

Jing Wang

Jing got his PhD degree in Biology from City University of New York in 2008. He then continued his post-doctoral research at Mount Sinai Medical Center, investigating molecular, cellular mechanism of neurodegenerative diseases. He has authored numerous research...
View all posts by Jing Wang >

Huanghaotian Fu

View all posts by Huanghaotian Fu >

Related Articles

Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Meetup
Revenue and Marketing Insights from Customer Segmentation
Student Works
Data Driven Ads by Starbucks Customer Segmentation
Capstone
Using Data for A Recipe Recommendation System
Capstone
Finding the Best Liquor Store Location in Iowa

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application