NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Linear and Random Forest Forest Regression Analysis on House Prices

Linear and Random Forest Forest Regression Analysis on House Prices

Gary Simmons
Posted on Oct 21, 2022

Introduction

Trouble finding affordable housing in your city? Ignoring the crazy inflation that has been happening lately, typically house prices vary on various different features attached to each house. This project focuses on the training, testing, and comparison of a linear regression and a random forest regressor on the famous Ames Housing and Real Estate data, a data set used to challenge data scientists on how to train regression models. The dataset I used came from a New York City Data Science Academy machine learning project folder.  It draws on  Ames Housing Data with the main features of interest and Ames Real Estate Data to get MapRefNo, which will be used to get coordinates later to calculate distances from Iowa State University (ISU). Information on what each feature represents can be found here on Kaggle.

Methods

I. Data Cleaning Pipeline

First, I focus on Ames Housing Data and make MSSubClass and MoSold strings. Afterwards, I fill all nan values in string columns with a string 'nan'. Once that is done, I make dummy columns for all string columns, dropping the first column. I make a dictionary called relatedDummiesDictionary that stores all dummy columns as keys and their values as string lists. Each of these string lists contains the column name and the column names of its sibling dummy columns. I then save relatedDummiesDictionary as relatedDummiesDictionary.json. Now that the data set has almost all the columns I need, I can then perform the train-test-split and save the train and test data as .csv files.

Now it is time to get distances between each house and Iowa State University with the method util.returnDFWithISUDistance. In that method, I take the PID column of Ames Housing Data and turn that into a data frame. I then merge this new data frame with (Ames Real Estate Data)[[ 'MapRefNo', 'Prop_Addr' ]], joining Housing PID to Real Estate MapRefNo. Then, using Google_V3 geocoder, I get the coordinates of ISU ('Iowa State University, Ames, USA') and each house address (Prop_Addr + ', Ames,IA, USA').  With those coordinates, I calculate the distance between ISU and the houses in miles via geopy.distance.great_circle. These distances are saved as the column, ISUDistance, in the merged data frame. Lastly, I attach ISUDistance back to the original data frame joining PID and MapRefNo.

After acquiring distances, I can drop all strings. I gather the modes of OverallQual and OverallCond and the means of YearBuilt, YearRemodAdd, GarageYrBlt, YrSold, and ISUDistance. The collected mode and mean values become the nan replacement values for the columns they represent. All other nan values become zeros. These nan replacement values are saved as a dictionary and json file to be re-used by the method util.replaceNansWithTrainingDataValues during nan replacement for both training and testing datasets.

Here lies the fork in the road between Linear Regression Preparation (Methods II) and Random Forest Regression Preparation (Methods III).

II. Linear Regression Preparation

II.a) Elastic Net

Adding all columns to a linear regression is not optimal due issues like homoskedasity and colinearity. To ensure I am picking the best columns, I started with an elastic net with an alpha penalty of 1000 and a l1-ratio of 0.5, which is half Lasso and half Ridge. Originally, I wanted to Grid Search CV the best alpha and l1 ratio parameters, but that led to an alpha of 1 and an l1 ratio of 0. As that is a ridge regression with no penalty, it would make this step useless. I wanted to keep this step in to ensure model accuracy, so I imposed a heavy alpha penalty of 1000. As for the 0.5 l1 ratio, I wanted the best of both worlds between Lasso and Ridge. Pure lasso eliminates unnecessary columns by applying zero coefficients and ridge will keep a small non-zero to not essential columns. Whatever survives the elastic net will be used for pure, penalty-free linear regression. The final model is a pure linear regression because it will be simpler to relay coefficient values with dollar values without having to explain error penalties.

II.a.i) Making Gross Living Area the base of the model

The base columns of this model starts with Gross Living Area, the total finished living area above ground. The NYCDSA proposal folder suggested to look at Gross Living Area because it was the top feature with highest f_regression score in relation to price. This same folder also suggested expressed that there is a different rate between homes above and below the 80th percentile in Gross Living Area for each particular neighborhood. It mentions that this difference is a discount for large houses in the neighborhood. Following the suggestion I group the data frame by neighborhood. I find the 80th percentile of each neighborhood. Afterward, I create two new columns, smallGrLivArea and larageGrLivArea. For each house, if the GrLivArea is smaller than the 80th percentile in its neighborhood, it's smallGrLivArea is GrLivArea and largeGrLivArea is zero. In contrast, houses above the 80th percentile of their neighborhood will have a smallGrLivArea of zero and largeGrLivArea of GrLivArea. Once both columns are established. I drop GrLivArea from the dataframe.

The proposal document also expresses that both of these new columns follow power laws via finding the slope of log-log plots. The power law relation for each column is found, and two new columns are created relabeled with the power laws. After column relabeling, I dropped the old columns. These two columns form the base for the rest of the linear portion of the project.

II.a.ii) Storing significant correlations

I calculate Spearman correlations between all variables against each other. If the p value is greater than 0.05, the correlation is insignificant and discarded. The correlations then are ranked by absolute correlation values. I experiment and pick a percentile value. For this study I chose the 97.5th percentile. The pairs with absolute R values larger than the 97.5th percentile are stored in significantCorrelationsDictionary and significantCorrelationsDictionary.json. Also, if the pair is associated with sibling dummy variables defined by relatedDummiesDictionary, which is mentioned above, the pairings of those sibling dummies with the partnered variable are also recorded in the signficantCorrelationsDictionary. For example, let's say FeatureA has a significant correlation with FeatureB_4, which has sibling dummies FeatureB_2, FeatureB_3, and FeatureB_5. We would then have to put pairings FeatureA-FeatureB_2, FeatureA-FeatureB_3, FeatureA-FeatureB_4, and FeatureA-FeatureB_5 in the significantCorrelationsDictionary.

II.a.iii) Applying the model lift algorithm

Seeing that f_regression was the main reason why GrLivArea was suggested as the first feature to consider, I sorted the remainder of the figures via f_regression score from highest to lowest. The features created by smallGrLivArea and largeGrLivArea started as the base model. I proceeded to loop through the sorted f_regression list with the following algorithm which can be broken into two parts.

II.a.iii.A) Create new model concatenating the next variable(s) in the sorted f_regression base model.

I first check if the selected feature correlates with any feature in the base model. If the correlation in question is in the signficantCorrelationsDictionary or a correlation between one of it's sibling dummies, then I would skip the feature and move on to the next feature. After passing correlation checks I would either add the feature(s) directly or transform the feature and add the column directly to a new model composing of the base model and the new feature(s) in question. If it is dummy value, I add the dummy feature and its sibling dummies directly to the new model and move to Methods II.a.iii.b. If the feature has a set of less than 15, I utilize util.engineerSmallFeature which tries to fit 3 simple linear regressions between price and three versions of the feature: price vs feature, price vs feature^power, and price vs boxcox(feature). If the power or the boxcox is chosen, we re-label the column name with the parameters to be used for descriptive analysis later. The best scored model of these 3 small regression gets added to the new model.

All other continuous variables will go through a more rigorous process with util.IsHomoskedastic and util.engineerFeature. The method util.IsHomoskedastic fits a simple linear regression with the x and y values it's given, makes predictions on the x values, and sends the residuals to two methods, util.IsCentered and util.retrieveSpreadValue. util.IsCentered splits the residuals via three even bins along the predicted axis. If zero falls within the 95% confidence interval of the residuals of all 3 sections, then it passes the centering test. util.retrieveSpreadValue also sections the residual data the same way in the 3 bins as util.IsCentered, but it compares the largest residual variance and the smallest residual variance of the three sections with a spread value (largest variance divided by smallest variance). 

If the spread value is less than 1.5, it passes the spread test.  If the feature passes both centering and spread tests in util.IsHomoskedastic, it will directly add the feature to the new model. If the feature fails either test of util.IsHomoskedastic it tests, it will be analyzed via util.engineerFeature, where it will try to fit a simple linear regression to price vs feature^power. It will then retest the modified power feature with util.IsHomoskedastic. If it passes, a new power relation is passed to section II.a.iii.B. If not, it is boxcoxed and passed II.a.iii.B.

II.a.iii.B) Cross validate new model and determine whether to keep or discard the new feature(s)

Now it is time to cross validate the model with 5-fold validation. If the test_score of the new model is better than the base, the new model becomes the new base. If not, the new feature(s) are discarded and the algorithm restarts with the next feature(s) in the list.

II.b) Regular linear regression

After the algorithm has been completed, I trained a regular linear regression model with the final collection of columns chosen in II.a and used the trained model for testing and feature description analysis

III. Random Forest Regression Preparation

Originally, I was using GridSearchCV to optimize random forest parameters; however, I got impatient and went with this methodology. First I looped using the number of estimators, n_estimators, equalling 100, 500, and 1000. Whatever got the best test_score from cross validation, I kept as the n_estimator value. This value happened to be 1000. 

Keeping the n_estimator value, I followed the same methodology, but now using the n_estimator value and looping through maximum tree depth, max_depth, equalling 10, 20, and None. When this value is set to None, the trees in the regressor will continue to divide into enough splits until values are no longer ambiguous. Providing None as an option for max_depth is one reason why GridSearchCV was taking very long. 

The best max depth was 20. With n_estimators = 1000 and max_depth=20, I do one final loop, but now with n_features, the number of features considered at each split, being 'sqrt','log2', and None. At each split, 'sqrt' looked at the sqrt(number of columns), 'log2' examined log2(number of columns), and None reviewed every column. Providing the None option also to max_features in the GridSearchCV also bogged down GridSearchCV's speed. The final random forest used for the model was n_estimators = 1000, max_depth = 20, and max_features = None.

Results

I. Random Forest Regression Feature Importances

Above are the top 10 features of importance used by the Random Forest to determine house prices. These features in order represent, overall quality, gross living area, 1st floor square footage, total basement square footage, garage area, finished basement type 1 square footage, lot area, year build, 2nd floor square footage, and garage size in car capacity . Unfortunately, many of these features could not be found in the linear regression due to the rigorous significant correlation analysis. However, their correlated partners did survive. For Results II, I l focus on the top 5 deciding continuous variables (a-c), the top 2 categorical variables (d-e), and the surviving correlating variables that are related to the random forest top 10 features of  importance (f-k).

II. Feature Analysis and Correlations

II.a) Gross Living Area

Houses with a Gross Living Area smaller than the 80th percentile size of their neighborhood had a Gross Living Area price contribution of $70.17 x GrLivArea $/ft^2.

Homes with Gross Living Area at or above the 80th percentile of their neighborhood had a Gross Living Area contribution of $39.55 x GrLivArea^1.08 $/(ft^2.16).

II.b) Basement Full Bathrooms & Basement Half Bathrooms

The Basement half bath contribution was $6136 x (number of basement half baths)^0.7

When looking at the number of basement full bathrooms, the price increased $15561 for each basement half bath.

II.c) Distance from Iowa State University

Ames house prices increase by $800.13 per mile in distance away from Iowa State University.

II.d) Exterior Quality

The model defaulted the exterior quality of the house to excellent. Average/Typical exterior quality dropped the price by $77k, fair quality exterior quality dropped the housing price by $70k, and good exterior quality dropped the price by $64k. There were no poor quality exterior quality houses in the dataset.

II.e) Home Functionality

The home functionality contribution used Major Deductions 1 as its default. Homes with Major Deductions 2 lowered in price by $19k, and homes with Salvage only functionality dropped in price by $40k. Minor deductions 2 cases had a price increase of $13k, minor deductions 1  cases increased by $14k, and moderate deduction cases increased by $15k.

II.f) Garage Type (correlates with 1st floor sqft and year built)

More than one type of garage was the default selected for Garage Type. Attached to home garage types increased the price by $12k. Detached from home garage types gained $9k. Homes with no garage gained $14k. The houses with a basement garage dropped $4k. Carport garages increased the price by $2k. Built-in garages dropped the default price by $75.

II.g) MSZoning Code (correlates with lot area)

Agricultural zoning is the default price value for this model. Building in the Floating Village Residential area gives a $157.87 boost. Houses in commercial areas have a $6k boost. Industrially bound houses gained an additional $10k  on top of  the default price. Residential low, medium, and high density locations increased the price value by $30k, $19k, and $30k respectively.

II.h) Enclosed Porch (correlates with year built)

The house price value goes up by $2.53 for every square foot of the enclosed porch.

II.i) Overall Condition (correlates with year built)

For each rating increase of overall condition, the price increased by $4k.

II.j) Paved Driveway (correlates with year built)

The model defaulted with dirt/gravel driveways. Partial pavement driveways increased the price by $2k. Paved drives increased in price by $5k.

II.k) Garage Condition (correlates with size garage in car space (garagecars))

The linear model selected Excellent garage condition as its default. Good garages lowered the price by 14k. Typical/Average garages demoted the price by $20k. Fair garages received a $23k discount. Poor garages reduced the price by $24k. Lastly, having no garage diminishes the house price by $31k.

III. Comparisons

Taking the square root of the mean square error gives us an average price error between the actual and predicted values of a house. The linear regression model had an average price error of $26501.33 for training data and $25268.16. For the random forest, the training set had an average price error of $9489.71 and the testing set had an average price error of $20701.15.

Discussion

I. Linear Regression Cost analysis

When comparing the Gross Living Area rates between small and large houses, the presence of a discount for larger houses does exist as the proposal suggested.

Originally, I assumed, since the proposal suggested that most residents worked at Iowa State University, that the closer one was to the campus, the more expensive it got. However, discussions with audience members when I presented this work suggested that closer houses were cheaper due to student housing rates.

In regards to MSZoning, the Floating Village Residential Area is a retirement community. Thus, the price increase from the agricultural default is small. Commercial and Industrial areas are more desirable than agricultural areas, but less preferred over residential areas, so a $6k-$10k increase over agricultural land is understandable. 

The interesting observation regarding zoning is that residential medium density is only a $19k increase over agricultural land, but the high and low density residential areas have a $30k increase over agricultural land. I speculate that this discount for residential medium density is due to balancing a combination of other factors, good and bad, that differ from its extreme counterparts. For this reason, it's safe in a Goldilocks zone.

Everything else mentioned is based on standard practice. Adding more bathrooms or improving the quality or condition of a certain part of the house will increase price values as expected. Less bathrooms, lower quality, or worse conditions in turn lower the price value as expected. More expensive features, means higher house value.

II. Model comparison

My Random Forest model accurately predicts the prices better than my linear regression model. However, comparing the $9k error from training data and the $20k error from the test data, the random forest has a high variance and overfit. The linear regression model in contrast has a slight bias when examining the $27k error on training data and a $25k error on the test data. This small bias is caused by using the combination of cross validation and only including columns if it gives a model lift.

Conclusion and Future Works

It should be noted that random forests are good predictors, but can not be described. Linear regression models on the other hand can be a good use for descriptive modeling; however, if columns with high feature importance are being excluded from the linear regression, we might not get the true reason why the houses are priced at those values. Thus, for the next iteration of this project, I will start with a random forest regressor on training data and return back to the original idea of optimizing the forest's parameters with grid search, but with finite values on number estimators, the maximum depth of the trees, and the maximum number of features analyzed at each split. In addition to excluding the None option from the previously mentioned parameters, I will add splitting criteria as a parameter to the grid search. 

After optimizing the random forest via the grid search, I will perform the same column addition process as mentioned in the methods, with the following changes:

  1. Any feature with a set length less than 15 should be considered categorical and turned into dummy columns. Analysis of these numerical small length sets show that ratings, qualities, and conditions might be ordinal, but their influence on price is not exactly linear.
  2. Dummy variable correlation analysis with other features will be excluded. Important features were excluded because of dummy variable correlation. Thus, I'll permit these features to skip correlation analysis and let them attempt cross validation.
  3. After starting the linear model with Small and Large Gross Living Area contributions, column addition via elastic net (alpha = 1000, l1-ratio = 0.5) and cross validation, will be done in order the Random Forest's feature importances.

Once satisfied from the model training of both models, I will evaluate both models via testing data and look at the descriptive analysis of the linear model.

Nevertheless, the final models of this iteration were able to get price errors of $20k-25k; this model is safe to use for descriptive analysis, and the logic behind the price contributions makes sense. Bathroom counts, areas, conditions and quality of particular features, and location play a significant part in how a house is typically priced, excluding inflation. Hopefully, our economy will get better and people are able to buy houses that they could turn into homes.

References

Featured Image : "Aerial photography villa complex luxury resort" - Image by dashu83 on Freepik

All data used in this project comes from a NYCDSA Machine Learning Project proposal folder, but this data set can be acquired via combining the following documents:

Suggested Ames Real Estate Data: https://www.cityofames.org/home/showpublisheddocument/58715/637843112781470000

Suggested Ames Housing Data: https://www.kaggle.com/datasets/prevek18/ames-housing-dataset

Data Documentation: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=data_description.txt

Github: https://github.com/GGSimmons1992/AmesMachineLearningProject

 

About Author

Gary Simmons

Open-minded and tenacious data scientist and machine learning programmer familiar with large dataset analysis, Angular user interface enhancement, .NET Core REST API problem solving, and relational database management. My Applied Physics BS, Physics MS, and software development background...
View all posts by Gary Simmons >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application