NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Data Visualization > Team Machine Learning Project

Team Machine Learning Project

Benjamin Roberts, Andrew Brainerd, Yufeng, Ning Sun and Dev Dabbara
Posted on Dec 15, 2017

A quick introduction

Kaggle is a sort of data science candyland in which one can pick and choose from almost any data set imaginable. Kaggle is also a website that allows for Data scientists to compete against each other to see which models are most effective and predicting a certain feature. Though the competition was closed, our teamโ€™s goal was to predict the housing prices of Ames, Iowa by submitting our models to the public leaderboard.

 This was an exciting opportunity because such information can be quite useful. Real Estate agents, bankers, home buyers, and home sellers could all make use of knowing the most powerful predictive features identified by the most successful models.

Data Described

The data was composed of 80 variables. Seventy-nine of those features were predictor features. These features ranged from categorical variables such as the neighborhood that the house was sold in to primarily quantitative variables such as the square footage of the first floor.  The target feature was the quantitative variables, sale price, in U.S. dollars.  The data was pre-split into a training set and a test with each set being given in the form of a csv file. This totaled to 1460 rows on the training set and 1459 rows for the testing set (- minus the sale prices).

Data Visualized

Itโ€™s always great to start with a correlation plot. You can see here that the features most strongly related sale price are, not surprisingly, the overall quality and GrLivArea (which is). The features most related to each other are the year it was built / year the garage was built and the first floors square feet and the total basement square feet.We really did not find this as a surprise at all.

 

 

The graph below represents the features that seem to have strong linear relationships of the data in greater detail. The scatterplot visualization allows you to see the linearity of the data and the degree of variance each feature has. Linear features follow the line of best fit such is in the case of case of overall quality predicting sale price.

Although linear, this corrplot appears to vary more than total square footage and predicting basement square footage since the data points donโ€™t appear to cluster as much around the mean. Perhaps total square footage and total basement square footage can be collapsed into one variable. Overall this is a very nice visualization that gives a more detailed explanation of the relationship between the features.

 Scatter Plot

 

The two graphs below represent missing data. The first graph represents the data that has the most missing values. The second graph is merely an extension of the first graph. Donโ€™t be deceived by the size of the bars in the graphs. The highest bar representing pool QC represents over 2500 missing values. The highest bar in the second graph, MSC zoning, represents only 4 missing values.

 Some of these missing values are actually not missing at all. They represent they lack some value such as the case of fences. It's not that there are no records of missing fences. It is the fact that there are not any fences surrounding certain homes. This fact will play a major role in how the data should be cleaned in the project. Another interesting thing to note regarding missing data is the case of missing garages.

Features with a lot of missing values.

Features with only a few missing values.

 

Data Cleaned and Fun

The first task was to identify which features where quantitative. Categorical features were imputed in python by changing NA values into the most common value. Ex: data.loc[all_data.Alley.isnull(), 'Alley'] = 'NoAlley'.

'BsmtUnfSF'] = house_all.BsmtUnfSF.median()

'BsmtFinSF1'] = house_all. BsmtFinSF1.median()

Only one is null and it has type Detchd?

Only one is null, guess it is the most common option WD?

Many columns have missing values, and not all can be treated the same way. Some are missing because they are just not applicable to the house in question (e.g. columns giving info about fireplaces/pools/garages for houses which do not have those things). The rest of the numerical values were imputed with the mean or median.  Imputing allows for feature transformations.

Feature transformation is critical for meeting the assumptions of several models including ridge and lasso regressions. One assumption of these models is that features are normally (bell curve) distributed with data points being clustered around the mean.  The Y variable must first be transformed.

It was found that the original sale price was skewed to the right but that the log transformation (log + 1 of sales price) did an excellent job of centering the data around the mean. Lastly some of the categorical variables needed to be dummified to incorporate them into the regression model. The process of dummification involved converted categories into 1 or 0 based on whether a condition was true or not. If the house had a garage, then it would be assigned a 1 value. If it did not have that value, then it would be assigned a zero. This allowed of the data to be incorporated into regression models since all the data was now numeric.

SKEWNESS_CUTOFF = 0.75

'SaleCondition'] = 'Normal'

The data of course must be split back from the train to the test set.

Once the data had been thoroughly cleaned we could attempt feature engineering to see if there was a way to improve a modelโ€™s prediction in any way. โ€œYrSoldโ€ and โ€œMoSoldโ€ although numbers were categorical variables and were changed accordingly. As mentioned in the data visualization section, it appeared that a lot of the square footage variables were related to each other.

This would make common sense because you would expect the floors of the house to have a similar amount of square feet. It would be unlikely to have a significantly larger second floor relative to the first floor.  Living Area Square Footage (Total_Liv_AreaSF) was the combined features of 1stFlrSF, 2ndFlrSF, LowQualFinSF, and GrLivArea.  Our hypothesis is that reducing these similar features into one would reduce the complexity of the model(s) and thereby improve overall predictions.

Models tested

We officially chose one model 4 models to test although one of team members dabbled with XGboost. Each of the models was broken into 5 folds with the 5th fold reserved for testing and the rest for training. Our initial hypothesis was that the tree-based models (gradient boosting and random forest) would be most successful at predicting sales prices but we chose to use the ridge and lasso regression since because they would be easiest to interpret and because we are still learning.

The ridge and lasso regression are generally better than the simple linear model because they introduce a regularization parameter. This allows for the model to avoid overfitting due to an increased complexity (number of features in this case). Overfitting will result in higher variance. Our team must reduce that variance by introducing such parameters since many features can certainly an issue. The main reason behind our group using Ridge Regression was because of the modelโ€™s ability to alleviate multiple collinearity or the independence of each variable.

Too much multicollinearity can result in an overwhelming amount of variance and destructive overfitting.  Our best prediction from the Ridge regression was having the root mean squared error of .12623 and the alpha (regularization parameter) being of .10. We found that for the most part that the residuals were centered around the mean. This meets the assumption of having relatively independent errors.

Lasso model is useful because of its tendency to prefer solutions with fewer parameters. This is because the Lasso eliminate features that donโ€™t have any value regarding the prediction. Certain variables such as RoofMatl_Metal and Street_Pave were totally eliminated from this particular model.

Tree-based models are advantageous because they do not assume linearity. You donโ€™t need to meet as many assumptions as for the linear models when attempting to predict a variable. Decision trees tend to rapidly overfit, so we chose to use the more complex models in Random Forests and in Gradient Boosting. Random Forest is effective at dealing with a large number of features since it automatically selects which features are the most significant features at predicting the target value. Random forest works by randomly sampling from each feature or bagging.  

In our case, our RMSE was .14198 with min samples per leave being 1 and the minimal split per tree being at 4 The maximum number of features is 67 and the random state remained at zero. You can see the number of features below.

Gradient boosting โ€œlearnsโ€ by improving on each decision tree by minimizing the number of residuals. We chose to use Gradient boosting because it frequently outperforms random forest models. Post cross-validation, our optimal parameters were having a learning rate of โ€œ0.05", a max depth of 5, minimum samples split of 4, and the number of estimators being at 1000.  Our tuned model contained an RMSE of 0 .12916.

 

Conclusion

Our original goal was to find the best model to predict housing prices in Ames, Iowa. Our initial hypothesis was incorrect. The linear models turned out to be the most effective at predicting house prices. In particular, the Lasso model proved to be the most effective at prediction with a RMSE of 0.12290. This is likely the most effective model because it was able to drop the unnecessary variables and because linear models tend to work fine with fewer rows/columns of data. Future paths of research would be to limit the number of features to reduce complexity in tree based models.

Also potentially forge Principal Components to reduce complexity by reducing the number of dimension. Also finding more ways to effectively feature engineer through gaining a better grasp on Real Estate in this area.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

 

 

About Authors

Benjamin Roberts

View all posts by Benjamin Roberts >

Andrew Brainerd

View all posts by Andrew Brainerd >

Yufeng

View all posts by Yufeng >

Ning Sun

View all posts by Ning Sun >

Dev Dabbara

View all posts by Dev Dabbara >

Related Articles

Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Beware of Feature Importance for Business Decisions
Meetup
Building a Safer Future
Python
Tech Layoffs: Exploring the Trends and Industry Shifts
Meetup
Analysis of Mass Shootings and Gun Ownership in the United States

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application