Data Driven Predictions of House Prices in Ames, Iowa

Posted on Jun 29, 2022

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Ames, Iowa, is a college town centered around Iowa university. There are many distinct neighborhoods of various socio-economic classes. It represents a good opportunity for real estate investment at multiple financial levels, especially house rentals, due to a guaranteed stream of potential renters from the nearby university. When the market conditions are right or the prospect of being a landlord is no longer attractive, we want to make sure our investment returns a high value. Machine learning can help us choose homes with a high resale value, by creating a predictive model of sale price based on house features. This allows us to invest in property with a higher resale value. To use machine learning we need data.

Luckily, we have a robust dataset of housing prices in this area spanning from 2006-2010 including sale price and about 80 different metrics, from house size, to neighborhood, number of rooms and even shingle type.

Data Wrangling

Although this dataset is very useful, it is not perfect and there is some missing data and observations encoded incorrectly. About a dozen features encode NaN (not a number) to signify a feature is missing. It is a trivial matter to reencode these values to to 'none' .

Other observations have missing data. Although the simplest method of training a predictive model is to simply drop these records, the more data we train our model on the more accurate it will become.

There are seven features remaining with missing data, we can tackle them one by one.

Zoning

A little investigation reveals that how a lot is zoned is highly correlated with the neighborhood that lot is in. In other words, most neighborhoods have a majority that are zoned in a specific category. Therefore, we can confidently fill in the missing zoning data with the most common zone in that neighborhood.

Utilities

There are two missing values in the feature describing the utilities hookup of the house. Simple observation reveals all other houses have a 'public' utilities hookup. Therefore, the probability of these two missing values is anything other than public is extremely low.

Kitchen Quality

Correlation analysis reveals that kitchen quality maps almost perfectly to the overall quality of the house. If a house is listed as having a 'good' quality, over 95% of the kitchen qualities will also be 'good' the same is true for all other categories. We can therefore substitute the overall house quality for any missing values in the kitchen quality

Functionality

This feature in our dataset describes the overall functionality of the property. Over 95% of the other observations listed have a functionality described as 'typical'. These records with missing data also have overall house condition listed as good or better. Reason dictates it is highly likely that these houses have 'typical' function as well

Number of Cars Garage and Garage Area

There are also two values missing from both of these features. In all, our dataset contains five variables relating to garages, and a simple inspection of these two observations reveals these houses do not have a garage on the property. We can encode these missing values to reflect that.

Lot Frontage

Lot frontage is described as “Linear feet of street connected to property”. Over 15% of our data contains missing values for this feature. Since this is a numerical feature instead of categorical, we have to take special steps to impute it. We find that lot frontage is correlated with lot area, therefore we can find the average ratio of lot area to lot frontage and use this value to impute the missing lot frontage values.

Feature Engineering and Dimensionality Reduction for the Data

It is informative to look at the distribution of the sale prices of the houses in our dataset.

Data Driven Predictions of House Prices in Ames, Iowa

Our predictions will be more accurate if our model's target variable, sale price, is normally distributed. We can perform a Box-Cox transformation on the sale price, and reverse the transform after making predictions to get interpretable values.

Data Driven Predictions of House Prices in Ames, Iowa

Combining Variables

There are three features related to indoor living area: first, second, and basement square footage. Looking at the relationship of these variables against sale price shows a bimodal nature. There are multiple observations of second and basement square footage of zero, these values are skewing the linear relationship of square footage to sale price.

Data Driven Predictions of House Prices in Ames, Iowa

The outdoor square footage variables show the same behavior.

Data Driven Predictions of House Prices in Ames, Iowa

This bimodal nature also holds true for the number of different types of bathrooms.

Data Driven Predictions of House Prices in Ames, Iowa

We can combine these three sets of related variables into three distinct variables to discover the true relationship with sale price.

The combined indoor square footage shows a very nice linear relationship to sale price, making it very useful in modeling house prices.

Data Driven Predictions of House Prices in Ames, Iowa

The bimodal nature (observations with zero outdoor square footage) is still present, but significantly reduced by combining all four outdoor square footage variables.

The relationship is not as clear with total number of bathrooms, further analysis will be needed to determine if this variable will be useful in modeling.

We have just eliminated nine variables without losing any information and potentially improving predictive capabilities of our models.

Quality and Condition

There are a dozen variables in our dataset related to the quality and condition of various features of the house. Although, it is natural to think of the quality and condition of an item before purchasing it, the very process of applying these metrics is subjective. There is a high protentional for random influence based on individual preference.

Let's start by looking at the relationship of the six quality variables with sale price.

Garage and pool quality seem to have very little to do with the final sale price of a house, even fireplace quality is on the edge of usefulness. We can explore this relationship in a little more detail visually.

A linear relationship is evident in the features with high correlation to sale price, I feel confident in dropping the remainder.

We can perform the same analysis on the features relating the the condition of the house.

These features do not look useful at all for modeling purposes. Perhaps a visual inspection of condition versus sale price will reveal hidden relationships.

The flatness of the trendlines of these features further cement the idea that the condition variables are not useful. Perhaps it is true that the condition of a feature is too subjective to be useful.

Further Dimensionality Reduction

Using the same methods illustrated above, bivariate and correlation analysis we can further eliminate other not useful features:

  • Number of Bedrooms
  • Number of Kitchens
  • Finished Square Feet
  • Garage Year Built
  • Month Sold
  • Lot Frontage
  • Pool Area

We have successfully eliminated nearly 50% of the features form our data. This will result in a much more stable model.

Variable Redundancy of Data

Exploratory data analysis serves many functions, increasing subject competency, finding obvious relationships between variables, and sometimes discovering data redundancy. This particular data set has two sets of variables that describe the same feature

Garage Area and Number of Cars

These two variables describe the same thing, the size of the garage. To confirm this we can calculate the variance inflation factor for these features.

A high VIF indicates a redundant variable. The only question is, which one to keep?

A simple plot of these two variables against sale price shows a linear relationship for both. The trendline analysis also shows similar R-squared for both relationships, leading to the conclusion it does not matter which variable we keep. I decided to keep the number of cars variable, simply because it is simpler.

Year House Built and House Remodel Date

When calculating the variance inflation factor for the previous example, I discovered a mystery.

A VIF in the tens of thousands is an oddity. Plotting a histogram of the two variables reveals an even stranger occurrence.

The two variables show the exact same distribution (besides for a spike at 1950). A little more analysis shows that if a house was never remolded the the relevant data field is populated with the year the house was built. If the house was built before 1950 and never remodeled, the value '1950' is encoded.

Filtering the data to only accurate values of the year a house was remolded show there is no significant linear relationship with this variable and sale price. So that we do not eliminate potentially relevant information, I added a binary classifier indicating if a house was remolded, and eliminated the year the house was remolded field.

Trimming Categorical Features

Mathematical analysis of numerical features if inherently simpler than categorical ones. In order to determine the usefulness of nominally encoded features we can encode them numerically and perform a chi-squared test.

A low chi squared alone is not necessarily an accurate means of determining the usefulness of a feature, but the lowest scoring five features do not logically seem to be good indicators of house price. I feel safe in dropping them.

Using Lasso Coefficients for Dimensionality Reduction

Now that our data has been reduced to its most logical and useful features through manual analysis we can attempt to create a stable Lasso model to further simplify. Lasso's implementation of a penalty term in linear regression will force non-relevant features' beta coefficients to zero. This indicates they will not be useful in our predictive modeling.

We can plot the lasso coefficients to approximate the impact each variable will have on our final model.

Modeling

Our data exploration has shown a strong linear relationship between many variables and sale price. This indicates linear regression may be an appropriate tactic for predicting house price.

Data Preprocessing

After scaling our numerical feature data using sklearn's standard scaler, pandas dummification functionality, and creating a 70/30 split of train and test data we can begin to fit linear models to our data.

All hyperparamters will be determined through a thorough cross validation grid search and verified by scoring the fit on the test data set.

Lasso

Plotting predicted versus actual sale price shows a normal distribution of residuals, a Q-Q plot shows a dense straight line for the bulk of our data, and an R Squared of .921 on the test data set signifies a reasonably good fit. The mean average error of $15,710 is the average dollar amount each prediction is off by.

Elastic-Net

The next logical step after Lasso regression is to attempt an Elastic Net model.

There is a slight improvement from the lasso model, a .005 increase in R-Squared and a reduction in mean average error of $442. This is most likely the limit of the accuracy of strictly linear models.

The next step would be tree-based models. The preprocessing steps will be slightly different for these models. Dummification is not necessary and one hot encoding will be used instead to numerically encode categorical features.

Random Forest

A random forest model is strictly worse than our elastic net predictions with a lower R-squared and higher mean average error.

This does not necessarily mean tree-based models will not be good fits for our data, we can try different frameworks, for example gradient boosted tree based models.

XGBoost

XGBoost seems to give the best fit to our data with the highest R-squared and lowest mean average error.

One benefit of tree-based models is the ease with which we can extract feature importance.

Conclusion

Our tree-based models show the same ranking of importance for our numerical features and illustrate that the overall quality and total square footage are the two most influential features to sale price.

Weather the house has central air conditioning and the neighborhood it is located in are the most important categorical features to home sale price.

Our modeling shows that in order to maximize house resale value we need a large house with good overall quality and central air conditioning in the right neighborhood.

About Author

[email protected]

Education: Stevens Institute of Technology Bachelors of Engineering: Engineering Physics, Solid-State and Optical Engineering Bachelors of Science: Applied Mathematics Associates: Applied Physics Experience: 10+ years Quality Control Manager: Testing and Characterization of Solid State Frequency multiplied Diode Pumped...
View all posts by [email protected] >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI