Kaggle Ames House Pricing Competition
Kaggleβs Ames House Pricing Competition is a data science competition that encourages participants to develop predictive models to accurately predict the sale price of houses in Ames, Iowa. The competition provides participants with a dataset of 79 features and 1460 housing prices from 2006 to 2010. The dataset includes information on the size of the house, its location, the number of bedrooms and bathrooms, the age of the house, and various other features.
Before model application, data cleansing was performed to improve the usability of different modeling techniques. Investigation was done on multiple numerical and categorical values to assess its correlation to sales price. Some of the most linear and applicable numerical variables were overall quality, greater living area, 1st floor square footage, and age of property. Some categorical standouts were neighborhood, building types, and garage types.
After constructing a revised dataset include positive linear numerical features and categorical standouts, it was time to test and train our data. Some of the modeling techniques used were linear regression, decision trees, random forest, gradient boosting, and support vector machine regression.
Results were concluded by estimating prediction accuracy and cross validation score. Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a cross-validation procedure, a data set is partitioned into complementary subsets, and the model is trained using one subset and tested on the other.Β
Gradient boosting led to the best cross validation scores and highest accuracy. Β Achieving 85% accuracy in predicting home prices Β and a .9 in cross validation scores (1 being the highest rating). Why did gradient boosting work so well? Gradient boosting works by combining multiple weak models to minimize differences between datapoints and its plotted regression line to output its predictions. In a dataset with somemissing values, feature outliers, and categorical values to take into consideration, gradient boosting performs great given these occurrences.
Some room for improvement in this project could be more versatile feature engineering. More outliers could've been removed, different features lists could've been included in model preparation, and feature engineering techniques such as combinations of square footage, living areas, and etc could've been created. It's also possible to include different regression techniques like hierarchal regression which mixed models could've performed with nested data at different levels.