The Ames Data Set: Sales Price Tackled With Diverse Models
A Machine Learning Journey
When one thinks of "journey", they think of an experience that has a beginning, a middle, and an end. And, in a way, machine learning is an experience for the user, as one creates an algorithm for predictions. No longer are instructions predefined, rather data are trained and then tested on in sets as the algorithm learns. The journey itself is a process of transforming these data sets into workable states (the beginning), creating machine learning models from these states (the middle), and then finding out which model works best for the situation we are in. Our journey started with the Ames data set (featured img. made with Stable Diffusion).
What Is the Ames Data Set?
A college town of Iowa State University, Ames is used as an example for Kaggle's competition to predict sales price of houses in that particular area. The data set itself consists of housing sale records between 2006-2010, including various numeric features like area and square footage, to more abstract attributes like its quality and condition. From these 81 columns and 2580 rows, our goal was to use the steps of the machine learning journey to create the "most accurate machine learning model with the best score". Score is defined as coefficient of determination, or R-squared.
While we did predict sale price in the end, our main goal was to create various models which each have their own unique capabilities. No model is a "one size fits all", and this data set was used to test that, especially against "similar" regularization models and tuned variants. All work can be found on our github, including older attempts before feature engineering.
Data Preprocessing
Preprocessing is integral to the journey and requires a fundamental understanding of the data one is working with. In the initial trials, noted in the github linked, all preprocessing was done separately and individually. The benefits of this are that it is one method to break down a large problem into multiple separate pieces, working on each one at a time. However, the problem with this is evident. It is inefficient and leads to so-called data leakage. The alternative, and the path we eventually chose, would be to use pipelines.
Pipelines: Profiling, Cleansing, and Reduction
A pipeline is a processing method in which each element is connected in a series. Each output is the input of the next pipeline. The fundamental difference here between using a pipeline and preprocessing individually is that every operation is applied to all transformers in a series. This simplifies the coding, avoids errors, and, most importantly, allows for recalling if needed. To create a valid pipeline, and as a first step to preprocessing, one has to profile the data. A typical data set is separated into categorical features (string and object) and numeric (integer and float). Ordinal features, that is features that have a "ranked" quality to them, are inputted separately when it comes to transformations.
We know from the data set that several of the features were coded "incorrectly", so their types had to be switched. This is important for One Hot Encoding, where we convert categorical data into a numeric form so the machine learning program can work with it. We also discovered that there were numerous missing values. Under normal circumstances, one can deal with a missing value via the context, such as inputting the mean, median, or most likely value. We opted to input the mean or "none" for simplicity sake.
Finally, comes the reduction. Using a function, we can find at least "some" of the redundant features in the data. This included features like "BsmtFinSF1" and "1stFlrSF", the former of which is already included in "TotalBsmtSF" and the latter is already included in "GrLivArea".
Pipelines: Data Transformation
Transformations, when it comes to a pipeline, are steps done to a specific data type. For example, our first "step" would be to handle the null and NA values. Then, depending on the data type, we would encode the categorical values, or even normalize the numeric values. When this is finished, a combined preprocessor of all transformations and features was created.
However, the one caveat was that the Random Forest and Gradient Boosting models that are later used required ordinal encoding. We opted to manually organize the values from an arbitrary "least-to-greatest". For example:
'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],
'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
Note how the feature "electrical" doesn't exactly have a defined "least-to-greatest" value. This is why it is considered a separate category even in other models, such as the regularization models. After the data is combined, the preprocessed pipeline can be applied to the original data set. This prevents a hurdle we originally went through: having to input each transformation into pickle files to maintain format between different Python files. Unless there is an exception to a new model one is working with, this transformation can be used as a foundation, which is a massive time saver.
Multiple Linear Regression
Overview and a Comparison
A MLR is used to determine the strength of relationships, as well as statistical significance. When used in machine learning, it is using more than one independent variable to predict your target. A MLR model is unique in that it deals with multiple predictors. An OLS (Ordinary Least Squares) model, which is a stats model, will not produce multivariate results, nor will it allow for testing of coefficients across equations.
A MLR model, while it can handle multiple predictors, is still subject to the problem of multicollinearity. This is a unique situation where the more and more features that are added, the increasing likelihood that said features are going to influence each other and the outcome.
MLR as a Model
The MLR tells us the relationship of the multiple features and their relationship with the target value. This brings us to the midpoint of our journey, where we have to add the model to our pipeline. In this case, it's creating a new combined pipeline of the preprocessor mentioned before and our model, which was literally "LinearRegression()". To test the default of a regressor model, we used used cross-validation, which is the performance of the model on unseen data.
The data is split into groups, called folds, of hopefully equal size so that different models are being trained in order to improve accuracy and performance. We initially used five folds, but then opted for ten to showcase a wider distribution. There was not a significant difference between the fold comparison. The Mean Cross-Validation Score of the MLR model was .90. It did not require any "tuning", because any adjustments made had a tendency to lower the R-squared. Although not shown, the MLR was also able to tell us that the feature with the highest predictability was ground living area.
The Regularization Models
Regularization models attempt to deal with the problem of overfitting. Overfitting is when the test error no longer nears the training error as the model complexity increases. When one or more coefficients are too high, the model’s output becomes sensitive to minor alterations in the input data. An example of overfitting would be if there is a high test error, but a low training error.
We used three regularization techniques in our project: the Lasso, the Ridge, and Elastic Net. Each of these models have their own positives and negatives, and each resulted in different, but similar scores. Regularization models are useful for dealing with multicollinearity and work better with more features.
Default vs Tuned Models
A note should be made about the difference between a "default" and a "tuned" model. The MLR model used previously exclusively relied on the default for the cross-validation, since tuning did not improve the score. The default models are unaffected by parameter tuning.
"Tuning" is an attempt to control the values of the model’s parameters in the learning process, in order to achieve better results. It is a progressive process, where one would input ranges of parameters, adjust the ends of the ranges once the “best parameters” are found, and finally end on the parameter when the ends no longer change. For the regularization models, only alphas were used in tuning. A grid search is then performed on the tuned model.
Lasso Regression
The Lasso is more effective in situations where subset of features contribute more to the output. It adds a penalty term (an absolute value), to the least squares cost function to shrink the coefficients to exactly zero. This ensures features remain small at the cost of high bias and weak predictive power. However, the Lasso is known for its feature selection as a result, selecting one variable from a group of highly correlated variables. Because it will only retain one variable, a highly correlated model will end up having lower accuracy and a loss of information.
As expected, the default has a lower score than the tuned model, by a large degree. The lasso model, because it arbitrarily selects one variable over time, begins to degrade as the folds progress. The sharp drop at fold four can still possibly be attributed to the model becoming accustomed to the data, via the validation set.
Ridge Regression
Ridge is similar to the Lasso, except instead of shrinking coefficients to exactly zero, it shrinks them to nearly zero. This allows for a more even distribution of correlated variables, and makes it more suited to deal with multicollinearity, at the cost of the inability to perform feature selection.
Once again, the tuned model has a superior score to the default model. However, it should be noted that the tuned model's score is only a "slight" increase compared to the Lasso model. The difference between the default and tuned versions are more substantial, however.
The Elastic Net
The final regularization technique used, the Elastic Net, is a hybrid method of the Ridge and the Lasso regressions. It has an addition of an alpha, where it selects a subset of predictors that are correlated, but not redundant. It is used when the Lasso becomes too dependent on a particular variable. The Elastic Net is computationally expensive due to parameter tuning, and may fail to select relevant features if there are too many predictors.
There is a stark contrast between the default model of the Elastic Net and the other models. The simplest explanation is that Elastic Net instability can be a result of the data itself, and not necessarily the actual model. When tuned, it more represents the other models, including that fold four dip.
Comparison
When it comes to the default, the Elastic model had the lowest mean CV score, while the Ridge model had the highest. The Elastic showed incredible instability, indicated by the varying points on the boxplot, while the Lasso had the lowest fold score out of all the default models.
The tuned models were all superior to the default models, with the Elastic Net have the best Mean CV score of all the models, at a score of .9028. Notably, the tuned models also are much more compact than the default models, and have a "reversal" of the lowest fold score. Meaning the Lasso, technically the worst model, this time has one of the highest scores, while the Elastic Net has one of the lowest fold scores.
Under normal circumstances, an analyst would choose the model with the highest score. The problem is that the difference in these scores are in the ten-thousandths decimal place. This is a difference that can be attributed to data shuffling or randomization (note the state was specified every time for the KFold).
Random Decision Forests
A Random Forest model is used to find feature importance, or the degree of dependency of variables. It is the combination of multiple decision trees, the hierarchical structures that use nodes to test attributes. The higher feature importance, the larger the effect it has on the model. However, it should be noted that decision trees are also subject to the overfitting mentioned before. As trees grow in size, so does the possibility of overfitting.
The Random Forest resulted in a default score of 0.8863, while the tuned model has a CV score of 0.8864. The differences in score, and results, were so minutely inconsequential that it is difficult to actually pinpoint where the differences are. Some of the features are ranked slightly different in the default version, but mostly the graphs were identical, which is why it's not shown. This is important, because sometimes tuning can still change the score (unlike in the MLR model), but may not visibly change anything.
Gradient Boosting
Gradient Boosting is an ensemble technique that takes multiple “weaker” models and combines them into a collectively more accurate model. The weaker models are added sequentially, where each addition fixes the errors made from the previous. This sequencing allows for greater accuracy, especially on larger data sets. The final prediction is an aggregate sum of the individual predictions.
Gradient Boosting is similar to a Random Forest, as it also shows feature importances and also requires Ordinal Encoding. In fact, it's so similar that the default model is visually extremely similar to the Random Forest. The default score, however, was 0.8947, which is superior to both the default and tuned versions of the Random Forest models.
The tuned model is visually distinct. The reason for this is the parameter of "max_features". For the Gradient Boosting, we used "square root", which square roots the features. On the other hand, the Random Forest used "none", which is just all of the features. This made the Random Forest model extremely computationally expensive, even more so than the Elastic Net. Still, it had a superior score of 0.9056. This is the best score of all models.
Summary of Results
The models themselves all have their strengths and weaknesses, but the results bring out which models may be preferred over the other. When compared side-by-side, the hypothetical “best” model to be used would be the Gradient Boosting model. The "best" regression model was the Elastic Net, although not by much. And the "worst" model was the Random Forest. The MLR model was the only model that actually performed worse when subject to a tuning, so it was left untouched.
However, all of these models would be situational, based on what one is trying to find and what kind of data one is working with. For example, some models work better with highly correlated data sets than others, while other models deal with larger data sets than others. And just because you would choose one of those models, doesn’t necessarily mean that the model in itself is “bad” because its score was lower, rather, much like parameter tuning, you’re supplementing the score pitfall with the model’s benefits.
Final Thoughts
Regularization Techniques
The linear models were interesting perhaps in their lack of variety, with the exception of the default of Elastic Net. Each model has set advantages or disadvantages, specifically used in situations, but after using three of them there’s no one that actually stands out well-above the rest. This results in a "pick-and-choose" situation , where one would select regression models that suits the situation to compare them alongside other regression models. This is fine because, in our experience, these models are: compact, easy to use, and similar enough to each other that they can be compared with ease. In fact, the regularization models were so similar they were combined together in one pipeline, and could be ran in one line. The problem, though, is one doesn't want to feel like they wasted their time using multiple methods.
Random Forest and Gradient Boosting
The Random Forest model is impressive in that it is a streamlined result of multiple decision trees. This makes it separate from the linear models in that ordinal encoding actually matters, but more importantly, a lot of information can be extracted from just the one model. For our case, the Gradient Boosting was just a "better" Random Forest. This was compounded by what each chose as "best parameter". The Random Forest had "none" as the best parameter, which meant all features were included as is. This was computationally expensive. When "square root" was included, the visual became much more similar to that of the Gradient Boosting model.
A similar situation would apply to the Gradient Boosting model. If "none" was used instead, then the model becomes computationally expensive and the score decreases, but the feature importances become visually similar to that of the Random Forest model. This is important because it demonstrates, once again, how critical it is to choosing the right model for the data, and how much parameter tuning can affect the results. The Random Forest model had the worst score, yet the Gradient Boosting model had the best score, but the two structurally are interchangeable.
Room For Improvement
From a Modeling Standpoint
Our set goal for the project was to find R-Squared using cross-validation, whether it be default or tuned. We wanted to compare the models using said R-squared to find the "best score", and then address the strengths and weaknesses of the models. In an actual scenario, our "final phase" of the journey was more of a "dead stop", instead of finding out the results that could actually be achieved from the "best model". We briefly discussed predictions and feature importance, but did not use that to actually address, in depth, how that is influencing sales.
Furthermore, as previously hinted at, other models can work better in certain situations. The models we used may not be the best models for this data set. This includes classification modeling (regressors were used), so-called XGboosting and AdaBoosting, and bootstrap aggregating.
Finally, we have an issue of the data itself needs to be addressed. Many factors are at play, such as the nature of the model. Future experimentation with the models and data itself can be changed based on its nature. For example, the models can be based on quintiles based on the neighborhoods, instead of seeing sale prices as a total.
From a Data Standpoint
We never actually used the MLR or Lasso models for feature selection. When it comes to these, it's more of a "yes" or "no" whether to segregate features based on a coefficient limit. What could actually aid a feature selection like this would be feature engineering more, accounting for their usefulness and redundancy. We mentioned in the beginning that certain features were made redundant by others, such as "total basement square foot" accounting for all basement square footage features. This could apply similarly to something like, say, the bathrooms.
Pipelines were used to organize all transformations. However, two things can be noted: outliers and structure. Outliers were never addressed in the pipelines, even though they clearly affected the data. Furthermore, certain actions, such as inputting a median instead of a mean, have a different effect on the end results. There are alternatives to the StandardScalar, such as the MinMaxScaler, but a scalar cannot be used with all models.
Conclusion
The machine learning journey is a process that requires full knowledge of the data set one is working with, and careful consideration of the process in determining the results. A random model cannot be arbitrarily assigned to any data set, as demonstrated, at the cost of: redundancy, overfitting, or giving poor results. One wants to find the model that adapts to their data set the best, based on factors like the number of correlations, features, and values. We may have found the Gradient Boosting model to be the "best" model by R-Squared, but it wasn't "that" much of a difference compared to the other models. The regularization models even had barely any of a difference between each other, and certain tuning, such as in the Random Forest model, also barely did anything. It should therefore once again be reiterated that, just because a model had a lower score, does not mean it is "inferior" if it suits the user's needs better from a functional standpoint.