My First Machine Learning Project

Austin Cheng
Posted on Dec 16, 2019

The Beginning of My Machine Learning Journey 

In this blog I will walk-through how my teammates (Aron, Ashish, Gabriel) and I approached our very first machine learning project. The purpose of this blog is more for record-keeping sake-- keeping track of my journey as an aspiring data scientist, and noting down the thought process and reasoning behind the steps taken to arrive at our predictive models. I will keep the reasoning as general as possible because the intention is to establish a generalized workflow that I could build off of. The ultimate goal here is to someday return to this data set, apply better predictive models, see what I would have done differently, and allow myself to see my own growth as a data scientist. 

The Data Set

The data set was taken from kaggle.com. The data consists of 79 features describing practically everything about the houses in Ames of Iowa. This data set is meant to be used as a toy-example for aspiring machine-learning practitioners to play with. The main lesson to be learnt from this data set is that simple linear models can be very powerful and that they can easily out-perform high-complexity models in the right scenarios. In the following, I will describe the workflow that we followed to tackle this data set, and verify that linear models indeed should always be in one's arsenal. 

Workflow for Data Pre-Processing

Data Pre-Processing and Transformation

We adhered to the advice we were given right away: transform the target variable (sale price) into one that follows a normal distribution and removing outliers. The former is important because it ensures that the residuals of the target variable will be normally distributed (which is the underlying assumption of linear inference models), and the latter ensures that our model result doesn't get skewed (or become wrongly biased) by anomalous observations, particularly those that have high influence and leverage. Below we illustrate the log transformation (our manual box-cox transformation): 

At the top we show the highly skewed untreated data. At the bottom we show the log-transformed data and we can see the drastic improvement in the data's distribution

One step that we avoided to do was transforming the features so that they would also become normally distributed. Machine learning models could potentially benefit from normally distributed features but this would compromise the interpretability of the resultant model. For this reason we chose not to pursue it and instead, we move on to treating outliers. Below, we show the effect of removing outliers for a particular variable:

On the left is the untreated data, and on the right is the treated data. The effect of removing the outlier is obvious as we can see the fitted line to shift significantly.

Missingness and Imputation

The second step, where a large portion of the time was spent on, was treating missingness. The imputation is tricky because it required deeper insight of each feature. Whether to impute with the mean, median, mode, zero, none, or simply to remove the observation or feature itself depended on some predetermined guideline that we thought was acceptable. This is where a lot of the human intuition is used. Below we show a qualitative summary of the missingness.

The top shows the amount of missingness for each feature and the bottom shows the correlation between missingness.

I will not delve into the specifics of how we treated the missingness of each variable (the reader can refer to our code posted in github for the exact treatment) but instead briefly go over the general idea. First, any variable with over 95% of definite missingness can in principle be safely discarded but we must take caution before doing so because the missingness may not necessarily be actual missingness. Drawing information from the correlation between the missingness of variables, we could deduce what some of the missingness meant. For instance, the highly correlated missingness associated with garage speak to the likeliness that these houses simply do not have a garage. A variable that are basically completely missing is pool area. For this, we take it that observations with missing information on pool area means that these houses do not have a pool. Here we gave flavors of how we treated variables with significant level of missingness (in general we picked the conservative option which was to keep any information we could). For variables where the number of missingness is relatively low (like less than 5% of the observations), we chose to impute with the mean if the variable was continuous (or ordinal) and with the mode if the variable was categorical. The reasoning behind the mean imputation is that the imputed data does not alter the fitted slopes and hence does not bias the model result. As for the mode or median (for categoral or numeric variables respectively), there is no particular good reason except to think that the observations belong to the most representative groups. There maybe flaws in this option but sometimes convenience outweighs precision especially when the amount of missingness is small (the number of missingness for these features are in the low tens range). In order to be precise in this imputation process, I would choose to impute based on k-nearest neighbors or other machine learning models. Another widely accepted imputation method is to impute with a grossly outlying number such as -999 (if all observations are real positive numbers). However, this imputation does not work with inference models where analytical equations are fitted. Because of this, the imputation of -999 was avoided. 

First Round of Feature Selection 

The curse of dimensionality is often preached to us. High dimensionality likely means collinear variables which causes inaccuracies in fitted coefficients as well as high variance. High dimensionality may potentially mean sparsity in data or may mean an unnecessary number of features which can cause overfit. Both of these are highly undesirable because they lead to a poor performing model. 

Correlation Investigation: Getting Rid of Multi-Collinearity

The first attempt in feature selection is driven by the need to reduce multi-collinearity within the system. The methodology is to perform correlation investigation while either combining or removing features. Below we show the correlation plots before and after the treatment of multi-collinearity:

On the left is the correlation plot of the raw data. On the right is that of the treated data where features were either removed or combined.

One can see that the correlation (represented by dark blue) is drastically reduced. This was achieved through the removal and/or combination of features. A guide that helped us in making whether we were making the right decisions was based on the constant evaluation of R-squared of the features:

On the left plot, the living area related variables (last fifth to last third) all have a R-squared of greater than 0.8 (roughly equal to a value of 5 for VIF). On the right plot, after combining the features appropriately, the R-squared values related to living area have decreased.

Clustering Sub-categories

There are categorical variables whose sub-categories can be clustered together. Below we show an example:

In this plot we can see that all the irregular sub-categories (IR1 to IR3) have means very close to each other but far from regular (Reg). This is a hint that maybe we should cluster the IR's together to reduce dimensionality after dummification.

In this particular example, we can see that it may be potentially beneficial if we grouped all the irregulars (IR1 to IR3) together into one big subcategory. This is beneficial because after dummification of the variables, the feature space will remain relatively small compared to if subcategories were not clustered. The process of clustering was not done manually but was done with K-means clustering (despite it being an unsupervised method) by clustering the subcategories according to a variable that was correlated with the target variable (in this dataset we used Gr living area). 

Note on Feature Engineering: Which Arithmetic Operation to Use? 

Feature engineering can be done through interaction. This interaction could be reflected as some sort of arithmetic operation of any two or more features. What we learnt was that one must take heed in choosing the type of operation. Multiplication and addition for instance can have drastic differences in the final model result. A good guideline that we concluded with is that one must always obey the natural physical units of the variable. For instance, number of garages and garage area, if combined, should be combined through multiplication and not addition. Addition in this case would not make physical sense and as a matter of fact, a test on the two types of operation did indeed show that multiplication of the two resulted in a drastic decrease in VIF whereas the addition did not. 

Another example that deserves a description is the behavior for each neighborhood. 

Different neighborhoods have different behavior in sale price. Each deserves to have its own model.

Looking at the neighborhood plots, we can see that each neighborhood behaves distinctly and each follows a very well-defined behavior. The neighborhoods warrant their own models. To achieve this, we created a switch-like interaction parameter by multiplying the dummified neighborhoods categories by Gr living area. This way, instead of being a simple interception shifter (which is what categorical variables do for generalized linear models), each neighborhood can have its own set of coefficients-- its own equation. Implementing this feature engineering led us to a drop in our Kaggle ranking. 

The Pipeline

Our pipeline can be summarized as follows:

The data set is split into a train and test set where the train set is then sent into five models: three linear (Lasso, Ridge, Elastic Net) and two nonlinear (Random Forest, Gradient Boosting). An extensive grid search is performed for each model where the best hyperparameters are chosen. With the best hyperparameters, we use the models to predict on the test set and compare the test scores. The following shows a summary of the initial feature engineering performed using the pipeline outlined above:

Many types of feature engineering and selection have been tried but apart from the ones shown above (up to dataset C where the feature engineering are performed sequentially on top of each other starting with A), all have yielded a worse Kaggle ranking though our own test MSE score is not always consistent with the Kaggle ranking. Below we show the results for data sets A through D:  

We can see here that the elastic net has the slight edge over all other models. The linear models all perform much better than the non-linear tree models. This is a good verification of the statement mentioned in the beginning, that linear models will always have its place. In this particular data set, the behavior of the target variable behaves largely linear with the features, which gives the linear models good reason to outperform the non-linear ones. All this being said, even with the linear models, our Kaggle and MSE score can certainly be still improved. The reason we know this is from the plots below:

On the left is the test and train data set MSE and on the right are those for random forest. Both exhibit signs of overfitting.

 

The plots above show that a huge discrepancy exist between the test and train data set MSE score. For a tree model it may make sense because tree models tend to overfit (though of course the point of random forest is precisely to avoid this problem); however, the penalized linear model should have mitigated this problem... and it did not. This means that we can definitely improve on our feature selection and engineering. However, we tried numerous feature selection actions, while also taking into account the suggestions based on feature importance shown below,  but they have all given us negative feedback.

The feature importance from Lasso (left) and Random Forest (right). Note that the feature importance are different between the two models. Random forests emphasizes on the importance of continuous variables more than Lasso because high cardinality results in a bigger error or entropy drop. This is why it is important to label encode as oppose to one-hot encode though a test was performed to compare the two and the two yielded similar results (label encode did give a slight edge).

Seeing the futility of feature engineering, we chose to brute-force improve our models through recursively getting rid of features. The idea is demonstrated below:

The left schematic shows the procedure of how we recursively removed features. The right show the MSE as features are sequentially removed. The sudden jump is most likely due to the an important feature being removed.

The optimal number of features is indicated by the position where the test error suddenly jumps. With this recursive method, we were able to further improve our MSE score:

We can see that recursively removing features, i.e., trusting the machinery to do the job, does help. Error scores significantly go down after the recursive feature removal.

Finally, we chose to put everything together by ensembling all the different models. We did it as follows:

Diagram showing the ensembling technique we employed. This apparently is referred to as stacking.

The ensembling is simply the linear combination of the predicted values of the different models. The weights of the different models is chosen by the set of weights that minimizes the test error score. Submitting our final result to Kaggle, we got a final score of 0.1214. 

New Things to Try and My Conclusion

Being our first machine learning project, we certainly learnt a lot. First and foremost, we saw with our eyes the power of linear models. This was a fact we saw coming. The second and the tougher lesson was that we saw the limitation of human intuition. The many hours of futile feature engineering is a memorable lesson for us. In these machine learning problems, there should always be a balance in the reliance of human intuition as well as the machinery. We wasted too much time being faithful with the data set, trying to figure out what was statistically significant or not, and being too hesitant in dropping features. These actions may be good... if we were decisive but the problem is that the conclusions from these EDA and statistical tests are never black and white-- they seldom lead to actionable responses. What we should've done is to look more quickly into the feature importance given by the linear and non-linear models while comparing the importance with a random dummy variable. At the same time, we should've spent more time looking into performing PCA of a subset of associated features. We clearly still suffered from multi-collinearity at the end despite all the efforts in manual feature engineering. We needed to be cleverer with the machine learning techniques. And so the lesson here is clear to us. Certainly, we will be much better next time around. 

About Author

Austin Cheng

Austin Cheng

Austin is an experienced researcher with a PhD in applied physics from Harvard University. His most notable work is engineering the first single electronic guided mode and explaining it with computational simulation. He is passionate about the growing...
View all posts by Austin Cheng >

Leave a Comment

Avatar
经验&教训分享:我的第一个机器学习项目 360站长资源网 360站长资源网 March 27, 2020
[…] My First Machine Learning Project […]
Avatar
独家 | 经验&教训分享:我的第一个机器学习项目 - CodingNote.cc February 11, 2020
[…] My First Machine Learning Project […]
Avatar
运营借鉴-经验&教训分享:我的第一个机器学习项目书荒源码-书荒源码-做小说站我们是认真的 January 9, 2020
[…] My First Machine Learning Project […]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp