Data Study on U.S. Residential Energy Use
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Thomas Kassel. He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from January-May 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course.
Data shows the residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity end-uses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns.
The Residential Energy Consumption Survey, RECS, collects energy-related data on a nationally representative sample of U.S. homes. First conducted in 1978 and administered every 4-5 years by the Energy Information Administration, it is a leading data source for residential energy analysis. Methodology for the study is well-documented - a comprehensive overview of RECS can be found in the program’s technical documentation.
This project applied machine learning methods to the most recently available RECS dataset, published in 2009. The primary goals were twofold, as is common in many regression exercises:
- Description - identification of variable importance or features with the strongest statistical associations to annual residential electricity usage (kWh)
- Prediction - feature engineering and use of machine learning platform h2o to optimize predictive accuracy over annual electricity usage
EDA and Data Pre-Processing
RECS 2009 consists of 12,083 observations (each representing a unique housing unit) and over 900 features encompassing physical building characteristics, appliances, occupant behavior/usage patterns, occupant demographics, etc. These features serve as independent variables in predicting the outcome variable, annual electricity usage in kilowatt-hours (kWh). Because RECS aims to cover a comprehensive overview of many types of residences for a variety of analytical purposes (beyond energy use prediction), many of the features are sparsely populated, collinear, or uncorrelated with kWh usage. Therefore, a preliminary and recurring task throughout the project was dimension reduction.
Since most of the independent variables were collected through occupant interviews - as opposed to exact physical examination of the residence - the values of many are binned as factor/categorical variables. The dataset’s 931 original variables had the following characteristics:
Missingness was common in the raw dataset, with 73 features having NA values in more than 95% of observations. To quickly gauge whether these features correlated with the outcome variable kWh (and therefore should be retained/imputed), I made use of flexible EDA graphing functions from the GGally R package. An example is shown below on continuous variables, revealing generally low r-statistics in correlation to kWh and having non-significant p-values (not shown). GGally also accommodates similar exploratory analysis on factor variables via pairwise box plots.
Overall, although a more complex imputation strategy could have been employed to address these 73 features, they were dropped due to their high missingness and little evidence of predictive power over the outcome variable.
As previously mentioned, the majority of the dataset features were factor variables, many of which needed recoding in order to correctly capture the variation being described. As illustrated below, nominal factor variables such as WALLTYPE have no intrinsic ordering; it would not make sense to order “stucco” higher than “wood”, for example. On the other hand, ordinal factors such as AGERFRI1 (age of most-used refrigerator) have a clear order to their factor levels; each level denotes a numerically higher value than the previous.
Information contained in variables like AGERFRI1 is often more appropriately formatted in integer/numeric form, which can better convey their continuous range of values, aid interpretation of the variable coefficient, and reduce standard error by decreasing the number of coefficients for model estimation (many algorithms generate a separate model coefficient for each factor level via one-hot-encoding).
In the image above, WALLTYPE was left as a factor while AGERFRI1 was recoded as an integer value (using the mid range of each bin as number of years). This factor-to-integer recoding was applied to additional binned variables relating to age of appliances and frequency of use of those appliances.
New factor variables were also created by consolidating information from existing factors into one-hot-encoded binary values such as presence of heated pool, use of electricity for space heating, etc.
We can visualize multicollinearity among the integer and numeric features using correlation plots, again made with R package GGally. In the plot below, a conceptual group of collinear variables stands out, i.e. those directly or indirectly representing a residence’s size, for example, total square footage (TOTSQFT), cooling square footage (TOTCSQFT), number of bathrooms (NCOMBATH), number of refrigerators (NUMFRIG), etc.
To confirm the degree of multicollinearity, variance inflation factors were calculated based on a linear model with the above numeric features - those with high VIF scores (loosely following the rule of thumb VIF > 5.0) were dropped.
Despite variable recoding and de-correlation using the methods previously described, a large number of features remained in the dataset, the majority of them being factor variables. Although principal component analysis (PCA) is a powerful method for dimension reduction, it does not generalize easily to categorical data. Therefore, feature reduction was continued instead through the use of an exploratory LASSO regression model, utilizing the L1 regularization penalty to drive non-significant variable coefficients in the model’s output to 0. This was helpful in identifying and dropping features with very little predictive power over kWh usage, particularly factor variables that were not covered in the multicollinearity exercise above.
Modeling With H2o
The processed and cleaned dataset was migrated to open-source, online machine learning platform h2o, which offers highly efficient and scalable modeling methods. In contrast to machine learning conducted in slower native R packages (caret, glm) in the local R environment, R package h2o facilitates API calls to h2o’s online platform, sending the given dataset to be distributed and parallel-processed among multiple clusters.
H2o offers an array of the most common machine learning algorithms (glm, kNN, random forests, gbm, deep learning) at very impressive run times - lessening the large burden of computational speed in the model fitting and tuning process. It also provides handy, built-in functions for pre-processing such as training/validation/test set splitting, in this case chosen to be 70/20/10.
To best understand variable importance and interpret linear effects of the features on kWh usage, a simple multivariate regression model was fit using h2o’s glm function with default parameters (no regularization). A variable importance plot (see below) ranks the top 15 features by z-statistic, all of which are quite large - an observed z-score of 15 translates to a p-value of 9.63e-54, or virtually zero. Hence all 15 variables show extremely significant statistical evidence of a relationship to kWh, with an additional ~100 of the features (not shown) also meeting significance at the p = 0.05 threshold.
Variables beginning with “hasElec” were feature engineered (previously described), validating the approach of using domain knowledge to add or recode valuable information in new features. Coefficient estimates are denoted on the bar for each feature. We can observe that the presence of electric space heating at a residence (one-hot-encoded as a factor variable) increases yearly electricity usage at the residence by about 2,722 kWh; each additional TOTCSQFT (air conditioned square feet, numeric) adds 1.146 kWh/year.
ElasticNet Regression with Grid Search
While default h2o glm provided interpretability and p-values, adding regularization to the linear model enabled an increase in predictive accuracy. H2o allows flexible grid search methods to tune the main glm hyperparameters alpha (type of regularization) and lambda (degree of coefficient shrinkage). Because grid search can quickly become computationally expensive, I utilized h2o’s powerful early-stopping and random search functionality to ensure that computation time is not wasted for very small, marginal gains in validation accuracy.
A primary purpose of grid search is to choose optimal tuning parameters according to a validation metric - in this case root-mean-squared-error (RMSE) of kWh prediction on the validation set - and then train a new model using the optimal values discovered on the full dataset (no train/validation split) to maximize sample size. In 5 minutes of training, h2o fit all ~440 model combinations specified in the grid search, with the best RMSE model having alpha = 1.0 (LASSO regression) and lambda = 1.0 (small amount of regularization). These parameters were then used to fit a final model on 90% of the data (combining 70% train and 20% validation sets) and performance was evaluated on the 10% holdout test data, which had not yet been seen by the model.
Gradient Boosted Machine
To increase predictive accuracy over generalized linear model-based approaches, I used h2o’s implementation of gradient boosted machines. Boosting, a nonlinear tree-based approach, sequentially fits many decorrelated decision trees to slowly reduce overall predictive error (in the regression setting, often RMSE). Grid search was performed using the same methods as above and with the following tuning parameters:
- learn_rate - how much error should each tree “remember” from the previous
- max_depth - max tree depth (rule of thumb - sqrt of # of features)
- sample_rate - percentage of rows/observations to be chosen to fit a given tree
- col_sample_rate - percentage of columns to be chosen to fit a given tree
Following the same process as with GLM, parameter values were then chosen from the trained model with the best validation set RMSE. The model was re-trained on the full 90% of training data and tested on the final 10% holdout split.
Results from the two modeling strategies are summarized in the table below.
Additionally, predicted vs actual kWh values for either modeling strategy are plotted against the most significant numeric feature, CDD30YR (30-year average of cooling degree-days in the residence’s region). Both model types demonstrate reasonable ability to predict kWh over a range of annual cooling loads/climate regions.
Final predictive accuracy was similar for both linear-based (GLM) and tree-based (GBM) models on the RECS dataset. The two models’ RMSE values represent a large improvement over an RMSE of 7,560, which was established as a baseline accuracy by initially fitting a default GLM on the raw dataset (before any feature engineering or reduction). This improvement validates the approaches used - using domain knowledge to feature engineer highly significant variables and grid search for hyperparameter tuning.
Aside from the regression setting, future analysis could be applied to RECS in a classification context. Since the majority of the dataset’s features are factor variables, energy providers could use the rich training set to attempt to answer important energy-related classification questions about their customers - type of space heating, presence of major appliances, or whether the home is a good candidate for solar installation. Greater predictive ability over residential electricity use will contribute over time to a smarter, cleaner, and more efficient power grid.