Modeling Valuation Multiples of Companies with Gradient Boost
Motivation
The main goal of a financier in investment management is to build and manage a portfolio with a strong return on investment that ideally outperforms the market. One of the most central concepts to constructing robust stock portfolios is relative valuation. That is looking at how a publicly traded company is valued compared to its peer firms, or a collection of companies that have similar valuation metrics, in the current market. If we can select a cluster of peer firms and determine how one of the firms' valuation compares to the average of those peers, we can predict how that firm's stock price should trend in the near term - at least in theory. That means that a company overvalued relative to its peers should see its stock price go down, and conversely the stock price of an undervalued company should go up. Therefore, it becomes crucial to determine what firms are most similarly valued to each other at any given time in order to most accurately manage stock portfolios and maximize investment returns.
Past academic research has utilized NLP-based techniques to first select peer firms in order to determine relative valuation. For example, in the 2010s some researchers developed methods of calculating similarity scores for companies based on their business descriptions filed with the Securities and Exchange Commission (SEC). In 2015, a group of researchers examined sequential search activities by investors on the SEC's EDGAR website, which enables querying company earnings reports, to determine which firms were most similar.
However, such techniques for peer firm selection fail to account for a plethora of quantitative fundamental data that describe the financial performance of companies and how that changes over time. If a machine learning model can first predict valuation metrics based on earnings data and then assign peer firms, that should not only prove to be accurate but more adaptable to changes in a company's performance and broadly across the market as well.
In 2022, Paul Geertsema and Helen Lu from the University of Auckland published a paper that demonstrated how a gradient boosting model based on data reflecting fundamental earnings performance and stock volatility on a monthly basis can predict specific valuation metrics and determine how the valuation of firms in three major American stock exchanges (NYSE, NASDAQ and AMEX) evolves relative to their peers. A gradient boosting model, as will be explained later, is an accurate regression technique that can be trained on high-dimensional data with more interaction between features than a linear regression model can capture, while also avoiding overfitting the data like a standard decision tree model. That makes gradient boosting a strong candidate for predicting relative valuation for publicly traded companies.
The goal of this project is to replicate the research on a commercial scale by using different kinds of gradient boosting techniques to 1) predict valuation based on fundamental earnings data and 2) understand if this method can accurately predict if a company can properly correct its valuation in the short term based on the relative valuation to its peer firms.
Theoretical Background
In machine learning, boosting is a slow-learning algorithm that recursively fits simple estimators to the residuals of the previous iteration that are then aggregated to form a composite ensemble model. The process for fitting a gradient boosting regressor is as follows:
- Start with a base model with the output 0, which means the residuals for all the training data points are the target values.
- Fit a weak estimator to the residuals to minimize the mean squared error (MSE). (In many cases, like in this project, that estimator is a single-node decision tree with two output leaves, which is also known as a stump.)
- Update the model by adding it to the fitted estimator scaled by a learning rate ε, which is usually a small value in the range of 0.001-0.01. This is where the "gradient" aspect applies. Much like gradient descent when doing linear regression, the purpose of the learning rate is to avoid overshooting the MSE minimum.
- Calculate the new residuals of the updated model. These will become the new training set for the next cycle of model training.
- Repeat steps 2 through 4 for the desired number of iterations or until the mean squared error is below a specified threshold.
The advantage of ensemble-based machine learning models is that by building a composite model out of weaker estimators, there is less risk of overfitting the training data. A single but more complex decision tree, on the other hand, is more likely to underfit the testing data and thus result in a model with poor predictive value. Gradient boosting also improves the performance of the model by recursively training each weak estimator on the residuals rather than on the target values themselves. That causes the data points with the worst errors to gain greater weight in the model as more estimators are fitted.
Another version of gradient boost, known as extreme gradient boost (XGBoost), runs a similar algorithm but adds regularization to the mix to further optimize fitting both the training and testing data. Much like how Lasso regression penalizes polynomial model complexity by adding a term to the optimization function that grows with the magnitude of the coefficients, XGBoost penalizes model complexity by augmenting the MSE optimization function with terms that penalize the number of leaves in the decision tree estimators as well as the output values of the leaves. Below is a formulation of the objective function that combines the MSE with the regularization terms.
Model Development: Methodology and Results
Data Processing
In order to have an expansive distribution of all the data variables, I queried data from companies listed in two of the major American stock exchanges, the New York Stock Exchange (NYSE) and Nasdaq, that had earnings reports available from at least the last 10 years, using the QuickFS database API. Earnings reports are broken down into three components: the income statement for detailing revenues and expenses, the balance sheet for outlining assets and liabilities, and the cash flow statement for showing how cash and cash equivalents were earned and spent through various business activities. An example of an income statement for Microsoft from Q1 (the first fiscal quarter) of 2024 is shown below. It outlines Microsoft's total revenue as well as how much the company spent in various categories during that period.
After consolidating all available fundamental data from the last 10 years, I annualized the features from the income and cash flow statements. Those data fields pertain to the flow of money earned and spent, which means they're susceptible to seasonal variance. However, summing the values over four-quarter windows evens out those short-term fluctuations and gives a clearer picture of how a company earns and spends money over a longer period of time. Then I cleaned up the data further by removing any rows that contained duplicate values for company and time period but had different data values, which likely was an indication of erroneously inputted data.
With the queried earnings data cleaned up, I then engineered the input features that are used to train the model in predicting company valuation. This was done using the full table in Geertsema and Lu, Appendix A.1 (p. 367-369) as a guide. These features are common accounting ratios that describe aspects of a company's finances. That can include relative strength of its assets and liabilities or efficiency in how it spends money/accrues debt to earn its revenue. I omitted some of the features included in that model from mine, specifically beta, which is a measure for stock price volatility that can add a degree of external risk not accounted for in earnings data, and any metrics relying on inventory that were prone to a lot of missing values.
The target features of the model were engineered to reflect three key valuation ratios: market capitalization to book equity, enterprise value to sales, and enterprise value to total assets. To tighten the distribution and lessen the impact of outliers, I replicated the approach of Geertsema and Lu by setting the natural log of the ratios as the target features. After engineering these features, I dropped all data rows with null values, whether they were caused by division by zero or taking the natural log of a non-positive value, and then further filtered the dataset to contain companies with at least 20 quarters worth of non-null data.
Model Training
Next we trained two tree-based gradient boost models, one using Scikit-Learn's GradBoost and another using the XGBoost Python package. The input features included all the aforementioned engineered features as per Geertsema and Lu's paper, the business sector and the year and quarter of the earnings period for the data. Categorical features like the sector and earnings quarter were encoded ordinally for the GradBoost model, whereas they were inputted in their raw form with XGBoost. The target feature was chosen to be the natural log of the enterprise value (EV) to sales ratio.
To optimize each model, I ran a grid search over a specific set of values of hyperparameters and chose the values that yielded the best five-fold cross-validation R-squared score. The table below shows which hyperparameters were tuned for each model and what values were tested.
After running the grid searching process to tune the models, the optimized parameters shown below, along with their R-squared scores and cross-validation times, were obtained.
Ultimately my choice for the optimal model was XGBoost. Despite having a 1% lower R-squared score than GradBoost, the XGBoost model was more than 50 times faster at training and cross-validating, which is extremely valuable despite a very slight loss in cross-validation score. Therefore, the model I use for the rest of this project will be the tuned XGBoost model with the above hyperparameters.
Feature Importance
With the tuned XGBoost model in hand, I used a SHapley Additive exPlanations (SHAP) analysis to determine what features are expected to have the greatest impact on predicting EV-to-sales valuation. SHAP is a technique based in cooperative game theory that seeks to understand how data features additively contribute to predicting the value of a target variable. The greater the average SHAP value is for a given feature, the more that a unit change in that feature is expected to impact the output value. Below is a plot showing the average SHAP values for the top features, i.e. how much the log of EV-to-sales changes on average per unit step of a given variable.
The top five features identified by the SHAP analysis were also what Geertsema and Lu found to be most impactful in predicting EV-to-sales in their research:
- Asset turnover - a measure of efficiency in how a company uses its assets to generate revenue, calculated as the ratio of sales to assets.
- EBIT margin - also known as operating margin, defined the ratio of earnings before interest and taxes to sales. This can better reflect the return on sales than profit margin since it removes the impact of non-business expenses
- Operating leverage - the ratio of sum of COGS (cost of goods sold) and SG&A (selling, general and administrative expenses) to total assets. This efficiency metric measures how much a company's operating expenditures translates into growth.
- R&D to sales - a quantity that reflects how much spending in R&D (research and development) affects revenue growth
- Gross profit margin - similar to EBIT margin, but more simply the ratio of the difference between revenue and COGS, which is gross profit, to revenue
These five financial ratios fall under the categories of efficiency (asset turnover and operating leverage), profitability (EBIT and gross profit margins) and growth (R&D to sales), so we can see how these metrics would have a more outsized effect on business valuation. If a company can more efficiently utilize its assets in generating sales, see strong profitability even when accounting for interest and taxes, and/or grow its revenue through strong investment in R&D, its enterprise value can be expected to grow more strongly relative to its turnover.
Corrections in Relative Valuation
Now that we have a strongly predictive model whose feature importance is also better understood, we can turn to our next question: do companies across the market tend to see corrections in their relative valuation over time? In other words, does the actual valuation metric of a company generally trend towards the predicted value over time to diminish the residual between the actual and predicted quantities ?
The principle is simple: an overvalued company whose EV-to-sales is greater than what is predicted by the model should see it decrease, while an undervalue company with a lower EV-to-sales than predicted should see it increase. Geertsema and Lu were able to demonstrate that their technique of modeling relative valuation could also predict corrections in stock price, which in turn informed how to construct stock portfolios with strong returns on investment. While we don't have a metric directly reflecting stock price, we can verify if the valuation metric predicted in our model follows a similar principle as in the research.
To examine this hypothesis, I calculated the target variable residuals, i.e. the difference between the predicted and actual log of EV-to-sales values, for each firm across all quarters with available data. Overvalued companies for a given quarter had negative residuals, while undervalued companies had positive residuals. Then I examined the quintiles of the residuals, which are calculated as the 0.2, 0.4, 0.6 and 0.8 quantiles, for each quarter to see how they trended over the entire timespan encompassed by the dataset.
Overall, the quintiles of the residuals tend to follow similar patterns over the last ten full years of earnings without trending to zero over any period of time, which means that we cannot conclusively say that predicting EV-to-sales using this model also help in determining if a company corrects its valuation based on that metric alone. Now it's important to note that a given quintile across various filing quarters might not necessarily be represented by the same company for all of those filing periods. However, the reason this still matters is because we would expect multiple companies across the broader market to see corrections in the same market conditions, even if there's variance in their specific fundamentals.
(One noteworthy trend is that the 0.2 quantile of valuation residuals, which represents the most overvalued companies. The downward trend there is even more marked more negative values over the course of 2020 than seen in the other quintiles before going back up over the next two years. This could likely be due to sales dropping so precipitously due to the pandemic that the actual EV-to-sales is well above what it should be for those filing periods based on the fundamentals.)
To have a better understanding of the likelihood of the relative valuation for a given company trending in the correct direction, I plotted the distribution of probabilities of all firms of their log EV-to-sales residuals trending in the correct direction (negative residuals go up, positive ones go down) from one filing quarter to the next. This helps to understand if a firm's changes in relative valuation are either short-term fluctuations or real corrections over the long run. If our method of predicting relative valuation was no better than randomly picking the trend to be up or down, then we would see the average probability of short-term corrections occurring between two consecutive quarters at around 50%.
In fact, the average probability of a firm correcting its relative valuation between two consecutive quarters is somewhere around 60%, which is a clear improvement over randomly determining if that quantity will trend in the correct or wrong direction. However, we cannot conclusively say that this method of predicting relative valuation also predicts corrections in relative valuation in a statistically significant manner. The distribution of probabilities across all firms in our data spans roughly between 45% and 80%, which is too broad to definitely conclude that the mean probability of 60% is statistically significantly better than random chance. Therefore, while our model is a valuable method of predicting valuation multiples based on fundamental economic data from quarterly earnings reports, there is more work to be done to determine what is a robust way to predict relative valuation manifested as some trending of the residuals to zero.
Conclusions and Future Work
In summation, I have demonstrated that an XGBoost model that combines gradient boosting with regularization can be accurately descriptive of a given valuation metric for companies across the NYSE and NASDAQ. This model identifies key features that can be most deterministic of a company’s valuation, and they are confirmed by peer-reviewed research. Finally, while I do not yet see a significantly better ability of this model to predict corrections in a firm’s valuation relative to the market than going by random chance, there are further venues I can explore to assess trends in relative valuation.
There are several venues we can go down to expand the scope of this work:
- Build upon data volume by including more firms from other major stock indices and over a greater span of time, then verify the statistical significance of the model fitting results
- Similar to G&L, construct ML models for different valuation metrics such as market capitalization to book equity or enterprise value to total assets and assess their ability to predict corrections in relative valuation
- Include features pertaining to stock price volatility so that the models can account for systematic risk beyond fundamental earnings data
- Construct separate ML models across all firms for each earnings period as an alternate way to assess trends in relative valuation, which can work best with a greater volume of data