Exploring the Data on Financial Statements of US Companies
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
“You have to understand accounting and you have to understand the nuances of accounting. It’s the language of business and it’s an imperfect language, but unless you are willing to put in the effort to learn accounting – how to read and interpret financial statements – you really shouldn’t select stocks yourself” – Warren Buffett
Introduction
After spending 9 years staring at data from financial statements and the like, I still don’t fully understand accounting.
I spent my time planning and analyzing the financials for small or mid cap companies. These companies were deeply complex and had plenty of interesting nuances, but I know there was a vast world of companies and industries that I hadn’t explored.
With that in mind, I built an exploratory data analysis tool in R using data.table and ggplot2 and an interactive dashboard in Shiny.
The analysis explores the SEC’s corpus of financial statement data, representing the majority of US public companies from 2012 to 2016. For consistency and dimension reduction, I focused on the 10-K filings, which are the primary annual reports for public companies.
In the end, I was only able to scratch the surface of the insights and trends that could be found in the data. Visit my github project here if you’d like to use go deeper.
Let’s get started.
The Data
The data was originally 8GB with separate raw text files for company submission, numeric values, description of the values, and the presentation layout of the values across 20 quarters.
I was able to filter it down to a more manageable 200MB flat table by using the data.table library in R, which is quite fast for large datasets, to join tables and remove: quarterly reports, less common financial statement accounts and other extraneous info, and the presentation-related layout of the financial statements. The original files can be found on the sec.gov site here.
Exploratory Data Visualization
First Look
The data represents financial statement disclosures from most of the public companies that file in the United States from 2012 to 2016. While some companies may file in slightly different ways, the majority (especially the more established) have certain things on their financial statements, like: Revenue, Net Income / (Loss), Assets, stockholder equity
Approximately 5,000 companies file 10-K’s with the SEC each year, although that number is declining:
On the other hand, all major accounts have grown since 2012, despite some volatility in Net Income:
Many might consider this a worrying trend: as the size of the market grows, the number of companies is shrinking. This leaves power in the hands of very few.
The largest companies dominate the market in terms of pure size – more than I realized. As seen from the chart below, the top 100 companies account for about 75% of total “market” revenue (“market” defined as those companies that reported “Revenues”). The top 30 companies account for 50% of revenue!
Peering at Overall Market Financials
Looking back at the growth in the major accounts, we notice that Net Income was particularly volatile while assets, book shareholder equity, and revenue were relatively stable.
Let’s take a closer look at Net Income as a use-case for the Shiny app. I select the Financial St Data tab, then select “NetIncomeLoss” as my metric:
The line graph helps us see more clearly which industries drove the volatility in net income – before looking, can you guess which industry experienced the most? (hint: black gold)
There! Natural Resources have had a tough time recently, which makes sense with the volatility of the commodity markets, especially oil.
Financial services (top line) also bounced around quite a bit – I suspect it is due to the razor-thin base interest rates in recent times and a few major fines.
The Power Law of Business
We’ve seen that the size of the largest companies dominates the rest of the market, and you’d be right to suspect that there’s something exponential going on.
“In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.” - From Wikipedia:
After examining the charts below, I think you’ll see the power law between various financial metrics. While the app allows you to see the values for individual companies, I display the charts here to illustrate that the percentage growth in one account is usually related to the percentage growth in another account
Note that the first chart is on a standard scale where you can see how most companies play in a much smaller arena, while the second chart is on a log/log scale (where you can see the proportional growth of accounts):
Thank You!
Anyways, this was a fun dip into a vast pool of public company data, made easier by R, data.table, ggplot2, and Shiny. In the future, I’d like to examine the underlying structure of how companies grow and shrink using machine learning.
For now though, I think these companies can rest easy:
We only took a sample of what’s in the data, so I encourage you to run the go to my github repo to run the Shiny Dashboard app and explore the data for yourself: