Exploring the Financial Statements of US Companies

Posted on Dec 11, 2017

“You have to understand accounting and you have to understand the nuances of accounting. It’s the language of business and it’s an imperfect language, but unless you are willing to put in the effort to learn accounting – how to read and interpret financial statements – you really shouldn’t select stocks yourself” – Warren Buffett

After spending 9 years staring at financial statements and the like, I still don’t fully understand accounting.

I spent my time planning and analyzing the financials for small or mid cap companies. These companies were deeply complex and had plenty of interesting nuances, but I know there was a vast world of companies and industries that I hadn’t explored.

With that in mind, I built an exploratory data analysis tool in R using data.table and ggplot2 and an interactive dashboard in Shiny.

The analysis explores the SEC’s corpus of financial statement data, representing the majority of US public companies from 2012 to 2016. For consistency and dimension reduction, I focused on the 10-K filings, which are the primary annual reports for public companies.

In the end, I was only able to scratch the surface of the insights and trends that could be found in the data. Visit my github project here if you’d like to use go deeper.

Let’s get started.

The Data

The data was originally 8GB with separate raw text files for company submission, numeric values, description of the values, and the presentation layout of the values across 20 quarters.

I was able to filter it down to a more manageable 200MB flat table by using the data.table library in R, which is quite fast for large datasets, to join tables and remove: quarterly reports, less common financial statement accounts and other extraneous info, and the presentation-related layout of the financial statements. The original files can be found on the sec.gov site here.


Exploratory Data Visualization

First Look

The data represents financial statement disclosures from most of the public companies that file in the United States from 2012 to 2016. While some companies may file in slightly different ways, the majority (especially the more established) have certain things on their financial statements, like: Revenue, Net Income / (Loss), Assets, stockholder equity

Approximately 5,000 companies file 10-K’s with the SEC each year, although that number is declining:

On the other hand, all major accounts have grown since 2012, despite some volatility in Net Income:

Many might consider this a worrying trend: as the size of the market grows, the number of companies is shrinking. This leaves power in the hands of very few.

The largest companies dominate the market in terms of pure size – more than I realized. As seen from the chart below, the top 100 companies account for about 75% of total “market” revenue (“market” defined as those companies that reported “Revenues”). The top 30 companies account for 50% of revenue!


Peering at Overall Market Financials

Looking back at the growth in the major accounts, we notice that Net Income was particularly volatile while assets, book shareholder equity, and revenue were relatively stable.

Let’s take a closer look at Net Income as a use-case for the Shiny app. I select the Financial St Data tab, then select “NetIncomeLoss” as my metric:

The line graph helps us see more clearly which industries drove the volatility in net income – before looking, can you guess which industry experienced the most? (hint: black gold)

There! Natural Resources have had a tough time recently, which makes sense with the volatility of the commodity markets, especially oil.

Financial services (top line) also bounced around quite a bit – I suspect it is due to the razor-thin base interest rates in recent times and a few major fines.

The Power Law of Business

We’ve seen that the size of the largest companies dominates the rest of the market, and you’d be right to suspect that there’s something exponential going on.

In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.” - From Wikipedia:

After examining the charts below, I think you’ll see the power law between various financial metrics. While the app allows you to see the values for individual companies, I display the charts here to illustrate that the percentage growth in one account is usually related to the percentage growth in another account

Note that the first chart is on a standard scale where you can see how most companies play in a much smaller arena, while the second chart is on a log/log scale (where you can see the proportional growth of accounts):


Thank You!

Anyways, this was a fun dip into a vast pool of public company data, made easier by R, data.table, ggplot2, and Shiny. In the future, I’d like to examine the underlying structure of how companies grow and shrink using machine learning.

For now though, I think these companies can rest easy:


We only took a sample of what’s in the data, so I encourage you to run the go to my github repo to run the Shiny Dashboard app and explore the data for yourself:

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp