Exploring the Data on Financial Statements of US Companies

Posted on Dec 11, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
“You have to understand accounting and you have to understand the nuances of accounting. It’s the language of business and it’s an imperfect language, but unless you are willing to put in the effort to learn accounting – how to read and interpret financial statements – you really shouldn’t select stocks yourself” – Warren Buffett


After spending 9 years staring at data from financial statements and the like, I still don’t fully understand accounting.

I spent my time planning and analyzing the financials for small or mid cap companies. These companies were deeply complex and had plenty of interesting nuances, but I know there was a vast world of companies and industries that I hadn’t explored.

With that in mind, I built an exploratory data analysis tool in R using data.table and ggplot2 and an interactive dashboard in Shiny.

The analysis explores the SEC’s corpus of financial statement data, representing the majority of US public companies from 2012 to 2016. For consistency and dimension reduction, I focused on the 10-K filings, which are the primary annual reports for public companies.

In the end, I was only able to scratch the surface of the insights and trends that could be found in the data. Visit my github project here if you’d like to use go deeper.

Let’s get started.

The Data

The data was originally 8GB with separate raw text files for company submission, numeric values, description of the values, and the presentation layout of the values across 20 quarters.

Exploring the Data on Financial Statements of US Companies

I was able to filter it down to a more manageable 200MB flat table by using the data.table library in R, which is quite fast for large datasets, to join tables and remove: quarterly reports, less common financial statement accounts and other extraneous info, and the presentation-related layout of the financial statements. The original files can be found on the sec.gov site here.


Exploratory Data Visualization

First Look

The data represents financial statement disclosures from most of the public companies that file in the United States from 2012 to 2016. While some companies may file in slightly different ways, the majority (especially the more established) have certain things on their financial statements, like: Revenue, Net Income / (Loss), Assets, stockholder equity

Approximately 5,000 companies file 10-K’s with the SEC each year, although that number is declining:

Exploring the Data on Financial Statements of US Companies

On the other hand, all major accounts have grown since 2012, despite some volatility in Net Income:

Exploring the Data on Financial Statements of US Companies

Many might consider this a worrying trend: as the size of the market grows, the number of companies is shrinking. This leaves power in the hands of very few.

The largest companies dominate the market in terms of pure size – more than I realized. As seen from the chart below, the top 100 companies account for about 75% of total “market” revenue (“market” defined as those companies that reported “Revenues”). The top 30 companies account for 50% of revenue!


Peering at Overall Market Financials

Looking back at the growth in the major accounts, we notice that Net Income was particularly volatile while assets, book shareholder equity, and revenue were relatively stable.

Let’s take a closer look at Net Income as a use-case for the Shiny app. I select the Financial St Data tab, then select “NetIncomeLoss” as my metric:

The line graph helps us see more clearly which industries drove the volatility in net income – before looking, can you guess which industry experienced the most? (hint: black gold)

There! Natural Resources have had a tough time recently, which makes sense with the volatility of the commodity markets, especially oil.

Financial services (top line) also bounced around quite a bit – I suspect it is due to the razor-thin base interest rates in recent times and a few major fines.

The Power Law of Business

We’ve seen that the size of the largest companies dominates the rest of the market, and you’d be right to suspect that there’s something exponential going on.

In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.” - From Wikipedia:

After examining the charts below, I think you’ll see the power law between various financial metrics. While the app allows you to see the values for individual companies, I display the charts here to illustrate that the percentage growth in one account is usually related to the percentage growth in another account

Note that the first chart is on a standard scale where you can see how most companies play in a much smaller arena, while the second chart is on a log/log scale (where you can see the proportional growth of accounts):


Thank You!

Anyways, this was a fun dip into a vast pool of public company data, made easier by R, data.table, ggplot2, and Shiny. In the future, I’d like to examine the underlying structure of how companies grow and shrink using machine learning.

For now though, I think these companies can rest easy:


We only took a sample of what’s in the data, so I encourage you to run the go to my github repo to run the Shiny Dashboard app and explore the data for yourself:

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI