analyzing global public companies with R

Posted on Jul 23, 2014

Moyi Dang took Data Analysis with R - Intensive Beginner Level with Vivian Zhang in May-June, 2014. This post is based on her final project for the class.


**note: you can click on the pictures for a clearer image

With the power of R, I wanted to run some regressions on the universe of all listed companies in the world, and identify the 20 most undervalued companies, and the 20 most overvalued companies, if one has the freedom to pick them anywhere in the world. At the end of the post I will show you the results, but remember that these are not stock recommendations, just 40 names spit out by a regression. 🙂 Regressions can be powerful tools but it might just be a starting point for further research.

I got my raw data from Bloomberg - there are a total of 53,776 companies in total - these include only the "primary tickers" of a company, and are restricted to companies that currently have market cap data - so for example if a stock is currently suspended it would not have market cap data and would not be included in the data set. I also converted all financial data into US dollars, for ease of comparison across countries.

Then I picked out 11 metrics to describe these companies:

Return of Invested Capital

Net Margin

EPS Growth


Market Cap


Insider Ownership

Capital Expenditure

Share Repurchase

Region (i.e. North America, Emerging Asia, etc.)

Then, I got the valuation ratios and the past one year return. I plan to run a multi-variate regression using the above 11 factors against any of the variables below:

Price to Earnings ratio (PE)
Price to Book ratio (PB)
Enterprise Value to Earnings before interest, depreciation and amortization (EV/EBITDA)
One Year Return

But before getting into the regressions, here is a visualization of the data set:

Each company is plotted based on its country and city of domicile. The color of the bubbles are on a heat scale where the red bubbles are companies with the largest market cap, and the yellow bubbles are companies with the smallest market cap. The size of the bubbles correspond to the PB ratio. Some of the largest bubbles are yellow colored, while a lot of small bubbles are red - so some small companies have the highest PB, and some large ones have low PB.

all companies

To make this map, I first need to get the longitude and latitude of the location of each company. To do that, I got a very comprehensive list of cities in the world and their lat and long from: Then I merged that with my data set.

Once I have that, I just used mapBubbles from the rworldmap package.

Here is my code:

mapBubbles(dF = alldata5, nameZSize = "PB", #alldata5 is my cleaned up data set
nameZColour="logmc", #logmc is the natural log of market caps
colourPalette = "heat",
oceanCol = "lightblue",
landCol = "wheat",
addLegend = FALSE,
addColourLegend = FALSE,
nameX = "long", #the longitude and longitude of the city and country of domicile
nameY = "lat")

Now onto the regressions.

I ran the below regression code for PB, PE, EV/EBITDA, and One Year Return:

regression1 <- lm(formula = PB ~
ROIC + net.margin + EPS.Growth
+ leverage + logmc + logliquid
+ Pct.Insider.Shares + capextosales + + region,
data = alldata2,
na.action = na.omit)

It turns out that PB resulted in the most factors having significant p values. Also the p values of each factor varies widely by region. To find a way to visualize this, I decided to use the star plot / radar chart. I don't think this is a very well known type of chart - so if you're unfamiliar with it here is a description:

I have 10 factors (excluding region) - so I take a regular decagon, and each factor occupies a point on the regular decagon. Then I plot (1 - p value) along the line from the center of the decagon to a point - so if 1 - p is close to one, it would be a point very close to the point of the decagon, if 1 - p is close to 0, it would be a point close to the center of the decagon. Then I connect the points to form an irregular decagon and shade in the region. The larger the region, the more factors are statistically significant.

Anyway here it is - the significance of the ten metrics as a predictor of PB separated by region of the world:



The plot is generated using radarchart in the fmsb package. Here is my code:

for(i in 1:8) {
data1 dim(data1)
head(data1 , 1)
regression ROIC + net.margin + EPS.Growth
+ leverage + logmc + logliquid
+ Pct.Insider.Shares + capextosales +,
data = data1,
na.action = na.omit)
result |t|)"]
result p radarchart(df = p,
maxmin = TRUE,
centerzero = TRUE,
pcol = topo.colors(2),
pfcol = topo.colors(2),
title = regions[i])

For the sake of illustration, here is the same plot generated by regressing One Year Return - you can see that the regions are much smaller, so not as statistically significant:

(But interestingly with the exception of Africa. So if I had picked African stocks one year ago based on these 10 metrics, I might be happy right now!)


So which countries have the most undervalued companies? Here is a tree diagram where the size of the rectangles correspond to the number of undervalued companies in each country, and the color corresponds to how undervalued these companies are (the average residual of that country).

Undervalued by country

This diagram is done by first splitting data by country using ddply, and then using treemap from the treemap package. Here is my code:

#counting residuals by country
counts head(counts, 5)
undervalued_mean .variables = 'Country',
.fun = function(x) mean(x$resid))
undervalued_mean head(undervalued_mean, 5)
undervalued head(undervalued, 5)
index = "Country",
vSize = "freq",
vColor = "mean",
type = "value",
title = "Undervalued Companies by Country")

Here is the diagram for overvalued companies by country. I'm surprised by the size of Taiwan here - it is also not so small in the undervalued diagram. I checked this and it turns out there are 1773 companies based in Taiwan in my data set. That's a lot of companies but why does it rival the US here? One reason I think is that this diagram doesn't include any companies where the residual is NA (error), so maybe Taiwanese companies generated less errors... anyway maybe this means something, sell Taiwanese stocks?

overvalued by country

Here are the undervalued companies by sector:

undervalued by sector


Here are the overvalued companies by sector:

Overvalued Companies by Sector

Now for the moment we've all been waiting for...

the 20 most undervalued companies , and 20 most overvalued companies...






...according to this regression



...don't get too excited...



... it's just a regression...



... you should still do your own work...



... remember you can click to enlarge the image...




results by names

Here is a plot of where they are in the world. The green bubbles are overvalued. The red ones are undervalued. The size of bubbles correspond to their PB ratio - as expected the green bubbles are bigger than the red ones.

final results map


Is this regression perfect? I don't think any regressions are perfect, but particularly not this one. It picks companies only based on PB - the undervalued companies tend to be in sectors that have lower PB, while the overvalued ones tend to be in sectors with higher PB.

But hope you enjoyed my post nevertheless!

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp