# analyzing global public companies with R

Moyi Dang took Data Analysis with R - Intensive Beginner Level with Vivian Zhang in May-June, 2014. This post is based on her final project for the class.

**note: you can click on the pictures for a clearer image

With the power of R, I wanted to run some regressions on the universe of all listed companies in the world, and identify the 20 most undervalued companies, and the 20 most overvalued companies, if one has the freedom to pick them anywhere in the world. At the end of the post I will show you the results, but remember that these are not stock recommendations, just 40 names spit out by a regression. 🙂 Regressions can be powerful tools but it might just be a starting point for further research.

I got my raw data from Bloomberg - there are a total of 53,776 companies in total - these include only the "primary tickers" of a company, and are restricted to companies that currently have market cap data - so for example if a stock is currently suspended it would not have market cap data and would not be included in the data set. I also converted all financial data into US dollars, for ease of comparison across countries.

Then I picked out 11 metrics to describe these companies:

**Return of Invested Capital**

**Net Margin**

**EPS Growth**

**Leverage**

**Market Cap**

**Liquidity**

**Insider Ownership**

**Capital Expenditure**

**Share Repurchase**

**Region (i.e. North America, Emerging Asia, etc.)**

Then, I got the valuation ratios and the past one year return. I plan to run a multi-variate regression using the above 11 factors against any of the variables below:

**Price to Earnings ratio (PE)**

**Price to Book ratio (PB)**

**Enterprise Value to Earnings before interest, depreciation and amortization (EV/EBITDA)**

**One Year Return**But before getting into the regressions, here is a visualization of the data set:

Each company is plotted based on its country and city of domicile. The color of the bubbles are on a heat scale where the red bubbles are companies with the largest market cap, and the yellow bubbles are companies with the smallest market cap. The size of the bubbles correspond to the PB ratio. Some of the largest bubbles are yellow colored, while a lot of small bubbles are red - so some small companies have the highest PB, and some large ones have low PB.

To make this map, I first need to get the longitude and latitude of the location of each company. To do that, I got a very comprehensive list of cities in the world and their lat and long from: http://dev.maxmind.com/geoip/legacy/geolite/. Then I merged that with my data set.

Once I have that, I just used `mapBubbles`

from the `rworldmap`

package.

Here is my code:

library(rworldmap)

mapBubbles(dF = alldata5, nameZSize = "PB", #alldata5 is my cleaned up data set

nameZColour="logmc", #logmc is the natural log of market caps

colourPalette = "heat",

oceanCol = "lightblue",

landCol = "wheat",

addLegend = FALSE,

addColourLegend = FALSE,

nameX = "long", #the longitude and longitude of the city and country of domicile

nameY = "lat")

Now onto the regressions.

I ran the below regression code for PB, PE, EV/EBITDA, and One Year Return:

regression1 <- lm(formula = PB ~

ROIC + net.margin + EPS.Growth

+ leverage + logmc + logliquid

+ Pct.Insider.Shares + capextosales + Decrease.in.Capital.Stocks + region,

data = alldata2,

na.action = na.omit)

summary(regression1)

It turns out that PB resulted in the most factors having significant p values. Also the p values of each factor varies widely by region. To find a way to visualize this, I decided to use the star plot / radar chart. I don't think this is a very well known type of chart - so if you're unfamiliar with it here is a description:

I have 10 factors (excluding region) - so I take a regular decagon, and each factor occupies a point on the regular decagon. Then I plot (1 - p value) along the line from the center of the decagon to a point - so if 1 - p is close to one, it would be a point very close to the point of the decagon, if 1 - p is close to 0, it would be a point close to the center of the decagon. Then I connect the points to form an irregular decagon and shade in the region. The larger the region, the more factors are statistically significant.

Anyway here it is - the significance of the ten metrics as a predictor of PB separated by region of the world:

The plot is generated using `radarchart`

in the `fmsb`

package. Here is my code:

library(fmsb)

par(mfrow=c(2,4))

for(i in 1:8) {

data1 dim(data1)

head(data1 , 1)

regression ROIC + net.margin + EPS.Growth

+ leverage + logmc + logliquid

+ Pct.Insider.Shares + capextosales + Decrease.in.Capital.Stocks,

data = data1,

na.action = na.omit)

summary(regression)

result |t|)"]

result p radarchart(df = p,

maxmin = TRUE,

centerzero = TRUE,

pcol = topo.colors(2),

pfcol = topo.colors(2),

title = regions[i])

}

For the sake of illustration, here is the same plot generated by regressing One Year Return - you can see that the regions are much smaller, so not as statistically significant:

(But interestingly with the exception of Africa. So if I had picked African stocks one year ago based on these 10 metrics, I might be happy right now!)

So which countries have the most undervalued companies? Here is a tree diagram where the size of the rectangles correspond to the number of undervalued companies in each country, and the color corresponds to how undervalued these companies are (the average residual of that country).

This diagram is done by first splitting data by country using `ddply`

, and then using `treemap`

from the `treemap`

package. Here is my code:

#counting residuals by country

counts head(counts, 5)

undervalued_mean .variables = 'Country',

.fun = function(x) mean(x$resid))

undervalued_mean head(undervalued_mean, 5)

undervalued head(undervalued, 5)

#treemap

library(treemap)

?treemap

treemap(undervalued,

index = "Country",

vSize = "freq",

vColor = "mean",

type = "value",

title = "Undervalued Companies by Country")

Here is the diagram for overvalued companies by country. I'm surprised by the size of Taiwan here - it is also not so small in the undervalued diagram. I checked this and it turns out there are 1773 companies based in Taiwan in my data set. That's a lot of companies but why does it rival the US here? One reason I think is that this diagram doesn't include any companies where the residual is NA (error), so maybe Taiwanese companies generated less errors... anyway maybe this means something, sell Taiwanese stocks?

Here are the undervalued companies by sector:

Here are the overvalued companies by sector:

Now for the moment we've all been waiting for...

the 20 most undervalued companies , and 20 most overvalued companies...

...

...

...drumroll...

...

...

...according to this regression

...

...

...don't get too excited...

...

...

... it's just a regression...

...

...

... you should still do your own work...

...

...

... remember you can click to enlarge the image...

...

...

are:

Here is a plot of where they are in the world. The green bubbles are overvalued. The red ones are undervalued. The size of bubbles correspond to their PB ratio - as expected the green bubbles are bigger than the red ones.

Is this regression perfect? I don't think any regressions are perfect, but particularly not this one. It picks companies only based on PB - the undervalued companies tend to be in sectors that have lower PB, while the overvalued ones tend to be in sectors with higher PB.

But hope you enjoyed my post nevertheless!