World Health and Economy Data Visualization: A Web Scraping Project

Avatar
Posted on Sep 9, 2019

Introduction

As a world traveler, I have traveled, studied and worked in different places across the world. Yet still it’s fascinating to discover some facts about certain places that I never knew before.

For this project, I combined my web scrapping skills, my data analytics skills, with my curiosity to know more about different countries in the world.

The website that I chose to scrap is CIA World Factbook, a website that is produced by the Central Intelligence Agency that provides very comprehensive information for all world entities.

For the purpose of this project, my main areas of focus would be on these following areas: People and society (especially health), economy, and education.

Web Scraping

The layout of CIA World Factbook is pretty organized and straightforward: users can select country of their choice, and then the website shows a comprehensive list of information about this country; then users can click and collapse on different topics to see all kinds of facts about a certain country.

From a technical side, I used Beautiful Soup to scrap the data. The structure of the website is very straightforward so I didn’t have to use more complicated scrapping tool such as Scrappy or Selenium. Beautiful Soup is a great HTML parser tool that served its purpose and worked great for this project. I used R studio to produce all the graphs, with the majority of the graphs created using ggplot2, a great data visualization package.

Topic 1 : People and Health

The 1st general topic I would like to explore is People and Health. Specifically, I want to explore those prominent issues in global health that have been the subject of attention for decades, namely: mortality, life expectancy and aging, and health expenditure. Additionally, I would like to explore if health expenditure is necessarily helping to alleviate those problems by doing correlation studies.

1. Mortality

Maternal mortality

Maternal mortality is general indicator of the overall health of a population & the status of women in society. It is very obvious from this boxplot that Africa has the widest dispersion & highest distribution of maternal mortality rate. And it’s not surprising to see that countries in Africa, especially those countries torn by wars and hunger problems have the highest maternal mortality rate. On the other side, it’s also not surprising to see that European countries with good social welfare system have the lowest maternal mortality rate.

I was surprised by 2 facts: (1) Kuwait is on the list, and (2) US is not on the list. A further research online provided me with further information that backs up my observation:

(1) Kuwait spends about US$2–2.5 billion of the government’s US$10 billion annual health budget goes to sending some 2,000 patients abroad each year to North America or the European Union to receive extended or specialized care.
(2) The maternal mortality rate in US has more than doubled over the past decade, it’s the only developed country that has this trend.

Infant mortality

Preserving the lives of newborns is a long-standing issue in public health. This is why infant mortality is a very important marker of the overall health of a society.

Again, Africa has the widest dispersion & highest distribution of infant mortality rate. If we are looking at the individual countries, Afghanistan has the highest infant mortality rate, which is most likely due to the constant wars. On the other side, countries we consider to be politically and socially stable with good social welfare system have the lowest infant mortality rate. Again, US is not on this list.

2. Life expectancy and aging

Life expectancy at birth

On average, Africa has the lowest average life expectancy . All the other continents are doing pretty well, mostly surpassing the 70 years threshold, with Europe slightly champion the rest of the world. Countries/regions in Europe & East Asia have the longest life expectancy. Afghanistan and countries in Africa have the shortest life expectancy. There is really nothing surprising here, as I would expect that life expectancy has inverse relationship with mortality rate.

Median age

Median age is a very simple idea: half the people are younger than this age & half are older. If you have life expectancy of 70 years, the median should be around 35 years old.

Africa has the lowest median age. Europe & North America has strikingly higher median age than the rest of the world. If we are looking at individual countries, it’s quite surprising. The chart is implying that populations in Europe and North America are aging. That leads to me to look further into 2 important indicators for aging population: population growth rate & fertility rate.

Population growth rate

Moving onto the population growth rate, we see completely opposite dynamics here: Africa has the highest population growth rate, with Europe at the bottom.

Total fertility rate

That contrast between Africa and the rest of the world is even greater when we are looking at the fertility rate.

So what does this tell us?

  • African countries with the highest population growth rate & highest fertility rate are also those countries with the highest maternal and infant mortality rate & the lowest median age.
  • The story is completely opposite for Europe and East Asia: Low mortality rate, high median age, however coupled with low population growth rate and low fertility rate.
  • Africa is facing health quality problem, whereas Europe and Asia are facing population aging problem. The aging problem in Europe has been going on for quite a while, but it’s interesting to see how that would have a greater future impact in Asia considering the fact that Asia is playing an increasingly important role in world economy and politics.

3. Health expenditure

So far, we’ve explored a few prominent health issues. Next, I would like to explore how much money are countries spending on health, and are those health expenditures necessarily solving those health issues.

Health Expenditure by $

Health expenditure, by definition of CIA, includes public and private health expenditures. It should come as no surprise that North America has a strikingly high health expenditure by dollar values compared with the rest of the world (the outlier on the upper right is US).

In looking at the dollar values, I realize it might not be fair to be just looking at dollar value, because obviously smaller countries in Australian – Oceania do not have as much economic resources as wealthier western countries. Therefore, it makes more sense to look at health expenditure as a % of GDP.

Health Expenditure by % of GDP

After adjusting for health expenditure as percentage of GDP, the differences among continents are less dramatic.

US still has the largest health expenditure by % of GDP. Quite surprisingly, those Australian-Oceania island countries spend a significant percentage of their GDP on health care, that even topped over some European countries. Meanwhile, countries who spend the lowest % of their GDP on healthcare are mostly from Asia (East Asia, Southeast Asia, South Asia, and Central Asia).

4. The question: Does higher health expenditure necessarily help solving health concerns/problems?

  1. Maternal mortality
  2. Infant mortality
  3. Life expectancy at birth
  4. Median age
  5. Total fertility rate
  6. HIV/AIDS adult prevalence rate
  7. Obesity adult prevalence rate
  8. Children under the age of 5 years underweight

So, on top of the areas of health problems I explored above, I also added three additional prominent worldwide health problems: HIV/AIDS, obesity rate, and children underweight rate.

My key question is: Are higher health expenditures necessarily correlated with better health condition?

So I created a correlation heatmap that shows the correlation between health expenditures and each of the 8 major health problems for each continent.
(Red = positive correlation, blue = negative correlation, the darker the color, the stronger the correlation)

Let’s first look at correlations that’s meeting my initial expectation:
• More money spent on health, lower maternal mortality rate + lower infant mortality rate. We can observe negative correlations across continents, with relatively stronger correlation for Australia-Oceania and Central Asia.

• More money spent on health, longer the life expectancy. We can observe positive correlations across continents, with relatively stronger correlation for Australia-Oceania.

• More money spent on health, longer the median age. We can observe positive correlations across continents, with relatively stronger correlation for Central Asia.

The directions of those correlations are meeting my expectations, with some continents have stronger correlations than the others.

Next, there are some correlations that were unexpected to me:

  • I would think the HIV rate should have a negative correlation with health expenditure, but that’s not true for South Asia, Central Asia, and Africa.
    • Further research online showed that Central Asia is the only region in the world where the HIV epidemic continued to rise rapidly, with a 60% increase in infections between 2010 and 2015.
    • Then what about Africa and South Asia? One possible explanation is that those countries are not allocating enough money to cure/address HIV.
  • In terms of fertility rate, it is a relief to see that with more $ spent on health, it has a generally good positive correlation for Europe. However, I see strong negative correlation for Central Asia/Australia-Oceania, and South America. One possible explanation could be that even though money has been spent, it was not spent to address fertility issue.
  • In terms of obesity, I am very surprised by the almost perfect positive correlation between health spending vs. obesity rate in North America, it’s also very high in Central Asia. This came as a surprise to me as I know for a fact that medical costs for obesity is extremely high in US. This raised the question on if we are spending enough health expenditure to specifically target obesity related disease?

Additionally, this correlation study shows me that there might be a lot of space for further study into Central Asia. Sometimes we automatically assume that Africa has the most health problems, and we automatically group all the countries in Asia together - this could potentially cost us to overlook prominent health issues in certain Asian countries.

Thoughts:

• It would be arbitrary to draw a universal hypothesis that those countries are not allocating enough health expenditure to address that particular issue & that’s why I am not seeing the correlation I expected to see.

• However, that would not be a scientific way to address issues we observed here. Global health is a very complicated subject that does not have a universal answer to it. For each of these topics, there are people who spend their entire career trying to figure out patterns, correlations, and cure. This is far from enough to make certain assumptions on why I am not seeing what I expected to see.

• However, this correlation study does give me (and hopefully you) a new sense of direction of further study into a particular topic where unexpected patterns and/or correlations were observed.

Topic 2: Education, Communication and Economy

The next generic topic I would like to explore encompass education, communication and economy.

5. Size of economy

GDP & GDP real growth rate

GDP (Gross Domestic Product) is the total market value of all final goods and services produced in a country in a given period.

To compare the GDP across different countries with different currencies, we need to convert into a common currency. There are 2 common conversion method. One is the nominal GDP method that uses market exchange rates, and another more robust method is the purchasing power parity (PPP), which takes into account the relative cost of local goods, services and inflation rates of the country. For this project, I used PPP exclusively.

The result of GDP chart shows nothing unexpected, and the order is very similar to what I would get if I were to rank by nominal GDP, except that Indonesia has a really high GDP after I adjusted for cost of living using PPP.

6. Okun's law: Does it apply to all countries?

Unemployment rate vs. GDP per capita

We all learnt in Econ 101 about Okun’s law, which states 1% increase in unemployment causes a 2% fall in GDP. In another word, I should see a -0.5 slope for this chart with unemployment rate on the y-axis, and GDP (PPP) on the x-axis.

Okun's law is a very straightforward method to investigate the relationship between economic growth and unemployment. As with all the economic assumptions, it doesn’t hold 100% of the time, it has been known to shift over time and be impacted by more unusual economic climates, such as jobless recoveries and the more recent financial crisis.

As you can see, we do have a negative slope (signaling the inverse relationship between unemployment and GDP), even though it is not necessarily close to -0.5.

Then I tried something more experimental – I divided all countries into 3 brackets: low, middle and high GDP countries. Now Okun's law holds better after I section countries into 3 GDP buckets.

7. Internet usage

GDP per capita vs internet users

One interesting piece of data included in the dataset I scrapped off CIA World Factbook is the number of internet users.

As we can see, there is a positive correlation between the number of internet users and GDP per capita for low to mid GDP countries. That could mean access to internet helps the lower income countries to grow faster and improve the average condition of their citizens more than it does for the higher income nations.

8. Education expenditure

Education expenditures include all the public expenditure on education. It is an important figure that showcases a country's emphasis on nurturing and education for the next generation.

Education expenditure by $

Looking at dollar values alone, US and India are the top two countries that spent the most dollar amount on education expenditures.

Education expenditure by % of GDP

Now if we are looking at education expenditure by percentage of GDP, the perspective changed: smaller countries in Central America and Australia-Oceania spent significant percentage of their GDP on education, even topped those European countries traditionally known for having good social welfare system.

GDP per capita vs. education expenditure

There is a strong positive correlation between education expenditure and GDP per capita for lower GDP countries. That could mean access to education helps the lower income countries to grow faster and improve the average condition of their citizens more than it does for the higher income nations.

9. Health expenditure problem revisited

At the end of the project, I revisited the question I raised earlier, by dividing countries into 3 GDP brackets instead of continents. I would like to explore if the correlation between health expenditure and the prevalence (or lack) of the eight major health concerns changes if I choose to group countries by GDP brackets. The result is as follows:

It turns out that the correlation becomes weaker almost for every area of health concern, if I group countries by GDP brackets. In another world, grouping countries by geographical continents give me better correlation matrix.

Conclusion

People & Health:

  • Countries in Africa faces prominent issues with health care and life quality.
  • Countries in Europe and Asia the problem of population aging.
  • Larger health expenditures do not necessarily alleviate specific global health concerns.
  • Data visualization helps us to see that global health is a topic that is relevant to all of us.

Education, Communication and Economy

  • The inverse relationship between GDP and unemployment (Okun’s law) holds for our data set.
  • Stronger correlation could be found if we group countries into 3 GDP brackets.
  • There is a positive relationship between GDP per capital vs. number of internet users for mid to lower GDP countries. This relationship is negative for high GDP countries.
  • There is a positive relationship between GDP per capital vs. education expenditures for mid to lower GDP countries. This relationship is negative for high GDP countries.

Lastly, thank you for reading through this post. Here is a PDF version of my presentation if you are interested in knowing more about this project:

https://drive.google.com/file/d/146sDn-E0VJcMBEInw6-yhEd62XxYVAjk/view

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp