Covid-19 Global Analysis: 2021

Posted on Sep 28, 2022


Back in Summer of 2021, vaccines were ready and available in the United States but not all over the globe. As my family in Trinidad and colleagues in India reported, in those areas, people were on waiting lists to receive the vaccines that were in very limited supply. I wondered why this was so. Once can ponder about solving how to vaccinate the world with various approaches. One way is to think like a capitalist and let each country figure out how to purchase or produce the vaccine. Another way is to think like a socialist and redistribute the vaccines regardless of economic or population factors. 

In this project I investigated the Global Covid 19 Vaccination Tracker database from Kaggle and the 2020 World Bank Population data, mainly looking through correlations between population, GDP, vaccine availability for the percentage of the population, the daily dose rate, and percentage of the population fully vaccinated. Larger populations may take longer to get fully vaccinated. Countries with better economies may be able to pay for more doses. Higher supply might trigger faster distribution, which in turn could help a country to achieve full vaccination. Thus with these assumptions, I expected the following correlation relationships prior to analysis:

  • population to be negatively correlated with vaccine availability for percentage of the population, daily dose rate, and percentage of population being fully vaccinated
  • GDP to be positively correlated with vaccine availability for percentage of the population, daily dose rate, and percent of population being fully vaccinated
  • vaccine availability for percentage of the population would be positively correlated with daily dose rate and percent of the population being fully vaccinated
  • daily dose rate would be positively correlated with percent of the population being fully vaccinated
  • The null hypothesis for each of the nine listed correlations is that the correlation is not significant.


I. Data Manipulation

I followed the notebook Vaccine Gap in the World, to fix the country names in both Global COVID19 Vaccination Tracker CSVs: Global_COVID_Vaccination_Tracker.csv, which I will call VaccineTrackerData (vaccinationNumbers in my github notebook) and  GDP_PerCapita.csv, which I will call GDPData (gdp in my github notebook). With GDPData, I removed all unrecognizable symbols by replacing all instances of "�" with an empty string and replaced "United Arab Emirates" with "UAE." For VaccineTrackerData, I replaced "U.S." with "United States," "U.K." with "United Kingdom," "Mainland China'' with "China," and "Republic of the Congo" with "Congo."  After these replacements, I merged the dataframes on GDP ["Country/Territory"] and VaccineTrackerData ["Countries and regions"] into a variable called datamerge.  

I now needed population data. I took the "Country Name" and "2020" columns of the World Bank Population data and assigned them into a variable called popdata. The specific name .csv file from the World Bank website I used is API_SP.POP.TOTL_DS2_en_csv_v2_2918012.csv. The last 7 digits in the name of the csv file is timestamp related, so that number may change upon download. I replaced the column name "2020" with the name "population" and attempted to merge popdata with datamerge. As that led to a lot of mismatched countries, I performed an outer merge on these data frames on datamerge ["Country/Territory"] and popdata ["Country Name"] with indicator=True, and dubbed the merged data frame tempMergeForMismatches.

With tempMergeForMatches I was able to see which countries did not match. I saw a good portion of countries that should have matched but did not because the Kaggle database and the World Bank database named them differently. From extensive searching and learning about certain countries, I made the following replacements in popData:

  • "Bahamas, The" to "Bahamas"
  • "Brunei Darussalam"to "Brunei"
  • "Cabo Verde" to "Cape Verde"
  • "Congo, Dem. Rep."to "DR Congo"
  • "Egypt, Arab Rep." to "Egypt"
  • "Gambia, The" to "Gambia"
  • "Hong Kong SAR, China" to "Hong Kong"
  • "Iran, Islamic Rep." to "Iran"
  • "Kyrgyz Republic" to "Kyrgyzstan"
  • "Lao PDR" to "Laos"
  • "Russian Federation" to "Russia"
  • "Korea, Rep." to "South Korea"
  • "Syrian Arab Republic" to "Syria"
  • "Macao SAR, China" to "Macau"
  • "United Arab Emirates" to "UAE"
  • "Venezuela, RB" to "Venezuela",
  • "Yemen, Rep." to "Yemen"
  • "Congo, Rep." to "Congo",
  • "Cote d"Ivoire" to "Ivory Coast"
  • "Sint Maarten (Dutch part)" to "Sint Maarten"
  • "Slovak Republic" to "Slovakia"
  • "St. Kitts and Nevis" to "Saint Kitts and Nevis"

I reattempted the outer merge, saw that no other countries could be merged, and proceeded on inner join with datamerge ["Country/Territory"] and popdata ["Country Name"]. I reassigned the merged data frame to datamerge to derive  the GDP, population, and vaccination related data that would allow me to proceed to data analysis.

II. Data Analysis

II.a) Choropleths

Maps are helpful for  displaying data in a geographic fashion, so I decided to apply the data to choropleths. The first attempted choropleth was GDP. However, it appears that the wealthiest countries are tiny countries that don't appear on the choropleth, like Monaco. Due to this wealth disparity, when applying the choropleth to the data, the whole map became red, and no green could be seen, unless one zooms into Europe for the tiny countries with great wealth. Accordingly,, for GDP I excluded Monaco and Liechtenstein and set the max value with Luxembourg. Even though Luxembourg is still a tiny invisible country, lowering the max data value created more color diversity on the rest of the map.

Choropleth creation happens in the makeChoropleth method of src/ It takes the data frame of interest, the name of the column of examination, the preferred title of the choropleth image, and an optional color range depicted as a two element list (colorRange). If the color range is not given, it will default the range from zero to the maximum value of the dataframe column of interest. It applies the inputs to a with range_color=colorRange, and saves the choropleth as a .png for the images/choropleths folder.

II.b) Bar Charts

To supply more information on the maximum and minimum values of each data feature of interest, I created top 5 and bottom 5 bar charts on each feature. Bar chart creation happens in the createBarCharts method of src/, which requires the dataframe, the name of the column of examination, and preferred title prefix for the Top 5 and Bottom 5 .png files (e.g. createBarCharts(datamerge,'GDP Estimate in USD per capita as per UN','GDP') creates GDPTop5.png and GDPBottom5.png from datamerge['GDP Estimate in USD per capita as per UN']). The createBarCharts method saves the generated .png files in images/barCharts.

II.c) Check for Normalcy in Data

For correlation to properly work, I needed to decide whether to use Pearson or Spearman correlation methods. Pearson only works on normalized data. For this sanity check I invoke the method, checkNormality, in, which only requires the dataframe and a column name. It saves a .png file of a Seaborn kde histplot and a qqplot. I can then visually check for normality via examination of the kde shape and the straightness of the qqplot.

II.d) Examine Correlations

Without spoiling too much information of Results Section II, I determined that all features of interest required a Spearman correlation, which is carried out by spearmanCorrelate of The correlation method invokes two overlapping Seaborn regplots: one to show the scattered data points, and one to show the orange lowess regression line. I invoked the lowess regression line because none of the data selected is normalized.

Population GDP Enough for % Daily Dose % fully vac.
Population Identity Not interested Results III.d Results III.f Results III.e
GDP Duplicate Identity Results III.a Results III.c Results III.b
Enough for % Duplicate Duplicate Identity Results III.h Results III.g
Daily Dose Duplicate Duplicate Duplicate Identity Results III.i

The table above shows the correlations studied in this project and the results subsections on where to find the correlation results. I excluded the identity and duplicate correlations due to lack of added information. Also, I have excluded the GDP-Population correlation because this project focuses on Covid-19 vaccination distribution and administration, not the wealth gap disparity amongst countries in the world.


I. Choropleths and Bar Charts

I.a) Population

The top 5 populous countries are China, India, United States, Indonesia, and Pakistan. Their bottom 5 counterparts are Nauru, San Marino, Monaco, Sint Maarten, and Saint Kitts and Nevis.

The Chinese and Indian populations are about 1.4 billion, making them the only two green countries in this choropleth.

I.b) GDP

For GDP, Monaco, Bermuda, Luxembourg, Cayman Islands, and Switzerland are the top 5, and Somalia, Malawi, South Sudan, Central African Republic, and Afghanistan are in the bottom 5.

According to the global view of this choropleth, most of the wealth is found in the U.S, Canada, Western Europe, and Australia.

I.c) Enough Vaccines for Percent of Population

At the time of this study, Maldives, Greenland, UAE, Bahrain, and Uruguay had the highest vaccine availability for percent of the people, and DR Congo, South Sudan, Haiti, Chad, and Burkina Faso had the lowest.

Seeing the choropleth, one can observe that Africa struggled with getting enough vaccines for its population.

I.d) Daily Dose Rate

India, China, Indonesia, Brazil, and Japan had the highest dose administration per day rate, while San Marino, Saint Kitts and Nevis, Guinea-Bissau, Sint Maarten, and Bermuda had the lowest.

India and China out did their third place counterpart ten fold, like how they did in the population survey, and again are the only green countries for this choropleth.

I.e) Percent Fully Vaccinated

Regarding the percent fully vaccinated race, the top 5 countries are Malta, Maldives, Qatar, Singapore, and Portugal, and the bottom 5 countries are South Sudan, Haiti, Burkina Faso, Yemen, and Chad. 

There seems to be a struggle to get full vaccination in South Asia, the Middle East, and Africa, according to the map.

  1. Normalcy in Data

None of the selected data display a kde distribution plot that is gaussian, nor does it display a linear qqplot. Hence it is recommended to use Spearman correlation over Pearson correlation for correlation analysis.

III. Correlations

III.a) Does GDP correlate with vaccine availability for percent of the population?

Country GDP and vaccine availability for percent of the population received a Spearman R value of 0.83. With a p-value of 0.00000, the null hypothesis is rejected: this correlation is significant.

III.b) Does GDP correlate with the percent of people fully vaccinated?

Percent fully vaccinated correlates with GDP with a Spearman R value of 0.83. This correlation is significant with a p-value of 0.00000, rejecting the null hypothesis.

III.c) Does GDP correlate with the number of doses per day?

Daily Dose rate and GDP produce a Spearman correlation value of 0.05. A p-value of 0.52846 fails to reject the null hypothesis, indicating that this correlation is not significant.

III.d) Does population correlate with vaccine availability for percent of the population?

The correlation between vaccine availability for percentage of the population and country population itself receive a Spearman R Value of -0.25. Having a p-value of 0.00067, this is a significant anti-correlation that rejects the null hypothesis.

III.e) Does population correlate with the percent of people fully vaccinated?

Population and percent fully vaccinated also demonstrate a Spearman R value of -0.25. The p-value here is 0.00070. As a result, the significant anti-correlation rejects the null hypothesis.

III.f) Does population correlate with the number of doses per day?

Correlation between Daily Dose Rate and population calculates a Spearman R value of 0.75. Here, the null-hypothesis is rejected with a p-value of 0.00000

III.g) Does vaccine availability for percent of the population correlate with percent of population fully vaccinated?

Spearman correlation between percent fully vaccinated and vaccine availability for percentage of the population illustrates a R-value of 0.99 and a p-value of 0.0000; the correlation is significant and rejects the null hypothesis.

III.h) Does vaccine availability for percent of the population correlate with doses per day?

Daily dose rate and vaccine availability for percent of the population receives a Spearman R value of 0.14 and a p-value of 0.05324; this correlation is not significant and fails to reject the null hypothesis.

III.i) Does number of doses per day correlate with percent of population being fully vaccinated

Finally, the Spearman R correlation between percent fully vaccinated and daily dose rate has an R-value of 0.09. 0.20716 as a p-value fails to reject the null hypothesis.


I. Significant correlation with vaccine availability for percent of the population and percent of population fully vaccinated

The significant correlation between vaccine availability for percent of the population and percent of population fully vaccinated in Results III.g expresses that we are distributing the vaccine at maximum capacity; i.e., once the vaccine is available it is administered. I will refer to these two features as "the two highly correlated targets".

II. Significant correlation with GDP and the two highly correlated targets

The significant GDP to vaccine feature correlations in Results III.a and III.b indicates that GDP affects how many doses a population gets. In turn, GDP also affects the rate towards full vaccination.

III. Significant anti-correlation with population and the two highly correlated targets

There are significant anti-correlations in the population to vaccine features in Results III.d and III.e. These anti-correlations dictate that the more population a country has, the harder it will be to become fully vaccinated.

IV. No significant correlation between daily doses and the two highly correlated targets

It can be seen from Results III.h and III.i that neither vaccine availability for percent of the population nor percent of population fully vaccinated affect the daily dose rate. In hindsight, I see that the two highly correlated features are percentage values; each country will require a different amount of total doses in order to reach 100% vaccination status. More on this topic will be discussed in discussion section V.

V. Other Correlations Related to Daily Dose

Result III.c expresses no significant correlation between daily dose rate and GDP. However, Result III.f illustrates a significant correlation between daily dose and population. Regardless of wealth and status of a country, a higher population indicates a larger daily dose rate. In relation to discussion sections III and IV, higher populous countries might be administering more vaccines, but it is still going to be difficult to reach full vaccination status due to higher vaccine number requirements.


In this study we saw the following significant correlations:

  1. Enough vaccines for percent of population significantly correlates with percent fully vaccinated
  2. GDP significantly correlates with enough vaccines for percent of population and percent fully vaccinated.
  3. Population significantly anti-correlates with enough vaccines for percent of population and percent fully vaccinated
  4. Daily dose numbers significantly correlates with population

Conclusion statements 1-3 agree with the introductory assumptions. According to conclusion point 1, we can infer that we’re vaccinating at full capacity; as soon as supply is there, vaccination is administered. GDP, which is mentioned in conclusion point 2, influences vaccine creation and purchasing, which in turn guarantees well off countries to get vaccinated faster. The third conclusion expresses that there is a higher demand for vaccines in larger populous countries.

The fourth conclusion statement disagrees with the introductory assumption of population being negatively correlated with daily dose. Most countries, including countries with higher populations, will administer vaccines as soon as they are available. Even though highly populous countries have low vaccine availability for a percent of their citizens, a low vaccine availability for percent of a large population does not mean that the daily dose rate is small. In other words, a small percent of a large number could still be a large number.

Based on these conclusions, GDP and population are the highest factors towards getting enough vaccines to a country's population and achieving full vaccination. How can this be improved? Avoiding suggesting a pure socialist or pure capitalist solution mentioned in the introduction, I suggest the oxygen mask approach; get well off countries vaccinated enough to herd immunity, then funnel vaccines to poorer countries. Hopefully, when the next global pandemic happens in a hundred years, this approach will be taken to cease variant mutation faster than we did this pandemic, despite other socio-political choices of defiant individuals.

Future Works

Examining the last conclusion point, "Daily dose numbers significantly correlates with population", the next step should look into "doses administered", which is the total doses administered, not the daily dose rate. This was overlooked since a larger number of total doses administered may not indicate that the population is fully vaccinated. This should be tested, not assumed.


Github Link

About Author

Gary Simmons

Open-minded and tenacious data scientist and machine learning programmer familiar with large dataset analysis, Angular user interface enhancement, .NET Core REST API problem solving, and relational database management. My Applied Physics BS, Physics MS, and software development background...
View all posts by Gary Simmons >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI