Covid-19 Global Analysis: 2021
Back in Summer of 2021, vaccines were ready and available in the United States but not all over the globe. As my family in Trinidad and colleagues in India reported, in those areas, people were on waiting lists to receive the vaccines that were in very limited supply. I wondered why this was so. Once can ponder about solving how to vaccinate the world with various approaches. One way is to think like a capitalist and let each country figure out how to purchase or produce the vaccine. Another way is to think like a socialist and redistribute the vaccines regardless of economic or population factors.
In this project I investigated the Global Covid 19 Vaccination Tracker database from Kaggle and the 2020 World Bank Population data, mainly looking through correlations between population, GDP, vaccine availability for the percentage of the population, the daily dose rate, and percentage of the population fully vaccinated. Larger populations may take longer to get fully vaccinated. Countries with better economies may be able to pay for more doses. Higher supply might trigger faster distribution, which in turn could help a country to achieve full vaccination. Thus with these assumptions, I expected the following correlation relationships prior to analysis:
- population to be negatively correlated with vaccine availability for percentage of the population, daily dose rate, and percentage of population being fully vaccinated
- GDP to be positively correlated with vaccine availability for percentage of the population, daily dose rate, and percent of population being fully vaccinated
- vaccine availability for percentage of the population would be positively correlated with daily dose rate and percent of the population being fully vaccinated
- daily dose rate would be positively correlated with percent of the population being fully vaccinated
- The null hypothesis for each of the nine listed correlations is that the correlation is not significant.
I. Data Manipulation
I followed the notebook Vaccine Gap in the World, to fix the country names in both Global COVID19 Vaccination Tracker CSVs: Global_COVID_Vaccination_Tracker.csv, which I will call VaccineTrackerData (vaccinationNumbers in my github notebook) and GDP_PerCapita.csv, which I will call GDPData (gdp in my github notebook). With GDPData, I removed all unrecognizable symbols by replacing all instances of "�" with an empty string and replaced "United Arab Emirates" with "UAE." For VaccineTrackerData, I replaced "U.S." with "United States," "U.K." with "United Kingdom," "Mainland China'' with "China," and "Republic of the Congo" with "Congo." After these replacements, I merged the dataframes on GDP ["Country/Territory"] and VaccineTrackerData ["Countries and regions"] into a variable called datamerge.
I now needed population data. I took the "Country Name" and "2020" columns of the World Bank Population data and assigned them into a variable called popdata. The specific name .csv file from the World Bank website I used is API_SP.POP.TOTL_DS2_en_csv_v2_2918012.csv. The last 7 digits in the name of the csv file is timestamp related, so that number may change upon download. I replaced the column name "2020" with the name "population" and attempted to merge popdata with datamerge. As that led to a lot of mismatched countries, I performed an outer merge on these data frames on datamerge ["Country/Territory"] and popdata ["Country Name"] with indicator=True, and dubbed the merged data frame tempMergeForMismatches.
With tempMergeForMatches I was able to see which countries did not match. I saw a good portion of countries that should have matched but did not because the Kaggle database and the World Bank database named them differently. From extensive searching and learning about certain countries, I made the following replacements in popData:
- "Bahamas, The" to "Bahamas"
- "Brunei Darussalam"to "Brunei"
- "Cabo Verde" to "Cape Verde"
- "Congo, Dem. Rep."to "DR Congo"
- "Egypt, Arab Rep." to "Egypt"
- "Gambia, The" to "Gambia"
- "Hong Kong SAR, China" to "Hong Kong"
- "Iran, Islamic Rep." to "Iran"
- "Kyrgyz Republic" to "Kyrgyzstan"
- "Lao PDR" to "Laos"
- "Russian Federation" to "Russia"
- "Korea, Rep." to "South Korea"
- "Syrian Arab Republic" to "Syria"
- "Macao SAR, China" to "Macau"
- "United Arab Emirates" to "UAE"
- "Venezuela, RB" to "Venezuela",
- "Yemen, Rep." to "Yemen"
- "Congo, Rep." to "Congo",
- "Cote d"Ivoire" to "Ivory Coast"
- "Sint Maarten (Dutch part)" to "Sint Maarten"
- "Slovak Republic" to "Slovakia"
- "St. Kitts and Nevis" to "Saint Kitts and Nevis"
I reattempted the outer merge, saw that no other countries could be merged, and proceeded on inner join with datamerge ["Country/Territory"] and popdata ["Country Name"]. I reassigned the merged data frame to datamerge to derive the GDP, population, and vaccination related data that would allow me to proceed to data analysis.
II. Data Analysis
Maps are helpful for displaying data in a geographic fashion, so I decided to apply the data to choropleths. The first attempted choropleth was GDP. However, it appears that the wealthiest countries are tiny countries that don't appear on the choropleth, like Monaco. Due to this wealth disparity, when applying the choropleth to the data, the whole map became red, and no green could be seen, unless one zooms into Europe for the tiny countries with great wealth. Accordingly,, for GDP I excluded Monaco and Liechtenstein and set the max value with Luxembourg. Even though Luxembourg is still a tiny invisible country, lowering the max data value created more color diversity on the rest of the map.
Choropleth creation happens in the makeChoropleth method of src/utils.py. It takes the data frame of interest, the name of the column of examination, the preferred title of the choropleth image, and an optional color range depicted as a two element list (colorRange). If the color range is not given, it will default the range from zero to the maximum value of the dataframe column of interest. It applies the inputs to a plotly.express.choropleth with range_color=colorRange, and saves the choropleth as a .png for the images/choropleths folder.
II.b) Bar Charts
To supply more information on the maximum and minimum values of each data feature of interest, I created top 5 and bottom 5 bar charts on each feature. Bar chart creation happens in the createBarCharts method of src/utils.py, which requires the dataframe, the name of the column of examination, and preferred title prefix for the Top 5 and Bottom 5 .png files (e.g. createBarCharts(datamerge,'GDP Estimate in USD per capita as per UN','GDP') creates GDPTop5.png and GDPBottom5.png from datamerge['GDP Estimate in USD per capita as per UN']). The createBarCharts method saves the generated .png files in images/barCharts.
II.c) Check for Normalcy in Data
For correlation to properly work, I needed to decide whether to use Pearson or Spearman correlation methods. Pearson only works on normalized data. For this sanity check I invoke the method, checkNormality, in utils.py, which only requires the dataframe and a column name. It saves a .png file of a Seaborn kde histplot and a qqplot. I can then visually check for normality via examination of the kde shape and the straightness of the qqplot.
II.d) Examine Correlations
Without spoiling too much information of Results Section II, I determined that all features of interest required a Spearman correlation, which is carried out by spearmanCorrelate of utils.py. The correlation method invokes two overlapping Seaborn regplots: one to show the scattered data points, and one to show the orange lowess regression line. I invoked the lowess regression line because none of the data selected is normalized.
|Population||GDP||Enough for %||Daily Dose||% fully vac.|
|Population||Identity||Not interested||Results III.d||Results III.f||Results III.e|
|GDP||Duplicate||Identity||Results III.a||Results III.c||Results III.b|
|Enough for %||Duplicate||Duplicate||Identity||Results III.h||Results III.g|
|Daily Dose||Duplicate||Duplicate||Duplicate||Identity||Results III.i|
The table above shows the correlations studied in this project and the results subsections on where to find the correlation results. I excluded the identity and duplicate correlations due to lack of added information. Also, I have excluded the GDP-Population correlation because this project focuses on Covid-19 vaccination distribution and administration, not the wealth gap disparity amongst countries in the world.
I. Choropleths and Bar Charts
The top 5 populous countries are China, India, United States, Indonesia, and Pakistan. Their bottom 5 counterparts are Nauru, San Marino, Monaco, Sint Maarten, and Saint Kitts and Nevis.
The Chinese and Indian populations are about 1.4 billion, making them the only two green countries in this choropleth.
For GDP, Monaco, Bermuda, Luxembourg, Cayman Islands, and Switzerland are the top 5, and Somalia, Malawi, South Sudan, Central African Republic, and Afghanistan are in the bottom 5.
According to the global view of this choropleth, most of the wealth is found in the U.S, Canada, Western Europe, and Australia.
I.c) Enough Vaccines for Percent of Population
At the time of this study, Maldives, Greenland, UAE, Bahrain, and Uruguay had the highest vaccine availability for percent of the people, and DR Congo, South Sudan, Haiti, Chad, and Burkina Faso had the lowest.
Seeing the choropleth, one can observe that Africa struggled with getting enough vaccines for its population.
I.d) Daily Dose Rate
India, China, Indonesia, Brazil, and Japan had the highest dose administration per day rate, while San Marino, Saint Kitts and Nevis, Guinea-Bissau, Sint Maarten, and Bermuda had the lowest.
India and China out did their third place counterpart ten fold, like how they did in the population survey, and again are the only green countries for this choropleth.
I.e) Percent Fully Vaccinated
Regarding the percent fully vaccinated race, the top 5 countries are Malta, Maldives, Qatar, Singapore, and Portugal, and the bottom 5 countries are South Sudan, Haiti, Burkina Faso, Yemen, and Chad.
There seems to be a struggle to get full vaccination in South Asia, the Middle East, and Africa, according to the map.
- Normalcy in Data
None of the selected data display a kde distribution plot that is gaussian, nor does it display a linear qqplot. Hence it is recommended to use Spearman correlation over Pearson correlation for correlation analysis.
III.a) Does GDP correlate with vaccine availability for percent of the population?
Country GDP and vaccine availability for percent of the population received a Spearman R value of 0.83. With a p-value of 0.00000, the null hypothesis is rejected: this correlation is significant.
III.b) Does GDP correlate with the percent of people fully vaccinated?
Percent fully vaccinated correlates with GDP with a Spearman R value of 0.83. This correlation is significant with a p-value of 0.00000, rejecting the null hypothesis.
III.c) Does GDP correlate with the number of doses per day?
Daily Dose rate and GDP produce a Spearman correlation value of 0.05. A p-value of 0.52846 fails to reject the null hypothesis, indicating that this correlation is not significant.
III.d) Does population correlate with vaccine availability for percent of the population?
The correlation between vaccine availability for percentage of the population and country population itself receive a Spearman R Value of -0.25. Having a p-value of 0.00067, this is a significant anti-correlation that rejects the null hypothesis.
III.e) Does population correlate with the percent of people fully vaccinated?
Population and percent fully vaccinated also demonstrate a Spearman R value of -0.25. The p-value here is 0.00070. As a result, the significant anti-correlation rejects the null hypothesis.
III.f) Does population correlate with the number of doses per day?
Correlation between Daily Dose Rate and population calculates a Spearman R value of 0.75. Here, the null-hypothesis is rejected with a p-value of 0.00000
III.g) Does vaccine availability for percent of the population correlate with percent of population fully vaccinated?
Spearman correlation between percent fully vaccinated and vaccine availability for percentage of the population illustrates a R-value of 0.99 and a p-value of 0.0000; the correlation is significant and rejects the null hypothesis.
III.h) Does vaccine availability for percent of the population correlate with doses per day?
Daily dose rate and vaccine availability for percent of the population receives a Spearman R value of 0.14 and a p-value of 0.05324; this correlation is not significant and fails to reject the null hypothesis.
III.i) Does number of doses per day correlate with percent of population being fully vaccinated
Finally, the Spearman R correlation between percent fully vaccinated and daily dose rate has an R-value of 0.09. 0.20716 as a p-value fails to reject the null hypothesis.
I. Significant correlation with vaccine availability for percent of the population and percent of population fully vaccinated
The significant correlation between vaccine availability for percent of the population and percent of population fully vaccinated in Results III.g expresses that we are distributing the vaccine at maximum capacity; i.e., once the vaccine is available it is administered. I will refer to these two features as "the two highly correlated targets".
II. Significant correlation with GDP and the two highly correlated targets
The significant GDP to vaccine feature correlations in Results III.a and III.b indicates that GDP affects how many doses a population gets. In turn, GDP also affects the rate towards full vaccination.
III. Significant anti-correlation with population and the two highly correlated targets
There are significant anti-correlations in the population to vaccine features in Results III.d and III.e. These anti-correlations dictate that the more population a country has, the harder it will be to become fully vaccinated.
IV. No significant correlation between daily doses and the two highly correlated targets
It can be seen from Results III.h and III.i that neither vaccine availability for percent of the population nor percent of population fully vaccinated affect the daily dose rate. In hindsight, I see that the two highly correlated features are percentage values; each country will require a different amount of total doses in order to reach 100% vaccination status. More on this topic will be discussed in discussion section V.
V. Other Correlations Related to Daily Dose
Result III.c expresses no significant correlation between daily dose rate and GDP. However, Result III.f illustrates a significant correlation between daily dose and population. Regardless of wealth and status of a country, a higher population indicates a larger daily dose rate. In relation to discussion sections III and IV, higher populous countries might be administering more vaccines, but it is still going to be difficult to reach full vaccination status due to higher vaccine number requirements.
In this study we saw the following significant correlations:
- Enough vaccines for percent of population significantly correlates with percent fully vaccinated
- GDP significantly correlates with enough vaccines for percent of population and percent fully vaccinated.
- Population significantly anti-correlates with enough vaccines for percent of population and percent fully vaccinated
- Daily dose numbers significantly correlates with population
Conclusion statements 1-3 agree with the introductory assumptions. According to conclusion point 1, we can infer that we’re vaccinating at full capacity; as soon as supply is there, vaccination is administered. GDP, which is mentioned in conclusion point 2, influences vaccine creation and purchasing, which in turn guarantees well off countries to get vaccinated faster. The third conclusion expresses that there is a higher demand for vaccines in larger populous countries.
The fourth conclusion statement disagrees with the introductory assumption of population being negatively correlated with daily dose. Most countries, including countries with higher populations, will administer vaccines as soon as they are available. Even though highly populous countries have low vaccine availability for a percent of their citizens, a low vaccine availability for percent of a large population does not mean that the daily dose rate is small. In other words, a small percent of a large number could still be a large number.
Based on these conclusions, GDP and population are the highest factors towards getting enough vaccines to a country's population and achieving full vaccination. How can this be improved? Avoiding suggesting a pure socialist or pure capitalist solution mentioned in the introduction, I suggest the oxygen mask approach; get well off countries vaccinated enough to herd immunity, then funnel vaccines to poorer countries. Hopefully, when the next global pandemic happens in a hundred years, this approach will be taken to cease variant mutation faster than we did this pandemic, despite other socio-political choices of defiant individuals.
Examining the last conclusion point, "Daily dose numbers significantly correlates with population", the next step should look into "doses administered", which is the total doses administered, not the daily dose rate. This was overlooked since a larger number of total doses administered may not indicate that the population is fully vaccinated. This should be tested, not assumed.
- Global COVID19 Vaccination Tracker: https://www.kaggle.com/kamal007/global-covid19-vaccination-tracker?select=Global_COVID_Vaccination_Tracker.csv
- World Bank Population, total: https://data.worldbank.org/indicator/SP.POP.TOTL
- Vaccine Gap in the World: https://www.kaggle.com/code/sasakitetsuya/vaccine-gap-in-the-world/notebook