Data Analysis Minority and Non-Minority Business Creation
Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Shelby Ahern. Shelby took our Data Science with R - Beginner Level - class with Vivian Zhang in June, 2014. This post was based on her final project submission.
The focus of this exploration was reviewing the level of new business creation in New York City by minority and non-minority populations from 2005-2013. In this post, Iโll review the process of preparing the data and conducting two hypothesis tests on the primary measure -- the number of incorporations per capita. Data sources and definitions may be found as notes at the bottom of this post. Also, the presentation deck for this project is available on SlideShare.
Preparing the Data
I created a CSV file for the Active Corporations and XLSX files for the MBE and Population files, respectively. I then created data frames combining the data from Active Corporations, MBE Directory, and population files, by borough and year, to prep each data frame for four new additional columns of calculated measures. The calculated measures are:
- Incorporations per Capita
- MBE Incorporations per Capita
- MBE Incorporations per Minority capita
- Non-MBE Incorporations per Non-Minority capita
Hereโs an example of a data frame for Manhattan:
> Manhattan
Dataframes for Brooklyn, Queens, Bronx, and Staten Island.
Immediately, we see that the number of Incorporations and MBE Incorporations per year differ by at least a factor of 1000+. This is also reflected in the per capita measures. Even when compared to minority population alone, the number of MBE incorporations is tens of thousands of times less than the number of Non-MBE Incorporations per Non-Minority capita.
Initial Analysis
Graphing the per capita measures for the boroughs shows a similar disparity in the other boroughs as well - in fact, the MBE per capita dimensions donโt even register within the vertical scale.
Here is the code for the graph:
library(ggplot2)
p3 <- ggplot() +
geom_point(data=MH_Corps, aes(year, NwCorpsperCap), colour="coral3") +
geom_point(data=MH_Corps, aes(year, NwMBECorpsperCap), colour="coral") +
geom_point(data=BK_Corps, aes(year, NwCorpsperCap), colour="aquamarine3") +
geom_point(data=BK_Corps, aes(year, NwMBECorpsperCap), colour="aquamarine") +
geom_point(data=QN_Corps, aes(year, NwCorpsperCap), colour="royalblue3") +
geom_point(data=QN_Corps, aes(year, NwMBECorpsperCap), colour="royalblue") +
geom_point(data=BX_Corps, aes(year, NwCorpsperCap), colour="orange3") +
geom_point(data=BX_Corps, aes(year, NwMBECorpsperCap), colour="orange") +
geom_point(data=SI_Corps, aes(year, NwCorpsperCap), colour="mediumpurple3") +
geom_point(data=SI_Corps, aes(year, NwMBECorpsperCap), colour="mediumpurple") +
xlab('Year') +
ylab('New Corporations per Capita') +
facet_wrap(~ County, ncol=1)
Other findings from the borough comparisons show:
- The per-capita incidence of incorporations increased across all boroughs, from 2005 - 2013.
- Manhattan, Queens, and Brooklyn had the highest per-capita incorporations.
- Queens appears to have the steepest increase in corporation filings.
To investigate MBE incorporation in the boroughs, I created a second graph of just the MBE Incorporations/capita:
p4 <- ggplot() +
geom_point(data=MH_Corps, aes(year, NwMBECorpsperCap), colour="coral") +
geom_point(data=BK_Corps, aes(year, NwMBECorpsperCap), colour="aquamarine") +
geom_point(data=QN_Corps, aes(year, NwMBECorpsperCap), colour="royalblue") +
geom_point(data=BX_Corps, aes(year, NwMBECorpsperCap), colour="orange") +
geom_point(data=SI_Corps, aes(year, NwMBECorpsperCap), colour="mediumpurple") +
xlab('Year') +
ylab('New Corporations per Capita') +
facet_wrap(~ County, ncol=1)
-
The per-capita incidence of MBE incorporations varied by borough (led by Manhattan), and trended downward after 2009.
The glaring difference in the scale between the number of MBEs and non-MBEs lead me to consider the MBE data, specifically, the process and purpose of MBE certification. MBE certification is marketed by the city and the state primarily as a program that facilitates opportunity for competing for government contracts.
That aim may not be relevant to the majority of businesses, which is likely a significant explanation of why the MBE numbers are comparatively low. Further, the certification process is comprehensive - requiring personal financial statements, for example, which may be a barrier to other owners pursuing the certification. In short, MBE certifications do not represent the universe of businesses in NYC that are launched and operated by minorities.
However, without any other (as now) public data sets that include the race of the primary owner/controlling partners, weโll proceed with the analysis of new business starts by owner race with the data weโve been using (letโs consider it an academic exercise).
Comparative Analysis
Q: Do the Frequency of Incorporations vary significantly between Minority and Non-Minority Populations?
The approach:
-
-
- Select Value to test:
- MBE Corps per Minority capita
- Non-MBE Corps per Non-Minority capita
- Utilize data from all years and boroughs (5 boroughs x 9 years x 2 categories = 90 obs.)
- Evaluate which test(s) to conduct.
- Parametric vs. Non-parametric
- Means test vs. Other
- Conduct test and analyze results.
- Select Value to test:
-
Having determined which dimension will serve as the common measure (corp per capita; Minority vs. Non-Minority), letโs take a look at the data distribution for both series. The purpose for doing so is to determine whether the data is normally distributed, which is an assumption required by some standard means tests, like a Z- or T-test.
Hereโs the code and the output for a 3 ways I utilized for reviewing the MBE per Minority capita data for normality:
1. Histogram. Occurrences of MBE Incorporations per Minority capita for All Boroughs, 2005-2013.
MBE_Hist <- ggplot(NwMBECorpsperMBECap_l, aes(x=NwMBECorpsperMBECap_l$values));
MBE_Hist + geom_histogram(binwidth=.000001)
We can see that the data is skewed to the โleftโ or rather, what would be considered below the where the mean would fall on a standard bell curve.
2. QQPlot Occurrences of MBE Incorporations per Minority capita vs. Normal Distribution (All Boroughs, 2005-2013)
library(lattice)
qqmath(NwMBECorpsperMBECap_l$values)
qqline(NwMBECorpsperMBECap_l$values, col = 2)
library(lattice)
qqmath(NwMBECorpsperMBECap_l$values)
qqline(NwMBECorpsperMBECap_l$values, col = 2)
The plot shows that the points do not fall along a linear pattern, which is the criteria for evaluating the normality of the data.
3. Shapiro-Wilk Test
Lastly, I conducted a Shapiro-Wilk test for normality and the p-value returned was less than the test value of .1. Thus the null hypothesis - that the data is normally distributed - is rejected.
shapiro.test(NwMBECorpsperMBECap_l$values)
# Shapiro-Wilk normality test
# data: NwMBECorpsperMBECap_l$values
# W = 0.89, p-value = 0.0004636
# p.value < 0.1
#Null hypothesis (data is normally distributed) is rejected
Concluding that the data for MBE Incorporations per Minority capita is not normally distributed (nor was the data for Non-MBE/Non-Minority capita), I sought population comparison tests that did not require parametric datasets. Moodโs Median Test and the Mann-Whitney-Wilcoxon Test were two tests that satisfied those conditions and were applicable.
Hypothesis Testing
Moodโs Median Test
A nonparametric test where the null hypothesis of the medians of the populations from which two or more samples are drawn are identical. (Wikipedia)
H0: Medians of MBE Incorporations per Minority capita and Non-MBE Incorporations per Non-Minority capita are equivalent.
H1: Medians of MBE - Minority cap and Non-MBE -- Non-Minority cap are NOT equivalent.
median.test <- function(x, y) {
z <- c(x, y)
g <- rep(1:2, c(length(x), length(y)))
m <- median(z)
fisher.test(z < m, g)$p.value
}
median.test(NwMBECorpsperMBECap_l$values, NwNonMBECorpsperNonMBECap_l$values)
[1] 1.9e-26 # p-value
# p.value < 0.05
# Null hypothesis (medians are equal) is rejected
}
Results
The null hypothesis was rejected, thus, we can conclude that the median(s) of the MBE Incorporations per Minority Capita data set across all boroughs and years is significantly different from the Non-Minority Incorporations per Non-Minority capita in the same geographies during the same period of time.
Mann-Whitney-Wilcoxon Test
A nonparametric test of the null hypothesis that two populations are the same against an alternative hypothesis, especially that a particular population tends to have larger values than the other. (Wikipedia)
H0: MBE Incorporations per Minority capita and Non-MBE Incorporations per Non-Minority capita could be representative of the same set of data.
H1: MBE Incorporations per Minority capita and Non-MBE Incorporations per Non-Minority capita could NOT be representative of the same set of data.
wilcox.test(values ~ Set, data=AllPerCapObs)
#Wilcoxon rank sum test
#data: values by Set
#W = 0, p-value < 2.2e-16
#alternative hypothesis: true location shift is not equal to 0
The null hypothesis was rejected which we interpret as validation that the number of per-capita incorporations per year by minority status could not be produced from the same population.
Conclusion
There is a significant difference between the levels of new business creation by race, but, within this exploration, is primarily attributable to how minority entrepreneurship was defined (by MBE certification). Without other ways of tracking MBE business development, MBE certification serves as an inaccurate measure. Separately, without broadening the objective of the MBE certification program to include other stronger incentives for prospective participants, the City may not see increased rates of participation nor the MBE certification as an effective tool for encouraging minority business creation.
Data Sources and Definitions
The data include:
- Active Corporations: Beginning 1800. Source: New York State Department of State, Division of Corporations, State Records & UCC. Accessed 7/1/2014.
- NYC Online Directory of Certified Businesses: Minority-Owned Business Enterprises (MBE). Source: New York City Department of Small Business Services. Accessed 7/9/2014.
- 2010 Census and American Community Survey 1-year Population Estimates (2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013). Source: U.S. Census Bureau and SocialExplorer.com. Accessed 7/15/2014.
Definitions:
- โNew Businessesโ and Incorporations, used interchangeably herein, are the following Entity Types (from Active Corporations dataset):
- Domestic Business Corporation
- Domestic Cooperative Corporation
- Domestic Professional Corporation
- Borough = County (used for all data sets)
- Manhattan: New York County
- Brooklyn: Kings County
- Queens = Queens County
- Bronx = Bronx County
- Staten Island = Richmond County
-
Minority Business Enterprise: โUnder Article 15-A of the Executive Law, an MBE is a business enterprise in which at least fifty-one percent (51%) is owned, operated and controlled by citizens or permanent resident aliens who are meeting the ethnic definitions: Black, Hispanic, Asian-Pacific, Asian-Indian Subcontinent, Native American.โ http://www.esd.ny.gov/MWBE/Qualifications.html. Accessed 7/16/2014.
- Minority population, and/or Minority capita: the term minority is based on the count of the populations of the following races (U.S. Census Bureau): Black or African American Alone, American Indian and Alaska Native Alone, Asian Alone, Native Hawaiian and Other Pacific Islander Alone, Some Other Race Alone, and โTwo or More racesโ.
- Note: in 2013, the U.S. Census Bureau stopped the practice of bucketing โSome Other Race Alone,โ which is a variation in the data between 2005-2012 and 2013.
- Non-minority per capita is the inverse of the minority per capita, typically โWhite Alone.โ