Data Analysis Minority and Non-Minority Business Creation

Posted on Jul 27, 2014

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Contributed by Shelby Ahern. Shelby took our Data Science with R - Beginner Level - class with Vivian Zhang in June, 2014. This post was based on her final project submission.

The focus of this exploration was reviewing the level of new business creation in New York City by minority and non-minority populations from 2005-2013. In this post, I’ll review the process of preparing the data and conducting two hypothesis tests on the primary measure --  the number of incorporations per capita. Data sources and definitions may be found as notes at the bottom of this post. Also, the presentation deck for this project is available on SlideShare.

Preparing the Data

I created a CSV file for the Active Corporations and XLSX files for the MBE and Population files, respectively. I then created data frames combining the data from Active Corporations, MBE Directory, and population files, by borough and year, to prep each data frame for four new additional columns of calculated measures. The calculated measures are:

  1. Incorporations per Capita
  2. MBE Incorporations per Capita
  3. MBE Incorporations per Minority capita
  4. Non-MBE Incorporations per Non-Minority capita

Here’s an example of a data frame for Manhattan:

> Manhattan

Manhattan Corporation df

Dataframes for Brooklyn, Queens, Bronx, and Staten Island.

Immediately, we see that the number of Incorporations and MBE Incorporations per year differ by at least a factor of 1000+. This is also reflected in the per capita measures. Even when compared to minority population alone, the number of MBE incorporations is tens of thousands of times less than the number of Non-MBE Incorporations per Non-Minority capita.


Initial Analysis

Graphing the per capita measures for the boroughs shows a similar disparity in the other boroughs as well - in fact, the MBE per capita dimensions don’t even register within the vertical scale.

Corporations per Capita, 5 Boroughs

Here is the code for the graph:

p3 <- ggplot() +
geom_point(data=MH_Corps, aes(year, NwCorpsperCap), colour="coral3") +
geom_point(data=MH_Corps, aes(year, NwMBECorpsperCap), colour="coral") +
geom_point(data=BK_Corps, aes(year, NwCorpsperCap), colour="aquamarine3") +
geom_point(data=BK_Corps, aes(year, NwMBECorpsperCap), colour="aquamarine") +
geom_point(data=QN_Corps, aes(year, NwCorpsperCap), colour="royalblue3") +
geom_point(data=QN_Corps, aes(year, NwMBECorpsperCap), colour="royalblue") +
geom_point(data=BX_Corps, aes(year, NwCorpsperCap), colour="orange3") +
geom_point(data=BX_Corps, aes(year, NwMBECorpsperCap), colour="orange") +
geom_point(data=SI_Corps, aes(year, NwCorpsperCap), colour="mediumpurple3") +
geom_point(data=SI_Corps, aes(year, NwMBECorpsperCap), colour="mediumpurple") +
xlab('Year') +
ylab('New Corporations per Capita') +
facet_wrap(~ County, ncol=1)

Other findings from the borough comparisons show:

  • The per-capita incidence of incorporations increased across all boroughs, from 2005 - 2013.
  • Manhattan, Queens, and Brooklyn had the highest per-capita incorporations.
  • Queens appears to have the steepest increase in corporation filings.


To investigate MBE incorporation in the boroughs, I created a second graph of just the MBE Incorporations/capita:

MBE Corporations per Cap, 5 Boroughs

p4 <- ggplot() +
geom_point(data=MH_Corps, aes(year, NwMBECorpsperCap), colour="coral") +
geom_point(data=BK_Corps, aes(year, NwMBECorpsperCap), colour="aquamarine") +
geom_point(data=QN_Corps, aes(year, NwMBECorpsperCap), colour="royalblue") +
geom_point(data=BX_Corps, aes(year, NwMBECorpsperCap), colour="orange") +
geom_point(data=SI_Corps, aes(year, NwMBECorpsperCap), colour="mediumpurple") +
xlab('Year') +
ylab('New Corporations per Capita') +
facet_wrap(~ County, ncol=1)

  • The per-capita incidence of MBE incorporations varied by borough (led by Manhattan), and trended downward after 2009.


The glaring difference in the scale between the number of MBEs and non-MBEs lead me to consider the MBE data, specifically, the process and purpose of MBE certification. MBE certification is marketed by the city and the state primarily as a program that facilitates opportunity for competing for government contracts.

That aim may not be relevant to the majority of businesses, which is likely a significant explanation of why the MBE numbers are comparatively low. Further, the certification process is comprehensive - requiring personal financial statements, for example, which may be a barrier to other owners pursuing the certification. In short, MBE certifications do not represent the universe of businesses in NYC that are launched and operated by minorities.

However, without any other (as now) public data sets that include the race of the primary owner/controlling partners, we’ll proceed with the analysis of new business starts by owner race with the data we’ve been using (let’s consider it an academic exercise).


Comparative Analysis

Q: Do the Frequency of Incorporations vary significantly between Minority and Non-Minority Populations?

The approach:

      1. Select Value to test:
        1. MBE Corps per Minority capita
        2. Non-MBE Corps per Non-Minority capita
        3. Utilize data from all years and boroughs (5 boroughs x 9 years x 2 categories = 90 obs.)
      2. Evaluate which test(s) to conduct.
        1. Parametric vs. Non-parametric
        2. Means test vs. Other
      3. Conduct test and analyze results.

Having determined which dimension will serve as the common measure (corp per capita; Minority vs. Non-Minority), let’s take a look at the data distribution for both series. The purpose for doing so is to determine whether the data is normally distributed, which is an assumption required by some standard means tests, like a Z- or T-test.

Here’s the code and the output for a 3 ways I utilized for reviewing the MBE per Minority capita data for normality:

1. Histogram. Occurrences of MBE Incorporations per Minority capita for All Boroughs, 2005-2013.
Histogram, MBE per Minority Capita

MBE_Hist <- ggplot(NwMBECorpsperMBECap_l, aes(x=NwMBECorpsperMBECap_l$values));
MBE_Hist + geom_histogram(binwidth=.000001)

We can see that the data is skewed to the “left” or rather, what would be considered below the where the mean would fall on a standard bell curve.

2. QQPlot Occurrences of MBE Incorporations per Minority capita vs. Normal Distribution (All Boroughs, 2005-2013)

QQplot MBE per Minority Capita

qqline(NwMBECorpsperMBECap_l$values, col = 2)

The plot shows that the points do not fall along a linear pattern, which is the criteria for evaluating the normality of the data.

3. Shapiro-Wilk Test
Lastly, I conducted a Shapiro-Wilk test for normality and the p-value returned was less than the test value of .1. Thus the null hypothesis - that the data is normally distributed - is rejected.

# Shapiro-Wilk normality test
# data: NwMBECorpsperMBECap_l$values
# W = 0.89, p-value = 0.0004636
# p.value < 0.1
#Null hypothesis (data is normally distributed) is rejected

Concluding that the data for MBE Incorporations per Minority capita is not normally distributed (nor was the data for Non-MBE/Non-Minority capita), I sought population comparison tests that did not require parametric datasets. Mood’s Median Test and the Mann-Whitney-Wilcoxon Test were two tests that satisfied those conditions and were applicable.


Hypothesis Testing

Mood’s Median Test

A nonparametric test where the null hypothesis of the medians of the populations from which two or more samples are drawn are identical. (Wikipedia)

H0: Medians of MBE Incorporations per Minority capita and Non-MBE Incorporations per Non-Minority capita are equivalent.
H1: Medians of MBE - Minority cap and Non-MBE -- Non-Minority cap are NOT equivalent.

median.test <- function(x, y) {
z <- c(x, y)
g <- rep(1:2, c(length(x), length(y)))
m <- median(z)
fisher.test(z < m, g)$p.value
median.test(NwMBECorpsperMBECap_l$values, NwNonMBECorpsperNonMBECap_l$values)
[1] 1.9e-26 # p-value
# p.value < 0.05
# Null hypothesis (medians are equal) is rejected


The null hypothesis was rejected, thus, we can conclude that the median(s) of the MBE Incorporations per Minority Capita data set across all boroughs and years is significantly different from the Non-Minority Incorporations per Non-Minority capita in the same geographies during the same period of time.

Mann-Whitney-Wilcoxon Test

A nonparametric test of the null hypothesis that two populations are the same against an alternative hypothesis, especially that a particular population tends to have larger values than the other. (Wikipedia)

H0: MBE Incorporations per Minority capita and Non-MBE Incorporations per Non-Minority capita could be representative of the same set of data.
H1: MBE Incorporations per Minority capita and Non-MBE Incorporations per Non-Minority capita could NOT be representative of the same set of data.

wilcox.test(values ~ Set, data=AllPerCapObs)
#Wilcoxon rank sum test
#data: values by Set
#W = 0, p-value < 2.2e-16
#alternative hypothesis: true location shift is not equal to 0

The null hypothesis was rejected which we interpret as validation that the number of per-capita incorporations per year by minority status could not be produced from the same population.


There is a significant difference between the levels of new business creation by race, but, within this exploration, is primarily attributable to how minority entrepreneurship was defined (by MBE certification). Without other ways of tracking MBE business development, MBE certification serves as an inaccurate measure. Separately, without broadening the objective of the MBE certification program to include other stronger incentives for prospective participants, the City may not see increased rates of participation nor the MBE certification as an effective tool for encouraging minority business creation.


Data Sources and Definitions

The data include:

  • NYC Online Directory of Certified Businesses: Minority-Owned Business Enterprises (MBE).  Source: New York City Department of Small Business Services. Accessed 7/9/2014.
  • 2010 Census and American Community Survey 1-year Population Estimates (2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013).  Source: U.S. Census Bureau and Accessed 7/15/2014.


  • “New Businesses” and Incorporations, used interchangeably herein, are the following Entity Types (from Active Corporations dataset):
    • Domestic Business Corporation
    • Domestic Cooperative Corporation
    • Domestic Professional Corporation
  • Borough = County (used for all data sets)
    • Manhattan: New York County
    • Brooklyn: Kings County
    • Queens = Queens County
    • Bronx = Bronx County
    • Staten Island = Richmond County
  • Minority Business Enterprise: “Under Article 15-A of the Executive Law, an MBE is a business enterprise in which at least fifty-one percent (51%) is owned, operated and controlled by citizens or permanent resident aliens who are meeting the ethnic definitions: Black, Hispanic, Asian-Pacific, Asian-Indian Subcontinent, Native American.” Accessed 7/16/2014.

  • Minority population, and/or Minority capita: the term minority is based on the count of the populations of the following races (U.S. Census Bureau): Black or African American Alone, American Indian and Alaska Native Alone, Asian Alone, Native Hawaiian and Other Pacific Islander Alone, Some Other Race Alone, and “Two or More races”.
    • Note: in 2013, the U.S. Census Bureau stopped the practice of bucketing “Some Other Race Alone,” which is a variation in the data between 2005-2012 and 2013.
    • Non-minority per capita is the inverse of the minority per capita, typically “White Alone.”

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI