Data Study on Healthcare Innovation in the Startup World

Posted on Aug 22, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Anyone with one foot inside the medical world can tell you that there's a lot of buzz about the future of medicine in big data and genomics.  Our improving ability to derive insight from medical data on larger and larger scales is auspicious for the precision of clinical medicine as well as the effectiveness of health policy.  And of course, unlocking the treasure trove of information stored away in the human genome has the potential to reshape the delivery of healthcare in a myriad of ways--some predictable, some not.

However, from my brief foray into the medical education system (first year of medical school), it became very clear to me that this type of innovation was certainly not being bred in typical medical school curricula; clinicians will not be leading the charge on advanced analytics in medicine.

Having witnessed a number of friends entering the big data gold rush of other industries, predominantly in the startup sphere, I set out to understand what sort of involvement healthcare start-ups are having in medical innovation--with a particularly curious eye on the big data analytics model.  What questions are healthcare startups seeking to answer?  What problems are they attempting to solve--and how?  Where is this all happening?

And then, could I make generalizations about which types of companies seem to do well?  What type of healthcare innovation lends itself well to the startup model?



Using Selenium and BeautifulSoup in Python, I scraped data from the healthcare marketplace page of  The page provides links to detail pages of 400 healthcare startups.  From these pages, I scraped company name, location, market tags, the "blurb," and number of employees.

Data Study on Healthcare Innovation in the Startup World

Above one can see that each company lists a "blurb" under its name, as well as its location, several market tags, and a number of employees range.

Additionally, I had my scraper navigate to each company's "Activity" page and take everything recorded under the "People" tab.  The "People" tab contains dated activity entries of predominantly investors, but also incubators, advisors and more.  
Data Study on Healthcare Innovation in the Startup World

With the data in workable form, I was able to begin to tackle the questions I set out to answer. First:

Where are healthcare startups?  

Plotting the frequency of each location on a horizontal bar-chart gives the following:Data Study on Healthcare Innovation in the Startup World

Clearly, no other city comes close to San Francisco's count of just under 150 companies (of the 400 scraped).  However, the lopsidedness is even more extreme than its first appearance; looking closer at the list, one notices how many of the other top locations are near San Francisco.  In fact--Palo Alto, Mountain View, Menlo Park, Silicon Valley, San Mateo, Redwood City, San Carlos, and Oakland are all within a short drive of San Francisco!  Binning them all together into the "Bay Area," we see the sheer dominance of that region in the healthcare startup world--almost 200 of 400 entries!


Next Question:

What are healthcare startups?

The data taken from Angelist provided two sources of information regarding this question--market tags and the "blurb."


Removing the top 3 uninformative market tags (health, healthcare system, and medicine), the most frequent market tags among healthcare startups listed on Angelist are as follows:Freq_market_tags

Looking at the top tags, two categories jump out: mobile/software and personal health/fitness.  It appears that a large portion of young healthcare companies are seeking to improve healthcare through the development of mobile apps.  And interestingly, personal fitness seems to be a problem that many entrepreneurs feel equipped to take on.

Perhaps the most unexpected finding on this list was the "Elder Care" market tag--especially holding as high a position as it does.   Because of our aging population, Geriatrics is a field of medicine in incredibly high demand--yet it consistently attracts a low amount of interest from graduating medical students.  It's very interesting to see its strong presence in the medical startup sphere.

"Big data" fell fairly high--and even higher if you combine it with "big data analytics."

NB: Many (if not most) companies list multiple market tags.  There is undoubtedly much overlap among these categories.

Considering that "Big Data" and "Big Data Analytics" appeared with high frequency, genomics--medicine's premier big data analytics problem--was notably absent.  Aware of the possibility that genomics simply wasn't a market tag, I analyzed the language of each company's "blurb" to see if a focus on genomics would appear.


The blurbs of all the companies were combined into one string in order to view each word by its frequency.  Only words greater than 5 letters were included, and some of the top words were removed (health, medical etc.).


The "blurb" analysis appeared very similarly to the market tag analysis.  Mobile apps and personal health/fitness are featured at the top once again, and there's no indication of genome analysis.  "Big" and "data" were too small to make the cut, but analytics is featured.  Interestingly, "cancer" made the list as well.  There are a substantial amount of healthcare companies focusing on cancer specifically.


What kinds of healthcare startups do well?

To evaluate how well a company had done, a growth rate metric for each startup (number of employees/age) was calculated from the data.  The following plot of market tag by the growth metric was helpful in understanding which kinds of healthcare startups have done well.  Since many companies listed multiple market tags, each company's top market tag (its tag with the highest general frequency) was used.


The growth rate is in employees added per month.

Notably, the order of market tag by growth rate is not the same as market tag by frequency.  Mobile, fitness, and personal health companies--all top of the list by count--are usurped by the previously quiet "health and insurance" tag.   From this we can see that the most common kinds of healthcare startups are not those with the most growth potential.


It's all happening in the Bay Area.

Health care startups are going mobile.

Health care entrepreneurs think its time to focus on your fitness.

Elder care--finally!

Though healthcare startups are focusing on big data, companies focusing on genome analysis--medicine's Mount Everest of data analytics problems--are rare.

With more time...

Look for patterns in investment activity among particular types of healthcare companies--what kind of startup attracts the most financial support?

Conduct a time series analysis on early investment activity.  Can we predict growth and valuation based on early investment and the company profile variables discussed so far?






About Author

William Bartlett

Will Bartlett is a History of Science and Medicine Major from Yale University who recently took a leave of absence from medical school to explore data science. As an undergraduate, he studied the role of data in medicine...
View all posts by William Bartlett >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI