Seed Accelerators and Social Media: What made VCs Fund These Startups?

Posted on Aug 22, 2016

Contributed by Shu Liu. Shu is currently in the NYC Data Science Academy 12 week full-time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016. This post is based on his third class project - Web Scraping (due on the 6th week of the program).

You may also explore this project via R, Python Codes and Data on Github.


The success of a startup depends on many factors, such as the founders, funding, and the environment of the industry in which it is established. Startups never stop searching for a chance to improve their probability of success. It’s the same for venture capitalists(VC’s). VC’s work hard to select the best target to invest with  to maximize their profit.
Navigating the early part of its existence well is crucial to a startup’s success. A good seed accelerator can provide enough mentorship and funding support for startups. Mentorship help founders clearly understand what they want and what they should focus on. This is why some successful startups are usually born in the same seed accelerator.


After a seed accelerator, VC’s play an important role in helping startups to become stronger. However, it’s difficult for VC’s to know whether a young startup will succeed or fail. It’s common to use the Discounted Cash Flow method to a public company, but this method can’t be applied to a startup. In fact, most startups don’t have clear financial records and formal financial reports. Therefore, Relative Valuation is a choice for evaluating fast-growing startups. A critical part in the Relative Valuation for online companies is finding a related company, and assessing whether the two or similar or not based on the number of users. We can also explore how startups behave on  social media to indirectly assess its  number of  users.

This project focuses on webscraping data from and The first contains data about seed accelerators while the latter serves as the source for social media data of startups.

Data Source:

Extracted from: using Webscraping:

Twitter using Tweepy API:

Variables Selection:

Startup (funding > 1 million): 

name, website, number of followers, number of friends, number of statuses, amount of funding, rounds of funding

Corresponding Seed Accelerators:

name, address, established year, website, amount exited, amount funded, number of startups exited, number of startups funded

Initial Analysis:

Rplot02 Rplot03

I first took out the top ten seed accelerators with the most past funding. 'Y Combinator' dominates the feed in this respect. This is partly due to its being older. According to Wikipedia, it is the first seed accelerator. Y Combinator’s creation was followed by TechStars (2006) and Seedcamp (2007). The bar chart to the left cements the importance of age when it comes to seed accelerators.

The top ten startups from companies with a valuation greater than 1 million dollars are ordered by their total amount of funding. Most of them are very popular today.  All of them are online companies, which proves the importance of the number of users in the valuation of startups.


The data scraped from twitter contains some interesting insight. Friends_num, statuses_num, and favourites_num are more correlated with each other than with followers_num, but these three variables are less correlated with funding (total amount of funding) than followers_num. This means that followers_num has greater direct influence on how much funding a startup can get. It's really a reasonable projection since the number followers on Twitter depends on how popular the startup is, and people who follow the company’s Twitter account are more likely to be users of its business. However, the other three variables is not direct indexes of how popular the business of the startup is because a startup can write as many statuses as possible on Twitter, even though it has only a few followers.

Further Steps: Multiple Linear Regression

The analyses above serve as a guide on how to apply multiple linear regression to this problem. The initial form of the model lies below.

Dependent variables:

Amount of funding/rounds of funding

Independent Variables:

Seed factor (year, state, amount funded(number of startups funded) )  &

Users factor (number of followers)

However, assumptions such as multicollinearity need to be checked before building an effective regression model. It is also possible that the variance explained by the model might be small due to the variables not having enough predictive power.

About Author


Shu is currently a master’s student studying financial engineering at University of Southern California, and he has a multidisciplinary background in math, economics, and financial engineering. Being able to look at problems from both marketing and technical perspectives,...
View all posts by Shu LIU >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI