Seed Accelerators and Social Media: What made VCs Fund These Startups?

Posted on Aug 22, 2016

Contributed by Shu Liu. Shu is currently in the NYC Data Science Academy 12 week full-time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016. This post is based on his third class project - Web Scraping (due on the 6th week of the program).

You may also explore this project via R, Python Codes and Data on Github.


The success of a startup depends on many factors, such as the founders, funding, and the environment of the industry in which it is established. Startups never stop searching for a chance to improve their probability of success. It’s the same for venture capitalists(VC’s). VC’s work hard to select the best target to invest with  to maximize their profit.
Navigating the early part of its existence well is crucial to a startup’s success. A good seed accelerator can provide enough mentorship and funding support for startups. Mentorship help founders clearly understand what they want and what they should focus on. This is why some successful startups are usually born in the same seed accelerator.


After a seed accelerator, VC’s play an important role in helping startups to become stronger. However, it’s difficult for VC’s to know whether a young startup will succeed or fail. It’s common to use the Discounted Cash Flow method to a public company, but this method can’t be applied to a startup. In fact, most startups don’t have clear financial records and formal financial reports. Therefore, Relative Valuation is a choice for evaluating fast-growing startups. A critical part in the Relative Valuation for online companies is finding a related company, and assessing whether the two or similar or not based on the number of users. We can also explore how startups behave on  social media to indirectly assess its  number of  users.

This project focuses on webscraping data from and The first contains data about seed accelerators while the latter serves as the source for social media data of startups.


Data Source:

Extracted from: using Webscraping:

Twitter using Tweepy API:

Variables Selection:

Startup (funding > 1 million): 

name, website, number of followers, number of friends, number of statuses, amount of funding, rounds of funding

Corresponding Seed Accelerators:

name, address, established year, website, amount exited, amount funded, number of startups exited, number of startups funded


Initial Analysis:

Rplot02 Rplot03

I first took out the top ten seed accelerators with the most past funding. 'Y Combinator' dominates the feed in this respect. This is partly due to its being older. According to Wikipedia, it is the first seed accelerator. Y Combinator’s creation was followed by TechStars (2006) and Seedcamp (2007). The bar chart to the left cements the importance of age when it comes to seed accelerators.

The top ten startups from companies with a valuation greater than 1 million dollars are ordered by their total amount of funding. Most of them are very popular today.  All of them are online companies, which proves the importance of the number of users in the valuation of startups.


The data scraped from twitter contains some interesting insight. Friends_num, statuses_num, and favourites_num are more correlated with each other than with followers_num, but these three variables are less correlated with funding (total amount of funding) than followers_num. This means that followers_num has greater direct influence on how much funding a startup can get. It's really a reasonable projection since the number followers on Twitter depends on how popular the startup is, and people who follow the company’s Twitter account are more likely to be users of its business. However, the other three variables is not direct indexes of how popular the business of the startup is because a startup can write as many statuses as possible on Twitter, even though it has only a few followers.


Further Steps: Multiple Linear Regression

The analyses above serve as a guide on how to apply multiple linear regression to this problem. The initial form of the model lies below.

Dependent variables:

Amount of funding/rounds of funding

Independent Variables:

Seed factor (year, state, amount funded(number of startups funded) )  &

Users factor (number of followers)

However, assumptions such as multicollinearity need to be checked before building an effective regression model. It is also possible that the variance explained by the model might be small due to the variables not having enough predictive power.

About Author



Shu is currently a master’s student studying financial engineering at University of Southern California, and he has a multidisciplinary background in math, economics, and financial engineering. Being able to look at problems from both marketing and technical perspectives,...
View all posts by Shu LIU >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp