M3 - How to fund your startup

Posted on Dec 12, 2015

Contributed by John Montroy and Avi Yashchin. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on their fourth class project(due at 8th week of the program).


Try it here! | See the code

The startup world is not for the faint of heart. Behind every success story lies a dozen failures, and the odds can still turn for even the most robust startup. Countless resources have been poured into aiding would-be founders - websites like CrunchBase or AngelList, new podcasts like StartUp, not to mention countless how-to guides and videos.

All well and good, but what remains largely unaddressed is the question of funding. The general process is known - line up your investors, hone your pitch, and be prepared for rejection. But that first crucial step: how do you find would-be investors? There are resources available, but it often still feels like a stab in the dark. Perhaps we can do better.

M3 - Investor / Startup Matching

This is M3 - a tool meant to aid in exactly this process. What does it do?

We create text-based "fingerprints" of your startup, run that against our database of venture capital firms, and return your best bets for funding. These are firms that have funded startups like yours in the past, and will fund similar ones in the future. In short - we're a shortcut for finding the right venture capital firm. How exactly this fingerprint is generated and used will be discussed in more detail below.

On the technical side, here's what you're looking at:

  • Python-based app, Flask framework
  • Designed with heavily tweaked Bootstrap templates
    • (So we definitely had to get deep into the HTML/CSS/Javascript!)
  • MySQL back-end on Google Cloud Services
  • Site hosted on an Amazon EC2 server
  • 5000+ websites scraped with Scrapy
  • Asynchronous Scrapy jobs running through Celery

Lots of infrastructure. This was a challenging app for us to get up and running in just under two weeks - at times we were much closer to being full-stack engineers rather than data scientists!

Let's take a second to talk algorithms, though - what are we actually using to pair startups with VCs?

Semantic Fingerprinting

Note: this section is primarily theoretical background - if you’re just interested in our project, skip to the next section.

An aside: Shakespeare scholars often discuss issues of authorship and authenticity in Shakespeare's oeuvre. Example: the scene involving the Greek goddess Hecate in MacBeth is often attributed to fellow English playwright Thomas Middleton. There are historical arguments to be made, about its role as song and dance interlude, but there are also stylistic and syntactic arguments. It just doesn't read quite like the rest of Shakespeare.

If you'll permit a slight stretch, these scholars are doing what we're doing. Semantic fingerprinting is a technique that maps bodies of text into comparable, analyzable "fingerprints" - a fingerprint being simply a large, sparse vector, otherwise known as a Sparsely Distributed Representation (SDR). Once you have an SDR for two bodies of text, whether they be Shakespeare or company descriptions, comparisons become simultaneously insightful and simple.

The underlying theory of the semantic fingerprinting technique used here comes courtesy of Cortical.io, an AI company led by researcher and author Jeff Hawkins. Cortical.io is seeking to create a new kind of artificial intelligence by taking cues from the most powerful intelligence engine we know of - namely, the human brain. In brief, Hawkins and researchers at Cortical.io believe that the human neocortex (the largest and most evolutionarily-recent area of the brain) is underpinned by one universal learning algorithm. The stands in opposition to a more compartmentalized understanding of intelligence in the brain (this area corresponds to language, here music, here math), but modern research supports the idea. A 2009 article from Scientific American discusses technology allowing a blind man to see with his tongue - a strong example of the brain’s ability to adapt by employing a common learning algorithm across all senses and experiences.

A full explanation of Hawkin’s theory is beyond the scope of this blog post - interested readers are enthusiastically directed towards his excellent 2004 book On Intelligence, which has remained relevant despite a full decade of progress in AI. Two shorter but more complex reads are available on the statistical properties of Sparse Distributed Representations and a white paper on Semantic Folding Theory.

We can outline the basic pieces fairly quickly, however:

  1. Define your source corpus / dictionary - a random sampling of Wikipedia articles would serve as a General English dictionary, whereas an assortment of medical papers would be a General Medical dictionary.
  2. Create an appropriate base vector representation out of your source corpus. This involves generic tokenization, lemmatization, and importance weighting of your texts, followed by an unsupervised algorithm for the extraction of keywords and phrases. These keywords are weighted and linked to one another (think PageRank), and then used in the construction of a 128 x 128 grid, where each pixel represents a “context”. One context could roughly be “things involved with opera” - expected keywords then might be “opera, concert hall, singing, classical, costumes”, etc. Semantically similar contexts are placed near each other on the grid.
  3. An input corpus is entered for vectorization. Similar keyword extraction takes place, and those keywords / contexts are mapped to the original 128 x 128 grid, where an ON-bit represents a recognized context in your input corpus.

As outlined in the mathematical properties paper above, two vectorized corpora are related to each other in a simple way. The core of it is overlapping bits - take the union of two SDRs, and the more bits they share, the more closely related they are. It’s not quite so simple, of course - some common distance metrics employed are:

  1. Euclidean Distance
  2. Cosine Similarity
  3. Hamming Distance
  4. Jaccard Index

Last but certainly not least - how are we generating these fingerprints of a given input text? That's where Cortical.io's API comes in. Play around with this fingerprinting demo here - cortical.io provides API keys upon request, and the API provides everything you need to begin fingerprinting and comparing. The default base corpus is a generic English language corpus - this was the corpus used for all fingerprinting in our project.

Stage 1: Adventures in Scraping

So our procedure is now clear: gather text data on startups and venture capital firm, generate fingerprints using cortical.io's API, and given a startup, select the best match. This best match is the venture capital firm whose text indicates the closest semantic similarity to the text of your own product - the a priori assumption here being that a venture capital firm who describes themselves similarly to your startup (OR: a venture capital firm whose portfolio includes similar startups to yours) is a venture capital firm that is more likely to fund your startup.

This procedure was broken into several stages:

  1. Scrape websites for data; clean and analyze data
  2. Produce front-end for querying data

Stage 1 - how do we get this data? We scrape. Lots and lots and lot of websites. 5000+ websites, to be more specific. We do this with lots of infrastructure, and a powerful enough scraper / crawler to take care of most of the dirty work. Simultaneously, we begin to analyze the text that comes in. Jumping straight into it, our workflow for stage 1:

Screen Shot 2015-12-12 at 3.20.23 PM


Without lingering too long, the gist is: Scrapy goes out and scrapes lots of pages on lots of websites. We used Goose for text extraction as well as some generic XPath / CSS selector manipulations. A few cleaning functions were defined, and pages were consolidated into one body of text per website. A Scrapy pipeline was built to dump into a MySQL database, hosted on Google Cloud Services.

Meanwhile, as text was being dumped, a simultaneous process was running to analyze the extracted text per site. We used a few different APIs for analyzing - cortical.io primarily, but also TextBlob and OpenCalais, just so we could play a bit.

This scraping/analyzing cycle went on for several days. We launched 10+ spiders simultaneously on an AWS instance, all scraping and dumping different websites. We ran these processes in the background, handled connectivity issues, checked for bad data, and so on. It was a lot of spinning cogs, but it came together, and we had our data set.

What sorts of issues did we have? How good was our data? Can we do better? Here are some issues / thoughts we had while in stage 1.

  • Goose for text extraction - how robust?
  • Min / Max page scraping depth - 2, 3, 4 pages deep? How much is enough?
  • ASCII vs UTF-8 - how to handle? How is a website encoded? The MySQL database? Python data types like unicode vs str?
  • Python relative paths - how best to navigate a project?
  • Boilerplate text - how much of the scraped text has real meaning? How much of it is generic legal / VC text?

This last bullet is worth lingering on. Not all text on a website is meaningful - in fact, most of it probably isn't. What we're actually interested in how a VC / startup describes itself, its projects, its mission. So how do we get there? Well, one way could be with an algorithm called tf-idf (text-frequency inverse-document frequency), which allows you to properly weight the importance of word based on a ratio of its frequency in one document vs a set of documents. You could also do TextRank, an algorithm based on PageRank that uses a graph approach to discover important words and phrases, unsupervised. Or perhaps we generate a fingerprint using cortical.io of just boilerplate finance / legal, and subtract out that fingerprint from our VCs and startups.

All of these are worth pondering and investigating, and so will be in future iterations of M3. They're especially worthwhile, because right now, the keywords for some websites are nothing more than "legal, agreement, finance, disclosure" and so on. Not particularly meaningful. Stay tuned!

Stage 2: Adventures in Flask

Now we need our queryable front-end. For this, we chose Flask, a fairly light-weight Python framework meant for full-stack development. Flask provides interactivity between Python and the web dev side using a language called Jinja, so the challenge was established. Figure out Flask, figure out Jinja, figure out enough web dev to get by, and put it all together in a speedy front-end.

For the visual component, we used Bootstrap - specifically, the simple but elegant Cover template, freely available. Bootstrap comes with a little setup required, especially when working with Flask. There are a handful of default components, including the Bootstrap minimum CSS / Javascript files, plus a few other things here and there. Mostly though, Bootstrap was delightful. The entire power of the internet is at your fingertips, with a little elbow grease.

Flask and Jinja was another story. Flask is compact compared to Django, but a full-stack framework is complex regardless. Many tutorials were read, many long hours debugging silly web dev problems, and so on. A few main points:

  • Originally, distance metrics were calculated via API calls to cortical.io - mainly Cosine and Euclidean to start. This was slowing down our app tremendously, so we reverse-engineered the algorithms used for these metrics (NOT just plain Euclidean distance - tell me how you calculate that distance for vectors of unequal length?) and now run them locally. Much faster.
  • Streaming images are a bit tricky and involve two routings instead of one for a page.
  • Our site scrapes an input website using Scrapy in stand-alone mode - unfortunately Scrapy has to run in its own thread, and so enter Celery. Celery is an asynchronous job module, allowing functions to be run asynchronously in a queue with a simple decorator attached to the module. A Celery instance does need to be initialized, however, along with a "broker", which is simply a back-end Celery interacts with while running jobs. We used Redis as our broker.
  • MySQL is a nightmare. Flask has its own MySQL connector module (two, actually), and it's terribly unwieldy. We also are guilty of abusing proper database usage with this product - SQL queries are sprinkled throughout the code, instead of being called as Stored Procedures. The database calls are properly parameterized to protect from SQL injection, but the attempt we made to move to Stored Procedures was an exercise in frustration (if you're curious - one cursor per proc call, then you gotta dump the cursor). We're also guilty of re-initializing too many connections and cursors. Main point - next time, a different back-end. Also, our data isn't huge, but it's getting there, and so MySQL will be out-scaled pretty soon.
  • A lot of work was done in the AWS environment - simple shell scripts, resource monitoring, logging, the works. To kick off the app, we needed a Redis server instance, a Celery worker, and the Flask app itself (which needed connectivity to the MySQL DB).
  • There's a lot of mess in the code. We're working on it.

To be honest, there's simply too much code to go into here. You're welcome to explore the code on GitHub - we're continually making efforts to increase commentary, modularization, etc. The total code output is well over 1000 lines of Python, and that doesn't begin to touch the HTML, CSS, and Javascript that was customized for the app.

How it works in the end is simple - enter your website OR a text description of your product, and we'll find the top 3 best matches based on a variety of metrics and return them to you That's it. We'll be working on a brief video tour of our product in the near future.

M3 Results.

M3 Results.

So what?

Does it work? Yes (mostly). The algorithm works very well, and the data is all there. We're working on getting better representative texts as discussed above - subtract out boilerplate, scrape supplementary sites, and so on. One success story is the company kaplancleantech.com - enter in that site, and your best bets for VC funding all come from VCs with a history of clean tech entrepreneurship.

Example M3 results, with keywords

Example M3 results for kaplancleantech.com, with keywords

This technology is incredibly powerful if used well. We've used it, but we too have a lot of work to do before the algorithm is truly robust. But once it is, this matching algorithm could be used in any industry, on any two bodies of text, for any time there are buyers and sellers. Site content creation, medical community, fiction - the sky's the limit.

This project was an exercise is putting together a very complex workflow from scratch - we started with nothing more than an API and an idea, and now we have Python, Flask, Bootstrap, Scrapy, MySQL, and more all woven together into a presentable product. There is certainly much more work to be done, but we're just getting started.

About Authors

John Montroy

John Montroy is a graduate of Middlebury College with a B.A. in Physics. After a summer of particle physics at CERN with the Harvard ATLAS team, he began his career as a data analyst in the auto industry....
View all posts by John Montroy >

Avi Yashchin

Avi Yashchin is a serial entrepreneur in the technology, finance and education businesses. After a summer of working on the Sloan Digital Sky Survey at NASA, he began his career as a high frequency algorithmic trader in the...
View all posts by Avi Yashchin >

Related Articles

Leave a Comment

Google May 27, 2020
Google Always a massive fan of linking to bloggers that I like but do not get lots of link enjoy from.
Google October 12, 2019
Google Very few internet websites that take place to become comprehensive beneath, from our point of view are undoubtedly very well really worth checking out.
Brigette April 15, 2016
Excellent items from you, man. I hzve take ihto accout your stuff prtior to and you're simply ttoo magnificent. I actually liie what you've bought right here, really like what you're stting and the best wayy wherein you say it. You makje it entertaining and you continue to take care of to keep it wise. I can't wait to read mhch more frtom you. That is actuslly a wonderful site.
Active Release March 7, 2016
If you are suffering from lower again ache, then you might get aid in the event you go to a professional chiropractor. There is also out and about whether would be the correct individual that will help you together with your particular worries. Patients with chronic lower back and neck pain have long sought treatment from chiropractors for good reason.
Marie-Pierre Garnier February 12, 2016
Hi John and Avi! it's a very interesting use case you have developed here, using the Semantic Fingerprinting approach. As you explain in the post, this technology has been developed by Cortical.io, but this company is not led by Jeff Hawkins. It is managed by the two founders, Francisco Webber and Daniel Schreiber. There is a partnership between Cortical.io and Numenta, Jeff Hawkins's company. The Semantic Folding theory underlying the Semantic Fingerprinting technology has been developed by Francisco Webber, who built up on Jeff Hawkins' theories about the brain to create a computer model that understands the meaning of text. Best, Marie-Pierre
webdesign January 10, 2016
Hi to all, the contents present at this web site are in fact amazing for people experience, well, keep up the nice work fellows.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI