NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Funding Goals: Which Kickstarter Projects Reached It?

Funding Goals: Which Kickstarter Projects Reached It?

Michael Griffin
Posted on Jan 15, 2020

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Summary of Kickstarter Project Success Article

  • I looked at the latest data from the crowdfunding site Kickstarter to analyze trends in project success, using a mixture of quantitative analysis and natural language processing.
  • Around 40% of the projects meet their funding targets - this varies significantly across time and project type. I built models to predict which projects succeed, focusing on technology projects which typically have high funding goals and low success rates. 
  • I explored a variety of models on the project and text data using tree-based, deep learning techniques and AutoML to improve the f1 score from 57% to 75%.
  • Tools used: Python, Google Colab, sklearn, shap, pdpbox, autosklearn, FastAI libraries, Bokeh, github here

Introduction to Kickstarter

Kickstarter is one of the leading global platforms for crowdfunding creative projects, with diverse campaigns ranging from $1000 craft experiences to $5 million tech product releases. Creators pitch new projects to โ€œbackersโ€ who can contribute towards a project for various levels of rewards, typically including the product or service itself. The funding model differs from many other crowdfunding sites as it does not involve any transfer of equity and the pledges are only collected if the entire funding goal is reached.

The website is a great source of information on interesting ideas and public responses, and can be readily scraped to explore several interesting questions:

  1. What are the patterns of project success across time and categories?
  2. How likely is a given project to succeed? How would you design a project to maximize the chance of success?
  3. How much is likely to be raised? What goal should be set?

NLP                                   

For now, Iโ€™ll focus on questions (1) and (2) - these fit neatly with NLP approaches explored in my ML research project and the risk modelling principles from my previous job.

Unsurprisingly, there are plenty of data science blogs, papers and repos that examine Kickstarter data already โ€“ it should be noted that I am looking at the pure prediction problem which makes good results harder to obtain.  My aim is to model whether a new project is likely to succeed using a subset of the information available at the launch - so I try to be strict about maintaining temporal structure.

โ€‹In the future, I would like to add more information about each project like the risks and challenges or specific pledge details. Examining the โ€œtrajectoryโ€ of a project and comparing to data from Indiegogo (a similar tech-focused crowdfunding site) would also be interesting.

Applications

Predictions could be useful for a few different audiences:

  • Project creators โ€“ how should project pitches be designed to maximize the chance of success?
  • Backers โ€“ could potential backers filter projects to focus their attention on new projects with a reasonable chance of success? Backers may also have a "portfolio" of projects pledges - they can offer no support, small support or significant support across a wide range of live projects and the best action could depend on the likelihood of success for each project.
  • Kickstarter โ€“ could recommendation engines help promote projects which are likely to be near the success threshold?

Dataset

I use data scraped from the Kickstarter website in mid-December โ€“ the dataset is fairly large, capturing around 200,000 individual projects which span back to the inception of the organization in 2009. This contains a mix of quantitative data on funding goals and timings, categorical information on the project types and raw text like the project description. Quite a bit of pre-processing is needed to build a clean dataset with useful

Features:

 

  • Language โ€“ since Iโ€™ll be using NLP techniques, I focus on the projects described in English, cutting out around 6,000 using the langdetect library
  • Project organizer - projects can be created and run by individuals or organizations and it seems likely this could interact with success rates.  However, the dataset only contains the name of the creator โ€“ so I used a combination of the NameDataset library and the entity detection to flag names which are likely to be organizations. I also add a metric to capture the number of historic projects run by each organizer
  • Funding window โ€“ I added variables to capture the time gap between project creation, launch and funding deadlines, as well as the funding rate ($/day requirement)
  • Sentiment โ€“ used textblob library to label the objectivity and polarity of the project pitch โ€“ perhaps enthusiasm and devotion matters to project success?
  • Duplicates โ€“ I found there were approximately 2700 duplicates blurbs in the dataset, often capturing where a project spans categories or is relaunched.  To remove this distortion, I cut all duplicate rows
  • Currencies - I convert goal amounts to USD equivalents for a consistent size metric. I also take natural logs of these variables to help capture the broad range in project scale - this does not directly affect tree-based logic but does aid visualization
  • Encoding โ€“ standard dummification approaches for categorical variables
  • Dropping variables - all information which would only be available after the project launch is removed eg staff favorite flags.

Point in Time Nature

There are some issues relating to the point-in-time nature of the data โ€“ the dataset is scraped at a specific time, but posts may be removed or edited after launch. You might expect this to positively skew success rates as unsuccessful campaigns could be hidden or removed from the site. โ€‹To explore this, I looked at the projects listed by year against a historic dataset scraped through a similar process in 2018 โ€“ this analysis indicates that many of the pre-2015 projects do not appear in the latest dataset.

Despite this the success rate profile does not look materially different between these two cuts, which may mean the skew is not systematic. So, inaccuracies in the historic project database could be an issueโ€“ in an ideal world I would aggregate a set of unique projects across scraped data at all available points. For now, Iโ€™ll proceed with 2019 data on the assumption that historic projects are missing completely at random.

Exploratory Analysis on Data on Kickstarter Project Success Rate

Simple analysis of the dataset is revealing - the most popular categories to launch a Kickstarter project relate to music, film & music and art.

Funding Goals: Which Kickstarter Projects Reached It?

Focusing instead on the flow of funding offers a different view - significantly more money is pledged towards technology, design and games with an average of $30k+ per project, compared to <$12k in all other categories. There is significant variation in success rates by category: more than 75% of comics/dance projects succeed, while fewer than 30% of food/journalism projects are successful.

Funding Goals: Which Kickstarter Projects Reached It?

Modelling Technology Project Success

My aim is to build a classification model to predict which projects succeed โ€“ I will focus on the f1-score to judge the results as this reflects a useful mix of the accuracy and recall. To deal with this small class imbalance I use under sampling of the majority class (failure) which means results across approaches are more comparable.

To partition the dataset for modelling purposes, I want to set this up as a pure prediction problem โ€“ so for model tuning/selection I use data up to 2017 for training with validation on 2018 data. For the final performance test, data up to 2018 is used with testing on 2019 information.  This does create challenges with setting appropriate thresholds since the aggregate success rate drops from around 75% before 2013 to below 50% in 2014-16.

For efficiency I focus on the technology category โ€“ this contains about 18,000 projects across 16 subcategories with an average success rate of c40% over the period 2009-2017.

Results across a set of Models

My approach considers results across a set of models:

  1. Baseline heuristic model which uses aggregate success rates by category - this simple approach predicts that all โ€˜Technology, โ€˜Hardwareโ€™ and โ€˜Gadget projects will succeed, and others fail
  2. Random forest using all project parameters excluding the project description
  3. Neural net based on tabular project data with two layers of 500 and 50 nodes with dropout, batchnorm and ReLU activations
  4. NLP model trained on the descriptions only, using an RNN language model and encoder trained on wikitext
  5. A combined random forest model using (2) and (4)
  6. A combined neural net based using (3) and (4)
  7. An autoML pipeline using the sklearn package using the same data as (4)

Insights

These results show that the NLP encoder, neural net and random forest models offer similar improvements over the baseline model; these are complementary approaches based on different inputs, so the combination brings a significant jump in performance. Auto ML does offer a further marginal improvement, although the complex pipeline means this is at the expense of interpretability.

The final f1 score โ€‹current is a significant improvement on the baseline model. Whilst this is not a particularly accurate model, this result is probably unsurprising - assessing the creativity, appeal and viability of a new project pitch is surely a very difficult task. But it does demonstrate that there are patterns in the dataset which can help differentiate likely winners.

To understand the model output, I focus on the combined NLP and random forest model and use explainability tools from the shap and pdpbox libraries. The top features are visualized in order of importance below showing:

  1. The NLP model output ("prob_success_lang_model") is most important.
  2. Year of launch is important with early years adding to the likelihood of success (since aggregate rates were higher).
  3. If the creator is an organization rather than an individual, this generally improves the modelled odds of success.
  4. High funding rate requirements reduce the chance of success.
  5. Gadgets and hardware projects are more likely to succeed while web projects are penalized.

PDP

Partial dependencies plot (PDP) better demonstrate some of these trends - the below chart shows how the modelled likelihood of success drops with increasing (logged) funding requirements, assuming all other variables are held equal. So, if you want to design a successful project, picking the funding target and window is critical.

The most useful output of my work is probably the modelled probability of success, rather than the predicted label. This could help stakeholders help assess the likelihood of success and act accordingly. The shape library offers useful visuals to show the key factors behind individual predictions: for instance, the below 2019 project was predicted to have an 8% chance of success mostly driven by the comparatively high goal/funding rate and unfavorable project description.

Whereas another 2019 project generated a 70% likelihood of success as a result of the favourable project description, gadget categorisation and long creation window.

The utility of probabilistic predictions is best viewed across populations. In the bar chart below, you can see the predicted distribution of success % for the 2018 validation set โ€“ this demonstrates how the model produces a good distribution of success bands across the full range from 0 to 100%, although very high likelihood predictions are rare. The overlaying line shows the actual realized success % within each bucket which has the expected linear trend. As an example, there are around 310 projects which were modelled to have a 60-70% chance of success; of these, 65% actually raised their target.

A similar pattern is seen on the 2019 test dataset, indicating that the model does offer predictive power for future projects.

Application

To allow users to experiment with the dataset and predictions, I created a dynamic interface using the Bokeh library. This has three tabs showing the project success over time, the key prediction variables and the model outcome.

The app can be run locally by downloading the Bokeh folder in my GitHub folder. Then navigate to the directory and use the terminal to run the following command: bokeh serve --show main.py

About Author

Michael Griffin

Mike Griffin is training at the NYC data academy and has several years of experience in strategy/analytics roles in finance. He studied Natural Sciences (Physics) at the University of Cambridge and Management at the Judge Business School. Mike...
View all posts by Michael Griffin >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application