Scraping Kickstarter

Gordon Fleetwood
Posted on Nov 10, 2015

Contributed by Gordon. Gordon took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).

The Problem

"Kickstarter is the world's largest funding platform for creative projects," says the first line of the description on the company's website. Creators post projects on Kickstarter hoping for their work to be crowd funded by interested parties. If the project's goal is met before the projects expiry date, the money promised becomes money to spend. If not, the pledges go unfulfilled.

Kickstarter has an private API to access data from these projects, and several people have written their own APIs as wrappers over this hidden conduit. From a scraping perspective, getting data becomes a bit harder. Python's famous web scraping library, Beautiful Soup, is powerless against the Javascript foundations upon which Kickstarter's website is built. Thus, my first hurdle was the find a library powerful enough to glean data from websites built on Javascript.

I first looked at the library called grab, but the poor English language report proved to be too large a barrier to overcome. Next to enter my gaze was Scrapy, but that was always discarded to its perceived over-complexity. I finally settled on Selenium as my tool.

Methodology

Selenium's tagline is terse: It automates browsers. Out of the box Selenium allows one to open a web browser, goes to a page, and do any action a human could do (clicking button, filling in forms, etc), in addition to the base task of parsing html for information. One can partner Selenium with Phantom.js to do this surfing without opening a browser, but, for some reason, that proved to be slower on my machine.

The code below activates Selenium, navigates to Kickstarter's website, and then stores all the project categories and their urls.

from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://www.kickstarter.com/discover?ref=nav')
categories = browser.find_elements_by_class_name('category-container')
category_links = []
for category_link in categories:
#Each item in the list is a tuple of the category's name and its link.
category_links.append((str(category_link.find_element_by_class_name('h3').text),
category_link.find_element_by_class_name('bg-white').get_attribute('href')))

For each category I use its url to navigate to its page. The default is to show the first 20 active projects in a given category. Using Selenium I expand the results to show the first 20 projects of all the projects ever submitted for that category.

for category in category_links:
browser.get(category[1])
browser.find_element_by_class_name('sentence-open').click()
browser.find_element_by_id('category_filter').click()

I then went to each project's page and scraped the data I wanted. This included the project's name, funding goal, current money garnered, and description. There was some branching in the code to account for different project states: funded and finished, funded and not finished, etc.

The eagle-eye reader will notice a key omission here. I did indeed only scrape the first 20 projects of the 15 categories due to time restrictions. My conservative estimate put the time to scrape data on all of Kickstarter's 200,000 plus projects at four days. The difference between scraping data on 600 instead of 200,000+ projects was five lines of code.

while True:
try:
browser.find_element_by_class_name('load_more').click()
except:
break

What this snippet does is click the "Load More" button at the bottom of the category's page until every project is loaded, and then scrapes the data for each.

Next Steps: Short and Long Term

Once I have the full data I intend to do some extensive Machine Learning on the data to try to a build a predictive model to tell whether or not a Kickstarter project will be funded. Finally, I will build a web app that will allow a user to input the description of their Kickstarter project, and they will be able to receive a prediction of whether or not it will be funded.

That's still a long way off, though. Still, I decided to do some basic Machine Learning of the type that I want to do later.

My process involved separating the data into two: one with numeric data and the other with textual data. With the numeric data I used vanilla Logistic Regression on the entire data to achieve an 83% accuracy rate, a 23% increase over the baseline accuracy. Next I used Natural Language Processing to build a model on training data, and then tried to predict the test data set. Surprisingly, the accuracy was 97%.

I'm exciting to see what the more robust process will produce.

Links

Slides: https://slides.com/gfleetwood/kickstarter-project/
GitHub: https://github.com/gfleetwood/nyc-data-science-academy/tree/master/scraping_kickstarter

About Author

Gordon Fleetwood

Gordon Fleetwood

Gordon has a B.A in Pure Mathematics and a M.A. in Applied Mathematics from CUNY Queens College. He briefly worked for a early stage startup where he was involved in building an algorithm to analyze financial data. However,...
View all posts by Gordon Fleetwood >

Leave a Comment

Avatar
Google November 5, 2019
Google The time to read or stop by the subject material or web-sites we've linked to below.
Avatar
Google November 2, 2019
Google The time to study or check out the subject material or sites we have linked to beneath.
Avatar
comparateur pneu June 3, 2016
After enduring thousands of screaming fans, her clothing line selling out, and an appearance with Kanye West at MTV's EMAs, Kim Kardashian's trip to London could be described as a knockout.
Avatar
comparateur prix pneu May 31, 2016
Now you know about the credibility of the personal trainer, you can start asking about the packages they can offer you.
Avatar
best cs go skin sites May 19, 2016
Wow, lovely website. Thnx ..
Avatar
occasion auto March 18, 2016
I purchased their firmer model with the ticker 12 3/4 guage steal, after less than one year I filed and was approved for a warranty claim, I got the bed replaced and quickly sold it. I then purchased a Sealy plush top that also got dips pretty fast, the up side is that the better coil system held up the padding broke down.
Avatar
lazada philippines March 8, 2016
Learning how to draw caricature can be a fun hobby for people who enjoy drawing things and humor at the same time. You can trim video clips, transform them into slow motion, add your own voice, visual effects, transitions, and music tracks, and have complete control over your newly-created movies. Lookup It - Working with a research motor such as Google and Yahoo will probably be one particular of the initially factors to cross your mind, and that is not always a bad point.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp