Web Scraping for New Business

Paul Grech
Posted on Nov 28, 2015

Contributed by Paul Grech. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).

Github: Full source code found here

Background & Inspiration

Along with being an electrical engineer and now attending NYC Data Science Academy 12 week Data Science bootcamp to become a data scientist, I am also co-owner of a wedding and event coordination company with my wife known as LLG Events Inc. As you could imagine my weeks are filled with exploring fun and interesting data sets and my weekends are filled with running lavish events so at some point during the bootcamp I sat back and thought to myself... hmm wouldn't it be nice to combine the two! Inspired to combine these two elements of my life, I examined all challenges that my wife and I face as business owners and realized that the most difficult part of a service oriented business is how to meet clients. We do not believe in costly advertising that deems vendors "Platinum Vendors" just because they spend the most, we use our reputation to sell us but this also has its limitations... How do we meet new clients outside of our network? How do we expand our network?

Fortunately and unfortunately we are IMPATIENT, so how do we speed up the process?!?!?!?! With this challenge ahead, a lightbulb went off. I remember during the planning process for my own wedding where my wife and I designed a fun and simple website with basic information about us and our event hosted by The Knot...  and then it clicked! Why don't I web scrape the knot for all of these websites and reach out to those people as potential new clients. Many of our client's didn't even know they wanted a planner until they were stressed out so why not help them out ahead of time!

Plan of Action

Well, The Knot contains greater than 1 Million wedding websites between previous and future dates. My plan was to scrape all user information in python using BeautifulSoup so that I can get information such as location, date and wedding website url in order to gain some insight on the wedding industry as a whole and hopefully find some potential clients.

Seems easy enough right... Just create a couple of for loops to loop through all possibilities of First Name: 'A-Z A-Z' and Last Name: 'A-Z A-Z'...

# Create iterable list of first and last names (2 letters)
def iterate(let):
    first_names = []
    last_names = []
    for i in range(len(let)):
        for j in range(len(let)):
            for k in range(len(let)):
                for l in range(len(let)):
                    first = let[k] + let[l]
                    last = let[i] + let[j]
                    if first == 'aa' or first[0] != first[1]:
                        first_names.append(first)
                        last_names.append(last)

    return(first_names, last_names)

from The Knots search utility as seen below and pull the results off into a database or even a .csv file. What could possibly go wrong?

TheKnotSearch

Obstacles

Well, the question wasn't what could go wrong but how much will go wrong. Answer to that is ALOT! As I learned, web scraping is an interesting task because one small change for a website could mean rewriting your entire script. Below I outline some of my main challenges that I faced throughout this process:

  1. Embedded Java Script: How to deal with html that is not visible because it is really embedded java script?
  2. API Limitations: Although I have a delay in my script, sometimes the web page servers just don't allow you to access them beyond a certain point. How do I get around this?
  3. Testing: Large projects must be scaled or else testing and development time will take forever. What is the best way to handle this?
  4. Error Handling: What happens when an error is returned to your script?
  5. Changing HTML: The webpage I am scraping changed, how do I handle these small changes?

Solution to Embedded Java Script

Let me first start by saying google developer tools is amazing! You can select any component of a webpage and it shows you the exact html responsible for rendering that portion. Therefore there is a two part solution to the embedded java script issue. First I had to recognize the issue understand why the html that should have contained the data was not in the html. The answer... EMBEDDED JAVA SCRIPT!  Well, there goes my scraping idea. However, fear not, this is actually good news. I used the chrome developer tool to find The Knot's API call. In order to do so, I entered one search into the "couple search" and watched the Networking tab of chrome developer tools. From there, I was able to find the API call to the database and instead of webscraping, my python projected turned into an API call with a json return. Using python json_normalize, my problem actually made life a bit easier.

    # Pull request delay
    time.sleep(1)
    path = 'https://www.theknot.com/registry/api/couple/search?firstName='+fn[iter]+\
           '&lastName='+ln[iter]+\
           '&eventMonth='+month+\
           '&eventYear='+year+\
           '&eventType=Wedding&reset=true&track=true&limit=20&offset=0'
    request = Request(path)
    response = urlopen(request)
    data = response.read()
    data_json = json.loads(data)

# Json to DF
couples = json_normalize(data_json['Couples'])

 

Solution to API Limit

Before encountering this problem, I had to first push the limits. Because of my laziness, I did not want to wait for all iterations of first and last name to process. Lets estimate this (first name = fn, last name = ln, L = letter):

26 (fnL1) * 26 (fnL2) * 26 (lnL1) * 26 (lnL2) = 456,976 possibilities

Now lets remove all the repetitive letters since no real names start with those (i.e. bb, cc, dd etc...) except for aa since you can have names such as Aaron.

456,976 - 16,900 = 440,076 combinations at 1 per second is 440,076 seconds

440,076 seconds / 60 sec.per.min / 60 min.per.hr / 24 hr.per.day ~ 5 days

5 Days is way above my patience level so what next... well lets make 10 different files that each run specific combinations. In doing so, yes, I reached the limit and my instances would eventually be shut down. In order to counter this a simple delay was added as can be seen in the above code. As can see below, we had bins that would change depending on the index of the file. These bins would create the list of first name and last name combinations that would be put into the API call. Also, in case the script shut down, each instance outputs a .csv file with the current data at 25% intervals.

    # API first name / last name parameters:
    #   Create letter bin depending on file designator (0-9)
    letters_bin = len(fn) / 10
    start_letters = file * letters_bin
    if start_letters + letters_bin - 1 > len(fn):
        end_letters = len(fn)
    else:
        end_letters = start_letters + letters_bin - 1

    fn = fn[start_letters:end_letters]
    ln = ln[start_letters:end_letters]

    # Create bins for output
    length = len(fn)
    cbin = 0
    bin_size = len(fn) / 4 - 1
    bin1 = 0 + bin_size
    bin2 = bin1 +bin_size + 1
    bin3 = bin2 +bin_size + 1
    bin4 = length - 1
    bins = [bin1, bin2, bin3, bin4]

 

Solution to Scaleability

Kind of hard to test a script that needs about a day or so to run so therefore the code was broken into run and test code as seen below:

if test == 'test':
    # Test Parameters
    letters = ['a', 'b']
    (fn, ln) = iterate(letters)
    month = '0'
    year = '2016'

    #Initialization print outs
    print "First Name: ", fn
    print "Last Name: ", ln
    print "Month: ", month
    print "Year: ", year

    # Save data once complete
    length = len(fn)
    bins = [length-1]
    cbin = 0

else:
    # API Time parameters:
    letters = list(string.ascii_lowercase)
    (fn, ln) = iterate(letters)
    month = '0'
    year = '0'

One of the best practices I learned from my time as an electrical engineer in the defense industry whether it was doing development or software verification was to make your code readable, scaleable and testable. This project was a true testament to how useful it is to write your code in sections that can be individually run and verified.

Solution to Error Handling and a Changing Webpage

These two go hand in hand because error handling adds robustness to your code that can allow it to continue running even if errors are occasionally returned. Often times there would be no results returned from an API call so in order to speed up the process, error handling was included that checked for a no result and would therefore skip the rest of the script and simply move on to the next iteration.

try:
        data_json['ResultCount'] > 0
    except:
        couples['MatchedFirstName'] = fn[iter]
        couples['MatchedLastName'] = ln[iter]
        couples['Id'] = 0
    else:
        if data_json['ResultCount'] > 0:

            # Json to DF
            couples = json_normalize(data_json['Couples'])

Not only did this save time but it made the code much more efficient in the way it was written. Above is a common technique commonly used to handle errors in Python. Practices like this also make the code more robust in that it will not crash if case small adjustments are made to the webpage since the goal of this project is not to get every result but most results.

Data Processing

Well, now that we have collected as many results as possible (on the scale of ~3,500 from this date forward in  NY, CT and PA) there is some cleaning that had to be done:

import pandas as pd
import re

# Input for file to be cleaned...
f = raw_input('File Number for cleaning: ')
csv = pd.read_csv('data/'+f+'_couples_prenltk.csv')


# Retaining columns of data frame needed for analysis
csv = csv[['Id', 'EventDate', 'City', 'Location',
             'MatchedFirstName', 'MatchedLastName',
             'Registrant2FirstName', 'Registrant2LastName',
             'RegistriesSummary', 'Websites']]

# Remove all observations that do not have a website
csv.dropna(subset = ['Websites'], inplace = True)

# Remove extra characters from API output wrapped around website address
csv['Websites'] = csv['Websites'].apply(lambda x:''.join(re.findall("(h.*)'", str(x.split(',')[0]))))

# Ensure website formatting is correct - testing
#print csv.Websites.head(10)

# Extract file number and save to new csv file
f = f[0]
csv.to_csv(f+'_filtered.csv')

Output files from the API script that contained all data were read into a cleaning script that kept only the relevant data such as names, locations, dates and websites and presented the information in a clean format. This was used in order to prepare the data for a Shiny application that would allow the customer (well, my wife) to view the results of all my effort!

Presenting the Data and Future Work

Just like any great product, all the fancy tech work needs to be in the background with the results front and center. As a way of displaying my results, a Shiny Dashboard app was created that allowed the user to filter each variable and select which variables to show in the results table. Embedded URL links were added for easy click and go presentation as well.

ShinyDashboardResults

In the future, I would like to create a crawler that can then go through each couples website and look for specific venue names as a way to progressing our business in locations of our choice. Also, forecasting an approximate "engagement season" would also allow for accurate business planning and preparation on our part.

About Author

Paul Grech

Paul Grech

Paul Grech is a Data Scientist with passion for exploring insight in big data. He is eager to advance his skills and build value in a professional environment. Previous experience include several years of professional consulting experience in...
View all posts by Paul Grech >

Related Articles

Leave a Comment

Avatar
Google November 18, 2019
Google We came across a cool internet site that you just may possibly love. Take a look in case you want.
Avatar
van cleef collier imitation September 10, 2016
cartierbraceletlove Make sure you share your calendar to be public before you do this, otherwise, outlook 2007 will throw an error message van cleef collier imitation http://www.bijouxclassique.net/
Avatar
replique delices de cartier or blanc September 7, 2016
wow,i’m so proud of you girl, being a pilot at your age…very inspiring… replique delices de cartier or blanc http://www.luxemontre.com/
Avatar
adidas jacka herr May 23, 2016
Excellent, what a webpage it is! This blog gives useful information to us, keep it up. adidas jacka herr http://www.la-cantina.se/tin.php?sv=adidas-jacka-herr
Avatar
salomon zapatillas mujer May 13, 2016
I would like to point out my admiration for your kind-heartedness in support of people that should have help on this important issue. Your special commitment to getting the solution all-around turned out to be exceptionally helpful and has continuously made somebody like me to get to their endeavors. Your entire warm and helpful key points indicates much to me and extremely more to my peers. Thank you; from everyone of us. salomon zapatillas mujer http://www.sanantoniodenia.es/oniod.php?es=salomon-zapatillas-mujer
Avatar
zapatillas de mujer salomon May 13, 2016
I'm still learning from you, while I'm making my way to the top as well. I absolutely liked reading all that is written on your site.Keep the posts coming. I liked it! zapatillas de mujer salomon http://www.majuuniformes.com.br/forme.php?es=zapatillas-de-mujer-salomon
Avatar
adidas barnkläder May 7, 2016
Hello, how's it going? Just shared this post with a colleague, we had a good laugh. adidas barnkläder http://www.roxshop.se/dnd.php?sv=id-2035

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp