Data Web Scraping for New Business
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Paul Grech. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).
Github: Full source code found here
Background & Inspiration
Along with being an electrical engineer and now attending NYC Data Science Academy 12 week Data Science bootcamp to become a data scientist, I am also co-owner of a wedding and event coordination company with my wife known as LLG Events Inc. As you could imagine my weeks are filled with exploring fun and interesting data sets and my weekends are filled with running lavish events so at some point during the bootcamp I sat back and thought to myself... hmm wouldn't it be nice to combine the two!
Inspired to combine these two elements of my life, I examined all challenges that my wife and I face as business owners and realized that the most difficult part of a service oriented business is how to meet clients. We do not believe in costly advertising that deems vendors "Platinum Vendors" just because they spend the most, we use our reputation to sell us but this also has its limitations... How do we meet new clients outside of our network? How do we expand our network?
Fortunately and unfortunately we are IMPATIENT, so how do we speed up the process?!?!?!?! With this challenge ahead, a lightbulb went off. I remember during the planning process for my own wedding where my wife and I designed a fun and simple website with basic information about us and our event hosted by The Knot... and then it clicked! Why don't I web scrape the knot for all of these websites and reach out to those people as potential new clients. Many of our client's didn't even know they wanted a planner until they were stressed out so why not help them out ahead of time!
Plan of Action
Well, The Knot contains greater than 1 Million wedding websites between previous and future dates. My plan was to scrape all user information in python using BeautifulSoup so that I can get information such as location, date and wedding website url in order to gain some insight on the wedding industry as a whole and hopefully find some potential clients.
Seems easy enough right... Just create a couple of for loops to loop through all possibilities of First Name: 'A-Z A-Z' and Last Name: 'A-Z A-Z'...
# Create iterable list of first and last names (2 letters) def iterate(let): first_names = [] last_names = [] for i in range(len(let)): for j in range(len(let)): for k in range(len(let)): for l in range(len(let)): first = let[k] + let[l] last = let[i] + let[j] if first == 'aa' or first[0] != first[1]: first_names.append(first) last_names.append(last) return(first_names, last_names)
from The Knots search utility as seen below and pull the results off into a database or even a .csv file. What could possibly go wrong?
Obstacles
Well, the question wasn't what could go wrong but how much will go wrong. Answer to that is ALOT! As I learned, web scraping is an interesting task because one small change for a website could mean rewriting your entire script. Below I outline some of my main challenges that I faced throughout this process:
- Embedded Java Script: How to deal with html that is not visible because it is really embedded java script?
- API Limitations: Although I have a delay in my script, sometimes the web page servers just don't allow you to access them beyond a certain point. How do I get around this?
- Testing: Large projects must be scaled or else testing and development time will take forever. What is the best way to handle this?
- Error Handling: What happens when an error is returned to your script?
- Changing HTML: The webpage I am scraping changed, how do I handle these small changes?
Solution to Embedded Java Script Data
Let me first start by saying google developer tools is amazing! You can select any component of a webpage and it shows you the exact html responsible for rendering that portion. Therefore there is a two part solution to the embedded java script issue. First I had to recognize the issue understand why the html that should have contained the data was not in the html. The answer... EMBEDDED JAVA SCRIPT!
Well, there goes my scraping idea. However, fear not, this is actually good news. I used the chrome developer tool to find The Knot's API call. In order to do so, I entered one search into the "couple search" and watched the Networking tab of chrome developer tools. From there, I was able to find the API call to the database and instead of webscraping, my python projected turned into an API call with a json return. Using python json_normalize, my problem actually made life a bit easier.
# Pull request delay time.sleep(1) path = 'https://www.theknot.com/registry/api/couple/search?firstName='+fn[iter]+\ '&lastName='+ln[iter]+\ '&eventMonth='+month+\ '&eventYear='+year+\ '&eventType=Wedding&reset=true&track=true&limit=20&offset=0' request = Request(path) response = urlopen(request) data = response.read() data_json = json.loads(data) # Json to DF couples = json_normalize(data_json['Couples'])
Solution to API Limit
Before encountering this problem, I had to first push the limits. Because of my laziness, I did not want to wait for all iterations of first and last name to process. Lets estimate this (first name = fn, last name = ln, L = letter):
26 (fnL1) * 26 (fnL2) * 26 (lnL1) * 26 (lnL2) = 456,976 possibilities
Removing Repetitive Letters
Now lets remove all the repetitive letters since no real names start with those (i.e. bb, cc, dd etc...) except for aa since you can have names such as Aaron.
456,976 - 16,900 = 440,076 combinations at 1 per second is 440,076 seconds
440,076 seconds / 60 sec.per.min / 60 min.per.hr / 24 hr.per.day ~ 5 days
5 Days is way above my patience level so what next... well lets make 10 different files that each run specific combinations. In doing so, yes, I reached the limit and my instances would eventually be shut down. In order to counter this a simple delay was added as can be seen in the above code. As can see below, we had bins that would change depending on the index of the file. These bins would create the list of first name and last name combinations that would be put into the API call. Also, in case the script shut down, each instance outputs a .csv file with the current data at 25% intervals.
# API first name / last name parameters: # Create letter bin depending on file designator (0-9) letters_bin = len(fn) / 10 start_letters = file * letters_bin if start_letters + letters_bin - 1 > len(fn): end_letters = len(fn) else: end_letters = start_letters + letters_bin - 1 fn = fn[start_letters:end_letters] ln = ln[start_letters:end_letters] # Create bins for output length = len(fn) cbin = 0 bin_size = len(fn) / 4 - 1 bin1 = 0 + bin_size bin2 = bin1 +bin_size + 1 bin3 = bin2 +bin_size + 1 bin4 = length - 1 bins = [bin1, bin2, bin3, bin4]
Solution to Scaleability
Kind of hard to test a script that needs about a day or so to run so therefore the code was broken into run and test code as seen below:
if test == 'test': # Test Parameters letters = ['a', 'b'] (fn, ln) = iterate(letters) month = '0' year = '2016' #Initialization print outs print "First Name: ", fn print "Last Name: ", ln print "Month: ", month print "Year: ", year # Save data once complete length = len(fn) bins = [length-1] cbin = 0 else: # API Time parameters: letters = list(string.ascii_lowercase) (fn, ln) = iterate(letters) month = '0' year = '0'
One of the best practices I learned from my time as an electrical engineer in the defense industry whether it was doing development or software verification was to make your code readable, scaleable and testable. This project was a true testament to how useful it is to write your code in sections that can be individually run and verified.
Solution to Data Error Handling and a Changing Webpage
These two go hand in hand because error handling adds robustness to your code that can allow it to continue running even if errors are occasionally returned. Often times there would be no results returned from an API call so in order to speed up the process, error handling was included that checked for a no result and would therefore skip the rest of the script and simply move on to the next iteration.
try: data_json['ResultCount'] > 0 except: couples['MatchedFirstName'] = fn[iter] couples['MatchedLastName'] = ln[iter] couples['Id'] = 0 else: if data_json['ResultCount'] > 0: # Json to DF couples = json_normalize(data_json['Couples'])
Not only did this save time but it made the code much more efficient in the way it was written. Above is a common technique commonly used to handle errors in Python. Practices like this also make the code more robust in that it will not crash if case small adjustments are made to the webpage since the goal of this project is not to get every result but most results.
Data Processing
Well, now that we have collected as many results as possible (on the scale of ~3,500 from this date forward in NY, CT and PA) there is some cleaning that had to be done:
import pandas as pd import re # Input for file to be cleaned... f = raw_input('File Number for cleaning: ') csv = pd.read_csv('data/'+f+'_couples_prenltk.csv') # Retaining columns of data frame needed for analysis csv = csv[['Id', 'EventDate', 'City', 'Location', 'MatchedFirstName', 'MatchedLastName', 'Registrant2FirstName', 'Registrant2LastName', 'RegistriesSummary', 'Websites']] # Remove all observations that do not have a website csv.dropna(subset = ['Websites'], inplace = True) # Remove extra characters from API output wrapped around website address csv['Websites'] = csv['Websites'].apply(lambda x:''.join(re.findall("(h.*)'", str(x.split(',')[0])))) # Ensure website formatting is correct - testing #print csv.Websites.head(10) # Extract file number and save to new csv file f = f[0] csv.to_csv(f+'_filtered.csv')
Output files from the API script that contained all data were read into a cleaning script that kept only the relevant data such as names, locations, dates and websites and presented the information in a clean format. This was used in order to prepare the data for a Shiny application that would allow the customer (well, my wife) to view the results of all my effort!
Presenting the Data and Future Work
Just like any great product, all the fancy tech work needs to be in the background with the results front and center. As a way of displaying my results, a Shiny Dashboard app was created that allowed the user to filter each variable and select which variables to show in the results table. Embedded URL links were added for easy click and go presentation as well.
In the future, I would like to create a crawler that can then go through each couples website and look for specific venue names as a way to progressing our business in locations of our choice. Also, forecasting an approximate "engagement season" would also allow for accurate business planning and preparation on our part.