NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Python > Data Web Scraping for New Business

Data Web Scraping for New Business

Paul Grech
Posted on Nov 28, 2015
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Paul Grech. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).

Github: Full source code found here

Background & Inspiration

Along with being an electrical engineer and now attending NYC Data Science Academy 12 week Data Science bootcamp to become a data scientist, I am also co-owner of a wedding and event coordination company with my wife known as LLG Events Inc. As you could imagine my weeks are filled with exploring fun and interesting data sets and my weekends are filled with running lavish events so at some point during the bootcamp I sat back and thought to myself... hmm wouldn't it be nice to combine the two!

Inspired to combine these two elements of my life, I examined all challenges that my wife and I face as business owners and realized that the most difficult part of a service oriented business is how to meet clients. We do not believe in costly advertising that deems vendors "Platinum Vendors" just because they spend the most, we use our reputation to sell us but this also has its limitations... How do we meet new clients outside of our network? How do we expand our network?

Fortunately and unfortunately we are IMPATIENT, so how do we speed up the process?!?!?!?! With this challenge ahead, a lightbulb went off. I remember during the planning process for my own wedding where my wife and I designed a fun and simple website with basic information about us and our event hosted by The Knot...  and then it clicked! Why don't I web scrape the knot for all of these websites and reach out to those people as potential new clients. Many of our client's didn't even know they wanted a planner until they were stressed out so why not help them out ahead of time!

Plan of Action

Well, The Knot contains greater than 1 Million wedding websites between previous and future dates. My plan was to scrape all user information in python using BeautifulSoup so that I can get information such as location, date and wedding website url in order to gain some insight on the wedding industry as a whole and hopefully find some potential clients.

Seems easy enough right... Just create a couple of for loops to loop through all possibilities of First Name: 'A-Z A-Z' and Last Name: 'A-Z A-Z'...

# Create iterable list of first and last names (2 letters)
def iterate(let):
    first_names = []
    last_names = []
    for i in range(len(let)):
        for j in range(len(let)):
            for k in range(len(let)):
                for l in range(len(let)):
                    first = let[k] + let[l]
                    last = let[i] + let[j]
                    if first == 'aa' or first[0] != first[1]:
                        first_names.append(first)
                        last_names.append(last)

    return(first_names, last_names)

from The Knots search utility as seen below and pull the results off into a database or even a .csv file. What could possibly go wrong?

Data Web Scraping for New Business

Obstacles

Well, the question wasn't what could go wrong but how much will go wrong. Answer to that is ALOT! As I learned, web scraping is an interesting task because one small change for a website could mean rewriting your entire script. Below I outline some of my main challenges that I faced throughout this process:

  1. Embedded Java Script: How to deal with html that is not visible because it is really embedded java script?
  2. API Limitations: Although I have a delay in my script, sometimes the web page servers just don't allow you to access them beyond a certain point. How do I get around this?
  3. Testing: Large projects must be scaled or else testing and development time will take forever. What is the best way to handle this?
  4. Error Handling: What happens when an error is returned to your script?
  5. Changing HTML: The webpage I am scraping changed, how do I handle these small changes?

Solution to Embedded Java Script Data

Let me first start by saying google developer tools is amazing! You can select any component of a webpage and it shows you the exact html responsible for rendering that portion. Therefore there is a two part solution to the embedded java script issue. First I had to recognize the issue understand why the html that should have contained the data was not in the html. The answer... EMBEDDED JAVA SCRIPT!

Well, there goes my scraping idea. However, fear not, this is actually good news. I used the chrome developer tool to find The Knot's API call. In order to do so, I entered one search into the "couple search" and watched the Networking tab of chrome developer tools. From there, I was able to find the API call to the database and instead of webscraping, my python projected turned into an API call with a json return. Using python json_normalize, my problem actually made life a bit easier.

    # Pull request delay
    time.sleep(1)
    path = 'https://www.theknot.com/registry/api/couple/search?firstName='+fn[iter]+\
           '&lastName='+ln[iter]+\
           '&eventMonth='+month+\
           '&eventYear='+year+\
           '&eventType=Wedding&reset=true&track=true&limit=20&offset=0'
    request = Request(path)
    response = urlopen(request)
    data = response.read()
    data_json = json.loads(data)

# Json to DF
couples = json_normalize(data_json['Couples'])

Solution to API Limit

Before encountering this problem, I had to first push the limits. Because of my laziness, I did not want to wait for all iterations of first and last name to process. Lets estimate this (first name = fn, last name = ln, L = letter):

26 (fnL1) * 26 (fnL2) * 26 (lnL1) * 26 (lnL2) = 456,976 possibilities

Removing Repetitive Letters

Now lets remove all the repetitive letters since no real names start with those (i.e. bb, cc, dd etc...) except for aa since you can have names such as Aaron.

456,976 - 16,900 = 440,076 combinations at 1 per second is 440,076 seconds

440,076 seconds / 60 sec.per.min / 60 min.per.hr / 24 hr.per.day ~ 5 days

5 Days is way above my patience level so what next... well lets make 10 different files that each run specific combinations. In doing so, yes, I reached the limit and my instances would eventually be shut down. In order to counter this a simple delay was added as can be seen in the above code. As can see below, we had bins that would change depending on the index of the file. These bins would create the list of first name and last name combinations that would be put into the API call. Also, in case the script shut down, each instance outputs a .csv file with the current data at 25% intervals.

    # API first name / last name parameters:
    #   Create letter bin depending on file designator (0-9)
    letters_bin = len(fn) / 10
    start_letters = file * letters_bin
    if start_letters + letters_bin - 1 > len(fn):
        end_letters = len(fn)
    else:
        end_letters = start_letters + letters_bin - 1

    fn = fn[start_letters:end_letters]
    ln = ln[start_letters:end_letters]

    # Create bins for output
    length = len(fn)
    cbin = 0
    bin_size = len(fn) / 4 - 1
    bin1 = 0 + bin_size
    bin2 = bin1 +bin_size + 1
    bin3 = bin2 +bin_size + 1
    bin4 = length - 1
    bins = [bin1, bin2, bin3, bin4]

Solution to Scaleability

Kind of hard to test a script that needs about a day or so to run so therefore the code was broken into run and test code as seen below:

if test == 'test':
    # Test Parameters
    letters = ['a', 'b']
    (fn, ln) = iterate(letters)
    month = '0'
    year = '2016'

    #Initialization print outs
    print "First Name: ", fn
    print "Last Name: ", ln
    print "Month: ", month
    print "Year: ", year

    # Save data once complete
    length = len(fn)
    bins = [length-1]
    cbin = 0

else:
    # API Time parameters:
    letters = list(string.ascii_lowercase)
    (fn, ln) = iterate(letters)
    month = '0'
    year = '0'

One of the best practices I learned from my time as an electrical engineer in the defense industry whether it was doing development or software verification was to make your code readable, scaleable and testable. This project was a true testament to how useful it is to write your code in sections that can be individually run and verified.

Solution to Data Error Handling and a Changing Webpage

These two go hand in hand because error handling adds robustness to your code that can allow it to continue running even if errors are occasionally returned. Often times there would be no results returned from an API call so in order to speed up the process, error handling was included that checked for a no result and would therefore skip the rest of the script and simply move on to the next iteration.

try:
        data_json['ResultCount'] > 0
    except:
        couples['MatchedFirstName'] = fn[iter]
        couples['MatchedLastName'] = ln[iter]
        couples['Id'] = 0
    else:
        if data_json['ResultCount'] > 0:

            # Json to DF
            couples = json_normalize(data_json['Couples'])

Not only did this save time but it made the code much more efficient in the way it was written. Above is a common technique commonly used to handle errors in Python. Practices like this also make the code more robust in that it will not crash if case small adjustments are made to the webpage since the goal of this project is not to get every result but most results.

Data Processing

Well, now that we have collected as many results as possible (on the scale of ~3,500 from this date forward in  NY, CT and PA) there is some cleaning that had to be done:

import pandas as pd
import re

# Input for file to be cleaned...
f = raw_input('File Number for cleaning: ')
csv = pd.read_csv('data/'+f+'_couples_prenltk.csv')


# Retaining columns of data frame needed for analysis
csv = csv[['Id', 'EventDate', 'City', 'Location',
             'MatchedFirstName', 'MatchedLastName',
             'Registrant2FirstName', 'Registrant2LastName',
             'RegistriesSummary', 'Websites']]

# Remove all observations that do not have a website
csv.dropna(subset = ['Websites'], inplace = True)

# Remove extra characters from API output wrapped around website address
csv['Websites'] = csv['Websites'].apply(lambda x:''.join(re.findall("(h.*)'", str(x.split(',')[0]))))

# Ensure website formatting is correct - testing
#print csv.Websites.head(10)

# Extract file number and save to new csv file
f = f[0]
csv.to_csv(f+'_filtered.csv')

Output files from the API script that contained all data were read into a cleaning script that kept only the relevant data such as names, locations, dates and websites and presented the information in a clean format. This was used in order to prepare the data for a Shiny application that would allow the customer (well, my wife) to view the results of all my effort!

Presenting the Data and Future Work

Just like any great product, all the fancy tech work needs to be in the background with the results front and center. As a way of displaying my results, a Shiny Dashboard app was created that allowed the user to filter each variable and select which variables to show in the results table. Embedded URL links were added for easy click and go presentation as well.

Data Web Scraping for New Business

In the future, I would like to create a crawler that can then go through each couples website and look for specific venue names as a way to progressing our business in locations of our choice. Also, forecasting an approximate "engagement season" would also allow for accurate business planning and preparation on our part.

About Author

Paul Grech

Paul Grech is a Data Scientist with passion for exploring insight in big data. He is eager to advance his skills and build value in a professional environment. Previous experience include several years of professional consulting experience in...
View all posts by Paul Grech >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Google November 18, 2019
Google We came across a cool internet site that you just may possibly love. Take a look in case you want.
van cleef collier imitation September 10, 2016
cartierbraceletlove Make sure you share your calendar to be public before you do this, otherwise, outlook 2007 will throw an error message van cleef collier imitation http://www.bijouxclassique.net/
replique delices de cartier or blanc September 7, 2016
wow,iโ€™m so proud of you girl, being a pilot at your ageโ€ฆvery inspiringโ€ฆ replique delices de cartier or blanc http://www.luxemontre.com/
adidas jacka herr May 23, 2016
Excellent, what a webpage it is! This blog gives useful information to us, keep it up. adidas jacka herr http://www.la-cantina.se/tin.php?sv=adidas-jacka-herr
salomon zapatillas mujer May 13, 2016
I would like to point out my admiration for your kind-heartedness in support of people that should have help on this important issue. Your special commitment to getting the solution all-around turned out to be exceptionally helpful and has continuously made somebody like me to get to their endeavors. Your entire warm and helpful key points indicates much to me and extremely more to my peers. Thank you; from everyone of us. salomon zapatillas mujer http://www.sanantoniodenia.es/oniod.php?es=salomon-zapatillas-mujer
zapatillas de mujer salomon May 13, 2016
I'm still learning from you, while I'm making my way to the top as well. I absolutely liked reading all that is written on your site.Keep the posts coming. I liked it! zapatillas de mujer salomon http://www.majuuniformes.com.br/forme.php?es=zapatillas-de-mujer-salomon
adidas barnklรคder May 7, 2016
Hello, how's it going? Just shared this post with a colleague, we had a good laugh. adidas barnklรคder http://www.roxshop.se/dnd.php?sv=id-2035

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application