Scraping Event Data with Selenium

Posted on Mar 26, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Scraping Event Data with Selenium

Introduction

I have been blessed to grow up in the most convenient time period in history to organize information. I grew up in an era of ubiquitous one-stop-shop aggregators populating the web. When I wanted to watch soccer highlights as a kid, I could venture over the Footytube. In my late teens, Reddit became a great source of user curated information and data from a wide array of genres.

When I bought my first iPhone in college, Flipboard gave me a launching pad for the biggest headlines of the day from a number of different news sources. When I would get tired of my hometown, sites like Skiplagged and Priceline helped me crunch the best deals from a number of different sources.

Finally, as an avid concert attendee, SeatGeek has served as an inspiration for concert ticket aggregation to find the best deals for local events. For my second project at the NYC Data Science Academy, I put my skills to the test to begin building an aggregator. The first step was to put together a web scraper.

Objective

My plan was to scrape tickets sites which all have very sophisticated anti-scraping mechanisms to prohibit scalpers. I would like to begin this post by stating that I do not have any malicious intent for purchasing and re-selling an oversize block of tickets to any particular event. There is no way that I would have the time to keep up with that kind of business going through this bootcamp! I began by scraping AXS.com. AXS is a digital marketing platform for purchasing tickets for sports and entertainment events in the US, and overseas.

Data Scraping

Scraping Event Data with Selenium

I decided to use Python's Selenium Webdriver library for scraping since I would encounter a lot of JavaScript features on these sites. Using Scrapy, I would need to navigate the site using URLs, however, some click actions would not change the URL on my targeted sites. This required me to mimic a user's action in a test browser. I also ran into challenges with the website detecting my bot and serving me CAPTCHAs.

Then, I increased my time.sleep() commands and dispersed them in between a large number of steps in my script. I also realized a VPN would be helpful so that every time I kicked off a new process, it would not track my requests coming from the same IP address it just identified as a bot.

Finally, I am an avid user of these websites and I understood that I was running the risk of being banned if caught. It could take months to clear my name through customer service by explaining my project, or I may never be able to use an account on these sites again without circumventing a ban. To prevent these consequences, I made sure to log out of any accounts before scraping. I also knew I should not use my home IP address for this reason.

Challenges:

  1. JavaScript
  2. CAPTCHA
  3. Dynamic Element Loading
  4. Ban from Site

Solutions:

  1. Mimic User Actions Via QA Tool (Selenium)
  2. Set Idle Time Liberally in Script and Use VPN
  3. Group Scrolling and Remove Duplicates in Script
  4. Never Log In Using Primary Account and Use VPN

Through this project, I learned a lot about events taking place in my city. In the future I would like to add artist profiles by connecting to the Spotify API. I am also in the process of building a user interface using Django. This will be complete by the end of April, after we go through our machine learning and big data modules in the bootcamp. I will update this posting when those portions of the project are complete.

To review my work completed so far, feel free to view my Github project repository and project presentation.

Side Note: If any of the companies whose data I am scraping would like for me to stop, feel free to let me know by emailing [email protected] The purpose of my activity is purely exploratory and academic. I am not conducting any revenue generating activity with your products.

 

About Author

Kweku Ulzen

A lover of technology and data, Kweku is always interested in exploring how the two intersect to affect society and provide insights into the most pressing and interesting issues. He is a graduate of the University of Alabama...
View all posts by Kweku Ulzen >

Related Articles

Leave a Comment

sheldon cooper February 1, 2020
Hey, It’s an Undoubtedly impressive software to test and you have provided some deep insights in a good format because its simple to understand. Automation is the future and I see Selenium as a career key . Thanks for sharing this Information -Sheldon

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI