Scraping Event Data with Selenium

Kweku Ulzen

Posted on Mar 26, 2018

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

I have been blessed to grow up in the most convenient time period in history to organize information. I grew up in an era of ubiquitous one-stop-shop aggregators populating the web. When I wanted to watch soccer highlights as a kid, I could venture over the Footytube. In my late teens, Reddit became a great source of user curated information and data from a wide array of genres.

When I bought my first iPhone in college, Flipboard gave me a launching pad for the biggest headlines of the day from a number of different news sources. When I would get tired of my hometown, sites like Skiplagged and Priceline helped me crunch the best deals from a number of different sources.

Finally, as an avid concert attendee, SeatGeek has served as an inspiration for concert ticket aggregation to find the best deals for local events. For my second project at the NYC Data Science Academy, I put my skills to the test to begin building an aggregator. The first step was to put together a web scraper.

Objective

My plan was to scrape tickets sites which all have very sophisticated anti-scraping mechanisms to prohibit scalpers. I would like to begin this post by stating that I do not have any malicious intent for purchasing and re-selling an oversize block of tickets to any particular event. There is no way that I would have the time to keep up with that kind of business going through this bootcamp! I began by scraping AXS.com. AXS is a digital marketing platform for purchasing tickets for sports and entertainment events in the US, and overseas.

Data Scraping

I decided to use Python's Selenium Webdriver library for scraping since I would encounter a lot of JavaScript features on these sites. Using Scrapy, I would need to navigate the site using URLs, however, some click actions would not change the URL on my targeted sites. This required me to mimic a user's action in a test browser. I also ran into challenges with the website detecting my bot and serving me CAPTCHAs.

Then, I increased my time.sleep() commands and dispersed them in between a large number of steps in my script. I also realized a VPN would be helpful so that every time I kicked off a new process, it would not track my requests coming from the same IP address it just identified as a bot.

Finally, I am an avid user of these websites and I understood that I was running the risk of being banned if caught. It could take months to clear my name through customer service by explaining my project, or I may never be able to use an account on these sites again without circumventing a ban. To prevent these consequences, I made sure to log out of any accounts before scraping. I also knew I should not use my home IP address for this reason.

Challenges:

JavaScript
CAPTCHA
Dynamic Element Loading
Ban from Site

Solutions:

Mimic User Actions Via QA Tool (Selenium)
Set Idle Time Liberally in Script and Use VPN
Group Scrolling and Remove Duplicates in Script
Never Log In Using Primary Account and Use VPN

Through this project, I learned a lot about events taking place in my city. In the future I would like to add artist profiles by connecting to the Spotify API. I am also in the process of building a user interface using Django. This will be complete by the end of April, after we go through our machine learning and big data modules in the bootcamp. I will update this posting when those portions of the project are complete.

To review my work completed so far, feel free to view my Github project repository and project presentation.

Side Note: If any of the companies whose data I am scraping would like for me to stop, feel free to let me know by emailing q@kwekuulzen.com. The purpose of my activity is purely exploratory and academic. I am not conducting any revenue generating activity with your products.

About Author

Kweku Ulzen

A lover of technology and data, Kweku is always interested in exploring how the two intersect to affect society and provide insights into the most pressing and interesting issues. He is a graduate of the University of Alabama...

View all posts by Kweku Ulzen >

Meetup

Building a Safer Future

Student Works

Airbnb vs Long-Term Rentals: Understanding NYC Real Estate

Python

CitiBike Supply and Demand in NYC

Python

Comparison of Uber and Lyft Cab Services in Boston, MA

Data Visualization

The Data Behind EV Driving

Cancel reply

You must be logged in to post a comment.

sheldon cooper February 1, 2020

Hey, It’s an Undoubtedly impressive software to test and you have provided some deep insights in a good format because its simple to understand. Automation is the future and I see Selenium as a career key . Thanks for sharing this Information -Sheldon

Scraping Event Data with Selenium

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Objective

Data Scraping

Challenges:

Solutions:

About Author

Kweku Ulzen

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Scraping Event Data with Selenium

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Objective

Data Scraping

Challenges:

Solutions:

About Author

Kweku Ulzen

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!