Scraping Event Data with Selenium
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
I have been blessed to grow up in the most convenient time period in history to organize information. I grew up in an era of ubiquitous one-stop-shop aggregators populating the web. When I wanted to watch soccer highlights as a kid, I could venture over the Footytube. In my late teens, Reddit became a great source of user curated information and data from a wide array of genres.
When I bought my first iPhone in college, Flipboard gave me a launching pad for the biggest headlines of the day from a number of different news sources. When I would get tired of my hometown, sites like Skiplagged and Priceline helped me crunch the best deals from a number of different sources.
Finally, as an avid concert attendee, SeatGeek has served as an inspiration for concert ticket aggregation to find the best deals for local events. For my second project at the NYC Data Science Academy, I put my skills to the test to begin building an aggregator. The first step was to put together a web scraper.
Objective
My plan was to scrape tickets sites which all have very sophisticated anti-scraping mechanisms to prohibit scalpers. I would like to begin this post by stating that I do not have any malicious intent for purchasing and re-selling an oversize block of tickets to any particular event. There is no way that I would have the time to keep up with that kind of business going through this bootcamp! I began by scraping AXS.com. AXS is a digital marketing platform for purchasing tickets for sports and entertainment events in the US, and overseas.
Data Scraping
I decided to use Python's Selenium Webdriver library for scraping since I would encounter a lot of JavaScript features on these sites. Using Scrapy, I would need to navigate the site using URLs, however, some click actions would not change the URL on my targeted sites. This required me to mimic a user's action in a test browser. I also ran into challenges with the website detecting my bot and serving me CAPTCHAs.
Then, I increased my time.sleep() commands and dispersed them in between a large number of steps in my script. I also realized a VPN would be helpful so that every time I kicked off a new process, it would not track my requests coming from the same IP address it just identified as a bot.
Finally, I am an avid user of these websites and I understood that I was running the risk of being banned if caught. It could take months to clear my name through customer service by explaining my project, or I may never be able to use an account on these sites again without circumventing a ban. To prevent these consequences, I made sure to log out of any accounts before scraping. I also knew I should not use my home IP address for this reason.
Challenges:
- JavaScript
- CAPTCHA
- Dynamic Element Loading
- Ban from Site
Solutions:
- Mimic User Actions Via QA Tool (Selenium)
- Set Idle Time Liberally in Script and Use VPN
- Group Scrolling and Remove Duplicates in Script
- Never Log In Using Primary Account and Use VPN
Through this project, I learned a lot about events taking place in my city. In the future I would like to add artist profiles by connecting to the Spotify API. I am also in the process of building a user interface using Django. This will be complete by the end of April, after we go through our machine learning and big data modules in the bootcamp. I will update this posting when those portions of the project are complete.
To review my work completed so far, feel free to view my Github project repository and project presentation.
Side Note: If any of the companies whose data I am scraping would like for me to stop, feel free to let me know by emailing q@kwekuulzen.com. The purpose of my activity is purely exploratory and academic. I am not conducting any revenue generating activity with your products.