Scraping Event Data with Selenium

Avatar
Posted on Mar 26, 2018

I have been blessed to grow up in the most convenient time period in history to organize information. I grew up in an era of ubiquitous one-stop-shop aggregators populating the web. When I wanted to watch soccer highlights as a kid, I could venture over the Footytube. In my late teens, Reddit became a great source of user curated information from a wide array of genres. When I bought my first iPhone in college, Flipboard gave me a launching pad for the biggest headlines of the day from a number of different news sources. When I would get tired of my hometown, sites like Skiplagged and Priceline helped me crunch the best deals from a number of different sources. Finally, as an avid concert attendee, SeatGeek has served as an inspiration for concert ticket aggregation to find the best deals for local events. For my second project at the NYC Data Science Academy, I put my skills to the test to begin building an aggregator. The first step was to put together a web scraper.

My plan was to scrape tickets sites which all have very sophisticated anti-scraping mechanisms to prohibit scalpers. I would like to begin this post by stating that I do not have any malicious intent for purchasing and re-selling an oversize block of tickets to any particular event. There is no way that I would have the time to keep up with that kind of business going through this bootcamp! I began by scraping AXS.com. AXS is a digital marketing platform for purchasing tickets for sports and entertainment events in the US, and overseas.

I decided to use Python's Selenium Webdriver library for scraping since I would encounter a lot of JavaScript features on these sites. Using Scrapy, I would need to navigate the site using URLs, however, some click actions would not change the URL on my targeted sites. This required me to mimic a user's action in a test browser. I also ran into challenges with the website detecting my bot and serving me CAPTCHAs.

I increased my time.sleep() commands and dispersed them in between a large number of steps in my script. I also realized a VPN would be helpful so that every time I kicked off a new process, it would not track my requests coming from the same IP address it just identified as a bot.

Finally, I am an avid user of these websites and I understood that I was running the risk of being banned if caught. It could take months to clear my name through customer service by explaining my project, or I may never be able to use an account on these sites again without circumventing a ban. To prevent these consequences, I made sure to log out of any accounts before scraping. I also knew I should not use my home IP address for this reason.

Challenges:

  1. JavaScript
  2. CAPTCHA
  3. Dynamic Element Loading
  4. Ban from Site

Solutions:

  1. Mimic User Actions Via QA Tool (Selenium)
  2. Set Idle Time Liberally in Script and Use VPN
  3. Group Scrolling and Remove Duplicates in Script
  4. Never Log In Using Primary Account and Use VPN

Through this project, I learned a lot about events taking place in my city. In the future I would like to add artist profiles by connecting to the Spotify API. I am also in the process of building a user interface using Django. This will be complete by the end of April, after we go through our machine learning and big data modules in the bootcamp. I will update this posting when those portions of the project are complete.

To review my work completed so far, feel free to view my Github project repository and project presentation.

Side Note: If any of the companies whose data I am scraping would like for me to stop, feel free to let me know by emailing [email protected] The purpose of my activity is purely exploratory and academic. I am not conducting any revenue generating activity with your products.

 

About Author

Avatar

Kweku Ulzen

A lover of technology and data, Kweku is always interested in exploring how the two intersect to affect society and provide insights into the most pressing and interesting issues. He is a graduate of the University of Alabama...
View all posts by Kweku Ulzen >

Related Articles

Leave a Comment

Avatar
sheldon cooper February 1, 2020
Hey, It’s an Undoubtedly impressive software to test and you have provided some deep insights in a good format because its simple to understand. Automation is the future and I see Selenium as a career key . Thanks for sharing this Information -Sheldon

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp