Scraping And Analyzing Ticketing Software Reviews

Posted on Sep 5, 2019
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

There are many different types of ticketing software that have sprung up in the past decade. However, with this project, after coming across the website capterra.com, I wanted to focus on ticketing softwares that dealt with merchants who had to find a platform in order to sell tickets to their events.

These merchants pay the ticketing software company varying fees, such as a percentage of a ticket or a fixed deduction from each ticket sale. However, although these metrics can be used to try to deduce how a company should structure their business flow, I was more interested in the customer (merchant-side) reviews for each of the platforms. Therefore, I delved into each of the ticketing software company's pages and thus scraped all of their reviews.

 

Scraping

In order to scrape the review websites of the ticketing software companies, I used Selenium as my driver. Initially, I was going to use a combination of the package Scrapy as well as Selenium but could not due to the dynamic loading aspect of the review pages.

After going to the main site that showed a compilation of all the ticketing software companies, I first made the driver sort the page according to the most reviews (Figure 1).

Most Reviews
Figure 1. Driver had to sort companies by most reviews

I then had to grab the URLs of each of the companies' review links and stored these links into a csv file. This first step was done in the urls.py file. One complication of this project was that Capterra did some A/B testing so I had to omit the companies that had a new website layout as well as those without 5 or more reviews.

Subsequently, I used the urls in the csv file in a new script called reviews.py and made a loop to make sure that the Selenium driver would visit each company's review page. From there, I located all the elements I wanted to scrape within each review cell such as the title, name, date, and much more. I then looped this for all of the review cells and managed to compile them all into a dictionary that finally outputted each of the values into a csv file called reviews.csv (Figure 2).

Scraped
Figure 2. Driver scraped various elements of the review

Data Cleaning

I used a jupyter notebook to clean my data so that I could visually inspect each of the columns in an efficient manner. After importing the csv files in as pandas dataframes, I had to concatenate them altogether so that the cleaning could be performed on one complete dataset rather than many different datasets. I altered the column values with improperly formatted strings using regular expressions.

This included commas and other punctuation marks that had entered into the values while scraping. There were also many columns which had the column name attached with a colon as a preface to the values that needed to be extracted so this was taken care of by using the string replace function to convert matching expressions into empty strings.

After the data was finally cleaned, the snippets of code in the jupyter notebook were compiled into a python script to enable a one time process of cleaning the data by passing in all of the csvs as a parameter.

Exploratory Data Analysis

With exploratory data analysis, I first took a look at the null values that were present in my data. At first glance, there were so many missing values which led me to think that there was an error in the scraping process. However, after further examination, I realized that it was due to missing profile information from a lack of user input as well as missing categories of comments because I had segmented the comments section into multiple features. By looking at the missing values for the overall rating, I was able to verify that my scraping process was successful.

Subsequently, I examined the median overall rating for each of the companies. A majority of the companies were so saturated with 5 star ratings that the median of their overall ratings ended up being a 5 (Figure 3).

Figure 3. Median overall ratings for all companies

Therefore, I wanted to check for potential differences between more reputable companies and those without as many reviews. I moved on to subset the dataset, filtering for those companies that had 100 or more reviews. I then overlayed the plot of the subset over the plot of the complete dataset. The pattern was very similar for both datasets (Figure 4), which showed that the companies that had scaled more were operating in a similar fashion to those that were not as big.

Figure 4. Overall ratings for all companies vs. over 100 reviews companies

Because the overall ratings were very similar, I wanted to look at other metrics that could provide more insight into how the users (merchants) felt about a particular company. Inspecting the recommendation levels for all companies and the recommendation levels for the subsetted companies, there seemed to be a very similar distribution for both datasets (Figures 5 and 6).

Figure 5. Recommendation for all companies
Figure 6. Recommendation for all companies

Thus, I looked into other ways to classify the reviews. One of the more interesting metrics was attributed to the four types of paid status of the reviews. The four different types of paid statuses were: NGC (Nominal Gift Card), NO (no compensation), VRI (vendor referred incentive), and PGC (Non Nominal Gift Card). By reading into the descriptions, the VRI was a nominal incentive to write a review and the PGC was an entry into a raffle to supply a review. I plotted the number of the value counts for the paid status feature (Figure 7).

Figure 7. Counts of different paid statuses

By looking at the plot, I saw that NGC and NO were the predominant categories that most reviews fell under. Therefore, I aggregated the means for both overall ratings and recommendations for each of the paid statuses. Initially, I thought there would be a great discrepancy; however, by looking at the averages, the ones for the NO category was noticeably higher than the averages of the NGC category.

I did not take into consideration the PGC or VGI categories because due to the lack of counts in those respective categories (151 and 2 as compared to the counts of NO and NGC of 1886 and 2315), there could have been very high variance in the reviews.

Natural Language Processing

Using natural language processing, I wanted to analyze the sentiment of the reviews. The reviews themselves were split into various features such as "Pros" and "Cons" so I was able to aggregate the results of each of these categories. Although the scope of my NLP analysis was small, I captured aggregated polarity and subjectivity for each of the software companies' reviews for each of the categories as well (Figure 8). 

Figure 8. Polarity vs. Subjectivity for each Ticketing Platform

One flaw to this approach was that my method did not capture the variety of results that could come from all of the reviews. For my procedure, I had taken a sample of the reviews' pros, cons, etc. in order to come up with the aggregated total for each company. However, some of the reviews that were not part of the sample could actually have been filled with much more polarity and subjectivity than the reviews that ended up being chosen.

Further Improvements

There are many methods to improve this project further.

In data analysis, I could take a look into each of the specific ratings, while adjusting to the sparsity of the values. The positions of the reviewers could also be taken into consideration for further analysis.

For the NLP side of this project, I should have approached this task with a more robust method, namely sampling with more scrutiny. Also, I could eventually perform sampling multiple times and aggregate the results in order to minimize the bias of a certain sample.

Furthermore, although through this project I was able to practice my web scraping and data analysis, I want to delve deeper into the machine learning aspects. Especially with NLP, I would love to implement more sophisticated techniques using neural networks and Tensorflow as my go to framework.

Feel free to check out this project on my Github repository!

About Author

Wonchan Kim

Wonchan graduated from the University of California, Los Angeles with a B.S. degree in Materials Engineering and a minor in Statistics in 2019. Wonchan has proven acumen in business applications and experience tailoring data analysis and machine learning...
View all posts by Wonchan Kim >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI