Scraping And Analyzing Ticketing Software Reviews
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
There are many different types of ticketing software that have sprung up in the past decade. However, with this project, after coming across the website capterra.com, I wanted to focus on ticketing softwares that dealt with merchants who had to find a platform in order to sell tickets to their events.
These merchants pay the ticketing software company varying fees, such as a percentage of a ticket or a fixed deduction from each ticket sale. However, although these metrics can be used to try to deduce how a company should structure their business flow, I was more interested in the customer (merchant-side) reviews for each of the platforms. Therefore, I delved into each of the ticketing software company's pages and thus scraped all of their reviews.
In order to scrape the review websites of the ticketing software companies, I used Selenium as my driver. Initially, I was going to use a combination of the package Scrapy as well as Selenium but could not due to the dynamic loading aspect of the review pages.
After going to the main site that showed a compilation of all the ticketing software companies, I first made the driver sort the page according to the most reviews (Figure 1).
I then had to grab the URLs of each of the companies' review links and stored these links into a csv file. This first step was done in the urls.py file. One complication of this project was that Capterra did some A/B testing so I had to omit the companies that had a new website layout as well as those without 5 or more reviews.
Subsequently, I used the urls in the csv file in a new script called reviews.py and made a loop to make sure that the Selenium driver would visit each company's review page. From there, I located all the elements I wanted to scrape within each review cell such as the title, name, date, and much more. I then looped this for all of the review cells and managed to compile them all into a dictionary that finally outputted each of the values into a csv file called reviews.csv (Figure 2).
I used a jupyter notebook to clean my data so that I could visually inspect each of the columns in an efficient manner. After importing the csv files in as pandas dataframes, I had to concatenate them altogether so that the cleaning could be performed on one complete dataset rather than many different datasets. I altered the column values with improperly formatted strings using regular expressions.
This included commas and other punctuation marks that had entered into the values while scraping. There were also many columns which had the column name attached with a colon as a preface to the values that needed to be extracted so this was taken care of by using the string replace function to convert matching expressions into empty strings.
After the data was finally cleaned, the snippets of code in the jupyter notebook were compiled into a python script to enable a one time process of cleaning the data by passing in all of the csvs as a parameter.
Exploratory Data Analysis
With exploratory data analysis, I first took a look at the null values that were present in my data. At first glance, there were so many missing values which led me to think that there was an error in the scraping process. However, after further examination, I realized that it was due to missing profile information from a lack of user input as well as missing categories of comments because I had segmented the comments section into multiple features. By looking at the missing values for the overall rating, I was able to verify that my scraping process was successful.
Subsequently, I examined the median overall rating for each of the companies. A majority of the companies were so saturated with 5 star ratings that the median of their overall ratings ended up being a 5 (Figure 3).
Therefore, I wanted to check for potential differences between more reputable companies and those without as many reviews. I moved on to subset the dataset, filtering for those companies that had 100 or more reviews. I then overlayed the plot of the subset over the plot of the complete dataset. The pattern was very similar for both datasets (Figure 4), which showed that the companies that had scaled more were operating in a similar fashion to those that were not as big.
Because the overall ratings were very similar, I wanted to look at other metrics that could provide more insight into how the users (merchants) felt about a particular company. Inspecting the recommendation levels for all companies and the recommendation levels for the subsetted companies, there seemed to be a very similar distribution for both datasets (Figures 5 and 6).
Thus, I looked into other ways to classify the reviews. One of the more interesting metrics was attributed to the four types of paid status of the reviews. The four different types of paid statuses were: NGC (Nominal Gift Card), NO (no compensation), VRI (vendor referred incentive), and PGC (Non Nominal Gift Card). By reading into the descriptions, the VRI was a nominal incentive to write a review and the PGC was an entry into a raffle to supply a review. I plotted the number of the value counts for the paid status feature (Figure 7).
By looking at the plot, I saw that NGC and NO were the predominant categories that most reviews fell under. Therefore, I aggregated the means for both overall ratings and recommendations for each of the paid statuses. Initially, I thought there would be a great discrepancy; however, by looking at the averages, the ones for the NO category was noticeably higher than the averages of the NGC category.
I did not take into consideration the PGC or VGI categories because due to the lack of counts in those respective categories (151 and 2 as compared to the counts of NO and NGC of 1886 and 2315), there could have been very high variance in the reviews.
Natural Language Processing
Using natural language processing, I wanted to analyze the sentiment of the reviews. The reviews themselves were split into various features such as "Pros" and "Cons" so I was able to aggregate the results of each of these categories. Although the scope of my NLP analysis was small, I captured aggregated polarity and subjectivity for each of the software companies' reviews for each of the categories as well (Figure 8).
One flaw to this approach was that my method did not capture the variety of results that could come from all of the reviews. For my procedure, I had taken a sample of the reviews' pros, cons, etc. in order to come up with the aggregated total for each company. However, some of the reviews that were not part of the sample could actually have been filled with much more polarity and subjectivity than the reviews that ended up being chosen.
There are many methods to improve this project further.
In data analysis, I could take a look into each of the specific ratings, while adjusting to the sparsity of the values. The positions of the reviewers could also be taken into consideration for further analysis.
For the NLP side of this project, I should have approached this task with a more robust method, namely sampling with more scrutiny. Also, I could eventually perform sampling multiple times and aggregate the results in order to minimize the bias of a certain sample.
Furthermore, although through this project I was able to practice my web scraping and data analysis, I want to delve deeper into the machine learning aspects. Especially with NLP, I would love to implement more sophisticated techniques using neural networks and Tensorflow as my go to framework.
Feel free to check out this project on my Github repository!