Web Scraping: Building an App to Find the Perfect Tennis String
I scraped tennis string review data and built an app that allows players to find the string best suited to their particular preferences, skill level and playing style.
My app is an improvement over any resource previously available for researching and ranking tennis strings because:
- Users can filter reviews based on string, reviewer and racquet criteria so that rankings are based only on relevant data. With 17,500+ total reviews, there is room for significant filtering while still leaving enough data for an accurate ranking.
- Rankings are based on a weighted average of all the string characteristics, rather than only one characteristic, and scores are easily interpretable. I also implement a second way of ranking strings, based on adjectives in their reviews.
- Users can sort and visualize reviews for a selected string, and get an analysis of how the string compares within filtered and full datasets.
--- (CLICK HERE TO SEE THE APP) ---
Problem: Choosing the right tennis string is complicated and highly specific
Finding the ideal tennis racquet string is a challenge for many players because there are thousands of different types available, varying in material, construction, shape/texture and thickness. To add confusion, two strings with the same ‘specs’ can play very differently, and the same string can play differently when used by different people or strung at a different tension. Because of this dizzying array of possibilities, many players just ask their stringer to pick a string for them and do not put much thought into optimizing this important piece of equipment – the only part that actually touches the ball during play.
For players who do try to find the best string for their game, the only thing to do is test out a variety of strings until you find what works. As an avid tennis player who tinkers with different string combinations, I have found stringforum.net to be the best resource for finding strings to try because the site has so many reviews (17500+ reviews by 4400+ unique reviewers) and covers almost every string on the market (2350+ varieties). However, even though it is the best resource currently available, the website has very basic search, filtering and ranking capabilities that limit its usefulness.
People don’t have the right tool to help them research and rank strings, but the data are out there to build one
This is especially tragic because the site gathers great data! For each review, stringforum not only collects information about the the string (ratings across seven categories, an overall satisfaction rating, the adjectives that best describe the string, and a text review), but also about the reviewer (gender, age bracket, playing style, ability level, swing speed, and how much spin they use in their strokes), and the racquet used in testing (manufacturer, model, frame size, string pattern, string tension level). In addition to data gathered from reviews, the site also has general information about the strings (price, thickness, material, construction and features).
Solution: So let’s scrape the data and build an app!
I scraped review data from stringforum and built an app that leads users through a three step process for finding the right tennis string:
- User filters reviews according to string, tester and racquet criteria, leaving the relevant ones. Only these filtered reviews are used for rankings.
- User inputs weights for desired and undesired string attributes, and ranks strings based on these weights.
- User views detailed review information about the highest ranked strings in order to select ones to test.
--- (CLICK HERE TO SEE THE APP) ---
Not all tennis string reviews are relevant to all players. A beginner uses strings under very different conditions than an expert, so the opinions of one may not be informative for the other. The same goes for players using heavy vs. light spin, slow vs. fast swing speeds, and may of the other tester attributes. A useful string ranking system would allow users to filter reviews to select only those from players similar to themselves (as long as they leave enough data for an accurate analysis).
The same also goes for many of the racquet and string attributes. Users whose racquets are strung at low tensions would want to filter out reviews by testers using high tensions. Users who like thinner-gauge strings would want to filter out reviews about thicker-gauge strings. I'm sure most users would want to filter by price.
My app allows users to filter the dataset by 20 review criteria (7 string criteria, 7 tester criteria and 6 racquet criteria). The table in the lower panel is dynamically updated when the user adjusts preferences in any of the three input tabs.
Reviews on stringforum include ratings in eight categories, which I will call 'string characteristics', and users of the website can rank strings by any of them: ‘comfort’, ‘control’, ‘durability’, ‘feel’, ‘power’, ‘spin’, ‘tension stability’ and ‘overall rating’. The site's rankings are not very useful, however, because users can only sort by one characteristic at a time. This would be fine for a player interested in only maximizing control or only maximizing power, but any player who has preferences for more than one characteristic is out of luck.
The problem is, every tennis player I know has some degree of preference for all the characteristics - the only question is how much. Instead of asking which characteristic the player prefers, a better ranking system would list all the characteristics and ask the user which weights to put on each. One player may place a high emphasis on comfort and control, low emphasis on durability and power, and medium emphasis on the others. Another may assign entirely different weights. The point is that it's natural to take all the characteristics into account when deciding what makes for a good string, and a ranking system should reflect this reality.
It’s also a shame that users on stringforum aren’t able to rank strings based on adjectives. Each review includes a list of adjectives to describe the string being evaluated, from a list of 22 possibilities (e.g., ‘soft’, ‘lively’, ‘explosive’, ‘spongy’, ‘springy’, ‘stiff’, ‘precise’, ‘dull’, ‘boring’). Since there are a finite number of options, it would be easy to rank strings according to how often reviewers chose to describe them by an adjective. Just like for string characteristics, users should be given a list of all 22 adjectives, asked to provide weights for each, and get an individualized ranking based on those preferences. In this case, however, the user should be able to provide negative weights in case he/she wants to penalize strings for certain adjectives.
My app allows users to rank strings in three ways: using characteristics, adjectives or both. Users are able to input preferred weights for up to 30 categories and view a ranked table. The scores are easily interpretable and color coded to show percentile.
After getting a ranking, it's time for the user to look at detailed review information for the top strings.
The best format for reading reviews is a single table that displays all the review data in separate columns. This way, the user can sort the dataset by any desired variable and find out, for example, what reviewers who rated a string poorly for spin had to say about it (and also scan info about those reviewers and their racquets to spot patterns).
It would be nice to also give the user a graphical representation of the review text. Word clouds aren't the most informative visualizations, but they are easy to implement and are suited to this case.
For the characteristics and adjectives ratings, users should be able to view an analysis of how a selected string compares with the filtered and full datasets. The comparison should be displayed in both percent and absolute terms, with percentile and z-score being my choices (I prefer z-score over rank, both conceptually and aesthetically, because users can interpret it without seeing the sample size).
My app displays detailed review data for a selected string in four ways: a table for reading reviews, a word cloud, and separate tables for characteristics and adjectives analyses. As with the ratings table, scores are color coded by percentile.
You can find the code at my GitHub page
For scraping the review data, I used Scrapy, which is a web crawling framework in Python. The main task was to instruct a web crawling 'spider’ how to navigate through the site URLs, and provide it the XPath code to identify the data to collect on each page. The review pages on the site followed a predicable URL pattern and were organized in tables, which made the job relatively straightforward. One slight twist was that the site often encodes its data as symbols (smiley faces, plusses, etc) rather than text or numbers, so I had to identify those and encode them as numbers.
The initial dataset, fresh from scraping, had 17517 observations (reviews) of 19 variables. After wrangling, the variables almost tripled to 55. Much of the work involved text string manipulation - extracting the various pieces of tester, racquet and tennis string information as separate variables. I also created separate variables for each of the 22 adjective choices (allows for faster and more efficient processing than working with a single variable containing all the selected adjectives).
I allowed the user to decide how to deal with missing data. For each of the 20 criteria in the filtering section, the user is given a choice whether to include or exclude reviews with missing values.
On the stringforum site these were displayed as plusses or minuses, with the number designating the degree (three plusses meant 'amazing', three minuses meant 'terrible', and a white circle was 'neutral'). When scraping the data, I encoded these on a scale from -3 to +3, for the number of plusses or minuses (neutral was 0). However, I wanted to encode these into a more intuitive scale in the app.
My solution was to convert these numbers to a %max. This way, when a user sees a score of 100, it's intuitive that the string has a perfect score (whereas a score of 3 could mean anything without context), and a score of 0 is the lowest score (whereas 0 was the middle score in the earlier encoding). The scale is also easily interpretable. A score of 44, for example, means that the string received exactly 44% of the maximum possible score for the metric.
Adjective scores need to be encoded because these will also be used for ranking, and this encoding obviously needs to percentage terms (an absolute score would not be fair to strings with fewer reviews).
The simplest option would be to list all the reviews for a string and calculate the '% of reviews' in which each adjective is mentioned. Another, similar, option is to list all the adjectives for a string and calculate the '% of adjectives' for each. However, I chose not to either of these methods because I wanted every review to count the same towards the string's overall scores. With both these methods, a review that selects two adjectives would count twice as much as a review with one adjective (and, in the extreme case, a review with all 22 adjectives selected would count 22 times as much!). I did not want to give some reviewers more influence than others.
Instead of these methods, I calculated the scores as a '% of vote'. I treated each review as a vote where the user can choose to allocate 100 points among adjectives. If a reviewer selects only one adjective, 100% of the vote goes to it. If a user selects two adjectives, 50% of the vote goes to each, and so on. This way, each review counts the same, but the user can choose to spread that vote over several adjectives.
Notice that, for each review, the scores of all 22 adjectives add up to 100. This also holds true when all the reviews for a string are averaged, which makes for easy interpretation: you could just read the scores as percentages. For example, a user could read the adjective scores for a string and interpret them as: the reviewers voted this string 6% soft, 8% comfortable, 11% precise, and so on, adding up to to 100%.
Building the App
I built the app using the Shiny package for R and the shinydashboard sidebar layout. The three menu items in the sidebar are Review Criteria, String Rankings, and String Profiles, and they correspond to the three core functions of the app (filter, rank and research).
Ranking - Weights and Defaults
For characteristics, users are allowed to assign weights from 0 - 10, with a default of 5. This means that each characteristic counts a medium amount in the overall rankings by default and the user can decide if it should, instead, count for nothing, a small amount or a large amount. It makes sense that the default is medium because all eight are important components of good strings. A user who uses the default ranking would still get a perfectly acceptable (although bland) ranking, with all characteristics weighted equally.
For the adjectives, users are allowed to assign weights from -2 to 2, with a default of 0. This means that no adjectives count in the overall rankings by default, and the user must specifically choose which ones should count at all, how strongly, and whether to reward or penalize a string for having reviews with that adjective. With 22 adjectives, and some of them negative, it makes sense to let the user initiate the ranking and not count any by default.
Rankings - Output Tables
For string rankings, the user has a choice of three output tables: ranking by characteristics, ranking by adjectives, and ranking by both. If the user chooses to display the combined ranking, a panel appears asking for weights to assign each of the components (characteristics and adjectives) on a scale of 1-10.
I gave the user the choice of which table to display because the adjectives table (although fun to play with), does not work well on its own - it’s more useful as part of a combined ranking. I have found that a combined ranking providing a high weight for characteristics and a low one for adjectives provides good results.
Rankings - Color Coding by Percentile
The cell backgrounds of each String Rankings output table are colored according to that string’s percentile within the filtered dataset, in 5% increments. If the mean value for a string is above the 55th percentile for a metric, its cell will be green. If it’s below the 45th percentile, then its cell will be red. The middle percentiles are white, and shades of color get darker toward the extremes.
I succeeded in building an app that significantly improves upon the best resource previously available for researching and ranking tennis racquet strings. However, the app is still a prototype and its functionality and user interface can be further improved.
In terms of functionality, a 'string comparisons' feature showing a head-to-head analysis between strings would be useful. So would a 'find similar strings' feature, for when a user has a string he/she likes and wants to find other ones like it. 'Tester profile' and 'racquet profile' menu items would allow a user to play around with the data and explore how a particular tester rated different strings, and how a particular racquet brand or model performed against others. I also hope to add an 'EDA' menu item that allows users to visualize and plot the data.
There are several tweaks to make in terms of the user interface - moving some information out of tables and into information boxes is one obvious improvement, and the general 'look' could benefit from some text styling and CSS wrappers. I am eager to make these changes as I continue to develop the app.
For this project I explored using the scraped data for building the string finder app, but the same data could be used, for example, to gain insights about what players of different types like and dislike about tennis strings. This type of information would be fun to extract, and useful for manufacturers and marketers to know. I hope to pursue an analysis along these lines in another blog post.
--- (CLICK HERE TO SEE THE APP) ---