NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > R Shiny > Data Web Scraping: Finding the Perfect Tennis String

Data Web Scraping: Finding the Perfect Tennis String

Iman Singh
Posted on Mar 8, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Summary

I scraped tennis string review data and built an app that allows players to find the string best suited to their particular preferences, skill level and playing style.

My app is an improvement over any resource previously available for researching and ranking tennis strings because:

  1. Users can filter reviews based on string, reviewer and racquet criteria so that rankings are based only on relevant data. With 17,500+ total reviews, there is room for significant filtering while still leaving enough data for an accurate ranking.
  2. Rankings are based on a weighted average of all the string characteristics, rather than only one characteristic, and scores are easily interpretable. I also implement a second way of ranking strings, based on adjectives in their reviews.
  3. Users can sort and visualize reviews for a selected string, and get an analysis of how the string compares within filtered and full datasets.

--- (CLICK HERE TO SEE THE APP) ---

Concept

Problem: Choosing the right tennis string is complicated and highly specific

Finding the ideal tennis racquet string is a challenge for many players because there are thousands of different types available, varying in material, construction, shape/texture and thickness. To add confusion, two strings with the same โ€˜specsโ€™ can play very differently, and the same string can play differently when used by different people or strung at a different tension. Because of this dizzying array of possibilities, many players just ask their stringer to pick a string for them and do not put much thought into optimizing this important piece of equipment โ€“ the only part that actually touches the ball during play.

For players who do try to find the best string for their game, the only thing to do is test out a variety of strings until you find what works. As an avid tennis player who tinkers with different string combinations, I have found stringforum.net to be the best resource for finding strings to try because the site has so many reviews (17500+ reviews by 4400+ unique reviewers) and covers almost every string on the market (2350+ varieties). However, even though it is the best resource currently available, the website has very basic search, filtering and ranking capabilities that limit its usefulness.

Data Web Scraping: Finding the Perfect Tennis String

Screenshot of the stringforum ratings page. Note that users can only search string names (not within reviews), only sort by one metric at a time, and filtering is limited to string type and availability.

 

People donโ€™t have the right tool to help them research and rank strings, but the data are out there to build one

Gathering Data

This is especially tragic because the site gathers great data!

For each review, stringforum not only collects information about the the string (ratings across seven categories, an overall satisfaction rating, the adjectives that best describe the string, and a text review), but also about the reviewer (gender, age bracket, playing style, ability level, swing speed, and how much spin they use in their strokes), and the racquet used in testing (manufacturer, model, frame size, string pattern, string tension level). In addition to data gathered from reviews, the site also has general information about the strings (price, thickness, material, construction and features).

Data Web Scraping: Finding the Perfect Tennis String

Screenshot of a single review on stringforum. Lots of great data, but difficult to search, filter, and compare reviews.

 

Solution: So letโ€™s scrape the data and build an app!

I scraped review data from stringforum and built an app that leads users through a three step process for finding the right tennis string:

  1. User filters reviews according to string, tester and racquet criteria, leaving the relevant ones. Only these filtered reviews are used for rankings.
  2. User inputs weights for desired and undesired string attributes, and ranks strings based on these weights.
  3. User views detailed review information about the highest ranked strings in order to select ones to test.

--- (CLICK HERE TO SEE THE APP) ---

Filtering

Not all tennis string reviews are relevant to all players. A beginner uses strings under very different conditions than an expert, so the opinions of one may not be informative for the other. The same goes for players using heavy vs. light spin, slow vs. fast swing speeds, and may of the other tester attributes. A useful string ranking system would allow users to filter reviews to select only those from players similar to themselves (as long as they leave enough data for an accurate analysis).

The same also goes for many of the racquet and string attributes. Users whose racquets are strung at low tensions would want to filter out reviews by testers using high tensions. Users who like thinner-gauge strings would want to filter out reviews about thicker-gauge strings. I'm sure most users would want to filter by price.

My app allows users to filter the dataset by 20 review criteria (7 string criteria, 7 tester criteria and 6 racquet criteria). The table in the lower panel is dynamically updated when the user adjusts preferences in any of the three input tabs.

Data Web Scraping: Finding the Perfect Tennis String

Screenshot of my app's 'Review Criteria' menu item, with the 'Tester Criteria' input tab selected. The current criteria, from all three input tabs, leave around 6500 relevant reviews out of 17,500+. In the app, users are able to scroll to see the full table.

 

Ranking

Reviews on stringforum include ratings in eight categories, which I will call 'string characteristics', and users of the website can rank strings by any of them: โ€˜comfortโ€™, โ€˜controlโ€™, โ€˜durabilityโ€™, โ€˜feelโ€™, โ€˜powerโ€™, โ€˜spinโ€™, โ€˜tension stabilityโ€™ and โ€˜overall ratingโ€™. The site's rankings are not very useful, however, because users can only sort by one characteristic at a time. This would be fine for a player interested in only maximizing control or only maximizing power, but any player who has preferences for more than one characteristic is out of luck.

Preferences

The problem is, every tennis player I know has some degree of preference for all the characteristics - the only question is how much. Instead of asking which characteristic the player prefers, a better ranking system would list all the characteristics and ask the user which weights to put on each. One player may place a high emphasis on comfort and control, low emphasis on durability and power, and medium emphasis on the others. Another may assign entirely different weights. The point is that it's natural to take all the characteristics into account when deciding what makes for a good string, and a ranking system should reflect this reality. 

Itโ€™s also a shame that users on stringforum arenโ€™t able to rank strings based on adjectives. Each review includes a list of adjectives to describe the string being evaluated, from a list of 22 possibilities (e.g., โ€˜softโ€™, โ€˜livelyโ€™, โ€˜explosiveโ€™, โ€˜spongyโ€™, โ€˜springyโ€™, โ€˜stiffโ€™, โ€˜preciseโ€™, โ€˜dullโ€™, โ€˜boringโ€™).

Since there are a finite number of options, it would be easy to rank strings according to how often reviewers chose to describe them by an adjective. Just like for string characteristics, users should be given a list of all 22 adjectives, asked to provide weights for each, and get an individualized ranking based on those preferences. In this case, however, the user should be able to provide negative weights in case he/she wants to penalize strings for certain adjectives.

Characteristics

My app allows users to rank strings in three ways: using characteristics, adjectives or both. Users are able to input preferred weights for up to 30 categories and view a ranked table. The scores are easily interpretable and color coded to show percentile.

Data Web Scraping: Finding the Perfect Tennis String

Screenshot of my app's 'String Rankings' menu item, with the 'String Characteristics' input tab selected. Under 'Table Options', the user chose to rank based on both characteristics and adjectives, which is why the string in the top position does not have the highest characteristics score. When ranking by both metrics, users can scroll right to see the adjective rankings, adjectives score and overall score for the weights selected.

 

Researching

After getting a ranking, it's time for the user to look at detailed review information for the top strings.

Reading Reviews

The best format for reading reviews is a single table that displays all the review data in separate columns. This way, the user can sort the dataset by any desired variable and find out, for example, what reviewers who rated a string poorly for spin had to say about it (and also scan info about those reviewers and their racquets to spot patterns).

Word Clouds

It would be nice to also give the user a graphical representation of the review text. Word clouds aren't the most informative visualizations, but they are easy to implement and are suited to this case.

Characteristics and Adjective Ratings

For the characteristics and adjectives ratings, users should be able to view an analysis of how a selected string compares with the filtered and full datasets. The comparison should be displayed in both percent and absolute terms, with percentile and z-score being my choices (I prefer z-score over rank, both conceptually and aesthetically, because users can interpret it without seeing the sample size).

Rating Tables

My app displays detailed review data for a selected string in four ways: a table for reading reviews, a word cloud, and separate tables for characteristics and adjectives analyses. As with the ratings table, scores are color coded by percentile.

Screenshot of my app's 'String Profiles' menu item, displaying the 'Characteristics Analysis' output tab for WeissCANNON Scorpion 1.22 (the highest ranked string based on my review criteria and ranking preferences)

 

Details

You can find the code at my GitHub page

Scraping

For scraping the review data, I used Scrapy, which is a web crawling framework in Python. The main task was to instruct a web crawling 'spiderโ€™ how to navigate through the site URLs, and provide it the XPath code to identify the data to collect on each page. The review pages on the site followed a predicable URL pattern and were organized in tables, which made the job relatively straightforward. One slight twist was that the site often encodes its data as symbols (smiley faces, plusses, etc) rather than text or numbers, so I had to identify those and encode them as numbers.

Wrangling

The initial dataset, fresh from scraping, had 17517 observations (reviews) of 19 variables. After wrangling, the variables almost tripled to 55. Much of the work involved text string manipulation - extracting the various pieces of tester, racquet and tennis string information as separate variables. I also created separate variables for each of the 22 adjective choices (allows for faster and more efficient processing than working with a single variable containing all the selected adjectives).

Missingness

I allowed the user to decide how to deal with missing data. For each of the 20 criteria in the filtering section, the user is given a choice whether to include or exclude reviews with missing values.

Encoding

Characteristics Scores

On the stringforum site these were displayed as plusses or minuses, with the number designating the degree (three plusses meant 'amazing', three minuses meant 'terrible', and a white circle was 'neutral'). When scraping the data, I encoded these on a scale from -3 to +3, for the number of plusses or minuses (neutral was 0). However, I wanted to encode these into a more intuitive scale in the app.

My solution was to convert these numbers to a %max. This way, when a user sees a score of 100, it's intuitive that the string has a perfect score (whereas a score of 3 could mean anything without context), and a score of 0 is the lowest score (whereas 0 was the middle score in the earlier encoding). The scale is also easily interpretable. A score of 44, for example, means that the string received exactly 44% of the maximum possible score for the metric.

Adjective Scores

Adjective scores need to be encoded because these will also be used for ranking. I needed a metric that was in percentage terms (to be fair to strings with few reviews), and give each review equal influence in the rankings. Reviewers are allowed to select as many adjectives as they want from a list of 22, and I did not want a review with 1 adjective selected to count less than a review with 5, 10, or 22 adjectives selected.

My solution was to calculate the prevalence of each adjective - the % of reviews in which the adjective was selected. It is fair because it treats each review as having 22 votes, one for each adjective, and both 'yes'(selected) and 'no'(not selected) votes are counted. Prevalence is also easily interpretable and on the same 1-100 scale as our other ranking metric.

Building the App

I built the app using the Shiny package for R and the shinydashboard sidebar layout. The three menu items in the sidebar are Review Criteria, String Rankings, and String Profiles, and they correspond to the three core functions of the app (filter, rank and research).

For the output tables I used the 'DT' interface to the JavaScript DataTables library, which has excellent built-in search and sort capabilities. Users are able to search for text found anywhere in a table, and sort by any column.

Rankings - Weights and Defaults

For characteristics, users are allowed to assign weights from 0 - 10, with a default of 5. This means that each characteristic counts a medium amount in the overall rankings by default and the user can decide if it should, instead, count for nothing, a small amount or a large amount. It makes sense that the default is medium because all eight are important components of good strings. A user who uses the default ranking would still get a perfectly acceptable (although bland) ranking, with all characteristics weighted equally.

For the adjectives, users are allowed to assign weights from -2 to 2, with a default of 0. This means that no adjectives count in the overall rankings by default, and the user must specifically choose which ones should count at all, how strongly, and whether to reward or penalize a string for having reviews with that adjective. With 22 adjectives, and some of them negative, it makes sense to let the user initiate the ranking and not count any by default.

 

Rankings - Output Tables

For string rankings, the user has a choice of three output tables: ranking by characteristics, ranking by adjectives, and ranking by both. If the user chooses to display the combined ranking, a panel appears asking for weights to assign each of the components (characteristics and adjectives) on a scale of 1-10.

The cell backgrounds of each String Rankings output table are colored according to that stringโ€™s percentile within the filtered dataset, in 5% increments. If the mean value for a string is above the 55th percentile for a metric, its cell will be green. If itโ€™s below the 45th percentile, then its cell will be red. The middle percentiles are white, and shades of color get darker toward the extremes.

Conclusion

I succeeded in building an app that significantly improves upon the best resource previously available for researching and ranking tennis racquet strings. However, the app is still a prototype and its functionality and user interface can be further improved.

In terms of functionality, a 'string comparisons' feature showing a head-to-head analysis between strings would be useful. So would a 'find similar strings' feature, for when a user has a string he/she likes and wants to find other ones like it. 'Tester profile' and 'racquet profile' menu items would allow a user to play around with the data and explore how a particular tester rated different strings, and how a particular racquet brand or model performed against others. I also hope to add an 'EDA' menu item that allows users to visualize and plot the data.

Improvements

There are several tweaks to make in terms of the user interface - moving some information out of tables and into information boxes is one obvious improvement, and the general 'look' could benefit from some text styling and CSS wrappers. I am eager to make these changes as I continue to develop the app.

For this project I explored using the scraped data for building the string finder app, but the same data could be used, for example, to gain insights about what players of different types like and dislike about tennis strings. This type of information would be fun to extract, and useful for manufacturers and marketers to know. I hope to pursue an analysis along these lines in another blog post.

--- (CLICK HERE TO SEE THE APP) ---

About Author

Iman Singh

A logical and creative problem-solver who combines a strong understanding of statistics and machine learning with the coding skills to query, wrangle, visualize and model data using multiple languages
View all posts by Iman Singh >

Related Articles

Machine Learning
Ames House Prices Predictions
R Shiny
Forecasting NY State Tax Credits: R Shiny App for Businesses
Data Visualization
Beyond the Podium: A Global Journey Through Formula 1 History
Meetup
Building a Safer Future
Meetup
New York Restaurants: Inspection Data Analysis, Statistics and More - R Programming Language

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application