Web Scraping Without a Paddle

Brenna Botzheim
Posted on Jan 24, 2020

Introduction

What better way to explore a lake, a river, or a bay than from the serene perspective of a kayak silently gliding through the water, propelled by one's own power and leaving nothing but rippling water in its wake? Or harnessing that silence to creep along the surface of the water and catch fish unaware, the way indigenous hunters in the arctic north developed millennia ago. Or even to use the maneuverability offered by a small boat and paddle, and conquer whitewater river rapids. The kayak is a versatile instrument, capable of many different roles based on its design and user.

I originally became interested in learning about kayaks when I moved to the Elkhorn Slough area in Northern California. The Elkhorn Slough is a protected wetland river delta. Not only is kayaking common on the slough, but also around the nearby Monterey Bay. The scenery and wildlife (including seals, sea otters, a plentiful variety of birds, jellyfish, etc.) make it an idyllic place to explore. However, I had only kayaked a handful of times in my life through kayak rental services. So, I turned to the internet to research what would be required and how to get started. I quickly found out kayaks are immensely more complicated than I originally anticipated. Here is a quick rundown on what I learned.

The primary types of kayaks are recreational kayaks, fishing kayaks, sea kayaks, and whitewater kayaks. Though individual products do vary, in general their differences are described in the table below:

Length

A longer kayak is faster and tracks better (requires less adjustment to stay straight) than a shorter kayak but sacrifices maneuverability. Kayaks designed for long distance kayaking or sea kayaking where speed is preferred tend to be longer, while whitewater kayaks are shorter because they require greater maneuverability and are propelled by the river current.

Width / Depth – Stability

Stability is an important concept in kayaking and can be broken down into primary and secondary stability. Primary stability refers to how much a kayak rocks in response to the kayaker's movements. A wider kayak has greater primary stability and is of the most concern to the inexperienced kayaker because it will not tip as easily due to the paddler’s movements. Primary stability is also important for fishing from a kayak, which inherently requires more movement from the kayaker. Secondary stability refers to how much a kayak rocks in response to waves or water currents and depends on the curvature and depth of a kayak's hull. Secondary stability is a concern to all types of kayaking.

Why not make kayaks both wide and deep to max out the benefits? That makes a kayak very large, increasing the drag while decreasing maneuverability and speed, forcing the kayaker to exert more effort.

What next?

I quickly fell down the rabbit hole of information overload. There was so much more to it than finding a good quality kayak. You have to find the kayak with the right combination of features to fit your needs. Also, adding to that, kayaks are not cheap. A wrong decision could cost me a lot down the road. But there was all this data on the web: why not harness it and apply my analytical skills to determine what a good investment would be. Better yet, why not generalize the problem to one that could benefit others looking to purchase kayaks, while simultaneously offering consumer feedback to kayak retailers?

The goal of the project then became to collect freely available data on kayaks from the internet, and use that data to gain insight into the kayak market: what products are available on the market, what do consumers like in the products, and most importantly, what is the most popular? This would be accomplished by finding the most popular kayaks by type and technical features and assessing what consumers like about them.

Data Collection

The data used in this project was scraped from online retailers using Selenium in Python. The outdoor-gear retailer REI turned out to be the best source of uniform data, having a consistent section outlining the kayak “technical specifications” which would allow me to compare ratings and reviews not only to the type of kayak, but also to the more nuanced differences between kayaks like dimensions. Towards this end, the data collection focused on gathering product information including kayak dimensions, kayak type, price, ratings, and reviews.

Analysis

Graphical EDA

The dominant feature I assess in this section is kayak type (i.e. recreational, fishing, whitewater, sea). I will look at how the data compares to the theoretical differences between kayak types, as well as compare the average rating and price of each category of kayak. The data refers to recreational kayaks as 'flatwater' kayaks, but the terms can be considered synonymous.

As described in the introduction, kayak types have distinct builds that aid them in their intended purpose. The following scatter plot shows how types of kayaks in the sample vary by width and length. First, we can see that sea kayaks (orange) are the narrowest, and tend towards longer crafts, which helps on long trips in choppy water. We can also see that fishing kayaks (blue) tend to be wider and split the difference in kayak length, which provides better primary stability while fishing. Finally, recreational kayaks (green) are the most messy, as expected, but tend toward the middle in both width and length, as recreational crafts attempt to compromise between the benefits of both. Though there are only two whitewater kayaks in the sample, they both are very short, as expected, with average to large width, which is a surprise. However, overall the graph is very similar to what we would expect the differences in kayak types to look like.

Then below we can see that as kayak length increases, kayak width decreases, and kayak depth similarly decreases. Important to note is that not all products had depth listed, so this graph is a subset of the original data.

The next graph shows the general distribution of number of products by kayak type. Recreational kayaks are very clearly the most prevalent category and the data notably contains very few whitewater kayaks.

Is the the prevalence of a product the best way to tell popularity? Perhaps, since it seems likely that sellers will supply more of what is in demand to consumers. However, I don't have direct insights into the seller's business model here. Another way to look at popularity is to compare the amount of reviews each kayak category has, as seen below. Flatwater kayaks again have a very large share of the reviews and whitewater kayak reviews are non-existent.

Furthermore, if we look at the relative proportion a type of kayak makes of the total products versus their share of the total reviews, perhaps we can see if a category is outperforming the others. This idea is in investigated in the following table:

It does look like there is a disproportionate amount of reviews for recreational kayaks compared to other kayaks. This could reflect higher demand for recreational kayaks, assuming the likelihood a customer leaves a review is equal for all kayak types.

Next, I'll inspect if price has an impact on ratings:

Overall, the price of kayaks follows a roughly normal distribution with a median around $1000 and a right skew. In the below scatter plot I compare average rating to price. While there isn't much to glean here, we do see that more expensive kayaks are reviewed better. Also fishing and sea kayaks are reviewed better than recreational kayaks or multi-purpose "Fishing/Flatwater" kayaks.

Finally, the last graph looks at the distribution of ratings for each category:

Similar to the previous graph, we can see that sea and fishing kayaks have very high ratings. Recreational kayaks have the biggest range of values and "Fishing/Flatwater" kayaks were rated the worst, though still fairly highly. However, ratings by themselves offer limited insights. For further analysis I need to probe the reviews themselves.

NLP

We've seen an overview of the collected data and consumer preferences, but so far I haven't answered the fundamental question of this project: What do consumers prefer in a kayak? Why are recreational kayaks popular? Customer ratings are useful but they only offer a snapshot of the overall sentiment of the reviews. To dig deeper I can utilize natural language processing to try to understand the content of the reviews and what consumers like about a product – not just if they like it.

I'll be using a topic modeling technique called LDA ( Latent Dirichlet Allocation). LDA is a probabilistic model that takes the collection of reviews and generates topics that are essentially clusters of words that describe hidden structures within the text. The topics represent repeating patterns within the reviews that are have the greatest probability of generating the observed collection of reviews.

I'll start by inspecting the overall word frequency for all reviews for all products. Prior to the NLP stage, the reviews were cleaned up to remove punctuation and noisy words like 'the' or 'I'. This makes the results more reliable and relevant.

In the word cloud above lots of positive adjectives pop out: great, love, easy, good, comfortable, and stable. More ambiguous terms also stand out: seat, handle, time, inflatable, cockpit, purchase, REI, work, track, and pump.

Below is a frequency distribution of the top 20 words in the data set. The most frequent words reflect what was most mentioned in the reviews. Here words like pump, easy, floor, bag, dry, and small all come up frequently.

For the next stage, I fit the data with a LDA model and the results show that the top 5 topics can be described as follows:

  1. 0.028*kayak + 0.012*water" + 0.010*boat + 0.008*great + 0.008*seat + 0.007*well + 0.007*back + 0.006*easy + 0.005*comfortable + 0.005*paddle
  2. 0.023*kayak + 0.008*water + 0.008*back + 0.008*boat + 0.006*paddle + 0.006*lake + 0.006*seat + 0.006*pump + 0.005*time + 0.005*use
  3. 0.011*kayak + 0.006*service + 0.005*oru + 0.005*rei + 0.004*customer + 0.004*replacement + 0.004*time + 0.003*first + 0.003*seat + 0.003*much
  4. 0.017*boat + 0.008*water + 0.007*floor + 0.005*kayak + 0.005*like + 0.004*lbs + 0.004*back + 0.004*paddle + 0.004*small + 0.004*great
  5. 0.029*boat + 0.014*kayak + 0.012*water + 0.011*well + 0.009*easy + 0.006*great + 0.005*bag + 0.005*pump + 0.005*stable + 0.005*like

The topics are a little messy in this format, but using the pyLDAvis library in Python I can visualize the model in 2-d space, which allows a deeper look at each topic and allows us to examine the relationship between topics.

  • Topic 1 and Topic 3 are the most related to each other. Topic 1 seems to describe the topic of how it feels to sit in the kayak with terms like "seat, back, comfortable." Topic 3 looks rather different, and looks to refer to the topic of customer service or ordering replacements with terms like "REI, customer, replacement, time, first"
  • Topic 4 may speak to kayak size with words like "lbs, small, back"
  • Topic 5 looks to refer to kayak features, with words like "bag, pump, stable"

So far I've used NLP to look at the reviews as a whole. But are there any differences between results when only looking at the most popular type of kayaks? From the previous section where I did an exploratory look at the data, I found that the most popular category seemed to be flatwater (recreational) kayaks. Let's look at only reviews in this category.

NLP–Recreational Kayaks

A large percentage of the reviews are made up of recreational reviews already, but can we see any differences? The one that stands out to me is 'stable', which now appears in the top 20 words.

I again train an LDA model and see what the top topics are:

  1. 0.023*kayak + 0.014*boat + 0.009*water + 0.007*easy + 0.006*floor + 0.005*paddle + 0.005*back + 0.005*stable + 0.005*like + 0.005*well
  2. 0.019*kayak + 0.009*use + 0.009*pump + 0.008*boat + 0.007*easy + 0.007*water + 0.006*back + 0.006*great + 0.005*dry + 0.005*bag
  3. 0.021*kayak + 0.014*water + 0.009*seat + 0.007*back + 0.007*like + 0.007*love + 0.006*little + 0.005*boat + 0.004*well + 0.004*storage
  4. 0.019*kayak + 0.016*boat + 0.012*water + 0.007*pump + 0.006*back + 0.006*like + 0.006*well + 0.005*really + 0.005*time + 0.005*small
  5. 0.021*kayak + 0.015*boat + 0.012*well + 0.011*water + 0.011*great + 0.009*paddle + 0.006*seat + 0.006*kayaks + 0.006*comfortable + 0.006*back

It looks like a lot more positive descriptors have shown up in the topics. And again, we can visualize these topics:

  • Topic 1 seems to refer to how the kayak handles in the water with words like "easy, floor, stable, paddle"
  • Topics 2 & 3 look to describe kayak features with words like "pump, bag, storage, dry, seat"
  • Topic 5 refers to the comfort of the kayak

The results of the two LDA models are similar, but there are slight changes in the topics once we only select recreational kayak reviews. It looks like recreational kayaks are liked for their stability, ease of use, and comfort while on the water.

Summary

The goal of this project was to gain insight into the market of kayaks on the internet and understand what is popular among consumers. By collecting product information online and looking for trends in the data, I did find that recreational kayaks seem to be the most popular category, with the most popular price range between $500-$1000. Fishing kayaks and sea kayaks are the highest rated, but they have relatively few ratings to support being the most popular category.

Next I implemented topic modeling (LDA) to begin to understand what was being discussed in the reviews and find out what customers liked about a product. Customers seemed to like how a kayak handles in the water, ease of use, reliable features/good design, and comfort. When filtering for the most popular category (recreational kayaks), stability, comfort and enjoyment became more donimant in topics.

Ultimately, most consumers buying kayaks seem to be beginners, and they look for kayaks they can handle and that make them feel safe, comfortable, and stable in the water. It's a fair intuition to say that most kayakers are casually into the hobby. They don't need to go crazy for the best performing kayak, or get a kayak that can handle harder conditions–like whitewater rivers. As long as it's easy to pull out and put back together for a weekend trip, it's perfect. Or at least that seems to be the story behind customers looking for kayaks on REI. From my results, it's likely the experienced kayakers looks elsewhere for their purchases.

So what ultimately exemplifies the most popular kayak? Meet the Advanced Elements AdvancedFrame Inflatable Kayak:

It is a recreational, inflatable kayak available for the reasonable price of $499. It's transportable, easy to set up, comfortable to sit in, wide for stability, and perfect for the casual kayaker.

About Author

Brenna Botzheim

Brenna Botzheim

Brenna Botzheim is an associate EOV Analyst at StormGeo. Brenna holds a Bachelors degree from San Francisco State University where she studied sociology and mathematics. In her spare time, Brenna continues to develop her skills in statistical data...
View all posts by Brenna Botzheim >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp