National Park Web Scraping

Wann-Jiun Ma

Posted on Dec 21, 2016

Contributed by Wann-Jiun Ma. He is currently attending the NYC Data Science Academy Online Data Science Bootcamp program. This post is based on his third class project - Web Scraping.

Introduction

We are planning a trip to national parks. With so many adventures to choose from, I thought it would be a good idea to scrap national park information from websites and use the scraped data to build a national park recommendation system for myself. The idea is pretty simple: 1) scraping national park features from websites; 2) EDA & data wrangling 3) building models based on the scraped data 4) evaluating and enjoying the results. I use both Scrapy and Beautiful Soup to scrape park information from websites. The websites that I scrape information from are Wikipedia and TripAdvisor. All codes can be found at https://github.com/Wann-Jiun/nycdsa_project_3_web_scraping.

Wikipedia Web Scraping Using Scrapy

First, let's have an overview of the parks in US. I use Scrapy to scrape park information from Wikipedia. Scrapy is a web scraping framework, written in Python. A Scrapy project is built around ‘spiders’, which are self-contained crawlers. The crawlers will follow a set of instructions to scrape information from websites. The information that I scrape from Wikipedia consists of park name, location, date built, park size, number of visitors (2014). After scraping the data, I perform some data wrangling for EDA at the next stage of analysis.

Let's see which state has the most national parks. I group by the state information and plot the result. The figure shows that California has the most national parks (9). I guess it's not surprising. Let's see if we can find any interesting facts from the data. I also count the mean of the total number of visitors in each state. The figure shows that Tennessee has the most visitors in 2014. It's very interesting since there is only one national park (Great Smoky Mountains) in Tennessee.

TripAdvisor Web Scraping Using Beautiful Soup

Now, let's consider scraping more information about national parks. I use Beautiful Soup to scrape date from TripAdvisor. Beautiful Soup is a Pyhton package designed for web scraping and easy to use. The information I scrape includes park name, review star, number of reviews, location, park feature (hiking trails, valleys, volcanoes, etc.), url links.

Note that the url links are scraped based on the "GET" request information, which is provided by web browsers. Based on the url links, we can also go to each individual park's web site to scrap more information. The following figure shows the information I scrape from each individual park's website.

Finally, I collect park features including name, number of reviews, review star, location, park feature, # of things to do. The nominal categorical data including park feature and location (state) are coded using Pandas' "get_dummies" function.

The features are fed into the k-means clustering algorithm to explore the underlying structure of the park data. k-means clustering aims to partition observations into k clusters in which each observation belongs to the cluster with the nearest mean and the closest similarity. Using k-means clustering, we are able to recommend similar parks to user based on the input that the user provides.

About Author

Wann-Jiun Ma

Wann-Jiun Ma (PhD Electrical Engineering) is a Postdoctoral Associate at Duke University. His research is focused on mathematical modeling, algorithm design, and software/experiment implementation for large-scale systems such as wireless sensor networks and energy analytics. After having exposed...

View all posts by Wann-Jiun Ma >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

Google May 10, 2021

Google Sites of interest we've a link to.

Google January 9, 2021

Google The info talked about within the write-up are some of the very best out there.

Google March 15, 2020

Google Very handful of internet sites that happen to become comprehensive beneath, from our point of view are undoubtedly nicely worth checking out.

Google December 18, 2019

Google The time to read or check out the subject material or web sites we've linked to below.

searching for employees May 5, 2017

whoah this blog is excellent i really like reading your posts. Keep up the good work! You know, a lot of persons are looking around for this info, you could aid them greatly.

tips for financial planning May 5, 2017

Thank you for some other informative blog. Where else may I get that kind of information written in such an ideal way? I've a project that I'm simply now working on, and I have been at the glance out for such info.

National Park Web Scraping

About Author

Wann-Jiun Ma

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

National Park Web Scraping

About Author

Wann-Jiun Ma

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!