National Park Web Scraping

Wann-Jiun Ma
Posted on Dec 21, 2016

Contributed by Wann-Jiun Ma. He is currently attending the NYC Data Science Academy Online Data Science Bootcamp program. This post is based on his third class project - Web Scraping.

Introduction

We are planning a trip to national parks. With so many adventures to choose from, I thought it would be a good idea to scrap national park information from websites and use the scraped data to build a national park recommendation system for myself. The idea is pretty simple: 1) scraping national park features from websites; 2) EDA & data wrangling 3) building models based on the scraped data 4) evaluating and enjoying the results. I use both Scrapy and Beautiful Soup to scrape park information from websites. The websites that I scrape information from are Wikipedia and TripAdvisor. All codes can be found at https://github.com/Wann-Jiun/nycdsa_project_3_web_scraping.

Wikipedia Web Scraping Using Scrapy

First, let's have an overview of the parks in US. I use Scrapy to scrape park information from Wikipedia. Scrapy is a web scraping framework, written in Python. A Scrapy project is built around ‘spiders’, which are self-contained crawlers. The crawlers will follow a set of instructions to scrape information from websites. The information that I scrape from Wikipedia consists of park name, location, date built, park size, number of visitors (2014). After scraping the data, I perform some data wrangling for EDA at the next stage of analysis.

 

wiki

Let's see which state has the most national parks. I group by the state information and plot the result. The figure shows that California has the most national parks (9). I guess it's not surprising. Let's see if we can find any interesting facts from the data. I also count the mean of the total number of visitors in each state. The figure shows that Tennessee has the most visitors in 2014. It's very interesting since there is only one national park (Great Smoky Mountains) in Tennessee.

rplot

TripAdvisor Web Scraping Using Beautiful Soup

Now, let's consider scraping more information about national parks. I use Beautiful Soup to scrape date from TripAdvisor. Beautiful Soup is a Pyhton package designed for web scraping and easy to use. The information I scrape includes park name, review star, number of reviews, location, park feature (hiking trails, valleys, volcanoes, etc.), url links.

slide1Note that the url links are scraped based on the "GET" request information, which is provided by web browsers. Based on the url links, we can also go to each individual park's web site to scrap more information. The following figure shows the information I scrape from each individual park's website.

slide2Finally, I collect park features including name, number of reviews, review star, location, park feature, # of things to do. The nominal categorical data including park feature and location (state) are coded using Pandas' "get_dummies" function.

The features are fed into the k-means clustering algorithm to explore the underlying structure of the park data. k-means clustering aims to partition observations into k clusters in which each observation belongs to the cluster with the nearest mean and the closest similarity. Using k-means clustering, we are able to recommend similar parks to user based on the input that the user provides.

About Author

Wann-Jiun Ma

Wann-Jiun Ma

Wann-Jiun Ma (PhD Electrical Engineering) is a Postdoctoral Associate at Duke University. His research is focused on mathematical modeling, algorithm design, and software/experiment implementation for large-scale systems such as wireless sensor networks and energy analytics. After having exposed...
View all posts by Wann-Jiun Ma >

Related Articles

Leave a Comment

Avatar
searching for employees May 5, 2017
whoah this blog is excellent i really like reading your posts. Keep up the good work! You know, a lot of persons are looking around for this info, you could aid them greatly.
Avatar
tips for financial planning May 5, 2017
Thank you for some other informative blog. Where else may I get that kind of information written in such an ideal way? I've a project that I'm simply now working on, and I have been at the glance out for such info.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Classes Demo Day Demo Lesson Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet Lectures linear regression Live Chat Live Online Bootcamp Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Lectures Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking Realtime Interaction recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp