National Park Web Scraping

Posted on Dec 21, 2016

Contributed by Wann-Jiun Ma. He is currentlyΒ attending the NYC Data Science Academy Online Data Science Bootcamp program. This post is based on his third class project - Web Scraping.

Introduction

We are planning a trip to national parks. With so many adventures to choose from, I thought it would be a good idea to scrap national park information from websites and use the scraped data to build a national park recommendation system for myself. The idea is pretty simple: 1) scraping national park features from websites; 2) EDA & data wrangling 3) building models based on the scraped data 4) evaluating and enjoying the results. I use both Scrapy and Beautiful Soup to scrape park information from websites. The websites that I scrape information from are Wikipedia and TripAdvisor. All codes can be found at https://github.com/Wann-Jiun/nycdsa_project_3_web_scraping.

Wikipedia Web Scraping Using Scrapy

First, let's have an overview of the parks in US. I use Scrapy to scrape park information from Wikipedia.Β Scrapy is a web scraping framework, written in Python. A Scrapy project is built around β€˜spiders’, which are self-contained crawlers. The crawlers will follow a set of instructions to scrape information from websites. The information that I scrape from Wikipedia consists of park name, location, date built, park size, number of visitors (2014). After scraping the data, I perform some data wrangling for EDA at the next stage of analysis.

 

wiki

Let's see which state has the most national parks. I group by the state information and plot the result. The figure shows that California has the most national parks (9). I guess it's not surprising. Let's see if we can find any interesting facts from the data. I also count the mean of the total number of visitors in each state. The figure shows thatΒ Tennessee has the most visitors in 2014. It's very interesting since there is only one national park (Great Smoky Mountains) in Tennessee.

rplot

TripAdvisor Web Scraping Using Beautiful Soup

Now, let's consider scraping more information about national parks. I use Beautiful Soup to scrape date from TripAdvisor. Beautiful Soup is a Pyhton package designed for web scraping and easy to use. The information I scrape includes park name, review star, number of reviews, location, park feature (hiking trails, valleys, volcanoes, etc.), url links.

slide1Note that the url links are scraped based on the "GET" request information, which is provided by web browsers. Based on the url links, we can also go to each individual park's web site to scrap more information. The following figure shows the information I scrape from each individual park's website.

slide2Finally, I collect park features including name, number of reviews, review star, location, park feature, # of things to do. The nominal categorical data including park feature and location (state) are coded using Pandas' "get_dummies" function.

The features are fed into the k-means clustering algorithm to explore the underlying structure of the park data. k-means clustering aims to partition observations into k clusters in which each observation belongs to the cluster with the nearest mean and the closest similarity. Using k-means clustering, we are able to recommend similar parks to user based on the input that the user provides.

About Author

Wann-Jiun Ma

Wann-Jiun Ma (PhD Electrical Engineering) is a Postdoctoral Associate at Duke University. His research is focused on mathematical modeling, algorithm design, and software/experiment implementation for large-scale systems such as wireless sensor networks and energy analytics. After having exposed...
View all posts by Wann-Jiun Ma >

Related Articles

Leave a Comment

Google May 10, 2021
Google Sites of interest we've a link to.
Google January 9, 2021
Google The info talked about within the write-up are some of the very best out there.
Google March 15, 2020
Google Very handful of internet sites that happen to become comprehensive beneath, from our point of view are undoubtedly nicely worth checking out.
Google December 18, 2019
Google The time to read or check out the subject material or web sites we've linked to below.
searching for employees May 5, 2017
whoah this blog is excellent i really like reading your posts. Keep up the good work! You know, a lot of persons are looking around for this info, you could aid them greatly.
tips for financial planning May 5, 2017
Thank you for some other informative blog. Where else may I get that kind of information written in such an ideal way? I've a project that I'm simply now working on, and I have been at the glance out for such info.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI