Happy Postcrossing

Chuan Hong
Posted on Nov 21, 2016

Contributed by Chuan Hong. Chuan is currently in the NYC Data Science Academy 12 week full-time Data Science Bootcamp program taking place between September 26th to December 23rd, 2016. This post is based on her class project - Web Scraping.

Introduction

In this world, people collect many types of things, such as movie posters, toys, or magnets, for many reasons. My project focuses on using web scraping and data visualization to understand a special group of collectors: postcrossers, who are addicted to exchange postcards all over the world.

 

Data: Postcrossing.com

12039133

On July 14th, 2005 the Postcrossing Project website was open to the public. The goal of this website is to allow people to receive postcards from all over the world, for free. It means that if you send a postcard, you will receive one back from a random Postcrosser from somewhere in the world. You may ask, how does it work? Actually, the procedure is very interesting, which is based on the trust mutually. The first step is to request to send a postcard. The website will send you an email with the address of another member and a Postcard ID as well. You then mail a postcard to that member. Once the member receives the postcard and registers it using the Postcard ID that is on the postcard, you are eligible to receive a postcard from another user at this point. Then, you are now in line for the next member who requests to send a postcard. Where the postcard comes from is a surprise!

The statistics data listed on this website showed that this website has millions of members from more 200 countries and a lot of postcards sent or received during a short time period.

postcards

Summary of postcards exchanged (source: Postcrossing. com)

 

Web Scraping

I am personally curious about the the distribution of members and the situation of postcards exchanged one day. So, by using the scrapy, selenium and GoogleAPI packages in Python, I scraped some desired attributes from the postcrossing.com website.

Here are the methods of web scraping (see the flow chart below):

- Scrapy was used to obtain a list of variables from the website.

- Selenium was applied to log in and reach out to next page

- In the end, GoogleAPI was used to save all desired information into a google spreadsheet file

slide1

Flowchart of web scraping data from Postcrossing.com

 

Following are two parts of web scraping in this project and the data collected from each part.

1. Member Distribution

Data collection: code, country.name, #members, #postcards.sent, and #population

webscrapy1

2. Postcards Exchanged One Day

First, I scraped each postcard's card_url in the page of "Postcrossing Gallery".

webscrapy2_1Then, I used Selenium to get to the page linked with each postcard and scraped all desired information which are listed inside the black boxes.

webscrapy2_2

Exploratory Data Analysis

1. Visualizing distributions of postcrossers and postcards (sent)

Firstly, let's look the distributions of postcards (sent). The bar chart shows that during the past 11 years, millions of postcards were mainly sent from Germany, then from Russia, USA, and Netherlands.

postcards-by-country

Similar to the rankings of postcards, Russia has the largest number of postcrossers. However, compared to Tainwan and China, either USA or Germany has less number of postcrossers.

postcrossers-by-country

When I took into account the population, the bar chart of density gives us a different ranking. It looks like people in Finland, Vatican or Taiwan are more likely to be a postcrosser.

density-of-members

Meanwhile, the bar chart of #postcards sent per postcrosser shows that postcrossers in Finland are more likely to send postcards to others. On average, more than 100 postcards were sent by per postcrosser in Finland during 2005 to 2016.

postcards-per-person

2. Postcards versus postcrossers

The scatter plot below shows in these 210 countries, that the number of postcards (sent) positively correlate with the number of postcrossers.

cards-vs-person

Pearson's correlation test: rho=0.96, p<0.05, n=210

3. Locating postcards registered one day

A total of 6202 postcards were registered within one day. In order to visualize the wide distribution of postcards exchanged one day, I created a map using Leaflet to geologically locate every postcard. Because each postcard corresponds to two postcrossers, the Sender and the Recipient, so I made two options to select "From" (P sign) and "To"(blue sign).

Screen Shot 2017-01-22 at 10.50.20 PM

The distribution of postcards registered one day (Nov 12, 2016)

4. Mapping postcards routes: Postcards Connecting the World

Using the geological information of  "From" and "To", I mapped the postcards flying routes. The map shows that Europe, Asia, and North America are the three main regions for postcrossing. The routes here also could be explained by the findings above of distributions of postcards and postcrossers.

mapping

Postcards Connecting the World

5. Travel distance versus travel time

The scatter plot below shows a positive relationship between the postcard travel distance and travel time. Meanwhile, compared to the postcards exchanged between different countries (red), postcards exchanged in the same country (blue) had relatively shorter travel time/distance.

dist-vs-time

Pearson's correlation test: rho=0.20, p<0.05, n=6202

 

About Author

Chuan Hong

Chuan Hong

Chuan Hong is a Ph.D. Candidate majoring in Public Health at the University of South Carolina. Her main research areas are environmental health sciences, with a focus on environmental epidemiology. By using a series of data collection, statistical...
View all posts by Chuan Hong >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp