Contributed by Chuan Hong. Chuan is currently in the NYC Data Science Academy 12 week full-time Data Science Bootcamp program taking place between September 26th to December 23rd, 2016. This post is based on her class project - Web Scraping.
In this world, people collect many types of things, such as movie posters, toys, or magnets, for many reasons. My project focuses on using web scraping and data visualization to understand a special group of collectors: postcrossers, who are addicted to exchange postcards all over the world.
On July 14th, 2005 the Postcrossing Project website was open to the public. The goal of this website is to allow people to receive postcards from all over the world, for free. It means that if you send a postcard, you will receive one back from a random Postcrosser from somewhere in the world. You may ask, how does it work? Actually, the procedure is very interesting, which is based on the trust mutually. The first step is to request to send a postcard. The website will send you an email with the address of another member and a Postcard ID as well. You then mail a postcard to that member. Once the member receives the postcard and registers it using the Postcard ID that is on the postcard, you are eligible to receive a postcard from another user at this point. Then, you are now in line for the next member who requests to send a postcard. Where the postcard comes from is a surprise!
The statistics data listed on this website showed that this website has millions of members from more 200 countries and a lot of postcards sent or received during a short time period.
I am personally curious about the the distribution of members and the situation of postcards exchanged one day. So, by using the scrapy, selenium and GoogleAPI packages in Python, I scraped some desired attributes from the postcrossing.com website.
Here are the methods of web scraping (see the flow chart below):
- Scrapy was used to obtain a list of variables from the website.
- Selenium was applied to log in and reach out to next page
- In the end, GoogleAPI was used to save all desired information into a google spreadsheet file
Following are two parts of web scraping in this project and the data collected from each part.
1. Member Distribution
Data collection: code, country.name, #members, #postcards.sent, and #population
2. Postcards Exchanged One Day
First, I scraped each postcard's card_url in the page of "Postcrossing Gallery".
Exploratory Data Analysis
1. Visualizing distributions of postcrossers and postcards (sent)
Firstly, let's look the distributions of postcards (sent). The bar chart shows that during the past 11 years, millions of postcards were mainly sent from Germany, then from Russia, USA, and Netherlands.
Similar to the rankings of postcards, Russia has the largest number of postcrossers. However, compared to Tainwan and China, either USA or Germany has less number of postcrossers.
When I took into account the population, the bar chart of density gives us a different ranking. It looks like people in Finland, Vatican or Taiwan are more likely to be a postcrosser.
Meanwhile, the bar chart of #postcards sent per postcrosser shows that postcrossers in Finland are more likely to send postcards to others. On average, more than 100 postcards were sent by per postcrosser in Finland during 2005 to 2016.
2. Postcards versus postcrossers
The scatter plot below shows in these 210 countries, that the number of postcards (sent) positively correlate with the number of postcrossers.
3. Locating postcards registered one day
A total of 6202 postcards were registered within one day. In order to visualize the wide distribution of postcards exchanged one day, I created a map using Leaflet to geologically locate every postcard. Because each postcard corresponds to two postcrossers, the Sender and the Recipient, so I made two options to select "From" (P sign) and "To"(blue sign).
4. Mapping postcards routes: Postcards Connecting the World
Using the geological information of "From" and "To", I mapped the postcards flying routes. The map shows that Europe, Asia, and North America are the three main regions for postcrossing. The routes here also could be explained by the findings above of distributions of postcards and postcrossers.
5. Travel distance versus travel time
The scatter plot below shows a positive relationship between the postcard travel distance and travel time. Meanwhile, compared to the postcards exchanged between different countries (red), postcards exchanged in the same country (blue) had relatively shorter travel time/distance.