Data Scraping the Skyscrapers using Scrapy
Contributed by Conred Wang. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between September 26th to December 23rd, 2016. This post is based on hisΒ third class project -Β Web Scraping Project (due on theΒ 7th week of the program). |
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
The Skyscraper Center publishes various types of data and information about the world's skyscrapers.
For example, the 100Β Tallest Completed Buildings in the World by Height to Architectural TopΒ :
The main page even includes aΒ downloadable PDF.Β However, some data, like year theΒ skyscrapers were Proposed and Construction Start, is only available inΒ secondary pages:
In order to obtain all the data we needed from the main and all secondary web pages, we used Scrapy.
An open source and collaborative framework for extracting the data you need from websites. Β In a fast, simple, yet extensible way. |
.
Data
ct cc country | ct cc country | ct cc country |
21 AE Arab Emirates | 01 AU Australia | 45 CN China |
01 GB United Kingdom | 01 KR South Korea | 01 KW KuwaitΒ |
03 MY Malaysia | 03 RU Russia | 02 SA Saudi Arabia |
02 TH Thailand | 02 TW Taiwan | 17 US USA |
01 VN Vietnam |
Statistics about these 100 skyscrapers:
<Usages>
- 40Β are multipurpose
- 74Β are used for office.
- 43Β are used for hotel.
- 29Β are used for residential.
- 2Β are used for retail.
<Totals>
- 7,758 floors.
- 118,653 feet.
<Time>
- 46Β do not have Proposed Year listed.
- 3Β do not have Construction Start Year listed.
- From Proposed To Construction Start:
- Cannot compute 46.
- Shortest took 0 year.
- Longest took 9Β years.
- From Construction Start To Complete:
- Cannot compute 3.
- Shortest tookΒ 1Β year.
- Longest took 11 years.
Q : One year to build a skyscraper! Β Really?
A : No kidding. Β There are actually 2 skyscrapers:
![]() |
About using Scrapy
Scrapy is really easy and simple.
As depicted in the "A dataflow overview" diagram (below, which can be found at The ITC Prog Blog), we only need to write 3 short Python scripts ("items.py",Β "pipelines.py" andΒ "skyscraper_spider.py"), and Scrapy did all the data extraction for us from the Skyscraper Center web pages:
We included all 3 Python scripts below.
It is worth to mentioning that:
- "scrapy shell <url>" and Google's Chrome inspect are the two indispensable tools when web scraping with Scrapy.
- Although we love Scrapy, it is not perfect yet. For example, Scrapy will not tell you your Python code indentation is improper.
- With UTF-8 encoding, the str function over text with unicode (for example, "u2026", horizontal ellipsis) will cause an exception. Β Instead, Β [<object>.encode('ascii','ignore')] can be used. Β You can find an example on line 21 of "pipelines.py".
1. items.py |
2. pipelines.py |
3.Β skyscraper_spider.py |
(end)