Data Scraping Trulia and Zillow
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
For this project I have used data from Python's following packages:
Beautiful soup
Scrapy
Selenium.
Main goal of this project was to gather data, preprocess it and prepare for farther analysis.
Introduction
To scrape real estate listing information from zillow.com I used Selenium Python bindings. Itself, Selenium is appropriate for creating robust, browser-based regression automation suites and tests. In other words, it is an automated testing suite. Selenium Python bindings gives access to Selenium WebDriver, which enables the user to directly communicate with the web browser and write functions and execute tasks in Python programming environment.
When one goes to zillow.com and types in the area of interest to buy or rent real estate, she is presented with an interactive webpage that has a map of the area dotted with locations of the listings and on the right side, 20 listings per page.
In order for me to understand what it is that I want to automate using Selenium, I first had to brows the listings, observe and register my own actions while browsing. This step gave me an initial idea of the algorithm to be written for automation.
There are two aspects of scraping zillow.com with Selenium.
- Automate the process of navigating towards the final page containing the information of interest
- Retrieving the information and repeating step A.
Data Scraping Process
In order for me to reach the final web page where there are all the descriptions and information for any one particular listing, I had to go through several actions such as:
- clicking on the listings: this opens an imbedded webpage with some preliminary information
- wait till the page loads
- scroll down and click on โMoreโ to โexpandโ the page and get access to all the features and full description of the listed apartment/house.
- read the content
- close the imbedded webpage of that particular listing (hence returning to the original web page with first 20 listings and the map of the area)
- click on the next listings
- repeat steps 1 through 5
- After looking through all the listings on the first page, I click on the link to the next page.
- wait till the next page with 20 more listings loads
- repeat steps 1 through 6.
This is the rough representations of initial chain of actions I wanted to automate with Selenium.
Step 3 and 4
The actual scrapping and writing of information happens mainly in step 3 and 4.
Step 3 is required, because when inspecting the webpage, the xpaths to the information are hidden. They only become visible (hence โscrapable') when we click on the โMoreโ button.
Step 4 mainly consists of finding the correct xpaths to all the different bits of informations of interest.
Step 4 can be broken down into following smaller steps:
- for each listing, initiate an empty python dictionary
- find the xpath of each bit of informations, i. e. price, number of bedrooms, floor size etc.
- store this information as a value of the dictionary initiated in step a. under a descriptive key name, i.e {โpriceโ: โ$256.000โ}, where โ$256.000โ was extracted in step b.
- after constructing the dictionary consisting of all the {key:value} pairs of interest, we write this into a csv file. Each dictionary gets one row in the csv file, where column names are the keys of the dictionary, and values populating the columns are the values of the dictionary.
- enclose steps a to d in a for loop to iterate through each listing.
Bellow is the github link to the script of the algorithm described above.
https://github.com/Vacun/Vacun.github.io/blob/master/Selenium/zillow%20with%20Selenium/zillow.py
Trulia.com
The websiteโs UI is similar to zillow.com with listings on the left half of the page and the map on the right side.
The key trick to simplifying the scraping process was the following:
If the website has itโs metadata stored in a JSON dictionary format, thats a score!
Steps of discovery:
- Open the inspector of the browser while on the webpage with listings in it.
- Command + F to activate the search bar in the inspector
- type โjson'
- inspect each of the search results (15 to 20 results)
- find the tag that contains metadata of the website in json format.
Conclusion
After inspecting each one of the search results, I was able to find the tag that contained a relatively large json dictionary in it: a sign of useful information. Closer inspection revealed that it did actually contain all the information I was interested in regarding each listing on that particular page. To be more precise, the tag contained several concatenated json dictionaries with different metadata information. That meant that after scraping this information, I would have to use regular expressions and pythonโs string manipulation to extract the dictionary of interest.
I used Pythonโs JSON package to help me with parsing the scraped information into a Python dictionary.
Bellow is the github link to the Scrapy spider for trulia.com
https://github.com/Vacun/Vacun.github.io/tree/master/Scraping/trulia/trulia