Data Scraping Trulia and Zillow

Vahe Voskerchyan

Posted on Apr 21, 2017

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

For this project I have used data from Python's following packages:

Beautiful soup

Scrapy

Selenium.

Main goal of this project was to gather data, preprocess it and prepare for farther analysis.

Introduction

To scrape real estate listing information from zillow.com I used Selenium Python bindings. Itself, Selenium is appropriate for creating robust, browser-based regression automation suites and tests. In other words, it is an automated testing suite. Selenium Python bindings gives access to Selenium WebDriver, which enables the user to directly communicate with the web browser and write functions and execute tasks in Python programming environment.

When one goes to zillow.com and types in the area of interest to buy or rent real estate, she is presented with an interactive webpage that has a map of the area dotted with locations of the listings and on the right side, 20 listings per page.

In order for me to understand what it is that I want to automate using Selenium, I first had to brows the listings, observe and register my own actions while browsing. This step gave me an initial idea of the algorithm to be written for automation.

There are two aspects of scraping zillow.com with Selenium.

Automate the process of navigating towards the final page containing the information of interest
Retrieving the information and repeating step A.

Data Scraping Process

In order for me to reach the final web page where there are all the descriptions and information for any one particular listing, I had to go through several actions such as:

clicking on the listings: this opens an imbedded webpage with some preliminary information
wait till the page loads
scroll down and click on ‘More’ to ‘expand’ the page and get access to all the features and full description of the listed apartment/house.
read the content
close the imbedded webpage of that particular listing (hence returning to the original web page with first 20 listings and the map of the area)
click on the next listings
repeat steps 1 through 5
After looking through all the listings on the first page, I click on the link to the next page.
wait till the next page with 20 more listings loads
repeat steps 1 through 6.

This is the rough representations of initial chain of actions I wanted to automate with Selenium.

Step 3 and 4

The actual scrapping and writing of information happens mainly in step 3 and 4.

Step 3 is required, because when inspecting the webpage, the xpaths to the information are hidden. They only become visible (hence ‘scrapable') when we click on the ‘More’ button.

Step 4 mainly consists of finding the correct xpaths to all the different bits of informations of interest.

Step 4 can be broken down into following smaller steps:

for each listing, initiate an empty python dictionary
find the xpath of each bit of informations, i. e. price, number of bedrooms, floor size etc.
store this information as a value of the dictionary initiated in step a. under a descriptive key name, i.e {“price”: “$256.000”}, where “$256.000” was extracted in step b.
after constructing the dictionary consisting of all the {key:value} pairs of interest, we write this into a csv file. Each dictionary gets one row in the csv file, where column names are the keys of the dictionary, and values populating the columns are the values of the dictionary.
enclose steps a to d in a for loop to iterate through each listing.

Bellow is the github link to the script of the algorithm described above.
https://github.com/Vacun/Vacun.github.io/blob/master/Selenium/zillow%20with%20Selenium/zillow.py

Trulia.com

The website’s UI is similar to zillow.com with listings on the left half of the page and the map on the right side.

The key trick to simplifying the scraping process was the following:

If the website has it’s metadata stored in a JSON dictionary format, thats a score!

Steps of discovery:

Open the inspector of the browser while on the webpage with listings in it.
Command + F to activate the search bar in the inspector
type ‘json'
inspect each of the search results (15 to 20 results)
find the tag that contains metadata of the website in json format.

Conclusion

After inspecting each one of the search results, I was able to find the tag that contained a relatively large json dictionary in it: a sign of useful information. Closer inspection revealed that it did actually contain all the information I was interested in regarding each listing on that particular page. To be more precise, the tag contained several concatenated json dictionaries with different metadata information. That meant that after scraping this information, I would have to use regular expressions and python’s string manipulation to extract the dictionary of interest.

I used Python’s JSON package to help me with parsing the scraped information into a Python dictionary.

Bellow is the github link to the Scrapy spider for trulia.com
https://github.com/Vacun/Vacun.github.io/tree/master/Scraping/trulia/trulia

About Author

Vahe Voskerchyan

My main interest in Mathematics, in conjunction with my studies in Behavioral Economics and Philosophy, helped me to hone down to choosing Data science as a career. I see data science as an excellent ‘experimental lab’ where I...

View all posts by Vahe Voskerchyan >

Capstone

Catching Fraud in the Healthcare System

Capstone

The Convenience Factor: How Grocery Stores Impact Property Values

Capstone

Acquisition Due Dilligence Automation for Smaller Firms

Machine Learning

Pandemic Effects on the Ames Housing Market and Lifestyle

Machine Learning

The Ames Data Set: Sales Price Tackled With Diverse Models

Cancel reply

You must be logged in to post a comment.

Google January 27, 2021

Google Here is a superb Blog You may Obtain Fascinating that we encourage you to visit.

Google January 14, 2021

Google We came across a cool web-site which you could possibly love. Take a look when you want.

Google November 9, 2019

Google Wonderful story, reckoned we could combine a few unrelated data, nevertheless genuinely really worth taking a appear, whoa did a single learn about Mid East has got far more problerms as well.

Google October 25, 2019

Google One of our visitors a short while ago suggested the following website.

sex ramp August 11, 2017

I savor, result inn I discovered just what I used to be looking for. You hhave ended my 4 dday lengthy hunt! God Bless you man. Have a nice day. Bye

Fun Factory (company) July 27, 2017

I just could not depart yojr website before shggesting that I actually loved the usual information a person supply for your visitors? Is going to bee bac often to ibspect new posts

Maureen April 27, 2017

I am actually glad to glance at this website posts which includes lots of valuable information, thanks for providing such information.

Data Scraping Trulia and Zillow

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Data Scraping Process

Step 3 and 4

Trulia.com

Conclusion

About Author

Vahe Voskerchyan

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Data Scraping Trulia and Zillow

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Data Scraping Process

Step 3 and 4

Trulia.com

Conclusion

About Author

Vahe Voskerchyan

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!