Scraping Trulia and Zillow

Vahe Voskerchyan
Posted on Apr 21, 2017

For this project I have used Python's following packages:

Beautiful soup

Scrapy

Selenium.

Main goal of this project was to gather data, preprocess it and prepare for farther analysis.

zellow.com: Selenium.

To scrape real estate listing information from zillow.com I used Selenium Python bindings. Itself, Selenium is appropriate for creating robust, browser-based regression automation suites and tests. In other words, it is an automated testing suite. Selenium Python bindings gives access to Selenium WebDriver, which enables the user to directly communicate with the web browser and write functions and execute tasks in Python programming environment.

When one goes to zillow.com and types in the area of interest to buy or rent real estate, she is presented with an interactive webpage that has a map of the area dotted with locations of the listings and on the right side, 20 listings per page.

In order for me to understand what it is that I want to automate using Selenium, I first had to brows the listings, observe and register my own actions while browsing. This step gave me an initial idea of the algorithm to be written for automation.

There are two aspects of scraping zillow.com with Selenium.

  1. Automate the process of navigating towards the final page containing the information of interest
  2. Retrieving the information and repeating step A.

In order for me to reach the final web page where there are all the descriptions and information for any one particular listing, I had to go through several actions such as:

  1. clicking on the listings: this opens an imbedded webpage with some preliminary information
  2. wait till the page loads
  3. scroll down and click on ‘More’ to ‘expand’ the page and get access to all the features and full description of the listed apartment/house.selenium
  4. read the content
  5. close the imbedded webpage of that particular listing (hence returning to the original web page with first 20 listings and the map of the area)
  6. click on the next listings
  7. repeat steps 1 through 5
  8. After looking through all the listings on the first page, I click on the link to the next page.
  9. wait till the next page with 20 more listings loads
  10. repeat steps 1 through 6.

This is the rough representations of initial chain of actions I wanted to automate with Selenium.

The actual scrapping and writing of information happens mainly in step 3 and 4.

Step 3 is required, because when inspecting the webpage, the xpaths to the information are hidden. They only become visible (hence ‘scrapable') when we click on the ‘More’ button.

Step 4 mainly consists of finding the correct xpaths to all the different bits of informations of interest.

Step 4 can be broken down into following smaller steps:

  1. for each listing, initiate an empty python dictionary
  2. find the xpath of each bit of informations, i. e. price, number of bedrooms, floor size etc.
  3. store this information as a value of the dictionary initiated in step a. under a descriptive key name, i.e {“price”: “$256.000”}, where “$256.000” was extracted in step b.
  4. after constructing the dictionary consisting of all the {key:value} pairs of interest, we write this into a csv file. Each dictionary gets one row in the csv file, where column names are the keys of the dictionary, and values populating the columns are the values of the dictionary.
  5. enclose steps a to d in a for loop to iterate through each listing.

 

Bellow is the github link to the script of the algorithm described above.
https://github.com/Vacun/Vacun.github.io/blob/master/Selenium/zillow%20with%20Selenium/zillow.py

 

trulia.com: Python’s Scrapy package.

The website’s UI is similar to zillow.com with listings on the left half of the page and the map on the right side.

The key trick to simplifying the scraping process was the following:

If the website has it’s metadata stored in a JSON dictionary format, thats a score!

Steps of discovery:

  1. Open the inspector of the browser while on the webpage with listings in it.
  2. Command + F to activate the search bar in the inspector
  3. type ‘json'
  4. inspect each of the search results (15 to 20 results)
  5. find the tag that contains metadata of the website in json format.

After inspecting each one of the search results, I was able to find the tag that contained a relatively large json dictionary in it: a sign of useful information. Closer inspection revealed that it did actually contain all the information I was interested in regarding each listing on that particular page. To be more precise, the tag contained several concatenated json dictionaries with different metadata information. That meant that after scraping this information, I would have to use regular expressions and python’s string manipulation to extract the dictionary of interest.

I used Python’s JSON package to help me with parsing the scraped information into a Python dictionary.

Bellow is the github link to the Scrapy spider for trulia.com
https://github.com/Vacun/Vacun.github.io/tree/master/Scraping/trulia/trulia

About Author

Vahe Voskerchyan

Vahe Voskerchyan

My main interest in Mathematics, in conjunction with my studies in Behavioral Economics and Philosophy, helped me to hone down to choosing Data science as a career. I see data science as an excellent ‘experimental lab’ where I...
View all posts by Vahe Voskerchyan >

Related Articles

Leave a Comment

Avatar
sex ramp August 11, 2017
I savor, result inn I discovered just what I used to be looking for. You hhave ended my 4 dday lengthy hunt! God Bless you man. Have a nice day. Bye
Avatar
Fun Factory (company) July 27, 2017
I just could not depart yojr website before shggesting that I actually loved the usual information a person supply for your visitors? Is going to bee bac often to ibspect new posts
Avatar
Maureen April 27, 2017
I am actually glad to glance at this website posts which includes lots of valuable information, thanks for providing such information.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp