Data Scraping Trulia and Zillow

Posted on Apr 21, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

For this project I have used data from Python's following packages:

Beautiful soup

Scrapy

Selenium.

Main goal of this project was toΒ gather data, preprocess it and prepare for farther analysis.

Introduction

To scrape real estate listing information from zillow.com I used Selenium Python bindings. Itself, Selenium is appropriate for creating robust, browser-based regression automation suites and tests. In other words, it is an automated testing suite. Selenium Python bindings gives access to Selenium WebDriver, which enables the user to directly communicate with the web browser and write functions and execute tasks in Python programming environment.

When one goes to zillow.com and types in the area of interest to buy or rent real estate, she is presented with an interactive webpage that has a map of the area dotted with locations of the listings and on the right side, 20 listings per page.

In order for me to understand what it is that I want to automate using Selenium, I first had to brows the listings, observe and register my own actions while browsing. This step gave me an initial idea of the algorithm to be written for automation.

There are two aspects of scraping zillow.com with Selenium.

  1. Automate the process of navigating towards the final page containing the information of interest
  2. Retrieving the information and repeating step A.

Data Scraping Process

In order for me to reach the final web page where there are all the descriptions and information for any one particular listing, I had to go through several actions such as:

  1. clicking on the listings: this opens an imbedded webpage with some preliminary information
  2. wait till the page loads
  3. scroll down and click on β€˜More’ to β€˜expand’ the page and get access to all the features and full description of the listed apartment/house.Data Scraping Trulia and Zillow
  4. read the content
  5. close the imbedded webpage of that particular listing (hence returning to the original web page with first 20 listings and the map of the area)
  6. click on the next listings
  7. repeat steps 1 through 5
  8. After looking through all the listings on the first page, I click on the link to the next page.
  9. wait till the next page with 20 more listings loads
  10. repeat steps 1 through 6.

This is the rough representations of initial chain of actions I wanted to automate with Selenium.

Step 3 and 4

The actual scrapping and writing of information happens mainly in step 3 and 4.

Step 3 is required, because when inspecting the webpage, the xpaths to the information are hidden. They only become visible (hence β€˜scrapable') when we click on the β€˜More’ button.

Step 4 mainly consists of finding the correct xpaths to all the different bits of informations of interest.

Step 4 can be broken down into following smaller steps:

  1. for each listing, initiate an empty python dictionary
  2. find the xpath of each bit of informations, i. e. price, number of bedrooms, floor size etc.
  3. store this information as a value of the dictionary initiated in step a. under a descriptive key name, i.e {β€œprice”: β€œ$256.000”}, where β€œ$256.000” was extracted in step b.
  4. after constructing the dictionary consisting of all the {key:value} pairs of interest, we write this into a csv file. Each dictionary gets one row in the csv file, where column names are the keys of the dictionary, and values populating the columns are the values of the dictionary.
  5. enclose steps a to d in a for loop to iterate through each listing.

 

Bellow is the github link to the script of the algorithm described above.
https://github.com/Vacun/Vacun.github.io/blob/master/Selenium/zillow%20with%20Selenium/zillow.py

 

Trulia.com

The website’s UI is similar to zillow.com with listings on the left half of the page and the map on the right side.

The key trick to simplifying the scraping process was the following:

If the website has it’s metadata stored in a JSON dictionary format, thats a score!

Steps of discovery:

  1. Open the inspector of the browser while on the webpage with listings in it.
  2. Command + F to activate the search bar in the inspector
  3. type β€˜json'
  4. inspect each of the search results (15 to 20 results)
  5. find the tag that contains metadata of the website in json format.

Conclusion

After inspecting each one of the search results, I was able to find the tag that contained a relatively large json dictionary in it: a sign of useful information. Closer inspection revealed that it did actually contain all the information I was interested in regarding each listing on that particular page. To be more precise, the tag contained several concatenated json dictionaries with different metadata information. That meant that after scraping this information, I would have to use regular expressions and python’s string manipulation to extract the dictionary of interest.

I used Python’s JSON package to help me with parsing the scraped information into a Python dictionary.

Bellow is the github link to the Scrapy spider for trulia.com
https://github.com/Vacun/Vacun.github.io/tree/master/Scraping/trulia/trulia

About Author

Vahe Voskerchyan

My main interest in Mathematics, in conjunction with my studies in Behavioral Economics and Philosophy, helped me to hone down to choosing Data science as a career. I see data science as an excellent β€˜experimental lab’ where I...
View all posts by Vahe Voskerchyan >

Related Articles

Leave a Comment

Google January 27, 2021
Google Here is a superb Blog You may Obtain Fascinating that we encourage you to visit.
Google January 14, 2021
Google We came across a cool web-site which you could possibly love. Take a look when you want.
Google November 9, 2019
Google Wonderful story, reckoned we could combine a few unrelated data, nevertheless genuinely really worth taking a appear, whoa did a single learn about Mid East has got far more problerms as well.
Google October 25, 2019
Google One of our visitors a short while ago suggested the following website.
sex ramp August 11, 2017
I savor, result inn I discovered just what I used to be looking for. You hhave ended my 4 dday lengthy hunt! God Bless you man. Have a nice day. Bye
Fun Factory (company) July 27, 2017
I just could not depart yojr website before shggesting that I actually loved the usual information a person supply for your visitors? Is going to bee bac often to ibspect new posts
Maureen April 27, 2017
I am actually glad to glance at this website posts which includes lots of valuable information, thanks for providing such information.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI