Demographic-Based Real Estate Investing

Posted on Jun 17, 2023

Presentation Video


Organizations are constantly seeking new ways of leveraging data to guide strategic decision-making and increase returns on investment (ROI). One of the areas of investment that benefits from a data-driven approach is real estate. To explore the impact data science can have on this form of investment, we partnered with Haystacks, a real estate investment strategy company. This blog post dives into our innovative approach, specifically focusing on the interplay between investor portfolios and Points of Interest (POIs) in real estate areas. We also detail the obstacles we encountered and the crucial insights acquired along the journey.

Project Context and Challenges

The task at hand was to explore how POIs affect a real estate portfolio, specifically utilizing a non time-series approach to automate correlation tests and visualizing potential correlation of POIs with real estate returns. POIs are of significant interest in real estate due to several reasons. Firstly, POIs provide insights into the available amenities and resources in a neighborhood. For potential homeowners or investors, knowing the proximity of schools, hospitals, restaurants, parks, and other essential facilities can greatly influence their decision-making process. Access to quality education, healthcare, and recreational areas are often important factors that contribute to the desirability and value of a property. Additionally, POIs can provide a sense of the overall infrastructure and development in an area, indicating its potential for future growth and investment opportunities. By analyzing the relationship between POIs and real estate portfolios, investors can gain valuable insights into the attractiveness and livability of a location, helping them make informed investment decisions.

Our journey commenced with extensive explorations around Atlanta, GA, employing data from Google Places and correlating it with real estate listings data in the same area. However, this initial approach led to overfitting issues. Consequently, despite the sizable sample, our models yielded low accuracy scores.

Upon realizing that our initial approach, which focused solely on linear relationships, did not yield satisfactory results, we needed to reassess our understanding of the interplay between POIs and home values. To support this shift in perspective, we conducted further analysis and observed that the relationship between POIs and home values is more complex than a simple linear correlation. This insight was reinforced by examining various statistical measures, such as scatter plots and correlation matrices, which revealed that the influence of POIs on real estate returns is not strictly linear. While proximity to POIs can be a contributing factor, other variables such as neighborhood demographics, local market trends, and property characteristics also play significant roles. By acknowledging the multifaceted nature of the relationship between POIs and home values, we were able to adjust our approach and explore alternative methods to capture the true impact of POIs on real estate portfolios.

So instead, we opted for a shift in perspective.

Reframing the Question

Our exploration led us to an intriguing alternative approach. Instead of correlating POIs directly to home value, we aimed to correlate the target demographics of businesses to zip-code and tenant demographics. The premise was simple: Businesses, large or small, conduct meticulous market research before selecting a location. This choice can offer valuable demographic insights to real estate investors.

To give some broader context, real estate investment invariably requires a comprehensive understanding of numerous factors influencing the success of an investment. These include market dynamics, supply and demand, economic conditions, and demographics such as population growth, age distribution, and income levels. Traditional mortgage data, despite its value, is often limited by the frequency of updates. In rapidly changing markets, this sluggishness in data collection and delayed updates can impede investors' ability to make timely, accurate investment decisions.

Our Novel Approach

To mitigate these challenges, we propose a novel methodology, leveraging points of interest (POIs) and their correlation with zip codes as proxies for demographic insights to match investors' portfolio preferences. POIs, which can range from businesses and amenities to landmarks and entertainment venues, reflect the characteristics and preferences of the local population dictated by market demands. Analyzing the distribution and types of POIs in a given area can give us a thorough understanding of the target audience and their preferences.

The rationale behind using POIs as proxies for demographic insights is that businesses invest considerable time, money, and research into understanding their target market before deciding where to open new locations. In the case of larger businesses, they possess proprietary data like customer profiles, buying patterns, and market trends, which help them identify areas with the highest potential for success.

Businesses select locations in proximity to their target demographics. If these target demographics line up with an investor's portfolio demographics, then that area could be a viable investment. Conversely, if a business closes, it may indicate shifts in demographics or market conditions. By piggybacking on businesses' extensive research and expertise, we can tap into this valuable knowledge, creating a symbiotic relationship with an investor's profile.

As an illustrative example, let's consider a real estate investor heavily invested in an area 20 minutes outside of Atlanta, GA. They own an apartment complex and various rental properties scattered throughout a specific zip code, which have particular demographics. This investor aims to expand their portfolio into other areas with similar returns.

In parallel, Starbucks, which conducts thorough market research to identify areas with high success probabilities, shares a similar interest in these demographics. It follows that if the presence of certain POIs like Starbucks, Michaels, and Jimmy Johns aligns with the investor's profile, we can infer that the demographics and market conditions are conducive to successful real estate investments for the investor.

Data Sources and Application

The data utilized for this project comprises a combination of Census data, HMDA data, and Google Places data. These datasets provide valuable insights into the demographics, housing market, and amenities of different areas within the Atlanta Metro region. While we focused on over 200 zip codes in Atlanta as a proof of concept, the methodology can be applied universally to other regions.

The Census data, summarized by zip code, offered a wealth of demographic details such as population density, income levels, education levels, and housing statistics. This information helps paint a comprehensive picture of the characteristics and composition of each neighborhood or area under analysis. Understanding the demographics of an area is crucial for real estate investors as it can provide insights into the target market, rental demand, and potential property value appreciation.

The HMDA, or Home Mortgage Disclosure Act, data is another essential dataset for our project. This data, also summarized by zip codes, provides information about loan applications, approvals, rates, and types. The HMDA dataset offers a wealth of information on mortgage activity within different areas, enabling us to assess the lending environment and creditworthiness of borrowers. The HMDA data is particularly useful for many real estate use cases as it allows us to analyze loan approval rates, which can serve as an indicator of economic stability and creditworthiness of borrowers in a specific area.

To illustrate the significance of HMDA data, let's examine the map below, which displays the HMDA approval rates in the Atlanta Metro region. Lighter shades represent higher approval rates. Upon analysis, a discernable pattern emerges, especially in areas north of Atlanta, such as Buckhead, which exhibits notably higher approval rates compared to the rest of the region.

Now, let's consider another map below, which showcases the percentage of people earning more than $200,000 per year in each zip code. Remarkably, we observe a striking overlap with the HMDA approval rates map. Once again, the zip codes north of Atlanta exhibit lighter shades, indicating both higher approval rates and a higher percentage of the wealthy population.

This correlation suggests that these specific areas not only have a relatively affluent population but also signify a robust economic environment. The combination of high income levels and high approval rates indicates a healthy real estate market and potentially lucrative investment opportunities.

So, how does this information serve an investor? High approval rates are often indicative of stable economic conditions and lower credit risk, which could suggest higher property values. Furthermore, the knowledge that high approval rates align with areas where a larger percentage of the population earns over $200,000 adds an additional layer of confidence for an investor. It indicates that the area attracts a wealthier demographic that is likely to maintain steady rent payments and have a lower risk of defaulting on their mortgages. Thus, incorporating HMDA data into our strategy further refines our approach and allows for a more nuanced understanding of investment potentials.

Given the importance of the HMDA data for our project, it is essential to delve into its significance. The HMDA dataset provides a comprehensive view of mortgage activity and lending patterns within different areas, making it a valuable resource for various real estate use cases. By analyzing HMDA data, investors and real estate professionals can gain insights into the lending environment, creditworthiness of borrowers, and the overall economic stability of specific regions or neighborhoods. This information is invaluable for identifying areas with potential for investment, assessing market conditions, and making informed decisions based on credit risk and loan approval rates. Therefore, leveraging the HMDA data can significantly enhance the accuracy and effectiveness of real estate investment strategies.

Gross Rental Yield and POI Categories: Untapped Opportunities

Continuing our exploration, we turned our attention to the Gross Rental Yield. This crucial metric provides an idea of how much an investor could make on an investment property before considering expenses like property management, taxes, and insurance. It enables investors to evaluate the potential return on investment based solely on rental income.

To calculate gross rental yield, we divided the property's annual rental income by its purchase price or market value, then expressed it as a percentage. For instance, a gross rental yield of 7% means the rental income is approximately 7% of the property's value. Since we didn't have the sale and rental price of individual properties, we used the mean prices for each zip code.

After establishing the gross rental yield, we explored various POI categories with at least 100 locations across the Atlanta Metro Region. The bottom ten categories, displayed in red on the right of the below bar chart, are most prevalent in zip codes with a low gross rental yield. These categories include industries such as real estate, which thrive in areas with high home values and high rental prices.

Conversely, the top ten categories found in zip codes with a high gross rental yield are displayed in green on the left of the chart. The 'Trucking Company' category emerged as a standout, with an impressive 7.5% average gross rental yield across locations. Interestingly, other POI categories such as 'Pawn Shops,' 'Warehouses,' and 'Laundromats' were also among the top ten.

This data implies that these areas, despite potentially being seen as temporary living spaces, may offer lucrative investment opportunities. The presence of these business types may indicate an underserved or transient demographic. While they may not be long-term residents, they still represent a sector with housing needs. In other words, high rental yields are not necessarily linked to high-end POIs like luxury retail or gourmet restaurants. Instead, practical and essential services seem to dominate the list.

Understanding the distribution and types of POIs in these high yield areas can serve as a strong indicator of the kind of tenant an investor can expect. Consequently, it will enable investors to tailor their properties to cater to these specific demographics, leading to higher occupancy rates, stable rental income, and, ultimately, higher returns on their investment.

It is worth mentioning that while gross rental yield is a valuable metric for assessing investment potential, it is essential to consider other factors such as property expenses, market trends, and local regulations to make well-informed investment decisions. In our analysis, we focused on gross rental yield as a starting point, and by not accounting for expenses, we aimed to highlight the untapped opportunities in certain POI categories. However, in practice, investors should thoroughly evaluate all relevant aspects before finalizing their investment strategies.

Exploring Business Correlations with Census and Mortgage Data

Continuing our analysis with Atlanta, GA as a proof of concept, we delve into how various businesses in the area correlate with an assortment of census and mortgage data. This investigation helps us identify the demographics of the customers these businesses serve in the zip codes they occupy and align these with investor preferences.

For instance, let's consider Dollar General. The below heatmap, which displays the POI name on the X-axis and census data on the Y-axis, reveals that Dollar General is most strongly correlated with areas that have low rental/property value, a large percentage of car commuters (particularly those traveling 60 min or more to work), and a household income of $35-50k. Conversely, Dollar General stores do not typically show up in areas with incomes higher than $200k or high median property values.

On the other hand, a store like 'Hollywood Feed,' a pet store chain, correlates highly with households that make over $200k. These correlations allow investors to match their interests with specific demographics served by these businesses.

Applying K Nearest Neighbor for Strategic Recommendations

To streamline this concept, we have created a function using K Nearest Neighbor (KNN). This function accepts an investor's current zip code and desired demographic profile, and it outputs a recommendation of 'K' number of zip codes that share a similar demographic profile. Additionally, it suggests prevalent POIs within that zip code.

The process begins with the input of a zip code and features an investor is interested in. The data is then used in a K-Nearest Neighbors (KNN) algorithm to identify the k most similar zip codes based on the provided columns' profiles.  Moreover, the function is scalable and can handle large datasets efficiently, making it suitable for large data sets and real-world scenarios.

Visualizing Results with PCA and KNN

To make these concepts more digestible, we have visualized the results of the KNN algorithm using Principal Component Analysis (PCA) in a 2D space. PCA helps us capture the most important patterns and variances in the data. This visual aid confirms that we are identifying the β€œclosest” relationships between zip codes. The gray dots represent all the zip codes in our dataset, the red dot is the selected zip code, and the blue dots represent the ”closest” zip codes.

A New Tool for Real Estate Investors

We used Plotly Dash to develop a tool that leverages businesses' proprietary information and demographic analysis expertise for the benefit of real estate investors. It is an inexpensive solution to identify areas that align with investor profiles and demonstrate significant investment potential. It also uncovers areas of future growth potential by identifying areas that match the preferences of successful businesses.

Our approach is cost-efficient, as it capitalizes on businesses' extensive data, reducing the need for expensive data acquisition. Furthermore, by using POIs as proxies for demographic information, we overcome the limitations of slow and infrequent data updates, enabling investors to stay ahead of market trends.  


Our project underscores the power of data science in revolutionizing real estate investment. By aligning data from various sources and correlating them in novel ways, we have unearthed valuable insights that promise to redefine real estate investment strategies.

We have not only drawn connections between seemingly unrelated variables but also created a powerful, user-friendly tool that offers real-time, granular insights to real estate investors. This interactive dashboard allows investors to explore various aspects of zip codes, from housing and commuting attributes to demographic and earnings attributes.

Special Thanks

  • Joe Lee from for his sponsorship and mentorship on this project.
  • Cole Ingraham from NYC Data Science Academy for his mentorship on this project.

About Authors

Brian Ralston

Experienced Data Scientist and Database Administrator with 3 years experience in SQL, Python, and R. Strong understanding of data warehousing, modeling, mining techniques, and dedicated to staying up to date with the latest technologies and industry trends. Excited...
View all posts by Brian Ralston >

Jason Phillip

As a versatile professional, I bring a rich and varied background in sales, real estate, entrepreneurship, and military leadership to the table. Having successfully owned and operated a business for a decade, I am now channeling my enthusiasm...
View all posts by Jason Phillip >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI