Where to Live? An Interactive Geospatial Data Digestion Framework Implemented in R with Shiny
Where are the best places to live? How do you answer this question?
If you turn to Google, there are many "top 10" lists, generated by someone else who does not know your personal needs. If you have retired, you might not care much about salary; if you do not have kids, education costs might mean nothing to you; if you have strong political views, you might not want them to clash with your neighbors'; if you have serious lung issues, air quality might be more important than anything else. We don't all share the same concerns, and one size does not fit all.
The question is: how do we go about finding the answer that does fit individual need?
How about choosing among a variety of available data, assigning your priority, and ranking the candidates based on your specific input?
Here is a prototype of the described solution, using R and Shiny. I invite you to play with this interactive map generating app, and make your own judgement.
An interactive spatial data digestion framework has been implemented in R with Shiny, to help answer questions set up as: "where is the best place to ...?" The web app ranks US counties based on user input, and visualizes the results using Leaflet map, along with other quantitative plots facilitating the "where's best" decision making process. Assuming a matrix-like data structure, the computational core of the framework combines a row-and-column filtering system with a weighted-average score generator. The delivered prototype product is hosted at shinyapps.io, with its source code shared on GitHub. For people interested in the thought process behind the scene, here is a short version of this blog, which is the project wrap-up presentation given at the end of the 2 week project period. This blog post completes the planned documentation package, with special effort devoted to the motivation, background and discussion sections.
Motivation & Background
My vision for this project is based on the following considerations:
1) The strengths of R and Shiny
R is an open source programming language and environment known for its statistical analysis capability, vectorized programming style and abundant plotting packages. Shiny is a fast growing extension to the R family, providing a web app development and hosting environment, without requiring html or java script knowledge. While browsing the R Shiny gallery, I was immediately impressed by the featured map generation functionality utilizing Leaflet or googleVis.
2) Personal appreciation of the power of maps
I come from a computational geoscience background. Geoscientists deeply appreciate the power of maps and love making them. Decisions are made on maps, from everyday-life decisions (with a GPS in your hand) to billion dollar decisions (think about offshore drilling).
3) Interest in decision making process
One of the biggest personal decisions using maps is where to live. Is it possible to quantify the criteria and measure the trade-offs? Before we dive into this detailed problem, let's generalize the typical decision making process.
When facing a large amount of input data and a variety of choices, you start by reducing what goes into the decision process by filtering down the data (shortlist generation). This filtering step is often well described with clear quantitative rules (keyword count, acceptable range of measurements, and so on), but the next step is much more labor intensive and less clear. When a committee need to pick the top 1 from the 3 shortlisted candidates, the process might involve a lot more than merely shortening the list from 300 to 3, with consistency and record-ability as additional challenges.
A few slides were developed to demonstrate one common decision making workflow, which could be quantified and automated. The specific version shown is implemented in my project. It starts with a general column-and-row filtering system during the "300 candidates to 3" stage, and ends after a weighted-average score generation and ranking step during the "shortlisted to winner" stage. The weighted-average linear system offers not only the benefit of simplifying but of scaling. The weighted-average routine naturally handles data expansion as I keep adding new columns to the data set. With advantages, however, there are also some limitations that will be detailed in the discussion section.
Project Scope and Deliverables
After the brain storm design stage, the project moves into the detailed scoping and execution stage.
For any project, an on-time delivery is usually the top priority, yet passionate developers always want to do more and keep adding sophisticated functionalities. A general good practice is to have checkpoints and fast-track deliverables planned at the beginning. At any given point past the proof-of-concept fast-track effort, make sure to always have a deliverable working version (and yes, use version control like Git).
At the halfway point, I created the proof-of-concept version. This happens to be a good place to talk about one key data preparation step for this project.
Data Preparation Key Steps
The 3 data sets are carefully chosen for the minimum deliverable version, as they demonstrate a few key challenges for the weighted-average scoring system:
0) All numerical data?
To begin with, all data sets here are numerical. Another level of complexity strikes when trying to include categorical data in this scoring and ranking framework. We will save this till the discussion section.
1) The normalization step
Let's look at the first and third choices in the picture, which are Air Quality and Income. The air quality data measures particle concentration in the air with numbers ranging from 4.44 to 15.96. Income data has numbers ranging from 22,894 to 125,900 (the units are not relevant here, as we are blending unrelated quantities like air quality and income together; the result is a number that lacks physical meaning, its sole purpose is for ranking). The issue here is that the air quality variation is not reflected well in the final score; without normalization, its impact is several digits (or orders of magnitude) weaker compared to income. This is why we need to normalize each quantity to the same range. In this case I chose [0,100] -- before pushing them through the weighted-average step.
2) The "good" direction
Again looking at air quality and income, after they are both normalized to [0,100], another issue is that the "good" directions are opposed to each other. For air quality, a smaller number signifies a good direction. In contrast, for income, the bigger number is better. Is it possible to simply "flip" the numbers in the backstage without bothering the user? Yes but not completely, the political data set is chosen to demonstrate another layer of complexity. The political data itself is the 2016 election voting difference for each county, and the "good" direction is a completely subjective thing. In the final version, population density is another column which needs user input to define positive direction, since some prefer to live close to the crowd, while others prefer more space.
After sorting out these key issues, I began to enrich the data by adding more variables that people might care about when making a "where to live" decision. The data sources are listed at the end. Without time constraints, this effort can go on forever.
Most of the effort past the midpoint is spent on automatically visualizing analytical insights in a user friendly manner, building UI features like linking filtering sliders to tab 2 data table, and special calculations based on user mouse click input. In the end, the delivered product looks like this, which is the hosted web app hopefully you just played with.
UI Features and Design Philosophy
While all other data sets are preloaded to memory, in a way 'static'; this "distance to" calculation reads (lat, long) from user's mouse click location on the map, and does a 'dynamic' spherical distance calculation for each county toward this given point. This is to incorporate the "I want to live close to / away from" decision component into the weighted-average score generation scheme. It has to be normalized too.
2) the pie chart
As I offer more and more optional data sets to be included in the calculation, it is nice to have a visual reminder of data considered and their assigned weights.
The radar chart (or spider web plot) is designed to help understanding the strength and weakness for the top ranked locations. I decided to plot many measurements for the chosen location, even if the user decides not to include certain input data, by leaving some boxes in the control panel unchecked.
I prefer to look at the full strength/weakness near the end, when a small number of final candidates survived through the filtering and digestion funnel, and we are very close to a final conclusion (winner). In case certain neglected aspects are so extreme that should trigger a second thought. In other words, this is a "what we've missed before I sign the contract" visualization.
The 25%, 50%, 75% quantile are also plotted, so there is a visual reminder of what the benchmark crowd might look like. When trying to choose top 5-10 from 3000+ counties, we are really often looking for outliers. I think it would be helpful to have some info reminding the user what an "average America" looks like.
The chartJSRadar is flexible enough that one can click on the name tags of plotted polygons, to toggle them on and off. This is a very nice design by the author of the package, as radar charts are very hard to read when too many polygons are plotted.
4) A data table tab linked to the interactive map, with "highlight-to-drop-marker" function
The data table on tab 2 is the filtered result linked to all the sliders on tab 1. The newly calculated score after each "Update Plots" button click is added to the left side of the filtered static data table, so user can easily sort using the fresh score, and drop markers on the map by clicking on the rows of data. It is worth noting that interesting interactive features like this are being developed by R Studio in the "crosstalk" package. Eventually these features are going to be much easier to build, compared to my solution in the source code.
Many of us might ask this question, especially after playing with the filtering sliders for a while -- how are some of the input data correlated to each other? The tab 3 "Correlation analysis" is dedicated to answering this question.
A few comments on the off-diagonal strong correlations:
- Median house value of 2015-2017 correlate with each other very well as the housing value in general changes slowly.
- Political voting stats correlate well with each other because in the US most people do not vote for the the 3rd party candidates.
- The income to cost ratio has a strong negative correlation with unemployment rate, while income has a much larger variation compared to cost. This seem to suggest we should chase high salary without worrying too much about the living cost in a statistical sense.
- There is a strong correlation between living cost / housing value and political voting results. Housing value and living cost are naturally correlated to each other, since house value is an input used to calculate living cost. In the end it reveals that high living cost regions tend to vote for DEM, while low cost regions for GOP.
- However, there is no strong correlation between income and politics! The 2016 vote is much more cost sensitive than income sensitive.
- Population density is correlated to housing value; therefore, it plugs into the same observation regarding politics.
- Population density has an intuitive correlation with air quality, so statistically speaking, if you choose the large cities, forget about clean air.
- Longitude has a few interesting correlation with other parameters. This is due to American geography -- think about where the mountains are.
Most of these observations are intuitive. The question now is (if you accept the statement that, when looking for the best places to live, we are really looking for outliers): does statistics matter here? If you find that golden place that scores high on everything, do you care about whether it makes statistical sense?
There is a discussion section recording my additional thoughts during this project. I appreciate your attention so far if you are still with me. Before I bore you completely I'd like to mention that there are acknowledgement and data source sections near the end.
The Delivered Product
This section is a bit scattered. I am merely documenting some of my additional thoughts during this project. The more important goal for this section is to trigger some additional thoughts on your side, and hopefully you would share with me.
Extended thoughts from the correlation matrix plot
Let's look at these three variables: income, living cost, and income/cost ratio. The ratio is derived from the first two base quantities. Ratio=1 means making ends meet, which is a convenient parameter for life-related decisions. However, it is worth noting that, when this ratio is offered as a third variable, it does not add any additional information to this income-cost system; the degrees of freedom remain two.
People with statistical knowledge might choose only up to two of the three parameters in this group, in any given calculation. The consideration here is: when all three are checked, we are double dipping into something. But this is not unique to this situation.
As many things in life are somewhat correlated, it is a common problem.
What does this ranking app do?
Initially, I set out to achieve a correlation matrix that allows me to digest the result and think about how to include/drop certain variables and have a PCA. In the end I dropped the last few steps. The reason is that I am not trying to come up with a "engineered feature," which is a linear combination of the available data columns, that could predict the likelihood of achieving what the users wants to achieve in life.
Instead, I simply ask users to bring their own "Happiness = f(x)" function to the app (with the assumption that the input variables are among the ones I offer here). The "happiness" or "achievement" is subjectively defined, and very hard to measure. My current understanding is that: for what this "where to live" app is designed to do, I do not have to worry about correlation at all; I just dig for any quantity that the users might care about during the decision making process, and offer them all as options.
Am I right?
The limitations of a linear framework
As mentioned earlier, the weighted-average engine is chosen because of its simplicity and scalability. At this point I would like to share my thoughts on its limitations, in the context of the "happiness" discussion. By choosing the linear framework, I am essentially saying this for example: while holding other variables constant, your happiness grows linearly with how much money you make; at any given point, if you need to take derivatives with respect to a certain variable, you'll find the first order derivative is a constant, and any higher order derivatives are 0 -- because I defined a model space with only first degree polynomials.
It is always a good practice when choosing basis functions, to first think about all the possible mathematical behavior your governing equation might require. For this project I assume, for example, I would never look at how much happier people become, when they make another 10,000 $/year. The app itself is meant to use explicit linear functions to go through a forward problem, rather then an inverse problem trying to analyze the quantitative impact of money/air quality/safety on our happiness.
Another train of thought is: I really wonder what happens when I plug in some nonlinear functions into this system, like this
Thoughts on filtering
While playing with the filters of this app, very often you find there are only a handful of counties (out of 3000+) surviving the filtering system, especially when you combine multiple sliders. When the whole data set is small enough for the memory to handle, the suggested approach should be: after choosing which columns of data should be included (by checking boxes), keep all rows (by not touching the sliders at all), then generate the score. Gain an understanding of where the bull's eyes are, then play with filters.
It would be nice to have a real-time visualization, as I play with the slider of one parameter, showing how the possible ranges of other parameters are changing. This would be very easily achievable in the future with the 'crosstalk' package, if not already.
The challenge of including categorical data
All data sets used are numerical. Categorical data sets are avoided since it's hard to define the "distance" among different categories. When people try to decide where to eat, and tell me their favorite cuisine is Italian, it takes lots of assumptions and framework building to define the score of American, Chinese, Thai, etc.
For this project, one categorical data set I found that's potentially easy to include is climate zone data, since it's often derived from numerical measurements like how many days in a year people turn on heating vs. cooling. Most other categorical data are hard to incorporate.
Data with gaps -- interpolation is rarely just a math problem
As real estate value is often a key consideration for people to compare places, I really wanted to include that info during calculation. However, though Zillow.com offers a lot of good real estate data, its county data set only has statistics for slightly more than 1800 counties, much less than the total (3000+). I do not want to do quick but wrong interpolation of the data and so was compelled to give up using the set. The proper real estate value interpolation for the other 1200+ counties is a complex project by itself.
For other data sets with only a few missing data items, I often just googled for the needed data points and manually fixed them. But for the 150+ counties reporting 0 crime rate, I suspect quite a few of them are wrong. I did not have time to validate the data, so they were used as is.
The data cleaning steps are also included in the GitHub code, check it out if you are interested in the details.
How to potentially make profit?
There are two future directions leading to business potential:
- When the dollar prize of the decision itself is huge, automating parts of the decision making process and generating easy-to-digest visualizations might be desired by large corporations.
- The app is naturally feasible as a web service for individual consumers: we are learning more and more about the users as they use this app -- the least we can easily do is to pick relevant ads and show beside the map.
The need for efficient data digestion as a general opportunity for data scientists
When facing a fast growing ocean of data, user-friendly digestion solutions will be in high demand, from corporations to individual consumers. Eventually all intellectually active human beings need to do some 'data science', while a large amount (if not the majority) of data would be quickly thrown away without being stored. This leads to a prediction that real time data stream digestion will soon become dominantly mainstream. I think the future is very interesting, but should I be scared also?
- NYC Data Science: Shu Yan, Zeyu Zhang
- Inspiration from the ‘SuperZip’ example by Joe Cheng
- Leaflet mapping examples on datascienceriot.com
- Correlation Matrix app ‘shinyCorrplot’ by saurfang
- Developers of all other packages I used for this project