Tree Troubles -- Predicting Sidewalk Damage Resulting From Trees In NYC
Introduction
Tree roots growing under sidewalks often cause cracking or lifting of the pavement once the tree surpasses a certain size. This creates significant tripping hazards for pedestrians, and liability issues for property owners. Furthermore, the cost of repairing such damage is in excess of $100 million per year in the United States. As such, this project seeks to:
- Predict the likelihood that a particular tree will result in sidewalk damage.
- Elucidate the factors most involved in causing such damage.
- Develop application to help recommend trees species and other steps to reduce the likelihood of future sidewalk damage.
Dataset
In 2015 NYC conducted volunteer-powered campaign to map, count, and care for all of the city's street trees. This dataset consist of:
|
The ultimate dataset consisted of the following features "tree_id", "year", "tree_dbh", "health", "spc_latin", "spc_common", "root_stone", "root_grate", "root_other", "trunk_wire", "address", "zipcode", "boro_name", "longitude", "latitude", "block_code", "sidewalk". Details on these terms can be found at the dataset link above.
Technology Pipeline
The technology employed is a mixture of Python, R, and Java. Python scripts are used for performing data cleaning and merging, as well as web scraping tree species data from leafsnap.com. R scripts are used for performing numerical and visual EDA and for running machine learning algorithms. For the desktop analysis application and to generate heatmaps (see below), Java is used. Java will also be used to develop a mobile application.
Visual Overview Of Data
In order to quickly check for a relationship between tree diameter and sidewalk damage a heatmap is generated by sorting the data by increasing diameter (from 3 to 70 inches). The magenta color represents the different species of trees. The red and greens pixels represent "damage" or "no damage" to the sidewalk respectively. One key takeaway here is that there isn't an obvious relationship between sidewalk condition and tree diameter. Another take away is that even though there are 132 tree species in the dataset, only a small number make up most of the trees planted (see bar plot below).
Variable Association
The associations between the predictor variables in the dataset and sidewalk condition is also compared using either a Cramers V function or the ICC package in R. The strength of association ranges from 0 to 1, with a value of 1 indicating perfect association between two variables.
Clustering (Unsupervised Learning)
Clustering was done by first generating a dissimilarity matrix using the “gower” distance, then using the “pam” function to find the best number of clusters. Using sample datasets (1000 obs) containing all the geolocation related features (i.e. address, zipcode, boro name, longitude, latitude), the optimal number of clusters found is 6. These clusters more or less corresponds to the boro the trees are located in. See image below.
Removing all geolocation related features, with the exception of longitude and latitude, the optimal number of cluster is now found to be 2 which corresponds to the sidewalk condition of either damage or not damaged.
Classification (Supervised Learning)
The R Caret package was used to run various machine learning classification algorithms on full dataset using the typical 80/20 (train/test) split validation method. The accuracy results are outlined below.
- Logistic Regression
- GLM: 75.8%
- GLMNet: 75.8%
- KNN: 76.9%
- Naive Bayes: 73.2%
- Tree Based Classification
- GBM: 77.3%
- XGBoost: 77.9%
- SVM (radial kernel): 77.1%
- Neural Net: 77.4%
Overall, the accuracy results for these algorithms was fairly close and given the nature of the problem, simple the Logistic Regression models were found to be well suited for use in the analysis application described below. As for what features are most important in determine sidewalk damage, both the tree based and logistics regression models are in overall agreement that having blocks around the trees (root_stone), tree diameter, and location play important roles.
Analysis Application “NYC Tree Insights”
In order to make the models useful for use by non technical users, a desktop applications that performs analysis on the “dead trees” data to predict the potential for sidewalk damage at various years (10, 20, 30, 50, 75) in the future is has been developed.
Additionally, the application also allows for rapid visual analysis by making use of bar plots and links to Google Maps to view the area and even the dead tree in question.
Conclusion
By making use of the NYC 2015 Tree Census dataset, a classification model, with an accuracy of over 75% in predicting root induced sidewalk damage was developed. Moreover, a Java based desktop application was developed around this model to help stake holders assess the likelihood sidewalk damage in the future if a certain species of tree is planted at a particular location. The next steps for this project are:
- Migrate machine learning backend to Amazon services
- Create web application
- Create mobile application