Tree Troubles -- Predicting Sidewalk Damage Resulting From Trees In NYC

Nathan Stevens

Posted on Dec 21, 2016

Introduction

Tree roots growing under sidewalks often cause cracking or lifting of the pavement once the tree surpasses a certain size. This creates significant tripping hazards for pedestrians, and liability issues for property owners. Furthermore, the cost of repairing such damage is in excess of $100 million per year in the United States. As such, this project seeks to:

Predict the likelihood that a particular tree will result in sidewalk damage.
Elucidate the factors most involved in causing such damage.
Develop application to help recommend trees species and other steps to reduce the likelihood of future sidewalk damage.

Dataset

In 2015 NYC conducted volunteer-powered campaign to map, count, and care for all of the city's street trees. This dataset consist of:

432,564 Live Tree
14,099 Dead Tree
40% Sidewalk Damage
Average DBH of 11.6 inches
132 Different Species

The ultimate dataset consisted of the following features "tree_id", "year", "tree_dbh", "health", "spc_latin", "spc_common", "root_stone", "root_grate", "root_other", "trunk_wire", "address", "zipcode", "boro_name", "longitude", "latitude", "block_code", "sidewalk". Details on these terms can be found at the dataset link above.

Technology Pipeline

The technology employed is a mixture of Python, R, and Java. Python scripts are used for performing data cleaning and merging, as well as web scraping tree species data from leafsnap.com. R scripts are used for performing numerical and visual EDA and for running machine learning algorithms. For the desktop analysis application and to generate heatmaps (see below), Java is used. Java will also be used to develop a mobile application.

Visual Overview Of Data

In order to quickly check for a relationship between tree diameter and sidewalk damage a heatmap is generated by sorting the data by increasing diameter (from 3 to 70 inches). The magenta color represents the different species of trees. The red and greens pixels represent "damage" or "no damage" to the sidewalk respectively. One key takeaway here is that there isn't an obvious relationship between sidewalk condition and tree diameter. Another take away is that even though there are 132 tree species in the dataset, only a small number make up most of the trees planted (see bar plot below).

Variable Association

The associations between the predictor variables in the dataset and sidewalk condition is also compared using either a Cramers V function or the ICC package in R. The strength of association ranges from 0 to 1, with a value of 1 indicating perfect association between two variables.

Clustering (Unsupervised Learning)

Clustering was done by first generating a dissimilarity matrix using the “gower” distance, then using the “pam” function to find the best number of clusters. Using sample datasets (1000 obs) containing all the geolocation related features (i.e. address, zipcode, boro name, longitude, latitude), the optimal number of clusters found is 6. These clusters more or less corresponds to the boro the trees are located in. See image below.

Removing all geolocation related features, with the exception of longitude and latitude, the optimal number of cluster is now found to be 2 which corresponds to the sidewalk condition of either damage or not damaged.

Classification (Supervised Learning)

The R Caret package was used to run various machine learning classification algorithms on full dataset using the typical 80/20 (train/test) split validation method. The accuracy results are outlined below.

Logistic Regression
- GLM: 75.8%
- GLMNet: 75.8%
KNN: 76.9%
Naive Bayes: 73.2%
Tree Based Classification
- GBM: 77.3%
- XGBoost: 77.9%
SVM (radial kernel): 77.1%
Neural Net: 77.4%

Overall, the accuracy results for these algorithms was fairly close and given the nature of the problem, simple the Logistic Regression models were found to be well suited for use in the analysis application described below. As for what features are most important in determine sidewalk damage, both the tree based and logistics regression models are in overall agreement that having blocks around the trees (root_stone), tree diameter, and location play important roles.

Analysis Application “NYC Tree Insights”

In order to make the models useful for use by non technical users, a desktop applications that performs analysis on the “dead trees” data to predict the potential for sidewalk damage at various years (10, 20, 30, 50, 75) in the future is has been developed.

Additionally, the application also allows for rapid visual analysis by making use of bar plots and links to Google Maps to view the area and even the dead tree in question.

Conclusion

By making use of the NYC 2015 Tree Census dataset, a classification model, with an accuracy of over 75% in predicting root induced sidewalk damage was developed. Moreover, a Java based desktop application was developed around this model to help stake holders assess the likelihood sidewalk damage in the future if a certain species of tree is planted at a particular location. The next steps for this project are:

Migrate machine learning backend to Amazon services
Create web application
Create mobile application

About Author

Nathan Stevens

Nathan holds a Ph.D. in Nanotechnology and Materials Science from the City University of New York graduate school, and has worked on numerous software and scientific research projects over the last 10 years. Software projects have ranged from...

View all posts by Nathan Stevens >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

Mayor de Blasio To Stop Issuing Violations On Homeowners For Damages Caused By Street Trees September 10, 2019

[…] Photo credit: Via source. […]

tag heuer watches ladies fake January 3, 2017

-The very first Commandment regarding Head of the family Galen tag heuer watches ladies fake http://www.watchheuer.ru/

ballon bleu de cartier copie January 3, 2017

Entiendo tu punto, cuando eres un usuario windows y pruebas otros sistemas operativos no entiendes casi nada, aunque hay que recordar que creciste usando windows has seguido sus actualizaciones, en pocas palabras ya lo conoces. Pero es lo mismo con los idiomas, tu creciste hablando español, (eso creo, de no ser así vamos a suponerlo) cuando quieres aprender a hablar otro idioma como el Japones al inicio no entiendes ni lo que estas aprendiendo, pero pasado el tiempo te das cuenta que es mas facil y sencillo que el mismo español (salvo a la escritura). El punto es que si no les dedicas el tiempo la paciencia y la disciplina en aprender, y continuas pensando y emulando (que no es el termino) Windows jamas vas a aprender a usarlo realmente. Tengo un grupo de chavos (casi niños) que desde el inicio usaron Ubuntu y cada ves que usan Windows se quejan y mencionan “como es posible que la gente use esa cosa”. recuerda en linux tu eres el dueño de tu software, cosa que jamas seras en Windows o Os x. ballon bleu de cartier copie http://www.montrecartier.com/category/montre-ballon-bleu-de-cartier/

Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC – Mubashir Qasim December 22, 2016

[…] post Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC appeared first on NYC Data Science Academy […]

Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC | A bunch of data December 22, 2016

[…] post Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC appeared first on NYC Data Science Academy […]

Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC - Use-R!Use-R! December 22, 2016

[…] post Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC appeared first on NYC Data Science Academy […]

Tree Troubles -- Predicting Sidewalk Damage Resulting From Trees In NYC

Introduction

Dataset

Technology Pipeline

Visual Overview Of Data

Variable Association

Clustering (Unsupervised Learning)

Classification (Supervised Learning)

Analysis Application “NYC Tree Insights”

Conclusion

About Author

Nathan Stevens

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Tree Troubles -- Predicting Sidewalk Damage Resulting From Trees In NYC

Introduction

Dataset

Technology Pipeline

Visual Overview Of Data

Variable Association

Clustering (Unsupervised Learning)

Classification (Supervised Learning)

Analysis Application “NYC Tree Insights”

Conclusion

About Author

Nathan Stevens

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!