Exploratory Data Visualization of Manhattan Street Trees
Contributed by Belinda Kanpetch. She is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on her first class project - R visualization (due on the 2nd week of the program).
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Why Street Trees?
The New York City street tree can sometimes be taken for granted or go unnoticed. Located along paths of travel they stand steady and patient; quietly going about their business of filtering out pollutants in our air, bringing us oxygen, providing shade during the warmer months, blocking winds during cold seasons, and relieving our sewer systems during heavy rainfall. All of this while beautifying our streets and neighborhoods. Some recent data studies have found a link between presence of streets and lower stress levels in urban citizens.
So what makes a street tree different from any other tree? Mainly its location. A street tree is defined as any tree that lives within the public right of way; not in a park or on private property. Although they reside in the public right of way (or within the jurisdiction of The Department of Transportation) they are the property of and cared for by the NYC Department of Parks and Recreation.
With the intent to understand the data and explore what the data was telling me I started with some very basic questions:
- How many street trees are there in Manhattan?
- How many different species are there?
- What is the general condition of the street trees?
- What is the distribution of species by community district?
- Is there a connection between median income of a community district to the number of street trees?
The Data Set
The dataset used for this exploratory visualization was downloaded from the NYC Open Data Portal and was collected as part of TreeCount!2015, a street tree census maintained by the NYC Department of Park and Recreation. The first census count was 1995 and has been conducted every 10 years by trained volunteers.
Some challenges with this dataset involved missing values in the form of unidentifiable species types. There were 2285 observations with unclassifiable species type, 487 observations that had unclassifiable community districts, geographic information (longitude and latitude) were character strings that had to be split into different variables, and species codes were given by 4 letter characters without any reference to genus, species, or cultivar and I had to find another dataset to decipher that code.
Visualizing the data
A quick summary of the dataset revealed a total of 51,660 trees total in Manhattan with 91 identifiable species with one ‘species’ as missing values.
A bar plot of all 92 species gave an interesting snapshot of the range in total number of trees per species. It was quite obvious that there was one species that has a dominant presence. In order to get better understanding of their counts and what were common species, I broke them down by quartiles and plotted them.
Plotting the first quartile (< 3.75)revealed that there were several species in which there was only one tree that existed in Manhattan!
Within the IQR (3.75 << total >>181.75) there was still quite a range but nothing that stood out as a species to investigate.
The distribution within the 4th quartile (181.75 << total >> 11529) was informative in that it helped to visualize the dominance of two specific species, the Honeylocust and Ornamental Pear that make up 23% and 15% of all the trees in Manhattan respectively. Coming in close were Ginko trees with 9.47% and London Plane with 7.8%. This quartile also contained the missing species group ‘0’.
A palette of the top 4 species in Manhattan.
Looking at data on trees by Community District
I wanted to look at community districts as opposed to zip codes because in my opinion community districts are more representative of community cohesiveness and character. So I plot the distribution by community district and tree condition.
Plotting the species distribution by community board using facet grid helped visualize other species that were not showing up dominant in the previous graphs. It would be interesting to look further into what those species are and why they are more dominant within some community districts and not others.
Attempts at mapping
The ultimate goal was to map each individual tree location on a map of Manhattan with the community districts outlined or shaded in. I attempted to plot them on a map using leaflet, bringing in shape files and converting to a data frame, and ggplot but neither yielded anything useful. The only visualization I was able to get was using qplot which took over 2 hours to render.
Next steps:
I would like take this into our next project utilizing shiny. For this project I would:
- Map tree locations with community district outlines.
- Scale the analysis to include all 5 boroughs for comparison.
- Add variables, such as drought and soil tolerances, to better understand the tree characteristics by geographic location.
- Look at previous years census to compare tree condition through time.