Visualizing New York City's Parking Violation Data
Contributed by Steven Ginsberg.He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his first class project - R visualization, due on the 2nd week of the program.
For my first R development project, I set out to visualize the Parking Violations issued in New York City during 2016. My goal was to find patterns in parking ticket issuances and find the perfect parking spot. Of course, past performance is no indicator of future performance, so be warned!
Loading the Data
Once downloaded, the data was easy to import, though time consuming (1.4gb, 7.3 million records). Working with a large dataset, regardless of the software, can cause problems during the development cycle. For instance, if I joined tables incorrectly, I often ran out of memory. Once debugged, the data loads smoothly in a reasonable amount of memory.
The primary data file includes ticket issuances for the 2016 NYC fiscal year, or July 2015 thru June 2016, or a partial year. Unfortunately, the actual data included tickets issued from January 2015 thru December 2016. Further investigation is required to identify the exact cause of the the problem. Other data problems included missing and incorrect fields, inconsistencies in the addresses. I have made the assumption that the dates entered are typos, and are meant to fall within the fiscal year.
Cleaning and Supplementing the Data
The code can be found HERE.
To clean and supplement the data, the code performs the following data manipulations:
- Load the main file, 'Parking_Violations_Issued_-_Fiscal_Year_2016.csv'
- Load the definitions for the violation code
- Load Precinct/Borough/coordinates data
- clean up some of the field names
- join the tables
- finally, I remove some temporary variables
After these steps are finished, we have 7.3 million records with 64 fields.
Visualizing the Results
One of the distinctions with R and ggplot that I am about to get familiar with in a hurry is the difference between discrete and continuous data. This database has no continuous fields, which limits the types of plots available. The code can be viewed HERE.
First, I took a look at the issuances across the boroughs
Not surprisingly, Manhattan has the most number of issuances, and encompasses about 1/2 of the database The remaining visuals show to Manhattan only (sorry BBQS). I created a variable to switch between Manhattan only and the whole city (on PVI Graphs.R lines 24-25). However, the scale of all the charts are set for the Manhattan only data.
Next I took a look at the types of violations.
The third chart is a look at the top 10 dates that tickets were issued.
While these 10 days only account for 10% of the tickets, they are up to 10 times the daily average (represented by the dots, at 2,500 tickets). Also, all but one are in July, 2015. I tried and failed to find any news that would explain this. Given the results are right at the change of the fiscal year, it leads to questions about the quality of the data.
Finally, I charted the top 20 streets on which the tickets were issued.
While this is an interesting view, it didn't go as far as I'd like. Broadway wins the prize, but it is also the longest street in the city, and it would be nice to know where on Broadway the tickets were issued. Unfortunately, I was unable to find latitude/longitude information to identify specific hot spots. I was able to attach latitude/longitude coordinates to the police precincts, as highlighted in chart 5. This chart is a jitter-plot, but since everything is centered around the precinct, rather than the actual street location, it's not helpful.