Data Visualization on Birds
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
The American Birder
For the millions of bird watchers in America, relevant and useful data resources are always welcome. Range maps and ecological histories enhance the bird watching experience by adding a layer of conservation awareness and help hobbyists become more acquainted with the birds they observe.
As a birder myself, I am always looking for new applications that help me achieve a greater understanding of the birds I observe on a daily basis. Learning more about these fascinating and beautiful animals helps to bring the bigger picture into focus; they have a much larger role in our world than just the momentary glimpse you get when observing them in a park or when walking down the street.
I chose to use the eBird dataset from the Cornell Lab of Ornithology to construct a Shiny application in R that allows a birder to "zoom out" from an isolated bird observation. Crowd-sourced data from around the world is what enables eBird to track the locations and times of bird observations. Using their online interface, a user can view observations of many different species of birds, explore bird watching hot-spots, and even see a real time observation submission map. Observations are available from as far back as the year 1900, and the increasing accessibility of technology translates to an ever increasing avalanche of data pouring in. The total data set today consists of over 500 million observations!
Using the Shiny package in R, I was able to build an application that explored observations from 2016 of 10 species of birds in the United States. I aggregated the observations by county and produced a graph using ggplot2 to show the frequency of observations throughout the US. The observations can be filtered by month using a slider bar to inspect the distribution of sightings during a particular season.
I also added a feature that allows the user to filter the observations by breeding season, which I implemented by using estimated breeding season ranges from the Cornell Lab of Ornithology Birds of North America website. A feature that I found really interesting and was excited to add to my application is the “play” button which shows an animated map that cycles through the months of the year and displays the bird sightings accordingly. This provides the user with an important perspective on the movement of different species throughout the US which can sometimes be lost when looking at separate monthly range maps one at a time.
In addition to visualizing the movement of birds throughout the US, I wanted to add additional functionality for bird watchers. I thought an interesting question to ask would be: at what time are birds most often being seen?
To answer this question, I added a histogram which shows the frequency of times that a certain species was observed. What I found was that 8:00 AM was by far the most frequent time that an eBird user submitted an observation. What I have concluded is that the data is biased; many more people are actively bird watching around 8:00 AM, thus, the amount of observations spike around that time. This does not necessarily mean that a species is more likely to be seen at 8:00 AM, only that there are more people actively looking.
However, I did find a different trend for the only owl species (the Short-eared Owl) that I included in my list of species. This species showed a maximum sighting frequency at around 5:00 PM, which is in agreement with the fact that owls become more active around dusk. This leads me to believe that the functionality of this feature is mainly relevant for determining very basic activity levels for certain species.
When viewing the range map I found that regional movement of bird sightings was apparent, however it was easy to overlook state-level observation trends. I elected to add a bar graph that brakes down observations by month for every state where there were sightings. This feature allows the user to see a clear pattern in sighting frequency over the course of a year.
The visualization enabled by the graph makes simple work of detecting whether the bird is a year-round resident or only present for certain seasons. This is important for understanding seasonal distribution of species. Although I have not implemented this functionality yet, a daily breakdown of sightings could yield important information on bird migration stopover sites (i.e. where birds temporarily stop to refuel during migration).
The final section of my application allows the user to inspect the data behind the graphics. This can be useful if the user wishes to extract a specific value of observations at a certain time or in a certain location. The data table has a search function that allows the user to filter the data by county, state, or time of day.
Although the application is functional, there are several potential areas for improvement that I would like to address in the future.
- The size of the data for some species is quite large which leads to issues with loading where graphics can take several seconds to render. This makes the play function of the map difficult to use effectively in some cases. I believe that these cases would benefit tremendously from either further optimization of the code or incorporating graphics packages with quicker rendering capabilities (ideally a combination of both)
- I would like to allow the user to inspect a much larger list of species and range of years. Due to the size of the data, storage is a significant issue which may not be avoidable without establishing a dedicated server to host the data.
- the observations by county are currently displayed using a log scale. I decided to use a log scale over the raw number of observations because many areas have observations of 1-100 and were vastly overshadowed by areas that had observations in the thousands. These areas with lower observations can still show significant trends, and I wanted to make sure they were not ignored. Still, this system does not address the issue of there being a bias in sighting frequency strictly due to larger numbers of available birders.
- Areas with larger populations will produce higher numbers of sightings simply due to the fact that there are more people actively looking for birds (similar to the issue I have with the time histogram). I would like to implement a system that standardizes county sightings by county population which would give a more normalized representation of sighting frequency.
- my current system uses counties as groups for aggregating observations by location. This can introduce problems in location detail. For example, many counties in Western United States are very large and can diminish the granularity of the map. I have seen other bird range mapping tools (such as the eBird map) that instead use rectangular areas denoted by longitude and latitude and do not rely on human designated borders. Implementing this method could increase the level of accuracy of the map's representation of sighting hot-spots.
Thank you for taking the time to read about my project! As someone who is passionate about ecology and animal behavior, I found building this application to be very rewarding. I really appreciated the new insights I gained as a result of running it. I am always trying to think about new ways to marry technology and environmental biology, and data science is an incredibly powerful tool that I can use to ask and answer questions in a field that I think is fascinating. Feel free to check out my application and I welcome any feedback! Here is a link to my GitHub repository if you would like to explore my code.