Web Scraping Bike Index to Analyze Stolen Bike Data
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
During my junior and senior years of high school, on many occasions I would hop on the downtown 6 as soon as the school day ended and head to Theatre 80 St. Marks. Theatre 80 was a cozy revival cinema house that often featured double-bills of film classics, a great bargain for a high school student with little more than pocket change.
I was fortunate to be a patron of this cinema house before it stopped showing films in the summer of 1994, for it was here that I was introduced to one of my favorite films, the Italian neorealist classic “Ladri di Biciclette” (“Bicycle Thieves”) by Vittorio De Sica. Although the story, music, and acting all affected me deeply when I first saw it, the scene in which the thief steals the main character’s bicycle and sets the entire movie in motion did not fully resonate with me until more than 20 years later, when I myself had my bicycle stolen, also while working.
Similar to Antonio Ricci, the main character, I witnessed the moment when the thief took the bicycle and even futilely gave chase on foot until I lost him in the shadows of the Williamsburg Bridge. It appeared at the time that even the quickly fading autumn light was conspiring against me.
It was after this incident two years ago that I looked into resources that were available to people who had their bikes stolen. One of these resources I came across was an online bicycle registry called “Bike Index” that was founded in 2013. The website has close to 150,000 bikes registered from around the world, though predominantly in the United States, and approximately a third of these bicycles are marked as stolen. In addition to allowing individuals to search the database, Bike Index partners with local businesses and organizations, law enforcement agencies, and other apps to alert the community when a bicycle is stolen. It was nice to discover an organization out there dedicated to helping people recover their stolen bicycles.
Given my experience as a bicycle theft victim, I decided it would be an interesting and worthwhile undertaking to web scrape data on the stolen bicycles registered on Bike Index and see whether I could detect any themes in the larger community affected by this crime. These include geographical distribution of the data, features of the bicycles, the circumstances under which they were stolen, and demographic analyses of certain areas with a high incidence of bicycle thefts. Some potential uses of the analysis include pinpointing locations where Bike Index may be deficient in registering stolen bicycles or, conversely, areas that have a high incidence of bicycle thefts that may merit further attention from local law enforcement agencies and the wider public.
Web Scraping Data
For the web scraping portion of the project, I used Python’s Scrapy, a web crawling application that allows for extraction of data from HTML documents. Bike Index’s database essentially has two levels of information. The first level is a thumbnail list of search results, with ten bicycles per page and some distinguishing attributes for each bicycle. After clicking on the image or header, one arrives at the second level, which provides more detailed information on the bicycle and the circumstances of the theft. Here is an example.
For this project, I intended to scrape all bikes stolen in the United States in the past five years, from October 2012 to October 2017, or approximately 3,700 pages of search results. As a practice run, I first scraped only the information on the first level with random download request times of one to three seconds. I was able to download little more than half of the bicycles as it appeared that my requests were met with HTTP 403 errors from the server.
Therefore, when I web scraped the second level, I added a constant download delay of 3 seconds. The process ended up being too slow so I aborted the attempt early and iteratively lowered the delay time, based on a small subset of pages, until I reached 0.25 second without receiving any HTTP 403 errors. It is unclear why this constant delay of 0.25 second was more successful than the first attempt using a random download delay request. In any case, I was able to download information on at least 80 percent of the bicycles.
But when I examined the information, I realized that the field that contained the color description of the bicycle was missing because I did not have “color” in upper case in my code. It felt incomplete without this information, so I web scraped the website once again and ended up finally with the following fields, in no particular order:
- Serial number
- Bike manufacturer
- Bike model
- Model year
- Frame material
- Location of theft
- Lock description
- How it was stolen
- Date of theft
- Detailed description of incident
Description of Data Set
From Bike Index, I was able to scrape information on approximately 30,000 bicycles. After cleaning the data by excluding those observations that were outside the United States or did not contain zip code information, I ended up with approximately 25,000 bicycles. In addition to the data from Bike Index, I also used 2015 demographic data sourced from the American Community Survey (“ACS”), a part of The United States Census Bureau’s Population Estimate Program. Demographic information for select counties was used in the analysis as will be discussed later in this blog post.
Geographical Data Analysis
The first analysis I performed was to map all the zip codes where at least one bicycle was stolen in the last five years to get an idea of how the data is geographically distributed. The map is from Leaflet’s OpenStreetMap and the following image is a screenshot from the Shiny app I used to present my analysis and findings.
Hovering over each zip code, one can ascertain information on the city, the zip code, and the number of bicycles stolen in the last five years. It is not surprising that we see a lot of red dots around the major metropolitan areas on the coasts, especially on the West Coast and the Northeast. However, I was a bit surprised that a large portion of the country between the coasts was so sparsely marked with red dots. The banner across the top of the image is a list of the top three cities by a number of bikes stolen, all on the West Coast.
Anecdotally, I understand that San Francisco, Seattle, and Portland have a thriving bicycle scene, but nevertheless, I was surprised that New York City was not among the top three for any of the years selected. In fact, it did not even make top ten.
This inspired me to do a state by state comparison, which more clearly revealed to me that we might not be dealing with a representative sample of stolen bicycles. The following is a map of the country with each state shaded according to the number of bicycles stolen.
The map clearly demonstrates that the dataset is heavily skewed towards the West Coast, among the states of California, Oregon, and Washington in particular, all three of which accounted for more than 60 percent of the bicycles in the dataset.
According to the data, the State of New York had only 895 stolen bicycles over this time period, which clearly did not make sense if one were to view the data as a geographical representation of the total population of stolen bicycles in this country. This was further confirmed when I performed additional research on the founders of Bike Index, both of whom are from or are living in the West Coast. Quite possibly the website first gained traction among communities along the West Coast and remained a popular service for bicycle owners there relative to those from other parts of the country.
Variables Data Analysis
I then proceeded to perform some descriptive statistical analyses of some of the attributes I scraped from Bike Index. First was the distribution of bikes stolen across the various manufacturers. The top three bicycle manufacturers (Trek, Specialized, and Giant) account for approximately a third of the bikes stolen registered on Bike Index.
As an interesting factoid, over a half of the bicycles stolen in the past five years were either black or multi-colored.
And the data clearly shows a downward trend in the number of bikes stolen as the weather gets colder.
Approximately a third of the bicycles that were stolen was locked using a cable lock. The second most frequent case is the one in which the bicycle was not locked at all, accounting for about 20 percent of the bicycles stolen. This was surprising to me initially as I couldn’t understand how anyone would not lock up their bicycle. After reflecting on my own personal history, however, I can asseverate not locking up one’s personal property is an issue when it comes to bicycle thefts: I did not lock my bicycle two out of the three times it was stolen.
In terms of how the bicycle was stolen, more than 50 percent of the time the lock was cut. It should be noted that in the graph below, the “Other” category is a catch-all category for those who wanted to provide more detail of the incident in another field where they were not limited to pre-defined choices. Therefore, the “Other” category may overlap with some of the other bars in the chart below.
As mentioned previously, registered users of the website can provide further details of the incident. Curious to know what some of the more common words that appeared in this field was, I performed a word cloud analysis. The following are the top 100 most frequent words with at least 50 occurrences.
Understandably, it appears that many people offered rewards in this field for the return of their bicycle. It is also interesting to note how many words were associated with the home: “home”, “house”, “storage”, “patio”, “apartment”, “building”, “basement”.
Data Seasonality in Bike Thefts?
Going back to the analysis in which I examined the number of bicycles stolen by season of the year, I was curious to know whether one could conclude from the data that seasonal differences were statistically significant. In particular, can one conclude from the data that the average number of bicycles stolen per year over the past five years was different for at least one of the seasons? The hypothesis testing was framed as follows:
Null Hypothesis: The average number of bicycles stolen per year in the population is the same for all four seasons.
Alternative Hypothesis: The average number of bicycles stolen per year in the population is different for at least one of the four seasons.
The following were the number of bicycle stolen per year over the five-year period:
Before I could perform a one-way analysis of variance (“ANOVA”), I had to reach some level of comfort that the assumptions of the test were satisfied. There was nothing in the data to suggest that number of bicycles stolen in each season would be dependent on another. Also, based on the following qq-plot of the observations, I got comfortable that the number of bicycles stolen in each season was approximately normally distributed as they all fell close to a straight line.
Although technically a qq-plot of each season was required, based on my understanding that the one-way ANOVA was robust with respect to normality and that a qq-plot of only five points for each season may seem sparse, I decided that a qq-plot of the entire set of observations was sufficient for the purpose of the test. Lastly, I applied both the Bartlett’s Test and Levene’s Test to see whether we could reject the null hypothesis that the variances were the same across all four seasons. Because of the high p-values, we couldn’t reject the null at the 5 percent significance level.
After performing these preliminary analyses, I then proceeded to the one-way ANOVA. I arrived at a p-value of .0566, just above the 5% threshold.
Therefore, I could not reject the null hypothesis that average number of bicycles stolen per year in the population is different for at least one of the four seasons. Despite the vast difference in average number of bicycles stolen per year between summer (1,621) and winter (942), it appears that there were wide variations within each season across the five-year period, driving down the F-statistic. The website received funding through a Kickstarter campaign in late 2013 and this may have accelerated the growth in registered users over the past several years, increasing the variance in each season.
Demographics Data Analysis
The next set of questions I wanted to research was whether one could detect any demographic differences between bicycle theft “hotspots” versus other areas. To focus my research, I looked at the top ten counties by number of bicycles stolen in the past ten years. Seven out of the top ten counties were on the West Coast. I then divided each county between the top quartile of zip codes and the bottom three quartiles of zip codes ranked by number of bicycles stolen. Finally, I joined the Bike Index data set with the demographic data set from the ACS.
The first demographic field I compared was the percent of males in the population. As the following chart shows, there does not seem to be much difference between the top quartile and bottom three quartiles in each county.
Next I examined the distribution of the population by age. Across all ten counties, a higher percentage of the population in the top quartile fell within the 25-34 and the 35-44 age groups compared to the bottom three quartiles. The difference appeared to be the largest for the 25-34 age group. This is consistent with my impression that a larger portion of the young adult population ride bicycles compared to the other age groups and therefore the opportunity to have their bicycles stolen would be greater. For illustrative purposes, the following is the chart for Cook County, which had the highest differential in the 25-34 age group.
Lastly, I examined the distribution of the population by race. Across all ten counties, the top quartile had a higher percentage of people classified as “White” versus the bottom three quartiles. The biggest difference was in Orleans Parish, which is shown in the chart below:
Sadly, in this country, income and wealth are highly correlated with race and I wonder if the differences between the top quartile and bottom three quartiles in terms of race might also be related to the differences in the income and wealth distribution of the various zip codes.
The web scraping project provided a wonderful opportunity to further examine an issue that is personally relevant using data that in all likelihood would not have been available otherwise.
Furthermore, if I had not scraped the data and combined it with visualization, I would not have discovered how skewed the data was geographically. Either the West Coast is really a hotbed of bicycle thefts compared to the rest of the country or, in the more likely scenario, the users of the website over the past five years have been disproportionately represented by people living on the West Coast. One way to test this would have been to web scrape the rest of the website by including those bicycles that were not marked as stolen.
If this subset were also skewed towards the West Coast, it would have provided further evidence that the latter scenario was the case. This information would also indicate that Bike Index has an opportunity to deepen their footprint in the rest of the country.
With a more robust dataset of bicycle thefts, the demographics and variables analysis also hinted at future work that might yield useful insights. One could test, for example, the statistical significance of the differences in age and race distributions mentioned earlier across a larger portion of the country.
Another possibility is to use variables such as the demographic profile of a geographic location to predict the number of bike thefts over a certain period. This information could then be used to concentrate recovery and enforcement efforts at certain “hotspots”. Bike thefts will not be entirely eliminated but a strategic effort to analyze the issue using data science could hopefully help turn the tide against this perennial bane of bicycle owners.