Clustering and Classifying Violent Behavior From Criminal Records

Posted on Apr 19, 2018



It's nice to know that there are applications of Data Science that go beyond business settings and corporate profits. A recent pro bono project I took on for the National Network for Safe Communities allowed me to experience first hand on applying machine learning methods to service our community. The research arm of NY's John Jay College of Criminal Justice shared their data provided by the district attorney of a city I shall not name due to non-disclosure agreement reasons.

The focus was on intimate partner violence and providing outreach programs to such cases. The problem was that given so many different case records, it is quite inefficient to fish out the ones they are looking for and so the goal was to develop an easier way to do this.



The data spans from the years 2015 to 2017. The DA (district attorney) data contains features that describe the details of previous cases, for example victim/suspect names, location of crime, suspect actions, etc. In its raw form, the data does not have labels that fit the client's exact definition of 'intimate' but there are columns that can indicate it.

However, since such detailed information is given about these cases, I thought I'd do an unsupervised learning method to sum up the data as best I could first. I decided to perform a clustering on the cases based on all the recorded suspect actions. My goal here is to group cases together based on how similar those suspects behaved. If behavioral profiles can be created, then we may be able to more effectively assign various outreach programs to suspects based on which cluster they belong to.



There are 39 features that describe the behaviors of suspect we can go off with. These include actions like "Impaired", "Pushing" and "Threw Items". The only thing is, these are binary features and dealing with such can be a bit tricky.

After multiple failed experiments with hierarchical clustering and testing with different dissimilarity  measures, I've found that applying K Means clustering AFTER transforming the variables using Principal Component Analysis resulted in very interpretable clusters.

The overall process can be summarized in the below diagram:




Normally we think of PCA as a means of dimension reduction by picking the amount of principal components that explain enough variance. Such a plot for our data would look like this:

However, our goal isn't to reduce the amount of features we have, our goal is to transform them into numerical data we can cluster. To do this, we take ALL the 39 principal component scores (aka eigen vectors), retaining 100% of our original variance and clustering on them.


The objective function in K Means clustering is to minimize the within cluster variation. Looking at the scree plot, I'd say 5 or 6 clusters seem about right. After experimenting with both, I concluded that 5 clusters was most interpretable.


After assigning each observation a cluster label and matching them back with the original data set, it was pretty easy to start profiling them. Working back with the binary data, we could simply add up all the 1's for each feature and for each cluster. The features with higher sums would have more weight in describing that cluster. Below is an example of the top 3 most prominent features for the cluster profiled as "Encounters under the influence":

The final clusters profiled based on the example above looks like this! Each case would be labeled as one of these clusters and the suspect in each case is assumed to take on one of the profiles.





Before I could start any supervised learning method, I first need a feature that can help guide the machine to learn. To reiterate, the goal is to be able to classify a case whether it is intimate or not since it has been found that not all cases have been classified correctly.



The National Network for Safe Communities defines "intimacy" as any relationship that has been currently or formerly intimate. This includes marriage, dating, or having a child together. Using various columns in the data set, I've engineered the supervising feature "Intimacy" by defining the python function:



Now that I've labeled each case ("Intimate"/"Not Intimate"), I needed to build a classification model. I've decided to model out a Multinomial naive Bayes classifier over a Logistic Regression for various reasons:

  • Given a smaller training size, the generative naive Bayes model will outperform the discriminative Logistic Regression model as stated by Andrew Ng in this paper
  • naive Bayes and its assumption of independent features makes the model simpler, more general and thus less variant
  • Naive Bayes is infamous for its use with text data and spam email detection.

The last point was added because interestingly, the data set provided contains a text narrative from the officer that was on scene. Can we detect intimacy for a case given the choice of words in a narrative used by an officer? I've decided to find out.



In order to run the Naive Bayes classifier, I need to clean the text data first. Using the NLTK module in python, the following steps were taken:

  1. Tokenizing the narrative (using RegexpTokenizer)
  2. Removing stopwords (using stopwords)
  3. Apply Lemmatization (using WordNetLemmatizer)

The resulting narratives look like this thus far:

After processing the text, I transformed the tokenized and Lemmatized narratives into a transaction object where each word is its own feature and each row would be a narrative or "document". This may also be known as a "Document Term Matrix".



The document term matrix resulted in 3,805 features. In order to reduce this so as to increase the accuracy of our model, features that are not frequent enough were eliminated. Using a validation set, I've concluded that eliminating any word that does not appear at least twice would suffice. This has already lowered the dimensionality to 1,861 features.



The Multinomial Naive Bayes classifier by default sets alpha to 1. Alpha is a smoothing parameter that deals with words that appear in a hold out set that was not trained in the training set. Using GridSearchCV, alpha was tuned to 2.53.



After refitting the Multinomial Naive Bayes with alpha = 2.53 and splitting the data into a train and test set, the results were surprisingly well.

  • Training Accuracy: 84%
  • Test Accuracy: 81%

The confusion matrix is below:



In conclusion, the model's true-positives, aka sensitivity, is 80.4% (82/(82+20)) and the model's true-negatives, aka specificity, is 82.4% (14/(3+14)). Both are quite above what I expected as narratives are written by many different police officers, each with their own writing style. But the model shows that certain key words, regardless of the officer, are used quite more frequently when it comes to more intimate cases. This was key for the model's performance.

Hopefully, the models above can be put to great use for our community.  Clustering can be very useful, especially when we do not have enough data to provide labels to our cases. It can be used to get quick understanding of what kind of people our suspects are (I would argue calling them patients actually). But given the fortunate situation where we do have enough data, the Naive Bayes classifier can greatly reduce the time it takes to filter out potential patients to a treatment program. Identifying early intervention opportunities to reduce violence in our community is just one of the many rewarding aspects data science can do. Can't wait to see what else is out there!

About Author

Kenny Moy

Kenny has years of experience providing data driven solutions in industries such as marketing, healthcare, real estate, and public service. In addition to machine learning, he loves the AHA! moments, storytelling, and the creativity aspects of data science.
View all posts by Kenny Moy >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI