Studying Data on Violent Behavior From Criminal Records
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
INTRO
It's nice to know that there are applications of Data Science that go beyond business settings and corporate profits. A recent pro bono project I took on for the National Network for Safe Communities allowed me to experience first hand on applying machine learning methods to service our community. The research arm of NY's John Jay College of Criminal Justice shared their data provided by the district attorney of a city I shall not name due to non-disclosure agreement reasons.
The focus was on intimate partner violence and providing outreach programs to such cases. The problem was that given so many different case records, it is quite inefficient to fish out the ones they are looking for and so the goal was to develop an easier way to do this.
UNSUPERVISED LEARNING
The data spans from the years 2015 to 2017. The DA (district attorney) data contains features that describe the details of previous cases, for example victim/suspect names, location of crime, suspect actions, etc. In its raw form, the data does not have labels that fit the client's exact definition of 'intimate' but there are columns that can indicate it.
However, since such detailed information is given about these cases, I thought I'd do an unsupervised learning method to sum up the data as best I could first. I decided to perform a clustering on the cases based on all the recorded suspect actions. My goal here is to group cases together based on how similar those suspects behaved. If behavioral profiles can be created, then we may be able to more effectively assign various outreach programs to suspects based on which cluster they belong to.
PCA DATA TRANSFORMATION
There are 39 features that describe the behaviors of suspect we can go off with. These include actions like "Impaired", "Pushing" and "Threw Items". The only thing is, these are binary features and dealing with such can be a bit tricky.
After multiple failed experiments with hierarchical clustering and testing with different dissimilarity measures, I've found that applying K Means clustering AFTER transforming the variables using Principal Component Analysis resulted in very interpretable clusters.
The overall process can be summarized in the below diagram:
Normally we think of PCA as a means of dimension reduction by picking the amount of principal components that explain enough variance. Such a plot for our data would look like this:
However, our goal isn't to reduce the amount of features we have, our goal is to transform them into numerical data we can cluster. To do this, we take ALL the 39 principal component scores (aka eigen vectors), retaining 100% of our original variance and clustering on them.
Data CLUSTERING
The objective function in K Means clustering is to minimize the within cluster variation. Looking at the scree plot, I'd say 5 or 6 clusters seem about right. After experimenting with both, I concluded that 5 clusters was most interpretable.
CLUSTER PROFILES
After assigning each observation a cluster label and matching them back with the original data set, it was pretty easy to start profiling them. Working back with the binary data, we could simply add up all the 1's for each feature and for each cluster. The features with higher sums would have more weight in describing that cluster. Below is an example of the top 3 most prominent features for the cluster profiled as "Encounters under the influence":
The final clusters profiled based on the example above looks like this! Each case would be labeled as one of these clusters and the suspect in each case is assumed to take on one of the profiles.
SUPERVISED LEARNING
Before I could start any supervised learning method, I first need a feature that can help guide the machine to learn. To reiterate, the goal is to be able to classify a case whether it is intimate or not since it has been found that not all cases have been classified correctly.
FEATURE ENGINEERING - TAGGING INTIMACY
The National Network for Safe Communities defines "intimacy" as any relationship that has been currently or formerly intimate. This includes marriage, dating, or having a child together. Using various columns in the data set, I've engineered the supervising feature "Intimacy" by defining the python function:
CHOOSING A MODEL
Now that I've labeled each case ("Intimate"/"Not Intimate"), I needed to build a classification model. I've decided to model out a Multinomial naive Bayes classifier over a Logistic Regression for various reasons:
- Given a smaller training size, the generative naive Bayes model will outperform the discriminative Logistic Regression model as stated by Andrew Ng in this paper
- naive Bayes and its assumption of independent features makes the model simpler, more general and thus less variant
- Naive Bayes is infamous for its use with text data and spam email detection.
The last point was added because interestingly, the data set provided contains a text narrative from the officer that was on scene. Can we detect intimacy for a case given the choice of words in a narrative used by an officer? I've decided to find out.
NATURAL LANGUAGE PROCESSING
In order to run the Naive Bayes classifier, I need to clean the text data first. Using the NLTK module in python, the following steps were taken:
- Tokenizing the narrative (using RegexpTokenizer)
- Removing stopwords (using stopwords)
- Apply Lemmatization (using WordNetLemmatizer)
The resulting narratives look like this thus far:
After processing the text, I transformed the tokenized and Lemmatized narratives into a transaction object where each word is its own feature and each row would be a narrative or "document". This may also be known as a "Document Term Matrix".
DIMENSION REDUCTION
The document term matrix resulted in 3,805 features. In order to reduce this so as to increase the accuracy of our model, features that are not frequent enough were eliminated. Using a validation set, I've concluded that eliminating any word that does not appear at least twice would suffice. This has already lowered the dimensionality to 1,861 features.
TUNING MULTINOMIAL NAIVE BAYES
The Multinomial Naive Bayes classifier by default sets alpha to 1. Alpha is a smoothing parameter that deals with words that appear in a hold out set that was not trained in the training set. Using GridSearchCV, alpha was tuned to 2.53.
MODEL EVALUATION
After refitting the Multinomial Naive Bayes with alpha = 2.53 and splitting the data into a train and test set, the results were surprisingly well.
- Training Accuracy: 84%
- Test Accuracy: 81%
The confusion matrix is below:
CONCLUSION
In conclusion, the model's true-positives, aka sensitivity, is 80.4% (82/(82+20)) and the model's true-negatives, aka specificity, is 82.4% (14/(3+14)). Both are quite above what I expected as narratives are written by many different police officers, each with their own writing style. But the model shows that certain key words, regardless of the officer, are used quite more frequently when it comes to more intimate cases. This was key for the model's performance.
Hopefully, the models above can be put to great use for our community. Clustering can be very useful, especially when we do not have enough data to provide labels to our cases. It can be used to get quick understanding of what kind of people our suspects are (I would argue calling them patients actually). But given the fortunate situation where we do have enough data, the Naive Bayes classifier can greatly reduce the time it takes to filter out potential patients to a treatment program. Identifying early intervention opportunities to reduce violence in our community is just one of the many rewarding aspects data science can do. Can't wait to see what else is out there!