Kaggle Telematics Competition Write-up March 25
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
NYC Data Science Academy Kaggle Competition Write-up
Julian Asano
March 25, 2015
This post is meant to showcase some of the work of the students of the first cohort of the NYC Data Science Academy Bootcamp. We competed in the AXA Driver Telematics Analysis Kaggle competition and this is a write-up done by a few of our team members.
Who might be interested in this blog post:
- Other AXA Kaggle teams that wish to compare and contrast results and techniques of our team with their own.
- Potential future Kagglers who want to hear the experiences of a first time competitor.
- Potential employers looking for an introduction to the hard work and creative problem solving skills of my fellow students.
The Telematics Competition Set-up:
The goal of the competition was to develop an algorithm to create driver specific signatures based upon nothing but X-Y coordinates in a number of csv’s. We were given 2736 directories, corresponding to 2736 different drivers. In each directory were 200 csv’s, one for each trip assigned to that driver.
Each row of a csv corresponds to one second of a driver’s trip. To test our driver signature algorithm, the organizers substituted an unknown number of trips in each driver’s directory with false trips that were driven by another driver. We don’t know which trips are driven by the driver of interest, all we know is that the driver of interest has the majority of trips in that folder.
Our goal is to submit a file identifying the correct probability of whether each trip belongs to the driver associated with their directory. This took the form of a csv with 200*2736 rows and the probability that that trip was driven by the driver of interest.
Project Development
My primary goal for this project was to get experience managing a data science project. I wanted to apply an agile project development model to our problem which would eventually iterate through the development cycle multiple times until the project deadline.
This is in contrast to the waterfall development method which I believe would not have appropriately utilized our team size and talent, and would not have lent itself to the competitive framework we were contrained by. The waterfall method would also have limited us to our initial vision of our final model without giving us ample room to revise our requirements based upon interim results.
Whereas, with the iterative model, we were able to create multiple working models and take advantage of many opportunities to test our models upon the public leaderboard (71 submissions to be exact). It allowed us to redefine basic assumptions late in the game which bumped our score tremendously in the final week of the competition.
Other Considerations
Git
We were instructed on the Git framework in our first week of the bootcamp. This allowed us to immediately implement a distributed version control system and allowed us to get used to working with our code as a team.
R
Alongside with Git, the first two weeks of the bootcamp were an intense introduction to R and we used R exclusively for the competition. It was the ideal choice for the data manipulation, visualization, and statistical learning methods we needed to implement.
Student Goals
As project manager I had to keep in mind that this was indeed a bootcamp student project. We were not a professional Kaggle team, and as such, I couldn’t put demands on team members like I would in a purely professional context. My main goal was to make sure that this was an educational experience for every member of the team.
After that requirement was met, I had to take a laissez-faire approach when it came to dominating the available time of the team members (especially my own time). Being in the middle of an intensive bootcamp experience meant that free time was very hard to come by. Luckily, our team was blessed with some extrememly motivated and capable members who you will hear from shortly. They went above and beyond the call of duty and won us all a Kaggle finish we are be proud of.
Inception
From my own perspective, I knew that there would be a drastic difference between my old job and my future job. I wasn’t exactly sure how guiding a large team of document review attorneys would translate to managing a team of PhD’s, but I was sure I couldn’t pass up the opportunity to try it in a relatively low-stakes educational environment with a bunch of courageous boot campers.
And so, in the second week of the boot camp, I pitched my idea to the class and offered a position on the team to anyone that wanted to join. I couldn’t have ventured a guess as to who would want to join my team after only the second week, but the topic must have sounded interesting because everyone in the room wanted to get involved. What I thought would be a team of 2-5 turned out to be a team of 14.
In order to make things manageable I decided to split the team into two sub-groups. This allowed me to make our weekly team meetings more manageable and allowed multiple people to step into similar vital roles within each group to make sure that everyone was involved near the center of the project. Eventually we merged the teams back together to maximize our performance by utilizing the best techniques and code from both groups.
Meetings
For the first few weekly meetings I placed myself in charge of guiding the conversation and delegating responsibilities for both sub-groups. Over the course of the next few weeks, I noticed that Jason (Jiayi) very naturally fell into a leadership position within his group. His dedication to the project and prior experience with machine learning algorithms made it a very easy decision for me to hand him a leadership role within his group.
This freed up a large portion of my time and it allowed Jason to more easily work on concepts I was not familiar with at the time. I was mostly excited to see what he could accomplish with his group when he was in charge. As you’ll see, Jason continued to serve as the lead architect of our final algorithm and managed to tweak the code to give us a finish in the top 9 percent of all the teams in the competition.
In Practice
Our workflow revolved around weekly meetings at which members of each sub-group would first go over the research or code they had been working on for the previous week. The second portion of the meetings involved an open table style brainstorming session to decide on ways to improve our model. The meetings concluded with task assignments for the next week, which was done mainly on a volunteer basis in order to be sensitive to our schedules and needs.
Meetings in the first two weeks consisted of brainstorming wild ideas as well as creating a framework for the simplest model we could come up with. In order to reap the benefits of an iterative project development model, we needed to create a working algorithm as soon as possible. Within a few weeks we were making submissions using the same basic framework that we ended up employing in our final submission.
Benefit
The benefit of this model was that we could send people down possible rabbit holes without it holding up the whole group. As long as we were trying new models and evolving our skill set, we were making progress. We could afford to chase down ideas like Fourier Transformation Matrices and Dynamic Time Warping (no, I didn’t just make that up) with minimal risk to the overall success of the project.
In the end, between the multiple trip matching algorithms and Sundar’s work integrating Support Vector Machines into an unsupervised problem, I was sure we made the right choice by using the agile development model.
RESULTS
I am very happy to report that the team exceeded our own expectations and posted a very respectable score of .90926 landing us in position #134 out of 1528. We seemed to have overfit our model less than our immediate competitors since we jumped up 8 places once we processed our score using the true (private) test data. WOOT!
THE CODE
I’ll get the code section of our blog posts started with my modest buildData.R script. One problem I wanted to avoid early on was a lack of interoperability. If everyone had their own idea about how to structure their data it would make it difficult to incorporate everyone’s code in the end. In order to keep everyone as consistent as possible, I created a script for the team to use to read in the 547,000 csv’s as easily loadable binaries saved on our individual machines.
It was based upon Lauri Koobas’ rebuild-data.R code in the Kaggle Forum. I figured that if we all started using the same foundation it would simplify the process of working with each others’ code later on down the line.
# Set WD to directory containing the 'drivers' folder.
# Set WD to directory containing the 'drivers' folder.
require(data.table)
# Use fread from the data.table package to read in x and y coords
# Apply trip ID to new third column in data frame
fread.and.modify <- function(file.number, driver) {
tmp <- fread(paste0("drivers/",driver,"/",file.number,".csv"), header=T, sep=",")
tmp[, tripID:=file.number]
return(tmp)
}
system.time({
# Pull down list of driver directories and create a home for binaries
driverlist <- list.files("./drivers/")
dir.create("./data/", showWarnings = TRUE, recursive = FALSE, mode = "0777")
# Loop through the driver list and use rbindlist to bind data from
# x, and y columns to the specific driver data frame
for (i in 1:length(driverlist)) {
onedriver <- driverlist[i]
drives <- rbindlist(lapply(1:200, fread.and.modify, onedriver))
save(drives, file = paste('./data/DriverData',onedriver, sep=''))
}
})
With the preliminaries out of the way, I’d like to guide you to our next blog post where Timothy Schmeier will tell you about his trip visualization algorithm when it is ready.