NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > Kaggle Telematics Competition Write-up March 25

Kaggle Telematics Competition Write-up March 25

jasano
Posted on Apr 2, 2015
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

NYC Data Science Academy Kaggle Competition Write-up

Julian Asano

March 25, 2015

This post is meant to showcase some of the work of the students of the first cohort of the NYC Data Science Academy Bootcamp. We competed in the AXA Driver Telematics Analysis Kaggle competition and this is a write-up done by a few of our team members.

Blog Post Authors(in order of appearance):

Julian Asano - Project Management Framework
Timothy Schmeier - Trip Animator
Sylvie Lardeux - Attribute Engineer
Alex Adler - Trip Matching Maestro
Jiayi Liu - Feature Creator and GBM Mechanic

Who might be interested in this blog post:
  • Other AXA Kaggle teams that wish to compare and contrast results and techniques of our team with their own.
  • Potential future Kagglers who want to hear the experiences of a first time competitor.
  • Potential employers looking for an introduction to the hard work and creative problem solving skills of my fellow students.

The Telematics Competition Set-up:

The goal of the competition was to develop an algorithm to create driver specific signatures based upon nothing but X-Y coordinates in a number of csvโ€™s. We were given 2736 directories, corresponding to 2736 different drivers. In each directory were 200 csvโ€™s, one for each trip assigned to that driver.

Each row of a csv corresponds to one second of a driverโ€™s trip. To test our driver signature algorithm, the organizers substituted an unknown number of trips in each driverโ€™s directory with false trips that were driven by another driver. We donโ€™t know which trips are driven by the driver of interest, all we know is that the driver of interest has the majority of trips in that folder.

Our goal is to submit a file identifying the correct probability of whether each trip belongs to the driver associated with their directory. This took the form of a csv with 200*2736 rows and the probability that that trip was driven by the driver of interest.

Project Development

My primary goal for this project was to get experience managing a data science project. I wanted to apply an agile project development model to our problem which would eventually iterate through the development cycle multiple times until the project deadline.
Traditional Waterfall Develpment Method vs. Iterative Development Method
This is in contrast to the waterfall development method which I believe would not have appropriately utilized our team size and talent, and would not have lent itself to the competitive framework we were contrained by. The waterfall method would also have limited us to our initial vision of our final model without giving us ample room to revise our requirements based upon interim results.

Whereas, with the iterative model, we were able to create multiple working models and take advantage of many opportunities to test our models upon the public leaderboard (71 submissions to be exact). It allowed us to redefine basic assumptions late in the game which bumped our score tremendously in the final week of the competition.

Other Considerations

Git

We were instructed on the Git framework in our first week of the bootcamp. This allowed us to immediately implement a distributed version control system and allowed us to get used to working with our code as a team.

R

Alongside with Git, the first two weeks of the bootcamp were an intense introduction to R and we used R exclusively for the competition. It was the ideal choice for the data manipulation, visualization, and statistical learning methods we needed to implement.

Student Goals

As project manager I had to keep in mind that this was indeed a bootcamp student project. We were not a professional Kaggle team, and as such, I couldnโ€™t put demands on team members like I would in a purely professional context. My main goal was to make sure that this was an educational experience for every member of the team.

After that requirement was met, I had to take a laissez-faire approach when it came to dominating the available time of the team members (especially my own time). Being in the middle of an intensive bootcamp experience meant that free time was very hard to come by. Luckily, our team was blessed with some extrememly motivated and capable members who you will hear from shortly. They went above and beyond the call of duty and won us all a Kaggle finish we are be proud of.

Inception

From my own perspective, I knew that there would be a drastic difference between my old job and my future job. I wasnโ€™t exactly sure how guiding a large team of document review attorneys would translate to managing a team of PhDโ€™s, but I was sure I couldnโ€™t pass up the opportunity to try it in a relatively low-stakes educational environment with a bunch of courageous boot campers.

And so, in the second week of the boot camp, I pitched my idea to the class and offered a position on the team to anyone that wanted to join. I couldnโ€™t have ventured a guess as to who would want to join my team after only the second week, but the topic must have sounded interesting because everyone in the room wanted to get involved. What I thought would be a team of 2-5 turned out to be a team of 14.

In order to make things manageable I decided to split the team into two sub-groups. This allowed me to make our weekly team meetings more manageable and allowed multiple people to step into similar vital roles within each group to make sure that everyone was involved near the center of the project. Eventually we merged the teams back together to maximize our performance by utilizing the best techniques and code from both groups.

Meetings

For the first few weekly meetings I placed myself in charge of guiding the conversation and delegating responsibilities for both sub-groups. Over the course of the next few weeks, I noticed that Jason (Jiayi) very naturally fell into a leadership position within his group. His dedication to the project and prior experience with machine learning algorithms made it a very easy decision for me to hand him a leadership role within his group.

This freed up a large portion of my time and it allowed Jason to more easily work on concepts I was not familiar with at the time. I was mostly excited to see what he could accomplish with his group when he was in charge. As youโ€™ll see, Jason continued to serve as the lead architect of our final algorithm and managed to tweak the code to give us a finish in the top 9 percent of all the teams in the competition.

In Practice

Our workflow revolved around weekly meetings at which members of each sub-group would first go over the research or code they had been working on for the previous week. The second portion of the meetings involved an open table style brainstorming session to decide on ways to improve our model. The meetings concluded with task assignments for the next week, which was done mainly on a volunteer basis in order to be sensitive to our schedules and needs.
Julian_3

Meetings in the first two weeks consisted of brainstorming wild ideas as well as creating a framework for the simplest model we could come up with. In order to reap the benefits of an iterative project development model, we needed to create a working algorithm as soon as possible. Within a few weeks we were making submissions using the same basic framework that we ended up employing in our final submission.

Benefit

The benefit of this model was that we could send people down possible rabbit holes without it holding up the whole group. As long as we were trying new models and evolving our skill set, we were making progress. We could afford to chase down ideas like Fourier Transformation Matrices and Dynamic Time Warping (no, I didnโ€™t just make that up) with minimal risk to the overall success of the project.

In the end, between the multiple trip matching algorithms and Sundarโ€™s work integrating Support Vector Machines into an unsupervised problem, I was sure we made the right choice by using the agile development model.

RESULTS

I am very happy to report that the team exceeded our own expectations and posted a very respectable score of .90926 landing us in position #134 out of 1528. We seemed to have overfit our model less than our immediate competitors since we jumped up 8 places once we processed our score using the true (private) test data. WOOT!

THE CODE

Iโ€™ll get the code section of our blog posts started with my modest buildData.R script. One problem I wanted to avoid early on was a lack of interoperability. If everyone had their own idea about how to structure their data it would make it difficult to incorporate everyoneโ€™s code in the end. In order to keep everyone as consistent as possible, I created a script for the team to use to read in the 547,000 csvโ€™s as easily loadable binaries saved on our individual machines.

It was based upon Lauri Koobasโ€™ rebuild-data.R code in the Kaggle Forum. I figured that if we all started using the same foundation it would simplify the process of working with each othersโ€™ code later on down the line.


# Set WD to directory containing the 'drivers' folder.

require(data.table)

# Use fread from the data.table package to read in x and y coords
# Apply trip ID to new third column in data frame
fread.and.modify <- function(file.number, driver) {
tmp <- fread(paste0("drivers/",driver,"/",file.number,".csv"), header=T, sep=",")
tmp[, tripID:=file.number]
return(tmp)
}
system.time({

# Pull down list of driver directories and create a home for binaries
driverlist <- list.files("./drivers/")
dir.create("./data/", showWarnings = TRUE, recursive = FALSE, mode = "0777")

# Loop through the driver list and use rbindlist to bind data from
# x, and y columns to the specific driver data frame
for (i in 1:length(driverlist)) {
onedriver <- driverlist[i]
drives <- rbindlist(lapply(1:200, fread.and.modify, onedriver))
save(drives, file = paste('./data/DriverData',onedriver, sep=''))
}
})

With the preliminaries out of the way, Iโ€™d like to guide you to our next blog post where Timothy Schmeier will tell you about his trip visualization algorithm when it is ready.

About Author

jasano

View all posts by jasano >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Data Analysis
Car Sales Report R Shiny App
Data Analysis
Injury Analysis of Soccer Players with Python
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
R Shiny
Forecasting NY State Tax Credits: R Shiny App for Businesses

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application