Kaggle Competition project: Grasp-and-Lift EEG

Posted on Aug 28, 2015

Authors: Eszter Schoell, Teresa Venezia, Dani Ismail, Joseph Russo

Team: Eszter Schoell, Teresa Venezia, Dani Ismail, Alejandra Jaimes and Joseph Russo are Data Science Fellows of the Data Science Bootcamp#2 (B002) Data Science, Data Mining and Machine Learning from June 1st to August 24th 2015. Teachers: Andrew, Bryan, Jason, Sam and Vivian.  The post is based on the Kaggle Competition team project submitted on behalf of Eszter, Teresa, Dani, Joseph, and Alejandra.

Machine Learning with Brain-Wave Patterns

An EEG (electroencephalogram) is a non-invasive method that displays electrical activity in the brain. The challenge of the Grasp-and-Lift EEG Detection Kaggle project was to build a model to identify when a hand is grasping, lifting, and replacing an object using EEG data that was taken from healthy subjects as they performed these activities.(https://www.kaggle.com/c/grasp-and-lift-eeg-detection). Our team was attracted to the broader goal of the competitions sponsor, WAY Consortium, which was to better understand the relationship between EEG signals and hand movements in order to develop a BCI [brain computer interface] device that would give patients with neurological disabilities the ability to move through the world with greater autonomy.

Exploring the EEG Data

The data was collected from a study of 12 right-handed participants between the ages of 19 and 35. An EEG cap with 32 electrodes was placed on each subjects head, and the signals were collected while the subject performed grasp-and-lift tasks in a series. During these trials, the objects weight, surface friction, or both were changed. The subjects task in each trial was to perform these sequential steps on an object: (1) reach for it, (2) grasp with the thumb and index finger, (3) lift, (4) hold for a couple of seconds, (5) place back on the support surface, (6) and release it, before returning the hand to a designated rest position.

The objective of this Kaggle competition was to detect the following 6 events that occurred during the grasp-and-lift tasks from the EEG data: (1) Hand Start, (2) First Digit Touch, (3) Both Start Load Phase, (4) Lift Off, (5) Replace, and (6) Both Released. For each subject, EEG data was recorded for 10 series of trials and approximately 30 trials within each series. Each observation was given a unique ID comprised of the subject, series, and frame. Each frame was 0.002 seconds (2ms). The training set contained the first 8 series (1 - 8) for each subject in data files and events files, respectively, totaling 17,985,754 frames or observations. The test set contained the last two series (9 -10) and totaled 3,144,171 frames. To illustrate, Figure 1 below plots one channel (electrode Fp1) for one subject and one trial of the grasp-and-lift task:


Figure 1. Plotting Channel Fp1: Subject 1, Series 1, Frames 0 – 5000

The 6 events were provided for the training set as 6 columns with labels of either zero or one, depending on whether the corresponding event occurred within ±150ms (±75frames). The events for the test set were not provided and had to be predicted. For this challenge, a perfect submission would predict a probability of one for the entire event window.

Preprocessing the Data

Given that deciphering signals to characterize brain activity requires expertise in signal processing, we searched the Kaggle forum to gain a better understanding for preprocessing and analyzing this type of data. Most helpful were the Python scripts authored by Alexandre Barachant, which we adjusted for our purposes to extract the important features for the best classification. First, we normalized the data per series to remove series-related effects. Next, because EEG signals are noisy, we needed to consider the best frequency band and channel selection (spatial filter) for classification. With respect to bandpass filtering, since our data involved hand movements, the [7, 30Hz] bandpass Butterworth filter was used. In order to do this, we installed the mne package used for EEG processing in Python. Figure 2 below illustrates pre- and post-filtered data for one channel.


Figure 2. Pre and Post Filtering [one channel before and after Butterworth filter]

Based on the forum discussions and our external research, we then focused on common spatial pattern (CSP) filtering. Spatial filtering algorithms are methods that combine several channels into a single one. The goal of CSP is to design spatial filters that result in optimal discriminatory power between two populations. When applied to our data, filtering with CSP produced a total of 4 features from our 32 EEG channels. Finally, we preprocessed both the training data and test data in the same way (normalize per series, extract frequencies between 7 and 30Hz and apply CSP filtering to extract 4 features) in order to run machine learning algorithms and predict the probabilities of the events.

Final Model Selection: Deep Learning with Neural Net

Deep Learning

After experimenting with other models (see below: Earlier Models), we chose 'deep learning' as our final model, as it had the best AUC (0.91 on the test set). The initial incentive for choosing deep learning is simply because we were analyzing neural time-series data, and it would be fun to learn. We are heavily indebted to Tim Hochberg and the Python NNet script he shared on the site.

At the core of deep learning is a hierarchical framework of linked (stacked) neural networks. To be deep, there has to be more than one of the neural net layers (stages) between the input and output layers. Figure 3 presents what a neural network could be for a subset of our data (subject 1, CSP-preprocessed data, created using R packages neuralnet, caret and e1071). On the far left are the inputs, in this case the 4 CSP features, in the middle are 2 layers with 16 and 8 nodes and on the far right the 6 events as outputs. Please compare this to the illustration (Figure 4) from http://neuralnetworksanddeeplearning.com/chap6.html made by Michael Nielsen of a deep learning model with convolution and pooling layers.

Figure 3. Example neural net with 4 inputs, 6 outputs and 2 hidden layers. Bias is in blue.

Figure 3. Example neural net with 4 inputs, 6 outputs and 2 hidden layers.


Figure 4. Visualization of Deep Learning from Michael Nielsen.

One core concept of deep learning is to build complex functions from simple ones, and that each layer provides a non-linear function, or feature transformationthat enables complex feature generation. This modular hierarchy of combining multiple levels of functions leads to a hierarchy of feature abstraction at each layer. A striking example is provided by deep networks from the imageNet model [Ref 1, Figure 2]. In this case, when observing the output of the sequential layers, the feature representation is indeed hierarchical: pixel -> edge -> textonor texture unit -> motif -> part -> object [Ref 2].

Another core concept is that the network should be able to build its own features, or representation of the data, relying on the incoming data itself, and not be hand-crafted by individuals for each source of data. For example, scale-invariant feature transform (SIFT) features are hand-crafted low-level features, versus having the network learn similar features, such as edge-detectors. This idea stems from the ability of different parts of the animal brain to be able to learn from different stimuli e.g. the auditory cortex learns to see. This suggests that there may be just one learning algorithm used by the brain [Ref 3].

Deep neural networks have become the benchmark for performance, and are commercially used by numerous companies including Microsoft, Google, and Facebook [Ref 2]. These deep networks scale to large data sets with millions of examples that require optimization over billions of parameters.

However, training a deep neural network to achieve performance is not trivial, as there are many variables to consider [Ref 3]:

-Architecture of the network

-Loss function (regression: squared error, classification: cross-entropy)

-Optimization of non-convex functions

-Unlike convex problems, non-convex functions have no guarantee for global minimization

-Initialization of the network

-Supervised fine-tuning (back-propagation)

Our first model trained on the raw data (all 32 channels), down-sampled by 10 and run on each individual person for series 1 through 8. The performance on the test data (series 9 and 10) was an AUC of 0.89. We then thought to use the pre-processed data (normalized, Butterworth filtered, CSP features), down-sampled by 10 and the performance was an AUC of 0.53. We therefore went back to the original data and trained on all data for each person (deep learning does better with more data). The AUC increased to 0.91. However, running this took about 6 hours on a MacBook pro with 16GB memory.

We now wanted to run the deep learning model in PySpark. Since SparkMllib does not have a deep learning algorithm, we hypothesized that the easiest thing to do would be to load all into an RDD, group by person (there are 12) and run the Python deep learning script on each subject in parallel. In other words, an 'embarrassingly parallel' problem.

As a first step, we set a Spark context that would use all 4 available cores and 10GB of memory:

conf = SparkConf().setAppName("Simple App").setMaster("local[4]").set('spark.executor.memory','10g')

We then pulled in a csv to create an RDD to be grouped by subject:

# load file as RDD
data = sc.textFile('filename.csv')

# Remove header (column names).
data = data.filter(lambda l: "subject" not in l)

# Write function and run to create key/value pairs, 
# ignore first column which is index strings

def parsePoint(line):
    elements = line.split(",")
    key = elements[1]
    values = elements[2:]
    pairs = (key,values)
    return pairs

modeldata = data.map(parsePoint)

# Group by subject 
groupdata = modeldata.groupByKey()

# Test that the information was as expected
framedata = groupdata.map(lambda x: pd.DataFrame.from_records(x[1]))

TypeError: 'ResultIterable' object does not support indexing

As we couldn't check whether our grouping had gone as expected or not, we tried a different approach. We grouped outside of PySpark and then pulled in as a RDD directly with key/value pairs:

# Pull in as pandas data frame
pd_df = pd.read_csv('filename.csv')

# Extract unique keys from subject column
keys = pd_df.ix[:,0].unique()

# Create list of lists to represent key/value pairs
data_ls = []
for i in keys:
    bools = pd_df.ix[:,0]==i

Creating the new object data_ls, which is a list of lists having subject in sublist with subject's data. In other words, a key/value mapping as list, that could then be pulled in to Spark as an RDD having 12 partitions:

data_rdd = sc.parallelize(data_ls,12)

Now, the next step will be to push the partitioned RDD through the deep learning. This requires changing the deep learning Python script to read input as a key/value data frame rather than pulling in csv's.

Earlier Models and Challenges with Spark Machine Learning Library (Mllib)

Logistic Regression

We chose Logistic Regression as it is a common choice for classification problems and would give us a benchmark for comparison with later models. As explained above, we used the filtered CSP data to build and test a model in both the Python scikit-learn package and then in the Spark machine learning library (SparkMLlib).

Using scikit-learn, we trained a logistic regression model for each subject on a subset of the training data and then tested the model on the test data, checking accuracy by submitting to Kaggle. After some experimentation applying various models to the test data, it became clear that there were some important factors beyond parameter tuning that affected the models performance. For example, since this was time series data, it was very important to sample the data sequentially rather than take a completely random sample. Additionally, scores were lower when all of the subjectsdata was aggregated to train the model. It was necessary to train separate models on each subject in order to get acceptable training errors. Ultimately, the scikit-learn model performed reasonably well with an AUC of 0.7. Notably, the AUC of a file submitted to Kaggle with every prediction set to zero was 0.5.

We also tried to create a model using Spark with its integrated machine learning library (Mllib). Unfortunately, there are some limitations with the current Spark release that prevented us from submitting results to Kaggle. While it was possible to predict classification labels, it would not be until the next version of Spark that classification probabilities can be predicted, which is the form that Kaggle required in order to calculate AUC and evaluate results. However, we decided to go ahead and build a logistic model in Spark anyway for our own experience. We are currently able to train a model using Spark but there is still some debugging to be done since the model always predicts a zero when applied to a set of test data. One possible reason this is happening is that Spark returns an error when more than 10,000 rows of training data are used, which is less than 1% of the training data for each subject. This is an issue yet to be resolved before we can successfully create a Spark model to compare with scikit-learn.

Support Vector Machine

As a comparison to logistic regression, we decided to also run a support vector machine model (svm) on the CSP features. Logistic regression is usually a better choice when the data is very noisy. In this case, the data were not particularly noisy and the separation hyperplanes may not be linear.

As a first step, overlapping classes were removed. The entire training sample data set over all subjects and series has 17,985,754 data points (time frames). 478,939 of those were not uniquely classified (the data point fell within 2 events); this 2.6% was removed from the entire training sample.

Using cross-validation with 3 folds, several parameters of the svm model were tested using the grid search function in the scikit learn package in Python. Quadratic and cubic separating hyperplanes were tested, as well as the radial bias function (rbf) kernel. C, which shrinks the error to zero as it increases, was set at 1 and 10. Gamma was tested at 0.001 and 0.0001 for the rbf kernel. In order to run the model on a laptop (RAM = 16GB, SSD = 1TB), a subset of the first subject was used (280,000 out of data points).

In order to run this faster, we decided to use Spark Mllib. However, the package currently only supports binary classification.

Conclusion: Strong Teamwork

When we embarked on this final project our hope was to use the tools that we learned during the Data Science Bootcamp from beginning to end. We are thrilled to report that we accomplished that goal! A critical ingredient to our success was teamwork. At the outset, we structured our project for optimal transparency. We agreed upon strategies for analyzing the data, visualizations, preprocessing, and machine learning. We developed a project workflow to manage discrete tasks and assign ownership. We created team GitHub, Dropbox, and Slack accounts to share code, documents, images, and most importantly, updates. We met daily to share progress reports, problem solving, and changes to workflow. We used both Python and R to accomplish our tasks. We relied on each other to help debug code and research other solutions. In the end, we were proud of the way we worked creatively, independently and as part of a team, and how we inspired each other to deliver. Go team!


1 Zeiler, M. and Fergus R. Visualizing and Understanding Convolutional Networkshttp://arxiv.org/pdf/1311.2901.pdf

2 Deep Learning: The Theoretician's Nightmare or Paradise? (LeCun, NYU, August 2012)


3 Bay Area Vision Meeting: Unsupervised Feature Learning and Deep Learning (Andy Ng)


About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp