Kaggle Competition project: Grasp-and-Lift EEG
Authors: Eszter Schoell, Teresa Venezia, Dani Ismail, Joseph Russo
Team: Eszter Schoell, Teresa Venezia, Dani Ismail, Alejandra Jaimes and Joseph Russo are Data Science Fellows of the Data Science Bootcamp#2 (B002) – Data Science, Data Mining and Machine Learning – from June 1st to August 24th 2015. Teachers: Andrew, Bryan, Jason, Sam and Vivian. The post is based on the Kaggle Competition team project submitted on behalf of Eszter, Teresa, Dani, Joseph, and Alejandra.
Machine Learning with Brain-Wave Patterns
An EEG (electroencephalogram) is a non-invasive method that displays electrical activity in the brain. The challenge of the Grasp-and-Lift EEG Detection Kaggle project was to build a model to “identify when a hand is grasping, lifting, and replacing an object using EEG data that was taken from healthy subjects as they performed these activities.” (https://www.kaggle.com/c/grasp-and-lift-eeg-detection). Our team was attracted to the broader goal of the competition’s sponsor, WAY Consortium, which was to better understand the relationship between EEG signals and hand movements in order to develop a “BCI [brain computer interface] device that would give patients with neurological disabilities the ability to move through the world with greater autonomy.”
Exploring the EEG Data
The data was collected from a study of 12 right-handed participants between the ages of 19 and 35. An EEG cap with 32 electrodes was placed on each subject’s head, and the signals were collected while the subject performed grasp-and-lift tasks in a series. During these trials, the object’s weight, surface friction, or both were changed. The subject’s task in each trial was to perform these sequential steps on an object: (1) reach for it, (2) grasp with the thumb and index finger, (3) lift, (4) hold for a couple of seconds, (5) place back on the support surface, (6) and release it, before returning the hand to a designated rest position.
The objective of this Kaggle competition was to detect the following 6 events that occurred during the grasp-and-lift tasks from the EEG data: (1) Hand Start, (2) First Digit Touch, (3) Both Start Load Phase, (4) Lift Off, (5) Replace, and (6) Both Released. For each subject, EEG data was recorded for 10 series of trials and approximately 30 trials within each series. Each observation was given a unique ID comprised of the subject, series, and frame. Each frame was 0.002 seconds (2ms). The training set contained the first 8 series (1 - 8) for each subject in data files and events files, respectively, totaling 17,985,754 frames or observations. The test set contained the last two series (9 -10) and totaled 3,144,171 frames. To illustrate, Figure 1 below plots one channel (electrode Fp1) for one subject and one trial of the grasp-and-lift task:
The 6 events were provided for the training set as 6 columns with labels of either zero or one, depending on whether the corresponding event occurred within ±150ms (±75frames). The events for the test set were not provided and had to be predicted. For this challenge, a perfect submission would predict a probability of one for the entire event window.
Preprocessing the Data
Given that deciphering signals to characterize brain activity requires expertise in signal processing, we searched the Kaggle forum to gain a better understanding for preprocessing and analyzing this type of data. Most helpful were the Python scripts authored by Alexandre Barachant, which we adjusted for our purposes to extract the important features for the best classification. First, we normalized the data per series to remove series-related effects. Next, because EEG signals are “noisy”, we needed to consider the best frequency band and channel selection (spatial filter) for classification. With respect to bandpass filtering, since our data involved hand movements, the [7, 30Hz] bandpass Butterworth filter was used. In order to do this, we installed the mne package used for EEG processing in Python. Figure 2 below illustrates pre- and post-filtered data for one channel.Based on the forum discussions and our external research, we then focused on common spatial pattern (CSP) filtering. Spatial filtering algorithms are methods that combine several channels into a single one. The goal of CSP is to design spatial filters that result in optimal discriminatory power between two populations. When applied to our data, filtering with CSP produced a total of 4 features from our 32 EEG channels. Finally, we preprocessed both the training data and test data in the same way (normalize per series, extract frequencies between 7 and 30Hz and apply CSP filtering to extract 4 features) in order to run machine learning algorithms and predict the probabilities of the events.
Final Model Selection: Deep Learning with Neural Net
After experimenting with other models (see below: Earlier Models), we chose 'deep learning' as our final model, as it had the best AUC (0.91 on the test set). The initial incentive for choosing deep learning is simply because we were analyzing neural time-series data, and it would be fun to learn. We are heavily indebted to Tim Hochberg and the Python NNet script he shared on the site.
At the core of deep learning is a hierarchical framework of linked (stacked) neural networks. To be “deep”, there has to be more than one of the neural net layers (stages) between the input and output layers. Figure 3 presents what a neural network could be for a subset of our data (subject 1, CSP-preprocessed data, created using R packages neuralnet, caret and e1071). On the far left are the inputs, in this case the 4 CSP features, in the middle are 2 layers with 16 and 8 nodes and on the far right the 6 events as outputs. Please compare this to the illustration (Figure 4) from http://neuralnetworksanddeeplearning.com/chap6.html made by Michael Nielsen of a deep learning model with convolution and pooling layers.
One core concept of deep learning is to build complex functions from simple ones, and that each layer provides a non-linear function, or “feature transformation” that enables complex feature generation. This modular hierarchy of combining multiple levels of functions leads to a hierarchy of feature abstraction at each layer. A striking example is provided by deep networks from the imageNet model [Ref 1, Figure 2]. In this case, when observing the output of the sequential layers, the feature representation is indeed hierarchical: pixel -> edge -> “texton” or texture unit -> motif -> part -> object [Ref 2].
Another core concept is that the network should be able to build its own features, or representation of the data, relying on the incoming data itself, and not be hand-crafted by individuals for each source of data. For example, scale-invariant feature transform (SIFT) features are hand-crafted low-level features, versus having the network learn similar features, such as edge-detectors. This idea stems from the ability of different parts of the animal brain to be able to learn from different stimuli e.g. the auditory cortex learns to see. This suggests that there may be just one learning algorithm used by the brain [Ref 3].
Deep neural networks have become the benchmark for performance, and are commercially used by numerous companies including Microsoft, Google, and Facebook [Ref 2]. These deep networks scale to large data sets with millions of examples that require optimization over billions of parameters.
However, training a deep neural network to achieve performance is not trivial, as there are many variables to consider [Ref 3]:
-Architecture of the network
-Loss function (regression: squared error, classification: cross-entropy)
-Optimization of non-convex functions
-Unlike convex problems, non-convex functions have no guarantee for global minimization
-Initialization of the network
-Supervised fine-tuning (back-propagation)
Our first model trained on the raw data (all 32 channels), down-sampled by 10 and run on each individual person for series 1 through 8. The performance on the test data (series 9 and 10) was an AUC of 0.89. We then thought to use the pre-processed data (normalized, Butterworth filtered, CSP features), down-sampled by 10 and the performance was an AUC of 0.53. We therefore went back to the original data and trained on all data for each person (deep learning does better with more data). The AUC increased to 0.91. However, running this took about 6 hours on a MacBook pro with 16GB memory.
We now wanted to run the deep learning model in PySpark. Since SparkMllib does not have a deep learning algorithm, we hypothesized that the easiest thing to do would be to load all into an RDD, group by person (there are 12) and run the Python deep learning script on each subject in parallel. In other words, an 'embarrassingly parallel' problem.
As a first step, we set a Spark context that would use all 4 available cores and 10GB of memory:
conf = SparkConf().setAppName("Simple App").setMaster("local").set('spark.executor.memory','10g')
We then pulled in a csv to create an RDD to be grouped by subject:
# load file as RDD data = sc.textFile('filename.csv') # Remove header (column names). data = data.filter(lambda l: "subject" not in l) # Write function and run to create key/value pairs, # ignore first column which is index strings def parsePoint(line): elements = line.split(",") key = elements values = elements[2:] pairs = (key,values) return pairs modeldata = data.map(parsePoint) # Group by subject groupdata = modeldata.groupByKey() # Test that the information was as expected framedata = groupdata.map(lambda x: pd.DataFrame.from_records(x)) framedata.take(1) TypeError: 'ResultIterable' object does not support indexing
As we couldn't check whether our grouping had gone as expected or not, we tried a different approach. We grouped outside of PySpark and then pulled in as a RDD directly with key/value pairs:
# Pull in as pandas data frame pd_df = pd.read_csv('filename.csv') # Extract unique keys from subject column keys = pd_df.ix[:,0].unique() # Create list of lists to represent key/value pairs data_ls =  for i in keys: bools = pd_df.ix[:,0]==i data_ls.append([i,[pd_df.ix[bools,1:]]])
Creating the new object data_ls, which is a list of lists having subject in sublist with subject's data. In other words, a key/value mapping as list, that could then be pulled in to Spark as an RDD having 12 partitions:
data_rdd = sc.parallelize(data_ls,12)
Now, the next step will be to push the partitioned RDD through the deep learning. This requires changing the deep learning Python script to read input as a key/value data frame rather than pulling in csv's.
Earlier Models and Challenges with Spark Machine Learning Library (Mllib)
We chose Logistic Regression as it is a common choice for classification problems and would give us a benchmark for comparison with later models. As explained above, we used the filtered CSP data to build and test a model in both the Python scikit-learn package and then in the Spark machine learning library (SparkMLlib).
Using scikit-learn, we trained a logistic regression model for each subject on a subset of the training data and then tested the model on the test data, checking accuracy by submitting to Kaggle. After some experimentation applying various models to the test data, it became clear that there were some important factors beyond parameter tuning that affected the model’s performance. For example, since this was time series data, it was very important to sample the data sequentially rather than take a completely random sample. Additionally, scores were lower when all of the subjects’ data was aggregated to train the model. It was necessary to train separate models on each subject in order to get acceptable training errors. Ultimately, the scikit-learn model performed reasonably well with an AUC of 0.7. Notably, the AUC of a file submitted to Kaggle with every prediction set to zero was 0.5.
We also tried to create a model using Spark with its integrated machine learning library (Mllib). Unfortunately, there are some limitations with the current Spark release that prevented us from submitting results to Kaggle. While it was possible to predict classification labels, it would not be until the next version of Spark that classification probabilities can be predicted, which is the form that Kaggle required in order to calculate AUC and evaluate results. However, we decided to go ahead and build a logistic model in Spark anyway for our own experience. We are currently able to train a model using Spark but there is still some debugging to be done since the model always predicts a zero when applied to a set of test data. One possible reason this is happening is that Spark returns an error when more than 10,000 rows of training data are used, which is less than 1% of the training data for each subject. This is an issue yet to be resolved before we can successfully create a Spark model to compare with scikit-learn.
Support Vector Machine
As a comparison to logistic regression, we decided to also run a support vector machine model (svm) on the CSP features. Logistic regression is usually a better choice when the data is very noisy. In this case, the data were not particularly noisy and the separation hyperplanes may not be linear.
As a first step, overlapping classes were removed. The entire training sample data set over all subjects and series has 17,985,754 data points (time frames). 478,939 of those were not uniquely classified (the data point fell within 2 events); this 2.6% was removed from the entire training sample.
Using cross-validation with 3 folds, several parameters of the svm model were tested using the grid search function in the scikit learn package in Python. Quadratic and cubic separating hyperplanes were tested, as well as the radial bias function (rbf) kernel. C, which shrinks the error to zero as it increases, was set at 1 and 10. Gamma was tested at 0.001 and 0.0001 for the rbf kernel. In order to run the model on a laptop (RAM = 16GB, SSD = 1TB), a subset of the first subject was used (280,000 out of data points).
In order to run this faster, we decided to use Spark Mllib. However, the package currently only supports binary classification.
Conclusion: Strong Teamwork
When we embarked on this final project our hope was to use the tools that we learned during the Data Science Bootcamp from beginning to end. We are thrilled to report that we accomplished that goal! A critical ingredient to our success was teamwork. At the outset, we structured our project for optimal transparency. We agreed upon strategies for analyzing the data, visualizations, preprocessing, and machine learning. We developed a project workflow to manage discrete tasks and assign ownership. We created team GitHub, Dropbox, and Slack accounts to share code, documents, images, and most importantly, updates. We met daily to share progress reports, problem solving, and changes to workflow. We used both Python and R to accomplish our tasks. We relied on each other to help debug code and research other solutions. In the end, we were proud of the way we worked creatively, independently and as part of a team, and how we inspired each other to deliver. Go team!
1 Zeiler, M. and Fergus R. “Visualizing and Understanding Convolutional Networks” http://arxiv.org/pdf/1311.2901.pdf
2 Deep Learning: The Theoretician's Nightmare or Paradise? (LeCun, NYU, August 2012)
3 Bay Area Vision Meeting: Unsupervised Feature Learning and Deep Learning (Andy Ng)