Home Depot Kaggle: Feature Engineering Section
[Note: I was part of the three person NYC Data Science Academy team that participated in the Home Depot Kaggle competition. As of this writing, our group secured a top 15% finish. I was primarily responsible for the team's feature engineering work. This is the portion of our report that I wrote relating to feature engineering on the project. Please feel free to also review the team's full report.]
Feature Engineering
One interesting aspect of this project was that "feature engineering" here was essentially equivalent to "feature creation." That's because the data set that Home Depot provided contained no actual features that we could use as inputs to a model. Instead, our task was to take the data provided (search queries and product titles/descriptions/attributes) and use that data to derive all the features to use as predictors.
From the very beginning of the feature engineering
From the very beginning of the feature engineering process, our primary challenge was relatively clear: fix the upper left problem.
Ultimately, the features we fed into our model fell into four categories, shown at left. "Direct Match"
The last category of features is probably worth some explanation. Certain features we designed were related only to data in the training set, and were therefore "disconnected" from the test set. For example, we devised a methodology for assigning a "word power" score to words contained in search queries. Specifically, for every word in a training set search term (after the data cleansing performed in the first phase, of course), we looked at the average relevancy score for observations where it appeared. This allowed us to create a dictionary with search word - scores as the key-value pair. We then applied this dictionary to the test set. That is, we applied the word power score for each word in the training set search queries to each word in the test set search queries. We used the sum of these word scores to create a word power score for each search in the test set.
One last point about our approach to feature engineering might be worth noting. We used R's tm package, but not for the tf-idf (term frequency - inverse document frequency) calculations for which it is often used. Instead, we found it to be an efficient tool for performing word lookups for word score calculations. Its document term matrix provided a convenient (and relatively fast) way to identify the words in the search term dictionary that also appeared in product titles. From there, it was a straightforward process to calculate the sum of word scores for each observation.