Improving Home Depot Search Relevance

Contributed by Amy(Yujing) Ma, Brett Amdur,Β Christopher Redino. They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on their machine learning project (due on the 8th week of the program).

Given only raw text as input, our goal is to predict the relevancy of products to search results at the Home Depot website. Our strategy is a little different from most other teams in this Kaggle competition, where we generated a workflow that starts with text cleaning, passes through feature engineering and ends with model selection and parameter tuning in the attempt to stand out among thousands of competitors.

Feature Engineering

One interesting aspect of this project was that "feature engineering" here was essentially equivalent to "feature creation." That's because the data set that Home Depot provided contained no actual features that we could use as inputs to a model. Instead, our task was to take the data provided (search queries and product titles/descriptions/attributes) and use that data to derive all the features to use as predictors.

From the very beginning of the feature engineeringwordMatch process, our primary challenge was relatively clear: fix the upper left problem. The upper left problem refers to a recurring issue: any single feature we used as a predictor during our simple exploratory analysis performed reasonably well at higher values, but abysmally at lower values. In other words, the upper left of a correlation plot was always too heavily populated. The plot on the right is an example. Using training set data, the x axis shows the number of words in the search term of an observation that match the product title, and the y axis shows the associated relevance score for that observation. It is not surprising that higher match count scores generate higher relevance scores. What might be surprising is that the opposite is not true: lower match scores were just as likely to generate low relevance scores as high ones. We surmised that success in this competition might depend our ability to find features (or sets of features) that didn't have such wide dispersion in their outputs at lower values of the feature.

From the very beginning of the feature engineering process, our primary challenge was relatively clear: fix the upper left problem.

Ultimately, the features we fed into our model fell into four categories, shown at left. "Direct Match" featuresfeatures are relatively straightforward. They track "hits": search term words and phrases that matched words in the target variables (i.e. title, description, and brand names). Ratio features use the percentage of words that are hits (in, for example, the product description), and Length features refer to the number of words in the variable's content.

The last category of features is probably worth some explanation. Certain features we designed were related only to data in the training set, and were therefore "disconnected" from the test set. For example, we devised a methodology for assigning a "word power" score to words contained in search queries. Specifically, for every word in a training set search term (after the data cleansing performed in the first phase, of course), we looked at the average relevancy score for observations where it appeared. This allowed us to create a dictionary with search word - scores as the key-value pair. We then applied this dictionary to the test set. That is, we applied the word power score for each word in the training set search queries to each word in the test set search queries. We used the sum of these word scores to create a word power score for each search in the test set.

One last point about our approach to feature engineering might be worth noting. We used R's tm package, but not for the tf-idf (term frequency - inverse document frequency) calculations for which it is often used. Instead, we found it to be an efficient tool for performing word lookups for word score calculations. Its document term matrix provided a convenient (and relatively fast) way to identify the words in the search term dictionary that also appeared in product titles. From there, it was a straightforward process to calculate the sum of word scores for each observation.

[slideshare id=59901142&doc=kagglepresentationv3-160322195719&w=650&h=350]

The Python 3 code for best model is shown below:

About Authors

Brett Amdur

Brett has spent his career at the intersection of technology, analytics, business and law. As a Fellow at NYC Data Science Academy, he is applying this diverse experience to helping organizations maximize the impact of data driven decisions....
View all posts by Brett Amdur >

Christopher Redino

The common thread through all of Christopher's endeavors is his love of problem solving, with his usual methods being analytical and computational in nature. Having learned coding at an early age, Christopher picks up new programming languages quickly...
View all posts by Christopher Redino >

Related Articles

Leave a Comment

Kaggle Competition "Home Depot Product Search Relevance" | YoutubePro September 9, 2017
[…] Learn more: http://nycdatascience.edu/blog/studen… […]
Trista April 15, 2017
Hi friends, how is everything, and what you desire to say regarding this article, in my view its actually remarkable for me.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI