The goal of our capstone project is to build a clothing recommender system that given a user’s choice of Instagram fashion blogger’s photo, our algorithm will return clothes with similar style to that choice in a more affordable price range. For example, If a user likes influencer Chiara Ferragni’s style, our recommender system will suggest similar pieces of apparel from selected ecommerce or department stores available on our website (http://fashion-rec.com).
Influencer marketing is a fast-growing industry. An influencer is someone on Instagram who commands a lot of attention (likes, views, comments, number of followers) and who has considerable influence over shopping choices made by his or her followers. According to mediakix, advertisers paid around $1.6 billion on Instagram advertising alone.
Many influencers get paid by the brand that they are wearing when they post a photo on Instagram, usually with the post mentioning the brand’s name. This is a common marketing technique deployed by these brands as they expect influencers’ followers to buy their products as well. Our recommender system might seem to work against these marketing campaigns, but in reality, we are simply bringing more choices to Instagram followers/consumers.
We used an agile project pipeline approach to complete our capstone project. This involved training of the machine learning models and the development of flask application simultaneously across group members. Below is the workflow of combining all the components together:
- Tops and dresses for women were web scraped from 5 different retailers: Macy’s, Bloomingdale’s, H&M, ASOS, and Fashion Nova. The information we scraped included product title, price, product description, and image.
- We also webscraped likes, comments, and Instagram posts from 13 Instagram influencers (weworewhat, alexachung, chiara, songofstyle, fashiondiary, kyliejenner, oliviapalermo, oliviaculpo, gigihadid, helenabordon, juliahengel, heileybaldwin, imjennim) from the past six months.
- We used NLP technique Latent Dirichlet Allocation (LDA) on our web scraped product descriptions to group our products into 6 different current market styles.
- We trained our image classification model based on iMaterialist Challenge (Fashion) at FGVC5, a Kaggle image dataset containing 1,014,544 unbiased clothing images and 228 categorical labels.
- We extracted product images from the 6 different clusters we derived through LDA in step 3, and used our image classification model trained in step 4 to find similarities to the clothes influencers wear (step 2).
- We built a Flask app that provides the user interface for a consumer to infinitely browse throughdifferent influencers’ photo and our products from major retailers. The web app is hosted on AWS.
Exploratory Data Analysis (EDA)
We have ~10,000 product information from retailers, and ~1,000 IG information from influencers in our dataset. The median prices for products range from ~$150 to ~$20 across the retailers as shown below.
The majority of product items came from ASOS based on the affordable price range as well as the diversity of clothing styles.
The total Instagram posts per influencer collected over the past six month is shown below. On average, the frequency of posts range from ~5 posts/month to ~ 20 posts/month.
The median number of likes per instagram post reflect the popularity of each Instagram influencer. Note: Kylie Jenner is not included in the graph below in order to better visualize the median likes for other Instagram influencer. Kylie Jenner has ~ 6 million median likes per post.
The high number of likes for bloggers in our database indicate many people are attracted to their outfits. The recommender system will be a good platform to let users discover their favorite blogger’s style and find apparels similar to it.
Machine Learning Models
1. Natural Language Processing (NLP)
NLP analysis was applied on the item descriptions of the web-scraped products to uncover the current market clothing trends. The NLP discovered styles were used to partition each blogger’s fashion styles that are currently available in the market.
Prediction-based methods and frequency-based methods are two common methods of NLP. Frequency-based methods assume words in a text are independent from each other, and only the occurrence of words in the text are taken into account. In contrast, prediction-based methods take the co-occurrence of words into consideration, which offers a clear advantage in dealing with text with strong inter-words relationship.
- Word2Vec and Doc2Vec
We performed both prediction-based and frequency-based methodologies in our project to evaluate which method provides the best result. For the prediction-based method, two approaches -- Word2Vec and Doc2Vec -- were used to generate vector representations for each item description, followed by k-mean clustering to group the products into different style categories based on vector distance.
For Word2Vec analysis, word vectors were obtained from a pre-trained Word2Vec model (common crawl 42B, 1.9M vocab and 300 dimensions, https://github.com/stanfordnlp/GloVe). The word vectors within an item description were averaged to generate a single vector that represent a single item description. As for Doc2Vec, corresponding vectors for each description was generated based on a Doc2Vec model using Gensim that was trained on words from our item text descriptions.
Using vectors from Doc2Vec resulted in item images that appear more similar among the top 10 most similar vectors. As a result, we decided to proceed K-means clustering with vectors generated from Doc2Vec. The item descriptions were separated into 6 different groups using K-means clustering based on cosine distance between vectors. Prediction-based method did not perform well for this case as the t-SNE graph from K-means clustering did not show clear separation of clusters as shown below. One explanation could be product descriptions are made of key phrases instead of full sentences with strong context.
- Latent Dirichlet Allocation (LDA)
On the other hand, a frequency-based method, specifically, Latent Dirichlet Allocation (LDA) showed better result. As one of the popular topic modeling algorithm, LDA takes all the words and their counts in each document as input and attempts to find the structure or topics in this collection of unlabelled documents. Topic modeling assumes that word usage is correlated with topic occurrence. Each topics is a combination of different words with different weights, and each document is a mixture of different topics.
In our case, the documents are item descriptions and topics are different fashion styles described by different keywords. For each product, we chose the style with the highest score to represent it. One tricky part of this method is that we do not know how many styles are covered in the retailer data we collected. We tried several different numbers and found 6 might be a reasonable choice when we saw the keywords in the topic. One thing worth noting, though, is that this is not an absolute answer for this problem. There are other possible ways to do the clustering and this is the challenge and beauty of unsupervised clustering problems.
After we grouped those products, we applied t-SNE to better visualize the results. t-SNE can reduce high dimensional data set into lower dimensions so that we can see it in a 2-D graph. In the graph below all the products are separated into 6 groups with different colors. The keywords of each group are shown in the upper left corner. As shown below, instead of simply separating them into dresses or tops, the algorithm successfully grouped all these clothes into 6 styles by features. For example, ‘print, floral’, ‘straps, bodycon’ and ‘soft touch, stripe’. These keywords have more information and better represent the stylish features of those products.
2. Image Classification
We applied the deep Convolutional Neural Network (CNN) algorithm with pre-trained imageNet (VGG16) to perform a multi-category classification on the million fashion images with labels provided by the recent Kaggle Competition (iMaterialist Challenge (Fashion) at FGVC5) to obtain our first version of image recognition model (classifier). Combining with the cosine similarity measurement, the algorithm can recommend similar items to online shoppers.
- General description of Kaggle dataset
The training dataset includes images from 228 fashion attribute classes with multiple ground truth labels per image. It includes a total of 1,014,544 images for training, 10,586 images for validation and 42,590 images for testing. For this project, we didn’t use any of the testing images from the Kaggle dataset.
- Model performance
The overall ROC and precision-recall curves on the validation set. Overall performance of the model is pretty good. However, there are also some labels didn’t perform well. We are currently looking into those classes and seeking a way to fine tune the algorithm to improve the model performance. Nevertheless, we would like to see how does the model perform visually as the next step. That’s why we start to deploy all our models into the Amazon Web Services.
Website AWS Deployment
The end product of our project is the Fashion-Rec (link goes here) website hosted on AWS using Flask. The first page contains images of influencing fashion bloggers for user to choose(e.g. Chiara, Helena Bordon).
Once the user clicks on any of them, the website will be redirected to another page containing pictures of this specific blogger. The algorithm will map each bloggers’ style into 5 or 6 current available clothing trend group according to the NLP analysis. This will ensure that users can have as many and as diverse choices as possible. For example, in the picture below, this blogger has six diverse styles like floral dress, casual shirt and formal suit.
If the user clicks on any of these pictures, the website will send back the url of the clicked image to the server and look for similar clothes in our database. The criteria that we are using to quantify the difference/similarity between two clothes is the cosine distance calculated on the CNN model outputs. The higher the similarity score, the more similar are the two clothes.
For example, If you select ‘Style0’ above, the system will automatically generate the similar dresses that are availables in our product database. The user can continue clicking on items under similar products, and the system will keep providing more similar dresses based on user’s preference.
To enrich the user’s experience, we also provide a list of the most dissimilar clothes to users while they are shopping. That way, if they want to try out a different style, they can easily add some new clothes in their shopping cart.
Please check out our website: http://fashion-rec.com