Fashion Rec.


The goal of our capstone project is to build a clothing recommender system that given a user’s choice of Instagram fashion blogger’s photo, our algorithm will return clothes with similar style to that choice in a more affordable price range. For example, If a user likes influencer Chiara Ferragni’s style, our recommender system will suggest similar pieces of apparel from selected ecommerce or department stores available on our website (

Influencer marketing is a fast-growing industry. An influencer is someone on Instagram who commands a lot of attention (likes, views, comments, number of followers) and who has considerable influence over shopping choices made by his or her followers. According to mediakix, advertisers paid around $1.6 billion on Instagram advertising alone.

Many influencers get paid by the brand that they are wearing when they post a photo on Instagram, usually with the post mentioning the brand’s name. This is a common marketing technique deployed by these brands as they expect influencers’ followers to buy their products as well. Our recommender system might seem to work against these marketing campaigns, but in reality, we are simply bringing more choices to Instagram followers/consumers.


Project Pipeline

We used an agile project pipeline approach to complete our capstone project. This involved training of the machine learning models and the development of flask application simultaneously across group members. Below is the workflow of combining all the components together:

  1. Tops and dresses for women were web scraped from 5 different retailers: Macy’s, Bloomingdale’s, H&M, ASOS, and Fashion Nova. The information we scraped included product title, price, product description, and image.
  2. We also webscraped likes, comments, and Instagram posts from 13 Instagram influencers (weworewhat, alexachung, chiara, songofstyle, fashiondiary, kyliejenner, oliviapalermo, oliviaculpo, gigihadid, helenabordon, juliahengel, heileybaldwin, imjennim)  from the past six months.
  3. We used NLP technique Latent Dirichlet Allocation (LDA) on our web scraped product descriptions to group our products into 6 different current market styles.
  4. We trained our image classification model based on iMaterialist Challenge (Fashion) at FGVC5, a Kaggle image dataset containing 1,014,544 unbiased clothing images and 228 categorical labels.
  5. We extracted product images from the 6 different clusters we derived through LDA in step 3, and used our image classification model trained in step 4 to find similarities to the clothes influencers wear (step 2).
  6. We built a Flask app that provides the user interface for a consumer to infinitely browse throughdifferent influencers’ photo and our products from major retailers. The web app is hosted on AWS.


Exploratory Data Analysis (EDA)

We have ~10,000 product information from retailers, and ~1,000 IG information from influencers in our dataset. The median prices for products range from ~$150 to ~$20 across the retailers as shown below.

The majority of product items came from ASOS based on the affordable price range as well as the diversity of clothing styles.  

The total Instagram posts per influencer collected over the past six month is shown below. On average, the frequency of posts range from ~5 posts/month to ~ 20 posts/month.

The median number of likes per instagram post reflect the popularity of each Instagram influencer. Note: Kylie Jenner is not included in the graph below in order to better visualize the median likes for other Instagram influencer. Kylie Jenner has ~ 6 million median likes per post.

The high number of likes for bloggers in our database indicate many people are attracted to their outfits. The recommender system will be a good platform to let users discover their favorite blogger’s style and find apparels similar to it.


Machine Learning Models

1. Natural Language Processing (NLP)

NLP analysis was applied on the item descriptions of the web-scraped products to uncover the current market clothing trends. The NLP discovered styles were used to partition each blogger’s fashion styles that are currently available in the market.

Prediction-based methods and frequency-based methods are two common methods of NLP. Frequency-based methods assume words in a text are independent from each other, and only the occurrence of words in the text are taken into account. In contrast, prediction-based methods take the co-occurrence of words into consideration, which offers a clear  advantage in dealing with text with strong inter-words relationship.

  • Word2Vec and Doc2Vec

We performed both prediction-based and frequency-based methodologies  in our project to evaluate which method provides the best result. For the prediction-based method, two approaches -- Word2Vec and Doc2Vec -- were used to generate vector representations for each item description, followed by k-mean clustering to group the products into different style categories based on vector distance.

For Word2Vec analysis, word vectors were obtained from a pre-trained Word2Vec model (common crawl 42B, 1.9M vocab and 300 dimensions, The word vectors within an item description were averaged to generate a single vector that represent a single item description. As for Doc2Vec, corresponding vectors for each description was generated based on a Doc2Vec model using Gensim that was trained on words from our item text descriptions.

Using vectors from Doc2Vec resulted in item images that appear more similar among the top 10 most similar vectors. As a result, we decided to proceed K-means clustering with vectors generated from Doc2Vec. The item descriptions were separated into 6 different groups using K-means clustering based on cosine distance between vectors. Prediction-based method did not perform well for this case as the t-SNE graph from K-means clustering did not show clear separation of clusters as shown below. One explanation could be product descriptions are made of key phrases instead of full sentences with strong context.

  • Latent Dirichlet Allocation (LDA)

On the other hand, a frequency-based method, specifically, Latent Dirichlet Allocation (LDA) showed better result. As one of the popular topic modeling algorithm, LDA takes all the words and their counts in each document as input and attempts to find the structure or topics in this collection of unlabelled documents. Topic modeling assumes that word usage is correlated with topic occurrence. Each topics is a combination of different words with different weights, and each document is a mixture of different topics.

In our case, the documents are item descriptions and topics are different fashion styles described by different keywords. For each product, we chose the style with the highest score to represent it. One tricky part of this method is that we do not know how many styles are covered in the retailer data we collected. We tried several different numbers and found 6 might be a reasonable choice when we saw the keywords in the topic. One thing worth noting, though,  is that this is not an absolute answer for this problem. There are other possible ways to do the clustering and this is the challenge and beauty of unsupervised clustering problems.

After we grouped those products, we applied t-SNE to better visualize the results. t-SNE can reduce high dimensional data set into lower dimensions so that we can see it in a 2-D graph. In the graph below all the products are separated into 6 groups with different colors. The keywords of each group are shown in the upper left corner. As shown below, instead of simply separating them into dresses or tops, the algorithm successfully grouped all these clothes into 6 styles by features. For example, ‘print, floral’, ‘straps, bodycon’ and ‘soft touch, stripe’. These keywords have more information and better represent the stylish features of those products.


2. Image Classification

We applied the deep Convolutional Neural Network (CNN) algorithm with pre-trained imageNet (VGG16) to perform a multi-category classification on the million fashion images with labels provided by the recent Kaggle Competition (iMaterialist Challenge (Fashion) at FGVC5) to obtain our first version of image recognition model (classifier). Combining with the cosine similarity measurement, the algorithm can recommend similar items to online shoppers.

  • General description of Kaggle dataset

The training dataset includes images from 228 fashion attribute classes with multiple ground truth labels per image. It includes a total of 1,014,544 images for training, 10,586 images for validation and 42,590 images for testing. For this project, we didn’t use any of the testing images from the Kaggle dataset.

  • Model performance

The overall ROC and precision-recall curves on the validation set. Overall performance of the model is pretty good. However, there are also some labels didn’t perform well. We are currently looking into those classes and seeking a way to fine tune the algorithm to improve the model performance. Nevertheless, we would like to see how does the model perform visually as the next step. That’s why we start to deploy all our models into the Amazon Web Services.



Website AWS Deployment

The end product of our project is the Fashion-Rec (link goes here) website hosted on AWS using Flask. The first page contains images of influencing fashion bloggers for user to choose(e.g. Chiara, Helena Bordon).

Once the user clicks on any of them, the website will be redirected to another page containing pictures of this specific blogger. The algorithm will map each bloggers’ style into 5 or 6 current available clothing trend group according to the NLP analysis. This will ensure that users can have as many and as diverse choices as possible. For example, in the picture below, this blogger has six diverse styles like floral dress, casual shirt and formal suit.

If the user clicks on any of these pictures, the website will send back the url of the clicked image to the server and look for similar clothes in our database. The criteria that we are using to quantify the difference/similarity between two clothes is the cosine distance calculated on the CNN model outputs. The higher the similarity score, the more similar are the two clothes.

For example, If you select ‘Style0’ above, the system will automatically generate the similar dresses that are availables in our product database. The user can continue clicking on items under similar products, and the system will keep providing more similar dresses based on user’s preference.

To enrich the user’s experience, we also provide a list of the most dissimilar clothes to users while they are shopping. That way, if they want to try out a different style, they can easily add some new clothes in their shopping cart.

Please check out our website:

About Authors

Kelly Ho

Kelly graduated from Cornell University with a Master of Engineering degree. She has three years of experience in analytics, statistical modeling, and providing data-driven recommendations for process improvement.
View all posts by Kelly Ho >

Richie Bui

Richie Bui has experienced knowledge working with collecting, processing and capturing large datasets while working with Medtronic the past 3 years. Closely working with his colleagues in engineering, product management, and bio-statisticians to gather information on the medical...
View all posts by Richie Bui >

Samuel Mao

Samuel Mao is a data scientist with three years of experience using R and Python to develop models addressing needs across business functions. He also has demonstrated experience growing US/China cross border enterprise value.
View all posts by Samuel Mao >

Silvia Lu

Silvia is currently working as a Data Analysis Intern at the Aviation Planning Department of Port Authority of New York and New Jersey. She also has a Psychology background and is a second-year graduate student in New York...
View all posts by Silvia Lu >

Zhenggang Xu

Zhenggang is currently a data science fellow in NYC data science academy. He received his education in computational chemistry and worked in deep water exploration for a few years. He believes in numbers since computations have helped him...
View all posts by Zhenggang Xu >

Leave a Comment

Google August 31, 2021
Google One of our guests recently proposed the following website.
Google March 4, 2021
Google The facts mentioned within the article are a few of the most beneficial readily available.
Google March 4, 2021
Google Here are a few of the web-sites we advise for our visitors.
CBD Oil For Dogs December 12, 2020
CBD Oil For Dogs [...]Sites of interest we have a link to[...] November 16, 2020 [...]usually posts some very exciting stuff like this. If you’re new to this site[...] November 14, 2020 [...]just beneath, are various absolutely not associated websites to ours, even so, they may be surely worth going over[...]
Google September 2, 2020
Google Sites of interest we have a link to.
YouTube Backlink August 28, 2020
YouTube Backlink [...]very couple of websites that happen to become comprehensive beneath, from our point of view are undoubtedly well worth checking out[...]
Google August 21, 2020
Google Below you’ll come across the link to some web-sites that we assume it is best to visit. August 19, 2020 [...]very couple of sites that come about to become comprehensive below, from our point of view are undoubtedly effectively worth checking out[...] August 5, 2020 [...]the time to study or check out the content or sites we've linked to below the[...] July 30, 2020 [...]Wonderful story, reckoned we could combine some unrelated information, nonetheless genuinely worth taking a look, whoa did one particular discover about Mid East has got extra problerms at the same time [...]
cbd cats July 9, 2020
cbd cats [...]we like to honor quite a few other net sites around the internet, even when they aren’t linked to us, by linking to them. Underneath are some webpages worth checking out[...]
Build Data Science Portfolio | NYC Data Science Academy Blog July 11, 2019
[…] Kelly Ho, Silvia Lu & Zhenggang Xu - Fashion Rec. […]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI