likePredict: A product to predict "likes" of an Instagram Post

Instagram “likes” Impose a Problem.

An internal Instagram study showed that teens delete up to half of all their Instagram posts due them not receiving enough likes. In fact, Instagram’s new “Instagram story” is in part an effort to counter the vanity imposed by the “likes” metric - Business Insidershutterstock_227868007As a result, there is a growing market need for a tool that can accurately predict the likes of a post for a given user. While there are a number of services that provide basic to complex analytics, there is a lack of predictive modeling being employed for the average user - hence we developed “likePredict”: A “like” predictor for use in public Instagram accounts.






Web Scraping Instagram Posts

Gathering the data for this project was the largest challenge. In recent years, Instagram has evoked an increasingly stringent API policy, requiring developers to undergo a number of processes before being given a key. As such, we needed to scrape the data.

Instagram, like many other large organizations, has many built in tools to limit and trap web scrapers from getting too much data. Alternatively, there are a number of Instagram web viewers that display instagram content without the same regulations. For our project we used two sources: Instagim and Instaliga.

Instagim - Posts by Tag

Instagim allows photos to be displayed by tag on a single page, displaying username, likes, comments, caption, filter and some relevant hashtags. Due to the scope and timing of this project, we focused solely on photos under the “nature” tag. Screen Shot 2017-03-30 at 1.40.04 PMWhile Instagim can constantly be refreshed to show the latest photos, we needed photos that had been posted for at least 24 hours to accumulate likes. When clicking the “load more” button at the bottom of the page, we noticed that the website made an ajax call with a unique key. By scraping solely these keys and waiting 24 hours, we could replicate the same ajax call from the previous day and obtain photos posted the day prior (we did our best to keep the 24 hour period constant). Using beautifulsoup, urllib and the requests library in Python, we were easily able to download each photo and get the metrics specific to it.

Instaliga - User Information

While we had the photo and the likes it got, we were still missing a key metric: followers. We needed to find a website that could be easily crawled for followers and following, since they would be important features in predicting the amount of likes for a photo. Instaliga was easy to scrape since it allowed us to append usernames to the end of a URL, and then crawl for this information using scrapy. Screen Shot 2017-03-30 at 1.41.16 PMAdditionally, we were able to scrape the metrics for the last 20 posts from a given user, giving us a baseline for their standard level of engagement with their audience. However, since each user is it’s own URL, the script generated too many requests, often being met with server errors. To auto throttle scrapy’s wait time turned out not to be time efficient, so we moved forward by hammering their server with requests and getting what data we could.  

EDA & Feature Engineering

Screen Shot 2017-03-30 at 3.32.11 PMWe started off on some basic EDA. The graph on the left here is a histogram of "likes" across all of our observations.  As can be seen, they are not normally distributed. Followers, following and other metrics seemed to follow the same pattern. We debated doing some basic transformations (logarithmic or box-cox), however, since our final model was likely to be a neural network or tree based model, we deemed it unnecessary.



Screen Shot 2017-03-30 at 3.32.42 PMThe  other non-photo features were relatively easy and straightforward. The bulk of our engineering was generating mean, mode, min and max likes for their previous posts. As intuition would suggest, all of these metrics were fairly correlated with the likes received on their newest post, unless of course other extraneous factors were at work (sudden influx of followers, particularly "likeable" post such as a celebrity post, etc...).

Screen Shot 2017-03-30 at 3.32.29 PM




However, many other features has more complex relationships. It seems fair to assume that followers would be correlated with likes, but although there is a relationship present it is subject to a large amount of variance. These fluctuations are likely to be accounted for by a number of other factors, hence our interest in the "previous likes" metrics.

Screen Shot 2017-03-30 at 3.32.48 PM

Some features such as the filter applied to the photo seemed to have even less of an effect than originally thought. These features would likely have larger bias in subsets of photo type: for example selfie photos may commonly use a filter to make a person’s skin look healthier. However, for our “nature” photos there seemed to be little correlation.

Photo Features - Extraction

Our plan was to supplement all of the user data with information from the photos to achieve a more accurate prediction. The logic follows that if a user’s followers and general account popularity defines a range within a photo should fall, the features in that photo will aid in assigning a more precise prediction for “likes”. After turning the photos into arrays, the PIL library was used to to extract summary statistics for each color band. Other features such as luminance were easy to calculate given this data.

However, we also wanted to extract more complex features. Using the OpenCV library, we were able to use a pre trained model for facial recognition, and assign a number relative to the number of faces in each photo. We also measured the blur of each photo

Photo Features - EDA

Unfortunately, few of these photo features seemed key in finding a correlation for likes in our subset of data. We compared many of the features against likes, and likes/followers ratio to normalize likes, but the correlations still seemed somewhat weak.


In the future, we plan to extract even more features, and scrape data from multiple accounts to see if photo features matter more in the realm of a single user. Comparing some of these photo attributes against the mean and median likes for users may also have been beneficial.

Model Tuning and Selection

Neural Network

We began the modeling process by constructing a basic Multi-Layer Perceptron with a single layer of input nodes and a single output node - using the Rectified Linear Unit as our activation function. After setting up this basic model using the Keras API with TensorFlow backend, the hyper-parameter tuning was initiated. Having tuned the model extensively, other models were then constructed for comparison.

Gradient Boosting Regressor

Given the results of the Multi-Layer Perceptron, we sought to compare with less complex model. Utilizing the GradientBoostingRegressor in sklearn, we were able to obtain much better results in our cross validation processes and predictions on our test data.

Ultimately we were able to compare our predicted results to the actual amount of likes a post received in our test set and determined that 95% of the predictions were within 30 likes.

likePredict: Flask app for predicting the likes of a post

Finally, we created a front end application using Flask. This library lets you easily link backend python code with html templates to build interactive web apps. Once launched, this would allow users a simple interface to upload their image and type in their Instagram handle to receive predictions. It’s important to note that we are not web designers, so the temporary UX/UI leaves something to be desired.

Screen Shot 2017-04-01 at 1.45.10 PM

Upon image upload, the back end model collects, analyzes, and exports a data frame. Upon handle input, the scrapy automatically collects the previous 20 posts and other relevant data.

Screen Shot 2017-04-01 at 1.45.36 PM

These data frames are combined to match our models training columns. The model is then applied to the resulting data frame and an output prediction is generated.

Screen Shot 2017-04-01 at 1.46.52 PM

Future Directions

In the future, we’d like to get access to the Instagram API since a scrapy based web application is not a model for stability and scalability. It would alleviate the majority of issues currently present in our process, and allow us to expand our models for different categories of photos. Currently, we are limited exclusively to public accounts due to data access. Additional features such as time of the week, follower involvement/network analysis, and a greater variety of image analysis would further reduce the error on our prediction within a given subset of photos. With the right amount of data and computational power, this applied model could solve some inherent issues with Instagram’s “likes” metric, and even expand to other platforms.

About Authors

Kyle Gallatin

Kyle Gallatin graduated from Quinnipiac University with a biology degree in 2015. Following, he continued on for his Master's in Molecular and Cellular Biology, received in 2016. Cultivating high level skills in data science through his analytical work...
View all posts by Kyle Gallatin >

Mayank Shah

Financial analyst -> Educational startup founder -> Facebook Analyst -> Data Scientist
View all posts by Mayank Shah >

Christopher Capozzola

Christopher is a passionate analyst with a certification as a Data Scientist and extensive background in mathematics. He has years of experience in research, analytics, and modeling. He leveraged his experience at NYC Data Science Academy to enrich...
View all posts by Christopher Capozzola >

Related Articles

Leave a Comment

Ravali March 22, 2019
Hello!! I am Ravali. Currently I am doing my masters in IT. As a part of my curriculum I am asked to complete a project on web scraping using beautiful soup and selenium web drivers. I have chosen Instagram as my website to scrape the pictures and number of likes for each picture. Now here comes the tough part for me. I need to analyze the user interests by analyzing the likes of different pictures with the help of python tools and plot the graphs about my analysis... So I need a help from you on how to analyze the data based on the number of user likes using python tools like data frames, Numpy etc. I would be really thankful for this help.
Kyle Gallatin July 31, 2017
Basic stats on the R, G and B portions of the image (max, min, mean, stddev, etc...), some luminance metrics, image size, blur, and the number of faces. We haven't yet added more features, but plan to use more pre trained libraries (similar to the face detection in open cv) to extract more concrete image features that we think may influence certain categories.
Tracy July 27, 2017
What were the photo features you extracted? Have those features expanded since you wrote this?

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI