Scraping Instagram for Hashtags
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
On Instagram, I have an account where I share pictures and videos related to my yoga practice. Lately, I have been thinking about how to acquire more followers. I was interested in knowing what hashtags my favorite Instagram yoga teachers use. Typically, on any given day, there is a trend going on in the yoga Instagram community. For example, some days, you may find yogis posting headstands and other days handstands or balancing poses. Usually, I would visit my favorite yogis' pages to see the trending yoga pose of the day/week and collect their hashtags from their most recent posts. In turn, I would use these hashtags on my posts in the hopes of getting more views and likes. Then I thought to myself, why not automate it?
I decided to scrape the last seven posts(pictures) provided by dylanwerneryoga(Dylan), seanphelpsyoga(Sean), and kevindhofer(Kevin)
Technologies Used:
Selenium
Python
R Studio
WordCloud2
The Process
First, I had to automate signing in to my account (you cannot see posts without having an Ig handle). Then I found my way to the yogis' page by using the Xpath of the search field and created ActionChain to type and click on the yogi handle I wanted. Afterward, utilizing an Xpath and another ActionChain, I clicked on the most recent post. However, when I reached the desired page, the hashtags were not readily available to get scraped. They were not visible on the page. The reason is that many teachers place the hashtags in the comments for a post instead of in the caption. They do this because they want people to focus solely on the content in their caption. As a result, their hashtag comments "disappear" once followers start commenting on the post as well. Consequently, the only way to see the hashtag comment is to load all the comments for that given post.
I created a While loop to tackle this problem that would click on the "load more comments" button until it disappeared. Next, I scraped the hashtags from the yogis' comments and saved them into a csv file to be later imported into R Studio.
Exploratory Data Analysis
Total likes and average likes per post.
Likes per day for the seven most recentposts.
Dylan used 76 hashtags which 41 of which are unique. On average Dylan used 11 hashtags per post. His favorite are #yoga(7), #mensyoga(7) ,#yogainspiration(7) , and #yogachallenge(5).
Sean used 78 hashtags which 49 of which were unique. On average Sean used 12 hashtags per post. His favorite are #yogatips (8), #yogahelp(4), #yogafit(4), and #yogabeginners(3). An examining his hashtag choices shows that he is targeting practitioners who are new to yoga. In fact, Sean has recently released an online training program.
Kevin used 163 hashtags which 84 of which were unique. On average Sean used 12 hashtags per post. His favorite are #portugal (11), #yoga(7), #yogainspiration(7), and #instayoga(6). Surveying his tags, Kevin focuses on yoga postures and the place/environment of the post itself.
In total, 21 photos provided 317 hashtags which 156 were unique.
Problems
My program runs only if the yogis' hashtag comment is the first comment on a given post. If not, I would have to change the Xpath according to the index of the comment I am searching for. The algorithm can also be used to scrape captions as well. However, if emojis are used in the caption an error will appear.
Results
#Yogainspritation, #yoga, and #menyoga are the most widely used hashtags my favorite yogis used in their most recent post. I will plan to use these tags and algorithms to inform me which tags to use on my next 21 posts. I will update this blog when complete to show results.
Code
https://gist.github.com/JosephMata/7c3bac580234a17102c2d3ba19822cd2