Latent Dirichlet Allocation and Topic Modelling: A Link Between The Quantitative and Qualitative?

Kyle D. Weber
Posted on Nov 4, 2019

Introduction/Motivation:

When discussing the application of machine learning techniques and the use of data science in industrial settings, one challenge is that the quantitative techniques that are commonly used -- regularized regression, GLM, time series techniques, and decision trees -- may be hard to explain to non-practitioners.  Moreover, often times the concepts and variables whose relationships we are looking to test are not explicitly defined in our data set, and practitioners may be hesitant to manually code additional variables in cases where the appropriate boundaries may be hard to define.

One methodology that actually helps with both of these issues is topic modelling, which uses variation in the tendency of different sets of words or phrases to appear together to generate groups of topics that are defined as probabilities that individual words will appear.  Words that are more strongly associated with a topic will be estimated as having a high probability of appearing in that topic.  This technique can easily be explained to workers outside the data science field by showing the relevant lists of words associated with different collections of topics.  Moreover, individuals with domain knowledge might be able to assist data scientists in interpreting the meaning of the words associated with different topics, making this technique uniquely well-suited to cross-team collaboration.  Another advantage of this technique is that the LDA technique often ends up identifying concepts that would be much more difficult to define a priori.  (For example, an LDA model might end up including a topic with words like "storms," "sunshine," and "showers" associated with weather, but it might be difficult to guess which words might appear in a weather-related context for corpora consisting of hundreds of thousands of pages.)

The basic goal of this blog post is threefold.  First, I want to introduce this new data set with scraped information about headphones that I collected and show how it might be used to gather information about the current and historical state of the headphone market.  Next, I want to briefly explain what topic modelling is, what its output might look like, and its advantages and disadvantages.  Finally, I want to provide some context for how topic modelling can help organizations and data scientists extract qualitative information about a product or business from textual data.

Data:

The data that I scraped for this project was collected from Head-fi.org, a website dedicated to provide audiophiles and other fans of portable audio equipment a space to discuss different brands and products.  While the website's main forum contains a huge number of reviews, I chose to focus on the "Head Gear" portion of the site, which contains descriptions and reviews for 6,813 different products listen under the main category of headphones.  The descriptions are generally copied directly from manufacturers, while reviews are written by individual users.  Items are also categorized on the site based on which type of headphone they represent.  There are 10,301 reviews in this data set, with each reviewed product getting an average of 3.7 reviews.  I used Scrapy packages in Python to extract this information from the head-fi.org site, and I used RStudio to generate the relevant plots seen below.

I was interested in scraping information about headphones because I had some background information about the product (I've bought a lot of headphones in my life) and so I thought it would be an interesting context to show how textual analysis could be combined with domain-specific knowledge to better understand an industry's dynamics.  I also liked the huge amount of product and brand variety present in this market (it does not exhibit the amount of concentration seen in some software and professional hardware markets).  The market will also likely remain the site of robust competition as other hardware markets mature and greater focus turns to the relatively advantages of different "wearable" technologies, so I think there will be some value in being able to compare my current findings with the characteristics of the head-fi.org site in the future.

I chose the head-fi.org site specifically for three reasons.  First, head-fi.org managed to relatively cooperative for scraping (although they did limit the number concurrent requests I could make).  Second, I really appreciated that -- as a specialist forum -- the reviews that were posted to the site generally ended up being extremely detailed in their descriptions of the products.  To provide one rough illustration, there are over fifty reviews that would be over ten pages if typed and printed (meaning they have more than 34,000 characters at roughly 3,400 characters per page).  Third, the site has more traffic than most of its competitors (whathifi.com, avforums.com, headphones.com, audioholics.com, etc.) according to Alexa, which suggests that this site would likely have richer data than competing forums. 

Exploratory Data Analysis:

Before turning to how I used textual analysis methods, I want to discuss some analytical questions that could be answered without using textual analysis. 

First, we can examine the distribution of ratings that are associated with Head-fi reviews, finding that it has significant skew and that the peak of the rating distribution appears around a rating of four.  Given that this is an enthusiast community, this is likely explained by the fact that better informed reviewers will tend to purchase products that they like.  This suggests that the explaining which products get reviews is as important as understanding which products have positive reviews to model how products gain the approval of this community.

Second, by examining the distribution of products across categories based on the date they are posted, we can examine various market trends in the market for headphones.  More specifically, by examining the graph below (click to expand), we can see that the proportion of reviews categorized as over-ear and on-ear headphones has tended to fall over time, while the proportion that are universal fit or in-ear have tended to rise over time.

We can also use information from our data set to examine whether various brands produce products that are significantly better or worse than the average of all other products included in the data set.   We can notice that most of the brands that are easily recognizable are either worse than average or not significantly different from average.  This is attributable to the fact that specialist brands that are likely less recognizable to the readers of this blog tended to have review scores that were significantly greater than average.  One last caveat is that Sennheiser had significantly more headphones in the data set than any other manufacturer, meaning that its mean value would be more precisely estimated from the data.  Thus, it ended up being significantly worse than the average of other manufacturers despite having a better review score than other brands which were not significantly worse than average, implying that significance in this case is more attributable to the greater precision with which the Sennheiser mean score is estimate rather than Sennheiser products being necessarily worse than the products of other brands.

  Different From Average? P-Value Which Direction? Mean Value of Brand
akg No 0.2497 N/A 3.95
apple Yes 0.0409 Worse 2.75
audiotechnica No 0.5052 N/A 4.09
beats Yes <.01 Worse 2.81
beyerdynamic No 0.1383 N/A 4.17
bose Yes 0.0123 Worse 3.51
brainwavz No 0.0603 N/A 3.89
jvc Yes <.01 Worse 3.69
sennheiser Yes 0.0285 Worse 3.93
skullcandy Yes <.01 Worse 3.06
sony Yes <.01 Worse 3.79

Finally, based on my statement above that smaller specialist brands tended to sell better products that were more highly rated than the products of larger brands, I wanted to test this out by producing a scatter-plot (with some jitter to avoid overlapping) of each brand's average score and number of products with a LOESS regression overlay to show an estimate of the non-linear relationship between these variables.  From this graph, we can see that brands with more products initially have higher average scores (since some one-product brands have average scores below two) but that after a certain point brands with more products start having lower average scores.  This gives the consumer useful information that they can use to assess the value of size when determining what headphone to buy.  The smallest headphone makers have an extremely large dispersion of product quality, but larger headphone makers tend to make products whose average score is lower than that of smaller manufacturers.

Several Application of Textual Analysis:

Before proceeding with some examples of what textual analysis can help us conclude about this data set, it is worth providing a simple layman's explanation for how textual analysis works in general.  The basic idea is that -- through some type of optimization method -- the LDA technique finds the distribution of words in each topic (in other words, how often we would expect each word to appear in a document on that topic) and the distribution of topics on documents (in other words, the probability that each document is about a specific topic) that best matches the observed probability that words appear across a large number of documents (in my case, documents are reviews)

The basic advantage of this method is threefold.  First, it allows huge quantities of data to be analyzed much faster than human coders could process individual documents.  Second, the list of words associated with individual topics is an extremely accessible output of the LDA model that can be used to gather feedback from other individuals with knowledge of a given industry or field.  Finally, the LDA method produces estimates of the probability that individual documents are about each topic that was analyzed.  Given that topics can represent abstract concepts, this allows us to generate a numeric measure that is associated with some abstract quantity that would otherwise be difficult to code.   The main caveats are that you have to chose the number of topics you want to generate, you need to interpret the meaning of the words associated with each topic, and topics may not represent meaningful concepts.  (A textual analysis of customer reviews may generate a topic associated with profanity, rudeness, or insults, but those topics might not be relevant for a company trying to improve its operations.)

We can now examine the list of words that were associated with the highest probability to each one of ten topics that were generated by using LDA to analyze product description data.  Note that the review data was stemmed to remove suffixes from words so that words of different tenses and forms would be related to each other.

 

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
ear ear music design cabl music nois high new driver
earphon fit bluetooth qualiti frequenc can cancel comfort product high
cabl comfort wireless color driver bass listen design technolog balanc
fit design control ipod respons will game featur qualiti low
tip use time bass plug like system headband design armatur
custom cord hour ear khz get use listen world frequenc
isol earbud phone comfort imped just volum perform made dynam
silicon includ call player sensit good featur pad one diaphragm
monitor carri microphon style type feel level adjust first hous
includ size devic music plate sound stereo ear develop mid

Even though the algorithm that produced this list of terms did not understand any of the meaning behind these words, we can examine this term list and find that the words have real-world meaning.  The first topic is a list of the physical characteristics of earbuds, the third topic is a list of features headphones can have, the fifth topic contains words related to technical specifications, the fourth and sixth topic appear to be a mixture of filler words, the seventh topic seems to be associated with noise cancellation techniques, the eight topic relates to the physical comfort associated with using a given pair of headphones, and the tenth topic contains a number of words that are associated with how audiophiles describe headphone quality (its highs and lows, its balance, its dynamism, and so on).

We can then examine the correlation between each topic and review score.  Here we note that topic 10 -- which contains words that are typically associated with audiophile discussions of headphones -- is the most positively associated with review score.  Technical specifications are also positively associated with review score, although this correlation ends up being much smaller.  Noise cancelling features are negatively associated with review score, as are both filler word topics.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
0.0289536 -0.1617635 -0.0845823 -0.1780946 0.0873286 -0.0839854 -0.1563137 -0.0173788 0.0953731 0.2458633

We can also examine the topics most closely associated with different brands to get a sense of each brand's positioning.  (Just to recap from my earlier description, this table lists the topic to which each brand's descriptions are assigned with the highest probability on average). 

Brand Most Common Topic
akg Topic 8
apple Topic 4
audio Topic 8
audiotechnica Topic 8
beats Topic 6
beyerdynamic Topic 8
bose Topic 7
brainwavz Topic 1
jvc Topic 5
philips Topic 8
sennheiser Topic 8
skullcandy Topic 6
sony Topic 8

Note that Brainwavz, a small earbud maker that mostly sells on Amazon, shows up with topic 1 being most associated with the brand (topic 1 being a list of keywords associated with the physical features of earbuds).  Beats and Skullcandy have product descriptions that are most associated with topic 6 -- a filler topic -- suggesting that these two brands may want to change their marketing if they want to appeal to specialists.  JVC seems to more heavily utilize technical specifications (as represented by topic 5) than other brands.  Interestingly, topic 8 -- reflecting information about the comfort of a pair of headphones -- was a top topic for many brands even though its presence was negatively associated with review score.

We can repeat the same analysis using the set of topics associated with the "cons" that are listed in each review.  (Note that the unit of analysis is now an individual review and not a product.)

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
price case ear qualiti need fit can mid cabl bass
high accessori pad build use may bit trebl tip lack
low larg uncomfort isol good comfort littl slight non detail
none control long poor hard might get bass stock soundstag
much better heavi cheap amp design time recess microphon trebl
volum driver small plastic best issu bright upper detach extens
better limit headband look will like sometim midrang thin sub
frequenc hous cord feel expens everyon side harsh short roll
noth includ head nois sourc signatur loos sibil easili end
realli iem big averag requir nozzl somewhat lower remov light

Again, we can see certain patterns associated with the words used in each topic.  Topic 10 contains words related to detail and precision, topic 9 contains words related to the flimsiness and physical characteristics of earbuds, topic 8 contains words related to the mid-range notes in a headphone (note the word "sibil" from "sibilant", referring to the hissing that headphones make if treble notes are not reproduced precisely), topic 7 contains words relating to the fit of a pair of headphones, topic 4 contains words related to build quality, topic 3 contains additional words related to the physical fit of headphones, and topic 2 contains words describing the accessories included with a pair of headphones.

I conclude by examining the correlation between these topics and the review score given in a review.  We can note that topic 10 (the detail in a product's sound), topic 8 (mid-range performance), and topic 4 (build quality) are the only topics that are negatively corrrelated with a headphone's review score, suggesting that manufacturers may want to pay special attention to these characteristics when assessing whether a product is the right fit for the specialist market.  Many of the other topics are weakly positively associated with review score, which makes sense.  If a review's "cons" sections primarily focuses on the words of topic 2 (which discuss the accessories included with a product), that's likely to be a positive review given that users on this forum are primarily interested in the sonic performance of headphone and other audio products rather than whether these products come with carrying cases.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
0.0036798 0.0470067 0.0102361 -0.173448 0.0870136 0.0746698 0.0844362 -0.1014616 0.0427622 -0.0566891

 

Next Steps:

To further utilize the code that I wrote to scrape data from the Head-fi.org website, I plan to implement four major extensions to this project.  First, once the relevant descriptions and reviews are posted, I would like to examine how the set of topics used in the description and discussion of Apple's new noise-cancelling headphone differ from those used to describe Apple's earlier products.  I am especially interested in examining whether Apple is changing its product characteristics to use more of the noise-cancelling words that are associated with Bose's product and which words are associated with criticisms of this product in user reviews.  This might provide the reader valuable information about how Apple is planning to segment this product.

Second, I want to expand the data set that I am using to include additional reviews that were featured as forum posts.  Since the headfi.org forum is too sprawling to scrape in its entirety (not to mention that it contains a number of off-topic threads), I would want to focus on threads that included in their title the names of one the brands mentioned above.  This would allow the major findings of my two analysis to be properly compared and contrasted.

Third, I want to add additional detail to my analysis of the features associated with the headphones market by running a regression analysis to examine which topics featured in reviews and descriptions are most associated with high scores.  While establishing a pattern of causation between the presence of these topics and high review scores is extremely difficult in the present topic, this information might provide industry analysts with additional insights into the types of product descriptions that are the most associated with high scores and the characteristics of reviews that tend to rate products highly.  (After limiting the sample to products with a large number of reviews, it might also be interesting to run quantile regression to examine how the distribution of review scores changes based on the presence of certain repeated themes in reviews.)

About Author

Kyle D. Weber

Kyle D. Weber

Kyle D. Weber is currently training at the NYC Data Science Academy to gain additional proficiency with machine learning techniques, SQL, Python, and database management tools such as Spark, Hadoop, and Hive. He holds a Masters of Philosophy...
View all posts by Kyle D. Weber >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp