Latent Dirichlet Allocation and Topic Modelling: A Link Between The Quantitative and Qualitative?

Kyle D. Weber

Posted on Nov 4, 2019

Introduction/Motivation:

When discussing the application of machine learning techniques and the use of data science in industrial settings, one challenge is that the quantitative techniques that are commonly used -- regularized regression, GLM, time series techniques, and decision trees -- may be hard to explain to non-practitioners. Moreover, often times the concepts and variables whose relationships we are looking to test are not explicitly defined in our data set, and practitioners may be hesitant to manually code additional variables in cases where the appropriate boundaries may be hard to define.

One methodology that actually helps with both of these issues is topic modelling, which uses variation in the tendency of different sets of words or phrases to appear together to generate groups of topics that are defined as probabilities that individual words will appear. Words that are more strongly associated with a topic will be estimated as having a high probability of appearing in that topic. This technique can easily be explained to workers outside the data science field by showing the relevant lists of words associated with different collections of topics. Moreover, individuals with domain knowledge might be able to assist data scientists in interpreting the meaning of the words associated with different topics, making this technique uniquely well-suited to cross-team collaboration. Another advantage of this technique is that the LDA technique often ends up identifying concepts that would be much more difficult to define a priori. (For example, an LDA model might end up including a topic with words like "storms," "sunshine," and "showers" associated with weather, but it might be difficult to guess which words might appear in a weather-related context for corpora consisting of hundreds of thousands of pages.)

The basic goal of this blog post is threefold. First, I want to introduce this new data set with scraped information about headphones that I collected and show how it might be used to gather information about the current and historical state of the headphone market. Next, I want to briefly explain what topic modelling is, what its output might look like, and its advantages and disadvantages. Finally, I want to provide some context for how topic modelling can help organizations and data scientists extract qualitative information about a product or business from textual data.

Data:

The data that I scraped for this project was collected from Head-fi.org, a website dedicated to provide audiophiles and other fans of portable audio equipment a space to discuss different brands and products. While the website's main forum contains a huge number of reviews, I chose to focus on the "Head Gear" portion of the site, which contains descriptions and reviews for 6,813 different products listen under the main category of headphones. The descriptions are generally copied directly from manufacturers, while reviews are written by individual users. Items are also categorized on the site based on which type of headphone they represent. There are 10,301 reviews in this data set, with each reviewed product getting an average of 3.7 reviews. I used Scrapy packages in Python to extract this information from the head-fi.org site, and I used RStudio to generate the relevant plots seen below.

I was interested in scraping information about headphones because I had some background information about the product (I've bought a lot of headphones in my life) and so I thought it would be an interesting context to show how textual analysis could be combined with domain-specific knowledge to better understand an industry's dynamics. I also liked the huge amount of product and brand variety present in this market (it does not exhibit the amount of concentration seen in some software and professional hardware markets). The market will also likely remain the site of robust competition as other hardware markets mature and greater focus turns to the relatively advantages of different "wearable" technologies, so I think there will be some value in being able to compare my current findings with the characteristics of the head-fi.org site in the future.

I chose the head-fi.org site specifically for three reasons. First, head-fi.org managed to relatively cooperative for scraping (although they did limit the number concurrent requests I could make). Second, I really appreciated that -- as a specialist forum -- the reviews that were posted to the site generally ended up being extremely detailed in their descriptions of the products. To provide one rough illustration, there are over fifty reviews that would be over ten pages if typed and printed (meaning they have more than 34,000 characters at roughly 3,400 characters per page). Third, the site has more traffic than most of its competitors (whathifi.com, avforums.com, headphones.com, audioholics.com, etc.) according to Alexa, which suggests that this site would likely have richer data than competing forums.

Exploratory Data Analysis:

Before turning to how I used textual analysis methods, I want to discuss some analytical questions that could be answered without using textual analysis.

First, we can examine the distribution of ratings that are associated with Head-fi reviews, finding that it has significant skew and that the peak of the rating distribution appears around a rating of four. Given that this is an enthusiast community, this is likely explained by the fact that better informed reviewers will tend to purchase products that they like. This suggests that the explaining which products get reviews is as important as understanding which products have positive reviews to model how products gain the approval of this community.

Second, by examining the distribution of products across categories based on the date they are posted, we can examine various market trends in the market for headphones. More specifically, by examining the graph below (click to expand), we can see that the proportion of reviews categorized as over-ear and on-ear headphones has tended to fall over time, while the proportion that are universal fit or in-ear have tended to rise over time.

We can also use information from our data set to examine whether various brands produce products that are significantly better or worse than the average of all other products included in the data set. We can notice that most of the brands that are easily recognizable are either worse than average or not significantly different from average. This is attributable to the fact that specialist brands that are likely less recognizable to the readers of this blog tended to have review scores that were significantly greater than average. One last caveat is that Sennheiser had significantly more headphones in the data set than any other manufacturer, meaning that its mean value would be more precisely estimated from the data. Thus, it ended up being significantly worse than the average of other manufacturers despite having a better review score than other brands which were not significantly worse than average, implying that significance in this case is more attributable to the greater precision with which the Sennheiser mean score is estimate rather than Sennheiser products being necessarily worse than the products of other brands.

	Different From Average?	P-Value	Which Direction?	Mean Value of Brand
akg	No	0.2497	N/A	3.95
apple	Yes	0.0409	Worse	2.75
audiotechnica	No	0.5052	N/A	4.09
beats	Yes	<.01	Worse	2.81
beyerdynamic	No	0.1383	N/A	4.17
bose	Yes	0.0123	Worse	3.51
brainwavz	No	0.0603	N/A	3.89
jvc	Yes	<.01	Worse	3.69
sennheiser	Yes	0.0285	Worse	3.93
skullcandy	Yes	<.01	Worse	3.06
sony	Yes	<.01	Worse	3.79

Finally, based on my statement above that smaller specialist brands tended to sell better products that were more highly rated than the products of larger brands, I wanted to test this out by producing a scatter-plot (with some jitter to avoid overlapping) of each brand's average score and number of products with a LOESS regression overlay to show an estimate of the non-linear relationship between these variables. From this graph, we can see that brands with more products initially have higher average scores (since some one-product brands have average scores below two) but that after a certain point brands with more products start having lower average scores. This gives the consumer useful information that they can use to assess the value of size when determining what headphone to buy. The smallest headphone makers have an extremely large dispersion of product quality, but larger headphone makers tend to make products whose average score is lower than that of smaller manufacturers.

Several Application of Textual Analysis:

Before proceeding with some examples of what textual analysis can help us conclude about this data set, it is worth providing a simple layman's explanation for how textual analysis works in general. The basic idea is that -- through some type of optimization method -- the LDA technique finds the distribution of words in each topic (in other words, how often we would expect each word to appear in a document on that topic) and the distribution of topics on documents (in other words, the probability that each document is about a specific topic) that best matches the observed probability that words appear across a large number of documents (in my case, documents are reviews)

The basic advantage of this method is threefold. First, it allows huge quantities of data to be analyzed much faster than human coders could process individual documents. Second, the list of words associated with individual topics is an extremely accessible output of the LDA model that can be used to gather feedback from other individuals with knowledge of a given industry or field. Finally, the LDA method produces estimates of the probability that individual documents are about each topic that was analyzed. Given that topics can represent abstract concepts, this allows us to generate a numeric measure that is associated with some abstract quantity that would otherwise be difficult to code. The main caveats are that you have to chose the number of topics you want to generate, you need to interpret the meaning of the words associated with each topic, and topics may not represent meaningful concepts. (A textual analysis of customer reviews may generate a topic associated with profanity, rudeness, or insults, but those topics might not be relevant for a company trying to improve its operations.)

We can now examine the list of words that were associated with the highest probability to each one of ten topics that were generated by using LDA to analyze product description data. Note that the review data was stemmed to remove suffixes from words so that words of different tenses and forms would be related to each other.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
ear	ear	music	design	cabl	music	nois	high	new	driver
earphon	fit	bluetooth	qualiti	frequenc	can	cancel	comfort	product	high
cabl	comfort	wireless	color	driver	bass	listen	design	technolog	balanc
fit	design	control	ipod	respons	will	game	featur	qualiti	low
tip	use	time	bass	plug	like	system	headband	design	armatur
custom	cord	hour	ear	khz	get	use	listen	world	frequenc
isol	earbud	phone	comfort	imped	just	volum	perform	made	dynam
silicon	includ	call	player	sensit	good	featur	pad	one	diaphragm
monitor	carri	microphon	style	type	feel	level	adjust	first	hous
includ	size	devic	music	plate	sound	stereo	ear	develop	mid

Even though the algorithm that produced this list of terms did not understand any of the meaning behind these words, we can examine this term list and find that the words have real-world meaning. The first topic is a list of the physical characteristics of earbuds, the third topic is a list of features headphones can have, the fifth topic contains words related to technical specifications, the fourth and sixth topic appear to be a mixture of filler words, the seventh topic seems to be associated with noise cancellation techniques, the eight topic relates to the physical comfort associated with using a given pair of headphones, and the tenth topic contains a number of words that are associated with how audiophiles describe headphone quality (its highs and lows, its balance, its dynamism, and so on).

We can then examine the correlation between each topic and review score. Here we note that topic 10 -- which contains words that are typically associated with audiophile discussions of headphones -- is the most positively associated with review score. Technical specifications are also positively associated with review score, although this correlation ends up being much smaller. Noise cancelling features are negatively associated with review score, as are both filler word topics.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
0.0289536	-0.1617635	-0.0845823	-0.1780946	0.0873286	-0.0839854	-0.1563137	-0.0173788	0.0953731	0.2458633

We can also examine the topics most closely associated with different brands to get a sense of each brand's positioning. (Just to recap from my earlier description, this table lists the topic to which each brand's descriptions are assigned with the highest probability on average).

Brand	Most Common Topic
akg	Topic 8
apple	Topic 4
audio	Topic 8
audiotechnica	Topic 8
beats	Topic 6
beyerdynamic	Topic 8
bose	Topic 7
brainwavz	Topic 1
jvc	Topic 5
philips	Topic 8
sennheiser	Topic 8
skullcandy	Topic 6
sony	Topic 8

Note that Brainwavz, a small earbud maker that mostly sells on Amazon, shows up with topic 1 being most associated with the brand (topic 1 being a list of keywords associated with the physical features of earbuds). Beats and Skullcandy have product descriptions that are most associated with topic 6 -- a filler topic -- suggesting that these two brands may want to change their marketing if they want to appeal to specialists. JVC seems to more heavily utilize technical specifications (as represented by topic 5) than other brands. Interestingly, topic 8 -- reflecting information about the comfort of a pair of headphones -- was a top topic for many brands even though its presence was negatively associated with review score.

We can repeat the same analysis using the set of topics associated with the "cons" that are listed in each review. (Note that the unit of analysis is now an individual review and not a product.)

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
price	case	ear	qualiti	need	fit	can	mid	cabl	bass
high	accessori	pad	build	use	may	bit	trebl	tip	lack
low	larg	uncomfort	isol	good	comfort	littl	slight	non	detail
none	control	long	poor	hard	might	get	bass	stock	soundstag
much	better	heavi	cheap	amp	design	time	recess	microphon	trebl
volum	driver	small	plastic	best	issu	bright	upper	detach	extens
better	limit	headband	look	will	like	sometim	midrang	thin	sub
frequenc	hous	cord	feel	expens	everyon	side	harsh	short	roll
noth	includ	head	nois	sourc	signatur	loos	sibil	easili	end
realli	iem	big	averag	requir	nozzl	somewhat	lower	remov	light

Again, we can see certain patterns associated with the words used in each topic. Topic 10 contains words related to detail and precision, topic 9 contains words related to the flimsiness and physical characteristics of earbuds, topic 8 contains words related to the mid-range notes in a headphone (note the word "sibil" from "sibilant", referring to the hissing that headphones make if treble notes are not reproduced precisely), topic 7 contains words relating to the fit of a pair of headphones, topic 4 contains words related to build quality, topic 3 contains additional words related to the physical fit of headphones, and topic 2 contains words describing the accessories included with a pair of headphones.

I conclude by examining the correlation between these topics and the review score given in a review. We can note that topic 10 (the detail in a product's sound), topic 8 (mid-range performance), and topic 4 (build quality) are the only topics that are negatively corrrelated with a headphone's review score, suggesting that manufacturers may want to pay special attention to these characteristics when assessing whether a product is the right fit for the specialist market. Many of the other topics are weakly positively associated with review score, which makes sense. If a review's "cons" sections primarily focuses on the words of topic 2 (which discuss the accessories included with a product), that's likely to be a positive review given that users on this forum are primarily interested in the sonic performance of headphone and other audio products rather than whether these products come with carrying cases.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
0.0036798	0.0470067	0.0102361	-0.173448	0.0870136	0.0746698	0.0844362	-0.1014616	0.0427622	-0.0566891

Next Steps:

To further utilize the code that I wrote to scrape data from the Head-fi.org website, I plan to implement four major extensions to this project. First, once the relevant descriptions and reviews are posted, I would like to examine how the set of topics used in the description and discussion of Apple's new noise-cancelling headphone differ from those used to describe Apple's earlier products. I am especially interested in examining whether Apple is changing its product characteristics to use more of the noise-cancelling words that are associated with Bose's product and which words are associated with criticisms of this product in user reviews. This might provide the reader valuable information about how Apple is planning to segment this product.

Second, I want to expand the data set that I am using to include additional reviews that were featured as forum posts. Since the headfi.org forum is too sprawling to scrape in its entirety (not to mention that it contains a number of off-topic threads), I would want to focus on threads that included in their title the names of one the brands mentioned above. This would allow the major findings of my two analysis to be properly compared and contrasted.

Third, I want to add additional detail to my analysis of the features associated with the headphones market by running a regression analysis to examine which topics featured in reviews and descriptions are most associated with high scores. While establishing a pattern of causation between the presence of these topics and high review scores is extremely difficult in the present topic, this information might provide industry analysts with additional insights into the types of product descriptions that are the most associated with high scores and the characteristics of reviews that tend to rate products highly. (After limiting the sample to products with a large number of reviews, it might also be interesting to run quantile regression to examine how the distribution of review scores changes based on the presence of certain repeated themes in reviews.)

About Author

Kyle D. Weber

Kyle D. Weber is currently training at the NYC Data Science Academy to gain additional proficiency with machine learning techniques, SQL, Python, and database management tools such as Spark, Hadoop, and Hive. He holds a Masters of Philosophy...

View all posts by Kyle D. Weber >

Capstone

Using NLP to Explore Unconventional Targets

Python

Video Game Descriptions: Do Some Words Sell Better?

Capstone

Using Data for A Recipe Recommendation System

Capstone

NLP Recipe Search Engine

Data Visualization

Sentiment Data Analysis of Amazon's Decaying Product Ratings

No comments found.

Latent Dirichlet Allocation and Topic Modelling: A Link Between The Quantitative and Qualitative?

About Author

Kyle D. Weber

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Latent Dirichlet Allocation and Topic Modelling: A Link Between The Quantitative and Qualitative?

About Author

Kyle D. Weber

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!