Machines for Machine Jobs
Patient personalities can be expressed in certain archetypes. Healthcare practitioners have made use of this knowledge by changing their communication style and their frames of mind when dealing with different patients to best educate their patient in order to engage them as active members of the decision making process.
Can we create tools to allow people to know their audience ahead of time so we can be focus on being sensitive listeners and building rapport?
Through use of TF-IDF I was able to determine that there was one group that stood out. Though there appeared to be other groups by graphical examination, upon numerical examination I determined they were not as consistently separated from other groups. Separating groups is possible, but the ability for separation with a clear distinction among the groups requires further study.
Immediate Challenges: In my experience working on data science projects, the rate limiting factor in terms of defining a project is primarily finding the data set. In this case I was not able to find a data set specifically regarding patients and clinical diagnosis. I found a research paper regarding patient food diaries but was unable to acquire this data set, I then found a publication regarding the categorization of restaurants based on their menu items.
Compromise: Because I was not able to find the data set I needed, I found a data set that had been used in solving a similar problem of grouping observations into categories based on text.
The Dataset: It is a collection of menu items with the features of price, name of restaurant, type of food, price of restaurant
TF-IDF: a method of using the frequency of a word in a document and its frequency within a corpus to place a numerical weight on the importance of a word
PCA: I initially calculated the TF-IDF scores of all the words for the 4 price ranges of restaurant ($, $, ), I then selected the top 50 for each price range and ran PCA on a concatenated dataframe of the top 50 words from each category.
|You will notice that the graph shows that most of the variation can be explained along one component|
KMeans: I then moved to KMeans to determine if there was any clustering I could do by using the full data frame of my TF-IDF scores, using my knowledge of the data set, I ran it initially using 4 centroids.
Due Diligence: However to ensure that I was not inventing clusters where there were none, I ran it on a different number of centroids. I used a graph of the points to determine visually which ones were standing out. However I also validated these findings numerically by comparing the Euclidean distance of each centroid against one another.
|Here is a basic plot of 6 centroids, you will notice that there are only two points that truly stand out|
|Here are the same 6 centroids but color coded, we can see based on the color coding that it is centroid 5 that stands out most consistently|
Application to Industries
|Healthcare: Through grouping, it may be possible for doctors to be aware of the patient type he is about to deal with and know which topics to stress when communicating ideas or asking questions based on previous diagnoses or documentation of patient symptoms|
|Insurance: Through grouping, it maybe be possible to offer lower premiums to patients who take part in food and health journals who may be shown to take care of themselves and have less of a need for catastrophic care|
|Marketing: Through grouping, it may be possible to use online blogs and reviews to understand certain elements of a market to see what they value based on how they describe and critique products|
Future Work and References
TSNE: a plotting technique that can demonstrate relationships with even with a large amount of features
LSA: a different method of determining which words are important within a text
Cosine Similarity: as opposed to the Euclidean distance which I used in the KMeans comparison
Amazon Mechanical Turk: Using Human responses to develop a training set for future grouping projects
Regex Parts of speech
BOI categorization of words
Thanks to Dan Jurafsky author of "Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus" for making the data set publicly accessible