Machines for Machine Jobs
Background
Patient personalities can be expressed in certain archetypes. Healthcare practitioners have made use of this knowledge by changing their communication style and their frames of mindΒ when dealing with different patients to best educate their patient in order to engage them as active members of the decision making process.
The Question
Can we create tools to allow people to know their audience ahead of time so we can be focus on being sensitive listeners and building rapport?
Results
Through use of TF-IDF I was able to determine that there was one group that stood out. Though there appeared to be other groups by graphical examination, upon numerical examination I determined they were not as consistently separated from other groups. Separating groups is possible, but the ability for separation with a clearΒ distinction amongΒ the groups requires further study.Β
The Process
Immediate Challenges: In my experience working on data science projects, the rate limiting factor in terms of defining a project is primarily finding the data set. In this case I was not able to find a data set specifically regarding patients and clinical diagnosis. I found a research paper regarding patient food diaries but was unable to acquire this data set, I then found a publication regarding the categorization of restaurants based on their menu items.
Compromise: Because I was not able to find the data set I needed, I found a data set that had been used in solving a similar problem of grouping observations into categories based on text.
The Dataset: It is a collection of menu items with the features of price, name of restaurant, type of food, price of restaurant
TF-IDF: a method of using the frequency of a word in a documentΒ and its frequencyΒ within a corpusΒ to place a numericalΒ weight on the importance of a word
PCA: I initially calculated the TF-IDF scores of all theΒ words for the 4 price ranges of restaurant ($, $,
), I then selected the top 50 for each price range and ran PCA on a concatenated dataframe of the top 50 words from each category.
![]() |
You will notice that the graph shows that most of the variation can be explained along one component |
KMeans: I then moved to KMeans to determine if there was any clustering I could do by using the full data frame of my TF-IDF scores, using my knowledge of the data set, I ran it initially using 4 centroids.
Due Diligence: However to ensure that I was not inventing clusters where there were none, I ran it on a different number of centroids. I used a graph of the points to determine visuallyΒ which ones were standing out. However I also validated these findings numerically by comparing the Euclidean distance of each centroid against one another.
![]() |
Here is a basic plot of 6 centroids, you will notice that there are only two points that truly stand out |
Β ![]() |
![]() |
Β ![]() |
Here are the same 6 centroids but color coded, we can see based on the color coding that it is centroid 5 that stands out most consistently |
Application to Industries
Healthcare: Through grouping, it may be possible for doctors to be aware of the patient type he is about to deal with and know which topics to stress when communicating ideas or asking questions based on previous diagnosesΒ or documentation of patient symptoms |
Insurance:Β Through grouping, it maybe be possible to offer lower premiums to patients who take part in food and health journals who may be shown to take care of themselves and have less of a need for catastrophic care |
Β Marketing: Through grouping, it may be possible to use online blogs and reviews to understand certain elements of a market to see what they valueΒ based on how they describe and critique products |
Future Work and References
TSNE: a plotting technique that can demonstrate relationships with even with a large amount of features
LSA: a different method of determining which words are important within a text
Cosine Similarity: as opposed to the Euclidean distance which I used in the KMeans comparison
Amazon Mechanical Turk: Using Human responses to develop a training set for future grouping projects
Regex Parts of speech
BOI categorization of words
Thanks to Dan JurafskyΒ author of "Linguistic Markers of Status in Food Culture: Bourdieuβs Distinction in a Menu Corpus" for making the data set publicly accessible