Machines for Machine Jobs

Frederick Cheung

Posted on Dec 23, 2016

Background

Patient personalities can be expressed in certain archetypes. Healthcare practitioners have made use of this knowledge by changing their communication style and their frames of mind when dealing with different patients to best educate their patient in order to engage them as active members of the decision making process.

The Question

Can we create tools to allow people to know their audience ahead of time so we can be focus on being sensitive listeners and building rapport?

Results

Through use of TF-IDF I was able to determine that there was one group that stood out. Though there appeared to be other groups by graphical examination, upon numerical examination I determined they were not as consistently separated from other groups. Separating groups is possible, but the ability for separation with a clear distinction among the groups requires further study.

The Process

Immediate Challenges: In my experience working on data science projects, the rate limiting factor in terms of defining a project is primarily finding the data set. In this case I was not able to find a data set specifically regarding patients and clinical diagnosis. I found a research paper regarding patient food diaries but was unable to acquire this data set, I then found a publication regarding the categorization of restaurants based on their menu items.

Compromise: Because I was not able to find the data set I needed, I found a data set that had been used in solving a similar problem of grouping observations into categories based on text.

The Dataset: It is a collection of menu items with the features of price, name of restaurant, type of food, price of restaurant

TF-IDF: a method of using the frequency of a word in a document and its frequency within a corpus to place a numerical weight on the importance of a word

PCA: I initially calculated the TF-IDF scores of all the words for the 4 price ranges of restaurant ($, $, ), I then selected the top 50 for each price range and ran PCA on a concatenated dataframe of the top 50 words from each category.

You will notice that the graph shows that most of the variation can be explained along one component

KMeans: I then moved to KMeans to determine if there was any clustering I could do by using the full data frame of my TF-IDF scores, using my knowledge of the data set, I ran it initially using 4 centroids.

Due Diligence: However to ensure that I was not inventing clusters where there were none, I ran it on a different number of centroids. I used a graph of the points to determine visually which ones were standing out. However I also validated these findings numerically by comparing the Euclidean distance of each centroid against one another.

Here is a basic plot of 6 centroids, you will notice that there are only two points that truly stand out


	Here are the same 6 centroids but color coded, we can see based on the color coding that it is centroid 5 that stands out most consistently

Application to Industries

Healthcare: Through grouping, it may be possible for doctors to be aware of the patient type he is about to deal with and know which topics to stress when communicating ideas or asking questions based on previous diagnoses or documentation of patient symptoms

Insurance: Through grouping, it maybe be possible to offer lower premiums to patients who take part in food and health journals who may be shown to take care of themselves and have less of a need for catastrophic care

Marketing: Through grouping, it may be possible to use online blogs and reviews to understand certain elements of a market to see what they value based on how they describe and critique products

Future Work and References

TSNE: a plotting technique that can demonstrate relationships with even with a large amount of features
LSA: a different method of determining which words are important within a text
Cosine Similarity: as opposed to the Euclidean distance which I used in the KMeans comparison
Amazon Mechanical Turk: Using Human responses to develop a training set for future grouping projects
Regex Parts of speech
BOI categorization of words

Thanks to Dan Jurafsky author of "Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus" for making the data set publicly accessible

About Author

Frederick Cheung

Hi my name is Fred. Although my educational background is an M.S. in Medical Science, my professional experience is with Small Business management, operations and sustainable business practices. I’ve recently completed a Data Science program working with languages...

View all posts by Frederick Cheung >

Capstone

Using NLP to Explore Unconventional Targets

Capstone

Blind Dating Ensemble Classifier

Student Works

Data Driven Ads by Starbucks Customer Segmentation

Machine Learning

Accurately Predicting House Prices and Improving Client Experience with Machine Learning

Machines for Machine Jobs

Background

The Question

Results

The Process

Application to Industries

Future Work and References

About Author

Frederick Cheung

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Machines for Machine Jobs

Background

The Question

Results

The Process

Application to Industries

Future Work and References

About Author

Frederick Cheung

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!