Machines for Machine Jobs

Frederick Cheung
Posted on Dec 23, 2016


Patient personalities can be expressed in certain archetypes. Healthcare practitioners have made use of this knowledge by changing their communication style and their frames of mind when dealing with different patients to best educate their patient in order to engage them as active members of the decision making process.

The Question

Can we create tools to allow people to know their audience ahead of time so we can be focus on being sensitive listeners and building rapport?


Through use of TF-IDF I was able to determine that there was one group that stood out. Though there appeared to be other groups by graphical examination, upon numerical examination I determined they were not as consistently separated from other groups. Separating groups is possible, but the ability for separation with a clear distinction among the groups requires further study. 

The Process

Immediate Challenges: In my experience working on data science projects, the rate limiting factor in terms of defining a project is primarily finding the data set. In this case I was not able to find a data set specifically regarding patients and clinical diagnosis. I found a research paper regarding patient food diaries but was unable to acquire this data set, I then found a publication regarding the categorization of restaurants based on their menu items.

Compromise: Because I was not able to find the data set I needed, I found a data set that had been used in solving a similar problem of grouping observations into categories based on text.

The Dataset: It is a collection of menu items with the features of price, name of restaurant, type of food, price of restaurant

TF-IDF: a method of using the frequency of a word in a document and its frequency within a corpus to place a numerical weight on the importance of a word

PCA: I initially calculated the TF-IDF scores of all the words for the 4 price ranges of restaurant ($, $, ), I then selected the top 50 for each price range and ran PCA on a concatenated dataframe of the top 50 words from each category.

screenshot-from-2016-12-22-21-28-59 You will notice that the graph shows that most of the variation can be explained along one component

KMeans: I then moved to KMeans to determine if there was any clustering I could do by using the full data frame of my TF-IDF scores, using my knowledge of the data set, I ran it initially using 4 centroids.

Due Diligence: However to ensure that I was not inventing clusters where there were none, I ran it on a different number of centroids. I used a graph of the points to determine visually which ones were standing out. However I also validated these findings numerically by comparing the Euclidean distance of each centroid against one another.

screenshot-from-2016-12-22-21-29-41 Here is a basic plot of 6 centroids, you will notice that there are only two points that truly stand out
 screenshot-from-2016-12-22-21-30-48 screenshot-from-2016-12-22-21-30-27
 screenshot-from-2016-12-22-21-30-04 Here are the same 6 centroids but color coded, we can see based on the color coding that it is centroid 5 that stands out most consistently

Application to Industries

Healthcare: Through grouping, it may be possible for doctors to be aware of the patient type he is about to deal with and know which topics to stress when communicating ideas or asking questions based on previous diagnoses or documentation of patient symptoms
Insurance: Through grouping, it maybe be possible to offer lower premiums to patients who take part in food and health journals who may be shown to take care of themselves and have less of a need for catastrophic care
 Marketing: Through grouping, it may be possible to use online blogs and reviews to understand certain elements of a market to see what they value based on how they describe and critique products

Future Work and References

TSNE: a plotting technique that can demonstrate relationships with even with a large amount of features
LSA: a different method of determining which words are important within a text
Cosine Similarity: as opposed to the Euclidean distance which I used in the KMeans comparison
Amazon Mechanical Turk: Using Human responses to develop a training set for future grouping projects
Regex Parts of speech
BOI categorization of words

Thanks to Dan Jurafsky author of "Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus" for making the data set publicly accessible

About Author

Frederick Cheung

Frederick Cheung

Hi my name is Fred. Although my educational background is an M.S. in Medical Science, my professional experience is with Small Business management, operations and sustainable business practices. I’ve recently completed a Data Science program working with languages...
View all posts by Frederick Cheung >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp