Machines for Machine Jobs

Posted on Dec 23, 2016


Patient personalities can be expressed in certain archetypes. Healthcare practitioners have made use of this knowledge by changing their communication style and their frames of mind when dealing with different patients to best educate their patient in order to engage them as active members of the decision making process.

The Question

Can we create tools to allow people to know their audience ahead of time so we can be focus on being sensitive listeners and building rapport?


Through use of TF-IDF I was able to determine that there was one group that stood out. Though there appeared to be other groups by graphical examination, upon numerical examination I determined they were not as consistently separated from other groups. Separating groups is possible, but the ability for separation with a clear distinction among the groups requires further study. 

The Process

Immediate Challenges: In my experience working on data science projects, the rate limiting factor in terms of defining a project is primarily finding the data set. In this case I was not able to find a data set specifically regarding patients and clinical diagnosis. I found a research paper regarding patient food diaries but was unable to acquire this data set, I then found a publication regarding the categorization of restaurants based on their menu items.

Compromise: Because I was not able to find the data set I needed, I found a data set that had been used in solving a similar problem of grouping observations into categories based on text.

The Dataset: It is a collection of menu items with the features of price, name of restaurant, type of food, price of restaurant

TF-IDF: a method of using the frequency of a word in a document and its frequency within a corpus to place a numerical weight on the importance of a word

PCA: I initially calculated the TF-IDF scores of all the words for the 4 price ranges of restaurant ($, $, ), I then selected the top 50 for each price range and ran PCA on a concatenated dataframe of the top 50 words from each category.

screenshot-from-2016-12-22-21-28-59 You will notice that the graph shows that most of the variation can be explained along one component

KMeans: I then moved to KMeans to determine if there was any clustering I could do by using the full data frame of my TF-IDF scores, using my knowledge of the data set, I ran it initially using 4 centroids.

Due Diligence: However to ensure that I was not inventing clusters where there were none, I ran it on a different number of centroids. I used a graph of the points to determine visually which ones were standing out. However I also validated these findings numerically by comparing the Euclidean distance of each centroid against one another.

screenshot-from-2016-12-22-21-29-41 Here is a basic plot of 6 centroids, you will notice that there are only two points that truly stand out
 screenshot-from-2016-12-22-21-30-48 screenshot-from-2016-12-22-21-30-27
 screenshot-from-2016-12-22-21-30-04 Here are the same 6 centroids but color coded, we can see based on the color coding that it is centroid 5 that stands out most consistently

Application to Industries

Healthcare: Through grouping, it may be possible for doctors to be aware of the patient type he is about to deal with and know which topics to stress when communicating ideas or asking questions based on previous diagnoses or documentation of patient symptoms
Insurance: Through grouping, it maybe be possible to offer lower premiums to patients who take part in food and health journals who may be shown to take care of themselves and have less of a need for catastrophic care
 Marketing: Through grouping, it may be possible to use online blogs and reviews to understand certain elements of a market to see what they value based on how they describe and critique products

Future Work and References

TSNE: a plotting technique that can demonstrate relationships with even with a large amount of features
LSA: a different method of determining which words are important within a text
Cosine Similarity: as opposed to the Euclidean distance which I used in the KMeans comparison
Amazon Mechanical Turk: Using Human responses to develop a training set for future grouping projects
Regex Parts of speech
BOI categorization of words

Thanks to Dan Jurafsky author of "Linguistic Markers of Status in Food Culture: Bourdieu’s Distinction in a Menu Corpus" for making the data set publicly accessible

About Author

Frederick Cheung

Hi my name is Fred. Although my educational background is an M.S. in Medical Science, my professional experience is with Small Business management, operations and sustainable business practices. I’ve recently completed a Data Science program working with languages...
View all posts by Frederick Cheung >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI