Data Study on the Best Dinners
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
Like most busy people, data shows we spend a lot of time wondering what we are going to make for dinner.
I wanted to scrape a website that helped me with this decision.
Data
Using scrapy and beautiful soup I scraped over 1900 recipes from chowhound.com. Among the most important factors to consider from this website are the ingredients that are absolutely necessary to prepare for any recipe. I used natural language processing and came up with the most frequently used ingredients through out all 1900 recipes.
Ingredients
The word cloud and graph illustrate the most frequently used words, and how often these ingredients are used throughout all recipes. From a practical perspective anyone will more likely cook if they have the ingredients for that recipe.
Effort
A lot of cooking also stems down to how any ingredients a cook wants to use and how much effort or how many steps they are willing to put into their cooking efforts. I wanted to break down where most recipes fall with a histogram of where most recipes fall in terms of amount of steps their recipes entail.
Turns out more recipes fall in the 10 to 15 step category!
I wanted to do the same to show the reader the amount of ingredients needs to cook their meal, so I wanted to show a histogram of ingredients needs for their recipes.
Clustering
I wanted to then show the relationship between the words. K means clustering allowed for me to do . K means clustering is a type of unsupervised learning machine learning task commonly used for clustering text and articles. I counted all the recipes instructions and ingredients and used k means clustering to categorize the by cluster.
These clusters demonstrate by the distance from data points the machines learning algorithm demonstrates the relationship of clustering points nearby with similar quantity of ingredients and number of instructions. I wanted to illustrate a simple example before illustrating attempting to cluster the recipes by ingredients used in the recipes.
I took the top 100 most used ingredients and performed k means clustering by determining whether or not the ingredient is in a recipe.
By using k mean clustering the algorithm by via Euclidean distance developed the following clustering.
The clusters did a pretty good job job in breaking down the most utilized ingredients per cluster based on the distance of the clusters.