Data Study on the Ingredients and Features of the Recipe
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Background
When people cook at home, it is very common to check certain recipes before they start to cook the dish. The questions arise naturally that what ingredients are the most popular ones and what are some common combinations. When I cook, I sometimes go to the website xiachufang to find data on nice recipes.
Xiachufang means 'go to the kitchen' in Chinese. This is a Chinese website that has thousands of recipes about Chinese dishes. Users are encouraged to submit their work onto the website and then rate the recipe. For a specific recipe item, xiachufang includes several other features besides regular components such as ingredients and so forth. The other items are exclusiveness, the existence of picture steps and the recognition of master chefs. Exclusiveness refers to the recipe that only available to this website.
The icon for picture steps would indicate whether or not the recipe contains pictures instructions. The recognition of master chefs is a special program between the website and some users whose recipes are considered as high quality by the website. Besides researching the ingredients, I am interested to know if these special features of the website lead to higher popularity of the recipe in terms of the number of attempts of certain recipes.
Methodology and Data
In this project, Scrapy package from Python is used to extract data from the website. Nine CSV files were created after running the spider concerning nine major ingredients: pork, chicken, beef, lamb, shrimp, fish, duck, eggs and tofu. After cleaning the data, 3245 recipes were recorded and each recipe contains eight features. (variables) They are titles, authors, ingredients, number of attempts, icons for exclusiveness, the existence of picture steps, masters chefs and the major ingredient category.
Sometimes people use different words to refer to the same ingredients. Because of this, the ingredients of each recipe needs to be normalized so that we can compare the ingredients more systematically. A Python method was applied to clean the ingredients and the 9th variable, the number of ingredients used for each recipe was added to the dataframe. In a nutshell, we have 3245 recipes and 9 variables attached to each recipe.
Data on Ratings, Number of Attempts and Ingredients
Ratings VS. Number of Attempts
Before researching the effect of special website icons, I would like to see if there is any relationship between the ratings and the number of attempts for each recipe. After removing the outliers, we obtain the following plot.The only trend we can tell from the plot is that as there are more attempts on the recipe, the ratings of the recipe tend to approach 8.0. However, if we look at certain parts of the plot, the rating can vary greatly. For example, if the number of attempts is between 0 and 1000, the ratings can vary from 6.5 to 9.5, which nearly covers all the rating users give on the website.
The effect of exclusiveness
In this subsection, we would like to check if the feature of exclusiveness would attract more users to attempt on the dishes and how would exclusiveness affect the ratings. The above boxplot shows the ratings of each group (exclusive recipes and non-exclusive ones), we can see that the ratings for exclusive recipes are slightly higher than the other ones.
A summary of the key characteristics can be found in the above table. The numbers in the parentheses are the data after cleaning the outliers. We could see that the exclusiveness does attract more people to attempt the recipe and users tend to give higher ratings for those exclusive recipes. Also, a two-sample t-test with bootstrapping confirmed that ratings and number of attempts are statistically different for the two groups.
Data on the effect of picture steps
The boxplots and table summarizing key data are as follows. We are investigating the effect of picture steps this time. The ratings for recipes with picture steps are higher. The biggest difference appears in the number of attempts. Simply by comparing the data in the table, we can conclude that people would be much more willing to attempt on the recipes with picture steps. Similarly, the statistical tests confirm the statistical significance.
The effect of master cooks
Lastly, let's look into the effect of master cooks. The conclusion drawn from here is quite similar to what we have found with the previous two cases. The recipes created by the master chefs are more popular in terms of the number of attempts and the ratings are slightly higher.
If we try to put these three boolean variables together, we find that the most popular recipe are the ones created by master chefs, together with exclusive and picture step icons. In summary, these special features of the website do boost the popularity of the recipe in terms of the number of attempts on each recipe. As for ratings and the number of ingredients, the recipe with these three icons is slightly higher meaning the recipes are of higher quality and slightly more complicated.
The Investigation of Ingredients
The popularity of ingredients
The table below summarizes the number of recipes for each major ingredient. The pork is the most popular one and fish/duck is less popular. This is somewhat strange and this could be because fish dishes are harder to cook at home.
The following table shows the ingredients that are used most often. Scallions, gingers, and garlic are somewhat a must for a lot of dishes. Peppers are used quite often as well.
The following two plots demonstrate the ratings and the number of attempts for each major category. We could see that egg dishes on average have the most attempts while fish dishes and duck dishes have fewer attempts. As for the ratings, the lamb dishes have the highest average rating which could be because there are fewer lamb dishes. In general, the ratings are quite similar for each category.
Data on Popular combinations
I also did an investigation on what combinations are common among each major ingredient category and the result is summarized in the following table.
It is interesting to notice that the popular combination of pork and chicken are other parts of pork and chicken. This suggests that it is quite reasonable to put them together. For other ingredients, egg, mushroom, tomato show up quite often and they may be placed in a somewhat closer location in a grocery store.
A sidewalk on the number of ingredients
With the number of ingredients, I did some research on how the number of ingredients affects the ratings and number of attempts. The first plot shows that most recipes on the website have less than 15 ingredients. A lot of the recipe has around 6 ingredients, which is close to the 7 ingredients average we obtained in the previous group studies.
The following two plots are the relationship between the number of ingredients and average ratings/number of attempts. As for the ratings, the highest average ratings occur with recipes that have more ingredients. This could be that the recipe contains more flavor or the recipe makes very clear what ingredients should be included. However, the most simple dishes have high ratings as well. As for the average number of attempts, there are no specific patterns except that fewer people tend to try those very complicated recipes (a lot of ingredients are used in those recipes).
Conclusion and Future Work
In conclusion, the special features/icons of the website do boost popularity and lead to slightly higher ratings. If I could make a suggestion to the website, I would encourage them to further develop those features and try to explore new features if possible. As for users/recipe makers, it is very important to include pictures and a reasonable amount of ingredients. Recipes with pictures obviously have more attempts.
As for future work, I would like to create a more detailed classification of the ingredients. If possible, I would also try to scrape another similar website and see if certain conclusions about common combinations hold as well.