Data Study on the Ingredients and Features of the Recipe

Posted on Feb 3, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


When people cook at home, it is very common to check certain recipes before they start to cook the dish. The questions arise naturally that what ingredients are the most popular ones and what are some common combinations. When I cook, I sometimes go to the website xiachufang to find data on nice recipes.

Xiachufang means 'go to the kitchen' in Chinese. This is a Chinese website that has thousands of recipes about Chinese dishes. Users are encouraged to submit their work onto the website and then rate the recipe. For a specific recipe item, xiachufang includes several other features besides regular components such as ingredients and so forth. The other items are exclusiveness, the existence of picture steps and the recognition of master chefs. Exclusiveness refers to the recipe that only available to this website.

The icon for picture steps would indicate whether or not the recipe contains pictures instructions. The recognition of master chefs is a special program between the website and some users whose recipes are considered as high quality by the website. Besides researching the ingredients, I am interested to know if these special features of the website lead to higher popularity of the recipe in terms of the number of attempts of certain recipes.

Methodology and Data

In this project, Scrapy package from Python is used to extract data from the website. Nine CSV files were created after running the spider concerning nine major ingredients: pork, chicken, beef, lamb, shrimp, fish, duck, eggs and tofu. After cleaning the data,  3245 recipes were recorded and each recipe contains eight features. (variables)  They are titles, authors, ingredients, number of attempts, icons for exclusiveness, the existence of picture steps, masters chefs and the major ingredient category.

Sometimes people use different words to refer to the same ingredients. Because of this, the ingredients of each recipe needs to be normalized so that we can compare the ingredients more systematically. A Python method was applied to clean the ingredients and the 9th variable, the number of ingredients used for each recipe was added to the dataframe. In a nutshell, we have 3245 recipes and 9 variables attached to each recipe.

Data on Ratings, Number of Attempts and Ingredients

Ratings VS. Number of Attempts

Before researching the effect of special website icons, I would like to see if there is any relationship between the ratings and the number of attempts for each recipe. After removing the outliers, we obtain the following plot.Data Study on the Ingredients and Features of the Recipe The only trend we can tell from the plot is that as there are more attempts on the recipe, the ratings of the recipe tend to approach 8.0. However, if we look at certain parts of the plot, the rating can vary greatly. For example, if the number of attempts is between 0 and 1000, the ratings can vary from 6.5 to 9.5, which nearly covers all the rating users give on the website.

The effect of exclusiveness

In this subsection, we would like to check if the feature of exclusiveness would attract more users to attempt on the dishes and how would exclusiveness affect the ratings. Data Study on the Ingredients and Features of the Recipe The above boxplot shows the ratings of each group (exclusive recipes and non-exclusive ones), we can see that the ratings for exclusive recipes are slightly higher than the other ones.

A summary of the key characteristics can be found in the above table. The numbers in the parentheses are the data after cleaning the outliers. We could see that the exclusiveness does attract more people to attempt the recipe and users tend to give higher ratings for those exclusive recipes. Also, a two-sample t-test with bootstrapping confirmed that ratings and number of attempts are statistically different for the two groups.

Data on the effect of picture steps

The boxplots and table summarizing key data are as follows. We are investigating the effect of picture steps this time. The ratings for recipes with picture steps are higher. The biggest difference appears in the number of attempts. Simply by comparing the data in the table, we can conclude that people would be much more willing to attempt on the recipes with picture steps. Similarly, the statistical tests confirm the statistical significance.

Data Study on the Ingredients and Features of the Recipe

The effect of master cooks

Lastly, let's look into the effect of master cooks. The conclusion drawn from here is quite similar to what we have found with the previous two cases. The recipes created by the master chefs are more popular in terms of the number of attempts and the ratings are slightly higher.

Data Study on the Ingredients and Features of the Recipe


If we try to put these three boolean variables together, we find that the most popular recipe are the ones created by master chefs, together with exclusive and picture step icons. In summary, these special features of the website do boost the popularity of the recipe in terms of the number of attempts on each recipe. As for ratings and the number of ingredients, the recipe with these three icons is slightly higher meaning the recipes are of higher quality and slightly more complicated.

The Investigation of Ingredients

The popularity of ingredients

The table below summarizes the number of recipes for each major ingredient. The pork is the most popular one and fish/duck is less popular. This is somewhat strange and this could be because fish dishes are harder to cook at home.

The following table shows the ingredients that are used most often. Scallions, gingers, and garlic are somewhat a must for a lot of dishes. Peppers are used quite often as well.

The following two plots demonstrate the ratings and the number of attempts for each major category. We could see that egg dishes on average have the most attempts while fish dishes and duck dishes have fewer attempts. As for the ratings, the lamb dishes have the highest average rating which could be because there are fewer lamb dishes. In general, the ratings are quite similar for each category.

Data on Popular combinations

I also did an investigation on what combinations are common among each major ingredient category and the result is summarized in the following table.

It is interesting to notice that the popular combination of pork and chicken are other parts of pork and chicken. This suggests that it is quite reasonable to put them together. For other ingredients, egg, mushroom, tomato show up quite often and they may be placed in a somewhat closer location in a grocery store.

A sidewalk on the number of ingredients

With the number of ingredients, I did some research on how the number of ingredients affects the ratings and number of attempts. The first plot shows that most recipes on the website have less than 15 ingredients. A lot of the recipe has around 6 ingredients, which is close to the 7 ingredients average we obtained in the previous group studies.

The following two plots are the relationship between the number of ingredients and average ratings/number of attempts. As for the ratings, the highest average ratings occur with recipes that have more ingredients. This could be that the recipe contains more flavor or the recipe makes very clear what ingredients should be included. However, the most simple dishes have high ratings as well. As for the average number of attempts, there are no specific patterns except that fewer people tend to try those very complicated recipes (a lot of ingredients are used in those recipes).


Conclusion and Future Work

In conclusion, the special features/icons of the website do boost popularity and lead to slightly higher ratings. If I could make a suggestion to the website, I would encourage them to further develop those features and try to explore new features if possible. As for users/recipe makers, it is very important to include pictures and a reasonable amount of ingredients. Recipes with pictures obviously have more attempts.

As for future work, I would like to create a more detailed classification of the ingredients. If possible, I would also try to scrape another similar website and see if certain conclusions about common combinations hold as well.

About Author

Hanbo Shao

Data Scientist with a strong quantitative background in mathematics and operations research. Detail-oriented, curious and highly motivated to apply data analysis and machine learning skills into solving real-life problems. A collaborative team player and loves to learn new...
View all posts by Hanbo Shao >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI