Ingredients and Features of the Recipe

Hanbo Shao
Posted on Feb 3, 2020


When people cook at home, it is very common to check certain recipes before they start to cook the dish. The questions arise naturally that what ingredients are the most popular ones and what are some common combinations. When I cook, I sometimes go to the website xiachufang to find nice recipes.

Xiachufang means 'go to the kitchen' in Chinese. This is a Chinese website that has thousands of recipes about Chinese dishes. Users are encouraged to submit their work onto the website and then rate the recipe. For a specific recipe item, xiachufang includes several other features besides regular components such as ingredients and so forth. The other items are exclusiveness, the existence of picture steps and the recognition of master chefs. Exclusiveness refers to the recipe that only available to this website. The icon for picture steps would indicate whether or not the recipe contains pictures instructions. The recognition of master chefs is a special program between the website and some users whose recipes are considered as high quality by the website. Besides researching the ingredients, I am interested to know if these special features of the website lead to higher popularity of the recipe in terms of the number of attempts of certain recipes.

Methodology and Data

In this project, Scrapy package from Python is used to extract data from the website. Nine CSV files were created after running the spider concerning nine major ingredients: pork, chicken, beef, lamb, shrimp, fish, duck, eggs and tofu. After cleaning the data,  3245 recipes were recorded and each recipe contains eight features. (variables)  They are titles, authors, ingredients, number of attempts, icons for exclusiveness, the existence of picture steps, masters chefs and the major ingredient category. Sometimes people use different words to refer to the same ingredients. Because of this, the ingredients of each recipe needs to be normalized so that we can compare the ingredients more systematically. A Python method was applied to clean the ingredients and the 9th variable, the number of ingredients used for each recipe was added to the dataframe. In a nutshell, we have 3245 recipes and 9 variables attached to each recipe.

Ratings, Number of Attempts and Ingredients

Ratings VS. Number of Attempts

Before researching the effect of special website icons, I would like to see if there is any relationship between the ratings and the number of attempts for each recipe. After removing the outliers, we obtain the following plot.The only trend we can tell from the plot is that as there are more attempts on the recipe, the ratings of the recipe tend to approach 8.0. However, if we look at certain parts of the plot, the rating can vary greatly. For example, if the number of attempts is between 0 and 1000, the ratings can vary from 6.5 to 9.5, which nearly covers all the rating users give on the website.

The effect of exclusiveness

In this subsection, we would like to check if the feature of exclusiveness would attract more users to attempt on the dishes and how would exclusiveness affect the ratings. The above boxplot shows the ratings of each group (exclusive recipes and non-exclusive ones), we can see that the ratings for exclusive recipes are slightly higher than the other ones.

A summary of the key characteristics can be found in the above table. The numbers in the parentheses are the data after cleaning the outliers. We could see that the exclusiveness does attract more people to attempt the recipe and users tend to give higher ratings for those exclusive recipes. Also, a two-sample t-test with bootstrapping confirmed that ratings and number of attempts are statistically different for the two groups.

The effect of picture steps

The boxplots and table summarizing key data are as follows. We are investigating the effect of picture steps this time. The ratings for recipes with picture steps are higher. The biggest difference appears in the number of attempts. Simply by comparing the data in the table, we can conclude that people would be much more willing to attempt on the recipes with picture steps. Similarly, the statistical tests confirm the statistical significance.

The effect of master cooks

Lastly, let's look into the effect of master cooks. The conclusion drawn from here is quite similar to what we have found with the previous two cases. The recipes created by the master chefs are more popular in terms of the number of attempts and the ratings are slightly higher.


If we try to put these three boolean variables together, we find that the most popular recipe are the ones created by master chefs, together with exclusive and picture step icons. In summary, these special features of the website do boost the popularity of the recipe in terms of the number of attempts on each recipe. As for ratings and the number of ingredients, the recipe with these three icons is slightly higher meaning the recipes are of higher quality and slightly more complicated.

The Investigation of Ingredients

The popularity of ingredients

The table below summarizes the number of recipes for each major ingredient. The pork is the most popular one and fish/duck is less popular. This is somewhat strange and this could be because fish dishes are harder to cook at home.

The following table shows the ingredients that are used most often. Scallions, gingers, and garlic are somewhat a must for a lot of dishes. Peppers are used quite often as well.

The following two plots demonstrate the ratings and the number of attempts for each major category. We could see that egg dishes on average have the most attempts while fish dishes and duck dishes have fewer attempts. As for the ratings, the lamb dishes have the highest average rating which could be because there are fewer lamb dishes. In general, the ratings are quite similar for each category.

Popular combinations

I also did an investigation on what combinations are common among each major ingredient category and the result is summarized in the following table.

It is interesting to notice that the popular combination of pork and chicken are other parts of pork and chicken. This suggests that it is quite reasonable to put them together. For other ingredients, egg, mushroom, tomato show up quite often and they may be placed in a somewhat closer location in a grocery store.

A sidewalk on the number of ingredients

With the number of ingredients, I did some research on how the number of ingredients affects the ratings and number of attempts. The first plot shows that most recipes on the website have less than 15 ingredients. A lot of the recipe has around 6 ingredients, which is close to the 7 ingredients average we obtained in the previous group studies.

The following two plots are the relationship between the number of ingredients and average ratings/number of attempts. As for the ratings, the highest average ratings occur with recipes that have more ingredients. This could be that the recipe contains more flavor or the recipe makes very clear what ingredients should be included. However, the most simple dishes have high ratings as well. As for the average number of attempts, there are no specific patterns except that fewer people tend to try those very complicated recipes (a lot of ingredients are used in those recipes).


Conclusion and Future Work

In conclusion, the special features/icons of the website do boost popularity and lead to slightly higher ratings. If I could make a suggestion to the website, I would encourage them to further develop those features and try to explore new features if possible. As for users/recipe makers, it is very important to include pictures and a reasonable amount of ingredients. Recipes with pictures obviously have more attempts.

As for future work, I would like to create a more detailed classification of the ingredients. If possible, I would also try to scrape another similar website and see if certain conclusions about common combinations hold as well.

About Author

Hanbo Shao

Hanbo Shao

Data Scientist with a strong quantitative background in mathematics and operations research. Detail-oriented, curious and highly motivated to apply data analysis and machine learning skills into solving real-life problems. A collaborative team player and loves to learn new...
View all posts by Hanbo Shao >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp