Exploring Avocado Data and Building Predictive Models

Posted on Dec 14, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Purpose and Goal:

Inspired by the popularities of avocado toasts among millennials, and finding skyrocketed prices on avocadoes at produce sections recently, I wanted to find out which cities in the U.S. provide the most reasonable prices for avocados and understand the market and trends better to hopefully benefit suffering millennials (including myself). I explored the data of prices and volumes of avocados sold in the major metropolitan areas, analyzed the costs at different cities and the correlation between the volume and prices, and built machine learning models to predict prices.

Data Analysis on the Rise of Avocados

Word Cloud from tweets that had "avocado" from 11/1/2020 - 11/20/2020 - "avocado toast" was the most frequently tweeted word after "avocado"

Data:

Avocado price data includes observations from 2015 to 2018 and was originally extracted from Hass Avocado Board and downloaded from Kaggle website. The dataset covers the average prices, types (conventional or organic), and cities and regions where avocados were sold.

Data Analysis:

1. Correlation between price and volume

I had a hypothesis of demand and supply, which means, in other words, that if the consumed volumes are higher, then the prices would be lower. The scatter plot I created, using matplotlib in Python libraries displays that it seems there is a trend for that direction. The Pearson correlation coefficient showed a small negative correlation (-0.225) between the average price and average volume consumption. Thus, there is an association between demand and supply, but that cannot explain everything about how the prices are structured. Having some outliers on the right side of the plot where some cities had the highest prices while the consumed volumes are limited. 

In the next series of analyses, I would dive deep into the data and testing hypotheses on how the network of the geography of consumption and production impact prices.

Data Analysis on the Rise of Avocados 

2. Ranking of cities by volume and price

Based on the below bar chart, surprisingly, Los Angeles consumes twice more than that of the second-highest volume cities, New York. The volume of the consumption of Los Angeles is astonishingly larger than any other city.

Data Analysis on the Rise of Avocados

According to U.S. Department of Agriculture, in 2017, the world production of avocados was 5.9 million tonnes, led by Mexico with 34% (2.01 million tonnes) of the total. In the U.S., California is the major producer, accounting for 93 percent of U.S. avocado output. However, most of US consumption relies on imports from Mexico - 89% of the imports. Thus, the proximity to these areas could highly likely affect the prices of avocados.

In terms of prices, the below chart shows that the Northeast and Mideast are the most expensive areas for avocados. It is surprising that Hartford-Springfield, Connecticut was the most expensive city. Also, San Francisco is geographically not so far from the production regions, but it ranked very high, even higher than New York and other Northeastern cities.

Data Findings

This brings down to a hypothesis on one factor that drives up prices: transportation costs from main production areas/regions (Mexico, CA, etc.) is the major element to impact prices? This would explain why Harford and other Northeastern cities had the highest costs. Based on the report from the Department of Agriculture, this could be strongly backed by their study for other fresh produce. The results of the "study indicate that transportation costs significantly increase the costs of marketing these produce items and therefore their wholesale price." 

The next question here is if living costs could be another major factor, considering some cities such as San Francisco where the proximity to Mexico would not be an issue, also ranked very high. Clearly, San Francisco could be an exception for this "proximity" theory, and yet the cost of living index, paradoxically indicates that since some cities which did not normally register as expensive cities such as Hartford, Philadelphia, and Albany (all Northeastern cities) did rank higher for avocados, confirms the hypothesis that the proximity to Mexico and South America may matter to the pricing.

3. Volume and price fluctuation (Time Series Analysis)

The below graph shows that there was a huge drop in the price in the summer, 2015. In general, it fluctuates over seasons - in summer, it goes down and in winter, it goes up. However, as a general trend, the plot shows that it is steadily increasing over years. And in 2017 there was a tremendous increase in the prices.

It is interesting to compare these ups and downs with those of the volume. In a sense, the volumes and prices are the flip sides of coins - the volume goes up strikingly high at the beginning of spring, and in summer, there is a smaller uptick and these are reflected on the price chart as downtimes. And that was explained in the first test on the Pearson correlation coefficient. There is a smaller association between the volume and price. Therefore, when there are more avocados in the market, the prices go down, and when there are not enough, the price increases.

4. Conventional vs Organic

Expectedly, the prices of organic avocados usually run higher than conventional ones. For some cities and regions such as Phoenix-Tuscon, West Texas, and New Mexico, compared with other cities, organic avocados are priced so much higher than the conventional ones.

Machine Learning Models

1. Model building

For the predictive models, I used the following 5 models to train the datasets, test the predicted prices, and compared them against actual prices to score the accuracies. 

  • Ensemble Model 
    • Combined the following 4 models
  • 2 Random Forest Models
    • Grid searched model and not grid searched model
  • XGBoost
  • Linear Regression

2. Accuracy scores

Based on the below chart, Random Forest models did quite better than other models including the ensemble model in which I combined all four models' outputs and averaged them out. I used mean squared error, root mean squared error, and R2 (coefficients of determinations) to compare the accuracies and in all three measures, the Random Forest model with the grid searched parameters did the best.  

Conclusion and next steps

This study showed that there are some associations between prices and volumes. Additionally, the proximity to the major production region (Mexico) and seasons can be strong elements that affect the pricing. However, there are exceptions too. For example, you could move to Texas to buy cheaper avocados, but there is a possibility that organic avocados may not be as reasonably priced as conventional avocados. Additionally, although San Francisco is relatively closer to Mexico, that does not help you so much if you are trying to make avocado toasts every day.

As a next step, I would find more datasets that can be incorporated into models to predict more accurately such as volumes imported from Mexico, fuel prices to account for transportation costs and distances to the production area.

Github

About Author

Kisaki Watanabe

Data Scientist with strong consulting experiences in data analytics/visualization and risk management, serving for industries ranging from social networking service, game, pharmaceutical, media, and advertising. Advanced skills in fraud investigation and trend projection/analysis with tools such as Tableau,...
View all posts by Kisaki Watanabe >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI