What Toys Can Tell Us: Insight and Discussion

Posted on Oct 21, 2019
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


eBay is second only to Amazon in terms of e-commerce sales volume in North America, surpassing  Apple and Walmart. 

While 'electronics' is the largest category in terms of sales,  the 'toys' category is uniquely positioned to give insight into current consumer trends, historical appetite, and - ultimately - measuring the strength of a brand. This information has implications for both the individual as well as the institution.

Propagation of names such as 'Iron Man' and 'Thanos' has facilitated a transition from obscure references to near-household status. While quantifying this transition is beyond the scope of this project, it has lead to the main question: Is there a way to track this propagation in a way that is neatly encapsulated by a consumer product? There is, and the answer is toys. Thus, let's begin by asking additional questions: How much does  a franchise matter to a brand? What are spending habits of the toy shopper?  

In our exploration, we will specifically take a look at the Action Figure Category.

Approach and Challenges

In theory, the questions are quite apparent, but in practice, the retrieval of sold listings proved to be a challenge. On any given day, the completed items in eBay's Action Figure category numbers over 1 million (reflecting several weeks' worth of data, and an ideal beginning sample size). At 100 results per page, this implies that there are over 10,000 pages worth of completed listings, but this proved to be elusive.

The data was scraped using python's scrapy package. The first crawl resulted in only a little over 8000 listings returned. Upon further examination, scrapy's response log indicated that only roughly 160 pages were scraped at 50 results per page. Tweaking settings and several adjustments to the code lead to only marginal improvement; a second scrape produced only 180 pages.

Thus, the first takeaway for improvement is readily apparent: either a way to force eBay's servers to return the 1M+ results is devised, or the crawler should be run every night, preferably over at least a 30 day time interval, with each iteration merged appropriately to avoid duplicating listings. Nevertheless, even with only 2 days' worth of sold listings, we can start asking questions, and envisioning how the answers can be deduced. The total sales over a 2 day period was $625,642.

The second major challenge was the user-populated 'Item Specifics' box. There are upwards of 22 unique fields that the seller can populate in this box, but as nearly every field is optional, the information varied widely from listing to listing. 



Key to the analysis was 'brand' field.  Luckily, blank/omitted listings only comprised a little under 2.5% of all sales. More challenging was the breadth of spelling variations provided for multiple brands. Certainly, future improvements to the project would implement increasingly complex regex expressions to correct/anticipate the user-provided data. Nevertheless, with some rigorous cleaning, the impact by brand was accurately captured, leading us to view immediate results.

2 obvious "brands" stand out: Marvel and Star Wars. These are not in fact brands but are franchises/intellectual property (incidentally, both belonging to Disney). This illustrates that the user populated data can be "lazy"; the user populates what is foremost and easy-to-identify. Correction for this is highly complex, and mainly dependent on whether the seller provided the brand name either in the auction title or the description. So, for the scope of the project, I left these "brands" intact.

The below illustrations were made from a combination of the seaborn, WordCloud, and plotly packages (unfortunately the interactive nature of plotly's graphs is lost when translating to a blog post).


A cursory look at sales by brand shows a very clear trend: toy brand Hasbro is a powerhouse:

Hasbro sales volume through 48 hours’ worth of  data is more than the next 3 largest brands combined, with shoppers purchasing over $150k worth of new and used toys. Of note is the defunct Kenner at 4th place, with roughly $30k, implying that collectors are driving those particular figures. This is a bit more clear when viewing sales by brand broken down by condition.


Together, these 20 constitute the heavy majority of the 2 day sales data. The collector segment is well-represented with used purchases driving sales in LJN, Mego, and Kenner, all either defunct or absorbed.

An inspection of the top 5 selling sub-categories shows interesting results.

Had there been a larger overall sample of data across at least 6 months, we could pose a hypothesis test with the H0 that toy sales are independent of current trends in film, television, streaming media, and other platforms. However, with such a large disparity in sales, we can infer that licensing and IP in the form of strongly supported media franchises do exceptionally well. I should note that while 'brand' is user-provided as is thus inconsistent, the sub-categories shown are mandatory fields, and so we can trust these segmentations with full confidence. 


The only confusion would be how much overlap there is between "Comic Book Heroes" and "TV, Movies & Video Games." That is currently beyond the scope of the project, but the question is interesting. 

It is also noteworthy that the third best selling category, "Transformers & Robots," is given distinction form "Military & Adventure," a point that will be revisited shortly.

I shift to a slightly more "bidder/buyer-centric" view here. This facet (from plotly) shows the average selling price, broken out by "buy-it-now" and "auction" format, classified by new/used condition. The top 5 categories span the columns, while brands populate the rows. Notice that NECA is included while the brand "Marvel" is excluded; I did this to make this plot strictly brand (i.e., manufacturer/producer)-based rather than franchise/IP-based. As such, the presence of "Unbranded" represents knock-off and unlicensed toys. 

Some takeaways from the above: collectors drive the highest prices, with Mattel's proprietary IP, "Masters of the Universe" commanding BIN prices in excess of $100 per item on average within the Military & Adventure category. The "Transformers & Robots" category reveals where buyers have the strongest presence in the Hasbro brand in terms of average price. This implies sellers are uncertain of the value of their goods, and elect for discovery through auction processes, with bidders also meeting their asks.

Bid dispersion for the top 5 categories across all brands is strongest in the TVMVG category, but the Transformers & Robots category exhibits the most bids in the 75 percentile (slightly under a tendency of 25 bids).


When we shift to looking at the top 5 categories with only the top brands, competition is less frequent but is highly centered in the Comic Book Heroes category. Again, this suggests sellers are uncertain of the value of their goods, but when viewed in conjunction with the average selling price above, toys in the Comic Book Heroes category are relatively inexpensive and mostly for new items.

An outlier from the top brands is Hot Toys, which required its own plot.


Focusing strictly on the high-end collectors' market, buyers are more than willing to pay on average $200 and up per item.

I examined the top brands a bit more, curious to see how the brands were fairing in the top 5 categories. Again, the graph was particularly illustrative for Hasbro 

This graph was done in plotly, so unfortunately some details are lost with the static image. The vertical bars within the color segments signal demarcations between used and new sales within each category. Still, it's apparent how heavily concentrated every brand is in the TVMVG category, though again with the exception of Hasbro. 


Hasbro has strong diversification away from the comic book/movie related franchises largely due to their own proprietary IPs: GI Joe and Transformers. A Wordcloud pull (shown above) from all auction titles across all brands illustrates just how strong Hasbro and its IPs/licences are, with "Star Wars" being an extremely common string in auction titles, along with "Marvel" and Hasbro's name itself.

Restricting the WordCloud to only listings with Hasbro indicated as the brand yields similar results, with "Spider-Man" and "Optimus Prime" even showing up. Again, the takeaway is clear, Hasro is the "best diversified" of the top brands, with a seemingly unbeatable combination of top licenses (Star Wars, Marvel) and proprietary IP (Transformers, GI Joe).


Mattel is the next best ‘well-rounded’ after Hasbro, with strong support from collectors for their propietary IP, ‘Masters of the Universe’ as well as contemporary DC and sports/WWE. One very significant factor to consider: eBay does not include Barbie in its Action Figure Category, instead dedicating an entire section under "Dolls" for Barbie figures. 

Hot Toys

Hot Toys licenses movie and tv-show related properties to produce high-end goods. Their market, as shown by the high average ending prices, is quite niche. License/IP heavy, they operate almost exclusively within comic book and movie-related categories, with strong support from the Star Wars license.


NECA has the same strategy as Hot Toys with IP heavy licensing, but at the opposite end of the pricing spectrum: average closing price for their goods are in the $50 range vs. Hot Toys’ high 200s to low 500s per sale. Thus they cater to a niche market that is alienated by Hot Toys' high price point, eschewing the crushing weight of Hasbro and Mattel with their comic-book franchise licenses. Unopened figure listings sell particularly well.

Further Work and Closing Summary

The data presented is less indicative of any over-arching conclusion, due to the extremely small sampling period: essentially only 2 full days of sales. However, when repeated sampling periods are taken, much more comprehensive analysis can follow, such as predictive pricing and correlation analysis and hypothesis testing.

A strength of eBay data over that of Amazon/Walmart is the ability to gauge immediate consumer interest in a given brand/IP on a real-time basis; you cannot tell when someone buys a toy on Amazon/Walmart. If a scraping package could be put together to incorporate all 3 websites, I imagine the trends and insights would be very interesting indeed.

A more robust cleaning methodology would contribute to better results-however as they stand now they are directionally correct and are certainly within ‘ball-park’ range. A text matching algorithm could be used to extract the ‘franchise’ from the listing title; the franchise field  being frequently omitted in the user-submitted details.

As mentioned earlier, Marvel and Star Wars were frequently populated in the ‘brand’ field, despite neither being a dedicated toy brand/maker. This suggests that, for long standing IPs with media/film support, there is a customer segment that is brand-agnostic and more franchise aware: they do not care which brand holds the license to make the franchise, only that the franchise continues to be made available for toy purchase. Strong sales of ‘unbranded’/knock off figures support this.  However, for the brand, the franchise is clearly of high importance.

Case in point,  Mattel has allowed their DC license to expire, and analysts postulate they will attempt to wrest control of the Star Wars and Marvel IPs from Hasbro…


About Author

Emanuel Pizana

An insight-driven data product with the proper context and intuition can really create bridges between Data Science and Business. A former finance professional with over 10 years of experience, I've spent time working with both Finance and Business...
View all posts by Emanuel Pizana >

Related Articles

Leave a Comment

MKsOrb August 28, 2020
MKsOrb [...]Wonderful story, reckoned we could combine a few unrelated data, nevertheless definitely really worth taking a search, whoa did 1 master about Mid East has got additional problerms also [...]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI