Data Used for Football Transfers - A Study in Credibility

Posted on Oct 25, 2021

The data science skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data used in Drinking From The Firehose

One obvious, though important feature of the Internet Age is data overload. For just about any conceivable subject, no matter how obscure it seems, there is an overabundance of information pertaining to it. And having more data may or not be better, even if somehow we could process all of it.  Depending on how information reaches us and and how we curate it, it could be that our most convenient, readily available sources of information are in fact less reliable than other sources which are more obscure to us.

Accordingly, there is tremendous value in being able to correctly assess the credibility value of a source of information, both in economic terms as well as our day-to-day personal lives.

What's Important About Football

In that vein, we attempt to develop heuristics toward reliable information, and reliable information sources, in the realm of football (soccer) transfers.  For the reasons we just we over, this project has significant value even for those who don't care about football, or football transfers.  And basically there is only one important background point specific to football to be understood for this project.  Specifically, for the bigger football teams in Europe, roster composition is largely determined by the transfer market.

That is, a player under contract with one club is sold to another for an amount of money agreed on between the two clubs, as well as an agreement to a new contract between the player and the buying club.

In this project we consider one club in particular, Chelsea Football Club, based in West London and one of the more prominent clubs in England.  And we consider one window in particular, that is a specific period of time where the football authorities allow transfers between clubs, in this case the summer window of 2020.

During this period Chelsea bought (or in a couple of cases acquired for free) several prominent players: Kai Havertz, Timo Werner, Ben Chilwell, Hakim Ziyech, Edouard Mendy, Thiago Silva.  And there were other players who Chelsea could have bought or considered buying but didn't: Sergio Reguilon, Jadon Sancho, Moussa Dembele, Jan Oblak, Kalidou Koulibaly, Raphael Varane, Nicolas Tagliafico, Declan Rice, Dean Henderson.

Data about football transfer

Given this transfer activity as a premise, is there any way to assess the likelihood of Chelsea buying one player or declining to buy another as a function of the press reports of Chelsea's transfer activity, or news regarding the club as a whole for that matter?

Data on Guardian

To that end, we consider data from The Guardian, a major UK newspaper, and in particular all of its football content from Apr 1, 2020 through Oct 31, 2020, a period containing that's summer's transfer window.  This data set added up approximately 3000 pieces of content and 60 MB of memory.

Searching through this data set, we consider a match to be a piece of content where the player's name is mentioned, as well as the string "Chelsea".  The guiding logic is the hope that if The Guardian repeatedly mentions "Chelsea" and a given player significantly more often than an otherwise comparable player, it is a positive indication that Chelsea will acquire that player.  And in fact, that's exactly what we found:


Data on Chelsea

The players Chelsea actually bought, represented in the red bars above, were mentioned much more frequently than the other plausible targets.

And in case this was merely an artifact of my ad hoc choices of Chelsea's plausible targets, I extended this methodology to a wider set of the world's prominent footballers.  To obtain this wider list, again we borrowed content from The Guardian.

Every year, the Guardian publishes a list of the world's Top 100 Footballers, voted on by their football writers and a few others who they grant votes to.  Conveniently for me, The Guardian publishes not just the Top 100 list, but also a link to a Google sheet which tabulates all of the underlying votes.  From this sheet, we have access to a modicum of information for 440 of the world's most prominent footballers, ie, all the footballers who received at least one vote in this survey.

And, compared to this wider set of players, we see the same phenomenon still in play:

The players Chelsea acquired were mentioned not merely more than an ad hoc group of targets, but more than every other footballer in the world.  Imagine, if you will, the chart above extending to the right for 440 columns, where each column represents the number of times that player's name was mentioned in the Guardian in the same piece of content as the string "Chelsea".  In order to prevent the bars from being impossibly narrow, I truncated this chart to 40 players (and only one would-be red bar didn't make the cut).

All Good Things Must Come To An End

Unfortunately, these seemingly promising results are not actually meaningful, due to a severe methodological error by me.  That is, pertaining to the players Chelsea actually acquired, most of the mentions occurred after Chelsea already bought the player.  Therefore, those mentions contain little if any inferential value as to whether Chelsea will buy that player in the future.

Ie, see the blue proportions in the bars of the chart above.

The unfortunate conclusion to this research is that if we want to develop a truly credible source of information with predictive value as to Chelsea's actions in the transfer market, we will have to use a much more sophisticated methodology than what I have done here.


About Author

john kosmicke

John is a quantitative technologist with experience in high frequency trading, recruiting, a master's degree from Iowa State and a bachelors's from Chicago.
View all posts by john kosmicke >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI