Data Used for Football Transfers - A Study in Credibility
The data science skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Data used in Drinking From The Firehose
One obvious, though important feature of the Internet Age is data overload. For just about any conceivable subject, no matter how obscure it seems, there is an overabundance of information pertaining to it. And having more data may or not be better, even if somehow we could process all of it. Depending on how information reaches us and and how we curate it, it could be that our most convenient, readily available sources of information are in fact less reliable than other sources which are more obscure to us.
Accordingly, there is tremendous value in being able to correctly assess the credibility value of a source of information, both in economic terms as well as our day-to-day personal lives.
What's Important About Football
In that vein, we attempt to develop heuristics toward reliable information, and reliable information sources, in the realm of football (soccer) transfers. For the reasons we just we over, this project has significant value even for those who don't care about football, or football transfers. And basically there is only one important background point specific to football to be understood for this project. Specifically, for the bigger football teams in Europe, roster composition is largely determined by the transfer market.
That is, a player under contract with one club is sold to another for an amount of money agreed on between the two clubs, as well as an agreement to a new contract between the player and the buying club.
In this project we consider one club in particular, Chelsea Football Club, based in West London and one of the more prominent clubs in England. And we consider one window in particular, that is a specific period of time where the football authorities allow transfers between clubs, in this case the summer window of 2020.
During this period Chelsea bought (or in a couple of cases acquired for free) several prominent players: Kai Havertz, Timo Werner, Ben Chilwell, Hakim Ziyech, Edouard Mendy, Thiago Silva. And there were other players who Chelsea could have bought or considered buying but didn't: Sergio Reguilon, Jadon Sancho, Moussa Dembele, Jan Oblak, Kalidou Koulibaly, Raphael Varane, Nicolas Tagliafico, Declan Rice, Dean Henderson.
Data about football transfer
Given this transfer activity as a premise, is there any way to assess the likelihood of Chelsea buying one player or declining to buy another as a function of the press reports of Chelsea's transfer activity, or news regarding the club as a whole for that matter?
Data on Guardian
To that end, we consider data from The Guardian, a major UK newspaper, and in particular all of its football content from Apr 1, 2020 through Oct 31, 2020, a period containing that's summer's transfer window. This data set added up approximately 3000 pieces of content and 60 MB of memory.
Searching through this data set, we consider a match to be a piece of content where the player's name is mentioned, as well as the string "Chelsea". The guiding logic is the hope that if The Guardian repeatedly mentions "Chelsea" and a given player significantly more often than an otherwise comparable player, it is a positive indication that Chelsea will acquire that player. And in fact, that's exactly what we found:
Data on Chelsea
The players Chelsea actually bought, represented in the red bars above, were mentioned much more frequently than the other plausible targets.
And in case this was merely an artifact of my ad hoc choices of Chelsea's plausible targets, I extended this methodology to a wider set of the world's prominent footballers. To obtain this wider list, again we borrowed content from The Guardian.
Every year, the Guardian publishes a list of the world's Top 100 Footballers, voted on by their football writers and a few others who they grant votes to. Conveniently for me, The Guardian publishes not just the Top 100 list, but also a link to a Google sheet which tabulates all of the underlying votes. From this sheet, we have access to a modicum of information for 440 of the world's most prominent footballers, ie, all the footballers who received at least one vote in this survey.
And, compared to this wider set of players, we see the same phenomenon still in play:
The players Chelsea acquired were mentioned not merely more than an ad hoc group of targets, but more than every other footballer in the world. Imagine, if you will, the chart above extending to the right for 440 columns, where each column represents the number of times that player's name was mentioned in the Guardian in the same piece of content as the string "Chelsea". In order to prevent the bars from being impossibly narrow, I truncated this chart to 40 players (and only one would-be red bar didn't make the cut).
All Good Things Must Come To An End
Unfortunately, these seemingly promising results are not actually meaningful, due to a severe methodological error by me. That is, pertaining to the players Chelsea actually acquired, most of the mentions occurred after Chelsea already bought the player. Therefore, those mentions contain little if any inferential value as to whether Chelsea will buy that player in the future.
Ie, see the blue proportions in the bars of the chart above.
The unfortunate conclusion to this research is that if we want to develop a truly credible source of information with predictive value as to Chelsea's actions in the transfer market, we will have to use a much more sophisticated methodology than what I have done here.