- Get Goalside
- Posts
- Video is all you need
Video is all you need
Unstructured data, LLMs, and vocabulary
This is the age of unstructured data. There’s your pull-quote. In the real world, ‘unstructured data’ often means ‘words’, and part of the large language model hype is that they can provide an interface for this. In football, the more important world, the big mass of unstructured data would be video.
What we call ‘event data’ - the shots, the passes - is just a ‘structured data’ way of viewing matches. But the game tape - yes, plus wordy things like scout reports or coach analysis - is unstructured.
For all the benefits of structured data, it takes time to structure, into a system that you will inevitably hate when it’s just too late to change it. (something something data engineering, too). So why not just squeeze the juice straight out of the unstructured lemon?
Unstructured lemon juice like goalkeeper save technique from video, or goalkeeper save timing from video. (Goalkeepers don’t get enough attention, but are also on camera in just the right way for body pose detection). If outfielders are more your cup of tea, here’s a 2019 paper on body orientation from video.
Any other research like this I should know about?
If we’re being picky about metaphors, video and LLMs aren’t equivalents. Video and words aren’t even the direct equivalent. In football, tracking data is a form of unstructured data that has been around for a long time; working with the video directly is like working with audio. But the point remains the same: unstructured data.
However, to leverage this unstructured data you first need the game film. As Nancy Hensley pointed out at the recent Hudl Statsbomb conference, the lower level (in quality and existence) of women’s football coverage on TV affects not just fan engagement but data collection. In Belgium, they’re making a big play of installing cameras and getting data on everything that moves.
This harks back to 2022 Get Goalside, ‘How football competitions are their own competition’:
“Just like in the television industry, football leagues are now competing much more directly with their overseas equivalents. This is why La Liga (not just Real Madrid or Barcelona) are taking it upon themselves to complain about Paris Saint-Germain's and Manchester City's finances. It's also why they have their own analysis and visualisation tool, Mediacoach, which forms part of LaLiga Tech, which launched last September. [2024 ed: now called Sportian, and is part of the Belgian Pro League deal]. All a way of trying to make sure that theirs is the best product around.
On a slightly different scale, the relatively recently-formed Canadian Premier League has made a concerted effort to help the entire competition with its own CPL in-house analysts and expertise.”
As alluded to above, leagues centralising technical advancement is something that makes a lot of sense, not to dictate usage (which would likely stunt innovation) but to set a minimum standard. Although it does need to be a reasonable minimum standard. From Ian Graham’s How to Win the Premier League: “We also received tracking data for all UEFA games, but until 2021 UEFA did not exercise any quality control over it, so we could not trust it.”
(Related reading recommendation, The Formula by Joshua Robinson and Jonathan Clegg: “After nearly seven decades of [Formula One] teams fighting tooth and nail for every advantage […] Liberty presented them with a new reality. Instead of being rivals, these teams had to understand once and for all that they were all in business with each other.”)
The ‘pivot to unstructured data’ creates another interesting dynamic.
For a long time, access to data has been an issue for wannabe analysts or researchers. In their lifetime to date as a data provider, Statsbomb have been admirable in the amount they’ve made openly available. But if DIY collection from video takes off (not necessarily for tracking data - you could imagine someone taking this and deciding to create a shot-detection system), that would open interesting doors.
But to where?
Well, if video/tracking data is a rough equivalent of words and text analysis, maybe the recent use of generative AI can give some pointers.
There are two indisputably ‘successful’ use cases for genAI. One is coding assistants (for the more popular languages); another, though ‘successful’ is a loaded term, is art. Now, the development of lucrative tools built on scraped art, made by the profession who’ll be undercut by said tools, is the type of societal wrinkle you’d find in a dystopian novel. But these tools - think of Photoshop’s generative fill feature rather than entire artworks if it helps - produce convincing results. In the hands of artists, they produce art. In the hands of schmucks, they don’t. A lot of early genAI ‘art’ production was schmuckery.
These two use cases make sense when you think of how these kinds of generative AI work systems work. ‘Art’ does not follow the same rules of ‘factual accuracy’ that so much of the rest of the world does (photorealism, and adherence to specific styles, aside). Where else but art could Caravaggio and Kahlo, Rothko and Ruysch exist as greats. Certainly not business chatbots. Coding, meanwhile, has a much stricter sense of ‘accuracy’ but a far more limited ‘vocabulary’. LLMs work by predicting the probability of the next word; English has an estimated 170,000 words, coding languages will have far fewer (hats off to the Reddit user who asked this question a few years ago).
So, to football.
At some point in the past year, I heard someone with a lot of experience (on a podcast episode (I think) that I now can’t find) give a warning about tracking data. They said that it’s tempting to go after the gold mines of off-ball metrics, but that that could be a red herring. A marsh that one would sink into.
I’m embellishing slightly, but I think the tendency of work using tracking data to split player movement into ‘runs’ is telling. Some of it is to create physical metrics, some of it is around concepts like ‘running in behind’ or ‘overlapping runs’: clear concepts, clear vocabulary, turning unstructured data into well-understood structured datapoints.
But there’s so much that isn’t well understood, or well verbalised, in football. Leander Forcher recently released his PhD dissertation on ‘success factors in soccer defense’, much of which highlighted the lack of pre-existing work in defensive analysis. Lack of pre-existing work often means lack of clear understanding of terms. (‘what if we’d focused on different parts of the sport’ is a continuing theme of Get Goalside)
We’re in the age of unstructured data. We’re also in the toddling age of learning how to get the best use out of unstructured data.
The only thing that’s certain is that we’ll need more data engineers.