Test Sample
839 shots
The 2020/2021 holdout set contains 839 shots across 35 open-data matches.
Foundational Model
A La Liga expected goals case study built with an R-first statistical workflow, using interpretable probability models, season-based validation, calibration-led evaluation, and a small freeze-frame context upgrade to create reusable shot-quality infrastructure.
How should shot quality be quantified from open event data so that analysts can separate chance quality from finishing and reuse the outputs in later forecasting and player models?
Test Sample
839 shots
The 2020/2021 holdout set contains 839 shots across 35 open-data matches.
Best Log Loss
0.3507
The logistic model is currently the best first-pass model on log loss.
Best Brier Score
0.0943
The same logistic model also leads on Brier score after the freeze-frame feature upgrade.
Estimate the probability that an individual shot results in a goal using public event data from La Liga.
Treat expected goals as a probability estimation problem rather than a classification exercise, because the real requirement is calibrated shot-quality estimates that analysts can trust.
Package the resulting xG values as reusable infrastructure for later team-strength, live win-probability, and player-rating work.
Raw goals and shot counts confound chance quality with finishing variance and goalkeeper outcomes, which makes them weak tools for evaluating attacking process.
A club-level xG model helps analysts benchmark chance creation, identify over- or under-performance, and build cleaner inputs for forecasting and player evaluation.
This project is intentionally scoped as a foundational model: it matters on its own, but it also underpins later portfolio projects.
The project uses StatsBomb Open Data, focused on La Liga to keep the competition context coherent and the scope manageable.
The current first pass contains 1,591 shots across 68 open-data matches, built from immutable raw JSON into one canonical shot table with geometry, event context, and freeze-frame-derived fields.
The current implementation uses 2019/2020 as the training season and 2020/2021 as the test season, preserving a realistic temporal split while also making the dataset limitation explicit.
The model stack is deliberately simple and defensible. A naive conversion-rate baseline establishes the floor, a logistic regression provides an interpretable statistical baseline, and a GAM is the main model because shot geometry effects are nonlinear and the job spec explicitly values GAM competence.
The first implementation then adds a deliberately small freeze-frame upgrade through defender count and goalkeeper distance. That gives the model some shot-context signal without pretending this is full tracking-data analysis.
More complex machine-learning models are intentionally excluded from the first version because they would increase implementation time, reduce interpretability, and weaken the portfolio unless validated to a much higher standard.
Validation should reflect how the model would behave on future football data, so the main evaluation design is season-based holdout rather than random train-test splitting.
Performance will be judged using proper scoring rules and calibration diagnostics, because a useful xG model must assign credible probabilities rather than simply rank shots well.
Diagnostics should explicitly check where the model is weak, especially on headers, long-range shots, and sparse contextual shot types.
On the 2020/2021 test split, both fitted models beat the naive baseline on proper scoring rules. The logistic model currently performs best on log loss and Brier score, while the GAM does not outperform it on this sample.
The current first-pass metrics are: baseline log loss 0.3908, logistic log loss 0.3507, GAM log loss 0.4024; baseline Brier 0.1148, logistic Brier 0.0943, GAM Brier 0.0955.
That is a useful result rather than a disappointment. It shows that limited data plus extra model flexibility can hurt performance, and that a stronger portfolio comes from reporting that honestly instead of forcing a more complex model to be the winner.
The exported artefacts now include a metric comparison chart and a calibration curve built from the real evaluation outputs.

The first-pass evaluation shows clear improvement over the naive baseline, with the logistic model outperforming the richer GAM on this limited open-data sample.

Calibration is informative but noisy in the upper probability buckets because the test sample is small. That uncertainty is part of the story, not a detail to hide.
The original plan positioned the GAM as the natural headline model because shot quality is driven by nonlinear geometry. That reasoning is still statistically sound, but the first-pass results did not support it as the best deployed choice.
After adding limited freeze-frame context, the logistic model improved enough to beat the GAM on both log loss and Brier score. The GAM became harder to estimate cleanly on this small open-data sample and did not justify its extra flexibility.
That failure is valuable. It shows why model selection should be evidence-led rather than driven by prestige or complexity, and it gives a concrete example of why a club workflow needs validation discipline before promoting a richer model into production.
Analysts should use the model to separate chance quality from conversion outcomes, compare attacking process across teams or players, and create cleaner downstream features for later models.
The outputs should not be treated as a complete measure of finishing ability or offensive value in isolation, because event data misses defensive pressure, goalkeeper positioning, and other critical context.
Safe interpretation means treating xG as structured evidence about chance quality, not as an all-purpose truth label for attacking performance.
The implementation should be organised as a reproducible R project rather than an exploratory notebook. Raw JSON should remain immutable, feature generation should be scripted, and evaluation artefacts should be saved for reuse in the website.
The codebase should include reusable functions, pipeline entry scripts, config files, and testthat coverage for extraction and transformation logic.
This is also why the project does not need a database at this stage. Flat-file inputs and outputs keep the workflow reproducible, inspectable, and easy to understand.
The biggest practical limitation is data scope. StatsBomb Open Data covers only a subset of La Liga matches here, so this is a rigorous demonstration project rather than a production-strength league model.
Open event data omits important contextual information, including detailed defensive pressure and exact goalkeeper positioning, so some visually similar shots can have materially different true scoring probabilities.
Uncertainty is likely to be largest in sparse shot contexts such as unusual headers, rare set-piece patterns, and long-range attempts with limited contextual information.
The high-probability calibration bins are especially noisy because the sample is small. Any strong-looking performance result should therefore be accompanied by calibration and subgroup analysis, not presented as proof that the model captures all relevant football context.
After the first release, the main next step is not adding arbitrary complexity. It is improving contextual coverage, tightening uncertainty communication, and integrating xG outputs into the team-strength and player-rating projects.
The first refinement target is model stability: the logistic model currently beats the GAM, so the next iteration should simplify or regularise the GAM rather than blindly expanding it.
A useful extension would be bootstrap uncertainty intervals for evaluation metrics and a compact failure log showing which modelling choices did not survive diagnostics.
Only after the file-based workflow is stable would it make sense to consider a shared project database for later portfolio infrastructure.