DM Analytics

Question

How should shot quality be quantified from open event data so that analysts can separate chance quality from finishing and reuse the outputs in later forecasting and player models?

Methods

Naive conversion baseline
Logistic regression
GAM with mgcv
Freeze-frame context features
Season-based validation
Probability calibration diagnostics

Data Sources

StatsBomb Open Data
La Liga event data
Shot-level feature table
Shot freeze-frame context

Test Sample

839 shots

The 2020/2021 holdout set contains 839 shots across 35 open-data matches.

Best Log Loss

0.3507

The logistic model is currently the best first-pass model on log loss.

Best Brier Score

0.0943

The same logistic model also leads on Brier score after the freeze-frame feature upgrade.

Problem

Estimate the probability that an individual shot results in a goal using public event data from La Liga.

Treat expected goals as a probability estimation problem rather than a classification exercise, because the real requirement is calibrated shot-quality estimates that analysts can trust.

Package the resulting xG values as reusable infrastructure for later team-strength, live win-probability, and player-rating work.

Football Context

Raw goals and shot counts confound chance quality with finishing variance and goalkeeper outcomes, which makes them weak tools for evaluating attacking process.

A club-level xG model helps analysts benchmark chance creation, identify over- or under-performance, and build cleaner inputs for forecasting and player evaluation.

This project is intentionally scoped as a foundational model: it matters on its own, but it also underpins later portfolio projects.

Data

The project uses StatsBomb Open Data, focused on La Liga to keep the competition context coherent and the scope manageable.

The current first pass contains 1,591 shots across 68 open-data matches, built from immutable raw JSON into one canonical shot table with geometry, event context, and freeze-frame-derived fields.

The current implementation uses 2019/2020 as the training season and 2020/2021 as the test season, preserving a realistic temporal split while also making the dataset limitation explicit.

Primary language: R
Core packages: data.table, mgcv, testthat, renv
Core features: shot distance, shot angle, body part, shot type, game state, minute, first-time shot, one-on-one, under pressure
Freeze-frame features added in the first pass: defender count and goalkeeper distance
Important limitation: StatsBomb Open Data is a subset, not full La Liga season coverage

Model Design

The model stack is deliberately simple and defensible. A naive conversion-rate baseline establishes the floor, a logistic regression provides an interpretable statistical baseline, and a GAM is the main model because shot geometry effects are nonlinear and the job spec explicitly values GAM competence.

The model stack is deliberately simple and defensible. A naive conversion-rate baseline establishes the floor, a logistic regression provides an interpretable statistical baseline, and a GAM is the headline flexible model because shot geometry effects are nonlinear and the relationship between chance quality and location is unlikely to be perfectly straight-line.

The first implementation then adds a deliberately small freeze-frame upgrade through defender count and goalkeeper distance. That gives the model some shot-context signal without pretending this is full tracking-data analysis.

More complex machine-learning models are intentionally excluded from the first version because they would increase implementation time, reduce interpretability, and weaken the portfolio unless validated to a much higher standard.

Model 0: global conversion rate
Model 1: logistic regression on core shot features
Model 2: GAM with nonlinear geometry effects and limited freeze-frame context
Primary objective: well-calibrated probabilities, not headline classification accuracy

Validation

Validation should reflect how the model would behave on future football data, so the main evaluation design is season-based holdout rather than random train-test splitting.

Performance will be judged using proper scoring rules and calibration diagnostics, because a useful xG model must assign credible probabilities rather than simply rank shots well.

Diagnostics should explicitly check where the model is weak, especially on headers, long-range shots, and sparse contextual shot types.

Log loss
Brier score
Calibration curve and reliability table
Expected calibration error
Subgroup diagnostics by body part, shot zone, and game state
Season-by-season stability checks

Results

On the 2020/2021 test split, both fitted models beat the naive baseline on proper scoring rules. The logistic model currently performs best on log loss and Brier score, while the GAM does not outperform it on this sample.

The current first-pass metrics are: baseline log loss 0.3908, logistic log loss 0.3507, GAM log loss 0.4024; baseline Brier 0.1148, logistic Brier 0.0943, GAM Brier 0.0955.

That is a useful result rather than a disappointment. It shows that limited data plus extra model flexibility can hurt performance, and that a stronger portfolio comes from reporting that honestly instead of forcing a more complex model to be the winner.

The exported artefacts now include a metric comparison chart and a calibration curve built from the real evaluation outputs.

The logistic model is the most defensible first deployed version

The naive baseline is clearly too blunt, but the richer GAM also fails to justify itself on this sample. The logistic model ends up as the strongest real option because it improves both log loss and Brier score without adding the extra instability that appears in the GAM.

That is a stronger result than forcing complexity to win. The project now demonstrates model judgement rather than model prestige.

Baseline

Logistic

GAM

Log lossEvaluation score

Baseline0.3908

Logistic0.3507

GAM0.4024

Brier scoreEvaluation score

Baseline0.1148

Logistic0.0943

GAM0.0955

Calibration errorEvaluation score

Baseline0.0020

Logistic0.0271

GAM0.0310

Lower values are better on this comparison.

Model	Log Loss	Brier	Calibration Error
Baseline	0.3908	0.1148	0.0020
Logistic	0.3507	0.0943	0.0271
GAM	0.4024	0.0955	0.0310

What Failed During Development

The original plan positioned the GAM as the natural headline model because shot quality is driven by nonlinear geometry. That reasoning is still statistically sound, but the first-pass results did not support it as the best deployed choice.

After adding limited freeze-frame context, the logistic model improved enough to beat the GAM on both log loss and Brier score. The GAM became harder to estimate cleanly on this small open-data sample and did not justify its extra flexibility.

That failure is valuable. It shows why model selection should be evidence-led rather than driven by prestige or complexity, and it gives a concrete example of why a club workflow needs validation discipline before promoting a richer model into production.

Planned headline model: GAM
Observed first-pass winner: logistic regression
Reason: limited sample plus extra flexibility reduced stability
Main lesson: complexity is only useful when the evidence supports it

Decision Use

Analysts should use the model to separate chance quality from conversion outcomes, compare attacking process across teams or players, and create cleaner downstream features for later models.

The outputs should not be treated as a complete measure of finishing ability or offensive value in isolation, because event data misses defensive pressure, goalkeeper positioning, and other critical context.

Safe interpretation means treating xG as structured evidence about chance quality, not as an all-purpose truth label for attacking performance.

Engineering

The implementation should be organised as a reproducible R project rather than an exploratory notebook. Raw JSON should remain immutable, feature generation should be scripted, and evaluation artefacts should be saved for reuse in the website.

The codebase should include reusable functions, pipeline entry scripts, config files, and testthat coverage for extraction and transformation logic.

Immutable raw data layer
Canonical shot table generation
Config-driven train, validation, and test splits
testthat checks for schema, geometry, and probability validity
Versioned figures and model artefacts

Limitations

The biggest practical limitation is data scope. StatsBomb Open Data covers only a subset of La Liga matches here, so this is a rigorous demonstration project rather than a production-strength league model.

Open event data omits important contextual information, including detailed defensive pressure and exact goalkeeper positioning, so some visually similar shots can have materially different true scoring probabilities.

Uncertainty is likely to be largest in sparse shot contexts such as unusual headers, rare set-piece patterns, and long-range attempts with limited contextual information.

The high-probability calibration bins are especially noisy because the sample is small. Any strong-looking performance result should therefore be accompanied by calibration and subgroup analysis, not presented as proof that the model captures all relevant football context.

Next Iteration

After the first release, the main next step is not adding arbitrary complexity. It is improving contextual coverage, tightening uncertainty communication, and integrating xG outputs into the team-strength and player-rating projects.

The first refinement target is model stability: the logistic model currently beats the GAM, so the next iteration should simplify or regularise the GAM rather than blindly expanding it.

A useful extension would be bootstrap uncertainty intervals for evaluation metrics and a compact failure log showing which modelling choices did not survive diagnostics.

Only after the file-based workflow is stable would it make sense to consider a shared project database for later portfolio infrastructure.

Pipeline Workflow

Ingest immutable raw StatsBomb JSON for La Liga seasons 2019/2020 and 2020/2021.
Extract a canonical shot table from match event files.
Engineer a compact, defensible set of geometry, event, and freeze-frame context features.
Split train and test data by season rather than randomly.
Fit a global-rate baseline, logistic regression, and headline GAM.
Evaluate calibration, Brier score, log loss, and subgroup reliability.
Export metrics and summary artefacts for the portfolio site.

Repository Structure

modeling/project-1-xg/config/project_config.R for data paths and split definitions
modeling/project-1-xg/R/ for reusable ingestion, feature, split, model, and evaluation functions
modeling/project-1-xg/scripts/ for sequential pipeline entry points
modeling/project-1-xg/tests/testthat/ for geometry and evaluation tests
modeling/project-1-xg/data/raw/ for immutable StatsBomb source files
modeling/project-1-xg/data/processed/ for shot tables and split outputs
modeling/project-1-xg/outputs/ for metrics, calibration tables, and site export artefacts

What Wider Use Would Require

Stable raw-to-shot extraction pipeline
Feature validation and geometry tests
Saved model artefacts and evaluation reports
Monitoring for schema drift and missing feature rates
Clear analyst guidance on safe interpretation

Expected Goals Model