Question

How should shot quality be quantified from open event data so that analysts can separate chance quality from finishing and reuse the outputs in later forecasting and player models?

Methods

  • Naive conversion baseline
  • Logistic regression
  • GAM with mgcv
  • Freeze-frame context features
  • Season-based validation
  • Probability calibration diagnostics

Data Sources

  • StatsBomb Open Data
  • La Liga event data
  • Shot-level feature table
  • Shot freeze-frame context

Test Sample

839 shots

The 2020/2021 holdout set contains 839 shots across 35 open-data matches.

Best Log Loss

0.3507

The logistic model is currently the best first-pass model on log loss.

Best Brier Score

0.0943

The same logistic model also leads on Brier score after the freeze-frame feature upgrade.

Problem

Estimate the probability that an individual shot results in a goal using public event data from La Liga.

Treat expected goals as a probability estimation problem rather than a classification exercise, because the real requirement is calibrated shot-quality estimates that analysts can trust.

Package the resulting xG values as reusable infrastructure for later team-strength, live win-probability, and player-rating work.

Football Context

Raw goals and shot counts confound chance quality with finishing variance and goalkeeper outcomes, which makes them weak tools for evaluating attacking process.

A club-level xG model helps analysts benchmark chance creation, identify over- or under-performance, and build cleaner inputs for forecasting and player evaluation.

This project is intentionally scoped as a foundational model: it matters on its own, but it also underpins later portfolio projects.

Data

The project uses StatsBomb Open Data, focused on La Liga to keep the competition context coherent and the scope manageable.

The current first pass contains 1,591 shots across 68 open-data matches, built from immutable raw JSON into one canonical shot table with geometry, event context, and freeze-frame-derived fields.

The current implementation uses 2019/2020 as the training season and 2020/2021 as the test season, preserving a realistic temporal split while also making the dataset limitation explicit.

  • Primary language: R
  • Core packages: data.table, mgcv, testthat, renv
  • Core features: shot distance, shot angle, body part, shot type, game state, minute, first-time shot, one-on-one, under pressure
  • Freeze-frame features added in the first pass: defender count and goalkeeper distance
  • Important limitation: StatsBomb Open Data is a subset, not full La Liga season coverage

Model Design

The model stack is deliberately simple and defensible. A naive conversion-rate baseline establishes the floor, a logistic regression provides an interpretable statistical baseline, and a GAM is the main model because shot geometry effects are nonlinear and the job spec explicitly values GAM competence.

The first implementation then adds a deliberately small freeze-frame upgrade through defender count and goalkeeper distance. That gives the model some shot-context signal without pretending this is full tracking-data analysis.

More complex machine-learning models are intentionally excluded from the first version because they would increase implementation time, reduce interpretability, and weaken the portfolio unless validated to a much higher standard.

  • Model 0: global conversion rate
  • Model 1: logistic regression on core shot features
  • Model 2: GAM with nonlinear geometry effects and limited freeze-frame context
  • Primary objective: well-calibrated probabilities, not headline classification accuracy

Validation

Validation should reflect how the model would behave on future football data, so the main evaluation design is season-based holdout rather than random train-test splitting.

Performance will be judged using proper scoring rules and calibration diagnostics, because a useful xG model must assign credible probabilities rather than simply rank shots well.

Diagnostics should explicitly check where the model is weak, especially on headers, long-range shots, and sparse contextual shot types.

  • Log loss
  • Brier score
  • Calibration curve and reliability table
  • Expected calibration error
  • Subgroup diagnostics by body part, shot zone, and game state
  • Season-by-season stability checks

Results

On the 2020/2021 test split, both fitted models beat the naive baseline on proper scoring rules. The logistic model currently performs best on log loss and Brier score, while the GAM does not outperform it on this sample.

The current first-pass metrics are: baseline log loss 0.3908, logistic log loss 0.3507, GAM log loss 0.4024; baseline Brier 0.1148, logistic Brier 0.0943, GAM Brier 0.0955.

That is a useful result rather than a disappointment. It shows that limited data plus extra model flexibility can hurt performance, and that a stronger portfolio comes from reporting that honestly instead of forcing a more complex model to be the winner.

The exported artefacts now include a metric comparison chart and a calibration curve built from the real evaluation outputs.

Bar chart comparing log loss and Brier score across baseline, logistic, and GAM models.

Metric Comparison

The first-pass evaluation shows clear improvement over the naive baseline, with the logistic model outperforming the richer GAM on this limited open-data sample.

Calibration chart comparing baseline, logistic, and GAM predicted probabilities against observed goal rates.

Calibration Curve

Calibration is informative but noisy in the upper probability buckets because the test sample is small. That uncertainty is part of the story, not a detail to hide.

What Failed During Development

The original plan positioned the GAM as the natural headline model because shot quality is driven by nonlinear geometry. That reasoning is still statistically sound, but the first-pass results did not support it as the best deployed choice.

After adding limited freeze-frame context, the logistic model improved enough to beat the GAM on both log loss and Brier score. The GAM became harder to estimate cleanly on this small open-data sample and did not justify its extra flexibility.

That failure is valuable. It shows why model selection should be evidence-led rather than driven by prestige or complexity, and it gives a concrete example of why a club workflow needs validation discipline before promoting a richer model into production.

  • Planned headline model: GAM
  • Observed first-pass winner: logistic regression
  • Reason: limited sample plus extra flexibility reduced stability
  • Main lesson: complexity is only useful when the evidence supports it

Decision Use

Analysts should use the model to separate chance quality from conversion outcomes, compare attacking process across teams or players, and create cleaner downstream features for later models.

The outputs should not be treated as a complete measure of finishing ability or offensive value in isolation, because event data misses defensive pressure, goalkeeper positioning, and other critical context.

Safe interpretation means treating xG as structured evidence about chance quality, not as an all-purpose truth label for attacking performance.

Engineering

The implementation should be organised as a reproducible R project rather than an exploratory notebook. Raw JSON should remain immutable, feature generation should be scripted, and evaluation artefacts should be saved for reuse in the website.

The codebase should include reusable functions, pipeline entry scripts, config files, and testthat coverage for extraction and transformation logic.

This is also why the project does not need a database at this stage. Flat-file inputs and outputs keep the workflow reproducible, inspectable, and easy to understand.

  • Immutable raw data layer
  • Canonical shot table generation
  • Config-driven train, validation, and test splits
  • testthat checks for schema, geometry, and probability validity
  • Versioned figures and model artefacts

Limitations

The biggest practical limitation is data scope. StatsBomb Open Data covers only a subset of La Liga matches here, so this is a rigorous demonstration project rather than a production-strength league model.

Open event data omits important contextual information, including detailed defensive pressure and exact goalkeeper positioning, so some visually similar shots can have materially different true scoring probabilities.

Uncertainty is likely to be largest in sparse shot contexts such as unusual headers, rare set-piece patterns, and long-range attempts with limited contextual information.

The high-probability calibration bins are especially noisy because the sample is small. Any strong-looking performance result should therefore be accompanied by calibration and subgroup analysis, not presented as proof that the model captures all relevant football context.

Next Iteration

After the first release, the main next step is not adding arbitrary complexity. It is improving contextual coverage, tightening uncertainty communication, and integrating xG outputs into the team-strength and player-rating projects.

The first refinement target is model stability: the logistic model currently beats the GAM, so the next iteration should simplify or regularise the GAM rather than blindly expanding it.

A useful extension would be bootstrap uncertainty intervals for evaluation metrics and a compact failure log showing which modelling choices did not survive diagnostics.

Only after the file-based workflow is stable would it make sense to consider a shared project database for later portfolio infrastructure.

Pipeline Workflow

  1. Ingest immutable raw StatsBomb JSON for La Liga seasons 2019/2020 and 2020/2021.
  2. Extract a canonical shot table from match event files.
  3. Engineer a compact, defensible set of geometry, event, and freeze-frame context features.
  4. Split train and test data by season rather than randomly.
  5. Fit a global-rate baseline, logistic regression, and headline GAM.
  6. Evaluate calibration, Brier score, log loss, and subgroup reliability.
  7. Export metrics and summary artefacts for the portfolio site.

Repository Structure

  • modeling/project-1-xg/config/project_config.R for data paths and split definitions
  • modeling/project-1-xg/R/ for reusable ingestion, feature, split, model, and evaluation functions
  • modeling/project-1-xg/scripts/ for sequential pipeline entry points
  • modeling/project-1-xg/tests/testthat/ for geometry and evaluation tests
  • modeling/project-1-xg/data/raw/ for immutable StatsBomb source files
  • modeling/project-1-xg/data/processed/ for shot tables and split outputs
  • modeling/project-1-xg/outputs/ for metrics, calibration tables, and site export artefacts

What Wider Use Would Require

  • Stable raw-to-shot extraction pipeline
  • Feature validation and geometry tests
  • Saved model artefacts and evaluation reports
  • Monitoring for schema drift and missing feature rates
  • Clear analyst guidance on safe interpretation