MatchMind

← All articles

What a reliability diagram tells you that a Brier score can't

Two models can share the same Brier score and still be wildly different. The reliability diagram is what separates them.

19 May 2026 Β· 6 min read

If you read enough football-prediction blog posts, you start to see a familiar move: a site quotes a single accuracy number β€” Brier score, hit rate, ROI β€” and asks you to trust their model on the strength of it. That single number is doing a lot of work. Maybe too much.

The problem with using one summary statistic to judge a probability model is that very different models can land on the same number. Two systems can both score Brier = 0.21 and still tell totally different stories about what the football season actually looked like. Here's how that happens β€” and why a reliability diagram is the visualisation that separates them.

The single-number trap

The Brier score for a probabilistic prediction is the squared error between the stated probability and the actual outcome (1 if it happened, 0 if it didn't), averaged across all predictions. A perfect model scores 0. Predicting 1/3 each outcome every fixture scores 0.222.

It's a fine summary metric β€” but it's an average. Averages hide shape. A model that's great at picking lopsided matches (80% favourites that win 80% of the time) and terrible at picking close ones (50/30/20 splits where the underdog wins) can end up with the same Brier score as a model that's mediocre across the board. From the headline number, they look identical. From the user's experience, one is a useful tool and the other is a generator of confident-sounding nonsense.

What a reliability diagram does

A reliability diagram bins predictions by their stated probability. All the predictions that said β€œ50% home win” go in one bin, all the β€œ60% home win” predictions in another, and so on. Then for each bin, you compute the actual frequency of home wins. Plot the predicted probability on the x-axis and the observed frequency on the y-axis. Perfect calibration is the y = x diagonal.

Dots above the line: the model is under-confident in that probability range (saying 50% when reality is 60%). Dots below: over-confident (saying 70% when reality is 55%). Both kinds of error can hide inside the same Brier score.

The MatchMind take

We publish a reliability diagram for our 1Γ—2 home-win predictions on the public Track Record page. Today the sample is still building toward statistical significance β€” we hold the curve back until we have at least 50 pre-match evaluations because anything less is just noise. When the threshold is met, the curve goes live. If we're under-confident in the 70-80% bracket, you'll see it. If we're over-confident in the 30-40% bracket, you'll see that too.

This isn't a sales move. It's the only way a probability model earns trust. If a site can't show you their reliability diagram, ask why.

What the diagram doesn't fix

Three honest caveats:

  • Small samples are noisy. A reliability diagram with 30 data points per bin can swing 10-15% from luck alone. Read the n carefully.
  • Sub-populations can hide. A model can be calibrated overall but mis-calibrated for high-scoring derbies, or after midweek European rotation. Slicing by league or by context is the next layer.
  • Calibration β‰  skill. A model that always predicts 1/3 each outcome is perfectly calibrated and completely useless. Reliability diagrams should be read alongside skill metrics like Brier and log-loss.

Why this matters for a football fan

If you're the kind of fan who reads about xG over a coffee, you don't want a tipster shouting at you. You want to know whether the source publishing a probability has done the work to make that probability mean something. The reliability diagram is the visual answer to that question.

See ours, judge it for yourself: matchmind.dev/track-record β†’

MatchMind in 30 seconds

MatchMind publishes calibrated 1Γ—2 win/draw/loss probabilities, xG, and AI-written match analysis for the Big-5 European leagues. Every probability is published alongside its calibration data β€” including when the model misses target.

See the live track record β†’ Β· Create a free account