What a reliability diagram tells you that a Brier score can't
Two models can share the same Brier score and still be wildly different. The reliability diagram is what separates them.
19 May 2026 Β· 6 min read
If you read enough football-prediction blog posts, you start to see a familiar move: a site quotes a single accuracy number β Brier score, hit rate, ROI β and asks you to trust their model on the strength of it. That single number is doing a lot of work. Maybe too much.
The problem with using one summary statistic to judge a probability model is that very different models can land on the same number. Two systems can both score Brier = 0.21 and still tell totally different stories about what the football season actually looked like. Here's how that happens β and why a reliability diagram is the visualisation that separates them.
The single-number trap
The Brier score for a probabilistic prediction is the squared error between the stated probability and the actual outcome (1 if it happened, 0 if it didn't), averaged across all predictions. A perfect model scores 0. Predicting 1/3 each outcome every fixture scores 0.222.
It's a fine summary metric β but it's an average. Averages hide shape. A model that's great at picking lopsided matches (80% favourites that win 80% of the time) and terrible at picking close ones (50/30/20 splits where the underdog wins) can end up with the same Brier score as a model that's mediocre across the board. From the headline number, they look identical. From the user's experience, one is a useful tool and the other is a generator of confident-sounding nonsense.
What a reliability diagram does
A reliability diagram bins predictions by their stated probability. All the predictions that said β50% home winβ go in one bin, all the β60% home winβ predictions in another, and so on. Then for each bin, you compute the actual frequency of home wins. Plot the predicted probability on the x-axis and the observed frequency on the y-axis. Perfect calibration is the y = x diagonal.
Dots above the line: the model is under-confident in that probability range (saying 50% when reality is 60%). Dots below: over-confident (saying 70% when reality is 55%). Both kinds of error can hide inside the same Brier score.
The MatchMind take
We publish a reliability diagram for our 1Γ2 home-win predictions on the public Track Record page. Today the sample is still building toward statistical significance β we hold the curve back until we have at least 50 pre-match evaluations because anything less is just noise. When the threshold is met, the curve goes live. If we're under-confident in the 70-80% bracket, you'll see it. If we're over-confident in the 30-40% bracket, you'll see that too.
This isn't a sales move. It's the only way a probability model earns trust. If a site can't show you their reliability diagram, ask why.
What the diagram doesn't fix
Three honest caveats:
- Small samples are noisy. A reliability diagram with 30 data points per bin can swing 10-15% from luck alone. Read the n carefully.
- Sub-populations can hide. A model can be calibrated overall but mis-calibrated for high-scoring derbies, or after midweek European rotation. Slicing by league or by context is the next layer.
- Calibration β skill. A model that always predicts 1/3 each outcome is perfectly calibrated and completely useless. Reliability diagrams should be read alongside skill metrics like Brier and log-loss.
Why this matters for a football fan
If you're the kind of fan who reads about xG over a coffee, you don't want a tipster shouting at you. You want to know whether the source publishing a probability has done the work to make that probability mean something. The reliability diagram is the visual answer to that question.
See ours, judge it for yourself: matchmind.dev/track-record β
MatchMind in 30 seconds
MatchMind publishes calibrated 1Γ2 win/draw/loss probabilities, xG, and AI-written match analysis for the Big-5 European leagues. Every probability is published alongside its calibration data β including when the model misses target.