We Wrote a Spec. Impulse Built a Calibrated NHL Win Probability Model

At Impulse, the NBA model went smoothly enough that the real question was whether our MLE agent could handle a noisier, harder domain. Hockey is where that question lives.
NHL win prediction is genuinely harder than NBA win prediction. Both leagues are parity-driven by design, but hockey adds wrinkles that basketball doesn't. Goalies can turn a winnable game on a single performance. 20-25% of games end in overtime or shootouts where the underlying margin matters less than a single bounce. Scoring is sparse, so signal per game is lower.
If Impulse can hand back a calibrated, leak-free model from data this noisy, the agent isn't a fluke.
Why hockey is harder than basketball to predict
A few honest contrasts worth surfacing up front, because they shape everything the model has to handle:
Base rate. NHL home teams win about 54% of the time. NBA home teams win 58%. Less home-ice signal means less easy edge for any model.
Goalies. One goalie can blow up an otherwise winnable game. There's no real equivalent in basketball. No single player dominates one position the way an NHL goalie does. The model has to incorporate goalie performance, but there's no clean way to encode "today's actual starter" without an injury-report feed.
Overtime and shootouts. A fifth of NHL games end in overtime or shootouts. Those outcomes look more like coin flips than underlying margin. The model has to be honest about not predicting these well.
Less signal per game. NHL games have 5-6 total goals on average. NBA games have around 220 total points. The signal-to-noise on rolling scoring stats is materially worse in hockey.
This is the kind of data the agent had to make sense of.
The data: all public, all parquet
Everything came from the public NHL API at api-web.nhle.com plus the legacy stats endpoint for team-id mapping. No paid feed. No proprietary tracking. No scraping.
Four raw sources pulled into parquet:
games.parquetevery NHL game from 2014-15 through the current 2025-26 playoffs. Game id, date, home and away teams, final score, season type, overtime and shootout flags, and decoded playoff round, matchup, and game-in-series for postseason games. About 17,000 rows.box_team.parquetper-game team rollup. Score and shots on goal at the team level.box_player.parquetper-game skater and goalie box scores. Time on ice, scoring, possession stats for skaters; saves, shots against, save percentage for goalies. About 680,000 rows.team_seasons.parquetper-season regular-season aggregates. Win percentage, goals-for and goals-against per game, goal differential.
At Impulse, we collapsed everything into a single feature table: 16,983 historical rows, 114 columns, one row per game. The label is home_win.
What the agent did with it
The input wasn't code. It wasn't a notebook. It was a single technical prompt describing what we wanted predicted, what rules the model had to respect (strict time-based split: train on seasons through 2022, validate on 2023, hold out 2024 as untouched test), and what artifacts to produce. The prompt also asked for two shipped models, a LightGBM and a small MLP, so the live app could show both side by side in a dropdown.
Then we ran the agent.
Here's what happened, in order:
1. Profile the data. Row count, types, null distribution per column, label balance. The agent flagged a structural detail nobody told it about: rolling-window features are null on a team's first 20 games of franchise history. Vegas (2017-18), Seattle (2021-22), and the Utah Mammoth (the new franchise replacing Arizona) all have legitimate gaps in their rolling stats. The agent surfaced this as a constraint on model choice, not a data quality problem to clean up.
2. Set up time-based splits correctly. Standard ML training uses random splits. For time-series sports data, that's wrong. You'd be using future games to predict past ones. The agent set up expanding-window cross-validation: fit on 2014-2018, validate on 2019; fit on 2014-2019, validate on 2020; and so on through 2022. It picked the right window structure without being told the specific shape.
3. Build a baseline first. Before anything fancy, the agent built a simple two-feature baseline using prior-season win percentage and home-ice advantage. The point of the baseline is to set a floor: any fancier model has to beat it, or it's not worth shipping.
4. Reason about model family selection. The agent laid out its choices explicitly. 17k rows, 114 features, mostly continuous with occasional natural nulls from the new-franchise rolling windows. Tabular trees are the obvious starting point. Deep learning isn't. LightGBM is well-suited because it handles nulls natively and trains fast on CPU. The spec also asked for an MLP for the second model slot, so the agent built one with two hidden layers and modest regularization, knowing it would offer a different perspective in the dropdown without necessarily outperforming the tree.
5. Tune both models. Bayesian optimization for each. For LightGBM: tree depth, learning rate, regularization. For the MLP: hidden layer sizes, learning rate, L2 strength, batch size. Capped at 10 trials each. The agent landed on conservative configurations: shallow trees and slow learning for LightGBM, a small MLP with strong regularization.
6. Calibrate the output. Calibration means: when the model says 70%, does the team actually win 70% of the time? The agent applied isotonic regression on each model separately on validation predictions, then verified the calibration before scoring the test set.
7. Explain the predictions. The agent produced two kinds of feature importance rankings for the LightGBM model: standard model-based importance and SHAP values (a more rigorous, per-prediction attribution method). For the MLP, since there's no clean SHAP path, it used permutation importance instead. Per-game SHAP values were saved for every game in the test set, so any prediction in the live app can be explained by which features pushed it which direction.
8. Package and ship. Both models pickled. Both calibrators pickled. Feature manifest. Predictions parquet. Total artifact size: about 750 KB. Small enough to commit directly to git as the production artifacts.
The whole thing was one run. No back-and-forth. No "actually, can you also do X."
The top 5 features
From the mean-absolute-SHAP ranking on the LightGBM model:
Prior-season win percentage differential. The strongest signal in the dataset. Team strength carries forward year over year, though weaker in hockey than basketball.
Rolling 20-game save percentage differential. This is the goalie signal. Goalies dominate hockey outcomes more than any individual basketball player does, and the model picked up on it through team-level rolling save percentage.
Rolling 10-game goal differential. Captures in-season trajectory that prior-season can't see. Injuries, lineup changes, schemes that needed time to gel.
Rest days differential. Schedule effects matter more in hockey than basketball because of back-to-back density. A team on a rested night vs. a team on a back-to-back is a 2-3 point swing on its own.
Prior-season goal differential per game. Collinear with win percentage but adds value at the tails. Catches teams that were unlucky or lucky last season relative to their underlying play.
One detail worth flagging: the standard model-based importance ranking and the SHAP ranking didn't agree past the top three. The standard ranking over-credited features used in many decision splits but with small per-split impact. Rolling shots-against and rolling penalty minutes fell into this bucket. SHAP correctly down-weighted those and elevated the schedule features. The agent knew the rankings would disagree and produced both anyway, because the disagreement is itself useful.
The agent's own assessment, surfaced unprompted: the achievable ceiling on NHL game prediction with any reasonable feature set is around 0.60 ROC-AUC. This is the well-known parity ceiling for hockey. Further gains would require injury and lineup data, shot-quality models (xG instead of raw shots), or goalie-specific rolling features rather than team aggregates.
What this means for everything else
Same takeaway as the NBA writeup.
At Impulse, we cleaned four public NHL data sources into a feature table. We wrote a spec describing what we wanted predicted, what rules the model had to respect, and what artifacts to ship. The agent did the rest: profiling, splits, baseline, model selection across two different model families, hyperparameter search, calibration, evaluation, explainability, packaging.
The implication is that the part of ML engineering that's mechanical, the part that sits between a clean feature table and a deployed calibrated model, is now mechanical work an agent can do reliably across different problem domains. NBA basketball one week, NHL hockey the next. Two different sports, two different statistical regimes, same workflow.
This is also why we're publishing prediction apps as our first proof points. Same workflow every time: pull the public data, write the spec, let the agent ship the model. The same process applies to churn, demand forecasting, fraud, lead conversion, insurance losses. Anything you can predict from tabular data.
If you have data and a problem worth solving, you can do this without writing a line of model code.
Try it on your own data → impulselabs.ai
See the live NHL app → nhlpredictions.impulselabs.ai