We Wrote a Spec. Impulse Built a Calibrated NBA Win Probability Model

Image

At Impulse, we wanted a real stress test for our new MLE agent, so we picked NBA win prediction. It's a clean binary classification problem on the surface, and a mess underneath. Rookies with no history. Expansion teams. Lockout seasons. The COVID bubble. Lineups that rotate every week. Schedule asymmetries. If Impulse can hand back a working, well-calibrated model from data this messy, it can handle most prediction problems people actually have.

This post walks through what we gave the agent, what the agent gave back, and what we think it means.


The data: all public, all parquet

Everything came from public sources. The NBA Stats API (the same one nba_api wraps) plus Basketball Reference for a couple of historical fields the official API has gaps on. No paid feed. No odds data. No proprietary tracking.

Three raw sources pulled into parquet (a fast, columnar file format we like for tabular data):

  1. games.parquet — every NBA game from 2000-01 through the current 2025-26 season. Game id, date, home team, away team, final score, season, season type (regular, play-in, playoffs), overtime flag. About 34,000 rows.

  2. team_seasons.parquet — team-season aggregates from 2007-08 onward, which is where the official advanced stats coverage starts. Net rating, pace, shooting efficiency, turnover rate, rebound rates, win pct, road and home splits.

  3. team_logs.parquet — game-by-game team box scores, used as the substrate for "rolling window" features (how a team has performed over its most recent N games).

The data is freely available to anyone who wants to pull it. The point worth flagging here is that good prediction problems often don't need exotic data. They need someone to pull the public stuff carefully and structure it well.


What we built before the agent ran

We collapsed the three raw sources into a single dataset:

nba_games_features.parquet — 33,385 historical rows, 76 columns, one row per game. The label is home_win (1 if the home team won, 0 if not).

The 76 columns are organized in three conceptual blocks:

  • Prior-season block (frozen at the start of the season, so it can't accidentally leak future information): how each team performed last year. Net rating, win pct, pace, shooting efficiency, plus cross-context splits like "how the road team did on the road last year."

  • Rolling-10-game block: how each team has performed over its most recent 10 games strictly before today. Captures in-season trajectory that prior-season can't see (injuries, new acquisitions, schemes that took time to gel).

  • Schedule and context block: rest days, back-to-back flags, games in last 7 days, season type, playoff round, head-to-head history.

A real wart in this dataset, and one that mattered later: nulls are everywhere and they're legitimate. Rolling-10 is null on a team's first ten games of franchise history. Prior-season aggregates are null for 2000 through 2006 because the source data doesn't go back that far. Current-season aggregates only exist on non-playoff rows. The nulls themselves carry signal, so we didn't want to fill them in with placeholder values. That meant the model had to handle nulls natively.

Everything else, we handed to the agent.


The spec

Here's what's actually different about Impulse. The input wasn't a notebook. It wasn't a series of "do X, then do Y" instructions. It was a single technical prompt that described what we wanted predicted, what rules the model had to respect (how to split the data, how to validate, what minimum quality bar to clear), and what artifacts to produce.

Then we ran the agent.

The thesis behind Impulse is that you shouldn't need to know ML to build models that work. You bring data and a question. Impulse handles the rest.


What the agent actually did

This is what separates Impulse from "we ran AutoML and got a model." The agent doesn't just brute-force its way through hyperparameters. It reads the spec, plans, and makes the same decisions an ML engineer would make, exposing its reasoning along the way.

Here's what happened, in order:

1. Profile the data. The agent's first move was to look at the data. Row count, types, null patterns per column, label balance (home wins about 58% of the time, slight class imbalance). It flagged the legitimate-null pattern explicitly and noted that the model would need to handle nulls natively.

2. Set up the data splits correctly. The spec required a strict time-based split: train on older seasons, validate on a more recent one, hold out the most recent season as untouched test data. Standard ML training often uses random splits, but for time-series problems like sports, that's wrong. You'd be using future games to predict past ones. The agent got this right without being told the specific window structure.

3. Build a baseline first. Before anything fancy, the agent built a simple two-feature baseline model using just net rating and home-court advantage. The point of the baseline is to set a floor: any fancier model has to beat it, or it's not worth shipping. Building the baseline first is the right order. Building it later is how teams end up convinced their fancy model is better than it actually is.

4. Pick the right model family. The agent reasoned through the choices: 33k rows, 76 features, mix of continuous and categorical inputs, lots of legitimate nulls. Tabular trees (specifically LightGBM and XGBoost) are well-suited for this kind of data. Deep learning isn't. The agent picked LightGBM because it handles nulls natively and trains fast on CPU.

5. Tune the model. The agent ran a bounded hyperparameter search, optimizing for log-loss (a metric that rewards well-calibrated probabilities, not just correct yes/no answers). It landed on a conservative configuration: shallow trees, slow learning rate, strong regularization. The standard small-data tabular setup.

6. Calibrate the output. "Calibration" means: when the model says 70%, does the team actually win 70% of the time? The agent applied isotonic regression on top of the model's raw probabilities to fix this, then verified the calibration on validation data before scoring the test set.

7. Evaluate honestly. Final scoring on the held-out 2024 season, broken out by regular season vs. playoffs (because playoffs are noisier and the calibration bar is harder to clear there). Both the LightGBM model and the simple baseline were saved, because keeping the baseline around is useful for anyone who wants a reference point.

8. Explain the predictions. The agent produced two kinds of feature importance rankings: standard model-based importance and SHAP values (a more rigorous, per-prediction attribution method). It also saved per-game SHAP values for the entire test set, so any single prediction can be explained by which features pushed it which direction.

9. Package and ship. Saved model file, calibrator, feature manifest in the exact column order the model expects, predictions file with both raw and calibrated outputs, an inference helper, and a metrics report. Same bundle for the baseline.

The whole thing was one run. No back-and-forth. No "actually, can you also do X."


The top 5 features

The model's top predictors, ranked by SHAP value:

  1. Last season's net rating differential. The single strongest signal in the dataset. How good each team was last year is a high-sample-size measurement that doesn't change much from one season to the next.

  2. Recent form (last 10 games). Captures injuries, midseason trades, scheme changes, all the in-season dynamics prior-season can't see.

  3. Last season's win percentage. Adds value at the tails (genuinely great and genuinely bad teams).

  4. Rest days difference. Schedule effects are real and bigger than people think. A team on a back-to-back vs. a team on three days rest is a 2–3 point swing on its own.

  5. Whether the home team is on a back-to-back. Home-court advantage exists, but it gets eroded fast when the home team is tired.

One detail worth flagging: the standard model-based importance ranking and the SHAP ranking didn't agree past the top two. The standard ranking over-credited features used in many decision splits but with small per-split impact. SHAP correctly down-weighted those and elevated the schedule features. This is a known limitation of standard importance, and it's exactly why the spec asked for both. The agent knew the rankings would disagree and produced both anyway, because the disagreement is itself useful.


What this means for everything else

At Impulse, we collapsed three public raw datasets into a feature table with leak-safe temporal logic. We wrote a spec describing what we wanted predicted and what rules the model had to respect. The agent did the rest: profiling, splits, baseline, model selection, hyperparameter search, calibration, evaluation, explainability, packaging.

The implication isn't "Impulse replaces ML engineers." It's that the part of ML engineering that's mechanical, the part that sits between a clean feature table and a deployed calibrated model, is now mechanical work an agent can do reliably. The part that's still domain expertise (knowing what features are safe to use, knowing what to predict, knowing what constraints matter) is where humans still belong.

This is also why we're publishing prediction apps for sports as our first proof points. NBA today, NHL tomorrow, hurricane risk and energy prices after that. Same workflow every time: pull the public data, encode the domain logic, write the spec, let the agent ship the model. The same process applies to churn, demand forecasting, fraud, lead conversion, insurance losses. Anything you can predict from tabular data.

If you have data and a problem worth solving, you can do this without writing a line of model code.

Try it on your own data impulselabs.ai

See the live NBA app nbapredictions.impulselabs.ai