We Wrote a Prompt. Impulse Built a 2026 World Cup Forecast

Image

The NBA model out-predicted the field. The NHL model held its own against the best hockey models in the league. So the question for our MLE agent was the obvious next one: could it handle a sport that's genuinely harder to forecast than either?


International soccer is that sport, and the World Cup is the hardest version of it. We wrote a prompt, handed the agent three public datasets, and it built a full tournament forecast: a probability for every one of the 48 teams to advance, reach each knockout round, and win the cup. Here's how it works, and why the model it built looks different from the other two.


Why the World Cup is harder than the NBA or NHL

Soccer is low-scoring. A World Cup match averages under three goals. When one goal can decide a game, a single deflection or a missed call swings the result in a way that a hot quarter or a bad shift never does in basketball or hockey. There's less signal in each match and more noise.


Then there are draws. Basketball and hockey resolve every game to a winner. Soccer has a third outcome, and in the group stage draws are common and they matter for who advances. The model has to handle win, draw, and loss, not a binary.


The knockouts are single-elimination. The NBA and NHL settle their playoffs over best-of-seven series, which dampen variance and let the better team win most of the time. The World Cup gives a favorite one bad ninety minutes before it's out. One off day ends a tournament. There's no series to recover in.


The structure is also brand new. The 2026 World Cup is the first with 48 teams, 12 groups, and a round of 32. Nobody has run this exact format before, so there's no clean historical precedent for how it plays out.


And international soccer has thinner data than a pro league. National teams play a few times a year, rosters turn over between tournaments, and a team's competitive matches are spread across qualifiers, friendlies, and continental cups of wildly different quality. There's far less per-team data than the NHL or NBA generate in a single season.


Put together: more variance per game, a third outcome, no series safety net, an untested format, and less data. If the agent can build a forecast that holds up against Goldman Sachs and Opta in this environment, it isn't a fluke.


The data: all public, no setup

Three datasets, all public, bundled so the run needs no setup.


international_results.csv is the backbone. It's roughly 49,000 men's international matches from 1872 through 2026: friendlies, qualifiers, and tournament finals, each with a neutral-venue flag. This is the single most important input. Using the full match history, not just past World Cups, is what lets the model read a team's current strength rather than its historical reputation.


squad_values_2026.csv is one row per 2026 team, carrying the present-day roster signal that a results-only model misses: total squad market value, pre-tournament FIFA points, a host flag, and recent form. A team's match history can't fully capture how much talent is on the current roster, and this is the signal that does.


groups_2026.csv is the official draw: 48 teams across 12 groups, four per group. It defines the bracket the simulation walks through.


One wrinkle the agent had to handle: team names don't match across the files. The squad file says Korea Republic, the match history says South Korea. Türkiye and Turkey. Côte d'Ivoire and Ivory Coast. A small reconciliation step lines each team's history up with its current-squad features, and the same mapping runs again at the end to match the app's naming.


What the agent did with it

The input wasn't code or a notebook. It was a prompt describing what we wanted: a proper strength rating built from the full match history, blended with the current-squad signal, with knockout games resolved by team strength rather than coin flips, simulated across the whole tournament many times. The prompt also set the realism bar: the forecast had to discriminate, a clear favorite, not a near-uniform one-in-forty-eight for everyone.


Then we ran the agent. Here's what it did, in order.


First it built a World Football Elo over the entire match log, walking chronologically through every played international and updating both teams after each result. It used home advantage that switches off for neutral-site games, a goal-difference multiplier so that blowouts move ratings more than narrow wins, and tournament-importance weighting so that a World Cup final counts far more than a friendly. A friendly nudges a rating a little. A World Cup knockout game moves it a lot.


Then it blended that Elo with the squad-value and FIFA-points signals into a single rating. Each component is put on the same scale before combining, so a team's rating reflects all three rather than whichever happens to have the largest raw numbers. A host adjustment lifts the United States, Mexico, and Canada for playing at home.


For each match, the agent drew a win, draw, or loss outcome that preserves the rating-implied expected result, with the draw probability shrinking as the gap between two teams widens. Two evenly matched sides draw often. A mismatch rarely ends level.


The most important modeling choice is how it handles the knockouts. Knockout games are resolved by strength, not a coin flip. The stronger side carries its edge through extra time and penalties instead of the game being settled 50/50. This is what lets team quality propagate all the way to the final. Without it, every bracket round washes the favorites back toward even and the whole forecast flattens.


Finally it simulated the full 2026 format 40,000 times with a fixed seed, so the run reproduces exactly. Each simulation plays the group stage, sends the top two from each group plus the eight best third-placed teams into the round of 32, and runs the single-elimination bracket to a champion. Across all 40,000 runs, the agent counts how often each team wins its group, advances, and lifts the cup, and turns those counts into probabilities.


This is a different kind of model than the NBA and NHL forecasts. Those were trained classifiers that learned a win probability from labeled games. This is a strength rating feeding a tournament simulation. The agent recognized that a 48-team bracket with a few competitive matches per team is not a supervised-learning problem and built the thing the problem actually called for.


The signals that go into the rating

The rating blends three inputs, each picking up something the others miss.


Squad market value is the read on current talent. It captures how much quality is on a team's roster right now, which a results-based signal alone can be slow to reflect. A roster stacked with players starring in the Champions League rates highly on this measure regardless of how a few recent friendlies went.


The Elo rating from the full match history is the read on results. Built over a century of internationals, it encodes how teams have actually performed against the quality of opposition they faced, with recent competitive matches weighted most heavily. This is what keeps the forecast grounded in what teams have done rather than reputation.


FIFA points are an independent ranking signal that adds information the other two miss and steadies teams that are hard to read from match results alone.


On top of the three, a host adjustment lifts the United States, Mexico, and Canada for the advantage of playing at home across an entire tournament.


Inside the Elo itself, three sub-choices do real work: the home-advantage term, the goal-difference multiplier that rewards decisive wins, and the tournament-importance weighting that trusts a World Cup result far more than a friendly.


Why the blend matters: a naive version of this model fails badly. Train only on past World Cup matches, roughly 960 games, and you get a near-useless forecast where the favorite sits around 3% and the whole field is nearly uniform. Two reasons. Finals-only history can't know a 2026 squad, so a team like Norway, with little World Cup pedigree but a strong current roster, gets buried. And a washed-out match model with coin-flip knockouts never lets strength reach the later rounds. The squad-value blend fixes the first problem. The strength-resolved knockouts fix the second. The agent's prompt encoded both requirements, and the agent implemented both.


What this means for everything else

Same takeaway as the NBA and NHL writeups, in a harder domain.


We assembled three public datasets. We wrote a prompt describing what we wanted forecast, the realism rules the model had to respect, and the output to produce. The agent did the rest: the rating, the match model, the strength-resolved bracket, 40,000 tournament simulations, and the probability table behind the live app. We didn't write the modeling code.


The point is that the work between a clean dataset and a deployed forecast is now mechanical work an agent can do across very different problems. NBA one week, NHL the next, the World Cup after that. Three sports, three different statistical regimes, three different model types, the same workflow each time. The same process applies to churn, demand forecasting, fraud, lead conversion, anything you can predict from tabular data.


The World Cup is the hardest forecasting problem we've put the agent on, and the tournament will tell us how well it did over the next month.


If you have data and a problem worth solving, you can do this without writing a line of model code.


See the live World Cup forecastworldcup2026.impulselabs.ai

Try it on your own dataimpulselabs.ai