An AI system designed to work directly with weather data has demonstrated a new ability to answer scientific questions by combining numerical forecasting models with natural language reasoning.
The result begins to close a long-standing gap between complex atmospheric data and the scientists who must interpret it, allowing weather models to be queried and understood far more directly.
What the tests showed
Within a large testing framework built around real global weather records, the AI agent Zephyrus handled everyday meteorological questions with a clear advantage over text-only AI models.
By analyzing those results, experts at the University of California San Diego (UC San Diego) demonstrated that the AI agent could translate plain-language questions into code that interrogates weather data and returns usable answers.
Across thousands of benchmark tasks, the agent repeatedly produced more accurate responses than comparable systems that relied only on text knowledge.
That pattern establishes the system as a practical step toward AI tools that can work directly with atmospheric data while revealing where deeper scientific reasoning still remains out of reach.
Why weather was the target
Modern forecasts still arrive as huge grids of numbers, even as newer AI models beat older systems on many mid-range tests.
Most large language models (LLMs) – AI systems trained to predict likely text – still cannot reason directly over those sprawling numerical fields.
Weather made a useful first target because the field mixes changing measurements, place names, time windows, and written bulletins.
“Weather prediction is a critical scientific challenge, with profound implications spanning agriculture, disaster preparedness, transportation, and energy management,” the researchers write.
How Zephyrus works
When a user typed a request, the agent turned it into Python code and sent it into a controlled workspace.
There it could call weather data, map place names to coordinates, and use a forecast model already built into the system.
After each run, the stronger version checked what happened, fixed errors, and tried again before answering.
Because weather questions often fail on the first attempt, that extra pass turned many dead ends into usable results.
The data behind the study
To test the agent, the team assembled 2,158 question-answer pairs across 46 tasks, from simple lookups to hard report writing.
Much of that testing relied on reanalysis, a reconstructed global weather record that blends observations with forecast models to rebuild past atmospheric conditions.
The data came from the ERA5 archive, a global atmospheric dataset produced by the European Centre for Medium-Range Weather Forecasts.
Another layer came from the WeatherBench 2 framework, which standardizes how weather models are checked against one another.
Because the questions covered places, times, forecasts, and written summaries, the benchmark exposed both practical wins and obvious weak spots.
On easier jobs, the agent usually answered the kinds of questions forecasters ask every day about values, times, and places.
For one tested model, the version that could test its answers against weather data reached 54.7 percent correctness, while a text-only version managed 19.9 percent.
Across all four tested AI models, the system answered weather questions about 29 to 35 percentage points more accurately than text-only versions.
Location questions stood out in particular, with one version reaching 86.6 percent accuracy instead of guessing from stored text alone.
Where Zephyrus struggled
Harder assignments told a rougher story, especially when the agent had to spot extremes or write broad weather discussions.
Report writing remained especially weak, with the best system scoring just 0.177 on a scale from 0 to 1 that measures how closely its reports matched expert forecasts.
Global outlooks over 3 months fell apart most clearly, while shorter U.S. forecast discussions showed at least some promise.
Those misses matter because scientists still need judgment, not fluent-sounding prose, when they brief people about rare or dangerous events.
Climate comes next
Zephyrus also reached beyond day-to-day weather by calling a climate simulator that can run longer scenarios.
Some harder prompts asked for counterfactual answers, reasoned guesses about what changes after an altered starting condition.
For climate research, that matters because scientists often need to explore causes, not just describe what already happened.
The next version will use larger training sets and more tuning for climate-focused tasks, the researchers noted.
The future of Earth science
Much of the promise here is practical, because weather science still asks newcomers to juggle code, data stores, and domain jargon.
That barrier can slow students and early-career researchers long before they get to the science itself.
“Our vision is to democratize Earth science,” said Rose Yu, an associate professor in computer science and engineering at UC San Diego.
If tools like this mature, more researchers could spend less time plumbing data and more time testing actual ideas.
Seen more broadly, Zephyrus points toward a different kind of scientific software – one that can query data, write code, and explain its answer.
For now, Zephyrus appears best suited as an assistant rather than a replacement – fast and useful for focused tasks, but still limited when complex scientific judgment is required.
The study is published in the Proceedings of the 2026 International Conference on Learning Representations.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–
