New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data
Colorful hot air balloons against a bright blue sky.
By Bill Doerrfeld September 26, 2025
For my latest DirectorPlus edition, I interviewed Thomas Johnson, co-founder and CTO of Multiplayer, about lessons learned releasing their MCP server.
Abstract blue web-like structure, possibly fabric, with a shimmering effect.
By Bill Doerrfeld September 23, 2025
Explore whether vector native databases outperform traditional DBs with vector add-ons for AI. Learn use cases, tradeoffs, and expert insights.
Man presenting at the Nordic APIs conference, standing in front of a screen, with audience.
By Bill Doerrfeld September 17, 2025
Join me in Stockholm for Platform Summit 2025 and the API Security UnConference, October 13–15. Exciting talks, networking, and more.
A grey articulated figure kneeling, arranging small white objects in a clear plastic container. White background.
By Bill Doerrfeld September 11, 2025
MCP shines for indeterministic workflows, novel integrations, and giving AI coding agents context on the fly. But for more predictable automation it may be overengineeering.
Overhead view of construction site with workers in orange vests, metal beams, and dark concrete.
By Bill Doerrfeld August 30, 2025
For my latest DirectorPlus column with LeadDev, I synced with JB Brown, VP of engineering at Smartsheet, to learn about their multi-agent AI development strategy.
Pink and purple sunset sky with dark, fluffy clouds.
By Bill Doerrfeld August 25, 2025
Alternative clouds are having a moment. Nearly 75% of orgs are using two or more alt clouds beyond the hyperscalers, according to a HostingAdvice.com report.
Digital global CIOs digital sovereignty
By Bill Doerrfeld August 20, 2025
The cloud is no longer borderless. Rising regional data laws and sovereign cloud mandates are forcing CIOs to act.
A
By Bill Doerrfeld August 11, 2025
In a multi-agent coding workflow, an engineer leads a "team" of specialist AI agents to perform various SDLC tasks: scaffolding, coding, testing, log analysis, deployment, and more.
Open source software churn end of life
By Bill Doerrfeld August 8, 2025
Open-source software churn is accelerating. With more frequent version end-of-lives and even total project abandonments, it's harder than ever to keep up.
Hype drives most programming language igrationsigra
By Bill Doerrfeld July 30, 2025
I covered a report from HostingAdvice.com, which found that the majority of programming language migrations are driven by hype, instead of proven outcomes.