New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data
Overhead view of construction site with workers in orange vests, metal beams, and dark concrete.
By Bill Doerrfeld August 30, 2025
For my latest DirectorPlus column with LeadDev, I synced with JB Brown, VP of engineering at Smartsheet, to learn about their multi-agent AI development strategy.
Pink and purple sunset sky with dark, fluffy clouds.
By Bill Doerrfeld August 25, 2025
Alternative clouds are having a moment. Nearly 75% of orgs are using two or more alt clouds beyond the hyperscalers, according to a HostingAdvice.com report.
Digital global CIOs digital sovereignty
By Bill Doerrfeld August 20, 2025
The cloud is no longer borderless. Rising regional data laws and sovereign cloud mandates are forcing CIOs to act.
A
By Bill Doerrfeld August 11, 2025
In a multi-agent coding workflow, an engineer leads a "team" of specialist AI agents to perform various SDLC tasks: scaffolding, coding, testing, log analysis, deployment, and more.
Open source software churn end of life
By Bill Doerrfeld August 8, 2025
Open-source software churn is accelerating. With more frequent version end-of-lives and even total project abandonments, it's harder than ever to keep up.
Hype drives most programming language igrationsigra
By Bill Doerrfeld July 30, 2025
I covered a report from HostingAdvice.com, which found that the majority of programming language migrations are driven by hype, instead of proven outcomes.
Cross-functional teams help Stack Overflow adapt LeadDev DirectorPlus 2025
By Bill Doerrfeld July 28, 2025
Facing an existential crisis, Stack Overflow has had to pivot quickly. I synced with a director to discover what team strategies are helping them adapt.
System Initiative feature InfoWorld doerrfeld
By Bill Doerrfeld July 14, 2025
System Initiative aims to replace the toil of maintaining config files with a data-based digital twin and visual modeling engine. An engine for DevOps, if you will.
CIOs describe why AI agents need APIs
By Bill Doerrfeld July 10, 2025
My latest feature on CIO.com explores why CIOs view APIs as a critical linchpin to realize success with agentic AI. Learn what it'll all take.
AI tooling directorplus doerrfeld one year
By Bill Doerrfeld June 30, 2025
A year into the DirectorPlus newsletter, I check back in with past guests on how their organizations are approaching AI tooling strategies.