New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data
System Initiative feature InfoWorld doerrfeld
By Bill Doerrfeld July 14, 2025
System Initiative aims to replace the toil of maintaining config files with a data-based digital twin and visual modeling engine. An engine for DevOps, if you will.
CIOs describe why AI agents need APIs
By Bill Doerrfeld July 10, 2025
My latest feature on CIO.com explores why CIOs view APIs as a critical linchpin to realize success with agentic AI. Learn what it'll all take.
AI tooling directorplus doerrfeld one year
By Bill Doerrfeld June 30, 2025
A year into the DirectorPlus newsletter, I check back in with past guests on how their organizations are approaching AI tooling strategies.
How to make APIs ready for AI agents to consume
By Bill Doerrfeld June 25, 2025
How do you make an API ready for AI agents to use? I posed this question to a handful of API experts and put together a comprehensive guide for The New Stack — published today.
Senior developers embarrassed tech stack leaddev doerrfeld storyblok report
By Bill Doerrfeld June 17, 2025
86% of developers are embarrassed by their tech stack. And, it's causing them to quit. I look at the implications of a report from Storyblok.
Comparing 6 multicloud management platforms Doerrfeld InfoWorld
By Bill Doerrfeld June 16, 2025
The majority of enterprises are now multicloud. I compared six of the leading multicloud management solutions for InfoWorld.
Large action models LAMs story Bill Doerrfeld The New Stack
By Bill Doerrfeld June 10, 2025
AI researchers are calling the next class of models large action models (LAMs). For The New Stack, I explored what LAMs are, what examples are emerging in the market, and what experts think.
7 proven AI prompting strategies for coding to try today
By Bill Doerrfeld June 9, 2025
My article for LeadDev explores specific prompting techniques proven to sharpen your AI-assisted software development.
Nordic APIs ranked #1 API blog on the web
By Bill Doerrfeld June 7, 2025
Nordic APIs, the API-specific blog I edit, was recently ranked the top API blog online by FeedSpot. After ten years managing this presence, I reflect a bit on the journey thus far.
Tips to improve your AI vibe coding
By Bill Doerrfeld June 3, 2025
Developers are realizing that being productive with AI coding assistants takes a lot more than just asking nicely. There's real craft to it.
More Posts