New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data
By Bill Doerrfeld January 26, 2026
The more folks use MCP servers in development, the more they’re realizing it can lead to runaway token usage, unpredictable response sizes, and flooded context windows.
By Bill Doerrfeld January 20, 2026
Who really benefits from AI coding tools? New research suggests AI amplifies existing top performers more than average developers. Read my post on LeadDev.
By Bill Doerrfeld January 19, 2026
Many say edge computing will enable the future of AI inference. For InfoWorld, I looked at the tech required, and the roadblocks to overcome to get us there.
By Bill Doerrfeld January 15, 2026
Survey data from Zuplo finds rising MCP adoption, security concerns, and shows how developers are using MCP servers to connect AI with APIs in 2026.
By Bill Doerrfeld January 5, 2026
Blockchain for everything, metaverse, big data, NFTs... In hindsight, what were we thinking? Today, I call out some of tech's biggest overhyped trends on InfoWorld.
By Bill Doerrfeld January 5, 2026
Like any production software application, AI agents are producing a spectrum of metadata behind the scenes. Some are calling agentic metadata a “gold mine” to direct continual improvements.
By Bill Doerrfeld December 19, 2025
My latest DirectorPlus column with LeadDev interviews Bedrock Robotics' CTO, Kevin Peterson, on what it takes to develop highly adaptable and safe autonomous machines.
By Bill Doerrfeld December 17, 2025
I explore some tips to help speakers craft solid pitches. The Nordic APIs speaker selection committee looks for these sorts of things, but the tips could apply to any tech event.
By Bill Doerrfeld December 11, 2025
I made 10 predictions on how AI and APIs will converge in 2026. Signs point to AI agents being the next big API consumer.
Brain in a gravitational well, surrounded by concentric circles, with blue lines extending from the brain.
By Bill Doerrfeld December 8, 2025
My latest for InfoWorld breaks down 10 MCP servers powering next-gen devops workflows, from GitHub and Atlassian to AWS and Pulumi.