New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data
By Bill Doerrfeld February 28, 2026
While hardware usually gets the spotlight in physical AI, the real differentiator won't be hardware. It'll be the models.
By Bill Doerrfeld February 27, 2026
In the latest DirectorPlus, Workato's CTO explains how MCP-enabled integration catalyzed internal AI usage and ROI.
By Bill Doerrfeld February 18, 2026
My latest on InfoWorld reviews MCP servers from 5 major cloud providers
By Bill Doerrfeld February 18, 2026
How are organizations actually using agentic knowledge bases in practice? My article for The New Stack looks at six emerging patterns.
eBPF in Production Report
By Bill Doerrfeld February 12, 2026
My report for the eBPF Foundation explores enterprise eBPF case studies, production deployments, and real business outcomes across cloud-native environments.
Close-up of whole bean coffee Bottomless
By Bill Doerrfeld February 10, 2026
Longtime Bottomless user sharing why I love automated coffee delivery triggered by a smart scale, plus a referral link for a free first bag.
By Bill Doerrfeld February 5, 2026
MCP servers can quickly drain context windows without guardrails. Thankfully, there are ways around this, say the experts.
By Bill Doerrfeld February 4, 2026
It may seem like AI agents are suddenly doing everything across industries. But in reality, the pace of agentic AI is moving carefully, and very deliberately, in highly regulated environments like finance and banking.
By Bill Doerrfeld February 3, 2026
My latest feature for InfoWorld explores when it makes sense to scrape public web sources, and when official API integrations are the better choice for external data.
By Bill Doerrfeld January 30, 2026
What does it mean to go nano with your software updates — to "carve with a scalpel" instead of swinging a hammer? For my latest DirectorPlus piece, I caught up with Chainguard VP Dustin Kirkland to dig into that idea.