New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data

Other Blog Posts

By Bill Doerrfeld April 20, 2026
My InfoWorld feature reviews the key building blocks in agentic systems and with real-world examples from Shopify, Block, and others.
By Bill Doerrfeld March 31, 2026
My latest InfoWorld feature explores what makes an enterprise MCP registry effective, from semantic discovery to governance and security for AI agents.
By Bill Doerrfeld March 30, 2026
My first-ever contribution to CSO Online looks at the shifting landscape, from perimeter-based security to API security, and how CISOs are responding.
By Bill Doerrfeld March 29, 2026
My latest feature for The New Stack looks into solutions being proposed to fix open source Slopmageddon.
A digital pattern of rounded rectangular blocks in shades of blue and purple, arranged in an interlocking layout.
By Bill Doerrfeld March 27, 2026
My latest DirectorPlus looks at how agentic AI is reshaping platform engineering at Squarespace: less shared code and more developer experience focus.
By Bill Doerrfeld March 19, 2026
Usage-based pricing is reshaping the API economy. Discover 5 API monetization success stories, including OpenAI, Plaid, and AssemblyAI.
A lightbulb against a purple background, containing a human brain with an
By Bill Doerrfeld March 18, 2026
Why event-driven APIs matter for AI workflows, enabling real-time data, scalable systems, and responsive agent behavior.
By Bill Doerrfeld February 28, 2026
While hardware usually gets the spotlight in physical AI, the real differentiator won't be hardware. It'll be the models.
By Bill Doerrfeld February 27, 2026
In the latest DirectorPlus, Workato's CTO explains how MCP-enabled integration catalyzed internal AI usage and ROI.
By Bill Doerrfeld February 18, 2026
My latest on InfoWorld reviews MCP servers from 5 major cloud providers