New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data

Other Blog Posts

By Bill Doerrfeld June 10, 2026
I'm working with Zuplo on some new content around their MCP Gateway release. First up: a deep comparison of MCP gateways on the market!
By Bill Doerrfeld June 10, 2026
The constant barrage of AI layoffs is overshadowing the economic reasons behind these cuts, as well as the net-positive talent redistribution happening at large.
By Bill Doerrfeld June 8, 2026
My latest for InfoWorld reviews MCP servers and agent-ready tools for connecting AI agents with popular database styles.
By Bill Doerrfeld May 29, 2026
For my latest DirectorPlus edition, Joel Carusone from NinjaOne shares how engineering leaders can build the muscle for making tough calls.
Close-up of a glowing laptop keyboard in blue light, viewed at an angle with the screen above
By Bill Doerrfeld May 25, 2026
My latest InfoWorld feature explores how Model Context Protocol (MCP) supports context engineering for AI-assisted coding.
A set of metal keys on a keyring resting on a wooden surface.
By Bill Doerrfeld May 22, 2026
My latest for Nordic APIs explores 10 API key security risks and what to use alongside keys for stronger API security.
By Bill Doerrfeld May 18, 2026
The yearly API conference, apidays New York, is a hotbed for solid discussion on what's top of mind in the API space, and as MC I had a front row seat.
By Bill Doerrfeld May 13, 2026
My latest for CIO Online features real results form CIOs actively deploying AI agents to empower sales and revenue teams.
By Bill Doerrfeld May 12, 2026
Reports say consumers are souring on AI everywhere, all the time. So, at the risk of losing trust, or even potential business, is adding AI to an existing product really worth it?
By Bill Doerrfeld May 1, 2026
Cloudflare rebuilt Next.js over a weekend using agentic coding.