New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data
Nordic APIs ranked #1 API blog on the web
By Bill Doerrfeld June 7, 2025
Nordic APIs, the API-specific blog I edit, was recently ranked the top API blog online by FeedSpot. After ten years managing this presence, I reflect a bit on the journey thus far.
Tips to improve your AI vibe coding
By Bill Doerrfeld June 3, 2025
Developers are realizing that being productive with AI coding assistants takes a lot more than just asking nicely. There's real craft to it.
AI coding is the easy part. Now it's time to focus on production.
By Bill Doerrfeld May 30, 2025
AI coding is the easy part. Now it's time to focus on DevOps to get it into production. In a recent interview for LeadDev's DirectorPlus, Honeycomb's CTO, Charity Majors, shares expert tips on how to accomplish this.
MCP security vulnerabilities
By Bill Doerrfeld May 21, 2025
My APISEC|CON talk covers the hype around agentic AI and MCP, and delves into inherent flaws in MCP architectures and suggests mitigations.
Knowing when to use AI coding assistants Doerrfeld InfoWorld
By Bill Doerrfeld May 6, 2025
AI coding assistants are a productivity dream in some cases — and a debugging nightmare in others. So, where’s the line?
How semantic caching reduces LLM API calls
By Bill Doerrfeld May 5, 2025
Semantic caching is like typical caching, but for AI. It could eliminate a lot of redundant API calls to LLMs, reducing costs and improving performance.
Using agentic AI for business workflows
By Bill Doerrfeld April 30, 2025
For CIO.com, leading executives shared with me how they're actively utilizing agentic AI to enhance core business workflows.
Making developer productivity metrics actionable LeadDev DirectorPlus
By Bill Doerrfeld April 25, 2025
Developer productivity metrics are useless if they're just sitting in dashboards. So, how can we use them to direct positive, real-world action?
New study reveals what really drives revenue per engineer
By Bill Doerrfeld April 10, 2025
What leads to a higher revenue per engineer? New benchmarking from DX reveals how areas like R&D spend, org size, and growth rate move the needle.
LLMs can now cite their sources
By Bill Doerrfeld April 9, 2025
My latest post on The New Stack reveals how researchers pinpoint the exact sources behind chatbot responses.
More Posts