Blog Layout

New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.

Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?

I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.

The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.

Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.

Featured image credit: Javier Allegue Barros

Read: New Tools Help LLM Developers Choose Better Pre-Training Data

< Older Post

Newer Post >

Other Blog Posts

Can zero standing privilege secure agentic AI?

By Bill Doerrfeld • July 13, 2026

The latest advancement in identity and access management (IAM) is something called zero standing privilege. It's a response to new agentic AI risks.

Are shared libraries on their way out?

By Bill Doerrfeld • July 7, 2026

Big surprise: it's way easier to generate new code with AI than to reuse old code. But what are the long-term effects?

AI coding agents are now the default

By Bill Doerrfeld • June 25, 2026

Agentic coding tools have become the default. "When we've taken that away accidentally from people, they scream."

Your AI model might disappear tomorrow: what now?

By Bill Doerrfeld • June 17, 2026

My latest for LeadDev considers how engineering leaders should respond in the wake of uncertainty in the AI model market.

My research into 10 MCP gateways

By Bill Doerrfeld • June 10, 2026

I'm working with Zuplo on some new content around their MCP Gateway release. First up: a deep comparison of MCP gateways on the market!

Signs point to AI talent redistribution, not replacement

By Bill Doerrfeld • June 10, 2026

The constant barrage of AI layoffs is overshadowing the economic reasons behind these cuts, as well as the net-positive talent redistribution happening at large.

Reviewing MCP servers to connect with databases

By Bill Doerrfeld • June 8, 2026

My latest for InfoWorld reviews MCP servers and agent-ready tools for connecting AI agents with popular database styles.

When do you pull the plug on software projects?

By Bill Doerrfeld • May 29, 2026

For my latest DirectorPlus edition, Joel Carusone from NinjaOne shares how engineering leaders can build the muscle for making tough calls.

Close-up of a glowing laptop keyboard in blue light, viewed at an angle with the screen above

How MCP supports context engineering

By Bill Doerrfeld • May 25, 2026

My latest InfoWorld feature explores how Model Context Protocol (MCP) supports context engineering for AI-assisted coding.

A set of metal keys on a keyring resting on a wooden surface.

API keys don't equal security. Here are 10 reasons why.

By Bill Doerrfeld • May 22, 2026

My latest for Nordic APIs explores 10 API key security risks and what to use alongside keys for stronger API security.

Share by: