New tools help LLM devs improve training data decisions

Bill Doerrfeld | May 29, 2025

My latest for The New Stack explores new research from Ai2 to help make better decisions around training data.


Training data matters. A lot. The data behind an LLM shapes its efficiency, accuracy, and bias. So, how do AI developers actually choose the right corpora?


I was surprised to learn that the status quo is often pretty ad hoc. But the Allen Institute for AI (Ai2) is challenging that with DataDecide — a suite of research, benchmarks, and tools that help developers make smarter pre-training data choices.


The big takeaway? You don't need to spend massive compute upfront to get high-accuracy results. Ai2's research shows that small-model experiments can predict large-model performance with 80% accuracy, at just 0.01% of the compute.


Check out my latest for The New Stack, where I dive into the findings and their broader implications for AI development.


Featured image credit: Javier Allegue Barros


Read: New Tools Help LLM Developers Choose Better Pre-Training Data
Man speaking at
By Bill Doerrfeld December 2, 2025
Excited to share my talk from Nordic APIs' Platform Summit 2025. As the opening talk of the event, I wanted to address the elephant in the room head-on: MCP.
Panel discussion on stage with five speakers and audience in front of a blue screen with speaker names.
By Bill Doerrfeld December 1, 2025
Watch all the great API talks, keynotes, and panel discussions from this year's Nordic APIs Platform Summit, now available on YouTube.
Open source flag leaddev directorplus doerrfeld Block
By Bill Doerrfeld November 28, 2025
Block's open source initiative is helping to build brand reputation, attract talent, boost partnerships, guide internal open source best practices, and aid long-term system reliability.
How CIOs are getting a return from AI
By Bill Doerrfeld November 26, 2025
Achieving ROI with AI requires a mix of strong leadership stewardship upfront, shifting the talent framework, and embedding proper monitoring and governance throughout the lifecycle.
Knowledge base for AI agents
By Bill Doerrfeld November 24, 2025
I recently explored what goes into creating a solid AI agent knowledge base — from the types of data it should contain to the retrieval mechanisms and architecture patterns that support reliable agentic behavior.
linkedin ai assistant hiring agent
By Bill Doerrfeld October 31, 2025
I recently spoke with LinkedIn VP of Engineering Prashanthi Padmanabhan about the making of their Hiring Assistant, an agent recruiters are using to optimize the applicant selection process.
How to fail at platfom engineering
By Bill Doerrfeld October 22, 2025
How do you fail at platform engineering? Make it 100% UI-first. Don't market it. Survey no one. Measure success by who's onboarded. And just copy what others are doing.
Platform summit conference nordic apis stockholm bill doerrfeld
By Bill Doerrfeld October 21, 2025
This year's Platform Summit felt reinvigorated, with many new motivations and areas to discuss. Of course, AI agents and MCP stole the show.
Colorful hot air balloons against a bright blue sky.
By Bill Doerrfeld September 26, 2025
For my latest DirectorPlus edition, I interviewed Thomas Johnson, co-founder and CTO of Multiplayer, about lessons learned releasing their MCP server.
Abstract blue web-like structure, possibly fabric, with a shimmering effect.
By Bill Doerrfeld September 23, 2025
Explore whether vector native databases outperform traditional DBs with vector add-ons for AI. Learn use cases, tradeoffs, and expert insights.