Training large language model (LLM) agents to act intelligently in real-world environments has become one of AI's hottest frontiers. From autonomous coding assistants to web-browsing bots, these agents rely on data — not just text, but sequences of actions and observations that reflect real-world reasoning.

But here's the catch: while pretraining LLMs benefits from massive internet-scale text data, fine-tuning agent models has been slow and fragmented. Each dataset uses a different format, making it almost impossible to combine or reuse across frameworks.

That's exactly the bottleneck that researchers from Carnegie Mellon University, The Ohio State University, and others set out to fix with their new framework — the Agent Data Protocol (ADP).

The Problem: Fragmented Agent Datasets

Agent training data comes in many flavors — coding tasks, web browsing sessions, software engineering trajectories, and more. However, every dataset records agent behavior differently:

  • Some use HTML trees for web navigation
  • Some log API calls as plain text
  • Some mix natural language with structured code snippets

This heterogeneity makes it difficult to reuse data across systems like OpenHands, SWE-Agent, or AgentLab. As a result, researchers often rebuild data pipelines from scratch — a slow, error-prone process that limits scalability.

The ADP team summarized it perfectly:

"The bottleneck is not the lack of data, but a lack of standardization."

Introducing the Agent Data Protocol (ADP)

ADP is a lightweight, expressive representation language designed to standardize agent training data. Think of it as an "interlingua" that allows datasets from diverse domains to speak a common language.

At its core, every ADP dataset consists of trajectories — sequences of what the agent did (actions) and what it saw (observations).

Core Components of ADP

Element Description Action What the agent does — e.g., code execution, API calls, messages Observation What the agent perceives — e.g., web page content, text feedback

Action Types:

  • 🧰 APIAction: structured tool calls like goto(url="google.com")
  • 💻 CodeAction: code snippets the agent executes, such as print("Hello World")
  • 💬 MessageAction: text-based interactions ("How can I help you?")

Observation Types:

  • 📜 TextObservation: captures feedback or execution results
  • 🌐 WebObservation: represents webpages (HTML, accessibility tree, screenshots)

This simple schema — implemented in Pydantic — turns complex agent interactions into consistent, machine-readable data.

⚙️ How ADP Works: From Chaos to Clarity

ADP introduces a three-stage conversion pipeline:

  1. Raw → Standardized (ADP) Converts each dataset's unique format into ADP's unified schema.
  2. ADP → SFT (Supervised Fine-Tuning) Translates standardized ADP data into the format required by each agent framework (e.g., OpenHands, SWE-Agent).
  3. Quality Assurance Automated validation ensures correctness of tool calls, reasoning text, and conversation flow.

Without ADP, converting D datasets to A agent frameworks requires O(D × A) engineering effort. With ADP, it becomes O(D + A) — a massive simplification (see below):

Before: 13 datasets × 3 agents = 39 custom converters After ADP: 13 converters + 3 scripts = 16 total

None
ADPcollapsesmany-to-manyconversionsintoahub-and-spokepipeline

The Impact: Unified Data, Unified Gains

The team converted 13 major datasets (like AgentInstruct, Mind2Web, SWE-Gym, Synatra) into the ADP format, creating the largest open agent dataset ever released — over 1.3 million training trajectories.

They then fine-tuned models using ADP data across multiple frameworks, achieving dramatic improvements:

Model Task Base Accuracy +ADP Accuracy SWE-Agent (7B) SWE-Bench 0.4% → 20.2% OpenHands (7B) SWE-Bench 2.8% → 20.4% AgentLab (7B) WebArena 4.5% → 21.0% OpenHands (7B) AgentBench 3.5% → 27.1%

Even 7B–8B models trained on ADP data matched or beat proprietary models like Claude 3 Sonnet on some tasks.

🔄 Beyond Fine-Tuning: Cross-Task Generalization

Interestingly, models trained on the diverse ADP mix outperformed those fine-tuned on task-specific datasets alone. For example:

  • On SWE-Bench, ADP-trained models achieved 10.4% accuracy vs. just 1.0% for SWE-smith-only fine-tuning.
  • On WebArena, ADP-trained models reached 20.1%, beating single-domain tuning (16.0%).

This highlights a key insight: diversity in training data improves generalization across domains — a critical property for building truly general-purpose LLM agents.

⚡ Why ADP Matters

ADP isn't just a technical contribution — it's a community framework. It:

  • Reduces redundant engineering
  • Makes agent research reproducible
  • Encourages dataset sharing
  • Enables fair benchmarking across agents

By acting as a universal "data bridge", ADP transforms the fragmented agent data landscape into a scalable ecosystem for fine-tuning, analysis, and innovation.

The Road Ahead

The authors outline three next steps for ADP's evolution:

  1. Multimodal Extensions — Incorporating screen recordings, images, and richer sensory data.
  2. Standardized Evaluation Protocols — Applying ADP's principles to testing and benchmarking.
  3. Community Growth — Encouraging open-source contributions and automated data validation tools.

Final Thoughts

The Agent Data Protocol represents a turning point for agentic AI research. By unifying how agent data is represented and shared, it removes one of the biggest roadblocks in LLM fine-tuning.

In the same way that datasets like ImageNet standardized computer vision, ADP could become the foundation for the next generation of intelligent, autonomous LLM agents.

🔗 Explore more: [2510.24702] Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents