Full Report
MIT CSAIL's 2025 AI Agent Index puts opaque automated systems under the microscope AI agents are becoming more common and more capable, without consensus or standards on how they should behave, say academic researchers.…
Analysis Summary
# Research: MIT CSAIL 2025 AI Agent Index
## Metadata
- **Authors:** Leon Staufer, Kevin Feng, Kevin Wei, Luke Bailey, Yawen Duan, Mick Yang, A. Pinar Ozisik, Stephen Casper, and Noam Kolt.
- **Institution:** MIT Computer Science & Artificial Intelligence Laboratory (CSAIL), in collaboration with University of Cambridge, University of Washington, Harvard Law School, Stanford University, and others.
- **Publication:** [AI Agent Index](https://aiagentindex.mit.edu/)
- **Date:** February 20, 2026 (Published context)
## Abstract
The 2025 AI Agent Index conducts a rigorous analysis of 30 prominent AI agents—systems capable of autonomous action across software services. The research reveals a significant transparency gap: while agentic capabilities and autonomy are accelerating, documentation regarding safety evaluations, third-party testing, and behavioral standards remains dangerously sparse. The study highlights the systemic risks posed by "opaque" automated systems that operate without a unified safety consensus.
## Research Objective
The study aims to address the lack of established standards and public information regarding the development, deployment, and safety of AI agents. It seeks to answer:
1. How autonomous are current AI agents?
2. What safety frameworks and evaluations are developers disclosing?
3. What are the legal and technical dependencies of the current agent ecosystem?
## Methodology
### Approach
The researchers conducted a multi-dimensional analysis using 45 specific annotation fields per agent. The evaluation was categorized into six pillars:
- Legal status and origin
- Technical capabilities
- Autonomy and control
- Ecosystem interaction
- Evaluation methodologies
- Safety protocols
### Dataset/Environment
The study analyzed **30 AI agents**, including:
- **Chat-based agents:** Manus AI, ChatGPT Agent, Claude Code.
- **Browser-based agents:** Perplexity Comet, ChatGPT Atlas, ByteDance Agent TARS.
- **Enterprise workflow agents:** Microsoft Copilot Studio, ServiceNow Agent.
### Tools & Technologies
The research utilized the **AI Agent Index platform**, a comparative tool designed to track machine learning models that possess agency (the ability to take actions via software APIs rather than just generating text).
## Key Findings
### Primary Results
1. **The Transparency Deficit:** 25 out of 30 agents provided no details on safety testing, and 23 offered no third-party testing data.
2. **Autonomy vs. Safety:** Of 13 agents reaching "frontier" levels of autonomy, only four (ChatGPT Agent, OpenAI Codex, Claude Code, and Gemini 2.5 Computer Use) disclosed agentic-specific safety evaluations.
3. **Market Concentration:** The ecosystem is dominated by a handful of foundation model providers (Anthropic, Google, and OpenAI) acting as the base layer for most other "wrapper" agents.
4. **Closed Ecosystems:** 23 out of 30 agents are closed-source, complicating independent security audits.
### Supporting Evidence
- 80% of the agents studied were updated or released within the 2024-2025 period, indicating rapid feature growth outpacing safety documentation.
- Geographical distribution: 13 agents are from U.S.-based (Delaware) firms, 5 from China, and 4 from other international jurisdictions (Germany, Norway, Cayman Islands).
### Novel Contributions
- **Multi-layered Dependency Mapping:** The research identifies the "scaffolding and orchestration" layers as a source of evaluation difficulty, where no single entity takes responsibility for the agent’s end-to-end behavior.
- **Protocol Obsolescence:** The study suggests traditional web protocols like `robots.txt` are becoming insufficient to manage or restrict autonomous agent behavior.
## Technical Details
The research clarifies that most modern agents are not monolithic. They consist of a **Foundation Model** provided by a major lab, wrapped in an **Orchestration Layer** (or "harness") that allows it to execute code, browse the web, or interact with APIs. This modularity creates a "safety debt" where the model provider claims the agent-builder is responsible for safety, while the agent-builder relies on the base model’s inherent safety filters, leaving a gap in the middle where the actual *actions* occur.
## Practical Implications
### For Security Practitioners
- **Loss of Control:** Traditional "exclusion" methods (like robots.txt) are failing. Practitioners must move toward more robust authentication and behavior-based bot detection.
- **Shadow Agency:** The rise of "wrappers" means enterprise data may be flowing through multiple third-party orchestration layers before reaching the foundation model.
### For Defenders
- **Agentic Evaluation:** Ensure that any AI agent integrated into a corporate network has undergone "agentic safety evaluations" specifically, not just standard LLM "jailbreak" testing.
- **Audit Trails:** Given the opacity of these systems, defenders should implement independent logging of all actions taken by agentic service accounts.
### For Researchers
- **Standardization:** There is an urgent need for a unified "Agent Safety Standard" that transcends specific company policies (like OpenAI’s Preparedness Framework).
## Limitations
- **Sample Size:** The study focused on 30 agents. While deeper than the 2024 index, it represents only a fraction of the burgeoning agent marketplace.
- **Proprietary Barriers:** Because 23 agents are closed-source, researchers had to rely on public disclosures and external observations rather than internal code audits.
## Comparison to Prior Work
Unlike the 2024 Index which analyzed 67 systems broadly, the 2025 Index is narrower but deeper, focusing specifically on the **actionable** capabilities of agents. It moves beyond "text generation" safety to "interaction" safety.
## Real-world Applications
- **Enterprise Automation:** Streamlining multi-step office tasks (e.g., ServiceNow and Microsoft Copilot).
- **Cyber Engineering:** Agents like Claude Code and Google Gemini CLI are being used for automated software development and debugging.
## Future Work
- **Inter-Agent Protocols:** Investigating how agents interact with each other (e.g., Moltbook) and the risks of "automated collusion."
- **Policy Development:** Creating frameworks for "agentic accountability" to determine liability when an autonomous system causes financial or data loss.
## References
- MIT CSAIL AI Agent Index: [https://aiagentindex.mit.edu/](https://aiagentindex.mit.edu/)
- Anthropic: "Measuring Agent Autonomy": [https://www.anthropic.com/research/measuring-agent-autonomy]
- McKinsey Report on AI Economic Impact: [https://www.mckinsey.com/mgi/our-research/agents-robots-and-us]