← Agent Workers 21 min read

AI Agent Workers: Devin, Factory, and the Autonomous Coding Frontier (March 2026)

Executive Summary

AI coding agents have moved from demo-ware to production enterprise tools in 12 months; thousands of companies now deploy agents that write, test, and ship code with varying degrees of autonomy
The market spans fully autonomous cloud agents (Devin, Factory Droids, Cursor Cloud Agents) to IDE-integrated agentic modes (GitHub Copilot Agent, Cursor Agent) to open-source platforms (OpenHands) to terminal-native tools (Claude Code)
Pricing models are converging on usage-based (tokens, ACUs, credits, tasks) rather than per-seat, reflecting the shift from “tool” to “worker” framing
Real-world performance is improving rapidly – Devin’s PR merge rate doubled from 34% to 67% in one year – but success rates on complex tasks remain modest (3/20 in independent testing, ~23% on SWE-Bench Pro)
Corporate governance is unprepared: 88% of organizations deploy AI but only 25% have board-level AI policies; liability for agent-generated code falls squarely on the deploying organization, not the vendor
Security risks are acute: 45% of AI-generated code introduces vulnerabilities; agents have supply chain blind spots; secret leakage through prompts and CI logs is a systemic concern
The Anthropic 2026 Agentic Coding Trends Report identifies a fundamental shift: engineering roles are moving from implementation to agent supervision, system design, and output review
Karpathy’s autoresearch pattern (March 2026) demonstrates the “overnight autonomous loop” model: 700 experiments over two days, ~20 additive improvements, with human control limited to refining a high-level prompt

1. Platform-by-Platform Analysis

1.1 Devin (Cognition AI)

What it is: The first widely marketed “AI software engineer” – a cloud-hosted autonomous agent with its own browser, editor, terminal, and sandboxed environment.

Pricing (as of March 2026):

Plan	Cost	ACUs Included	Per-ACU Cost	Key Features
Core	$20/month	Pay-as-you-go	$2.25/ACU	Basic access, single user
Teams	$500/month	250 ACUs	$2.00/ACU	Team management, API access
Enterprise	Custom	Custom	Custom	VPC deployment, SSO, admin dashboards

ACU = Agent Compute Unit: a normalized measure of VM time, model inference, and networking bandwidth consumed per task.

What it can actually do autonomously:

Security vulnerability remediation: 20x faster than human developers (1.5 min vs 30 min per vulnerability)
Code modernization / migration: 10-14x faster than manual (e.g., Java version migration, ETL framework file migration for a large bank)
Test coverage expansion: organizations report coverage rising from 50-60% to 80-90%
Routine PR creation: 67% merge rate (up from 34% one year ago)
Integrates with Slack, Teams, and Jira for task assignment

What still requires human oversight:

Complex architectural decisions
Destructive or irreversible database operations
Tasks without clear, upfront requirements and verifiable outcomes
Anything requiring stakeholder management or cross-team coordination
Devin does not reliably surface uncertainty or flag dangerous actions

Real-world performance:

Goldman Sachs, Santander, and Nubank are deploying Devin at scale
Independent testing found only 3/20 assigned tasks completed successfully
Best suited for tasks that would take a junior engineer 4-8 hours with clear specifications
Cognition describes Devin as “senior-level at codebase understanding but junior at execution”
Recurring complaints: task failures without clear explanation, compute limits at the $20 tier, slower-than-expected output

Enterprise readiness: VPC deployment available; SSO; admin dashboards; machine snapshots for login workflows. Enterprise plan required for serious compliance needs.

Benchmarks: Devin was originally marketed on SWE-bench performance but specific recent numbers on SWE-bench Verified or Pro are not prominently reported in 2026.

1.2 Factory AI

What it is: An “agent-native” software development platform built around autonomous “Droids” that integrate into existing IDEs and workflows.

Pricing (as of March 2026):

Plan	Cost	Tokens/Month	Seats	Key Features
BYOK	Free	Your own API keys	Unlimited	Bring your own model keys
Pro	$20/month	20M standard tokens	Up to 50 ($5/additional)	Dedicated compute, frontier models
Max	$200/month	200M standard tokens	Up to 100	Everything in Pro, scaled
Enterprise	Custom	Custom	Unlimited	SOC II, GDPR, ISO 42001, CCPA

Token-based billing (not per-seat). Cached tokens are discounted. Hybrid pricing models available for enterprise (seats + tokens + usage overages).

What it can actually do autonomously:

Pull context from tickets, implement solutions, create PRs with full traceability from ticket to code
Multi-IDE support: VS Code, JetBrains, Vim, terminal CLI
Supports all major frontier models (GPT-5, Claude Sonnet 4, o3, Gemini 2.5 Pro, Claude Opus 4.1)
Claims ~$18,000/year savings per engineer through optimized development processes

What still requires human oversight:

PR review and merge decisions
Architectural planning
Requirements clarification on ambiguous tickets

Enterprise readiness: SOC II, GDPR, ISO 42001, CCPA compliant. Strong enterprise security posture. Token billing can be unpredictable with large context windows or long-running tasks.

1.3 Cosine / Genie

What it is: An autonomous AI software engineer powered by a proprietary model (Genie 2), trained on data that “codifies human reasoning” by observing how engineers actually work.

Pricing (as of March 2026):

Plan	Cost	Tasks/Month	Key Features
Free	$0	80 (for new users)	Trial access
Hobby	$20/month	80	Individual developers
Professional	$99/user/month	240	Team features
Enterprise	Custom	Custom	Advanced features, security

Task-based pricing model: once you start a task, you can iterate with Genie as many times as needed within that single task cost.

What it can actually do autonomously:

End-to-end feature development: understands codebase, plans, writes changes, opens PRs
Multi-agent decomposition: breaks backlog items into subtasks, assigns to specialized agents
Bug fixes, code refactoring, validation across 50+ languages
CLI mode: runs in your actual environment, accesses local files, runs builds, executes tests
Developers can assign multiple tickets simultaneously, then return to review/merge PRs

Benchmarks: Cosine AutoPM achieves 72% on SWE-Lancer, outperforming OpenAI and Anthropic on that benchmark. Note: SWE-Lancer is a different benchmark than SWE-bench.

Enterprise readiness: VS Code extension and cloud platform. Enterprise plan available but details are custom-negotiated.

1.4 Sweep AI

What it is: An AI coding assistant that automates task execution by reading codebases, planning changes, writing code, and submitting pull requests.

Pricing: Self-hosted and hosted deployment options. Uses its own LLMs for privacy (no code retained by third parties). Specific pricing tiers not prominently published.

What it can actually do autonomously:

Reads instructions from GitHub issues or Jira tickets
Searches entire codebase for context
Writes code changes across multiple files
Submits PRs for human review
Supports Python, JavaScript, Rust, Go, Java, C#, C++
JetBrains IDE plugin with inline completions, test generation, static analysis feedback

What still requires human oversight:

PR review and merge
Task specification and prioritization
Architecture decisions

Enterprise readiness: Self-hosted option available, which is critical for enterprises that cannot send code to third-party services. Uses proprietary LLMs, reducing third-party data exposure.

1.5 OpenHands (formerly OpenDevin)

What it is: The most popular open-source AI coding agent platform. Model-agnostic, sandboxed, and designed for both local development and cloud-scale deployment.

Pricing (as of March 2026):

Plan	Cost	Key Features
Cloud Individual	Free	Basic cloud access
Cloud Growth	$500/month	Unlimited users
Self-hosted Enterprise	Custom	Private VPC, SAML/SSO, unlimited conversations, priority support

What it can actually do autonomously:

Modify code, execute commands, browse the web, interact with APIs
Automate the full software development lifecycle
Power with Claude, GPT, or any other LLM (model-agnostic)
SDK: composable Python library for defining and running agents locally or at cloud scale
Solves over 50% of real GitHub issues on SWE-bench

What still requires human oversight:

Complex multi-step workflows may produce inconsistent results
Requires configuration and prompt engineering for optimal performance
Open-source means you own operations, upgrades, and security hardening

Enterprise readiness: Raised $18.8M Series A (November 2025) specifically to bring cloud coding agents to enterprises. Self-hosted via Kubernetes/Helm chart. SAML/SSO. Extended support and research team access for enterprise contracts.

Key differentiator: Open-source and model-agnostic – organizations can avoid vendor lock-in and run fully on-premise for maximum security.

1.6 GitHub Copilot Workspace / Coding Agent

What it is: Microsoft/GitHub’s agentic layer on top of Copilot, spanning IDE agent mode, CLI, and cloud-based coding agents.

Pricing (as of March 2026):

Plan	Cost	Premium Requests	Key Features
Free	$0	50/month	2,000 completions, basic access
Pro	$10/month	300/month	Standard developer tier
Pro+	$39/month	1,500/month	All AI models (Claude Opus 4, o3)
Business	$19/user/month	Included	Copilot coding agent, org management
Enterprise	$39/user/month	Included	All Business features + enterprise extras

Additional premium requests cost $0.04 each. Chat, agent mode, code review, coding agent, and CLI all consume premium requests.

What it can actually do autonomously (Agent Mode):

Determines which files to modify, offers code changes and terminal commands
Iterates to remediate issues until the task is complete
Reads files, runs code, checks output, identifies lint errors/test failures, loops to fix them
Tool use: read_file, edit_file, run_in_terminal, search workspace
Self-healing: recognizes and fixes compilation and runtime errors automatically
MCP (Model Context Protocol) integration for external tool access

What it can actually do autonomously (Coding Agent – cloud-based):

Assigned via GitHub Issues
Works on its own branch in a cloud environment
Creates PRs with proposed changes
Runs CI/CD and iterates on failures

Recent updates (March 2026):

Custom agents, sub-agents, and plan agent now GA
Agent hooks in preview with auto-approve support for MCP
Copilot CLI (terminal-based agent) reached GA for all paid subscribers
Major agentic improvements for JetBrains IDEs

What still requires human oversight:

PR review and merge approval
Complex architectural decisions
Security-sensitive changes
Cross-repository coordination

Enterprise readiness: Best-in-class for organizations already on GitHub Enterprise Cloud. Centralized policy management, audit logs, org-level controls. Massive ecosystem integration advantage.

1.7 Cursor Agent Mode / Cloud Agents

What it is: An AI-native IDE (fork of VS Code) with deep agentic capabilities, ranging from in-editor agent mode to fully autonomous cloud agents running on isolated VMs.

Pricing (as of March 2026):

Plan	Cost	Key Features
Hobby	Free	Limited agent requests and completions
Pro	$20/month	Unlimited completions, monthly credit pool
Pro+	$60/month	Background agents, ~3x agent capacity
Ultra	$200/month	Maximum usage and premium model access
Teams	$40/user/month	Shared context, centralized billing, usage visibility
Enterprise	Custom	Negotiated per seat count

Credit-based system (since June 2025): monthly credit pool equal to plan price in dollars, consumed based on model selection. “Auto mode” is unlimited.

What it can actually do autonomously:

Agent Mode (in-editor):

Independently executes terminal commands, installs dependencies, runs tests
Analyzes compilation errors, proposes and applies fixes
Context-aware across the full project

Cloud Agents (launched February 2026):

Fully autonomous agents running on isolated VMs
Build software, test it, record video demos of their work
Produce merge-ready PRs
Run for 30-60 minutes independently on tasks
30% of Cursor’s own PRs are now made by agents

Automations (March 2026):

Auto-launch agents triggered by codebase changes, Slack messages, or timers
Automated review and maintenance of agent-generated code

What still requires human oversight:

Final PR review and merge decisions
Architectural planning
Tasks exceeding ~60 minutes of complexity
Security review of generated code

Enterprise readiness: Teams plan provides centralized billing and usage visibility. Enterprise plan negotiated. The rapid pace of feature releases (cloud agents, automations) may raise concerns about stability for conservative enterprises.

Key insight from Cursor: “Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects.”

1.8 Claude Code (Anthropic)

What it is: A terminal-native agentic coding tool that reads codebases, edits files, runs commands, and integrates with development workflows. Also available in IDE, desktop app, and browser.

Pricing: Consumed via Anthropic API credits (Claude Pro $20/month for individuals; Claude Max plans for heavy usage; API pricing for enterprise/CI/CD integration). No separate Claude Code subscription.

What it can actually do autonomously:

Chains an average of 21.2 independent tool calls without human intervention
Session durations nearly doubled in three months (from ~25 to ~45 minutes)
Auto-Accept mode (shift+tab): autonomous loops where Claude writes code, runs tests, iterates until tests pass
Agent teams mode (research preview): multiple agents working in parallel, coordinating autonomously
Headless mode (-p flag): non-interactive execution for CI/CD pipelines, scripts, cron jobs
- Supports --output-format (text, JSON, streaming)
- –max-turns to control autonomous step count
- Session IDs for context persistence across invocations (~200K token context)
Batch processing via /batch command

Common CI/CD use cases:

Automated code review
Test generation
Changelog generation
Code migration
Accessibility auditing
Technical debt detection
Documentation translation

What still requires human oversight:

Architecture and design decisions
Security-sensitive changes
Destructive operations
Tasks requiring cross-repo or cross-system reasoning beyond the current codebase
Long-running sessions may accumulate context errors

Enterprise readiness: API-based usage with enterprise Anthropic contracts. No inherent VPC deployment for Claude Code itself (it runs locally and calls the API). Organizations control where the tool runs. SOC 2 compliant API. The tool is open-source on GitHub, allowing security auditing.

Benchmarks: Opus 4.6 with 1M context and 128K output tokens. Anthropic reports measuring agent autonomy systematically, tracking tool call chains and session duration as metrics.

Key insight from Anthropic’s 2026 Agentic Coding Trends Report:

Developers integrate AI into 60% of their work
80-100% of delegated tasks still receive active human oversight
Engineering roles are shifting from implementation to agent supervision, system design, and output review
Multi-agent systems are replacing single-agent workflows

2. Comparative Analysis

2.1 Pricing Comparison

Platform	Entry Price	Enterprise Price	Billing Model	Free Tier
Devin	$20/month + $2.25/ACU	Custom	ACU (compute units)	No
Factory AI	Free (BYOK)	Custom	Token-based	Yes (BYOK)
Cosine/Genie	$20/month	Custom	Task-based	Yes (80 tasks)
Sweep AI	Varies	Custom	Self-hosted option	Limited
OpenHands	Free (cloud)	Custom	Conversations/usage	Yes
GitHub Copilot	$10/month	$39/user/month	Premium requests	Yes (50 req/mo)
Cursor	$20/month	Custom	Credit-based	Yes (limited)
Claude Code	API pricing	API enterprise	Token-based (API)	Limited

Trend: The industry is moving from per-seat to usage-based pricing, reflecting the “AI worker” framing where you pay for work done, not seats occupied.

2.2 Autonomy Spectrum

Platform	Autonomy Level	Typical Task Duration	Human Touchpoint
Devin	High (cloud sandbox)	1-8 hours	PR review
Factory AI	High (ticket-to-PR)	Variable	PR review
Cosine/Genie	High (multi-agent)	Variable	PR review
Sweep AI	Medium (issue-to-PR)	Minutes to hours	PR review
OpenHands	High (configurable)	Variable	PR review
GitHub Copilot Agent	Medium-High	Minutes to hours	PR review, agent mode approval
Cursor Cloud Agents	High (isolated VM)	30-60 minutes	PR review
Claude Code	Medium-High	Up to 45+ minutes	Auto-accept or interactive

2.3 SWE-bench Benchmark Reality Check

SWE-bench has fragmented into multiple variants due to data contamination concerns:

Benchmark	What It Measures	Top Scores (March 2026)
SWE-bench Verified	500 human-validated samples	~81% (best models) – contaminated
SWE-bench Pro	Harder, less contaminated	~46% (best) / 23% (Claude Opus 4.1, GPT-5)
SWE-bench Pro (private)	Previously unseen codebases	~18% (Claude Opus 4.1) / ~15% (GPT-5)
SWE-bench Live	Monthly-updated, contamination-free	Updated monthly

Critical finding: The gap between Verified (~81%) and Pro private (~15-18%) reveals how much benchmark scores overstate real-world capability. On truly novel codebases, the best models resolve fewer than 1 in 5 issues autonomously.

3. Corporate Governance Challenges

3.1 Who Is Responsible for Agent-Generated Code?

The deploying organization bears full liability. Under typical AI vendor contracts, indemnification provisions are limited or excluded entirely for AI-generated output. Regulators have made clear that businesses deploying AI remain fully accountable for legal compliance, regardless of whether the AI functionality comes from a third-party vendor.

Key governance gaps:

88% of organizations deploy AI, but only 25% have board-level AI policies governing that deployment
The remaining 63% represent boards operating without documented AI oversight while their companies deploy AI systems affecting customers, employees, and regulatory compliance
Under Caremark liability doctrine, the absence of board-level AI reporting and oversight is exactly the kind of governance gap that creates director liability exposure

3.2 Agentic AI Does Not Fit Existing Governance Frameworks

Autonomous AI agents that take actions, use tools, and make sequential decisions do not fit governance frameworks designed for prediction models or chatbots. Key challenges:

Accountability chains: When an agent calls another agent, which makes an API call, which triggers a purchase, who is responsible for the bad outcome?
Hallucination liability: AI-generated code, policies, SOPs, and training materials can trigger legal obligations the company does not realize it has assumed
IP uncertainty: The U.S. Supreme Court declined (March 2026) to extend copyright protection to purely AI-generated works, leaving ownership of agent-generated code legally ambiguous
Regulatory exposure: The EU AI Act high-risk obligations take effect August 2, 2026, with specific requirements for AI systems that generate code used in critical infrastructure

3.3 Emerging Best Practices

Establish board-level AI governance with documented policies, reporting cadence, and clear accountability
Define agent authorization levels – what agents can do autonomously vs. what requires human approval
Maintain audit trails – every agent action logged with full traceability
Implement “human-in-the-loop” gates at merge, deploy, and production-access decision points
Review vendor contracts for indemnification scope on AI-generated output
Treat agent-generated code as untrusted until human-reviewed, regardless of test pass rates

4. Security Implications

4.1 The Attack Surface

Giving AI agents access to codebases, CI/CD pipelines, and production environments creates a qualitatively different risk profile than traditional developer tooling:

Code quality risks:

45% of AI-generated code introduces security vulnerabilities
LLMs choose insecure methods nearly half the time
Common vulnerabilities: unsafe dependencies, hallucinated functions, secret leaks

Supply chain risks:

Agents dynamically resolve and install packages, optimizing for task completion, not supply chain risk
CI/CD pipelines and GitHub Actions are prime targets for supply chain attacks (Datadog State of DevSecOps 2026)
The OpenClaw incident: ~900 malicious skills (20% of total packages), 283 leaking credentials, 76 containing malicious payloads

Data leakage risks:

Secrets appear in prompts during debugging, in code comments models read, and in CI logs that agents summarize
Proprietary code copied into AI tools may be stored, reused for training, and later exposed

Visibility gaps:

Security teams cannot see what processes an AI agent spawns, what endpoints it contacts, or what packages it installs at runtime
Traditional SAST/DAST/SCA tools were not designed for agent-generated code review

4.2 Security Architecture Recommendations

Sandboxed execution environments – agents should never have direct production access
Principle of least privilege – agents get only the permissions needed for their specific task
Automated security scanning in the PR pipeline (not just at deploy)
Secret management – never expose secrets in agent-accessible environments; use credential vaults
Supply chain verification – lock dependency versions, verify package authenticity before agent installation
Audit logging of all agent actions, including command execution, file access, and network calls
Network isolation – agent environments should have restricted outbound access
Regular vulnerability assessment of the agent tools themselves (30+ IDE vulnerabilities disclosed in 2025)

5. Human Oversight Models

5.1 The “Delegate, Review, Own” Pattern

Leading teams are converging on a simple operating model:

Delegate: Engineer defines the task (ticket, issue, natural language prompt)
Review: Agent produces output (code, PR, test results); engineer reviews
Own: Engineer takes responsibility for merged code regardless of origin

The multi-agent pipeline emerging in practice:

Task Description (Human)
  --> Feature Author Agent (writes code)
  --> Test Generator Agent (writes tests)
  --> Code Reviewer Agent (reviews changes)
  --> Architecture Guardian Agent (checks compliance)
  --> Security Scanner Agent (vulnerability check)
  --> Human Review (final approval)
  --> CI/CD Pipeline (automated deployment)

The human remains the decision-maker at key checkpoints, but execution between checkpoints is fully autonomous.

5.2 Parallelism as the Key Multiplier

The emerging consensus: parallelism is the key productivity multiplier. Multiple agents on separate git worktrees, with human oversight as orchestrator rather than implementer.

From Anthropic’s 2026 report: developers integrate AI into 60% of their work, but maintain active oversight on 80-100% of delegated tasks. The shift is from writing code to directing and reviewing agent-generated code.

5.3 Three Requirements for Deployment Success

Clean API access to all systems the agent must interact with (CRM, ITSM, ERP, version control)
A governance model defining what agents execute autonomously, what triggers human review, and what gets logged
An auditable monitoring layer covering every agent action with logging, anomaly detection, and rollback capability

6. The “AI Teammate” Concept in Practice

6.1 From Tool to Teammate

The framing is shifting from “AI coding assistant” to “AI teammate” – a persistent entity that participates in team workflows:

Microsoft Copilot Cowork (private preview, March 2026): digital colleague that plans and performs extended tasks across Microsoft 365
Sentra: AI teammate that creates knowledge graphs through conversations, integrated into meetings, Slack, Jira, and calendars
Devin: Integrates with Slack and Teams for task assignment, communicates progress, asks clarifying questions

6.2 Organizational Impact

Organizations winning in 2026 are transitioning early-career talent from “Code Generators” to “System Verifiers”
Teams that write well produce better agent output, because agents rely on written context
The 4% of companies maximizing AI benefits are building workflows that turn individual knowledge into company-wide memory

6.3 The Engineer’s Evolving Role

Engineers’ value lies in:

Designing overarching system architecture
Defining precise objectives and guardrails for AI agents
Rigorously validating final output for robustness, security, and business alignment
Moving from “hands-on keyboard creation” to “high-level system design, quality assurance, and strategic oversight”

7. Karpathy’s Autoresearch Pattern

7.1 What It Is

Released March 6, 2026, autoresearch is an open-source framework by Andrej Karpathy that lets AI agents autonomously run machine learning experiments overnight on a single GPU. It reached 30,307 GitHub stars in one week – one of the fastest-growing repos in GitHub history.

7.2 The Pattern

Human writes a high-level prompt in a Markdown file (program.md)
AI agent autonomously edits the training script (train.py)
Each experiment runs for exactly 5 minutes of training time
Agent evaluates results, keeps improvements, discards regressions
Repeats: ~12 experiments/hour, 100+ overnight

After two days of autonomous operation: ~700 changes processed, ~20 additive improvements found that transferred to larger models.

7.3 Why It Matters for Enterprise

The autoresearch pattern demonstrates a replicable model for autonomous AI work:

Human control is at the prompt level, not the execution level
The loop is self-correcting: keep improvements, discard regressions
Scale is the advantage: no human can run 700 experiments in two days
Results transfer: improvements on small models generalized to larger ones

This pattern is directly applicable to:

Automated performance optimization
Configuration tuning
Test suite expansion
Code refactoring experiments
Security hardening iterations

The core insight: humans define the objective and evaluation criteria; agents handle execution at a scale humans cannot match.

8. Real vs. Hyped Capabilities

8.1 What Actually Works Today (March 2026)

Capability	Maturity	Evidence
Automated PR generation from tickets	Production-ready	Devin, Factory, Cosine, Sweep all ship this
Bug fixes with clear reproduction steps	Production-ready	67% merge rate (Devin), 50%+ SWE-bench (OpenHands)
Test generation and coverage expansion	Production-ready	50-60% to 80-90% coverage reported
Code migration (language/framework upgrades)	Production-ready	10-14x speed improvements documented
Security vulnerability remediation	Production-ready	20x speed improvement (Devin enterprise data)
Autonomous multi-hour task completion	Early production	30-60 min reliable (Cursor); up to 45 min (Claude Code)
Multi-agent coordination	Research preview	Claude Code agent teams; Cosine Multi-agent
Full feature development from spec	Emerging	Success rates vary widely by complexity
Architectural design	Not ready	All platforms require human architects

8.2 What Is Still Hype

“Replacing developers”: No platform replaces developers; they augment and redirect developer effort
SWE-bench scores as real-world predictors: 81% Verified vs 15-18% on private codebases shows benchmark inflation
Autonomous production deployment: No responsible platform ships agent code to production without human review
“AI teammate” parity with humans: Agents lack soft skills, contextual judgment, stakeholder management, and the ability to navigate organizational politics
Cost savings as advertised: Token/compute costs can be unpredictable; the $18K/year savings claim from Factory requires specific workflow optimization

9. Strategic Implications for Legal Advisory

9.1 For Foley Hoag Client Guidance

Liability clarity is urgent: Clients deploying AI agents need to understand that they – not AI vendors – bear responsibility for agent-generated code, including IP infringement, security vulnerabilities, and regulatory non-compliance
Board governance gaps create director exposure: 63% of organizations deploying AI lack board-level policies, creating Caremark liability risk
EU AI Act compliance deadline approaching: High-risk obligations take effect August 2, 2026; AI systems generating code for critical infrastructure may be in scope
IP ownership remains unresolved: Agent-generated code may not be copyrightable; clients need to understand the implications for trade secret and patent strategies
Vendor contract review is critical: Standard AI vendor indemnification provisions often exclude or severely limit coverage for AI-generated output
Security governance must evolve: Traditional code review processes are insufficient for the volume and speed of agent-generated code; automated security scanning in the PR pipeline is now mandatory

9.2 Market Trajectory

The AI agents market is projected to grow from $7.84 billion (2025) to $52.62 billion by 2030 at a 46.3% CAGR. Every enterprise software engineering organization will need to develop agent governance policies within the next 12-18 months.

What This Means for Your Organization

AI coding agents have moved from demos to production in 12 months. Goldman Sachs, Santander, and Nubank are deploying Devin at scale. Cursor reports 30% of its own PRs are now made by autonomous agents. The agents market is projected to grow from $7.84 billion in 2025 to $52.62 billion by 2030 at 46.3% CAGR. This is not a future state to monitor. It is a current state to govern. If your organization does not have a policy defining what AI agents can do autonomously versus what requires human approval, you are operating in a governance vacuum that creates liability exposure your board has not assessed.

The benchmark data should temper any impulse to replace developers with agents. SWE-bench Verified scores reach 81%, which sounds impressive until you compare it to SWE-bench Pro on private, previously unseen codebases: 15-18%. That means the best AI agents in the world resolve fewer than one in five issues on code they have not seen before. Independent testing of Devin found only 3 of 20 assigned tasks completed successfully. Devin’s own data shows a 67% PR merge rate – dramatically improved from 34% a year ago, but still meaning one-third of its work product is rejected. Agents are valuable for well-specified tasks on familiar codebases. They are not replacing engineers. They are changing what engineers spend their time on: from writing code to defining intent, reviewing output, and owning architecture.

The liability question is the one your general counsel needs to answer before your first agent-generated PR ships to production. Under standard AI vendor contracts, the deploying organization – not the vendor – bears full liability for agent-generated code. Eighty-eight percent of organizations deploy AI, but only 25% have board-level AI policies. That 63-percentage-point gap represents boards operating without documented AI oversight while their companies deploy AI systems that write production code, install dependencies, and interact with APIs. Forty-five percent of AI-generated code introduces security vulnerabilities. The EU AI Act’s high-risk obligations take effect August 2, 2026. The organizations that define agent authorization levels, maintain audit trails, and implement human-in-the-loop gates at merge and deploy decision points now will have both a compliance advantage and a quality advantage over those that wait for an incident to force the conversation.

Sources

Created by Brandon Sneider | brandon@brandonsneider.com March 2026