AI Agent Workers: Devin, Factory, and the Autonomous Coding Frontier (March 2026)

Executive Summary

  • AI coding agents have moved from demo-ware to production enterprise tools in 12 months; thousands of companies now deploy agents that write, test, and ship code with varying degrees of autonomy
  • The market spans fully autonomous cloud agents (Devin, Factory Droids, Cursor Cloud Agents) to IDE-integrated agentic modes (GitHub Copilot Agent, Cursor Agent) to open-source platforms (OpenHands) to terminal-native tools (Claude Code)
  • Pricing models are converging on usage-based (tokens, ACUs, credits, tasks) rather than per-seat, reflecting the shift from “tool” to “worker” framing
  • Real-world performance is improving rapidly – Devin’s PR merge rate doubled from 34% to 67% in one year – but success rates on complex tasks remain modest (3/20 in independent testing, ~23% on SWE-Bench Pro)
  • Corporate governance is unprepared: 88% of organizations deploy AI but only 25% have board-level AI policies; liability for agent-generated code falls squarely on the deploying organization, not the vendor
  • Security risks are acute: 45% of AI-generated code introduces vulnerabilities; agents have supply chain blind spots; secret leakage through prompts and CI logs is a systemic concern
  • The Anthropic 2026 Agentic Coding Trends Report identifies a fundamental shift: engineering roles are moving from implementation to agent supervision, system design, and output review
  • Karpathy’s autoresearch pattern (March 2026) demonstrates the “overnight autonomous loop” model: 700 experiments over two days, ~20 additive improvements, with human control limited to refining a high-level prompt

1. Platform-by-Platform Analysis

1.1 Devin (Cognition AI)

What it is: The first widely marketed “AI software engineer” – a cloud-hosted autonomous agent with its own browser, editor, terminal, and sandboxed environment.

Pricing (as of March 2026):

Plan Cost ACUs Included Per-ACU Cost Key Features
Core $20/month Pay-as-you-go $2.25/ACU Basic access, single user
Teams $500/month 250 ACUs $2.00/ACU Team management, API access
Enterprise Custom Custom Custom VPC deployment, SSO, admin dashboards

ACU = Agent Compute Unit: a normalized measure of VM time, model inference, and networking bandwidth consumed per task.

What it can actually do autonomously:

  • Security vulnerability remediation: 20x faster than human developers (1.5 min vs 30 min per vulnerability)
  • Code modernization / migration: 10-14x faster than manual (e.g., Java version migration, ETL framework file migration for a large bank)
  • Test coverage expansion: organizations report coverage rising from 50-60% to 80-90%
  • Routine PR creation: 67% merge rate (up from 34% one year ago)
  • Integrates with Slack, Teams, and Jira for task assignment

What still requires human oversight:

  • Complex architectural decisions
  • Destructive or irreversible database operations
  • Tasks without clear, upfront requirements and verifiable outcomes
  • Anything requiring stakeholder management or cross-team coordination
  • Devin does not reliably surface uncertainty or flag dangerous actions

Real-world performance:

  • Goldman Sachs, Santander, and Nubank are deploying Devin at scale
  • Independent testing found only 3/20 assigned tasks completed successfully
  • Best suited for tasks that would take a junior engineer 4-8 hours with clear specifications
  • Cognition describes Devin as “senior-level at codebase understanding but junior at execution”
  • Recurring complaints: task failures without clear explanation, compute limits at the $20 tier, slower-than-expected output

Enterprise readiness: VPC deployment available; SSO; admin dashboards; machine snapshots for login workflows. Enterprise plan required for serious compliance needs.

Benchmarks: Devin was originally marketed on SWE-bench performance but specific recent numbers on SWE-bench Verified or Pro are not prominently reported in 2026.


1.2 Factory AI

What it is: An “agent-native” software development platform built around autonomous “Droids” that integrate into existing IDEs and workflows.

Pricing (as of March 2026):

Plan Cost Tokens/Month Seats Key Features
BYOK Free Your own API keys Unlimited Bring your own model keys
Pro $20/month 20M standard tokens Up to 50 ($5/additional) Dedicated compute, frontier models
Max $200/month 200M standard tokens Up to 100 Everything in Pro, scaled
Enterprise Custom Custom Unlimited SOC II, GDPR, ISO 42001, CCPA

Token-based billing (not per-seat). Cached tokens are discounted. Hybrid pricing models available for enterprise (seats + tokens + usage overages).

What it can actually do autonomously:

  • Pull context from tickets, implement solutions, create PRs with full traceability from ticket to code
  • Multi-IDE support: VS Code, JetBrains, Vim, terminal CLI
  • Supports all major frontier models (GPT-5, Claude Sonnet 4, o3, Gemini 2.5 Pro, Claude Opus 4.1)
  • Claims ~$18,000/year savings per engineer through optimized development processes

What still requires human oversight:

  • PR review and merge decisions
  • Architectural planning
  • Requirements clarification on ambiguous tickets

Enterprise readiness: SOC II, GDPR, ISO 42001, CCPA compliant. Strong enterprise security posture. Token billing can be unpredictable with large context windows or long-running tasks.


1.3 Cosine / Genie

What it is: An autonomous AI software engineer powered by a proprietary model (Genie 2), trained on data that “codifies human reasoning” by observing how engineers actually work.

Pricing (as of March 2026):

Plan Cost Tasks/Month Key Features
Free $0 80 (for new users) Trial access
Hobby $20/month 80 Individual developers
Professional $99/user/month 240 Team features
Enterprise Custom Custom Advanced features, security

Task-based pricing model: once you start a task, you can iterate with Genie as many times as needed within that single task cost.

What it can actually do autonomously:

  • End-to-end feature development: understands codebase, plans, writes changes, opens PRs
  • Multi-agent decomposition: breaks backlog items into subtasks, assigns to specialized agents
  • Bug fixes, code refactoring, validation across 50+ languages
  • CLI mode: runs in your actual environment, accesses local files, runs builds, executes tests
  • Developers can assign multiple tickets simultaneously, then return to review/merge PRs

Benchmarks: Cosine AutoPM achieves 72% on SWE-Lancer, outperforming OpenAI and Anthropic on that benchmark. Note: SWE-Lancer is a different benchmark than SWE-bench.

Enterprise readiness: VS Code extension and cloud platform. Enterprise plan available but details are custom-negotiated.


1.4 Sweep AI

What it is: An AI coding assistant that automates task execution by reading codebases, planning changes, writing code, and submitting pull requests.

Pricing: Self-hosted and hosted deployment options. Uses its own LLMs for privacy (no code retained by third parties). Specific pricing tiers not prominently published.

What it can actually do autonomously:

  • Reads instructions from GitHub issues or Jira tickets
  • Searches entire codebase for context
  • Writes code changes across multiple files
  • Submits PRs for human review
  • Supports Python, JavaScript, Rust, Go, Java, C#, C++
  • JetBrains IDE plugin with inline completions, test generation, static analysis feedback

What still requires human oversight:

  • PR review and merge
  • Task specification and prioritization
  • Architecture decisions

Enterprise readiness: Self-hosted option available, which is critical for enterprises that cannot send code to third-party services. Uses proprietary LLMs, reducing third-party data exposure.


1.5 OpenHands (formerly OpenDevin)

What it is: The most popular open-source AI coding agent platform. Model-agnostic, sandboxed, and designed for both local development and cloud-scale deployment.

Pricing (as of March 2026):

Plan Cost Key Features
Cloud Individual Free Basic cloud access
Cloud Growth $500/month Unlimited users
Self-hosted Enterprise Custom Private VPC, SAML/SSO, unlimited conversations, priority support

What it can actually do autonomously:

  • Modify code, execute commands, browse the web, interact with APIs
  • Automate the full software development lifecycle
  • Power with Claude, GPT, or any other LLM (model-agnostic)
  • SDK: composable Python library for defining and running agents locally or at cloud scale
  • Solves over 50% of real GitHub issues on SWE-bench

What still requires human oversight:

  • Complex multi-step workflows may produce inconsistent results
  • Requires configuration and prompt engineering for optimal performance
  • Open-source means you own operations, upgrades, and security hardening

Enterprise readiness: Raised $18.8M Series A (November 2025) specifically to bring cloud coding agents to enterprises. Self-hosted via Kubernetes/Helm chart. SAML/SSO. Extended support and research team access for enterprise contracts.

Key differentiator: Open-source and model-agnostic – organizations can avoid vendor lock-in and run fully on-premise for maximum security.


1.6 GitHub Copilot Workspace / Coding Agent

What it is: Microsoft/GitHub’s agentic layer on top of Copilot, spanning IDE agent mode, CLI, and cloud-based coding agents.

Pricing (as of March 2026):

Plan Cost Premium Requests Key Features
Free $0 50/month 2,000 completions, basic access
Pro $10/month 300/month Standard developer tier
Pro+ $39/month 1,500/month All AI models (Claude Opus 4, o3)
Business $19/user/month Included Copilot coding agent, org management
Enterprise $39/user/month Included All Business features + enterprise extras

Additional premium requests cost $0.04 each. Chat, agent mode, code review, coding agent, and CLI all consume premium requests.

What it can actually do autonomously (Agent Mode):

  • Determines which files to modify, offers code changes and terminal commands
  • Iterates to remediate issues until the task is complete
  • Reads files, runs code, checks output, identifies lint errors/test failures, loops to fix them
  • Tool use: read_file, edit_file, run_in_terminal, search workspace
  • Self-healing: recognizes and fixes compilation and runtime errors automatically
  • MCP (Model Context Protocol) integration for external tool access

What it can actually do autonomously (Coding Agent – cloud-based):

  • Assigned via GitHub Issues
  • Works on its own branch in a cloud environment
  • Creates PRs with proposed changes
  • Runs CI/CD and iterates on failures

Recent updates (March 2026):

  • Custom agents, sub-agents, and plan agent now GA
  • Agent hooks in preview with auto-approve support for MCP
  • Copilot CLI (terminal-based agent) reached GA for all paid subscribers
  • Major agentic improvements for JetBrains IDEs

What still requires human oversight:

  • PR review and merge approval
  • Complex architectural decisions
  • Security-sensitive changes
  • Cross-repository coordination

Enterprise readiness: Best-in-class for organizations already on GitHub Enterprise Cloud. Centralized policy management, audit logs, org-level controls. Massive ecosystem integration advantage.


1.7 Cursor Agent Mode / Cloud Agents

What it is: An AI-native IDE (fork of VS Code) with deep agentic capabilities, ranging from in-editor agent mode to fully autonomous cloud agents running on isolated VMs.

Pricing (as of March 2026):

Plan Cost Key Features
Hobby Free Limited agent requests and completions
Pro $20/month Unlimited completions, monthly credit pool
Pro+ $60/month Background agents, ~3x agent capacity
Ultra $200/month Maximum usage and premium model access
Teams $40/user/month Shared context, centralized billing, usage visibility
Enterprise Custom Negotiated per seat count

Credit-based system (since June 2025): monthly credit pool equal to plan price in dollars, consumed based on model selection. “Auto mode” is unlimited.

What it can actually do autonomously:

Agent Mode (in-editor):

  • Independently executes terminal commands, installs dependencies, runs tests
  • Analyzes compilation errors, proposes and applies fixes
  • Context-aware across the full project

Cloud Agents (launched February 2026):

  • Fully autonomous agents running on isolated VMs
  • Build software, test it, record video demos of their work
  • Produce merge-ready PRs
  • Run for 30-60 minutes independently on tasks
  • 30% of Cursor’s own PRs are now made by agents

Automations (March 2026):

  • Auto-launch agents triggered by codebase changes, Slack messages, or timers
  • Automated review and maintenance of agent-generated code

What still requires human oversight:

  • Final PR review and merge decisions
  • Architectural planning
  • Tasks exceeding ~60 minutes of complexity
  • Security review of generated code

Enterprise readiness: Teams plan provides centralized billing and usage visibility. Enterprise plan negotiated. The rapid pace of feature releases (cloud agents, automations) may raise concerns about stability for conservative enterprises.

Key insight from Cursor: “Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects.”


1.8 Claude Code (Anthropic)

What it is: A terminal-native agentic coding tool that reads codebases, edits files, runs commands, and integrates with development workflows. Also available in IDE, desktop app, and browser.

Pricing: Consumed via Anthropic API credits (Claude Pro $20/month for individuals; Claude Max plans for heavy usage; API pricing for enterprise/CI/CD integration). No separate Claude Code subscription.

What it can actually do autonomously:

  • Chains an average of 21.2 independent tool calls without human intervention
  • Session durations nearly doubled in three months (from ~25 to ~45 minutes)
  • Auto-Accept mode (shift+tab): autonomous loops where Claude writes code, runs tests, iterates until tests pass
  • Agent teams mode (research preview): multiple agents working in parallel, coordinating autonomously
  • Headless mode (-p flag): non-interactive execution for CI/CD pipelines, scripts, cron jobs
    • Supports --output-format (text, JSON, streaming)
    • –max-turns to control autonomous step count
    • Session IDs for context persistence across invocations (~200K token context)
  • Batch processing via /batch command

Common CI/CD use cases:

  • Automated code review
  • Test generation
  • Changelog generation
  • Code migration
  • Accessibility auditing
  • Technical debt detection
  • Documentation translation

What still requires human oversight:

  • Architecture and design decisions
  • Security-sensitive changes
  • Destructive operations
  • Tasks requiring cross-repo or cross-system reasoning beyond the current codebase
  • Long-running sessions may accumulate context errors

Enterprise readiness: API-based usage with enterprise Anthropic contracts. No inherent VPC deployment for Claude Code itself (it runs locally and calls the API). Organizations control where the tool runs. SOC 2 compliant API. The tool is open-source on GitHub, allowing security auditing.

Benchmarks: Opus 4.6 with 1M context and 128K output tokens. Anthropic reports measuring agent autonomy systematically, tracking tool call chains and session duration as metrics.

Key insight from Anthropic’s 2026 Agentic Coding Trends Report:

  • Developers integrate AI into 60% of their work
  • 80-100% of delegated tasks still receive active human oversight
  • Engineering roles are shifting from implementation to agent supervision, system design, and output review
  • Multi-agent systems are replacing single-agent workflows

2. Comparative Analysis

2.1 Pricing Comparison

Platform Entry Price Enterprise Price Billing Model Free Tier
Devin $20/month + $2.25/ACU Custom ACU (compute units) No
Factory AI Free (BYOK) Custom Token-based Yes (BYOK)
Cosine/Genie $20/month Custom Task-based Yes (80 tasks)
Sweep AI Varies Custom Self-hosted option Limited
OpenHands Free (cloud) Custom Conversations/usage Yes
GitHub Copilot $10/month $39/user/month Premium requests Yes (50 req/mo)
Cursor $20/month Custom Credit-based Yes (limited)
Claude Code API pricing API enterprise Token-based (API) Limited

Trend: The industry is moving from per-seat to usage-based pricing, reflecting the “AI worker” framing where you pay for work done, not seats occupied.

2.2 Autonomy Spectrum

Platform Autonomy Level Typical Task Duration Human Touchpoint
Devin High (cloud sandbox) 1-8 hours PR review
Factory AI High (ticket-to-PR) Variable PR review
Cosine/Genie High (multi-agent) Variable PR review
Sweep AI Medium (issue-to-PR) Minutes to hours PR review
OpenHands High (configurable) Variable PR review
GitHub Copilot Agent Medium-High Minutes to hours PR review, agent mode approval
Cursor Cloud Agents High (isolated VM) 30-60 minutes PR review
Claude Code Medium-High Up to 45+ minutes Auto-accept or interactive

2.3 SWE-bench Benchmark Reality Check

SWE-bench has fragmented into multiple variants due to data contamination concerns:

Benchmark What It Measures Top Scores (March 2026)
SWE-bench Verified 500 human-validated samples ~81% (best models) – contaminated
SWE-bench Pro Harder, less contaminated ~46% (best) / 23% (Claude Opus 4.1, GPT-5)
SWE-bench Pro (private) Previously unseen codebases ~18% (Claude Opus 4.1) / ~15% (GPT-5)
SWE-bench Live Monthly-updated, contamination-free Updated monthly

Critical finding: The gap between Verified (~81%) and Pro private (~15-18%) reveals how much benchmark scores overstate real-world capability. On truly novel codebases, the best models resolve fewer than 1 in 5 issues autonomously.


3. Corporate Governance Challenges

3.1 Who Is Responsible for Agent-Generated Code?

The deploying organization bears full liability. Under typical AI vendor contracts, indemnification provisions are limited or excluded entirely for AI-generated output. Regulators have made clear that businesses deploying AI remain fully accountable for legal compliance, regardless of whether the AI functionality comes from a third-party vendor.

Key governance gaps:

  • 88% of organizations deploy AI, but only 25% have board-level AI policies governing that deployment
  • The remaining 63% represent boards operating without documented AI oversight while their companies deploy AI systems affecting customers, employees, and regulatory compliance
  • Under Caremark liability doctrine, the absence of board-level AI reporting and oversight is exactly the kind of governance gap that creates director liability exposure

3.2 Agentic AI Does Not Fit Existing Governance Frameworks

Autonomous AI agents that take actions, use tools, and make sequential decisions do not fit governance frameworks designed for prediction models or chatbots. Key challenges:

  • Accountability chains: When an agent calls another agent, which makes an API call, which triggers a purchase, who is responsible for the bad outcome?
  • Hallucination liability: AI-generated code, policies, SOPs, and training materials can trigger legal obligations the company does not realize it has assumed
  • IP uncertainty: The U.S. Supreme Court declined (March 2026) to extend copyright protection to purely AI-generated works, leaving ownership of agent-generated code legally ambiguous
  • Regulatory exposure: The EU AI Act high-risk obligations take effect August 2, 2026, with specific requirements for AI systems that generate code used in critical infrastructure

3.3 Emerging Best Practices

  1. Establish board-level AI governance with documented policies, reporting cadence, and clear accountability
  2. Define agent authorization levels – what agents can do autonomously vs. what requires human approval
  3. Maintain audit trails – every agent action logged with full traceability
  4. Implement “human-in-the-loop” gates at merge, deploy, and production-access decision points
  5. Review vendor contracts for indemnification scope on AI-generated output
  6. Treat agent-generated code as untrusted until human-reviewed, regardless of test pass rates

4. Security Implications

4.1 The Attack Surface

Giving AI agents access to codebases, CI/CD pipelines, and production environments creates a qualitatively different risk profile than traditional developer tooling:

Code quality risks:

  • 45% of AI-generated code introduces security vulnerabilities
  • LLMs choose insecure methods nearly half the time
  • Common vulnerabilities: unsafe dependencies, hallucinated functions, secret leaks

Supply chain risks:

  • Agents dynamically resolve and install packages, optimizing for task completion, not supply chain risk
  • CI/CD pipelines and GitHub Actions are prime targets for supply chain attacks (Datadog State of DevSecOps 2026)
  • The OpenClaw incident: ~900 malicious skills (20% of total packages), 283 leaking credentials, 76 containing malicious payloads

Data leakage risks:

  • Secrets appear in prompts during debugging, in code comments models read, and in CI logs that agents summarize
  • Proprietary code copied into AI tools may be stored, reused for training, and later exposed

Visibility gaps:

  • Security teams cannot see what processes an AI agent spawns, what endpoints it contacts, or what packages it installs at runtime
  • Traditional SAST/DAST/SCA tools were not designed for agent-generated code review

4.2 Security Architecture Recommendations

  1. Sandboxed execution environments – agents should never have direct production access
  2. Principle of least privilege – agents get only the permissions needed for their specific task
  3. Automated security scanning in the PR pipeline (not just at deploy)
  4. Secret management – never expose secrets in agent-accessible environments; use credential vaults
  5. Supply chain verification – lock dependency versions, verify package authenticity before agent installation
  6. Audit logging of all agent actions, including command execution, file access, and network calls
  7. Network isolation – agent environments should have restricted outbound access
  8. Regular vulnerability assessment of the agent tools themselves (30+ IDE vulnerabilities disclosed in 2025)

5. Human Oversight Models

5.1 The “Delegate, Review, Own” Pattern

Leading teams are converging on a simple operating model:

  1. Delegate: Engineer defines the task (ticket, issue, natural language prompt)
  2. Review: Agent produces output (code, PR, test results); engineer reviews
  3. Own: Engineer takes responsibility for merged code regardless of origin

The multi-agent pipeline emerging in practice:

Task Description (Human)
  --> Feature Author Agent (writes code)
  --> Test Generator Agent (writes tests)
  --> Code Reviewer Agent (reviews changes)
  --> Architecture Guardian Agent (checks compliance)
  --> Security Scanner Agent (vulnerability check)
  --> Human Review (final approval)
  --> CI/CD Pipeline (automated deployment)

The human remains the decision-maker at key checkpoints, but execution between checkpoints is fully autonomous.

5.2 Parallelism as the Key Multiplier

The emerging consensus: parallelism is the key productivity multiplier. Multiple agents on separate git worktrees, with human oversight as orchestrator rather than implementer.

From Anthropic’s 2026 report: developers integrate AI into 60% of their work, but maintain active oversight on 80-100% of delegated tasks. The shift is from writing code to directing and reviewing agent-generated code.

5.3 Three Requirements for Deployment Success

  1. Clean API access to all systems the agent must interact with (CRM, ITSM, ERP, version control)
  2. A governance model defining what agents execute autonomously, what triggers human review, and what gets logged
  3. An auditable monitoring layer covering every agent action with logging, anomaly detection, and rollback capability

6. The “AI Teammate” Concept in Practice

6.1 From Tool to Teammate

The framing is shifting from “AI coding assistant” to “AI teammate” – a persistent entity that participates in team workflows:

  • Microsoft Copilot Cowork (private preview, March 2026): digital colleague that plans and performs extended tasks across Microsoft 365
  • Sentra: AI teammate that creates knowledge graphs through conversations, integrated into meetings, Slack, Jira, and calendars
  • Devin: Integrates with Slack and Teams for task assignment, communicates progress, asks clarifying questions

6.2 Organizational Impact

  • Organizations winning in 2026 are transitioning early-career talent from “Code Generators” to “System Verifiers”
  • Teams that write well produce better agent output, because agents rely on written context
  • The 4% of companies maximizing AI benefits are building workflows that turn individual knowledge into company-wide memory

6.3 The Engineer’s Evolving Role

Engineers’ value lies in:

  • Designing overarching system architecture
  • Defining precise objectives and guardrails for AI agents
  • Rigorously validating final output for robustness, security, and business alignment
  • Moving from “hands-on keyboard creation” to “high-level system design, quality assurance, and strategic oversight”

7. Karpathy’s Autoresearch Pattern

7.1 What It Is

Released March 6, 2026, autoresearch is an open-source framework by Andrej Karpathy that lets AI agents autonomously run machine learning experiments overnight on a single GPU. It reached 30,307 GitHub stars in one week – one of the fastest-growing repos in GitHub history.

7.2 The Pattern

  1. Human writes a high-level prompt in a Markdown file (program.md)
  2. AI agent autonomously edits the training script (train.py)
  3. Each experiment runs for exactly 5 minutes of training time
  4. Agent evaluates results, keeps improvements, discards regressions
  5. Repeats: ~12 experiments/hour, 100+ overnight

After two days of autonomous operation: ~700 changes processed, ~20 additive improvements found that transferred to larger models.

7.3 Why It Matters for Enterprise

The autoresearch pattern demonstrates a replicable model for autonomous AI work:

  • Human control is at the prompt level, not the execution level
  • The loop is self-correcting: keep improvements, discard regressions
  • Scale is the advantage: no human can run 700 experiments in two days
  • Results transfer: improvements on small models generalized to larger ones

This pattern is directly applicable to:

  • Automated performance optimization
  • Configuration tuning
  • Test suite expansion
  • Code refactoring experiments
  • Security hardening iterations

The core insight: humans define the objective and evaluation criteria; agents handle execution at a scale humans cannot match.


8. Real vs. Hyped Capabilities

8.1 What Actually Works Today (March 2026)

Capability Maturity Evidence
Automated PR generation from tickets Production-ready Devin, Factory, Cosine, Sweep all ship this
Bug fixes with clear reproduction steps Production-ready 67% merge rate (Devin), 50%+ SWE-bench (OpenHands)
Test generation and coverage expansion Production-ready 50-60% to 80-90% coverage reported
Code migration (language/framework upgrades) Production-ready 10-14x speed improvements documented
Security vulnerability remediation Production-ready 20x speed improvement (Devin enterprise data)
Autonomous multi-hour task completion Early production 30-60 min reliable (Cursor); up to 45 min (Claude Code)
Multi-agent coordination Research preview Claude Code agent teams; Cosine Multi-agent
Full feature development from spec Emerging Success rates vary widely by complexity
Architectural design Not ready All platforms require human architects

8.2 What Is Still Hype

  • “Replacing developers”: No platform replaces developers; they augment and redirect developer effort
  • SWE-bench scores as real-world predictors: 81% Verified vs 15-18% on private codebases shows benchmark inflation
  • Autonomous production deployment: No responsible platform ships agent code to production without human review
  • “AI teammate” parity with humans: Agents lack soft skills, contextual judgment, stakeholder management, and the ability to navigate organizational politics
  • Cost savings as advertised: Token/compute costs can be unpredictable; the $18K/year savings claim from Factory requires specific workflow optimization

9.1 For Foley Hoag Client Guidance

  1. Liability clarity is urgent: Clients deploying AI agents need to understand that they – not AI vendors – bear responsibility for agent-generated code, including IP infringement, security vulnerabilities, and regulatory non-compliance
  2. Board governance gaps create director exposure: 63% of organizations deploying AI lack board-level policies, creating Caremark liability risk
  3. EU AI Act compliance deadline approaching: High-risk obligations take effect August 2, 2026; AI systems generating code for critical infrastructure may be in scope
  4. IP ownership remains unresolved: Agent-generated code may not be copyrightable; clients need to understand the implications for trade secret and patent strategies
  5. Vendor contract review is critical: Standard AI vendor indemnification provisions often exclude or severely limit coverage for AI-generated output
  6. Security governance must evolve: Traditional code review processes are insufficient for the volume and speed of agent-generated code; automated security scanning in the PR pipeline is now mandatory

9.2 Market Trajectory

The AI agents market is projected to grow from $7.84 billion (2025) to $52.62 billion by 2030 at a 46.3% CAGR. Every enterprise software engineering organization will need to develop agent governance policies within the next 12-18 months.


What This Means for Your Organization

AI coding agents have moved from demos to production in 12 months. Goldman Sachs, Santander, and Nubank are deploying Devin at scale. Cursor reports 30% of its own PRs are now made by autonomous agents. The agents market is projected to grow from $7.84 billion in 2025 to $52.62 billion by 2030 at 46.3% CAGR. This is not a future state to monitor. It is a current state to govern. If your organization does not have a policy defining what AI agents can do autonomously versus what requires human approval, you are operating in a governance vacuum that creates liability exposure your board has not assessed.

The benchmark data should temper any impulse to replace developers with agents. SWE-bench Verified scores reach 81%, which sounds impressive until you compare it to SWE-bench Pro on private, previously unseen codebases: 15-18%. That means the best AI agents in the world resolve fewer than one in five issues on code they have not seen before. Independent testing of Devin found only 3 of 20 assigned tasks completed successfully. Devin’s own data shows a 67% PR merge rate – dramatically improved from 34% a year ago, but still meaning one-third of its work product is rejected. Agents are valuable for well-specified tasks on familiar codebases. They are not replacing engineers. They are changing what engineers spend their time on: from writing code to defining intent, reviewing output, and owning architecture.

The liability question is the one your general counsel needs to answer before your first agent-generated PR ships to production. Under standard AI vendor contracts, the deploying organization – not the vendor – bears full liability for agent-generated code. Eighty-eight percent of organizations deploy AI, but only 25% have board-level AI policies. That 63-percentage-point gap represents boards operating without documented AI oversight while their companies deploy AI systems that write production code, install dependencies, and interact with APIs. Forty-five percent of AI-generated code introduces security vulnerabilities. The EU AI Act’s high-risk obligations take effect August 2, 2026. The organizations that define agent authorization levels, maintain audit trails, and implement human-in-the-loop gates at merge and deploy decision points now will have both a compliance advantage and a quality advantage over those that wait for an incident to force the conversation.

Sources


Created by Brandon Sneider | brandon@brandonsneider.com March 2026