SWE-CI Benchmark: AI Agents Can Write Code but Cannot Maintain It

Executive Summary

A new benchmark from Sun Yat-sen University and Alibaba (March 2026) tests whether AI agents can maintain codebases over months, not just fix isolated bugs. SWE-CI drops 18 models into 100 real-world Python repositories spanning 233 days and 71 consecutive commits. The results are stark: 75% of AI models break previously working code during long-term maintenance.
Only Anthropic’s Claude Opus series exceeds a 50% zero-regression rate. Claude Opus 4.6 achieves 0.76; every other model family scores below 0.25. A model that scores 70%+ on SWE-bench — the standard one-shot bug-fix benchmark — can still produce code that becomes a maintenance burden within weeks.
The enterprise code quality crisis is already measurable. GitClear’s analysis of 211 million lines finds code duplication up 48%, refactoring down 60%, and code churn up 41% since AI adoption accelerated. CodeRabbit finds AI-generated code produces 1.7x more issues, with performance problems 8x more frequent and security vulnerabilities 2.7x higher.
Gartner predicts prompt-to-app approaches will increase software defects by 2,500% by 2028 unless governance frameworks are implemented. 75% of technology leaders already face moderate-to-severe technical debt from AI-accelerated coding practices.
The hidden cost: cognitive debt. AI agents produce code 5-7x faster than developers can understand it. Developer trust in AI code accuracy dropped from 43% to 29% even as AI tool usage climbed to 84%. Organizations are shipping code faster than anyone can comprehend, building systems that no one fully understands.

The SWE-CI Benchmark: What It Measures and Why It Matters

The Problem with Existing Benchmarks

SWE-bench, the industry standard for evaluating AI coding agents, tests a single capability: given one GitHub issue, produce one patch. This measures functional correctness on isolated tasks. It says nothing about what happens when that agent maintains the same codebase for months.

SWE-CI, published March 4, 2026, by researchers at Sun Yat-sen University and Alibaba Group (arXiv 2603.03823), fills this gap. The name stands for Software Engineering — Continuous Integration. It tests whether AI agents can sustain code quality through the kind of ongoing development that constitutes 80%+ of real software engineering work.

How It Works

The benchmark comprises 100 tasks drawn from 68 actively maintained Python repositories. Each task spans a real development history — an average of 233 days and 71 consecutive commits — with a minimum of 500 changed lines of non-test code per task. The selection pipeline started from 4,923 candidate repositories and filtered down through increasingly strict criteria: 3+ years of active maintenance, 500+ GitHub stars, permissive licensing, and viable test suites.

SWE-CI uses a dual-agent protocol: an Architect agent analyzes failing tests, locates deficiencies, and designs improvements. A Programmer agent then implements the changes. Each task runs through up to 20 CI iterations, with a 3,600-second timeout per test cycle. The agents must iteratively resolve failing tests while preserving all previously passing tests — the same challenge human developers face during sustained maintenance.

The key metric is EvoScore, a future-weighted measure that penalizes agents whose early decisions cause problems in later iterations. An agent that passes today’s tests by writing brittle code will see its EvoScore decline as that code breaks downstream. The benchmark also tracks zero-regression rate: the proportion of tasks where an agent avoids introducing any regression throughout the entire maintenance history.

What the Results Show

The evaluation consumed over 10 billion tokens across 18 models from 8 providers. The results reveal a clear performance hierarchy — and a clear industry-wide problem:

Zero-Regression Rates (proportion of tasks with no regressions):

Model	Zero-Regression Rate
Claude Opus 4.6	0.76
Claude Opus 4.5	0.51
All other models	Below 0.25

Most AI coding agents introduce regressions on more than 75% of long-term maintenance tasks. Even Claude Opus 4.6 — the top performer — fails to avoid regressions in roughly 1 of every 4 tasks. As one developer noted in the Hacker News discussion: regressions in production systems “should be measured in 9s, not percentages.”

Within provider families, newer models consistently outperform older ones, with models released after January 2026 showing the largest gains. This suggests the problem is solvable with better training — but no model has solved it yet.

The EvoScore analysis reveals different optimization profiles: MiniMax, DeepSeek, and GPT models show a preference for long-term stability, while Kimi and GLM lean toward short-term gains. Claude and Qwen maintain stable rankings regardless of how heavily future iterations are weighted — suggesting more robust architectural reasoning.

The Broader Code Quality Crisis

SWE-CI quantifies what enterprise engineering leaders have been observing anecdotally: AI-generated code works on first pass but creates cascading problems over time. Multiple independent datasets confirm this pattern.

GitClear: 211 Million Lines Tell the Story

GitClear’s 2025 analysis of 211 million changed lines of code from repositories spanning Google, Microsoft, Meta, and enterprise customers documents the structural shift (GitClear AI Copilot Code Quality Report, February 2025, 211M lines, 2020–2024):

Code duplication rose from 8.3% to 12.3% of changed lines — a 48% increase. Copy-pasted code now exceeds moved code for the first time in GitClear’s tracking history.
Refactoring activity collapsed from 25% to below 10% of changed lines — developers are adding code rather than improving existing code.
Code churn increased 41% — more code is being modified within its first month, a signal of quick-fix patterns rather than considered design.
Code volume is up ~75% — developers are checking in substantially more code, but that code requires more rework and carries more duplication.

As API evangelist Kin Lane observed: “I don’t think I have ever seen so much technical debt being created in such a short period of time during my 35-year career in technology.”

CodeRabbit: AI Code Produces 1.7x More Issues

CodeRabbit’s State of AI vs. Human Code Generation report (December 2025, n=470 PRs: 320 AI-co-authored, 150 human-only) provides granular defect data:

Issue Category	AI Multiplier vs. Human Code
Performance inefficiencies	8.0x
Readability problems	3.0x
Security vulnerabilities	2.74x
Formatting drift	2.66x
Error handling gaps	2.0x
Concurrency/dependency issues	2.0x
Logic and correctness errors	1.75x
Overall issue rate	1.7x (10.83 vs. 6.45 issues per PR)

The 8x performance gap is alarming: AI-generated code introduces excessive I/O operations, unnecessary computation, and resource-intensive patterns that look correct in isolation but degrade system performance at scale.

Apiiro: 4x Velocity, 10x Vulnerabilities

Apiiro’s analysis of Fortune 50 enterprise codebases (tens of thousands of repositories, several thousand developers, December 2024 – June 2025) reveals the security dimension:

AI-assisted developers produced 3-4x more commits than non-AI peers
Those commits introduced 10x more security findings — rising from ~1,000 to over 10,000 new findings per month in just six months
Privilege escalation paths jumped 322%; architectural design flaws spiked 153%
AI-assisted developers exposed cloud credentials nearly twice as often as non-AI peers
Individual PRs grew significantly larger, touching more files and services per change — making review harder

The pattern is consistent: AI accelerates code production while degrading the structural qualities that make code maintainable and secure over time.

Cognitive Debt: The Cost Nobody Is Budgeting For

Technical debt is measurable through linters and code quality tools. A newer concept — cognitive debt — captures something harder to detect and more dangerous to ignore.

Margaret-Anne Storey’s framework (February 2026) defines cognitive debt as what accumulates when a team ships code faster than they can understand it. It is a property of the developers, not the code. AI coding agents create a 5-7x velocity-comprehension gap: they generate 140-200 lines per minute while developers comprehend 20-40 lines per minute. The result is working code that nobody fully understands.

The data supports this concern:

Developer confidence in AI code accuracy dropped from 43% to 29% over 18 months, even as usage climbed to 84% (JetBrains Developer Survey 2025, n=24,534)
67% of developers spend more time debugging AI-generated code despite initial velocity gains
68% spend more time resolving security vulnerabilities in AI code
71% refuse to merge AI code without manual review — but review quality degrades as code volume increases

The asymmetry is structural: AI makes the developer who generates code look productive, but the debt burden falls on whoever inherits the codebase. Fast developers appear successful while shipping; the problems surface months later during maintenance, debugging, and onboarding.

The Gartner Warning: 2,500% Defect Increase

Gartner’s December 2025 predictions quantify where this trajectory leads:

Prompt-to-app approaches will increase software defects by 2,500% by 2028 — driven by “context-deficient code” that is syntactically correct but architecturally flawed
40% of enterprises on consumption-priced AI tools will face 2x+ cost overruns by 2027
The root cause: automation bias — developers, particularly less experienced ones, trust AI suggestions based on surface-level results rather than engineering analysis
The defects are harder to catch: “complex architectural and logical bugs that are more damaging and significantly harder to detect with traditional testing methods than common coding errors”

This is not a distant projection. 75% of technology leaders already face moderate-to-severe technical debt from AI-accelerated coding practices. 25% of engineering time and budget already goes toward managing technical debt — and AI adoption is compounding the problem faster than teams can address it.

Key Data Points

Metric	Value	Source
AI models with >50% zero-regression rate	2 of 18 (Claude Opus only)	SWE-CI, March 2026
Tasks where most agents introduce regressions	75%+	SWE-CI, March 2026
Code duplication increase (2021-2024)	48% (8.3% → 12.3%)	GitClear, 211M lines, Feb 2025
Refactoring decline (2021-2024)	60%+ (25% → <10%)	GitClear, 211M lines, Feb 2025
AI code issues vs. human code	1.7x overall	CodeRabbit, n=470 PRs, Dec 2025
AI code performance issues vs. human	8.0x	CodeRabbit, n=470 PRs, Dec 2025
Security findings spike (AI-assisted repos)	10x in 6 months	Apiiro, Fortune 50, Dec 2024-Jun 2025
Privilege escalation paths increase	322%	Apiiro, Fortune 50, 2025
Developer trust in AI code accuracy	Dropped 43% → 29%	JetBrains, n=24,534, 2025
Defect increase prediction (prompt-to-app)	2,500% by 2028	Gartner, Dec 2025
Tech leaders facing AI-driven tech debt	75%	Industry consensus, 2026
Velocity-comprehension gap	5-7x	Storey framework, Feb 2026
Engineering leaders cutting junior hires	54%	Industry surveys, 2025-2026

What This Means for Your Organization

SWE-CI’s results expose a structural mismatch between how enterprises are evaluating AI coding tools and the actual risks those tools create. Most vendor demos, pilots, and ROI models focus on the first pass: how fast does the tool generate code, does it pass the tests, how many story points does it close? SWE-CI measures the second pass — what happens to that code over the next 233 days — and the answer, for 16 of 18 models tested, is that it degrades.

The practical implication is that AI coding tools shift cost forward in time. The velocity gains are real and immediate. The maintenance costs are real and deferred. If your organization is measuring AI ROI by lines of code, PR velocity, or task completion speed, you are measuring only the benefit side of the equation. The cost side — debugging, refactoring, security remediation, architectural rework — shows up 6-18 months later, often attributed to “tech debt” rather than to the tool that created it.

Three decisions matter now:

First, separate generation from maintenance in your AI strategy. AI tools are measurably effective at generating initial code — SWE-bench scores confirm this. They are measurably ineffective at maintaining code over time — SWE-CI proves this. Use AI for what it does well (drafting, prototyping, boilerplate, test generation) and invest human attention where AI fails (architecture decisions, cross-module integration, long-term design). Organizations treating AI as a replacement for engineering judgment, rather than a supplement to it, are the ones building the largest cognitive debt balances.

Second, budget for the maintenance tail. The evidence suggests a rough rule: for every dollar of velocity gain from AI-generated code, budget $0.50-1.00 for additional code review, security scanning, refactoring, and architectural oversight. GitClear’s refactoring collapse (25% → <10%) tells you that teams are not maintaining the code AI generates. CodeRabbit’s 8x performance deficit and Apiiro’s 322% privilege escalation spike tell you the consequences. If your 2026 AI coding budget allocates 100% to licenses and 0% to code quality governance, expect the 2027 budget to be dominated by remediation.

Third, treat SWE-CI as a vendor evaluation criterion. The 18-model evaluation reveals that not all AI coding tools carry equal maintenance risk. Claude Opus 4.6’s 0.76 zero-regression rate means roughly 1 in 4 maintenance tasks encounter regressions. Most competitors hit 3 in 4. If your organization plans to use AI agents for ongoing development — not just one-off generation — the maintenance profile of the model matters as much as its generation speed. Ask your vendor: what is your model’s SWE-CI EvoScore? If they cannot answer, they have not tested for the risk that matters most.

Sources

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration. Chen, J., Xu, X., Wei, H., Chen, C., Zhao, B. Sun Yat-sen University & Alibaba Group. arXiv 2603.03823, March 4, 2026. 100 tasks, 68 repositories, 18 models, 10B+ tokens. Independent academic benchmark; first long-term maintainability evaluation of AI coding agents. High credibility. https://arxiv.org/abs/2603.03823
GitClear AI Copilot Code Quality 2025 Research. GitClear, February 2025. 211 million changed lines of code, 2020-2024. Independent code analytics firm; large dataset from real commercial and open-source repositories. High credibility. https://www.gitclear.com/ai_assistant_code_quality_2025_research
State of AI vs. Human Code Generation Report. CodeRabbit, December 17, 2025. 470 open-source GitHub PRs (320 AI, 150 human). Vendor-produced but transparent methodology; moderate-high credibility. AI identification based on behavioral signals, not confirmation — acknowledged limitation. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks. Apiiro, 2025. Fortune 50 enterprise codebases, tens of thousands of repositories. Security vendor research on enterprise customers; moderate-high credibility due to Fortune 50 sample but vendor-produced. https://apiiro.com/blog/4x-velocity-10x-vulnerabilities-ai-coding-assistants-are-shipping-more-risks/
Gartner Predicts 2026: AI Potential and Risks Emerge in Software Engineering Technologies. Gartner, December 2025. Tier-1 analyst firm; predictions carry institutional weight but are projections, not measured outcomes. High credibility for framing enterprise risk. https://www.armorcode.com/report/gartner-predicts-2026-ai-potential-and-risks-emerge-in-software-engineering-technologies
Cognitive Debt: The Real Cost of AI-Generated Code. Bobby Blaine, referencing Margaret-Anne Storey’s framework. DEV Community, February 2026. Framework piece, not primary research. Useful for conceptual framing. Moderate credibility. https://dev.to/bobbyblaine/cognitive-debt-the-real-cost-of-ai-generated-code-33ep
JetBrains Developer Ecosystem Survey 2025. JetBrains, 2025. n=24,534 developers. Large-sample independent survey from a tool vendor with no AI model to sell. High credibility for developer sentiment data. https://www.jetbrains.com/lp/devecosystem-2025/
The AI Coding Technical Debt Crisis: What 2026-2027 Holds. Pixelmojo, 2026. Compilation of multiple sources. Secondary source aggregating primary research. Moderate credibility. https://www.pixelmojo.io/blogs/vibe-coding-technical-debt-crisis-2026-2027
Hacker News Discussion on SWE-CI. March 2026. 116 points, 40 comments. Community discussion with practitioner perspectives. Useful for real-world context. https://news.ycombinator.com/item?id=47295537

Created by Brandon Sneider | brandon@brandonsneider.com March 2026