Measuring What AI Broke: The Second-Order Effects Your Dashboard Does Not Show
Brandon Sneider | March 2026
Executive Summary
- AI-assisted developers produce 98% more pull requests — but review times increase 91%, bugs increase 9%, and organizational delivery does not improve at all. Telemetry from 10,000+ developers across 1,255 teams shows the bottleneck moved from coding to review. The speed evaporated at the next step in the process (Faros AI, 2025-2026).
- Every 25% increase in AI adoption correlates with a 7.2% decrease in delivery stability — while developers self-report feeling more productive. The subjective experience of speed is decoupled from objective measurement. Google’s DORA report found the perception gap is consistent and measurable (Google DORA, October 2024).
- 77% of employees say AI increased their workload, while 96% of executives expect productivity gains. Over 40% of AI users spend more time reviewing AI-generated content than they saved generating it. The editing, validating, and correcting cycle is invisible on executive dashboards (Upwork/Walr, n=2,500, July 2024).
- 89% of CEOs report zero productivity impact from AI over three years. Average executive AI usage is 1.5 hours per week. The macro evidence across approximately 6,000 senior leaders in four countries shows no measurable relationship between AI adoption and output per employee (NBER Working Paper 34836, February 2026).
- The pattern is consistent: AI accelerates one step and breaks the next. Companies that measure only the accelerated step see improvement. Companies that measure end-to-end see none.
The Bottleneck Migration Problem
The most dangerous AI metric is task-level time savings. It answers the wrong question. The right question is not “did AI make this step faster?” but “did the end-to-end process get faster, better, or cheaper?”
The Faros Data
Faros AI’s analysis of telemetry from 10,000+ developers across 1,255 teams — drawing from task management systems, IDEs, code analysis, CI/CD pipelines, version control, and incident management — provides the most granular picture of what happens when AI accelerates one step in a multi-step process.
At the individual level, the gains are real: 21% more tasks completed, 98% more pull requests merged, 47% more PRs touched daily. These are the numbers that appear on executive dashboards. These are the numbers that justify the Copilot license.
At the system level, the gains evaporate: PR review time increased 91%. Average PR size grew 154%. Bugs per developer increased 9%. There is no significant correlation between AI adoption and improvements at the company level across throughput, DORA metrics, or quality KPIs. Companies with heavy AI usage do not ship faster or more reliably than those without.
The root cause is Amdahl’s Law applied to organizations: the system moves only as fast as its slowest link. AI accelerated coding — the step that was already getting faster every year. It did nothing for review, testing, deployment, and release processes. When coding gets 98% faster and review stays the same, review becomes the bottleneck. The speed does not disappear — it queues.
The Same Pattern Beyond Software
A global B2B software company deployed generative AI across 2,800 go-to-market employees. Email volume to prospects tripled. Sales teams generated tailored sequences in minutes. The executive dashboard reported a 6% “productivity gain.” But response rates declined steadily, unsubscribes increased, and downstream teams — sales ops, legal, deal desks — were overwhelmed with corrections. Win rates and cycle times remained flat or worsened (Hamilton Mann/IMD, January 2026).
The Atlassian Developer Experience Report (n=3,500 developers across 6 countries, March 2025) quantifies the offset directly: 68% of developers save 10+ hours per week with AI. Fifty percent lose 10+ hours per week to organizational friction — finding information, adopting new technology, and context switching. The net improvement is roughly zero. Meanwhile, the gap between what leaders see (delivery metrics and KPIs) and what practitioners experience (missing documentation, disjointed workflows, tool overload) widened 19 points in a single year.
The Quality Degradation That Compounds
AI-generated output is fast. It is not always durable.
GitClear’s analysis of 211 million changed lines of code from repositories at Google, Microsoft, Meta, and enterprise corporations (January 2020-December 2024) tracks the quality trajectory:
| Metric | 2021 | 2024 | Change |
|---|---|---|---|
| Copy/pasted (cloned) code | 8.3% | 12.3% | +48% |
| Duplicated code blocks | Baseline | 8x higher | +700% |
| Code revised within 2 weeks (“churn”) | 3.1% | 5.7% | +84% |
| Refactoring activity | 25% of changes | Under 10% | -60% |
The 2024 data marks the first year where copy/pasted lines exceeded refactored lines — a structural inversion. AI tools accelerate code production while degrading code maintainability. The debugging, refactoring, and security patching costs accrue on a delayed timeline that the 90-day metrics card never captures.
Security is not improving with model scale. Veracode’s testing of 100+ large language models (July 2025) found 45% of AI-generated code failed security tests and introduced OWASP Top 10 vulnerabilities. Java had a 72% failure rate. Critically, “security performance remained flat regardless of model size or training sophistication” — this is not a problem that gets solved by waiting for the next model version.
Harness’s survey of 500 engineering leaders and developers (January 2025) confirms the downstream cost: 67% spend more time debugging AI-generated code. Sixty-eight percent spend more time resolving AI-related security vulnerabilities. Ninety-two percent say AI tools increase code volume shipped to production but also increase the “blast radius” from bad deployments. Forrester predicts 75% of technology decision-makers will face moderate-to-severe technical debt by 2026, driven specifically by rapid AI-assisted development (Forrester, 2025).
The Perception Gap
The most consistent finding across the research is not that AI fails. It is that people believe it is working when the data says it is not.
METR’s randomized controlled trial — 16 experienced developers completing 246 tasks in mature open-source projects — found developers using AI took 19% longer to complete tasks. Before the study, developers predicted AI would speed them up by 24%. After experiencing the measurable slowdown, they still believed AI had sped them up by 20%. Screen-recording data revealed more idle time during AI-assisted coding: not waiting-for-the-model time but actual no-activity periods. Coding with AI requires less cognitive effort, which makes it feel faster while being slower (METR, July 2025).
Google’s DORA report finds the same pattern at organizational scale: every 25% increase in AI adoption correlates with a 2.1% increase in self-reported productivity and a 2.6% increase in job satisfaction — but a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability (DORA, October 2024).
The Upwork/Walr survey (n=2,500 across four countries, July 2024) reveals the organizational version of this gap: 96% of C-suite leaders expect AI to boost productivity. Seventy-seven percent of employees using AI say it has increased their workload. Forty-seven percent do not know how to achieve the expected productivity gains. Over 40% spend more time reviewing or moderating AI-generated content than they saved generating it.
The executive sees adoption rates climbing. The employee sees workload increasing. Both are right — and neither metric captures the end-to-end picture.
What the Macro Evidence Says
At the organizational and economy-wide level, the AI productivity dividend remains invisible to standard measurement.
An NBER working paper (February 2026) surveying approximately 6,000 CEOs, CFOs, and senior leaders across the U.S., UK, Germany, and Australia found that 89% report zero productivity change over three years. Average AI usage is 1.5 hours per week. Over 90% report no employment impact. Despite this, executives forecast AI will increase productivity by 1.4% over the next three years — a forecast based on hope rather than evidence.
BCG’s 2025 survey confirms at the firm level: 60% of companies generate no material value from AI. Only 5% qualify as “future-built” with substantial returns. The 5% achieve 5x the revenue increases and 3x the cost reductions — but those gains come from systematic workflow redesign, not from turning on AI tools and measuring task speed (BCG, n=1,250+, September 2025).
Vaccaro et al.'s meta-analysis of 106 experiments with 370 effect sizes, published in Nature Human Behaviour (October 2024), delivers the most counterintuitive finding: on average, human-AI combinations perform worse than the better of humans or AI alone on decision-making tasks. The “human + AI > either alone” assumption is wrong for most structured decision-making contexts.
A separate meta-analysis of 371 empirical estimates from studies published 2019-2024 found no robust, publication-bias-free relationship between AI adoption and labor market outcomes — covering employment, productivity, wages, and skill demand (Santarelli et al., February 2025).
The Three Questions That Catch Second-Order Effects
Standard AI measurement tracks three things: adoption rate, time saved per task, and cost per outcome. These are necessary but insufficient. The research points to three additional questions that detect the problems the dashboard misses.
Question 1: Where did the speed go?
If AI saved 5 hours per week per user, what happened to those 5 hours? The invisible productivity research shows saved time disperses into other tasks rather than being redeployed intentionally. Track downstream queue lengths — review time, approval time, QA time, customer response time — not just the AI-assisted step. If the AI-assisted step got 50% faster but the next step’s queue grew 90%, the process did not accelerate.
Question 2: What step is now the bottleneck?
Map the end-to-end process before and after AI deployment. Measure cycle time from initiation to delivery, not task time. The Faros data shows this clearly: coding got 98% faster, review became the bottleneck, and end-to-end delivery did not change. The companies in BCG’s 5% redesigned the entire workflow. The other 95% accelerated one step and declared success.
Question 3: Did cross-team handoffs get better or worse?
AI adoption in one department creates friction with non-adopting departments. A sales team generating 3x more proposals creates a 3x review burden on legal. A marketing team producing 3x more content creates a 3x approval burden on compliance. The Atlassian data shows the leader-practitioner perception gap widened 19 points in one year. Survey downstream teams — not just the team using AI — to detect friction before it compounds.
Key Data Points
| Metric | Finding | Source |
|---|---|---|
| PRs merged with AI | 98% increase — but 91% longer reviews, 9% more bugs, zero org-level improvement | Faros AI, 10,000+ developers, 1,255 teams |
| Delivery stability per 25% AI adoption | 7.2% decrease, despite 2.1% self-reported productivity increase | Google DORA, October 2024 |
| Developer speed with AI (RCT) | 19% slower — while believing they were 20% faster | METR, n=16, 246 tasks, July 2025 |
| Employee workload from AI | 77% say AI increased workload; 40%+ spend more time reviewing than saved | Upwork/Walr, n=2,500, July 2024 |
| CEO-reported productivity impact | 89% report zero impact over 3 years | NBER, ~6,000 leaders, February 2026 |
| Companies with material AI value | 5% “future-built”; 60% report minimal gains | BCG, n=1,250+, September 2025 |
| Code duplication growth | 8x increase in duplicated blocks; churn up 84% | GitClear, 211M lines, 2020-2024 |
| AI code security failures | 45% of AI-generated code fails security tests | Veracode, 100+ models, July 2025 |
| AI time savings vs. friction | 68% save 10+ hrs/week; 50% lose 10+ hrs/week to friction — net zero | Atlassian, n=3,500, March 2025 |
| Human-AI team performance | Worse than the better of humans or AI alone on decision tasks | Vaccaro et al., 106 experiments, Nature, October 2024 |
What This Means for Your Organization
The single most expensive AI measurement mistake is tracking only the step AI accelerated. Every executive dashboard in 2026 reports adoption rates and time savings. Almost none tracks what happened to the steps before and after the AI-assisted task.
The three-question framework above — where did the speed go, what is now the bottleneck, and did handoffs improve or degrade — takes 30 minutes to apply to any AI deployment. It requires no new tools. It requires asking the people downstream of the AI-assisted process whether their work got easier or harder. The answer, based on the evidence, is usually harder — and that answer is the starting point for capturing the value the dashboard says is already there.
The companies in BCG’s 5% did not achieve superior returns by deploying better AI tools. They achieved them by redesigning workflows around the AI — which starts with measuring the entire process, not just the automated step. The 60% that generate no material value are measuring the right tool and the wrong system.
If these findings raised questions about whether your AI metrics are capturing what matters, I would welcome that conversation — brandon@brandonsneider.com.
Sources
-
Faros AI — “AI Software Engineering Productivity Paradox” (July 2025, updated March 2026). Telemetry from 10,000+ developers across 1,255 teams. Vendor-funded but negative finding against commercial interest. Credibility: High. https://www.faros.ai/blog/ai-software-engineering
-
Google DORA — “Accelerate State of DevOps Report 2024” (October 2024). Annual global survey, methodology established over 10 years. Independent methodology, Google Cloud sponsorship. Credibility: Very High. https://dora.dev/research/2024/dora-report/
-
METR — “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” (July 2025). Pre-registered RCT, n=16 developers, 246 tasks. Independent AI safety research. Credibility: Very High — gold-standard methodology. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
-
NBER — Working Paper 34836: “Firm Data on AI” (February 2026). ~6,000 CEOs, CFOs, and senior leaders across U.S., UK, Germany, Australia. Stanford/University of Chicago/Bank of England researchers. Credibility: Very High. https://www.nber.org/papers/w34836
-
Vaccaro et al. — “Human-AI Decision-Making” in Nature Human Behaviour (October 2024). Pre-registered meta-analysis, 106 experiments, 370 effect sizes. Credibility: Very High — top journal, no commercial interest. https://www.nature.com/articles/s41562-024-02024-1
-
BCG — “The Widening AI Value Gap” (September 2025). n=1,250+ firms worldwide. Independent consulting. Credibility: High. https://www.bcg.com/publications/2025/are-you-generating-value-from-ai-the-widening-gap
-
Upwork Research Institute / Walr (July 2024). n=2,500 (1,250 C-suite, 625 employees, 625 freelancers). Independent research firm methodology. Credibility: Medium-High — vendor-commissioned, independent execution. https://investors.upwork.com/news-releases/news-release-details/upwork-study-finds-employee-workloads-rising-despite-increased-c
-
GitClear — AI Assistant Code Quality Research (February 2025). 211 million changed lines, repositories from Google, Microsoft, Meta, and enterprise corporations. Vendor with transparent methodology. Credibility: High — massive dataset. https://www.gitclear.com/ai_assistant_code_quality_2025_research
-
Atlassian — “State of Developer Experience 2025” (March 2025). n=3,500 developers and managers across 6 countries. Vendor research. Credibility: Medium-High. https://www.atlassian.com/blog/developer/developer-experience-report-2025
-
Veracode — GenAI Code Security Report (July 2025). 100+ LLMs tested. Vendor (application security). Credibility: Medium — vendor, but finding that security doesn’t improve with model scale is independently corroborated. https://www.veracode.com/blog/genai-code-security-report/
-
Harness — State of Software Delivery Report (January 2025). n=500 engineering leaders and developers. Vendor research. Credibility: Medium. https://www.harness.io/state-of-software-delivery
-
Santarelli, Carbonara, and Tripathi — AI labor market meta-analysis (February 2025). 371 estimates from empirical studies, 2019-2024. Independent academic. Credibility: High. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5126345
-
Hamilton Mann / IMD — “The AI Productivity Illusion” (January 2026). B2B case study, 2,800 employees. Academic/business school. Credibility: Medium-High — case study unnamed but pattern consistent. https://www.imd.org/ibyimd/artificial-intelligence/the-ai-productivity-illusion/
-
ManpowerGroup — Global Talent Barometer 2026 (January 2026). ~14,000 workers across 19 countries. Independent staffing firm. Credibility: High. https://www.manpowergroup.com/en/insights/report/global-talent-barometer-january-2026
-
Gruda & Aeon — “Seven Myths about AI and Productivity,” California Management Review (October 2025). Peer-reviewed academic review. Credibility: Very High. https://cmr.berkeley.edu/2025/10/seven-myths-about-ai-and-productivity-what-the-evidence-really-says/
-
Forrester — 2025 Predictions on Technical Debt (2025). Independent analyst. Credibility: High. https://www.forrester.com/press-newsroom/forrester-predictions-2025-tech-security/
Brandon Sneider | brandon@brandonsneider.com March 2026