Are Your AI Metrics Real? Three Signs You Are Measuring Value — and Three Signs You Are Measuring Theater
Brandon Sneider | March 2026
Executive Summary
- Developers using AI believe they are 20% faster. They are actually 19% slower. METR’s randomized controlled trial (n=16 experienced open-source developers, 246 tasks, July 2025) is the only independent RCT on AI coding productivity. The perception gap is not a bug in the study — it is the core finding. If your people believe AI is helping and your numbers confirm it, you may be measuring their belief, not their output.
- AI-assisted teams produce 98% more pull requests, but organizational delivery metrics do not improve. Faros AI’s telemetry analysis (10,000+ developers, 1,255 teams, 2025-2026) shows individual activity metrics surge while DORA metrics — lead time, deployment frequency, change failure rate — stay flat. The bottleneck moved. The dashboard did not.
- 72% of AI investments are destroying value through waste, and only 29% of organizations can measure ROI confidently. Larridin’s State of Enterprise AI Report (2025) finds most companies cannot answer “is AI working?” because they are tracking the wrong numbers. The measurement gap is not a data problem. It is a design problem.
- The 5% capturing real AI value measure differently from the 95% that do not. BCG’s survey of 1,250+ firms (September 2025) finds that “future-built” companies expect twice the revenue increase and 40% greater cost reductions — not because they bought better tools, but because they built measurement systems that connect tool usage to business outcomes.
The Perception-Reality Gap: Why This Matters Now
Goodhart’s Law — “when a measure becomes a target, it ceases to be a good measure” — is now the defining challenge of enterprise AI measurement.
The pattern repeats across every function. In software development, GitHub Copilot generates code faster, so teams track lines of code and pull requests. In customer service, AI drafts responses faster, so teams track tickets closed per hour. In marketing, AI produces content faster, so teams track pieces published per week. In every case, the activity metric goes up. In most cases, the outcome metric does not move.
The METR study makes the mechanism precise. Experienced developers working on their own repositories — code they know intimately — were 19% slower with AI tools. Before starting, they predicted AI would save 24% of their time. After finishing, they still believed it saved 20%. The gap between perception and reality persisted even after direct experience.
This is not a developer problem. It is a human cognition problem. AI tools feel productive. They generate output. They eliminate blank-page paralysis. The experience of using them is genuinely different from the experience of working without them. But “feels faster” is not “is faster,” and most enterprise dashboards cannot tell the difference.
Three Signs Your AI Metrics Are Real
1. Your metrics measure end-to-end outcomes, not step-level activity
The Faros AI data reveals the critical distinction. At the individual level, AI-assisted developers complete 21% more tasks and merge 98% more pull requests. At the system level, delivery metrics stay flat — because review time increased 91%, PR size grew 154%, and bugs per developer increased 9%.
The diagnostic question: Can you trace your AI metrics from tool usage to a business outcome a CFO would recognize? “More pull requests” is activity. “Reduced time-to-market for features that drive revenue” is an outcome. If your dashboard stops at the activity layer, it is measuring theater.
What the 5% do differently: They apply Amdahl’s Law to their processes. Before deploying AI, they map the entire workflow and identify the actual constraint. Then they measure whether AI moved the constraint or just accelerated a step that was not the bottleneck. Google’s 2025 DORA report confirms: for every 25% increase in AI adoption, delivery stability decreases 7.2% in organizations that do not address the downstream constraint.
2. Your metrics include quality and rework, not just speed and volume
GitClear’s analysis of 211 million changed lines of code (January 2020-December 2024) tracks what happens when speed becomes the primary metric. Code churn — newly added lines revised within two weeks — increased from 5.5% in 2020 to 7.9% in 2024, a 44% relative increase. Copy-pasted code surged from 8.3% to 12.3%. Refactored code collapsed from 24.1% to 9.5%.
The speed metric improved. The maintainability metric degraded. Organizations tracking only the first number concluded AI was working. Organizations tracking both numbers saw the trade-off.
The diagnostic question: For every speed metric on your AI dashboard, is there a corresponding quality metric? If you track tickets resolved per hour, do you also track ticket reopens, escalations, and customer satisfaction on AI-assisted interactions? If you track content produced per week, do you also track revision cycles, compliance flags, and engagement rates?
What the 5% do differently: BCG’s future-built companies measure both throughput and durability. They track not just “did AI make this faster?” but “did the faster output hold up under scrutiny, or did it create downstream rework that consumed the time saved?”
3. Your metrics survived a baseline comparison, not just a before-after narrative
RAND Corporation data on 2,400+ enterprise AI initiatives (2025-2026) provides the starkest finding on measurement discipline: projects with clear pre-deployment success metrics achieve a 54% success rate. Projects without defined metrics achieve 12%. The difference is 4.5x — created entirely by the act of writing down what “working” looks like before deployment.
The diagnostic question: Did you document cost per transaction, hours per process cycle, and error rates before AI deployment? Or did you deploy first and then look for evidence that it helped? The second approach guarantees you will find what you are looking for, because human pattern-matching fills gaps with confirmation bias.
What the 5% do differently: They run a 30-day baseline sprint before any AI deployment. They measure the current state of every workflow AI will touch — not because they doubt AI, but because they cannot calculate ROI without a denominator. McKinsey’s 2025 State of AI survey (n=1,600+) identifies this practice as the single strongest predictor of AI bottom-line impact.
Three Signs Your AI Metrics Are Performative
1. Your metrics measure adoption, not impact
Adoption rate is the most dangerous vanity metric in enterprise AI. A 97% adoption figure tells leadership that employees logged in. It does not tell leadership whether the tool created value or whether employees are using it productively, compliantly, or reluctantly.
PYMNTS.com (March 2026) reports that some companies now track AI adoption by token consumption — the volume of AI interactions per employee. OpenAI reports average reasoning token consumption per organization increased 320x in the past 12 months. But token volume reflects prompting behavior, not business value. High token usage often indicates inefficient prompting or “agentic” workflow leaks, not high-quality output.
The red flag: If your AI dashboard’s top metric is “percentage of employees using AI tools” or “average sessions per user per week,” you are measuring compliance, not value. Microsoft’s own Katy George publicly shifted Microsoft’s internal measurement away from adoption rates to performance outcomes. Zapier’s Brandon Sammut called high adoption rates “meaningless for business results.” The vendors selling the tools do not trust this metric. Neither should you.
2. Your metrics rely on self-reported time savings
The METR RCT is definitive on this point: developers believed they were 20% faster when they were 19% slower. The Upwork/Walr survey (n=2,500, July 2024) finds that 77% of employees say AI increased their workload, while 96% of executives expect productivity gains. Atlassian’s developer experience study shows 68% of developers save 10+ hours per week — but 50% lose an equivalent amount to organizational friction AI created or exposed.
The red flag: If your ROI case depends on employees estimating how much time AI saves them, your ROI case is built on the same cognitive bias the METR study measured. People are bad at estimating time. They are worse at estimating time when using a tool that feels fast but generates output requiring extensive review, correction, and validation. Over 40% of AI users spend more time reviewing AI-generated content than they saved generating it.
3. Your metrics cannot explain why the P&L has not moved
BCG’s September 2025 survey of 1,250+ firms finds 60% generate “no material value” from AI despite active investment. McKinsey reports 88% use AI in at least one function, but only 39% see an impact on EBIT — most often less than 5%. The NBER study (n=6,000 executives, February 2026) across the U.S., U.K., Germany, and Australia finds 89% report zero measurable AI impact on labor productivity over three years.
The red flag: If your AI dashboard shows improvement but your P&L does not, one of two things is true: either the gains are real but too small to register at the enterprise level (in which case you are overinvesting), or the gains are not real and the dashboard is measuring the wrong thing. HBR’s Fall 2025 study identifies the mechanism: 30% of the workforce are “Disruptors” — employees who use AI heavily but score 4.6 on a 5-point resistance scale. They are performing adoption to avoid being seen as resisters. Usage is up. Value is not.
Key Data Points
| Signal | Metric | Source |
|---|---|---|
| Perception-reality gap | Developers believe 20% faster, actually 19% slower | METR RCT, n=16 experienced devs, 246 tasks, July 2025 |
| Bottleneck migration | 98% more PRs, 0% delivery improvement, +91% review time | Faros AI, 10,000+ devs, 1,255 teams, 2025-2026 |
| Quality degradation | Code churn up 44%, copy-paste up 48%, refactoring down 61% | GitClear, 211M lines analyzed, 2020-2024 |
| Measurement discipline | 54% success with pre-set metrics vs. 12% without | RAND Corporation, 2,400+ initiatives, 2025-2026 |
| The value gap | 60% of firms get no material value; 5% capture substantial value | BCG, 1,250+ firms, September 2025 |
| Self-report unreliability | 77% say AI increased workload; 96% of executives expect gains | Upwork/Walr, n=2,500, July 2024 |
| Stability cost | 7.2% decrease in delivery stability per 25% AI adoption increase | Google DORA, October 2024 |
| Token consumption trap | 320x increase in token usage per org in 12 months | OpenAI, reported via PYMNTS, March 2026 |
| Adoption illusion | 88% use AI in one function; only 39% see EBIT impact | McKinsey State of AI, n=1,993, 2025 |
What This Means for Your Organization
The executives who read the METR finding and immediately wonder “how do I know my team isn’t doing the same thing?” are asking the right question. The answer is not to distrust AI — the 5% capturing real value prove it works. The answer is to distrust metrics that measure activity instead of outcomes.
A practical starting point: take your current AI dashboard and apply three filters. First, for every metric, ask whether it measures something a CFO would recognize as value — revenue, cost, margin, risk, speed-to-market — or something that only makes sense inside the AI deployment conversation. Second, check whether any metric depends on employees self-reporting time saved; if it does, treat it as directional, not definitive. Third, verify that you documented a baseline before deployment; if you did not, your “improvement” number has no denominator and cannot be validated.
The organizations that get this right do not need more sophisticated dashboards. They need fewer metrics that connect to the P&L, measured against baselines established before deployment, with quality and rework tracked alongside speed and volume. If this diagnostic raised questions about whether your current measurement is capturing real value or documenting theater, I would welcome the conversation — brandon@brandonsneider.com.
Sources
-
METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” July 2025. RCT, n=16 experienced open-source developers, 246 tasks. Independent research organization. High credibility: only independent RCT on AI developer productivity. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
-
Faros AI, “The AI Productivity Paradox,” 2025-2026. Telemetry analysis, 10,000+ developers, 1,255 enterprise engineering teams. Vendor research with proprietary data access. Moderate-high credibility: large sample, objective telemetry, but vendor has product interest in the finding. https://www.faros.ai/ai-productivity-paradox
-
GitClear, “AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones,” 2025. 211 million changed lines of code analyzed, January 2020-December 2024. Independent analytics firm. High credibility: large longitudinal dataset, objective code metrics. https://www.gitclear.com/ai_assistant_code_quality_2025_research
-
Google DORA, “State of AI-Assisted Software Development 2025,” December 2025. Industry survey with telemetry validation. High credibility: Google-backed, peer-reviewed methodology, 10+ year track record. https://dora.dev/research/2025/dora-report/
-
BCG, “The Widening AI Value Gap: Build for the Future 2025,” September 2025. Survey of 1,250+ firms worldwide. High credibility: independent consulting research, large global sample. https://www.bcg.com/publications/2025/are-you-generating-value-from-ai-the-widening-gap
-
McKinsey, “The State of AI in 2025,” November 2025. n=1,993 respondents across 105 nations. High credibility: annual longitudinal survey, large sample. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
-
RAND Corporation / Pertama Partners, enterprise AI initiative analysis, 2025-2026. 2,400+ enterprise AI initiatives tracked. High credibility: independent research, large sample of real deployments. Cited via Larridin State of Enterprise AI Report.
-
Larridin, “The AI ROI Measurement Framework: From Vibe-Based Spending to Measurable Business Value,” 2025. State of Enterprise AI Report. Moderate credibility: industry analyst, methodology not fully disclosed. https://larridin.com/blog/ai-roi-measurement
-
NBER Working Paper 34836, “The Impact of AI on the Economy,” February 2026. n=5,956 executives across U.S., U.K., Germany, Australia. High credibility: peer-reviewed academic research, large multinational sample. Referenced via multiple secondary sources.
-
Upwork/Walr, “How Executives and Employees Really Feel About AI,” July 2024. n=2,500 respondents including executives, full-time employees, and freelancers. Moderate credibility: commissioned survey, reasonable sample size. Referenced via PYMNTS.com.
-
PYMNTS.com, “AI Adoption Is Being Measured in Tokens, but the Metric Falls Short, Experts Say,” March 2026. Industry reporting with expert commentary. Moderate credibility: trade journalism, multiple expert sources. https://www.pymnts.com/artificial-intelligence-2/2026/ai-adoption-is-being-measured-in-tokens-but-the-metric-falls-short-experts-say/
-
IT Revolution, “AI’s Mirror Effect: How the 2025 DORA Report Reveals Your Organization’s True Capabilities,” 2025. Analysis of DORA findings. Moderate-high credibility: respected DevOps publisher, based on primary DORA data. https://itrevolution.com/articles/ais-mirror-effect-how-the-2025-dora-report-reveals-your-organizations-true-capabilities/
Brandon Sneider | brandon@brandonsneider.com March 2026