← Consulting Firms 13 min read

McKinsey on AI in Software Engineering Productivity: A Critical Analysis

Research compiled March 2026 for Foley Hoag AI Consulting Project

Executive Summary

  • McKinsey has published the most extensive body of research on AI-assisted developer productivity of any consulting firm, spanning controlled experiments, enterprise surveys (1,993 respondents, 105 nations), and executive interviews — but their methodology has drawn significant criticism from the engineering community
  • Headline finding: developers complete coding tasks up to 2x faster with AI, with documentation in half the time, new code in nearly half the time, and refactoring in two-thirds the time — but gains diminish sharply on high-complexity tasks
  • The “adoption intensity” finding is their most actionable data point: companies with 80-100% developer adoption saw >110% productivity gains vs. 25% for average adopters — suggesting the ROI inflection point is universal rollout, not selective pilots
  • McKinsey’s recommended metrics framework builds on DORA and SPACE but adds “opportunity-focused” layers — and was heavily criticized by Kent Beck and Gergely Orosz for measuring output/effort rather than outcomes/impact
  • The “Unlocking the Value” study (Nov 2025) is their most mature work: identifies that top performers are 6-7x more likely to scale AI across 4+ use cases, see 16-45% improvements across quality/speed/productivity, and consistently pair tools with workflow redesign

McKinsey’s Major AI Developer Productivity Publications

McKinsey has produced a progression of increasingly sophisticated publications on this topic:

1. “The Economic Potential of Generative AI” (June 2023)

The foundational report that framed the market. Estimated $2.6-4.4 trillion in annual value from generative AI, with software engineering identified as one of four areas capturing ~75% of that value. Estimated the direct impact of AI on software engineering productivity at 20-45% of current annual spending — the number that launched a thousand consulting engagements.

Source: McKinsey — The Economic Potential of Generative AI

2. “Yes, You Can Measure Software Developer Productivity” (August 2023)

The controversial framework that proposed measuring developer productivity through three layers:

  • Inner loop metrics: Coding speed, code quality, iteration cycles
  • Outer loop metrics: Testing, integration, deployment efficiency
  • Team/organizational metrics: Planning, collaboration, throughput

This publication triggered the most significant backlash (see “Controversy” section below).

Source: McKinsey — Yes, You Can Measure Developer Productivity

3. “Unleashing Developer Productivity with Generative AI” (2023, updated through 2025)

McKinsey’s only controlled experiment on AI coding productivity. This is the study most frequently cited — and the one with the most granular task-level data.

Source: McKinsey — Unleashing Developer Productivity

4. “How Generative AI Could Accelerate Software Product Time to Market” (2024)

Extended the lens beyond developers to product managers. Found gen AI accelerated product time to market by 5%, improved PM productivity by 40%, and uplifted employee experience by 100% (doubling satisfaction scores). Based on empirical research with PMs in Europe and the Americas.

Source: McKinsey — How GenAI Could Accelerate Time to Market

5. “Unlocking the Value of AI in Software Development” (November 2025)

The most recent and most thorough study – covering productivity, quality, speed, and developer satisfaction. Surveyed ~300 senior leaders from publicly traded companies; 100 assessed impact across four outcomes. Introduces the “two shifts, three enablers” framework.

Source: McKinsey — Unlocking the Value of AI in Software Development

6. “Measuring AI in Software Development” — Interview with Jellyfish CEO (December 2025)

McKinsey Senior Partner Martin Harrysson and Partner Prakhar Dixit interviewed Andrew Lau (Jellyfish CEO) on how the product development lifecycle is transforming, measurement challenges, and the next phase of developer productivity.

Source: McKinsey — Measuring AI in Software Development

7. “The State of AI in 2025” (March 2025)

The annual flagship survey (1,993 participants, 105 nations). Found only 5.5% of companies (109 of 1,933) drive significant value from AI. Software engineering is the #1 function for AI agent scaling at 10%.

Source: McKinsey — The State of AI in 2025


Key Data Points: The Controlled Experiment

The “Unleashing Developer Productivity” study is McKinsey’s most methodologically rigorous work on AI coding productivity. Key details:

Study Design

  • Developers performed three garden-variety software tasks: refactor code into microservices, build new application functionality, and document code capabilities
  • Test/control design: Each developer participated in the test group (with AI tools) for half the tasks and the control group (no AI) for the other half
  • Participants had access to two tools: one using a general-purpose foundation model (prompt-based) and one using a fine-tuned model trained specifically on code
  • Sample size not publicly disclosed — a notable methodological weakness

Productivity Gains by Task Type

Task Time Reduction with AI
Code documentation ~50% (half the time)
Writing new code ~46% (nearly half the time)
Code refactoring ~35% (nearly two-thirds the time)
High-complexity tasks Significantly less improvement

Data Collection Methods

  1. Demographics survey: years of experience, expertise, prior knowledge
  2. Time tracking: start time, end time, break times recorded by participants
  3. Task surveys: perceived complexity and developer experience
  4. Judge evaluations: code demos assessed by judges for successful submissions
  5. Automated code quality review: open-source platform assessing readability, maintainability, bug detection
  6. Post-experiment survey: impressions of tools and experience

Code Quality Finding

Code quality (bugs, maintainability, readability) was marginally better in AI-assisted code — but developers actively iterated with the tools to achieve that quality. This signals AI is an augmentation tool, not a replacement.

Developer Satisfaction

Developers using AI tools were more than 2x as likely to report overall happiness, fulfillment, and entering a “flow” state.


Key Data Points: Enterprise Survey Findings

The Adoption-to-Impact Curve (from “Unlocking the Value,” Nov 2025)

Adoption Level Productivity Gain
Average (all orgs tracked) 25%+ improvement
80-100% developer adoption >110% improvement
Top performers (4+ use cases) 16-30% in productivity, time-to-market; 31-45% in quality

The critical insight: the relationship between adoption percentage and productivity gain is non-linear. There is a step-function increase once organizations cross the ~80% adoption threshold. This argues strongly against pilot programs and for organization-wide rollout.

High Performers vs. Laggards

Dimension Top Performers Bottom Performers
Use cases at scale 4+ (nearly two-thirds) 4+ (only 10%)
Likelihood to scale 6-7x higher Baseline
Hands-on workshops/coaching 57% 20%
Track quality improvements 79% Not reported
Track speed gains 57% Not reported

Broader State of AI Data (2025 Survey, n=1,993)

  • 88% of organizations deploy AI in at least one function
  • 72% use gen AI regularly
  • ~1/3 have begun genuine scaling (rest stuck in pilot/experiment)
  • Only 5.5% of companies drive significant value from AI
  • 90%+ of software teams use AI for refactoring, modernization, and testing
  • Average savings: 6 hours/week per developer
  • 10-20% cost reductions reported in software engineering, manufacturing, and IT

What They Recommend

McKinsey built their framework on two established standards:

  1. DORA metrics (Google, 2014): deployment frequency, lead time for changes, change failure rate, mean time to recovery
  2. SPACE framework (2021): satisfaction, performance, activity, communication, efficiency

McKinsey adds a multi-layer measurement approach:

Layer 1 — Adoption Metrics

  • AI feature adoption rates
  • Tool usage frequency and breadth
  • Percentage of developers actively using AI tools

Layer 2 — Throughput & Process Efficiency

  • Pull request rate and cycle time
  • Latency in development pipeline
  • Sprint velocity changes

Layer 3 — Outcome Metrics

  • Software quality (defect rates, production incidents)
  • Time to market (feature delivery speed)
  • Customer satisfaction
  • Business objectives met

What They Explicitly Warn Against

  • “Percentage of code generated by AI” — a weak proxy that offers little insight into real productivity
  • Story points completed — can be gamed and doesn’t capture value delivered
  • Lines of code — long-discredited but still tempting for executives

The Jellyfish Interview Framework (Dec 2025)

McKinsey partnered with Jellyfish (engineering intelligence platform) to articulate a practical measurement approach:

  • Connect data across planning tools, code repositories, and AI usage logs
  • Create a consistent view of performance to identify bottlenecks
  • Overlay outcome metrics (productivity, speed, quality) with input metrics (AI adoption, defect detection)
  • Use the Jellyfish AI Impact Framework to track adoption, productivity, and outcomes

The “Two Shifts, Three Enablers” Framework (Nov 2025)

McKinsey’s most recent organizational framework for AI-driven software development:

Two Key Shifts

  1. Shift from task-level AI to lifecycle-level AI: Embed AI across the entire development lifecycle — from design and coding to testing, deployment, and monitoring — not just code completion
  2. Shift from tool adoption to organizational transformation: Redesign processes, roles, and team structures around AI capabilities

Three Critical Enablers

  1. Upskilling: Hands-on workshops and one-on-one coaching (57% of top performers vs. 20% of bottom performers)
  2. Impact Measurement: Track quality improvements (79% of top performers) and speed gains (57%) with connected data systems
  3. Change Management: Treat AI adoption as organizational transformation, not technology deployment

Performance Gap

A 15 percentage-point performance gap exists between top and bottom performers. Top performers demonstrate:

  • Higher artifact consistency and quality
  • Shorter sprint cycles
  • Smaller team sizes
  • Higher customer satisfaction scores
  • Nearly two-thirds use at least three of five key factors vs. only 10% of bottom performers

Talent and Workforce Implications

From McKinsey’s “Gen AI Skills Revolution” and related publications:

  • 65% of respondents regularly use gen AI, but only 13% systematically use gen AI in software engineering
  • Junior and midlevel roles are shrinking as automation takes hold
  • Increased need for senior/staff engineers who can navigate complex architecture and review AI-generated code
  • AI will push developers toward full-stack proficiency — front-end developers gradually transition to full-stack roles
  • New roles emerging: prompt engineers, agent coaches, gen AI safety leads
  • Companies adopting AI earlier place greater emphasis on talent development — two-thirds already have a strategic approach to future talent needs
  • McKinsey predicts developers will gain ~3 hours/day back by 2030 through AI assistance

Source: McKinsey — The Gen AI Skills Revolution


The Controversy: What McKinsey Got Wrong

McKinsey’s developer productivity framework drew sharp criticism from respected engineering leaders. This context is essential for any consulting engagement.

Kent Beck & Gergely Orosz Critique (August 2023)

The most prominent response came from Kent Beck (creator of Extreme Programming, co-author of the Agile Manifesto) and Gergely Orosz (The Pragmatic Engineer newsletter). Their two-part critique argued:

  1. 4 out of 5 new metrics McKinsey proposed measure effort or output — not outcomes or impact. The framework “only measures effort or output, not outcomes and impact, which misses half of the software developer lifecycle.”
  2. McKinsey’s framework concentrates on a narrow definition of developer productivity: building, coding, and testing. This ignores the design, architecture, collaboration, and problem-definition activities where senior developers create the most value.
  3. The framework could harm engineering teams if implemented as-is, incentivizing local optimization (faster coding) at the expense of global outcomes (better products).

Source: Gergely Orosz — Measuring Developer Productivity? A Response to McKinsey Source: Kent Beck — Measuring Developer Productivity? A Response to McKinsey Source: LeadDev — What McKinsey Got Wrong

The SPACE Framework Authors’ Position

The original SPACE framework authors specifically warned against measuring developers by story points completed or lines of code — yet McKinsey’s framework effectively quantifies activity-level metrics in ways SPACE cautioned against.

McKinsey’s Response

McKinsey published a follow-up (“Re:think: Can software developer productivity really be measured?”) acknowledging the debate but largely standing by their framework, emphasizing that their approach was intended for executive audiences, not engineering teams.

Source: McKinsey — Re:think on Developer Productivity

Why This Matters for Consulting

The controversy reveals a fundamental tension: executives want quantifiable metrics (McKinsey’s audience), while engineering leaders resist being measured by output proxies. Any consulting engagement on AI developer productivity must navigate this tension carefully. The winning approach: measure outcomes (cycle time, defect rates, business value delivered) rather than outputs (lines of code, PRs completed, AI acceptance rates).


Critical Assessment: Hype vs. Reality

What’s Credible

  • Task-level speed improvements (documentation, boilerplate code) are well-established across multiple studies
  • The adoption intensity curve (>110% at 80%+ adoption) is consistent with network-effect dynamics in tool adoption
  • Developer satisfaction gains are consistent across multiple independent studies
  • The workflow redesign imperative is validated by every major consulting firm and analyst

What’s Overstated or Under-evidenced

  • Sample sizes not disclosed for the controlled experiment — making it impossible to assess statistical significance
  • “Up to 2x faster” headline is cherry-picked from documentation tasks; complex tasks show much smaller gains
  • The $2.6-4.4T economic estimate is a theoretical maximum, not a forecast — yet is frequently cited as if it’s inevitable
  • Selection bias in the survey: respondents are technology-forward enterprises, not representative of all companies
  • The METR RCT contradiction: MIT’s Randomized Controlled Trial found experienced developers were actually 19% slower with AI tools, despite believing they were 20% faster — a direct challenge to McKinsey’s optimistic findings

What’s Missing

  • Long-term code maintenance costs: speed gains during initial development may be offset by technical debt from AI-generated code
  • Security implications: McKinsey barely addresses the code security risks that Gartner warns about (2,500% defect increase prediction)
  • Cost-benefit analysis: no TCO modeling for AI tool adoption (licensing + training + security review + workflow redesign)

Implications for Foley Hoag

  1. Use McKinsey’s data selectively: The adoption intensity curve (>110% at 80%+ adoption) and the “two shifts, three enablers” framework are the most defensible and useful data points. Avoid citing the “up to 2x faster” headline without the complexity caveat.

  2. The metrics controversy is a consulting opportunity: Help clients navigate between executive demand for quantifiable productivity metrics and engineering resistance to output measurement. Position outcome-based measurement (DORA + business metrics) as the bridge.

  3. The 5.5% finding is the most powerful conversation opener: Only 5.5% of companies drive significant value from AI. This creates urgency without promising miracles.

  4. Pair McKinsey data with Gartner’s warnings: McKinsey’s optimism about productivity gains should be balanced with Gartner’s governance warnings (2,500% defect increase, 2x cost overruns) for a credible, balanced consulting narrative.

  5. The workforce restructuring angle matters for a law firm: McKinsey’s finding that junior/mid roles shrink while senior roles grow maps directly to how AI will reshape legal teams — associates doing AI-augmented research under partner supervision, with fewer associates needed per engagement.


What This Means for Your Organization

McKinsey’s most defensible finding is the adoption intensity curve: companies with 80-100% developer AI tool adoption saw greater than 110% productivity gains, while companies with average adoption saw 25%. That is a step-function increase, not a linear one. It means selective pilots and optional rollouts will never deliver the ROI that justifies the investment. If you are running a pilot with 15% of developers and extrapolating results to the full organization, your projections are structurally wrong. The ROI inflection point requires near-universal adoption, which requires organizational commitment that most pilot programs are not designed to generate.

The metrics controversy is not academic. It affects how your engineering leaders will receive any AI productivity initiative. Kent Beck and Gergely Orosz – two of the most respected voices in software engineering – publicly criticized McKinsey’s measurement framework for focusing on output metrics (coding speed, PR throughput) rather than outcome metrics (business value delivered, system reliability). If your executive team walks into an engineering all-hands citing McKinsey’s “2x faster” headline, expect pushback. The credible approach is to measure cycle time, defect rates, deployment frequency, and customer satisfaction – DORA metrics supplemented by business outcomes. These are metrics engineering leaders already respect.

The 5.5% finding from McKinsey’s survey of 1,993 participants across 105 nations is the number that matters most for strategic planning. Only 109 of those companies drive significant value from AI. The rest are experimenting, piloting, or spending without returns. That is consistent with BCG’s 5% and Accenture’s 13% – three independent firms converging on the same conclusion. The gap between knowing what to do and doing it well is where most organizations lose. McKinsey’s “two shifts, three enablers” framework (lifecycle-level AI, organizational transformation, upskilling, measurement, change management) describes the work. But frameworks do not implement themselves, and the 15-percentage-point performance gap between top and bottom performers is the cost of doing it poorly.

Sources

Primary McKinsey Publications

Critical Responses

Secondary Analysis


Created by Brandon Sneider | brandon@brandonsneider.com March 2026