Academic Research on AI-Assisted Programming Productivity: Eight Studies, Three Conclusions

Executive Summary

  • The RCT evidence converges on 0-26% productivity gains, depending on task complexity and developer experience — far below the 55-80% numbers in vendor marketing. Six randomized or quasi-randomized experiments now exist. The range spans from 19% slower (experienced developers, mature codebases) to 26% more tasks completed (mixed experience, enterprise settings). No independent study has replicated the 55% headline figure.
  • Google’s own internal RCT (n=96 engineers, October 2024) found a 21% speed improvement — but with a confidence interval so wide it includes near-zero. Google published this result and still cannot determine whether the effect is 5% or 40%. That is the state of measurement at the most AI-mature engineering organization on Earth.
  • Stanford’s analysis of 100,000 developers across 600+ companies finds the real average is 15-20%, with gains collapsing toward zero on complex tasks in large codebases. Task complexity is the dominant variable: greenfield simple tasks see 30-40% gains; complex brownfield work sees 0-10%. Most enterprise development falls in the latter category.
  • The largest observational study (Science, 2025, 170,000 developers, 30M commits) finds senior developers capture nearly all AI productivity gains while junior developers see no statistically significant benefit — the opposite of the “AI closes skill gaps” narrative. This contradicts Cui et al.'s finding that juniors benefit more, suggesting the answer depends on what you measure and how.
  • Three conclusions hold across all eight studies: (1) AI accelerates simple, isolated coding tasks; (2) the gains shrink or reverse as task and codebase complexity increase; (3) no study has found reliable organization-level productivity improvement.

The Eight Key Studies

The academic literature on AI-assisted programming now includes eight studies worth executive attention. They divide into three categories by methodology and what each can credibly claim.

Randomized Controlled Trials

1. METR (July 2025, updated February 2026)

The most cited independent RCT. Sixteen experienced open-source developers completed 246 real tasks in their own repositories, randomly assigned to AI-allowed or AI-disallowed conditions. Developers used Cursor Pro with Claude 3.5/3.7 Sonnet.

Result: AI-assisted work took 19% longer (95% CI: 2% to 40% longer). Developers predicted a 24% speedup before the study and still estimated a 20% speedup after — a 39-percentage-point perception-reality gap.

METR’s February 2026 follow-up expanded to 57 developers and 800+ tasks. The slowdown shrank to -4% for new recruits (CI: -15% to +9%), approaching neutrality. But METR abandoned this design because 30-50% of invited developers refused to participate without AI access — a selection bias that makes clean randomization impossible in 2026. Developers who stayed in the study were disproportionately those who benefit least from AI.

METR now believes “AI likely provides productivity benefits in early 2026” but cannot quantify the effect with their methodology.

Source: Becker et al., arXiv:2507.09089, July 2025; METR blog update, February 24, 2026. Credibility: High — independent, pre-registered, transparent about limitations. Small initial sample (n=16) is a real constraint.

2. Google Internal RCT (October 2024)

Google ran a randomized controlled trial with 96 full-time software engineers on a “complex, enterprise-grade” coding task in C++. All participants had at least one year of Google tenure.

Result: 21% speed improvement with AI tools enabled. But the confidence interval is wide (the authors do not publish exact bounds, stating only that the CI is “large”). Developers who spend more hours per day on code-related work saw greater AI-assisted speed gains.

The authors are explicit about limitations: “We cannot assume that the effect size obtained in our lab study will necessarily apply more broadly, or that the effect of AI found using internal Google tooling in the summer of 2024 will translate across tools and over time.”

Source: Paradis et al., “How much does AI impact development speed? An enterprise-based randomized controlled trial.” arXiv:2410.12944, October 2024. Credibility: High — RCT at a real company with real engineers, but Google-funded, single task, internal tooling that differs from commercial products.

3. Peng et al. / GitHub (February 2023)

Ninety-five freelance developers from Upwork implemented an HTTP server in JavaScript. The Copilot group finished 55.8% faster (95% CI: 21% to 89%, p=0.0017).

This remains the most-cited number in the industry and the weakest evidence base for enterprise decision-making. The task was greenfield, single-function, one language, with no code review, integration testing, or quality measurement. Of 95 recruits, only 35 completed the task — a 63% dropout rate that the study does not adequately address.

Source: Peng et al., arXiv:2302.06590, February 2023. Published in a 2024 journal version. Credibility: Medium — legitimate RCT design, but GitHub-commissioned, extreme task simplicity, and high dropout rate.

Large-Scale Field Experiments

4. Cui et al. / Microsoft-Accenture-Fortune 100 (2024, published Management Science 2026)

The strongest enterprise evidence. Three separate RCTs: Microsoft (7 months), Accenture (4 months), and an anonymous Fortune 100 electronics manufacturer (2 months, staggered rollout). Combined sample: 4,867 developers.

Combined result: 26.08% increase in completed tasks (SE: 10.3%), measured by pull requests. Also: 13.55% more commits and 38.38% more builds, with no negative impact on build success rates.

The experience-level finding is critical for workforce planning: short-tenure developers saw 27-39% output increases; long-tenure developers saw 8-13%. Junior developers saw 21-40% gains; senior developers 7-16%. Long-tenure developers were 4.3% less likely to accept AI code suggestions.

But 30-40% of developers across all three experiments never tried Copilot at all, despite having access. And the study measured task completion volume — not code quality, maintainability, or downstream rework.

Source: Cui et al., “The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers.” Management Science, 2026. Also: SSRN #4945566. Credibility: High — peer-reviewed in a top journal, three independent experiments, large sample. Microsoft-funded; task completion is an incomplete productivity measure.

5. Microsoft “Dear Diary” Study (October 2024)

A three-week RCT combined with daily diary entries at a large multinational software company. Intake: 228 engineers; final analysis: 106 (those who completed all instruments).

The key finding is the perception-telemetry disconnect. Developers reported statistically significant increases in perceiving AI tools as “useful” (2.93 to 3.51, p=0.001) and “enjoyable” (2.72 to 3.61, p<0.0001). But telemetry showed no statistically significant difference in code changes, pull requests, or development time between treatment and control groups.

Trust in AI-generated code barely moved: roughly 20% trusted AI output before and after the study. Only 75% of participants adhered to their assigned condition — control group members tried AI tools, and treatment group members did not consistently use them.

Source: Butler et al., “Dear Diary: A randomized controlled trial of Generative AI coding tools in the workplace.” arXiv:2410.18334, October 2024. Also: IEEE Conference Publication. Credibility: Medium-high — RCT + diary method at real company, but single company, 3-week duration may be too short, 25% non-compliance.

Large-Scale Observational Studies

6. Stanford / Denisov-Blanch (2025)

The largest analysis of AI coding productivity to date. Researchers developed an automated code review algorithm and applied it to data from nearly 100,000 developers across 600+ companies, analyzing tens of millions of commits and billions of lines of private codebase data.

Net average productivity gain: 15-20% after accounting for rework.

But the average masks enormous variation by context:

Task Type Codebase Productivity Gain
Low complexity Greenfield (new) 30-40%
High complexity Greenfield (new) 10-15%
Low complexity Brownfield (existing) 15-20%
High complexity Brownfield (existing) 0-10%

Programming language matters: popular languages (Python, Java, JavaScript) show 10-20% gains on simple tasks; low-popularity languages show minimal or negative gains. Codebase size erodes gains sharply — context window limitations, signal-to-noise degradation, and domain-specific logic all reduce AI effectiveness in large codebases.

Critically, the researchers found that self-reported AI productivity assessment is unreliable and that token spend (the most common AI usage metric) is a weak predictor of actual productivity gains.

Source: Denisov-Blanch, “Does AI Actually Boost Developer Productivity?” Presentation at AIEWF 2025; arXiv:2409.15152. Credibility: Medium-high — massive dataset, rigorous methodology, Stanford-affiliated. Not yet peer-reviewed in a top journal. Conference presentation, not full paper.

7. Daniotti et al. / Science (2025)

Published in Science, giving it the highest publication credibility of any study in this space. Researchers built a neural classifier to detect AI-generated code and applied it to 30+ million code contributions by 170,000 developers across six countries (US, France, Germany, India, China, Russia).

Quarterly output increased 3.6% due to AI adoption — a modest number compared to vendor claims. AI now writes an estimated 29% of Python functions in the US.

The most significant finding: senior developers capture nearly all productivity and exploration gains. Junior developers show no statistically significant benefit. Rather than closing the skill gap, AI appears to be widening it. Senior developers also expanded more readily into new software domains when using AI.

This directly contradicts Cui et al.'s finding that junior/short-tenure developers benefit more from AI. The likely explanation: Cui et al. measured task completion volume in a controlled setting; Daniotti et al. measured real-world code contributions over years. Different measures, different timeframes, different conclusions.

Source: Daniotti et al., “Who is using AI to code? Global diffusion and impact of generative AI.” Science, 2025. DOI: 10.1126/science.adz9311. Credibility: Very high — top-tier peer-reviewed journal, massive sample, novel measurement methodology. Observational (no randomization), which limits causal claims.

8. DORA 2025 State of AI-Assisted Software Development (September 2025)

Google’s annual DevOps report surveyed thousands of developers. Key findings: 90% now use AI at work; median usage is 2 hours daily; 80%+ believe AI has increased their productivity. AI adoption now positively correlates with throughput (a reversal from 2024) but continues to negatively correlate with delivery stability.

The DORA team identified seven capabilities that amplify AI’s positive impact, with platform engineering being the strongest predictor. Their central conclusion: “AI doesn’t fix a team; it amplifies what’s already there.”

Source: Google Cloud, “2025 State of AI-Assisted Software Development.” September 2025. Credibility: Medium-high — established methodology, large sample, but survey-based self-reporting from a company that sells AI tools.

The Contradictions That Matter

Three contradictions across these studies should inform executive decision-making:

Who benefits more — juniors or seniors? Cui et al. (n=4,867, RCT) finds juniors gain 21-40% while seniors gain 7-16%. Daniotti et al. (n=170,000, observational, Science) finds seniors capture all gains while juniors see none. The likely resolution: juniors complete more discrete tasks with AI assistance, but seniors produce higher-quality output that survives in production. Both can be true simultaneously if you measure different things.

Does AI increase productivity or not? Google’s RCT says 21% faster. METR’s RCT says 19% slower. Stanford says 15-20% on average, near-zero on complex work. The variable that explains most of the divergence is task complexity and codebase maturity. Simple tasks in new codebases: consistent gains. Complex tasks in large, mature codebases: gains disappear or reverse. Most enterprise development is the latter.

Can you measure AI productivity at all? METR abandoned its study design because selection bias made randomization impossible. Microsoft’s Dear Diary study found telemetry showed nothing while surveys showed gains. Stanford found self-assessment unreliable. DORA’s survey data contradicts its own organizational metrics. The measurement problem is not a temporary inconvenience — it may be structural. AI changes how developers work in ways that existing productivity metrics cannot capture.

Key Data Points

  • 21% speed improvement: Google internal RCT, n=96, enterprise C++ task, wide CI (Paradis et al., Oct 2024)
  • 26% more completed tasks: Three RCTs at Microsoft/Accenture/Fortune 100, n=4,867 (Cui et al., Management Science, 2026)
  • 19% slower: Experienced OSS developers in mature repos, n=16, 246 tasks (METR, July 2025)
  • -4% (CI: -15% to +9%): METR follow-up, n=57, 800+ tasks, approaching neutrality (METR, Feb 2026)
  • 15-20% average gain: 100K developers, 600+ companies, but 0-10% on complex brownfield (Stanford, 2025)
  • 3.6% quarterly output increase: 170K developers, 30M commits, only seniors benefit (Science, 2025)
  • 55.8% faster: Greenfield JavaScript task, n=95 freelancers, 63% dropout (Peng et al., 2023)
  • No telemetry-measurable difference: 3-week RCT, n=228, despite positive self-reports (Microsoft Dear Diary, 2024)
  • 30-40% never adopt: Developers given free access across three enterprise experiments (Cui et al.)
  • 39-point perception gap: Developers believe 20% faster, measured 19% slower (METR, 2025)

What This Means for Your Organization

The academic evidence now includes enough independent data points to draw operational conclusions — even if the precise magnitude remains uncertain.

Budget for 10-20% developer productivity gains, not 50%. Stanford’s 100,000-developer analysis is the best available estimate of the real average, and it includes the critical caveat that gains approach zero on the complex, brownfield, large-codebase work that dominates enterprise development. If your business case requires 40%+ gains to justify the investment, it will not pencil out. If it works at 10-15%, proceed.

Do not use developer surveys to measure ROI. Every study that compares self-reported productivity to objective measurement — METR, Dear Diary, DORA — finds the same thing: developers consistently believe AI makes them faster regardless of what the data shows. The 39-point perception gap documented by METR is not an outlier; it is the norm. Measure pull request cycle time, defect escape rates, time-to-production, and rework rates. If the tool is working, those numbers will move.

Expect uneven adoption and uneven returns. Across all enterprise experiments, 30-40% of developers never use AI tools even when given free access. Among those who do, gains vary by 3-5x depending on task type, language, codebase size, and developer experience. A blanket per-seat license at $19-39/developer/month is paying for a tool that a third of your developers will not touch and another third will use on tasks where it provides marginal benefit. Pilot programs with measured rollout are financially rational; enterprise-wide purchases on day one are not.

The junior-vs-senior question matters for workforce strategy. If Cui et al. is correct (juniors benefit more), AI tools are a force multiplier for cheaper talent. If Daniotti et al. is correct (seniors benefit more), AI widens the gap between your best and average developers. Both findings are credible. The difference likely depends on whether you value task throughput (juniors win) or production-quality output that survives long-term (seniors win). Your workforce investment strategy should not depend on which study you read last.

Sources

  1. Paradis et al. — “How much does AI impact development speed? An enterprise-based randomized controlled trial.” arXiv:2410.12944, October 2024. n=96 Google engineers. https://arxiv.org/abs/2410.12944 Credibility: High — RCT at Google, but single task, internal tooling, wide confidence interval.

  2. Becker et al. / METR — “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” arXiv:2507.09089, July 2025. n=16 developers, 246 tasks. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ Credibility: High — independent, pre-registered RCT.

  3. METR — “We Are Changing Our Developer Productivity Experiment Design.” February 2026. n=57 developers, 800+ tasks. https://metr.org/blog/2026-02-24-uplift-update/ Credibility: High — transparent about methodology failure and selection bias.

  4. Peng et al. — “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.” arXiv:2302.06590, February 2023. n=95 developers. https://arxiv.org/abs/2302.06590 Credibility: Medium — legitimate RCT, but GitHub-commissioned, single greenfield task, 63% dropout.

  5. Cui et al. — “The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers.” Management Science, 2026. n=4,867 developers. https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2025.00535 Credibility: High — peer-reviewed top journal, three independent RCTs.

  6. Butler et al. — “Dear Diary: A randomized controlled trial of Generative AI coding tools in the workplace.” arXiv:2410.18334, October 2024. n=228 engineers. https://arxiv.org/abs/2410.18334 Credibility: Medium-high — RCT + diary method, but single company, 25% non-compliance.

  7. Denisov-Blanch — “Does AI Actually Boost Developer Productivity?” Stanford Software Engineering Productivity Research, AIEWF 2025. ~100,000 developers, 600+ companies. https://softwareengineeringproductivity.stanford.edu/ai-impact Credibility: Medium-high — massive dataset, Stanford affiliation. Conference presentation, not peer-reviewed journal.

  8. Daniotti et al. — “Who is using AI to code? Global diffusion and impact of generative AI.” Science, 2025. 170,000 developers, 30M+ commits. https://www.science.org/doi/10.1126/science.adz9311 Credibility: Very high — Science publication, novel neural classifier methodology, massive sample.

  9. DORA / Google Cloud — “2025 State of AI-Assisted Software Development.” September 2025. https://dora.dev/research/2025/dora-report/ Credibility: Medium-high — established methodology, large survey, but Google is a vendor.

  10. Tabachnyk et al. / Google — “Achieving Productivity Gains with AI-based IDE features: A Journey at Google.” arXiv:2601.19964, accepted for LLM4Code '26. https://arxiv.org/abs/2601.19964 Credibility: Medium — Google-authored internal tooling study, abstract only available.


Created by Brandon Sneider | brandon@brandonsneider.com March 2026