Does the AI Productivity Gain Last? The 18-Month Evidence

Brandon Sneider · April 2026

Most AI productivity studies in the corpus are RCTs lasting days or weeks. That is a problem for executives making three-year deployment decisions.

Executive Summary

Only two credible studies track the same workers using AI tools for 18+ months. They reach opposite conclusions — and both are right, because they measure different deployments.
The Brynjolfsson/Li/Raymond customer-support study (5,179 agents, Nov 2020–2022, published QJE 2025) finds the 14% productivity gain appears in month 2 post-adoption and remains stable and persistent through end of sample. Gains during AI outages grow with months of exposure — workers are learning, not just leaning on the tool.
The NAV IT Copilot study (39 developers, Sep 2023–May 2025, 105 weeks per developer) finds zero statistically significant change in commit output over 20 months, despite every Copilot user reporting felt productivity gains. The perception-reality gap is the finding.
Market-level data from the Federal Reserve (Apr 2026) shows enterprise AI adoption still accelerating — 41% of US workers use generative AI at work, up 31% YoY through Nov 2025. Adoption is not plateauing. But tool-specific retention is: Microsoft Copilot lost 39% of its paid-AI market share in six months, and 64% of employees with Copilot access choose not to use it.
The pattern across all three data sources: AI gains persist where the deployment is structured around a specific repeatable task with a clear performance metric. Gains evaporate where the deployment is “here is a tool, use it however you like.”

What the Long-Run Evidence Actually Shows

Most AI productivity studies in the corpus are RCTs lasting days or weeks. That is a problem for executives making three-year deployment decisions. An 80% time saving on a controlled task tells you nothing about whether the gain persists at month 6, month 12, month 18, or whether workers disengage, the tool atrophies, or the initial lift was a novelty effect.

Only a handful of studies track the same workers for more than a year. The two with the most signal reach opposite conclusions.

Study 1: Customer Support (Brynjolfsson, Li, Raymond — QJE 2025)

A Fortune 500 enterprise-software company deployed a generative-AI chat assistant to its technical-support agents. The tool monitored customer conversations and suggested responses in real time. 5,179 agents observed over the rollout window (bulk deployment Nov 2020–Feb 2021), with 1.2 million chats in the post-AI period.

The productivity gain (14% more issues resolved per hour) appears in month 2 and stays flat through the end of the panel. That is the most useful finding in the longitudinal literature: the gain is not a novelty effect, it is not a plateau that decays, it is a step change that holds.

Two additional findings matter more than the headline:

Durable human-capital transfer. When the AI system went down (technical outages lasting minutes to hours), agents who had been using AI for 3+ months handled chats 15–25% faster than their own pre-AI baseline. Agents who experienced an outage one month into AI use showed no such gain. Workers are absorbing patterns from the AI recommendations and retaining them.
Tenure heterogeneity that persists. Agents with less than one month of tenure capture the largest gains. Agents with more than a year of tenure show no measurable productivity effect over the full study period. This is not something that resolves with time. Senior workers never catch up — because they did not need the help.

Study 2: Developer Productivity (NAV IT — Stray et al., arXiv 2509.20353, 2025)

NAV IT, a Norwegian public-sector technology organization with ~1,000 employees, rolled out GitHub Copilot in September 2023. Stray and colleagues analyzed 26,317 commits across 703 repositories over 105 weeks, tracking 25 Copilot users and 14 non-users through May 2025.

Commit output did not change. Users averaged 188 lines added and 105 deleted per week before Copilot; 200 added and 98 deleted after. The delta is not statistically significant. Structural code-quality metrics did not change either.

Every Copilot user reported feeling more productive. The correlation between self-reported productivity gains and actual commit changes was ρ≈0.17 (p=0.40) — statistical noise. Developers described the benefit as reduced drudgery and smoother workflow, not more output.

Two readings of this study are credible. The generous reading: Copilot delivered mental relief that does not show up in commits but makes developers more durable over time. The skeptical reading: after 20 months, the tool has not moved the needle on anything an executive can measure.

The Copilot adopters at NAV IT had significantly higher baseline activity than non-adopters before Copilot rolled out (p<0.005). The people who opted in were already the most engaged developers. Self-selection, not tool effect, explains the cross-sectional gap.

Study 3: Market-Level Adoption Trajectory (Federal Reserve, Apr 2026)

The Federal Reserve’s April 2026 FEDS note triangulates three federal surveys. Through November 2025, enterprise AI adoption is still accelerating — not plateauing. Firm-level adoption grew 68% year-over-year through September 2025. Worker-level use reached 41% of the US workforce, up 31% year-over-year. The most recent quarter showed the strongest growth in the Real-Time Population Survey’s history.

But tool-specific retention tells a different story. Microsoft Copilot’s paid-AI-subscriber market share fell from 18.8% in July 2025 to 11.5% in January 2026 — a 39% contraction in six months. Where employees have access to both Copilot and ChatGPT, 76% use ChatGPT and 18% use Copilot. 64% of employees with Copilot access do not use it at all. The headline 41% worker-level adoption figure masks significant churn between tools.

Why the Two Longitudinal Studies Disagree

The customer-support study and the Copilot study are not contradictory. They measure different deployments.

Factor	Customer Support (Brynjolfsson 2025)	Developer Copilot (NAV IT 2025)
Task structure	Narrow, repeatable (technical support)	Open-ended (software development)
Performance metric	Clear (issues/hour, resolution rate)	Ambiguous (commits, PRs, code quality)
Worker baseline	Heterogeneous — many juniors	Self-selected senior engineers
AI role	Real-time suggestion specific to the conversation	General-purpose autocomplete
Gain appears	Month 2, holds to end of sample	Never appears in objective metrics

The deployment that delivered durable gains had: narrow task scope, measurable output, workers who did not already have the tacit knowledge the AI surfaced. The deployment that delivered no measurable gain had: broad task scope, ambiguous output, workers who were already top of the distribution.

This is the organizational design question, not the technology question. Executives asking “does AI productivity last?” are asking the wrong question. The right question is “what task are we deploying it to, and how do we know when it is working?”

What the 18-Month Evidence Does Not Tell You

The corpus has real gaps worth naming:

No study tracks the same workers across multiple AI tool generations. The NBER study used 2020-era AI; the NAV IT study used 2023-2025 Copilot. Whether gains persist when the underlying model upgrades (Copilot to Copilot Chat to agentic Copilot Workspace) is not known.
No study measures resistance or sabotage over 18 months. The adoption-challenge literature has point-in-time snapshots of who opts in or out. Whether active resistance hardens, softens, or stays stable over time is unmeasured.
No longitudinal data on mid-market deployments. Both major studies are single-firm: one Fortune 500, one public-sector tech org with 1,000 employees. No 200–500 person company has a published 18-month AI productivity panel.
No team-level longitudinal data. All published panels measure individual output. Whether AI-driven individual gains aggregate to team or firm-level gains over 18 months is the question the Berkeley meta-analysis says is still unanswered — 371 estimates pooled, no robust effect on aggregate labor-market outcomes.

Key Data Points

Finding	Source	Date	Sample	Tier
14% productivity gain in month 2, stable through end of sample	Brynjolfsson/Li/Raymond, QJE Vol 140(2)	2025 (data 2020–2022)	5,179 agents, 3M chats	3
Outage-period gains grow from ~0% at month 1 to 15–25% at month 3+	Same	2025	Same	3
Zero significant change in commit output over 105 weeks	Stray et al., arXiv 2509.20353	Sep 2025	39 devs, 26,317 commits, 703 repos	1
Correlation of perceived to actual productivity: ρ=0.17 (p=0.40)	Same	Sep 2025	63 survey respondents	1
US worker AI use: 41%, +31% YoY through Nov 2025	Federal Reserve FEDS Notes	Apr 2026	RPS 5–6K/quarter	1
Copilot paid-share: 18.8% → 11.5% in six months	AI Business Weekly aggregation of Ramp data	Jan 2026	Aggregated market data	1
64% of employees with Copilot access do not use it	Same	Jan 2026	Market-level	1
371-estimate meta-analysis: no robust AI effect on aggregate productivity	Santarelli et al. via CMR Berkeley	Oct 2025	371 studies 2019–2024	1

What This Means for Your Organization

Stop asking whether AI productivity gains last. That framing produces useless answers — some deployments hold for two years, some produce no measurable gain from day one. Start asking three sharper questions:

First, what is the repeatable task? The deployments that deliver durable gains target a narrow, measurable activity where the AI can surface patterns the worker does not already hold. Technical support chat resolution. Claims triage. Contract redline review. “Help my engineers code faster” is not a task — it is a hope.

Second, who actually captures the gain? Both long-run studies agree: AI compresses the experience curve for newer workers and does little or nothing for veterans. If your deployment is rolling out to senior engineers who already know the codebase, you should expect what NAV IT got — felt improvement, zero measurable change. If it is rolling out to new hires in their first 90 days, you should expect what the Fortune 500 support org got — a step change that holds.

Third, what does the tool stickiness look like at month 6, month 12, month 18? 64% of employees with Copilot access do not use it. The market churn between tools is significant. Committing to a single-vendor three-year license on a tool your workforce may abandon for a competitor in quarter two is a procurement risk the longitudinal data now makes visible.

If this raised questions specific to how your organization should structure its next 18 months of AI deployment, I’d welcome the conversation — brandon@brandonsneider.com.

Sources

Brynjolfsson, Li, Raymond. “Generative AI at Work.” NBER Working Paper 31161, revised Nov 2023. Published Quarterly Journal of Economics, Vol. 140, Issue 2, 2025. https://www.nber.org/papers/w31161 — HIGH credibility. Peer-reviewed academic panel study of 5,179 customer-support agents with staggered AI rollout.
Stray, Brandtzæg, Wivestad, Barbala, Moe. “Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study.” arXiv:2509.20353v2, September 2025. https://arxiv.org/abs/2509.20353 — HIGH credibility. Independent 105-week panel of 39 developers at NAV IT (Norwegian public-sector tech org).
Federal Reserve Board. “Monitoring AI Adoption in the U.S. Economy.” FEDS Notes, April 3, 2026. https://www.federalreserve.gov/econres/notes/feds-notes/monitoring-ai-adoption-in-the-u-s-economy-20260403.html — HIGH credibility. Federal statistical triangulation across BTOS, RPS, SBU surveys.
California Management Review. “Seven Myths about AI and Productivity: What the Evidence Really Says.” October 2025. https://cmr.berkeley.edu/2025/10/seven-myths-about-ai-and-productivity-what-the-evidence-really-says/ — HIGH credibility. Aggregates Santarelli et al. 371-study meta-analysis (no robust macro effect) and 37-study SE review.
AI Business Weekly. “Microsoft Copilot Statistics 2026.” January 2026. https://aibusinessweekly.net/p/microsoft-copilot-statistics — MEDIUM credibility. Aggregator of Ramp corporate-spend data and Microsoft admin-center metrics. Useful for retention trend direction, not absolute levels.

Brandon Sneider | brandon@brandonsneider.com April 2026