Generative AI at Work: The Most-Cited Enterprise AI RCT and What It Actually Found

Brandon Sneider · April 2026

> **Temporal tier: TIER 4.** Field data collected November 2020–2022 on GPT-3-era conversational AI.

See also (wiki): productivity-rcts · training-architecture · workflow-redesign

Temporal tier: TIER 4. Field data collected November 2020–2022 on GPT-3-era conversational AI. Published in QJE February 2025, but the underlying experiment pre-dates GPT-4 by 18+ months and current frontier models by 3+ years. The mechanism finding (AI disseminates top-performer knowledge down the skill curve) is likely durable across model generations. The magnitude findings (15% average, 34% novice) reflect a narrower, less capable model than current enterprise deployments. Cite the mechanism, not the magnitude, as a current benchmark. Cross-reference METR (Jul 2025, TIER 2) and Stanford AI Index 2026 (Apr 2026, TIER 1) for current productivity ranges.

Executive Summary

Brynjolfsson, Li, and Raymond’s study of 5,172 customer support agents — published in the Quarterly Journal of Economics (2025) — is the most-cited randomized evidence on AI and white-collar productivity. The headline: 15% average productivity gain, but the average conceals almost everything important.
The gains were not evenly distributed. New workers (less than two months’ tenure) saw 34% productivity improvements. The most experienced, highest-skilled agents saw small speed gains and small quality declines.
The mechanism matters: the AI did not make workers smarter — it disseminated the best practices of the most experienced workers to everyone else. Agents with two months of tenure, using AI, matched the performance of agents with more than six months of tenure without it.
Three outcomes went beyond productivity: customer sentiment improved, worker attrition fell (driven by retention of newer workers), and English fluency improved among international agents. The value extended well past issues-per-hour.
The policy implication for mid-market employers: AI has the highest return where skill gaps are largest — not where the best people are. Organizations deploying AI primarily for their most experienced employees are directing the tool at the lowest-return population.

The Study

Brynjolfsson, Li, and Raymond — “Generative AI at Work” NBER Working Paper w31161 (April 2023, revised November 2023). Published in The Quarterly Journal of Economics, Vol. 140, Issue 2, pp. 889–942 (February 2025).

The study examined the staggered introduction of a generative AI-based conversational assistant at a software customer support firm. Staggered rollout across teams created a natural treatment-control comparison, allowing the researchers to identify causal effects rather than correlations. The sample covers 5,172 agents.

The AI tool worked by listening to live customer conversations and surfacing real-time suggested responses — not writing them for the agent, but recommending the next best reply based on what had worked in similar situations. It encoded the institutional knowledge of the highest performers and made it available to every agent in real time.

Source credibility: HIGH. This is an independent academic RCT, peer-reviewed in the top economics journal, by the author of the J-Curve theory of AI productivity diffusion. It has no vendor funding disclosed. The staggered rollout design provides genuine causal identification — not a correlation or a survey.

What the Numbers Say

The 15% Average Is the Least Important Number

Aggregate productivity — issues resolved per hour — rose 15% on average across all 5,172 agents. For a customer service operation, that is substantial. But the average masks the distribution that makes this finding strategically useful.

Worker Group	Productivity Change
Novice / lowest-skilled workers	+34%
Mid-tier workers	Above average gains
Most experienced / highest-skilled workers	Small speed gain; small quality decline
Overall average	+15%

The pattern has a direct explanation: the AI was surfacing the best practices of the top performers to everyone else. For the top performers, those suggestions were occasionally less accurate than what they would have written themselves — hence the small quality decline. For everyone below the top tier, the AI’s suggestions represented better-than-average guidance they would not otherwise have had.

The Experience Curve Finding

The most actionable number in the study: agents with two months of tenure using AI performed as well as agents with more than six months of tenure without AI.

That is a four-month compression of the experience curve. For organizations that hire regularly, train continuously, or operate in high-turnover environments, the workforce planning implication is direct: the cost of a new hire’s unproductive ramp period falls substantially with AI assistance.

This is not a theoretical effect. It is measured in resolved issues per hour against a control group.

Beyond Productivity: Three Additional Findings

The study tracks outcomes the productivity number does not capture:

Customer sentiment improved. Customers interacting with AI-assisted agents were more polite, used less hostile language, and were less likely to ask to speak to a supervisor. The AI was helping agents handle escalations more smoothly and communicate more clearly — the improvement showed up in how customers treated them.

Worker attrition fell. Employee turnover declined, driven primarily by retention of newer workers. The mechanism is plausible: newer workers who succeeded faster felt more confident and stayed. For any operation experiencing churn among new hires, this is a measurable cost reduction that does not appear in productivity metrics.

English fluency improved among international agents. Workers for whom English is a second language improved their fluency through AI-assisted interactions. The AI was not just routing queries — it was functioning as a continuous learning tool.

What This Finding Reveals About Most Enterprise AI Deployments

The study’s mechanism — AI encodes top-performer knowledge and distributes it down the skill curve — has an implication most organizations have not processed.

AI has the highest return where the skill gap is largest. The workers who benefit most are not the best — they are the newest, the least experienced, and the most in need of guidance. Organizations that deploy AI primarily as a productivity tool for their high performers are applying it to the population with the lowest marginal return.

This does not mean senior workers should not use AI. It means the deployment architecture should account for where the gains actually are. Teams with wide skill distributions, high new-hire volume, or significant ramp time capture the most value. Teams of uniformly experienced specialists may see smaller gains — or, in the worst case, minor quality deterioration from the most experienced members.

The study also identifies a constraint that limits the gains: AI’s advantage is largest for “moderately rare problems” — questions uncommon enough that a new agent lacks experience, but common enough that the system has adequate training data. For highly novel or truly edge-case issues, AI assistance adds less. For rote, high-volume queries, humans were already fast. The value lives in the middle.

The Contrast with METR (July 2025)

The NBER finding and the METR developer study (n=16, 246 tasks, July 2025) are frequently cited as contradictory. They are not. They describe the same pattern:

NBER: Junior customer service agents benefited substantially (+34%). Senior agents saw small declines.
METR: Experienced developers (the only population studied) were 19% slower.

Both findings point in the same direction: AI disseminates best practices and accelerates the skill curve. Experienced workers doing complex, self-selected tasks — where their existing expertise exceeds what AI can suggest — see little benefit or mild interference. That is not a contradiction. It is the same effect observed in two different populations at two different skill levels.

The framing error most organizations make is treating either study as “AI works” or “AI doesn’t work.” Both studies say: AI works where the skill gap is largest, and provides diminishing or slightly negative returns where it is smallest.

Key Data Points

Finding	Number	Source
Sample size	5,172 customer support agents	Brynjolfsson, Li, Raymond — QJE (2025)
Productivity gain — overall average	+15% (issues/hour)	Brynjolfsson, Li, Raymond — QJE (2025)
Productivity gain — novice / lowest-skilled	+34%	Brynjolfsson, Li, Raymond — QJE (2025)
Productivity change — most experienced agents	Small speed gain; small quality decline	Brynjolfsson, Li, Raymond — QJE (2025)
Experience curve compression	2 months with AI ≈ 6+ months without AI	Brynjolfsson, Li, Raymond — QJE (2025)
Customer outcomes	Improved sentiment; lower escalation rate	Brynjolfsson, Li, Raymond — QJE (2025)
Attrition effect	Declined; driven by newer-worker retention	Brynjolfsson, Li, Raymond — QJE (2025)

What This Means for Your Organization

The NBER study is the most important calibration tool in the enterprise AI literature — not because the 15% headline is reliable for all settings (customer support AI is a narrow context), but because the distribution of gains is almost certainly generalizable: newer and less experienced workers will benefit more than your best people.

The deployment sequence that follows from this: identify the workflows where skill variance is highest and new-hire ramp is longest. Those are the first candidates. Insurance claims review, customer onboarding, regulatory filings, first-level support — wherever the gap between a new employee and a ten-year employee is most costly to the organization.

The experience-curve compression number (two months equals six) also rewrites the new-hire economics calculation. If AI cuts productive ramp time by four months, the ROI calculation for AI investment should include the salary and management overhead of that compressed period — not just the per-seat license cost.

If the question for your organization is where to deploy first and which workforce segments to prioritize, that is exactly the kind of analysis worth having a specific conversation about — brandon@brandonsneider.com.

Sources

Brynjolfsson, Li, Raymond — “Generative AI at Work” (NBER Working Paper w31161, April 2023; revised November 2023; published QJE Vol. 140, Issue 2, pp. 889–942, February 2025). 5,172 customer support agents, staggered introduction design. Independent academic RCT, no vendor funding, top peer-reviewed journal. Credibility: HIGH.
- NBER: https://www.nber.org/papers/w31161
- QJE (peer-reviewed): https://academic.oup.com/qje/article/140/2/889/7990658
- PDF (author): https://danielle.li/assets/docs/GenerativeAIatWork.pdf
METR “Experienced Developers and AI” (July 2025; n=16, 246 tasks). Independent RCT. Cross-referenced for the skill-level pattern comparison. Credibility: HIGH for internal validity, limited generalizability.

Brandon Sneider | brandon@brandonsneider.com April 2026