Which Specific Business Tasks AI Helps, Hurts, or Leaves Unchanged: The Evidence Base

Research compiled March 2026

Executive Summary

AI task performance follows a “jagged frontier.” The BCG/Harvard study (n=758 consultants, September 2023) found that tasks inside the AI capability frontier saw 40% quality improvement and 25% speed gains, while a single task outside the frontier caused a 19-percentage-point drop in correct answers. The frontier is irregular and non-obvious — tasks of similar apparent difficulty can fall on opposite sides.
Experienced workers on familiar codebases get slower, not faster. METR’s RCT (n=16 developers, 246 tasks, July 2025) found AI-assisted work took 19% longer. Developers believed they were 20% faster — a 39-point perception gap. METR’s February 2026 follow-up (57 developers, 800+ tasks) confirmed the pattern.
Individual output gains do not translate to organizational productivity. Faros AI (n=10,000+ developers, 1,255 teams, July 2025) found 21% more tasks completed per developer but zero improvement in company-level throughput, DORA metrics, or quality KPIs. Review time increased 91%, bugs per developer increased 9%.
AI compresses skill distributions. The consistent finding across studies is that low performers benefit most (43% improvement for bottom-half BCG consultants, 34% for novice customer service agents) while top performers see minimal or negative impact. This is the most robust finding in the literature.
The task category that AI most reliably improves is first-draft generation of text and ideas — and the task category it most reliably degrades is anything requiring judgment over data the model has not seen. Between these poles lies a wide gray zone where outcomes depend on implementation quality, user skill, and organizational context.

Section 1: The Landmark Studies

1.1 METR Randomized Controlled Trial (2025)

Source: “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” METR, July 2025 Sample: 16 experienced open-source developers, 246 real tasks (bug fixes, features, refactors), repositories averaging 22,000+ stars and 1M+ lines of code Method: Randomized controlled trial — tasks randomly assigned to AI-allowed or AI-disallowed conditions Tools: Cursor Pro with Claude 3.5/3.7 Sonnet; developers had 50+ hours of Cursor experience URL: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Core finding: Developers took 19% longer to complete tasks when using AI (95% CI: 2% to 40% longer).

The perception gap: Before tasks, developers predicted AI would speed them up by 24%. After the study concluded, they still estimated a 20% speedup — despite the measured slowdown.

Where the time went (from participant accounts):

Extended prompting and re-prompting cycles when initial AI output missed codebase conventions
Reviewing AI-generated code that was syntactically correct but semantically wrong
Debugging AI suggestions that introduced subtle errors
One participant reported a test-writing task estimated at 1 hour took 4 hours 20 minutes with AI
Models repeatedly struggled to match existing code style, ignore fabricated constants, and maintain project-specific conventions

METR February 2026 follow-up (n=57 developers, 800+ tasks, 143 repositories): Returning developers still showed an estimated 18% slowdown (CI: -38% to +9%). New developers showed a 4% slowdown (CI: -15% to +9%). METR concluded the data provides “an unreliable signal” because experienced developers increasingly refuse to work without AI, even for $150/hour.

What this means for mid-market companies: The METR finding applies specifically to experienced developers on mature, familiar codebases — precisely the scenario most mid-market companies face with their existing engineering teams. The result does not apply to greenfield development, junior developers learning new codebases, or highly repetitive coding tasks.

1.2 BCG/Harvard “Jagged Frontier” Study (2023)

Source: “Navigating the Jagged Technological Frontier,” Dell’Acqua, McFowland, Mollick et al., Harvard Business School Working Paper 24-013, September 2023 Sample: 758 BCG consultants (7% of individual contributor workforce) Method: Field experiment — consultants randomly assigned to AI-access or no-AI-access conditions across 18 realistic consulting tasks Tool: GPT-4 URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321

Tasks inside the AI frontier (creative and generative work):

Propose 10+ ideas for a new shoe targeting an underserved market
Segment the footwear industry market
Draft a press release for the new product
Pen an inspirational memo to employees
Create marketing slogans
Write a 2,500-word article describing end-to-end process from pitch to launch

Results for inside-frontier tasks:

12.2% more tasks completed
25.1% faster completion
40% higher quality (human-rated)

The task outside the frontier (judgment over unfamiliar data):

Analyze a business case using interview transcripts and financial spreadsheet data
Required synthesizing qualitative and quantitative evidence to reach a conclusion

Results for the outside-frontier task:

Without AI: consultants correct 84% of the time
With AI: consultants correct only 60-70% of the time — a 19-percentage-point degradation
Consultants trusted the AI’s confident but incorrect output and failed to apply independent judgment

The skill compression effect:

Bottom-half performers improved 43% on inside-frontier tasks
Top-half performers improved 17%
AI compressed the performance distribution, making the range between best and worst much narrower

What this means for mid-market companies: The frontier is not where most executives assume. Tasks that feel complex (creative ideation, first-draft writing) often fall inside it. Tasks that feel routine (analyzing a spreadsheet alongside qualitative data) can fall outside it. The danger is not that AI fails obviously — it fails persuasively.

1.3 Brynjolfsson/Stanford Customer Support Study (2023, published QJE 2025)

Source: “Generative AI at Work,” Brynjolfsson, Li, and Raymond, Quarterly Journal of Economics, 2025 Sample: 5,179 customer support agents at a Fortune 500 software company Method: Staggered rollout — natural experiment as AI tool was introduced to different groups at different times Tool: GPT-based conversational assistant providing real-time response suggestions URL: https://www.nber.org/papers/w31161

Results by experience level:

Overall: 14% increase in issues resolved per hour
Novice/low-skill workers: 34% improvement in productivity
Experienced/high-skill workers: minimal speed gain, small decline in quality
AI effectively disseminated the tacit knowledge of top performers to newer agents

Additional findings:

Customer sentiment improved
Employee retention increased (less turnover in AI-assisted group)
Evidence of genuine worker learning — agents improved even after AI was removed

What this means for mid-market companies: Customer support is the single strongest evidence case for AI deployment. The benefit concentrates in the first 6-12 months of a new agent’s tenure and in handling routine, well-documented issue types. The benefit diminishes sharply for experienced agents handling novel or complex escalations.

1.4 Faros AI Productivity Paradox (2025)

Source: “The AI Productivity Paradox,” Faros AI, July 2025 Sample: 10,000+ developers, 1,255 teams, 6+ companies, up to two years of telemetry data Method: Observational — correlated AI adoption metrics with delivery outcomes using Spearman rank correlation Data sources: Task management, IDE telemetry, static code analysis, CI/CD pipelines, version control, incident management, HR metadata URL: https://www.faros.ai/blog/ai-software-engineering

Individual-level gains:

21% more tasks completed
98% more pull requests merged
9% more tasks per day
47% more PRs per day

Quality and downstream costs:

154% increase in average PR size
9% increase in bugs per developer
91% increase in PR review time
1.7x more issues per PR in AI-generated code (10.83 vs 6.45)

Organizational-level impact:

Zero measurable improvement in company-level throughput
Zero improvement in DORA metrics (deployment frequency, lead time, change failure rate)
Zero improvement in quality KPIs

Why gains evaporate: Individual output increases are absorbed by downstream bottlenecks — code review queues, testing pipelines, integration complexity, and deployment dependencies. Amdahl’s Law: the slowest stage determines total throughput.

Section 2: Task-Level Evidence by Business Function

2.1 Software Engineering

Task	Evidence	Direction
Greenfield code (new function, single language)	Peng et al. 2023, n=95: 55.8% faster	HELPS
Boilerplate/repetitive code	Microsoft internal, DORA 2025: consensus positive	HELPS
Writing unit tests	DORA 2025: 62% of test-writing developers use AI; GitClear: test coverage improvement	HELPS
Documentation generation	DORA 2025, Microsoft: reduced time, consensus positive	HELPS
Code review assistance	CodeRabbit 2025, n=470 PRs: AI flags issues but generates 1.7x more issues itself	MIXED
Debugging familiar codebase	METR 2025: included in 19% slowdown; participant reports of extended debugging cycles	HURTS (experienced devs)
Refactoring mature codebase	GitClear 2025: refactoring as share of changes dropped from 25% to <10%	HURTS (less refactoring done)
Complex feature in familiar repo	METR 2025: 19% slower overall; one task estimated at 30min took 4h7min	HURTS (experienced devs)
Code maintenance over time	GitClear 2025, 211M lines: code churn rose from 3.1% to 5.7%, duplicated blocks up 8x	HURTS

Key code quality finding (GitClear, 211M lines of code, 2020-2024): As AI adoption increased, duplicated code blocks grew 8x, code requiring revision within two weeks rose from 3.1% to 5.7%, and refactoring as a proportion of changed code dropped from 25% to under 10%. The pattern: AI generates more code, less of it gets cleaned up, and more of it needs immediate rework.

Key organizational finding (DORA 2025, ~5,000 developers surveyed): AI adoption now positively correlates with throughput but continues to correlate negatively with delivery stability. DORA’s conclusion: “AI doesn’t fix a team; it amplifies what’s already there.” A 25% increase in AI adoption is associated with a 7.2% reduction in delivery stability.

2.2 Customer Service

Task	Evidence	Direction
Routine ticket triage and routing	Brynjolfsson 2023/2025, n=5,179: 14% more issues/hour	HELPS
Response suggestions for common issues	Brynjolfsson: 34% improvement for novice agents	HELPS
Ticket deflection (FAQ, password reset)	Industry data: 45%+ deflection rates in retail/travel	HELPS
Complex escalations requiring empathy	Qualtrics 2025, n=20,000+ consumers: 4x failure rate vs other AI tasks	HURTS
Handling novel/unusual complaints	Qualtrics: nearly 1 in 5 consumers saw no benefit	HURTS
Post-call summarization	Microsoft/industry: consensus time savings	HELPS

Key finding (Qualtrics XM Institute, Q3 2025, n=20,000+ consumers across 14 countries): AI-powered customer service fails at four times the rate of AI applied to other business tasks. Nearly 1 in 5 consumers who used AI for customer service saw no benefit. Consumers rank AI customer service among the worst applications for convenience, time savings, and usefulness. 53% of consumers cite data misuse as their top concern (up 8 points YoY).

URL: https://www.qualtrics.com/articles/news/ai-powered-customer-service-fails-at-four-times-the-rate-of-other-tasks/

2.3 Legal Review and Research

Task	Evidence	Direction
Contract clause identification	Industry benchmarks 2025: 96-99% accuracy on structured contracts	HELPS
First-pass document review (privilege, relevance)	Deloitte 2025: 88% of legal teams report productivity gains	HELPS
Routine contract review speed	Concord 2025: review time from 92 minutes to 26 seconds per contract at 98% accuracy	HELPS
Legal research (case citation)	Stanford 2025: Lexis+ AI hallucinated 17%, Westlaw 33%, GPT-4 43%	HURTS
Complex legal analysis/reasoning	Stanford 2025: Ask Practical Law AI accurate only 18% of the time	HURTS
Brief drafting with novel arguments	Law360 tracker: 729+ documented AI hallucination incidents in legal practice by end 2025	HURTS

Key study (Stanford Law/HAI, published 2025): Hallucination rates for leading legal AI tools:

Lexis+ AI: 17%
Westlaw AI-Assisted Research: 33%
Ask Practical Law AI: accurate only 18% of the time
GPT-4 (general-purpose): 43%
Earlier Stanford study (2024): general-purpose models hallucinated 58-88% on legal tasks

URL: https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12413

Real-world consequences: Federal courts have imposed $50,000+ in fines for AI-generated false citations. 729+ AI hallucination incidents documented by Law360 through end of 2025.

2.4 Financial Analysis and Accounting

Task	Evidence	Direction
Invoice/receipt data extraction (OCR)	Industry benchmarks 2025: 97-99% accuracy, error rates below 0.5%	HELPS
Accounts payable automation	Industry 2025: 90% error reduction vs manual entry	HELPS
Anomaly/fraud detection in transactions	MindBridge, Deloitte studies: improved detection of irregularities	HELPS
Routine audit sampling	Systematic reviews 2025: AI enables continuous monitoring vs periodic sampling	HELPS
Spreadsheet-based financial analysis	BCG/Harvard 2023: the outside-frontier task that degraded performance 19 points	HURTS
Numerical reasoning/calculation	LLM math studies 2025: 13% error rate even with mitigation; struggles with >4-digit multiplication	HURTS
Strategic financial judgment	Gartner 2024 CFO survey: “inadequate data quality” cited as #1 challenge	NEUTRAL-TO-HURTS

Key limitation: LLMs are fundamentally unreliable for numerical reasoning. ChatGPT’s error rate on math tasks has been reduced from 29% to 13% through mitigation techniques, but more than 1 in 10 answers remains wrong. Models struggle especially with multi-step calculations, large-digit multiplication, and problems requiring precise numerical reasoning — core financial analysis skills.

2.5 Marketing and Content Creation

Task	Evidence	Direction
First-draft copywriting	BCG/Harvard: inside-frontier, 40% quality improvement	HELPS
Ad copy generation/testing	Industry 2025: 450% increase in ad CTR reported	HELPS
Email subject line optimization	HubSpot analysis: 41% higher conversion with AI optimization	HELPS
Brainstorming/ideation quantity	BCG/Harvard: 12.2% more outputs generated	HELPS
Content personalization at scale	McKinsey: marketing & sales identified as highest-value AI use case	HELPS
SEO content production	Industry data: consensus time savings for routine SEO	HELPS
Brand voice consistency at scale	Doshi & Hauser 2024: AI homogenizes output toward generic styles	HURTS
Original thought leadership	Doshi & Hauser: collective diversity of content reduced even as individual quality improves	HURTS
Cultural/market-specific nuance	CHI 2025: AI homogenizes writing toward Western styles, diminishes cultural nuances	HURTS

Key study (Doshi & Hauser, Science Advances, July 2024, n=293 writers, 600 evaluators, 3,519 evaluations): AI-assisted stories were rated as more creative, better written, and more enjoyable — especially for less creative writers. But AI-enabled stories were significantly more similar to each other than human-only stories. The diversity gap widens with more content produced.

URL: https://www.science.org/doi/10.1126/sciadv.adn5290

The homogenization trap: A 2025 follow-up study found the induced content homogeneity persists and climbs even after AI is removed. Writers do not retain the creative capability the AI provided — they lose it — but the generic style persists.

2.6 HR and Talent Screening

Task	Evidence	Direction
Resume keyword matching	83% of companies using by end 2025; effective for volume reduction	HELPS
Interview scheduling automation	Industry: consensus time savings	HELPS
Job description drafting	Industry: faster first drafts, broader language	HELPS
Candidate ranking/scoring	U. of Washington 2024: favors white-associated names 85% of the time	HURTS
Diversity screening	U. of Washington: male-associated names favored 52% of the time	HURTS
Cultural fit assessment	Multiple studies: AI replicates and amplifies historical hiring biases	HURTS
Evaluating non-traditional backgrounds	Workday class action (Mobley v. Workday, July 2024): systematic rejection pattern	HURTS

Key study (University of Washington/Brookings, October 2024): Tested three state-of-the-art LLMs (Mistral AI, Salesforce, Contextual AI) on 120 names across 500+ real job listings. AI screening tools favored white-associated names 85% of the time and male-associated names 52% of the time. A follow-up study (n=528 people) found humans exposed to AI recommendations mirror the AI’s biases.

URL: https://www.brookings.edu/articles/gender-race-and-intersectional-bias-in-ai-resume-screening-via-language-model-retrieval/

Adoption vs. awareness gap: 83% of companies will use AI resume screening by end 2025, yet 67% acknowledge their tools could introduce bias. 47% recognize age bias, 44% cite socioeconomic bias, 30% mention gender bias, 26% point to racial bias. The first major class action (Mobley v. Workday) was allowed to proceed in July 2024.

2.7 Administrative and Operational Tasks

Task	Evidence	Direction
Meeting summarization	Microsoft 365 Copilot internal study: most-used feature; UK government trial (n=20,000): 26 min/day saved	HELPS
Email drafting	Microsoft/Forrester: measurable time savings on routine correspondence	HELPS
Calendar scheduling	Industry 2025: 4-5 hours/week saved on coordination	HELPS
Document summarization (short, factual)	Vectara 2026 benchmark: top models achieve 0.7-0.9% hallucination rate	HELPS
Document summarization (long, complex)	HELMET benchmark 2025: only GPT-4o maintains performance through 128K tokens	MIXED
Translation (high-resource language pairs)	Lokalise 2024, n=600 evaluations: Claude 3.5-Sonnet preferred in 78% of cases	HELPS
Translation (low-resource languages, technical)	Industry studies 2025: 60-85% accuracy, significant error rates in specialized domains	HURTS
Data entry from structured forms	Industry 2025: AI-enhanced OCR achieves 97-99% accuracy vs 96-99% manual	HELPS

Key finding (UK Government Copilot Trial, 2025, n=20,000 civil servants, 3-month duration): Microsoft 365 Copilot saved government workers an average of 26 minutes per day on administrative tasks. Meeting summarization was the highest-usage feature. The study was conducted by the Central Digital and Data Office across multiple government departments.

Section 3: Patterns Across the Evidence

3.1 The Task Taxonomy That Predicts AI Impact

Based on the cumulative evidence, AI task performance follows a consistent pattern organized by three dimensions:

Dimension 1: Novelty of Input Data

AI works best on tasks where all relevant information is already in the model’s training data or provided in the prompt
AI fails on tasks requiring judgment over new, unseen, or ambiguous data (BCG outside-frontier task, METR experienced-developer findings)

Dimension 2: Cost of Errors

AI works best on tasks where errors are cheap to catch and fix (first drafts, brainstorming, scheduling)
AI fails on tasks where errors are expensive, subtle, or legally consequential (legal research, financial analysis, HR screening, medical diagnosis)

Dimension 3: Verification Difficulty

AI works best on tasks where the user can quickly verify output quality (email drafts, meeting summaries, code that compiles)
AI fails on tasks where verification requires the same expertise as creation (complex code logic, legal reasoning, strategic analysis)

3.2 The Skill Compression Effect

The most robust finding across the entire literature:

Study	Bottom Performers	Top Performers
BCG/Harvard (n=758)	+43% quality	+17% quality
Brynjolfsson (n=5,179)	+34% productivity	Minimal gain, slight quality decline
METR (n=16)	N/A (all experienced)	19% slower
Doshi & Hauser (n=293)	“Especially” helped less creative writers	Smaller improvement

Implication for mid-market companies: AI is most valuable as a floor-raiser, not a ceiling-raiser. Deploy it to accelerate onboarding, support less experienced staff, and standardize baseline quality. Do not expect it to make top performers better — the evidence consistently shows it does not.

3.3 The Organizational Throughput Paradox

Three independent data sources confirm the same pattern:

Faros AI (2025): 21% more tasks per developer, zero improvement in company-level metrics
DORA 2025: AI correlates with higher throughput but lower stability at the organizational level
Upwork (2024, n=2,500): 77% of employees say AI has added to their workload; 71% report burnout

The mechanism: AI accelerates the fastest stage of a process (typically creation/drafting) while leaving the slowest stages unchanged (review, approval, integration, deployment). Because throughput is constrained by the bottleneck, speeding up creation produces more work-in-progress, not more completed work. The Faros data shows this precisely: 98% more PRs merged, 91% longer review times, zero change in delivery velocity.

Upwork study details (Walr survey, April-May 2024, n=2,500 across US, UK, Australia, Canada — 1,250 C-suite, 625 employees, 625 freelancers):

96% of C-suite leaders expect AI to increase productivity
77% of employees say AI has added to their workload
39% spend more time reviewing/moderating AI-generated content
23% spend more time learning to use AI tools
47% of employees using AI say they don’t know how to achieve expected productivity gains
71% of full-time employees are burned out

URL: https://investors.upwork.com/news-releases/news-release-details/upwork-study-finds-employee-workloads-rising-despite-increased-c

3.4 Hallucination Rates by Task Type

Task Type	Hallucination Rate	Source
Short factual summarization (with source)	0.7-0.9% (top models)	Vectara Hallucination Leaderboard, March 2026
General knowledge questions	2-5% (top models), 9.2% (average)	Multiple benchmarks, 2025
Legal research (specialized tools)	17% (Lexis+ AI) to 33% (Westlaw)	Stanford Law, 2025
Legal research (general-purpose LLM)	43-88%	Stanford Law, 2024-2025
Open-ended factual questions	33% (o3) to 48% (o4-mini) on PersonQA	OpenAI PersonQA benchmark, 2025
News source identification	94% wrong (Grok-3)	Columbia Journalism Review, March 2025
Medical text summarization	1.47% hallucination, 3.45% omission	Nature Digital Medicine, 2025

The critical pattern: Hallucination rates improve dramatically when the model has source documents to work from (summarization) and worsen dramatically on open-ended or domain-specific questions. This maps directly to the BCG “inside vs. outside the frontier” distinction. The frontier is partly defined by whether the model is interpolating from given data or extrapolating into unseen territory.

Section 4: The Practical Decision Framework

For a mid-market company evaluating where to deploy AI, the evidence supports this hierarchy:

Deploy with confidence (strong evidence of benefit):

Meeting summarization and administrative text — highest adoption, lowest risk, most consistent time savings
Customer service tier-1 (routine tickets, response suggestions for new agents) — the Brynjolfsson study is the gold standard; 14% overall, 34% for novice agents
Invoice/receipt processing and data extraction — 97-99% accuracy, measurable cost reduction
First-draft content generation (marketing copy, email drafts, internal communications) — consistent quality and speed improvements, but with human editing
Boilerplate code generation and documentation — consensus positive across studies, low risk

Deploy with caution (mixed evidence, requires controls):

Contract review (first-pass clause identification) — high accuracy on structured documents, but hallucination risk on novel clauses
Translation (high-resource language pairs, non-critical content) — good for speed, requires human review for anything published
Code review assistance — useful for flagging issues, but AI-generated code itself has 1.7x more issues
Marketing personalization at scale — strong ROI evidence, but watch for homogenization over time

Deploy with significant safeguards or avoid:

Legal research and case citation — 17-43% hallucination rate; $50,000+ in court fines documented
Financial analysis requiring numerical reasoning — 13%+ error rate on calculations; the BCG outside-frontier task
HR candidate screening and ranking — systematic racial and gender bias documented; active class-action litigation
Complex customer service escalations — 4x failure rate vs other AI applications
Any task requiring judgment over data not in the prompt — the consistent failure point across all studies

Source Index

#	Source	Date	Sample	Independence Rating	URL
1	METR RCT	July 2025	16 devs, 246 tasks	Independent (nonprofit)	https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
2	METR Follow-up	Feb 2026	57 devs, 800+ tasks	Independent (nonprofit)	https://metr.org/blog/2026-02-24-uplift-update/
3	BCG/Harvard Jagged Frontier	Sept 2023	758 consultants	Academic (HBS, MIT, Wharton)	https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321
4	Brynjolfsson/Stanford	2023/QJE 2025	5,179 agents	Academic (Stanford, NBER)	https://www.nber.org/papers/w31161
5	Faros AI Paradox	July 2025	10,000+ devs, 1,255 teams	Vendor (but observational telemetry)	https://www.faros.ai/blog/ai-software-engineering
6	DORA 2025	Sept 2025	~5,000 developers	Industry (Google, but peer-reviewed methodology)	https://dora.dev/research/2025/dora-report/
7	GitClear Code Quality	2025	211M lines of code	Independent (analytics vendor)	https://www.gitclear.com/ai_assistant_code_quality_2025_research
8	Stanford Law Hallucinations	2025	Legal benchmarks	Academic (Stanford Law)	https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12413
9	Qualtrics CX Study	Q3 2025	20,000+ consumers, 14 countries	Independent (XM Institute)	https://www.qualtrics.com/articles/news/ai-powered-customer-service-fails-at-four-times-the-rate-of-other-tasks/
10	Doshi & Hauser Creativity	July 2024	293 writers, 600 evaluators	Academic (Science Advances)	https://www.science.org/doi/10.1126/sciadv.adn5290
11	U. of Washington HR Bias	Oct 2024	120 names, 500+ job listings, 3 LLMs	Academic (UW/Brookings)	https://www.brookings.edu/articles/gender-race-and-intersectional-bias-in-ai-resume-screening-via-language-model-retrieval/
12	Upwork Productivity Survey	July 2024	2,500 workers (US, UK, AU, CA)	Independent (Upwork Research Institute)	https://investors.upwork.com/news-releases/news-release-details/upwork-study-finds-employee-workloads-rising-despite-increased-c
13	McKinsey State of AI	March 2025	~1,993 respondents	Consulting (McKinsey/QuantumBlack)	https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-how-organizations-are-rewiring-to-capture-value
14	UK Government Copilot Trial	2025	20,000 civil servants	Government (UK CDDO)	https://www.geekwire.com/2025/microsoft-ai-tools-saved-british-government-workers-26-minutes-a-day-new-study-shows/
15	Peng et al. (GitHub Copilot)	Feb 2023	95 developers	Vendor-funded (GitHub)	https://arxiv.org/abs/2302.06590
16	CodeRabbit PR Quality	Dec 2025	470 PRs	Vendor (CodeRabbit)	https://www.coderabbit.ai/
17	Stanford HAI AI Index	2025	456-page report	Academic (Stanford HAI)	https://hai.stanford.edu/ai-index/2025-ai-index-report
18	Cui et al. / MIT-Wharton	2024	1,663 (Microsoft) + 311 (Accenture)	Academic (MIT, Wharton)	https://mit-genai.pubpub.org/pub/v5iixksv
19	Uplevel Data Labs	Sept 2024	785 developers	Independent (analytics vendor)	https://www.uplevelteam.com/

Methodological Notes

Independence ratings explained:

Independent (nonprofit/academic): No financial relationship with AI vendors. Highest credibility.
Industry (peer-reviewed methodology): Funded or conducted by companies with AI products, but uses accepted research methodology. Useful but discount accordingly.
Vendor (observational telemetry): Company selling AI tools or analytics. Data may be sound but selection effects and presentation bias are likely. Cross-reference with independent sources.
Vendor-funded: Study commissioned and funded by an AI vendor. Treat as marketing data unless corroborated.

What the evidence does not tell us:

Almost no studies examine AI impact over time horizons longer than six months
Almost no studies examine AI deployment at companies with 200-5,000 employees specifically
The METR and BCG studies are the only true randomized experiments; everything else is observational with significant confounders
The legal, financial, and HR evidence is mostly from benchmarks and case law, not controlled experiments
No study has measured the long-term organizational effects of AI-induced skill atrophy or homogenization