Which Specific Business Tasks AI Helps, Hurts, or Leaves Unchanged: The Evidence Base

Research compiled March 2026


Executive Summary

  • AI task performance follows a “jagged frontier.” The BCG/Harvard study (n=758 consultants, September 2023) found that tasks inside the AI capability frontier saw 40% quality improvement and 25% speed gains, while a single task outside the frontier caused a 19-percentage-point drop in correct answers. The frontier is irregular and non-obvious — tasks of similar apparent difficulty can fall on opposite sides.
  • Experienced workers on familiar codebases get slower, not faster. METR’s RCT (n=16 developers, 246 tasks, July 2025) found AI-assisted work took 19% longer. Developers believed they were 20% faster — a 39-point perception gap. METR’s February 2026 follow-up (57 developers, 800+ tasks) confirmed the pattern.
  • Individual output gains do not translate to organizational productivity. Faros AI (n=10,000+ developers, 1,255 teams, July 2025) found 21% more tasks completed per developer but zero improvement in company-level throughput, DORA metrics, or quality KPIs. Review time increased 91%, bugs per developer increased 9%.
  • AI compresses skill distributions. The consistent finding across studies is that low performers benefit most (43% improvement for bottom-half BCG consultants, 34% for novice customer service agents) while top performers see minimal or negative impact. This is the most robust finding in the literature.
  • The task category that AI most reliably improves is first-draft generation of text and ideas — and the task category it most reliably degrades is anything requiring judgment over data the model has not seen. Between these poles lies a wide gray zone where outcomes depend on implementation quality, user skill, and organizational context.

Section 1: The Landmark Studies

1.1 METR Randomized Controlled Trial (2025)

Source: “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” METR, July 2025 Sample: 16 experienced open-source developers, 246 real tasks (bug fixes, features, refactors), repositories averaging 22,000+ stars and 1M+ lines of code Method: Randomized controlled trial — tasks randomly assigned to AI-allowed or AI-disallowed conditions Tools: Cursor Pro with Claude 3.5/3.7 Sonnet; developers had 50+ hours of Cursor experience URL: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Core finding: Developers took 19% longer to complete tasks when using AI (95% CI: 2% to 40% longer).

The perception gap: Before tasks, developers predicted AI would speed them up by 24%. After the study concluded, they still estimated a 20% speedup — despite the measured slowdown.

Where the time went (from participant accounts):

  • Extended prompting and re-prompting cycles when initial AI output missed codebase conventions
  • Reviewing AI-generated code that was syntactically correct but semantically wrong
  • Debugging AI suggestions that introduced subtle errors
  • One participant reported a test-writing task estimated at 1 hour took 4 hours 20 minutes with AI
  • Models repeatedly struggled to match existing code style, ignore fabricated constants, and maintain project-specific conventions

METR February 2026 follow-up (n=57 developers, 800+ tasks, 143 repositories): Returning developers still showed an estimated 18% slowdown (CI: -38% to +9%). New developers showed a 4% slowdown (CI: -15% to +9%). METR concluded the data provides “an unreliable signal” because experienced developers increasingly refuse to work without AI, even for $150/hour.

What this means for mid-market companies: The METR finding applies specifically to experienced developers on mature, familiar codebases — precisely the scenario most mid-market companies face with their existing engineering teams. The result does not apply to greenfield development, junior developers learning new codebases, or highly repetitive coding tasks.


1.2 BCG/Harvard “Jagged Frontier” Study (2023)

Source: “Navigating the Jagged Technological Frontier,” Dell’Acqua, McFowland, Mollick et al., Harvard Business School Working Paper 24-013, September 2023 Sample: 758 BCG consultants (7% of individual contributor workforce) Method: Field experiment — consultants randomly assigned to AI-access or no-AI-access conditions across 18 realistic consulting tasks Tool: GPT-4 URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321

Tasks inside the AI frontier (creative and generative work):

  • Propose 10+ ideas for a new shoe targeting an underserved market
  • Segment the footwear industry market
  • Draft a press release for the new product
  • Pen an inspirational memo to employees
  • Create marketing slogans
  • Write a 2,500-word article describing end-to-end process from pitch to launch

Results for inside-frontier tasks:

  • 12.2% more tasks completed
  • 25.1% faster completion
  • 40% higher quality (human-rated)

The task outside the frontier (judgment over unfamiliar data):

  • Analyze a business case using interview transcripts and financial spreadsheet data
  • Required synthesizing qualitative and quantitative evidence to reach a conclusion

Results for the outside-frontier task:

  • Without AI: consultants correct 84% of the time
  • With AI: consultants correct only 60-70% of the time — a 19-percentage-point degradation
  • Consultants trusted the AI’s confident but incorrect output and failed to apply independent judgment

The skill compression effect:

  • Bottom-half performers improved 43% on inside-frontier tasks
  • Top-half performers improved 17%
  • AI compressed the performance distribution, making the range between best and worst much narrower

What this means for mid-market companies: The frontier is not where most executives assume. Tasks that feel complex (creative ideation, first-draft writing) often fall inside it. Tasks that feel routine (analyzing a spreadsheet alongside qualitative data) can fall outside it. The danger is not that AI fails obviously — it fails persuasively.


1.3 Brynjolfsson/Stanford Customer Support Study (2023, published QJE 2025)

Source: “Generative AI at Work,” Brynjolfsson, Li, and Raymond, Quarterly Journal of Economics, 2025 Sample: 5,179 customer support agents at a Fortune 500 software company Method: Staggered rollout — natural experiment as AI tool was introduced to different groups at different times Tool: GPT-based conversational assistant providing real-time response suggestions URL: https://www.nber.org/papers/w31161

Results by experience level:

  • Overall: 14% increase in issues resolved per hour
  • Novice/low-skill workers: 34% improvement in productivity
  • Experienced/high-skill workers: minimal speed gain, small decline in quality
  • AI effectively disseminated the tacit knowledge of top performers to newer agents

Additional findings:

  • Customer sentiment improved
  • Employee retention increased (less turnover in AI-assisted group)
  • Evidence of genuine worker learning — agents improved even after AI was removed

What this means for mid-market companies: Customer support is the single strongest evidence case for AI deployment. The benefit concentrates in the first 6-12 months of a new agent’s tenure and in handling routine, well-documented issue types. The benefit diminishes sharply for experienced agents handling novel or complex escalations.


1.4 Faros AI Productivity Paradox (2025)

Source: “The AI Productivity Paradox,” Faros AI, July 2025 Sample: 10,000+ developers, 1,255 teams, 6+ companies, up to two years of telemetry data Method: Observational — correlated AI adoption metrics with delivery outcomes using Spearman rank correlation Data sources: Task management, IDE telemetry, static code analysis, CI/CD pipelines, version control, incident management, HR metadata URL: https://www.faros.ai/blog/ai-software-engineering

Individual-level gains:

  • 21% more tasks completed
  • 98% more pull requests merged
  • 9% more tasks per day
  • 47% more PRs per day

Quality and downstream costs:

  • 154% increase in average PR size
  • 9% increase in bugs per developer
  • 91% increase in PR review time
  • 1.7x more issues per PR in AI-generated code (10.83 vs 6.45)

Organizational-level impact:

  • Zero measurable improvement in company-level throughput
  • Zero improvement in DORA metrics (deployment frequency, lead time, change failure rate)
  • Zero improvement in quality KPIs

Why gains evaporate: Individual output increases are absorbed by downstream bottlenecks — code review queues, testing pipelines, integration complexity, and deployment dependencies. Amdahl’s Law: the slowest stage determines total throughput.


Section 2: Task-Level Evidence by Business Function

2.1 Software Engineering

Task Evidence Direction
Greenfield code (new function, single language) Peng et al. 2023, n=95: 55.8% faster HELPS
Boilerplate/repetitive code Microsoft internal, DORA 2025: consensus positive HELPS
Writing unit tests DORA 2025: 62% of test-writing developers use AI; GitClear: test coverage improvement HELPS
Documentation generation DORA 2025, Microsoft: reduced time, consensus positive HELPS
Code review assistance CodeRabbit 2025, n=470 PRs: AI flags issues but generates 1.7x more issues itself MIXED
Debugging familiar codebase METR 2025: included in 19% slowdown; participant reports of extended debugging cycles HURTS (experienced devs)
Refactoring mature codebase GitClear 2025: refactoring as share of changes dropped from 25% to <10% HURTS (less refactoring done)
Complex feature in familiar repo METR 2025: 19% slower overall; one task estimated at 30min took 4h7min HURTS (experienced devs)
Code maintenance over time GitClear 2025, 211M lines: code churn rose from 3.1% to 5.7%, duplicated blocks up 8x HURTS

Key code quality finding (GitClear, 211M lines of code, 2020-2024): As AI adoption increased, duplicated code blocks grew 8x, code requiring revision within two weeks rose from 3.1% to 5.7%, and refactoring as a proportion of changed code dropped from 25% to under 10%. The pattern: AI generates more code, less of it gets cleaned up, and more of it needs immediate rework.

Key organizational finding (DORA 2025, ~5,000 developers surveyed): AI adoption now positively correlates with throughput but continues to correlate negatively with delivery stability. DORA’s conclusion: “AI doesn’t fix a team; it amplifies what’s already there.” A 25% increase in AI adoption is associated with a 7.2% reduction in delivery stability.


2.2 Customer Service

Task Evidence Direction
Routine ticket triage and routing Brynjolfsson 2023/2025, n=5,179: 14% more issues/hour HELPS
Response suggestions for common issues Brynjolfsson: 34% improvement for novice agents HELPS
Ticket deflection (FAQ, password reset) Industry data: 45%+ deflection rates in retail/travel HELPS
Complex escalations requiring empathy Qualtrics 2025, n=20,000+ consumers: 4x failure rate vs other AI tasks HURTS
Handling novel/unusual complaints Qualtrics: nearly 1 in 5 consumers saw no benefit HURTS
Post-call summarization Microsoft/industry: consensus time savings HELPS

Key finding (Qualtrics XM Institute, Q3 2025, n=20,000+ consumers across 14 countries): AI-powered customer service fails at four times the rate of AI applied to other business tasks. Nearly 1 in 5 consumers who used AI for customer service saw no benefit. Consumers rank AI customer service among the worst applications for convenience, time savings, and usefulness. 53% of consumers cite data misuse as their top concern (up 8 points YoY).

URL: https://www.qualtrics.com/articles/news/ai-powered-customer-service-fails-at-four-times-the-rate-of-other-tasks/


Task Evidence Direction
Contract clause identification Industry benchmarks 2025: 96-99% accuracy on structured contracts HELPS
First-pass document review (privilege, relevance) Deloitte 2025: 88% of legal teams report productivity gains HELPS
Routine contract review speed Concord 2025: review time from 92 minutes to 26 seconds per contract at 98% accuracy HELPS
Legal research (case citation) Stanford 2025: Lexis+ AI hallucinated 17%, Westlaw 33%, GPT-4 43% HURTS
Complex legal analysis/reasoning Stanford 2025: Ask Practical Law AI accurate only 18% of the time HURTS
Brief drafting with novel arguments Law360 tracker: 729+ documented AI hallucination incidents in legal practice by end 2025 HURTS

Key study (Stanford Law/HAI, published 2025): Hallucination rates for leading legal AI tools:

  • Lexis+ AI: 17%
  • Westlaw AI-Assisted Research: 33%
  • Ask Practical Law AI: accurate only 18% of the time
  • GPT-4 (general-purpose): 43%
  • Earlier Stanford study (2024): general-purpose models hallucinated 58-88% on legal tasks

URL: https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12413

Real-world consequences: Federal courts have imposed $50,000+ in fines for AI-generated false citations. 729+ AI hallucination incidents documented by Law360 through end of 2025.


2.4 Financial Analysis and Accounting

Task Evidence Direction
Invoice/receipt data extraction (OCR) Industry benchmarks 2025: 97-99% accuracy, error rates below 0.5% HELPS
Accounts payable automation Industry 2025: 90% error reduction vs manual entry HELPS
Anomaly/fraud detection in transactions MindBridge, Deloitte studies: improved detection of irregularities HELPS
Routine audit sampling Systematic reviews 2025: AI enables continuous monitoring vs periodic sampling HELPS
Spreadsheet-based financial analysis BCG/Harvard 2023: the outside-frontier task that degraded performance 19 points HURTS
Numerical reasoning/calculation LLM math studies 2025: 13% error rate even with mitigation; struggles with >4-digit multiplication HURTS
Strategic financial judgment Gartner 2024 CFO survey: “inadequate data quality” cited as #1 challenge NEUTRAL-TO-HURTS

Key limitation: LLMs are fundamentally unreliable for numerical reasoning. ChatGPT’s error rate on math tasks has been reduced from 29% to 13% through mitigation techniques, but more than 1 in 10 answers remains wrong. Models struggle especially with multi-step calculations, large-digit multiplication, and problems requiring precise numerical reasoning — core financial analysis skills.


2.5 Marketing and Content Creation

Task Evidence Direction
First-draft copywriting BCG/Harvard: inside-frontier, 40% quality improvement HELPS
Ad copy generation/testing Industry 2025: 450% increase in ad CTR reported HELPS
Email subject line optimization HubSpot analysis: 41% higher conversion with AI optimization HELPS
Brainstorming/ideation quantity BCG/Harvard: 12.2% more outputs generated HELPS
Content personalization at scale McKinsey: marketing & sales identified as highest-value AI use case HELPS
SEO content production Industry data: consensus time savings for routine SEO HELPS
Brand voice consistency at scale Doshi & Hauser 2024: AI homogenizes output toward generic styles HURTS
Original thought leadership Doshi & Hauser: collective diversity of content reduced even as individual quality improves HURTS
Cultural/market-specific nuance CHI 2025: AI homogenizes writing toward Western styles, diminishes cultural nuances HURTS

Key study (Doshi & Hauser, Science Advances, July 2024, n=293 writers, 600 evaluators, 3,519 evaluations): AI-assisted stories were rated as more creative, better written, and more enjoyable — especially for less creative writers. But AI-enabled stories were significantly more similar to each other than human-only stories. The diversity gap widens with more content produced.

URL: https://www.science.org/doi/10.1126/sciadv.adn5290

The homogenization trap: A 2025 follow-up study found the induced content homogeneity persists and climbs even after AI is removed. Writers do not retain the creative capability the AI provided — they lose it — but the generic style persists.


2.6 HR and Talent Screening

Task Evidence Direction
Resume keyword matching 83% of companies using by end 2025; effective for volume reduction HELPS
Interview scheduling automation Industry: consensus time savings HELPS
Job description drafting Industry: faster first drafts, broader language HELPS
Candidate ranking/scoring U. of Washington 2024: favors white-associated names 85% of the time HURTS
Diversity screening U. of Washington: male-associated names favored 52% of the time HURTS
Cultural fit assessment Multiple studies: AI replicates and amplifies historical hiring biases HURTS
Evaluating non-traditional backgrounds Workday class action (Mobley v. Workday, July 2024): systematic rejection pattern HURTS

Key study (University of Washington/Brookings, October 2024): Tested three state-of-the-art LLMs (Mistral AI, Salesforce, Contextual AI) on 120 names across 500+ real job listings. AI screening tools favored white-associated names 85% of the time and male-associated names 52% of the time. A follow-up study (n=528 people) found humans exposed to AI recommendations mirror the AI’s biases.

URL: https://www.brookings.edu/articles/gender-race-and-intersectional-bias-in-ai-resume-screening-via-language-model-retrieval/

Adoption vs. awareness gap: 83% of companies will use AI resume screening by end 2025, yet 67% acknowledge their tools could introduce bias. 47% recognize age bias, 44% cite socioeconomic bias, 30% mention gender bias, 26% point to racial bias. The first major class action (Mobley v. Workday) was allowed to proceed in July 2024.


2.7 Administrative and Operational Tasks

Task Evidence Direction
Meeting summarization Microsoft 365 Copilot internal study: most-used feature; UK government trial (n=20,000): 26 min/day saved HELPS
Email drafting Microsoft/Forrester: measurable time savings on routine correspondence HELPS
Calendar scheduling Industry 2025: 4-5 hours/week saved on coordination HELPS
Document summarization (short, factual) Vectara 2026 benchmark: top models achieve 0.7-0.9% hallucination rate HELPS
Document summarization (long, complex) HELMET benchmark 2025: only GPT-4o maintains performance through 128K tokens MIXED
Translation (high-resource language pairs) Lokalise 2024, n=600 evaluations: Claude 3.5-Sonnet preferred in 78% of cases HELPS
Translation (low-resource languages, technical) Industry studies 2025: 60-85% accuracy, significant error rates in specialized domains HURTS
Data entry from structured forms Industry 2025: AI-enhanced OCR achieves 97-99% accuracy vs 96-99% manual HELPS

Key finding (UK Government Copilot Trial, 2025, n=20,000 civil servants, 3-month duration): Microsoft 365 Copilot saved government workers an average of 26 minutes per day on administrative tasks. Meeting summarization was the highest-usage feature. The study was conducted by the Central Digital and Data Office across multiple government departments.


Section 3: Patterns Across the Evidence

3.1 The Task Taxonomy That Predicts AI Impact

Based on the cumulative evidence, AI task performance follows a consistent pattern organized by three dimensions:

Dimension 1: Novelty of Input Data

  • AI works best on tasks where all relevant information is already in the model’s training data or provided in the prompt
  • AI fails on tasks requiring judgment over new, unseen, or ambiguous data (BCG outside-frontier task, METR experienced-developer findings)

Dimension 2: Cost of Errors

  • AI works best on tasks where errors are cheap to catch and fix (first drafts, brainstorming, scheduling)
  • AI fails on tasks where errors are expensive, subtle, or legally consequential (legal research, financial analysis, HR screening, medical diagnosis)

Dimension 3: Verification Difficulty

  • AI works best on tasks where the user can quickly verify output quality (email drafts, meeting summaries, code that compiles)
  • AI fails on tasks where verification requires the same expertise as creation (complex code logic, legal reasoning, strategic analysis)

3.2 The Skill Compression Effect

The most robust finding across the entire literature:

Study Bottom Performers Top Performers
BCG/Harvard (n=758) +43% quality +17% quality
Brynjolfsson (n=5,179) +34% productivity Minimal gain, slight quality decline
METR (n=16) N/A (all experienced) 19% slower
Doshi & Hauser (n=293) “Especially” helped less creative writers Smaller improvement

Implication for mid-market companies: AI is most valuable as a floor-raiser, not a ceiling-raiser. Deploy it to accelerate onboarding, support less experienced staff, and standardize baseline quality. Do not expect it to make top performers better — the evidence consistently shows it does not.

3.3 The Organizational Throughput Paradox

Three independent data sources confirm the same pattern:

  1. Faros AI (2025): 21% more tasks per developer, zero improvement in company-level metrics
  2. DORA 2025: AI correlates with higher throughput but lower stability at the organizational level
  3. Upwork (2024, n=2,500): 77% of employees say AI has added to their workload; 71% report burnout

The mechanism: AI accelerates the fastest stage of a process (typically creation/drafting) while leaving the slowest stages unchanged (review, approval, integration, deployment). Because throughput is constrained by the bottleneck, speeding up creation produces more work-in-progress, not more completed work. The Faros data shows this precisely: 98% more PRs merged, 91% longer review times, zero change in delivery velocity.

Upwork study details (Walr survey, April-May 2024, n=2,500 across US, UK, Australia, Canada — 1,250 C-suite, 625 employees, 625 freelancers):

  • 96% of C-suite leaders expect AI to increase productivity
  • 77% of employees say AI has added to their workload
  • 39% spend more time reviewing/moderating AI-generated content
  • 23% spend more time learning to use AI tools
  • 47% of employees using AI say they don’t know how to achieve expected productivity gains
  • 71% of full-time employees are burned out

URL: https://investors.upwork.com/news-releases/news-release-details/upwork-study-finds-employee-workloads-rising-despite-increased-c

3.4 Hallucination Rates by Task Type

Task Type Hallucination Rate Source
Short factual summarization (with source) 0.7-0.9% (top models) Vectara Hallucination Leaderboard, March 2026
General knowledge questions 2-5% (top models), 9.2% (average) Multiple benchmarks, 2025
Legal research (specialized tools) 17% (Lexis+ AI) to 33% (Westlaw) Stanford Law, 2025
Legal research (general-purpose LLM) 43-88% Stanford Law, 2024-2025
Open-ended factual questions 33% (o3) to 48% (o4-mini) on PersonQA OpenAI PersonQA benchmark, 2025
News source identification 94% wrong (Grok-3) Columbia Journalism Review, March 2025
Medical text summarization 1.47% hallucination, 3.45% omission Nature Digital Medicine, 2025

The critical pattern: Hallucination rates improve dramatically when the model has source documents to work from (summarization) and worsen dramatically on open-ended or domain-specific questions. This maps directly to the BCG “inside vs. outside the frontier” distinction. The frontier is partly defined by whether the model is interpolating from given data or extrapolating into unseen territory.


Section 4: The Practical Decision Framework

For a mid-market company evaluating where to deploy AI, the evidence supports this hierarchy:

Deploy with confidence (strong evidence of benefit):

  1. Meeting summarization and administrative text — highest adoption, lowest risk, most consistent time savings
  2. Customer service tier-1 (routine tickets, response suggestions for new agents) — the Brynjolfsson study is the gold standard; 14% overall, 34% for novice agents
  3. Invoice/receipt processing and data extraction — 97-99% accuracy, measurable cost reduction
  4. First-draft content generation (marketing copy, email drafts, internal communications) — consistent quality and speed improvements, but with human editing
  5. Boilerplate code generation and documentation — consensus positive across studies, low risk

Deploy with caution (mixed evidence, requires controls):

  1. Contract review (first-pass clause identification) — high accuracy on structured documents, but hallucination risk on novel clauses
  2. Translation (high-resource language pairs, non-critical content) — good for speed, requires human review for anything published
  3. Code review assistance — useful for flagging issues, but AI-generated code itself has 1.7x more issues
  4. Marketing personalization at scale — strong ROI evidence, but watch for homogenization over time

Deploy with significant safeguards or avoid:

  1. Legal research and case citation — 17-43% hallucination rate; $50,000+ in court fines documented
  2. Financial analysis requiring numerical reasoning — 13%+ error rate on calculations; the BCG outside-frontier task
  3. HR candidate screening and ranking — systematic racial and gender bias documented; active class-action litigation
  4. Complex customer service escalations — 4x failure rate vs other AI applications
  5. Any task requiring judgment over data not in the prompt — the consistent failure point across all studies

Source Index

# Source Date Sample Independence Rating URL
1 METR RCT July 2025 16 devs, 246 tasks Independent (nonprofit) https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
2 METR Follow-up Feb 2026 57 devs, 800+ tasks Independent (nonprofit) https://metr.org/blog/2026-02-24-uplift-update/
3 BCG/Harvard Jagged Frontier Sept 2023 758 consultants Academic (HBS, MIT, Wharton) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321
4 Brynjolfsson/Stanford 2023/QJE 2025 5,179 agents Academic (Stanford, NBER) https://www.nber.org/papers/w31161
5 Faros AI Paradox July 2025 10,000+ devs, 1,255 teams Vendor (but observational telemetry) https://www.faros.ai/blog/ai-software-engineering
6 DORA 2025 Sept 2025 ~5,000 developers Industry (Google, but peer-reviewed methodology) https://dora.dev/research/2025/dora-report/
7 GitClear Code Quality 2025 211M lines of code Independent (analytics vendor) https://www.gitclear.com/ai_assistant_code_quality_2025_research
8 Stanford Law Hallucinations 2025 Legal benchmarks Academic (Stanford Law) https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12413
9 Qualtrics CX Study Q3 2025 20,000+ consumers, 14 countries Independent (XM Institute) https://www.qualtrics.com/articles/news/ai-powered-customer-service-fails-at-four-times-the-rate-of-other-tasks/
10 Doshi & Hauser Creativity July 2024 293 writers, 600 evaluators Academic (Science Advances) https://www.science.org/doi/10.1126/sciadv.adn5290
11 U. of Washington HR Bias Oct 2024 120 names, 500+ job listings, 3 LLMs Academic (UW/Brookings) https://www.brookings.edu/articles/gender-race-and-intersectional-bias-in-ai-resume-screening-via-language-model-retrieval/
12 Upwork Productivity Survey July 2024 2,500 workers (US, UK, AU, CA) Independent (Upwork Research Institute) https://investors.upwork.com/news-releases/news-release-details/upwork-study-finds-employee-workloads-rising-despite-increased-c
13 McKinsey State of AI March 2025 ~1,993 respondents Consulting (McKinsey/QuantumBlack) https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-how-organizations-are-rewiring-to-capture-value
14 UK Government Copilot Trial 2025 20,000 civil servants Government (UK CDDO) https://www.geekwire.com/2025/microsoft-ai-tools-saved-british-government-workers-26-minutes-a-day-new-study-shows/
15 Peng et al. (GitHub Copilot) Feb 2023 95 developers Vendor-funded (GitHub) https://arxiv.org/abs/2302.06590
16 CodeRabbit PR Quality Dec 2025 470 PRs Vendor (CodeRabbit) https://www.coderabbit.ai/
17 Stanford HAI AI Index 2025 456-page report Academic (Stanford HAI) https://hai.stanford.edu/ai-index/2025-ai-index-report
18 Cui et al. / MIT-Wharton 2024 1,663 (Microsoft) + 311 (Accenture) Academic (MIT, Wharton) https://mit-genai.pubpub.org/pub/v5iixksv
19 Uplevel Data Labs Sept 2024 785 developers Independent (analytics vendor) https://www.uplevelteam.com/

Methodological Notes

Independence ratings explained:

  • Independent (nonprofit/academic): No financial relationship with AI vendors. Highest credibility.
  • Industry (peer-reviewed methodology): Funded or conducted by companies with AI products, but uses accepted research methodology. Useful but discount accordingly.
  • Vendor (observational telemetry): Company selling AI tools or analytics. Data may be sound but selection effects and presentation bias are likely. Cross-reference with independent sources.
  • Vendor-funded: Study commissioned and funded by an AI vendor. Treat as marketing data unless corroborated.

What the evidence does not tell us:

  • Almost no studies examine AI impact over time horizons longer than six months
  • Almost no studies examine AI deployment at companies with 200-5,000 employees specifically
  • The METR and BCG studies are the only true randomized experiments; everything else is observational with significant confounders
  • The legal, financial, and HR evidence is mostly from benchmarks and case law, not controlled experiments
  • No study has measured the long-term organizational effects of AI-induced skill atrophy or homogenization