Which Specific Business Tasks AI Helps, Hurts, or Leaves Unchanged: The Evidence Base
Research compiled March 2026
Executive Summary
- AI task performance follows a “jagged frontier.” The BCG/Harvard study (n=758 consultants, September 2023) found that tasks inside the AI capability frontier saw 40% quality improvement and 25% speed gains, while a single task outside the frontier caused a 19-percentage-point drop in correct answers. The frontier is irregular and non-obvious — tasks of similar apparent difficulty can fall on opposite sides.
- Experienced workers on familiar codebases get slower, not faster. METR’s RCT (n=16 developers, 246 tasks, July 2025) found AI-assisted work took 19% longer. Developers believed they were 20% faster — a 39-point perception gap. METR’s February 2026 follow-up (57 developers, 800+ tasks) confirmed the pattern.
- Individual output gains do not translate to organizational productivity. Faros AI (n=10,000+ developers, 1,255 teams, July 2025) found 21% more tasks completed per developer but zero improvement in company-level throughput, DORA metrics, or quality KPIs. Review time increased 91%, bugs per developer increased 9%.
- AI compresses skill distributions. The consistent finding across studies is that low performers benefit most (43% improvement for bottom-half BCG consultants, 34% for novice customer service agents) while top performers see minimal or negative impact. This is the most robust finding in the literature.
- The task category that AI most reliably improves is first-draft generation of text and ideas — and the task category it most reliably degrades is anything requiring judgment over data the model has not seen. Between these poles lies a wide gray zone where outcomes depend on implementation quality, user skill, and organizational context.
Section 1: The Landmark Studies
1.1 METR Randomized Controlled Trial (2025)
Source: “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” METR, July 2025 Sample: 16 experienced open-source developers, 246 real tasks (bug fixes, features, refactors), repositories averaging 22,000+ stars and 1M+ lines of code Method: Randomized controlled trial — tasks randomly assigned to AI-allowed or AI-disallowed conditions Tools: Cursor Pro with Claude 3.5/3.7 Sonnet; developers had 50+ hours of Cursor experience URL: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
Core finding: Developers took 19% longer to complete tasks when using AI (95% CI: 2% to 40% longer).
The perception gap: Before tasks, developers predicted AI would speed them up by 24%. After the study concluded, they still estimated a 20% speedup — despite the measured slowdown.
Where the time went (from participant accounts):
- Extended prompting and re-prompting cycles when initial AI output missed codebase conventions
- Reviewing AI-generated code that was syntactically correct but semantically wrong
- Debugging AI suggestions that introduced subtle errors
- One participant reported a test-writing task estimated at 1 hour took 4 hours 20 minutes with AI
- Models repeatedly struggled to match existing code style, ignore fabricated constants, and maintain project-specific conventions
METR February 2026 follow-up (n=57 developers, 800+ tasks, 143 repositories): Returning developers still showed an estimated 18% slowdown (CI: -38% to +9%). New developers showed a 4% slowdown (CI: -15% to +9%). METR concluded the data provides “an unreliable signal” because experienced developers increasingly refuse to work without AI, even for $150/hour.
What this means for mid-market companies: The METR finding applies specifically to experienced developers on mature, familiar codebases — precisely the scenario most mid-market companies face with their existing engineering teams. The result does not apply to greenfield development, junior developers learning new codebases, or highly repetitive coding tasks.
1.2 BCG/Harvard “Jagged Frontier” Study (2023)
Source: “Navigating the Jagged Technological Frontier,” Dell’Acqua, McFowland, Mollick et al., Harvard Business School Working Paper 24-013, September 2023 Sample: 758 BCG consultants (7% of individual contributor workforce) Method: Field experiment — consultants randomly assigned to AI-access or no-AI-access conditions across 18 realistic consulting tasks Tool: GPT-4 URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321
Tasks inside the AI frontier (creative and generative work):
- Propose 10+ ideas for a new shoe targeting an underserved market
- Segment the footwear industry market
- Draft a press release for the new product
- Pen an inspirational memo to employees
- Create marketing slogans
- Write a 2,500-word article describing end-to-end process from pitch to launch
Results for inside-frontier tasks:
- 12.2% more tasks completed
- 25.1% faster completion
- 40% higher quality (human-rated)
The task outside the frontier (judgment over unfamiliar data):
- Analyze a business case using interview transcripts and financial spreadsheet data
- Required synthesizing qualitative and quantitative evidence to reach a conclusion
Results for the outside-frontier task:
- Without AI: consultants correct 84% of the time
- With AI: consultants correct only 60-70% of the time — a 19-percentage-point degradation
- Consultants trusted the AI’s confident but incorrect output and failed to apply independent judgment
The skill compression effect:
- Bottom-half performers improved 43% on inside-frontier tasks
- Top-half performers improved 17%
- AI compressed the performance distribution, making the range between best and worst much narrower
What this means for mid-market companies: The frontier is not where most executives assume. Tasks that feel complex (creative ideation, first-draft writing) often fall inside it. Tasks that feel routine (analyzing a spreadsheet alongside qualitative data) can fall outside it. The danger is not that AI fails obviously — it fails persuasively.
1.3 Brynjolfsson/Stanford Customer Support Study (2023, published QJE 2025)
Source: “Generative AI at Work,” Brynjolfsson, Li, and Raymond, Quarterly Journal of Economics, 2025 Sample: 5,179 customer support agents at a Fortune 500 software company Method: Staggered rollout — natural experiment as AI tool was introduced to different groups at different times Tool: GPT-based conversational assistant providing real-time response suggestions URL: https://www.nber.org/papers/w31161
Results by experience level:
- Overall: 14% increase in issues resolved per hour
- Novice/low-skill workers: 34% improvement in productivity
- Experienced/high-skill workers: minimal speed gain, small decline in quality
- AI effectively disseminated the tacit knowledge of top performers to newer agents
Additional findings:
- Customer sentiment improved
- Employee retention increased (less turnover in AI-assisted group)
- Evidence of genuine worker learning — agents improved even after AI was removed
What this means for mid-market companies: Customer support is the single strongest evidence case for AI deployment. The benefit concentrates in the first 6-12 months of a new agent’s tenure and in handling routine, well-documented issue types. The benefit diminishes sharply for experienced agents handling novel or complex escalations.
1.4 Faros AI Productivity Paradox (2025)
Source: “The AI Productivity Paradox,” Faros AI, July 2025 Sample: 10,000+ developers, 1,255 teams, 6+ companies, up to two years of telemetry data Method: Observational — correlated AI adoption metrics with delivery outcomes using Spearman rank correlation Data sources: Task management, IDE telemetry, static code analysis, CI/CD pipelines, version control, incident management, HR metadata URL: https://www.faros.ai/blog/ai-software-engineering
Individual-level gains:
- 21% more tasks completed
- 98% more pull requests merged
- 9% more tasks per day
- 47% more PRs per day
Quality and downstream costs:
- 154% increase in average PR size
- 9% increase in bugs per developer
- 91% increase in PR review time
- 1.7x more issues per PR in AI-generated code (10.83 vs 6.45)
Organizational-level impact:
- Zero measurable improvement in company-level throughput
- Zero improvement in DORA metrics (deployment frequency, lead time, change failure rate)
- Zero improvement in quality KPIs
Why gains evaporate: Individual output increases are absorbed by downstream bottlenecks — code review queues, testing pipelines, integration complexity, and deployment dependencies. Amdahl’s Law: the slowest stage determines total throughput.
Section 2: Task-Level Evidence by Business Function
2.1 Software Engineering
| Task | Evidence | Direction |
|---|---|---|
| Greenfield code (new function, single language) | Peng et al. 2023, n=95: 55.8% faster | HELPS |
| Boilerplate/repetitive code | Microsoft internal, DORA 2025: consensus positive | HELPS |
| Writing unit tests | DORA 2025: 62% of test-writing developers use AI; GitClear: test coverage improvement | HELPS |
| Documentation generation | DORA 2025, Microsoft: reduced time, consensus positive | HELPS |
| Code review assistance | CodeRabbit 2025, n=470 PRs: AI flags issues but generates 1.7x more issues itself | MIXED |
| Debugging familiar codebase | METR 2025: included in 19% slowdown; participant reports of extended debugging cycles | HURTS (experienced devs) |
| Refactoring mature codebase | GitClear 2025: refactoring as share of changes dropped from 25% to <10% | HURTS (less refactoring done) |
| Complex feature in familiar repo | METR 2025: 19% slower overall; one task estimated at 30min took 4h7min | HURTS (experienced devs) |
| Code maintenance over time | GitClear 2025, 211M lines: code churn rose from 3.1% to 5.7%, duplicated blocks up 8x | HURTS |
Key code quality finding (GitClear, 211M lines of code, 2020-2024): As AI adoption increased, duplicated code blocks grew 8x, code requiring revision within two weeks rose from 3.1% to 5.7%, and refactoring as a proportion of changed code dropped from 25% to under 10%. The pattern: AI generates more code, less of it gets cleaned up, and more of it needs immediate rework.
Key organizational finding (DORA 2025, ~5,000 developers surveyed): AI adoption now positively correlates with throughput but continues to correlate negatively with delivery stability. DORA’s conclusion: “AI doesn’t fix a team; it amplifies what’s already there.” A 25% increase in AI adoption is associated with a 7.2% reduction in delivery stability.
2.2 Customer Service
| Task | Evidence | Direction |
|---|---|---|
| Routine ticket triage and routing | Brynjolfsson 2023/2025, n=5,179: 14% more issues/hour | HELPS |
| Response suggestions for common issues | Brynjolfsson: 34% improvement for novice agents | HELPS |
| Ticket deflection (FAQ, password reset) | Industry data: 45%+ deflection rates in retail/travel | HELPS |
| Complex escalations requiring empathy | Qualtrics 2025, n=20,000+ consumers: 4x failure rate vs other AI tasks | HURTS |
| Handling novel/unusual complaints | Qualtrics: nearly 1 in 5 consumers saw no benefit | HURTS |
| Post-call summarization | Microsoft/industry: consensus time savings | HELPS |
Key finding (Qualtrics XM Institute, Q3 2025, n=20,000+ consumers across 14 countries): AI-powered customer service fails at four times the rate of AI applied to other business tasks. Nearly 1 in 5 consumers who used AI for customer service saw no benefit. Consumers rank AI customer service among the worst applications for convenience, time savings, and usefulness. 53% of consumers cite data misuse as their top concern (up 8 points YoY).
2.3 Legal Review and Research
| Task | Evidence | Direction |
|---|---|---|
| Contract clause identification | Industry benchmarks 2025: 96-99% accuracy on structured contracts | HELPS |
| First-pass document review (privilege, relevance) | Deloitte 2025: 88% of legal teams report productivity gains | HELPS |
| Routine contract review speed | Concord 2025: review time from 92 minutes to 26 seconds per contract at 98% accuracy | HELPS |
| Legal research (case citation) | Stanford 2025: Lexis+ AI hallucinated 17%, Westlaw 33%, GPT-4 43% | HURTS |
| Complex legal analysis/reasoning | Stanford 2025: Ask Practical Law AI accurate only 18% of the time | HURTS |
| Brief drafting with novel arguments | Law360 tracker: 729+ documented AI hallucination incidents in legal practice by end 2025 | HURTS |
Key study (Stanford Law/HAI, published 2025): Hallucination rates for leading legal AI tools:
- Lexis+ AI: 17%
- Westlaw AI-Assisted Research: 33%
- Ask Practical Law AI: accurate only 18% of the time
- GPT-4 (general-purpose): 43%
- Earlier Stanford study (2024): general-purpose models hallucinated 58-88% on legal tasks
URL: https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12413
Real-world consequences: Federal courts have imposed $50,000+ in fines for AI-generated false citations. 729+ AI hallucination incidents documented by Law360 through end of 2025.
2.4 Financial Analysis and Accounting
| Task | Evidence | Direction |
|---|---|---|
| Invoice/receipt data extraction (OCR) | Industry benchmarks 2025: 97-99% accuracy, error rates below 0.5% | HELPS |
| Accounts payable automation | Industry 2025: 90% error reduction vs manual entry | HELPS |
| Anomaly/fraud detection in transactions | MindBridge, Deloitte studies: improved detection of irregularities | HELPS |
| Routine audit sampling | Systematic reviews 2025: AI enables continuous monitoring vs periodic sampling | HELPS |
| Spreadsheet-based financial analysis | BCG/Harvard 2023: the outside-frontier task that degraded performance 19 points | HURTS |
| Numerical reasoning/calculation | LLM math studies 2025: 13% error rate even with mitigation; struggles with >4-digit multiplication | HURTS |
| Strategic financial judgment | Gartner 2024 CFO survey: “inadequate data quality” cited as #1 challenge | NEUTRAL-TO-HURTS |
Key limitation: LLMs are fundamentally unreliable for numerical reasoning. ChatGPT’s error rate on math tasks has been reduced from 29% to 13% through mitigation techniques, but more than 1 in 10 answers remains wrong. Models struggle especially with multi-step calculations, large-digit multiplication, and problems requiring precise numerical reasoning — core financial analysis skills.
2.5 Marketing and Content Creation
| Task | Evidence | Direction |
|---|---|---|
| First-draft copywriting | BCG/Harvard: inside-frontier, 40% quality improvement | HELPS |
| Ad copy generation/testing | Industry 2025: 450% increase in ad CTR reported | HELPS |
| Email subject line optimization | HubSpot analysis: 41% higher conversion with AI optimization | HELPS |
| Brainstorming/ideation quantity | BCG/Harvard: 12.2% more outputs generated | HELPS |
| Content personalization at scale | McKinsey: marketing & sales identified as highest-value AI use case | HELPS |
| SEO content production | Industry data: consensus time savings for routine SEO | HELPS |
| Brand voice consistency at scale | Doshi & Hauser 2024: AI homogenizes output toward generic styles | HURTS |
| Original thought leadership | Doshi & Hauser: collective diversity of content reduced even as individual quality improves | HURTS |
| Cultural/market-specific nuance | CHI 2025: AI homogenizes writing toward Western styles, diminishes cultural nuances | HURTS |
Key study (Doshi & Hauser, Science Advances, July 2024, n=293 writers, 600 evaluators, 3,519 evaluations): AI-assisted stories were rated as more creative, better written, and more enjoyable — especially for less creative writers. But AI-enabled stories were significantly more similar to each other than human-only stories. The diversity gap widens with more content produced.
URL: https://www.science.org/doi/10.1126/sciadv.adn5290
The homogenization trap: A 2025 follow-up study found the induced content homogeneity persists and climbs even after AI is removed. Writers do not retain the creative capability the AI provided — they lose it — but the generic style persists.
2.6 HR and Talent Screening
| Task | Evidence | Direction |
|---|---|---|
| Resume keyword matching | 83% of companies using by end 2025; effective for volume reduction | HELPS |
| Interview scheduling automation | Industry: consensus time savings | HELPS |
| Job description drafting | Industry: faster first drafts, broader language | HELPS |
| Candidate ranking/scoring | U. of Washington 2024: favors white-associated names 85% of the time | HURTS |
| Diversity screening | U. of Washington: male-associated names favored 52% of the time | HURTS |
| Cultural fit assessment | Multiple studies: AI replicates and amplifies historical hiring biases | HURTS |
| Evaluating non-traditional backgrounds | Workday class action (Mobley v. Workday, July 2024): systematic rejection pattern | HURTS |
Key study (University of Washington/Brookings, October 2024): Tested three state-of-the-art LLMs (Mistral AI, Salesforce, Contextual AI) on 120 names across 500+ real job listings. AI screening tools favored white-associated names 85% of the time and male-associated names 52% of the time. A follow-up study (n=528 people) found humans exposed to AI recommendations mirror the AI’s biases.
Adoption vs. awareness gap: 83% of companies will use AI resume screening by end 2025, yet 67% acknowledge their tools could introduce bias. 47% recognize age bias, 44% cite socioeconomic bias, 30% mention gender bias, 26% point to racial bias. The first major class action (Mobley v. Workday) was allowed to proceed in July 2024.
2.7 Administrative and Operational Tasks
| Task | Evidence | Direction |
|---|---|---|
| Meeting summarization | Microsoft 365 Copilot internal study: most-used feature; UK government trial (n=20,000): 26 min/day saved | HELPS |
| Email drafting | Microsoft/Forrester: measurable time savings on routine correspondence | HELPS |
| Calendar scheduling | Industry 2025: 4-5 hours/week saved on coordination | HELPS |
| Document summarization (short, factual) | Vectara 2026 benchmark: top models achieve 0.7-0.9% hallucination rate | HELPS |
| Document summarization (long, complex) | HELMET benchmark 2025: only GPT-4o maintains performance through 128K tokens | MIXED |
| Translation (high-resource language pairs) | Lokalise 2024, n=600 evaluations: Claude 3.5-Sonnet preferred in 78% of cases | HELPS |
| Translation (low-resource languages, technical) | Industry studies 2025: 60-85% accuracy, significant error rates in specialized domains | HURTS |
| Data entry from structured forms | Industry 2025: AI-enhanced OCR achieves 97-99% accuracy vs 96-99% manual | HELPS |
Key finding (UK Government Copilot Trial, 2025, n=20,000 civil servants, 3-month duration): Microsoft 365 Copilot saved government workers an average of 26 minutes per day on administrative tasks. Meeting summarization was the highest-usage feature. The study was conducted by the Central Digital and Data Office across multiple government departments.
Section 3: Patterns Across the Evidence
3.1 The Task Taxonomy That Predicts AI Impact
Based on the cumulative evidence, AI task performance follows a consistent pattern organized by three dimensions:
Dimension 1: Novelty of Input Data
- AI works best on tasks where all relevant information is already in the model’s training data or provided in the prompt
- AI fails on tasks requiring judgment over new, unseen, or ambiguous data (BCG outside-frontier task, METR experienced-developer findings)
Dimension 2: Cost of Errors
- AI works best on tasks where errors are cheap to catch and fix (first drafts, brainstorming, scheduling)
- AI fails on tasks where errors are expensive, subtle, or legally consequential (legal research, financial analysis, HR screening, medical diagnosis)
Dimension 3: Verification Difficulty
- AI works best on tasks where the user can quickly verify output quality (email drafts, meeting summaries, code that compiles)
- AI fails on tasks where verification requires the same expertise as creation (complex code logic, legal reasoning, strategic analysis)
3.2 The Skill Compression Effect
The most robust finding across the entire literature:
| Study | Bottom Performers | Top Performers |
|---|---|---|
| BCG/Harvard (n=758) | +43% quality | +17% quality |
| Brynjolfsson (n=5,179) | +34% productivity | Minimal gain, slight quality decline |
| METR (n=16) | N/A (all experienced) | 19% slower |
| Doshi & Hauser (n=293) | “Especially” helped less creative writers | Smaller improvement |
Implication for mid-market companies: AI is most valuable as a floor-raiser, not a ceiling-raiser. Deploy it to accelerate onboarding, support less experienced staff, and standardize baseline quality. Do not expect it to make top performers better — the evidence consistently shows it does not.
3.3 The Organizational Throughput Paradox
Three independent data sources confirm the same pattern:
- Faros AI (2025): 21% more tasks per developer, zero improvement in company-level metrics
- DORA 2025: AI correlates with higher throughput but lower stability at the organizational level
- Upwork (2024, n=2,500): 77% of employees say AI has added to their workload; 71% report burnout
The mechanism: AI accelerates the fastest stage of a process (typically creation/drafting) while leaving the slowest stages unchanged (review, approval, integration, deployment). Because throughput is constrained by the bottleneck, speeding up creation produces more work-in-progress, not more completed work. The Faros data shows this precisely: 98% more PRs merged, 91% longer review times, zero change in delivery velocity.
Upwork study details (Walr survey, April-May 2024, n=2,500 across US, UK, Australia, Canada — 1,250 C-suite, 625 employees, 625 freelancers):
- 96% of C-suite leaders expect AI to increase productivity
- 77% of employees say AI has added to their workload
- 39% spend more time reviewing/moderating AI-generated content
- 23% spend more time learning to use AI tools
- 47% of employees using AI say they don’t know how to achieve expected productivity gains
- 71% of full-time employees are burned out
3.4 Hallucination Rates by Task Type
| Task Type | Hallucination Rate | Source |
|---|---|---|
| Short factual summarization (with source) | 0.7-0.9% (top models) | Vectara Hallucination Leaderboard, March 2026 |
| General knowledge questions | 2-5% (top models), 9.2% (average) | Multiple benchmarks, 2025 |
| Legal research (specialized tools) | 17% (Lexis+ AI) to 33% (Westlaw) | Stanford Law, 2025 |
| Legal research (general-purpose LLM) | 43-88% | Stanford Law, 2024-2025 |
| Open-ended factual questions | 33% (o3) to 48% (o4-mini) on PersonQA | OpenAI PersonQA benchmark, 2025 |
| News source identification | 94% wrong (Grok-3) | Columbia Journalism Review, March 2025 |
| Medical text summarization | 1.47% hallucination, 3.45% omission | Nature Digital Medicine, 2025 |
The critical pattern: Hallucination rates improve dramatically when the model has source documents to work from (summarization) and worsen dramatically on open-ended or domain-specific questions. This maps directly to the BCG “inside vs. outside the frontier” distinction. The frontier is partly defined by whether the model is interpolating from given data or extrapolating into unseen territory.
Section 4: The Practical Decision Framework
For a mid-market company evaluating where to deploy AI, the evidence supports this hierarchy:
Deploy with confidence (strong evidence of benefit):
- Meeting summarization and administrative text — highest adoption, lowest risk, most consistent time savings
- Customer service tier-1 (routine tickets, response suggestions for new agents) — the Brynjolfsson study is the gold standard; 14% overall, 34% for novice agents
- Invoice/receipt processing and data extraction — 97-99% accuracy, measurable cost reduction
- First-draft content generation (marketing copy, email drafts, internal communications) — consistent quality and speed improvements, but with human editing
- Boilerplate code generation and documentation — consensus positive across studies, low risk
Deploy with caution (mixed evidence, requires controls):
- Contract review (first-pass clause identification) — high accuracy on structured documents, but hallucination risk on novel clauses
- Translation (high-resource language pairs, non-critical content) — good for speed, requires human review for anything published
- Code review assistance — useful for flagging issues, but AI-generated code itself has 1.7x more issues
- Marketing personalization at scale — strong ROI evidence, but watch for homogenization over time
Deploy with significant safeguards or avoid:
- Legal research and case citation — 17-43% hallucination rate; $50,000+ in court fines documented
- Financial analysis requiring numerical reasoning — 13%+ error rate on calculations; the BCG outside-frontier task
- HR candidate screening and ranking — systematic racial and gender bias documented; active class-action litigation
- Complex customer service escalations — 4x failure rate vs other AI applications
- Any task requiring judgment over data not in the prompt — the consistent failure point across all studies
Source Index
| # | Source | Date | Sample | Independence Rating | URL |
|---|---|---|---|---|---|
| 1 | METR RCT | July 2025 | 16 devs, 246 tasks | Independent (nonprofit) | https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ |
| 2 | METR Follow-up | Feb 2026 | 57 devs, 800+ tasks | Independent (nonprofit) | https://metr.org/blog/2026-02-24-uplift-update/ |
| 3 | BCG/Harvard Jagged Frontier | Sept 2023 | 758 consultants | Academic (HBS, MIT, Wharton) | https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321 |
| 4 | Brynjolfsson/Stanford | 2023/QJE 2025 | 5,179 agents | Academic (Stanford, NBER) | https://www.nber.org/papers/w31161 |
| 5 | Faros AI Paradox | July 2025 | 10,000+ devs, 1,255 teams | Vendor (but observational telemetry) | https://www.faros.ai/blog/ai-software-engineering |
| 6 | DORA 2025 | Sept 2025 | ~5,000 developers | Industry (Google, but peer-reviewed methodology) | https://dora.dev/research/2025/dora-report/ |
| 7 | GitClear Code Quality | 2025 | 211M lines of code | Independent (analytics vendor) | https://www.gitclear.com/ai_assistant_code_quality_2025_research |
| 8 | Stanford Law Hallucinations | 2025 | Legal benchmarks | Academic (Stanford Law) | https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12413 |
| 9 | Qualtrics CX Study | Q3 2025 | 20,000+ consumers, 14 countries | Independent (XM Institute) | https://www.qualtrics.com/articles/news/ai-powered-customer-service-fails-at-four-times-the-rate-of-other-tasks/ |
| 10 | Doshi & Hauser Creativity | July 2024 | 293 writers, 600 evaluators | Academic (Science Advances) | https://www.science.org/doi/10.1126/sciadv.adn5290 |
| 11 | U. of Washington HR Bias | Oct 2024 | 120 names, 500+ job listings, 3 LLMs | Academic (UW/Brookings) | https://www.brookings.edu/articles/gender-race-and-intersectional-bias-in-ai-resume-screening-via-language-model-retrieval/ |
| 12 | Upwork Productivity Survey | July 2024 | 2,500 workers (US, UK, AU, CA) | Independent (Upwork Research Institute) | https://investors.upwork.com/news-releases/news-release-details/upwork-study-finds-employee-workloads-rising-despite-increased-c |
| 13 | McKinsey State of AI | March 2025 | ~1,993 respondents | Consulting (McKinsey/QuantumBlack) | https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-how-organizations-are-rewiring-to-capture-value |
| 14 | UK Government Copilot Trial | 2025 | 20,000 civil servants | Government (UK CDDO) | https://www.geekwire.com/2025/microsoft-ai-tools-saved-british-government-workers-26-minutes-a-day-new-study-shows/ |
| 15 | Peng et al. (GitHub Copilot) | Feb 2023 | 95 developers | Vendor-funded (GitHub) | https://arxiv.org/abs/2302.06590 |
| 16 | CodeRabbit PR Quality | Dec 2025 | 470 PRs | Vendor (CodeRabbit) | https://www.coderabbit.ai/ |
| 17 | Stanford HAI AI Index | 2025 | 456-page report | Academic (Stanford HAI) | https://hai.stanford.edu/ai-index/2025-ai-index-report |
| 18 | Cui et al. / MIT-Wharton | 2024 | 1,663 (Microsoft) + 311 (Accenture) | Academic (MIT, Wharton) | https://mit-genai.pubpub.org/pub/v5iixksv |
| 19 | Uplevel Data Labs | Sept 2024 | 785 developers | Independent (analytics vendor) | https://www.uplevelteam.com/ |
Methodological Notes
Independence ratings explained:
- Independent (nonprofit/academic): No financial relationship with AI vendors. Highest credibility.
- Industry (peer-reviewed methodology): Funded or conducted by companies with AI products, but uses accepted research methodology. Useful but discount accordingly.
- Vendor (observational telemetry): Company selling AI tools or analytics. Data may be sound but selection effects and presentation bias are likely. Cross-reference with independent sources.
- Vendor-funded: Study commissioned and funded by an AI vendor. Treat as marketing data unless corroborated.
What the evidence does not tell us:
- Almost no studies examine AI impact over time horizons longer than six months
- Almost no studies examine AI deployment at companies with 200-5,000 employees specifically
- The METR and BCG studies are the only true randomized experiments; everything else is observational with significant confounders
- The legal, financial, and HR evidence is mostly from benchmarks and case law, not controlled experiments
- No study has measured the long-term organizational effects of AI-induced skill atrophy or homogenization