← Findings 🕐 10 min read
Findings

What AI Makes Worse: A Task Selection Card for Your First Pilot

Before committing any task to an AI pilot, answer these three questions. If any answer is "yes," move the task to the Yellow or Red column — regardless of what the vendor demo showed.


Executive Summary

  • The #1 reason AI pilots fail is not the technology — it is the task. MIT found 95% of generative AI pilots produce no measurable P&L impact (n=300+, 2025). The most common cause: the pilot targeted a task where AI creates more problems than it solves.
  • The BCG/Harvard “jagged frontier” study (n=758 consultants, 2023) proved the pattern: AI improved quality 40% on tasks inside its capability boundary and degraded accuracy 19 percentage points on a task just outside it. The frontier is invisible to the user.
  • This card sorts 30 common business tasks into three columns — AI helps, AI is neutral, AI hurts — based on the independent evidence, not vendor claims. Every entry carries a source and credibility rating so the selection committee knows what to trust.
  • The three-question filter at the bottom takes 60 seconds per candidate task. Use it before committing budget.

How to Read This Card

Green (AI Helps): Strong independent evidence of measurable improvement. Deploy with standard governance.

Yellow (AI Neutral or Mixed): Evidence is contradictory, gains depend heavily on implementation, or individual speed gains do not translate to organizational outcomes. Deploy only with pre-defined success metrics and a 90-day kill switch.

Red (AI Hurts): Independent evidence shows AI degrades accuracy, introduces bias, or creates downstream costs that exceed the time savings. Avoid for the first pilot. Revisit only after the organization has AI governance maturity.


The Task Selection Table

Operations and Administration

Task Rating Evidence Source / Credibility
Meeting summarization GREEN UK government trial: 26 min/day saved across 20,000 workers. Highest adoption, lowest risk. UK CDDO, n=20,000, 2025. Government study. High credibility.
Email drafting (routine) GREEN Consistent time savings on standard correspondence. No quality degradation reported. Microsoft/Forrester, 2025. Vendor-adjacent. Moderate-high credibility.
Invoice and receipt processing GREEN 97-99% accuracy. AP automation cuts cost per invoice from $15-22 to $2-4. Payback in 3-6 months. APQC benchmarks, 2025. Independent. High credibility.
Calendar and scheduling coordination GREEN 4-5 hours/week saved on coordination tasks. Low error cost. Industry consensus, 2025. Moderate credibility.
Data entry from structured forms GREEN AI-enhanced OCR achieves 97-99% accuracy. Low verification cost. Industry benchmarks, 2025. Moderate credibility.
Document summarization (long, complex) YELLOW Performance degrades beyond 128K tokens. Only GPT-4o maintained accuracy through long documents. HELMET benchmark, 2025. Academic. High credibility.

Customer Service

Task Rating Evidence Source / Credibility
Tier-1 ticket triage and routing GREEN 14% more issues resolved per hour. 34% improvement for new agents. The strongest RCT evidence for any AI business application. Brynjolfsson/Stanford, n=5,179, QJE 2025. Academic. Very high credibility.
FAQ deflection (password reset, status check) GREEN 45%+ deflection rates in retail and travel. Low complexity, high volume. Industry data, 2025. Moderate credibility.
Post-call summarization GREEN Consistent time savings. Low hallucination risk because the model works from a transcript. Microsoft/industry, 2025. Moderate-high credibility.
Complex escalations requiring empathy RED AI customer service fails at 4x the rate of AI in other business tasks. 1 in 5 consumers saw no benefit. Qualtrics XM Institute, n=20,000+, Q3 2025. Independent. High credibility.
Novel or unusual complaints RED Same Qualtrics finding. AI defaults to template responses. Customer satisfaction drops. Qualtrics XM Institute, Q3 2025. High credibility.

Software Engineering

Task Rating Evidence Source / Credibility
Boilerplate and scaffolding code GREEN 25-35% speed gains on repetitive coding. Universal consensus across studies. DORA 2025 (~5,000 devs), GitHub, Microsoft. High credibility (convergent).
Unit test generation GREEN 83% coverage vs. 54% traditional. Highest-value use case with lowest risk. QA industry data, DORA 2025. High credibility.
Documentation generation GREEN Automates the task developers skip. Low error cost — documentation is inherently reviewable. DORA 2025, industry consensus. High credibility.
Code review assistance YELLOW Flags patterns and catches issues — but AI-generated code itself creates 1.7x more downstream issues than human code. CodeRabbit, n=470 PRs, 2025. Vendor but independent analysis. Moderate credibility.
Complex features on familiar codebase RED Experienced developers took 19% longer with AI. One task estimated at 30 minutes took 4 hours 7 minutes. METR RCT, n=16, 246 tasks, 2025. Independent, pre-registered. Very high credibility.
Architecture decisions RED AI lacks organizational context. Suggestions are plausible but miss constraints only humans know. Errors compound across the system. Practitioner consensus, AlterSquare (20+ projects), 2026. Moderate credibility.
Task Rating Evidence Source / Credibility
Contract clause identification (structured) GREEN 96-99% accuracy on standard contract formats. Review time from 92 minutes to 26 seconds per contract. Concord/industry, 2025. Vendor-adjacent. Moderate-high credibility.
First-pass document review GREEN 88% of legal teams report productivity gains on volume review. Works when the task is classification, not judgment. Deloitte Legal Survey, 2025. Consulting. Moderate credibility.
Legal research and case citation RED Best specialized tool (Lexis+ AI) hallucinates 17%. General-purpose LLMs: 43-88%. Federal courts have imposed $50,000+ in fines for fabricated citations. Stanford Law/HAI, 2025. Academic. Very high credibility.
Brief drafting with novel arguments RED 729+ documented AI hallucination incidents in legal practice through end 2025. Confident, plausible, wrong. Law360 Tracker, 2025. Independent. High credibility.

Finance

Task Rating Evidence Source / Credibility
Accounts payable automation GREEN 90% error reduction vs. manual entry. Cost per invoice drops 80%. Fastest ROI of any AI deployment studied. APQC, industry benchmarks, 2025. Independent. High credibility.
Fraud and anomaly detection GREEN 90%+ detection accuracy, 50% false positive reduction. Works because AI processes transaction patterns humans cannot see at scale. MindBridge, Deloitte, 2025. Mixed sources. Moderate-high credibility.
Spreadsheet-based financial analysis RED This was the BCG/Harvard “outside-the-frontier” task. Consultants using AI were correct 60-70% of the time vs. 84% without AI — a 19-point accuracy drop. BCG/Harvard, n=758, 2023. Academic. Very high credibility.
Numerical reasoning and calculation RED LLM error rate on math: 13% even with mitigation. Struggles with multi-step calculations and numbers above 4 digits. Financial analysis is exactly this. LLM math benchmarks, 2025. Academic. High credibility.

Marketing and Content

Task Rating Evidence Source / Credibility
First-draft copywriting GREEN 40% quality improvement on creative and generative tasks (inside the AI frontier). 12.2% more outputs per session. BCG/Harvard, n=758, 2023. Academic. Very high credibility.
Email subject line testing GREEN 41% higher conversion with AI-generated subject lines. High volume, fast feedback, low error cost. HubSpot analysis, 2025. Vendor. Moderate credibility.
SEO content production GREEN Time savings on routine keyword-optimized content. Low risk when combined with human review. Industry consensus, 2025. Moderate credibility.
Brand voice at scale RED AI homogenizes output. Content becomes more similar across pieces even as individual quality rises. The diversity gap widens over time. Doshi & Hauser, n=293 writers, Science Advances, 2024. Academic. Very high credibility.
Original thought leadership RED Follow-up studies found homogenization persists even after AI is removed. Writers lose their original voice and do not recover it. Doshi & Hauser follow-up, 2025. Academic. High credibility.

HR and Talent

Task Rating Evidence Source / Credibility
Resume keyword matching (volume reduction) YELLOW Effective at reducing volume. But 83% of companies using AI screening acknowledge it could introduce bias. Use only as a first filter with human review. Industry surveys, 2025. Moderate credibility.
Job description drafting GREEN Faster first drafts, broader language. Low risk with review. Industry consensus, 2025. Moderate credibility.
Candidate ranking and scoring RED AI screening tools favor white-associated names 85% of the time. Male-associated names favored 52%. The first major class action (Mobley v. Workday) proceeded in July 2024. U. of Washington/Brookings, 120 names, 500+ listings, 2024. Academic. Very high credibility.
Cultural fit assessment RED AI replicates and amplifies historical hiring biases. Humans exposed to AI recommendations mirror the AI’s biases. U. of Washington follow-up, n=528, 2024. Academic. High credibility.

The Three-Question Filter

Before committing any task to an AI pilot, answer these three questions. If any answer is “yes,” move the task to the Yellow or Red column — regardless of what the vendor demo showed.

Question 1: Does the task require judgment over data the AI has not seen?

If the answer depends on internal context, client history, institutional knowledge, or data not in the prompt, AI will fill the gap with plausible fabrication. This is the BCG “outside-the-frontier” pattern: consultants using AI on an unfamiliar-data task dropped from 84% accuracy to 60-70%. The frontier is invisible — the AI does not signal when it crosses from interpolation into invention.

Question 2: Is the cost of a wrong answer high and the error hard to spot?

AI errors are not random — they are confident and well-formatted. A hallucinated legal citation looks identical to a real one. A biased candidate ranking looks like data-driven objectivity. A subtly wrong financial calculation produces a clean spreadsheet. If the person reviewing the output needs the same expertise as the person creating it, AI does not save time — it shifts the work from creation to verification.

Question 3: Does organizational value require more than individual speed?

Faros AI tracked 10,000+ developers and found 21% more tasks completed per person — and zero improvement in organizational throughput. The speed went into longer review queues, not faster delivery. If the task you are selecting feeds into a downstream bottleneck that AI does not address, the pilot will produce impressive individual metrics and no business outcome. Ask: where does the output of this task go next? Is that next step ready for more volume?


Key Data Points

Metric Finding Source
Organizations capturing substantial AI value Only 5% BCG, n=10,600, 2025
Inside-frontier quality improvement 40% higher quality BCG/Harvard, n=758, 2023
Outside-frontier accuracy degradation 19-percentage-point drop (84% → 60-70%) BCG/Harvard, n=758, 2023
Experienced developer slowdown 19% slower with AI METR RCT, n=16, 246 tasks, 2025
Perception vs. reality gap Developers predicted +24%, measured -19% METR RCT, 2025
Individual vs. organizational gains 21% more tasks per person, 0% org improvement Faros AI, n=10,000+, 2025
AI customer service failure rate 4x higher than other AI applications Qualtrics, n=20,000+, Q3 2025
Legal AI hallucination (best tool) 17% (Lexis+ AI) Stanford Law, 2025
HR screening racial bias White-associated names favored 85% U. of Washington/Brookings, 2024
Success with pre-defined metrics 54% vs. 12% without Pertama Partners, n=2,400+, 2026

What This Means for Your Organization

The executives who capture value from AI do not start with the most exciting task. They start with the most boring one — the task that is high-volume, low-judgment, and has a measurable cost per unit. Invoice processing. Ticket routing. Meeting summaries. Test generation. These are not the tasks that appear in vendor demos. They are the tasks that appear in ROI calculations 90 days later.

The single most expensive mistake in AI deployment is selecting a Red-column task for the first pilot. It produces a visible, politically damaging failure that poisons the organization’s willingness to try again. The METR data, the BCG data, and the Qualtrics data all confirm the same pattern: AI fails persuasively, not obviously. The output looks polished. The errors hide beneath the surface. By the time the organization discovers the problem, the pilot has consumed budget, credibility, and momentum.

Start with a Green-column task. Define success metrics before launch — organizations that do achieve 54% success versus 12% that do not (Pertama Partners, n=2,400+). Build a single success story. Then expand. If this card raised questions about which specific task in your organization belongs in which column — or how to design the pilot that follows — I would welcome that conversation at brandon@brandonsneider.com.

Sources

  1. METR — “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” n=16 developers, 246 tasks. July 2025. Independent, pre-registered RCT. Very high credibility. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
  2. METR Follow-up — n=57 developers, 800+ tasks. February 2026. Very high credibility. https://metr.org/blog/2026-02-24-uplift-update/
  3. Dell’Acqua, McFowland, Mollick et al. (BCG/Harvard) — “Navigating the Jagged Technological Frontier.” n=758 BCG consultants, 18 tasks. Harvard Business School Working Paper 24-013, September 2023. Academic field experiment. Very high credibility. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321
  4. Brynjolfsson, Li, Raymond — “Generative AI at Work.” n=5,179 agents. Quarterly Journal of Economics, 2025. Academic. Very high credibility. https://www.nber.org/papers/w31161
  5. Faros AI — “The AI Productivity Paradox.” n=10,000+ developers, 1,255 teams. July 2025. Vendor but observational telemetry. High credibility. https://www.faros.ai/blog/ai-software-engineering
  6. Qualtrics XM Institute — AI customer service failure rates. n=20,000+ consumers, 14 countries, Q3 2025. Independent. High credibility. https://www.qualtrics.com/articles/news/ai-powered-customer-service-fails-at-four-times-the-rate-of-other-tasks/
  7. Stanford Law School / HAI — Legal AI hallucination rates. 2025. Academic. Very high credibility. https://onlinelibrary.wiley.com/doi/full/10.1111/jels.12413
  8. Doshi & Hauser — “Generative AI Enhances Individual Creativity but Reduces the Collective Diversity of Novel Content.” n=293 writers, 600 evaluators. Science Advances, July 2024. Academic. Very high credibility. https://www.science.org/doi/10.1126/sciadv.adn5290
  9. University of Washington / Brookings — AI resume screening bias. 120 names, 500+ job listings, 3 LLMs. October 2024. Academic. Very high credibility. https://www.brookings.edu/articles/gender-race-and-intersectional-bias-in-ai-resume-screening-via-language-model-retrieval/
  10. Pertama Partners — AI project success rates. n=2,400+ initiatives. 2025-2026. Independent. High credibility. https://www.pertamapartners.com/insights/ai-project-failure-statistics-2026
  11. BCG — “AI at Work 2025.” n=10,600+ workers, 11 countries. Only 5% of organizations achieving substantial AI gains. Independent survey. High credibility. https://www.bcg.com/publications/2025/ai-at-work-momentum-builds-but-gaps-remain
  12. UK Central Digital and Data Office — Microsoft 365 Copilot trial. n=20,000 civil servants. 2025. Government. High credibility. https://www.geekwire.com/2025/microsoft-ai-tools-saved-british-government-workers-26-minutes-a-day-new-study-shows/
  13. APQC — AP automation benchmarks. 2025. Independent. High credibility. https://www.apqc.org/resources/benchmarking/
  14. CodeRabbit — AI code quality analysis. n=470 PRs. 2025. Vendor but independent analysis. Moderate credibility. https://www.coderabbit.ai/
  15. DORA (DevOps Research and Assessment) — State of DevOps 2025. ~5,000 developers. Google-affiliated but peer-reviewed methodology. High credibility. https://dora.dev/research/2025/dora-report/
  16. Law360 — AI hallucination incident tracker. 729+ incidents through 2025. Independent legal publication. High credibility.
  17. HubSpot — Email optimization analysis. 2025. Vendor. Moderate credibility.
  18. Concord — Contract review speed benchmarks. 2025. Vendor. Moderate credibility.
  19. AlterSquare — AI code quality across 20+ client projects. 2026. Practitioner data. Moderate credibility.
  20. GitClear — 211M lines of code analysis. 2025. Independent analytics. High credibility. https://www.gitclear.com/ai_assistant_code_quality_2025_research

Brandon Sneider | brandon@brandonsneider.com March 2026