← Findings 🕐 11 min read
Findings

What Actually Works: The Honest Guide to AI in Engineering

License fees represent 10-20% of total AI deployment cost. Year 1 total cost runs roughly 2.5x the license fee; at scale, 4-5x. The remaining 80-90% breaks into:


Executive Summary

  • AI delivers genuine, measurable value on specific tasks: boilerplate generation, unit testing (83% coverage vs. 54% traditional), documentation, and code autocomplete produce 25-35% speed gains that hold up across every controlled study in the literature
  • Configuration, not innovation, drives 90%+ of enterprise AI tool value — the wins come from turning on Copilot and configuring policies, not building custom AI systems. This is good news: the path to ROI is weeks, not quarters
  • The difference between individual gains and organizational gains is a solvable problem: developers produce 21% more tasks and 98% more PRs (Faros AI, 135K+ developers), but that speed shifts the bottleneck to code review (+91%). Organizations that address the new bottleneck capture the gains; those that don’t see the speed evaporate
  • Experienced developers use AI differently than beginners — and the pattern matters: METR’s randomized controlled trial (n=16, 246 tasks, 2025) found that experienced developers applying AI indiscriminately were 19% slower. Top performers apply AI to the right tasks (Tier 1 work) and keep doing the hard thinking themselves
  • License fees represent 10-20% of total AI deployment cost — but the remaining investment is addressable: debugging, review overhead, training, and governance make up the balance. Year 1 total cost runs roughly 2.5x the license (DX Research/Atlan, 2025). Organizations that plan for the full cost from day one avoid the budget surprise that derails most pilots
  • The organizations capturing real value do three things: deploy AI on the right tasks, redesign workflows around the new bottleneck, and budget for the full cost — not just the license

Part 1: The Productivity Evidence (Honest Assessment)

What the Controlled Studies Actually Show

Study Who N Finding Credibility
METR RCT (2025) Experienced OSS devs 16 devs, 246 tasks 19% SLOWER with AI; devs believed 20% faster HIGH — independent, randomized, pre-registered
GitHub/Microsoft (2023) Developers writing HTTP server 95 devs 55.8% faster on specific task MEDIUM — single task, vendor-funded
Microsoft field trial Microsoft employees Large 12.9-21.8% more PRs/week MEDIUM — vendor, poorly powered
Accenture field trial Accenture developers Large 8.7% more PRs, 11% more merges MEDIUM — vendor, subset analyzed
Google RCT (2025) Google engineers 96 devs 21% productivity gain MEDIUM — small sample, Google-specific infra
DX/Faros (2025) Cross-industry 135K+ devs 3.6 hrs/week saved per dev, BUT no org-level improvement HIGH — large sample, independent
AlterSquare (2026) 20+ client projects Multiple teams 46% more PRs, but 91% more review time HIGH — real-world, multi-client
CodeRabbit (2025) Code analysis Large corpus AI code creates 1.7x more problems HIGH — independent analysis

The Honest Interpretation

For C-Suite: “Your developers will tell you AI makes them faster. The data says it depends entirely on the task. For boilerplate and tests, they’re right. For complex work, they’re wrong. And at the organizational level, nobody has proven it helps yet.”


Part 2: Where AI Actually Helps (The Task-Level Truth)

Tier 1: Genuine, Proven Value

These work. Deploy them. The ROI is clear.

Task AI Effectiveness Evidence Configuration Effort
Code autocomplete HIGH — 25-35% faster for routine coding GitHub, Stack Overflow surveys TRIVIAL — install extension, assign license
Boilerplate generation HIGH — eliminates tedious scaffolding Universal practitioner consensus TRIVIAL — works out of the box
Unit test generation HIGH — 83% coverage vs 54% traditional QA industry data LOW — configure test framework preferences
Documentation generation HIGH — auto-generates from code Mintlify, Copilot docs features LOW — point at codebase
Code explanation HIGH — onboarding and knowledge transfer Universal consensus TRIVIAL — ask questions in chat
Simple refactoring MODERATE-HIGH — rename, extract, restructure Copilot, Cursor data TRIVIAL — standard tool feature

Configuration vs. novel tech: 100% configuration. You’re buying a subscription and turning it on. Zero custom development required.

Tier 2: Useful With Caveats

These help but require governance, training, and review process changes.

Task AI Effectiveness Evidence Configuration Effort
Multi-file code changes MODERATE — works for clear, scoped changes Cursor Composer data, AlterSquare LOW-MEDIUM — requires codebase indexing
Code review assistance MODERATE — catches patterns, misses context CodeRabbit, Qodo, Monday.com (800+ issues/month caught) MEDIUM — configure rules, integrate into CI
Natural language → code MODERATE — good for prototypes, risky for production Multiple sources LOW — but review overhead is high
Bug identification MODERATE — finds obvious bugs, misses subtle ones Mixed evidence LOW — configure in CI pipeline
API integration boilerplate MODERATE-HIGH — good at known patterns AlterSquare, practitioner reports LOW — standard tool feature

Configuration vs. novel tech: ~80% configuration (tool setup, CI integration, rule definition). ~20% process design (review workflows, quality gates).

Tier 3: Risky and Overhyped

Approach with extreme caution. The marketing far exceeds the reality.

Task AI Effectiveness Evidence Real Risk
Architecture decisions LOW — AI suggests, human must validate everything Practitioner consensus AI-designed architectures lack context of organizational constraints
Complex business logic LOW — subtle errors that only surface under load AlterSquare: “47 subtle bugs” from ChatGPT code Logic errors in production are expensive
Security-critical code DANGEROUS — AI removes validation, relaxes auth TDS, OWASP, Checkmarx: 2.74x higher vuln rate AI agents caught “removing validation checks, relaxing database policies, disabling authentication”
Autonomous AI agents (production) VERY EARLY — marketing far exceeds capability Devin: 34-67% merge rate; METR: experienced devs slower AutoGen example: infinite loop = $2,400 overnight API bill
Full application generation PROTOTYPE ONLY — not production-ready Vibe coding crisis: Gartner predicts 2,500% defect rise “Vibe coding debt” — developers accepting code they don’t understand

Configuration vs. novel tech: This tier is where vendors sell “novel AI” but the results don’t justify the investment yet.


Part 3: The Real Cost (Not Just the License Fee)

The True Cost Beyond the License (DX Research/Atlan, 2025)

License fees represent 10-20% of total AI deployment cost. Year 1 total cost runs roughly 2.5x the license fee; at scale, 4-5x. The remaining 80-90% breaks into:

Cost Category % of Total
Direct AI tool subscriptions 10-20%
Debugging and review overhead 30-40%
Integration, training, governance 30-40%
Workflow redesign and process changes 10-20%

What This Means for Budget Planning

A CFO looking at “$19/seat/month for Copilot” is seeing 10-20% of the real cost. The remaining investment covers:

  • Developer time reviewing AI output
  • Developer time debugging AI errors
  • IT time configuring, monitoring, and governing tools
  • Training time getting developers to use tools effectively
  • Process redesign time adapting workflows

Part 4: Configuration vs. Novel Technology

The Dirty Secret: Almost All Enterprise AI Value Comes from Configuration

What Companies Think They Need What They Actually Need
Custom AI models trained on their codebase Turn on Copilot’s enterprise features
Bespoke AI agents for their workflows Configure existing tool policies and permissions
AI-powered architecture redesign Set up code review automation in CI/CD
Novel AI deployment infrastructure SSO integration and license management
AI Center of Excellence (20 people) One champion per team + an admin who reads docs

The Implementation Effort Breakdown

Based on real enterprise deployments across multiple sources:

GitHub Copilot Enterprise Deployment:

  • License assignment and SSO: 1-2 days
  • Policy configuration (content exclusions, IP settings): 1-3 days
  • Developer onboarding and training: 1-2 weeks
  • Code review process adjustment: 2-4 weeks
  • Governance framework: 2-4 weeks
  • Total: 4-8 weeks, 95% configuration, 0% custom code

M365 Copilot Deployment:

  • License assignment: 1 day
  • Data governance review (the real work): 2-6 weeks
  • SharePoint permissions cleanup: 2-8 weeks (often the biggest blocker)
  • Change channel from Semi-Annual to Monthly: 1 day (but politically hard)
  • Training program: 2-4 weeks
  • Total: 6-16 weeks, 100% configuration and data cleanup, 0% custom code

AI Gateway Implementation:

  • Select and deploy gateway (Portkey/Helicone/LiteLLM): 1-2 weeks
  • Route existing AI calls through gateway: 1-3 weeks
  • Configure caching, routing, budgets: 1-2 weeks
  • Dashboard and alerting setup: 1 week
  • Total: 4-8 weeks, 90% configuration, 10% integration scripting

AI Code Review in CI/CD (CodeRabbit/Qodo):

  • Install and connect to repo: 1 hour
  • Configure review rules: 1-3 days
  • Train team on workflow changes: 1 week
  • Total: 1-2 weeks, 100% configuration

Where Custom Development IS Required (And It’s Rare)

Scenario Custom Effort When It Makes Sense
Training models on proprietary code patterns HIGH Only at >5,000 developer scale
Building custom AI agents for internal workflows MEDIUM Only after Tier 1 & 2 fully deployed
Integrating AI into proprietary build systems MEDIUM Only for legacy systems with no standard tooling
Custom prompt libraries for company patterns LOW After basic adoption is solid
Fine-tuning models for domain-specific tasks HIGH Almost never worth it vs. prompt engineering

Part 5: The Bottleneck Problem

Why Individual Productivity Doesn’t Equal Organizational Productivity

The Faros AI research makes this crystal clear using Amdahl’s Law: a system moves only as fast as its slowest component.

BEFORE AI:
  Code → Review → Merge → Deploy
  [====] [====] [==] [==]
  40%     30%    15%  15%     ← Time distribution

AFTER AI:
  Code → Review → Merge → Deploy
  [==]  [===========] [==] [==]
  15%     55%          15%  15%  ← Coding faster, but review is now the bottleneck

  Result: Individual coding faster, overall delivery... about the same.

The data:

  • PRs per developer: +98% (great!)
  • PR review time: +91% (terrible!)
  • PR size: +154% (makes review even harder)
  • Bugs per developer: +9% (more code = more bugs)
  • Org-level delivery improvement: 0% (it cancels out)

How to Fix It (Where the Hard Work Lives)

The fix isn’t better AI tools — it’s workflow redesign:

  1. AI-assisted code review (CodeRabbit, Qodo) to handle the review bottleneck
  2. Smaller PRs by policy — AI can generate massive changes; enforce size limits
  3. Automated quality gates in CI before human review
  4. Review load balancing across team members
  5. Tiered review — AI-generated boilerplate gets lighter review than novel logic

Part 6: The Vibe Coding Crisis (What NOT to Do)

Real Incidents and Data

  • AI-generated code creates 1.7x more problems than human code (CodeRabbit, large corpus)
  • PR incidents up 23.5% despite 20% more PRs (Stack Overflow, 2026)
  • 2.74x higher security vulnerability rate in AI co-authored PRs
  • 69 vulnerabilities across 15 test applications in vibe coding security review
  • cURL bug bounty shut down after AI-generated flood of low-quality reports (Daniel Stenberg)
  • Ghostty banned AI-generated code entirely (Mitchell Hashimoto)
  • tldraw auto-closes all external PRs due to AI slop (Steve Ruiz)
  • Gartner predicts 2,500% rise in software defects by 2028 for uncontrolled AI adoption
  • $2,400 overnight API bill from an AutoGen agent infinite loop (AlterSquare client)

The Pattern

Organizations that deploy AI coding tools without governance create more technical debt faster. The speed gain is real, but so is the quality decline. Without review processes, testing requirements, and architectural guardrails, AI tools are a debt accelerator, not a productivity multiplier.


Part 7: The Honest Recommendation

For a typical enterprise (Stage 1-2 on the AI Native Adoption Cycle):

Do immediately (weeks 1-4):

  1. Deploy GitHub Copilot Business ($19/seat/month). This is pure configuration — SSO, license assignment, content exclusion policies.
  2. Set up a basic AI usage policy (1-2 pages). Template available, zero custom development.
  3. Start tracking: acceptance rate, usage frequency, developer satisfaction.

Do next (weeks 4-12): 4. Add AI code review to CI/CD (CodeRabbit or similar). Configuration only — connect to repos, set rules. 5. Establish code review process changes for AI-generated code. Process design, not technology. 6. Run a pilot with one agentic tool (Cursor Agent or Claude Code) with 2-3 senior developers only.

Do later (months 3-6): 7. Evaluate AI gateway for cost visibility and optimization. 8. Expand from pilot to broader agentic tool adoption with governance. 9. Measure actual organizational impact (not just developer surveys).

Don’t do (or do much later):

  • Don’t fine-tune models on your codebase (costs $250K+, rarely justified)
  • Don’t deploy autonomous agents to production without 6+ months of governed pilot
  • Don’t let developers “vibe code” production features without review gates
  • Don’t buy the full Microsoft AI stack ($168-199/seat) before proving value at $19/seat

The Configuration Truth

90%+ of the value comes from buying existing tools and configuring them correctly. The novel technology (custom models, bespoke agents, AI-designed architectures) is where the hype lives but not where the ROI lives. The consulting value is in helping organizations:

  1. Choose the right tools for their stack (configuration decision)
  2. Configure them correctly (governance, security, policy)
  3. Redesign workflows around AI capabilities (process design)
  4. Measure what matters (metrics framework)
  5. Avoid the pitfalls that the data clearly shows (guardrails)

None of this requires novel technology. It requires expertise, judgment, and the discipline to ignore the hype.


What This Means for Your Organization

The data confirms what you likely already sense: AI delivers real productivity gains for engineering teams. The question is how to capture those gains at the organizational level, not just the individual level. The answer is specific and actionable. Tier 1 tasks – autocomplete, test generation, documentation, boilerplate – produce measurable ROI across every controlled study in the literature. Getting these working well across your engineering team is the highest-return starting point, and it can be done in weeks through configuration alone (SSO, content exclusion policies, CI/CD integration). No custom development required.

The organizations pulling ahead have recognized that AI shifts the bottleneck rather than eliminating it. Faros AI telemetry (n=10,000+ developers, 1,255 teams, 2025) shows dramatically more PRs merged, but review times grow proportionally. The companies capturing real value are the ones that identified this new constraint and addressed it – through AI-assisted code review, smaller PR policies, and tiered review processes. This is a workflow redesign problem, not a technology problem, and it is entirely solvable.

On cost, license fees represent 10-20% of total AI deployment cost, with Year 1 TCO running roughly 2.5x the license (DX Research/Atlan, 2025). That multiplier sounds daunting, but it is also the key to building an honest business case. Organizations that model the full cost upfront — debugging, review overhead, governance, training — and prove ROI against it are the ones whose pilots survive the budget review. The real cost is not a reason to hesitate. It is a reason to plan accurately and avoid the mid-deployment surprises that derail most initiatives.

If you are building that business case and want to pressure-test the assumptions against what the data actually shows, that is a conversation worth having early in the process rather than after the pilot is underway.


Sources


Brandon Sneider | brandon@brandonsneider.com March 2026