← Findings 🕐 11 min read

Findings

What Actually Works: The Honest Guide to AI in Engineering

Brandon Sneider · March 2026

License fees represent 10-20% of total AI deployment cost. Year 1 total cost runs roughly 2.5x the license fee; at scale, 4-5x. The remaining 80-90% breaks into:

Executive Summary

AI delivers genuine, measurable value on specific tasks: boilerplate generation, unit testing (83% coverage vs. 54% traditional), documentation, and code autocomplete produce 25-35% speed gains that hold up across every controlled study in the literature
Configuration, not innovation, drives 90%+ of enterprise AI tool value — the wins come from turning on Copilot and configuring policies, not building custom AI systems. This is good news: the path to ROI is weeks, not quarters
The difference between individual gains and organizational gains is a solvable problem: developers produce 21% more tasks and 98% more PRs (Faros AI, 135K+ developers), but that speed shifts the bottleneck to code review (+91%). Organizations that address the new bottleneck capture the gains; those that don’t see the speed evaporate
Experienced developers use AI differently than beginners — and the pattern matters: METR’s randomized controlled trial (n=16, 246 tasks, 2025) found that experienced developers applying AI indiscriminately were 19% slower. Top performers apply AI to the right tasks (Tier 1 work) and keep doing the hard thinking themselves
License fees represent 10-20% of total AI deployment cost — but the remaining investment is addressable: debugging, review overhead, training, and governance make up the balance. Year 1 total cost runs roughly 2.5x the license (DX Research/Atlan, 2025). Organizations that plan for the full cost from day one avoid the budget surprise that derails most pilots
The organizations capturing real value do three things: deploy AI on the right tasks, redesign workflows around the new bottleneck, and budget for the full cost — not just the license

Part 1: The Productivity Evidence (Honest Assessment)

What the Controlled Studies Actually Show

Study	Who	N	Finding	Credibility
METR RCT (2025)	Experienced OSS devs	16 devs, 246 tasks	19% SLOWER with AI; devs believed 20% faster	HIGH — independent, randomized, pre-registered
GitHub/Microsoft (2023)	Developers writing HTTP server	95 devs	55.8% faster on specific task	MEDIUM — single task, vendor-funded
Microsoft field trial	Microsoft employees	Large	12.9-21.8% more PRs/week	MEDIUM — vendor, poorly powered
Accenture field trial	Accenture developers	Large	8.7% more PRs, 11% more merges	MEDIUM — vendor, subset analyzed
Google RCT (2025)	Google engineers	96 devs	21% productivity gain	MEDIUM — small sample, Google-specific infra
DX/Faros (2025)	Cross-industry	135K+ devs	3.6 hrs/week saved per dev, BUT no org-level improvement	HIGH — large sample, independent
AlterSquare (2026)	20+ client projects	Multiple teams	46% more PRs, but 91% more review time	HIGH — real-world, multi-client
CodeRabbit (2025)	Code analysis	Large corpus	AI code creates 1.7x more problems	HIGH — independent analysis

The Honest Interpretation

For C-Suite: “Your developers will tell you AI makes them faster. The data says it depends entirely on the task. For boilerplate and tests, they’re right. For complex work, they’re wrong. And at the organizational level, nobody has proven it helps yet.”

Part 2: Where AI Actually Helps (The Task-Level Truth)

Tier 1: Genuine, Proven Value

These work. Deploy them. The ROI is clear.

Task	AI Effectiveness	Evidence	Configuration Effort
Code autocomplete	HIGH — 25-35% faster for routine coding	GitHub, Stack Overflow surveys	TRIVIAL — install extension, assign license
Boilerplate generation	HIGH — eliminates tedious scaffolding	Universal practitioner consensus	TRIVIAL — works out of the box
Unit test generation	HIGH — 83% coverage vs 54% traditional	QA industry data	LOW — configure test framework preferences
Documentation generation	HIGH — auto-generates from code	Mintlify, Copilot docs features	LOW — point at codebase
Code explanation	HIGH — onboarding and knowledge transfer	Universal consensus	TRIVIAL — ask questions in chat
Simple refactoring	MODERATE-HIGH — rename, extract, restructure	Copilot, Cursor data	TRIVIAL — standard tool feature

Configuration vs. novel tech: 100% configuration. You’re buying a subscription and turning it on. Zero custom development required.

Tier 2: Useful With Caveats

These help but require governance, training, and review process changes.

Task	AI Effectiveness	Evidence	Configuration Effort
Multi-file code changes	MODERATE — works for clear, scoped changes	Cursor Composer data, AlterSquare	LOW-MEDIUM — requires codebase indexing
Code review assistance	MODERATE — catches patterns, misses context	CodeRabbit, Qodo, Monday.com (800+ issues/month caught)	MEDIUM — configure rules, integrate into CI
Natural language → code	MODERATE — good for prototypes, risky for production	Multiple sources	LOW — but review overhead is high
Bug identification	MODERATE — finds obvious bugs, misses subtle ones	Mixed evidence	LOW — configure in CI pipeline
API integration boilerplate	MODERATE-HIGH — good at known patterns	AlterSquare, practitioner reports	LOW — standard tool feature

Configuration vs. novel tech: ~80% configuration (tool setup, CI integration, rule definition). ~20% process design (review workflows, quality gates).

Tier 3: Risky and Overhyped

Approach with extreme caution. The marketing far exceeds the reality.

Task	AI Effectiveness	Evidence	Real Risk
Architecture decisions	LOW — AI suggests, human must validate everything	Practitioner consensus	AI-designed architectures lack context of organizational constraints
Complex business logic	LOW — subtle errors that only surface under load	AlterSquare: “47 subtle bugs” from ChatGPT code	Logic errors in production are expensive
Security-critical code	DANGEROUS — AI removes validation, relaxes auth	TDS, OWASP, Checkmarx: 2.74x higher vuln rate	AI agents caught “removing validation checks, relaxing database policies, disabling authentication”
Autonomous AI agents (production)	VERY EARLY — marketing far exceeds capability	Devin: 34-67% merge rate; METR: experienced devs slower	AutoGen example: infinite loop = $2,400 overnight API bill
Full application generation	PROTOTYPE ONLY — not production-ready	Vibe coding crisis: Gartner predicts 2,500% defect rise	“Vibe coding debt” — developers accepting code they don’t understand

Configuration vs. novel tech: This tier is where vendors sell “novel AI” but the results don’t justify the investment yet.

Part 3: The Real Cost (Not Just the License Fee)

The True Cost Beyond the License (DX Research/Atlan, 2025)

License fees represent 10-20% of total AI deployment cost. Year 1 total cost runs roughly 2.5x the license fee; at scale, 4-5x. The remaining 80-90% breaks into:

Cost Category	% of Total
Direct AI tool subscriptions	10-20%
Debugging and review overhead	30-40%
Integration, training, governance	30-40%
Workflow redesign and process changes	10-20%

What This Means for Budget Planning

A CFO looking at “$19/seat/month for Copilot” is seeing 10-20% of the real cost. The remaining investment covers:

Developer time reviewing AI output
Developer time debugging AI errors
IT time configuring, monitoring, and governing tools
Training time getting developers to use tools effectively
Process redesign time adapting workflows

Part 4: Configuration vs. Novel Technology

The Dirty Secret: Almost All Enterprise AI Value Comes from Configuration

What Companies Think They Need	What They Actually Need
Custom AI models trained on their codebase	Turn on Copilot’s enterprise features
Bespoke AI agents for their workflows	Configure existing tool policies and permissions
AI-powered architecture redesign	Set up code review automation in CI/CD
Novel AI deployment infrastructure	SSO integration and license management
AI Center of Excellence (20 people)	One champion per team + an admin who reads docs

The Implementation Effort Breakdown

Based on real enterprise deployments across multiple sources:

GitHub Copilot Enterprise Deployment:

License assignment and SSO: 1-2 days
Policy configuration (content exclusions, IP settings): 1-3 days
Developer onboarding and training: 1-2 weeks
Code review process adjustment: 2-4 weeks
Governance framework: 2-4 weeks
Total: 4-8 weeks, 95% configuration, 0% custom code

M365 Copilot Deployment:

License assignment: 1 day
Data governance review (the real work): 2-6 weeks
SharePoint permissions cleanup: 2-8 weeks (often the biggest blocker)
Change channel from Semi-Annual to Monthly: 1 day (but politically hard)
Training program: 2-4 weeks
Total: 6-16 weeks, 100% configuration and data cleanup, 0% custom code

AI Gateway Implementation:

Select and deploy gateway (Portkey/Helicone/LiteLLM): 1-2 weeks
Route existing AI calls through gateway: 1-3 weeks
Configure caching, routing, budgets: 1-2 weeks
Dashboard and alerting setup: 1 week
Total: 4-8 weeks, 90% configuration, 10% integration scripting

AI Code Review in CI/CD (CodeRabbit/Qodo):

Install and connect to repo: 1 hour
Configure review rules: 1-3 days
Train team on workflow changes: 1 week
Total: 1-2 weeks, 100% configuration

Where Custom Development IS Required (And It’s Rare)

Scenario	Custom Effort	When It Makes Sense
Training models on proprietary code patterns	HIGH	Only at >5,000 developer scale
Building custom AI agents for internal workflows	MEDIUM	Only after Tier 1 & 2 fully deployed
Integrating AI into proprietary build systems	MEDIUM	Only for legacy systems with no standard tooling
Custom prompt libraries for company patterns	LOW	After basic adoption is solid
Fine-tuning models for domain-specific tasks	HIGH	Almost never worth it vs. prompt engineering

Part 5: The Bottleneck Problem

Why Individual Productivity Doesn’t Equal Organizational Productivity

The Faros AI research makes this crystal clear using Amdahl’s Law: a system moves only as fast as its slowest component.

BEFORE AI:
  Code → Review → Merge → Deploy
  [====] [====] [==] [==]
  40%     30%    15%  15%     ← Time distribution

AFTER AI:
  Code → Review → Merge → Deploy
  [==]  [===========] [==] [==]
  15%     55%          15%  15%  ← Coding faster, but review is now the bottleneck

  Result: Individual coding faster, overall delivery... about the same.

The data:

PRs per developer: +98% (great!)
PR review time: +91% (terrible!)
PR size: +154% (makes review even harder)
Bugs per developer: +9% (more code = more bugs)
Org-level delivery improvement: 0% (it cancels out)

How to Fix It (Where the Hard Work Lives)

The fix isn’t better AI tools — it’s workflow redesign:

AI-assisted code review (CodeRabbit, Qodo) to handle the review bottleneck
Smaller PRs by policy — AI can generate massive changes; enforce size limits
Automated quality gates in CI before human review
Review load balancing across team members
Tiered review — AI-generated boilerplate gets lighter review than novel logic

Part 6: The Vibe Coding Crisis (What NOT to Do)

Real Incidents and Data

AI-generated code creates 1.7x more problems than human code (CodeRabbit, large corpus)
PR incidents up 23.5% despite 20% more PRs (Stack Overflow, 2026)
2.74x higher security vulnerability rate in AI co-authored PRs
69 vulnerabilities across 15 test applications in vibe coding security review
cURL bug bounty shut down after AI-generated flood of low-quality reports (Daniel Stenberg)
Ghostty banned AI-generated code entirely (Mitchell Hashimoto)
tldraw auto-closes all external PRs due to AI slop (Steve Ruiz)
Gartner predicts 2,500% rise in software defects by 2028 for uncontrolled AI adoption
$2,400 overnight API bill from an AutoGen agent infinite loop (AlterSquare client)

The Pattern

Organizations that deploy AI coding tools without governance create more technical debt faster. The speed gain is real, but so is the quality decline. Without review processes, testing requirements, and architectural guardrails, AI tools are a debt accelerator, not a productivity multiplier.

Part 7: The Honest Recommendation

For a typical enterprise (Stage 1-2 on the AI Native Adoption Cycle):

Do immediately (weeks 1-4):

Deploy GitHub Copilot Business ($19/seat/month). This is pure configuration — SSO, license assignment, content exclusion policies.
Set up a basic AI usage policy (1-2 pages). Template available, zero custom development.
Start tracking: acceptance rate, usage frequency, developer satisfaction.

Do next (weeks 4-12): 4. Add AI code review to CI/CD (CodeRabbit or similar). Configuration only — connect to repos, set rules. 5. Establish code review process changes for AI-generated code. Process design, not technology. 6. Run a pilot with one agentic tool (Cursor Agent or Claude Code) with 2-3 senior developers only.

Do later (months 3-6): 7. Evaluate AI gateway for cost visibility and optimization. 8. Expand from pilot to broader agentic tool adoption with governance. 9. Measure actual organizational impact (not just developer surveys).

Don’t do (or do much later):

Don’t fine-tune models on your codebase (costs $250K+, rarely justified)
Don’t deploy autonomous agents to production without 6+ months of governed pilot
Don’t let developers “vibe code” production features without review gates
Don’t buy the full Microsoft AI stack ($168-199/seat) before proving value at $19/seat

The Configuration Truth

90%+ of the value comes from buying existing tools and configuring them correctly. The novel technology (custom models, bespoke agents, AI-designed architectures) is where the hype lives but not where the ROI lives. The consulting value is in helping organizations:

Choose the right tools for their stack (configuration decision)
Configure them correctly (governance, security, policy)
Redesign workflows around AI capabilities (process design)
Measure what matters (metrics framework)
Avoid the pitfalls that the data clearly shows (guardrails)

None of this requires novel technology. It requires expertise, judgment, and the discipline to ignore the hype.

What This Means for Your Organization

The data confirms what you likely already sense: AI delivers real productivity gains for engineering teams. The question is how to capture those gains at the organizational level, not just the individual level. The answer is specific and actionable. Tier 1 tasks – autocomplete, test generation, documentation, boilerplate – produce measurable ROI across every controlled study in the literature. Getting these working well across your engineering team is the highest-return starting point, and it can be done in weeks through configuration alone (SSO, content exclusion policies, CI/CD integration). No custom development required.

The organizations pulling ahead have recognized that AI shifts the bottleneck rather than eliminating it. Faros AI telemetry (n=10,000+ developers, 1,255 teams, 2025) shows dramatically more PRs merged, but review times grow proportionally. The companies capturing real value are the ones that identified this new constraint and addressed it – through AI-assisted code review, smaller PR policies, and tiered review processes. This is a workflow redesign problem, not a technology problem, and it is entirely solvable.

On cost, license fees represent 10-20% of total AI deployment cost, with Year 1 TCO running roughly 2.5x the license (DX Research/Atlan, 2025). That multiplier sounds daunting, but it is also the key to building an honest business case. Organizations that model the full cost upfront — debugging, review overhead, governance, training — and prove ROI against it are the ones whose pilots survive the budget review. The real cost is not a reason to hesitate. It is a reason to plan accurately and avoid the mid-deployment surprises that derail most initiatives.

If you are building that business case and want to pressure-test the assumptions against what the data actually shows, that is a conversation worth having early in the process rather than after the pilot is underway.

Sources

Brandon Sneider | brandon@brandonsneider.com March 2026