Executive Summary
- Data preparation consumes 40–60% of total AI project cost and 60–80% of project time, according to O’Reilly analysis. Most budgets underprice this line item by an order of magnitude.
- 96% of companies begin AI projects without sufficient high-quality training data (Optimus AI, 2025), triggering unplanned remediation spend of $10K–$90K per project — typically absorbed as scope creep rather than a planned capital expense.
- Industry context matters: healthcare EHR migration (Epic, Oracle Health) runs $1M–$500M depending on system size; manufacturing ERP/MES normalization and financial services multi-core-banking integration carry their own distinct cost curves and failure rates.
- Build-vs-buy: Fivetran starts at $500/month for the first 1M rows but escalates sharply at scale. MuleSoft and Informatica deals land $100K–$2M ARR at enterprise scope. dbt Cloud starts at $100/developer/month on top of a free open-source core.
- 83% of data migration projects fail or exceed budget/timeline (Gartner). The predictable cause is underscoping remediation before AI workloads begin.
The Data Problem Is the AI Problem
The evidence is consistent across every credible analysis of enterprise AI ROI: model performance is downstream of data quality. O’Reilly reports data preparation consumes 60–80% of AI project time and 40–60% of project cost. Anaconda’s more recent survey of data scientists puts the time share at 45% — lower, but still the single largest line item in any serious deployment.
The framing that matters for a CFO: when a vendor quotes a $200K enterprise license for an AI platform, the honest TCO is closer to $400K–$500K once legacy data is actually ready to feed it. The majority of AI programs (only 5% capture substantial financial gains per BCG AI at Work 2025, n=10,635) typically stall here — at integration, not at inference.
Cost Structure by Industry
The published benchmarks are uneven. Healthcare has the most transparent cost data because EHR migrations are capital projects with RFP paper trails. Financial services and manufacturing cost curves are harder to pin down because most remediation happens inside existing ERP/core-banking programs and never gets isolated as a line item.
Healthcare (EPIC, Oracle Health / Cerner)
Epic commands 42.3% of the acute-care EHR market and manages over 305 million patient records (IntuitionLabs, 2025). Implementation costs range from $1M for small practices to $500M for large integrated delivery networks. Annual maintenance runs 15–20% of initial license cost.
Legacy data migration inside Epic uses Data Courier and HL7 batch loads; Oracle Health / Cerner uses HL7 ADT and FHIR where the source system supports it. Both require significant custom transformation work — there is no “lift and shift.” AI feature readiness (ambient scribe, clinical documentation automation) sits downstream of a clean migration, not alongside it.
Financial Services
Innoflexion’s 2025 Data Readiness Index places financial services organizations in the 0.65–0.74 range — strong source-system data quality, weak cross-system coherence due to multi-core-banking and multi-ERP sprawl. Above 0.75 is considered autonomous-AI-ready. 0.50–0.74 supports AI with human review checkpoints. Below 0.50 requires structured remediation before production AI.
The practical implication: most FS firms need a data fabric or data product layer before agent-based AI is safe to deploy. That layer is typically a $500K–$3M multi-quarter program before the AI tooling clock even starts.
Manufacturing
Manufacturing’s data-readiness problem is ERP/MES integration — OEE data in the MES, financial data in SAP or Oracle, quality data in a third system, none of them natively joined. Published TCO benchmarks are scarce because remediation tends to be bundled inside ERP modernization projects. Gartner’s 2025 data classifies 25% of ERP implementations as catastrophic failures; running parallel systems during a migration costs $50K–$200K/month in duplicated infrastructure alone.
Legal / Professional Services
The primary legacy-data problem in legal is document management system (DMS) preparation — iManage and NetDocuments as source systems, with inconsistent matter-level metadata, conflicting tagging schemas across practice groups, and privilege markers that must survive any AI ingestion. Remediation here is less about platform cost and more about matter-partner time to normalize taxonomy before ingestion. Published benchmarks are rare; real-world scope typically runs 6–12 months of part-time knowledge management work plus a $50K–$250K integration spend.
Build vs. Buy: The Integration Platform Landscape
| Platform | Entry Pricing | Typical Enterprise ARR | Best Fit |
|---|---|---|---|
| Fivetran | $500/mo for 1M MAR | $100K–$500K | High-volume SaaS source ingestion |
| dbt Cloud | $100/developer/mo | $50K–$300K | Analytics/transformation layer; pairs with Fivetran |
| MuleSoft | No public pricing | $100K–$500K | Complex API orchestration, Salesforce stack |
| Informatica | Enterprise-only | $250K–$2M | Regulated industries, governance-heavy |
Fivetran’s 2026 pricing change bills at the connector level, which means cost grows faster than data volume as source systems multiply. This is the single largest hidden cost in mid-market Fivetran deployments — budget it explicitly.
Where Remediation Budgets Go Wrong
Three failure modes account for most of the 83% of data migrations Gartner reports over budget or failing outright:
- Remediation treated as an IT expense, not a capital project. Data cleanup without an executive sponsor and a dated charter drifts indefinitely. The work is tedious, thankless, and never ranked above whatever ticket came in yesterday.
- Buying the integration platform before scoping the data. Teams sign the Fivetran or Informatica contract, then discover the source systems require six months of normalization before the platform adds value. The meter runs; the value doesn’t.
- No target-state data model. Remediation with no reference architecture produces clean data in the shape of the legacy system. AI workloads then can’t use it without another remediation pass.
Key Data Points
| Metric | Value | Source | Date |
|---|---|---|---|
| Data prep share of AI project cost | 40–60% | O’Reilly | 2025 |
| Data prep share of AI project time | 60–80% | O’Reilly | 2025 |
| Data scientists’ time on data prep | 45% | Anaconda | 2024 |
| Companies starting AI without quality data | 96% | Optimus AI | 2025 |
| Unplanned remediation per project | $10K–$90K | Optimus AI | 2025 |
| Epic implementation range | $1M–$500M | DashTech | 2025 |
| Data migration project failure rate | 83% | Gartner | 2025 |
| ERP implementation catastrophic failure rate | 25% | Gartner | 2025 |
| Parallel ERP run cost during migration | $50K–$200K/mo | Software Modernization Services | 2025 |
| FS AI readiness range (DRI score) | 0.65–0.74 | Innoflexion | 2025 |
| AI-ready threshold | 0.75+ | Innoflexion | 2025 |
What This Means for Your Organization
The question a CIO should ask before approving any AI platform contract: what is our data readiness score, and what would it cost to move it above the 0.75 threshold? If the answer is “we don’t know,” the AI project is not ready to start. Starting anyway is how pilots stall in POC — McKinsey’s November 2025 survey (n=1,933) puts two-thirds of enterprises still at that stage with only 6% capturing >5% EBIT impact.
The cost model to build internally: a 3-line TCO that separates (1) integration platform spend, (2) remediation labor (internal and external), and (3) target-state data product investment. Most teams collapse these into one line and lose the ability to forecast when each cost comes online. Healthcare, financial services, and manufacturing all have distinct remediation curves; pretending they are the same produces bad budgets.
If this raised questions about how to scope remediation specific to your industry and data environment, I welcome the conversation — brandon@brandonsneider.com.
Sources
- O’Reilly analysis on AI project cost distribution, via Optimus AI, 2025. Credibility: MEDIUM (secondary citation of primary research). https://optimusai.ai/data-scientists-spend-80-time-cleaning-data/
- Pragmatic Institute, “Overcoming the 80/20 Rule in Data Science,” 2025. Credibility: MEDIUM. https://www.pragmaticinstitute.com/resources/articles/data/overcoming-the-80-20-rule-in-data-science/
- DashTech, “Epic EHR Integration vs Other Systems Comparison Guide 2025.” Credibility: MEDIUM (vendor-adjacent). https://dashtechinc.com/blog/epic-ehr-integration-vs-other-systems-comparison-guide-2025/
- EHR in Practice, “Epic EHR vs Oracle Health Comparison 2025.” Credibility: MEDIUM. https://www.ehrinpractice.com/epic-ehr-vs-cerner-ehr-comparison.html
- IntuitionLabs, “Epic vs Cerner: AI in EHRs.” Credibility: MEDIUM. https://intuitionlabs.ai/articles/epic-vs-cerner-ai-comparison
- DataFlowMapper (citing Gartner), “Data Migration Cost Analysis 2025.” Credibility: MEDIUM (Gartner secondary). https://dataflowmapper.com/blog/data-migration-costs-quantitative-analysis
- RecordPoint, “Hidden Costs of Maintaining Legacy Systems,” 2025. Credibility: MEDIUM. https://www.recordpoint.com/blog/maintaining-legacy-systems-costs
- Software Modernization Services, “ERP Modernization Cost Benchmarks,” 2025. Credibility: MEDIUM. https://softwaremodernizationservices.com/erp-modernization/
- Innoflexion, “AI Readiness Assessment: Data Readiness Index,” 2025. Credibility: MEDIUM (vendor framework). https://www.innoflexion.com/blog/ai-readiness-assessment-data-readiness-index
- Valiotti, “Fivetran Review 2026: Pricing, Features & Enterprise ETL Assessment.” Credibility: MEDIUM. https://valiotti.com/blog/fivetran-review-2025/
- Xenoss, “Data Integration Platforms Compared: Fivetran, Airbyte, DLT, dbt.” Credibility: MEDIUM. https://xenoss.io/blog/data-integration-platforms
- Integrate.io, “Fivetran vs Informatica vs Integrate.io,” 2025. Credibility: LOW–MEDIUM (competitor comparison). https://www.integrate.io/blog/fivetran-vs-informatica-vs-integrateio/
- USM Systems, “AI Software Cost: 2025 Enterprise Pricing Benchmarks for Manufacturing Leaders.” Credibility: LOW–MEDIUM (vendor-adjacent). https://usmsystems.com/ai-software-cost/
Brandon Sneider | brandon@brandonsneider.com April 2026