Executive Summary
- Gartner (Feb 2025, n=248 data management leaders) predicts 60% of AI projects unsupported by AI-ready data will be abandoned through 2026. 63% of organizations either lack or are unsure they have the right data management practices for AI.
- Qlik / Wakefield Research (Feb 2025, n=500 U.S. data professionals at $500M+ firms) finds 81% report significant data quality problems and 85% say leadership isn’t addressing them. 96% expect widespread crises from poor AI data quality. This is the cleanest recent picture of the gap between AI ambition and data reality.
- Realistic enterprise data modernization runs 6–18 months end-to-end. The bottleneck is not moving the raw data — it is the consumption layer (dashboards, reports) and transformation logic (thousands of stored procedures, undocumented business rules). Mid-market Snowflake setups cost $200K–$1M+ to stand up and $150K–$400K/year to run.
- AI-assisted tooling compresses one slice of the work dramatically: Caylent’s SQL-to-PostgreSQL migration for Teamfront/Arborgold ran 2,500+ stored procedures in 10 weeks vs. a 40-week manual estimate, with 70% AI-automated, 20% AI-assisted, 10% hand-coded. The remaining 30% — the part that still requires humans — is where mid-market companies stall.
- Anaconda’s recurring State of Data Science survey finds data scientists spend 39–45% of their time on data preparation (not the often-cited 80%). Independent of AI, 47% of newly created records contain at least one critical error (MIT Sloan) and only 3% of companies’ data meets basic quality standards (HBR). Cleaning is continuous, not a one-time project.
What “Doing It Well” Actually Looks Like
The honest answer: almost nobody has finished. The companies cited as data-ready role models are mostly large enterprises with seven-figure data platform teams (Morgan Stanley, JPMorgan, Walmart) that have been investing in data infrastructure for a decade before GenAI arrived. There is no 500-person law firm or manufacturer in the public record that ran a 90-day data-cleanup sprint and emerged AI-ready. What exists instead is a smaller pattern: companies that picked a narrow scope, funded it properly, and accepted that “clean” is a steady-state operational discipline, not a project milestone.
The independent benchmark to anchor on is Gartner’s Q3 2024 data management survey (n=248). 63% of organizations do not have or are unsure they have AI-ready data management practices. Against that baseline, the firms that have made real progress share four traits: executive sponsorship of data quality as an ongoing P&L line, a named data owner per domain (not a shared services team), ruthless scope reduction (one AI use case at a time, not a “platform”), and willingness to rewrite source system integrations rather than paper over them with transformation layers.
Qlik’s February 2025 survey of 500 U.S. professionals at $500M+ companies sharpens the picture: 81% report significant data quality problems; 85% say leadership isn’t addressing them. The same survey shows 77% of $5B+ companies expect a major crisis from poor AI data quality — meaning senior leaders now see the risk but the execution gap remains. That is the operational reality of most 200–5,000 employee American companies right now.
Realistic Timelines: What 6–18 Months Actually Buys
Enterprise data migrations consistently run 6–18 months across public case write-ups from Hakkoda, Databricks, NTT Data, and Acuvate. The distribution is not normal. Roughly:
| Phase | Typical duration (mid-market) | What slows it |
|---|---|---|
| Discovery & source audit | 4–8 weeks | Undocumented source systems; shadow spreadsheets; orphaned databases |
| Architecture decision (warehouse/lakehouse) | 2–6 weeks | Vendor selection, procurement, security review |
| Raw ingestion (bronze tier) | 4–12 weeks | Rarely the bottleneck once tooling is in place |
| Cleaning, deduplication, entity resolution (silver) | 3–9 months | Business rule definition — the binding constraint |
| Business-ready aggregates (gold) | 2–6 months | Stakeholder alignment on definitions, metrics, hierarchies |
| Consumption-layer rebuild (dashboards, stored procs) | 3–9 months | Volume: 2,500+ stored procedures is a real mid-market number |
The silver tier is where 60% of AI projects die — the same finding as Pass 88’s medallion timeline brief. Cleaning raw data is tool-assisted; resolving “is Customer #1002 the same as Customer #10020?” is not. Neither is deciding whether revenue recognition follows the contract or the invoice.
What AI-Assisted Tools Actually Compress
The Caylent / Teamfront / Arborgold case is the clearest recent datapoint. Four SQL Server clusters, 2,500+ stored procedures, 40-week manual estimate compressed to 10 weeks with AI tooling. 70% was AI-automated, 20% AI-assisted with human guidance, 10% hand-coded for edge cases. That 70% is the largest verified compression ratio in a published mid-market case. The work was done by Caylent (a systems integrator), not by internal staff — which matters for cost modeling.
Databricks partner Trellis IQ reports clearing a seven-year data harmonisation backlog in seven days for a global CPG manufacturer; SevDesk migrated 600+ dbt models from Redshift to Snowflake in “a few weeks,” saving ~2,500 hours. These are vendor-published with no independent verification — treat as directional. The consistent pattern: AI-assisted code translation and schema mapping compresses one slice (transformation logic) by 70–90%; entity resolution, business rule definition, and source documentation compress by 20–40% at most.
Anaconda’s State of Data Science surveys (recurring, most recent wave 2024) consistently find data scientists spend 39–45% of their time on data preparation — not the “80%” figure circulating for a decade. The 80% figure comes from older CrowdFlower/Figure Eight surveys that included data collection and labeling. For mid-market planning, assume 40% of data team capacity goes to preparation on an ongoing basis even after the “big cleanup” is complete. This is the steady-state operating cost most planning decks omit.
Who Actually Does the Work
The pattern across Teamfront (Caylent), the CPG manufacturer (Databricks/Trellis IQ), and the broader modernization literature is consistent: AI-era data remediation at mid-market scale is almost always done by a systems integrator with AI tooling, not by internal staff. Internal teams own domain knowledge (what the data means) and governance decisions. The SI owns the mechanical work (extraction, migration, code translation). A realistic mid-market staffing model:
- Internal: 1 data owner per domain (part-time), 1 data engineer, 1 analytics lead, plus executive sponsor for scope and governance decisions.
- External: SI team of 3–8, usually 6–12 months of engagement.
- Cost envelope: $300K–$1.5M for a focused mid-market initiative (one domain, one AI use case), on top of platform run costs ($150K–$400K/year for Snowflake/Databricks at this scale).
What makes this category hard to benchmark honestly: the vendors and SIs who write the case studies have every incentive to publish the fastest examples. The failures are private. Qlik’s finding that 47% of executives worry their company overinvested in AI is the other side of that same ledger.
Key Data Points
| Finding | Source | Date | Credibility |
|---|---|---|---|
| 60% of AI projects without AI-ready data will be abandoned through 2026 | Gartner press release, n=248 data leaders | Feb 26, 2025 | HIGH — independent analyst, named methodology |
| 63% of orgs lack/unsure of AI-ready data management practices | Gartner Q3 2024 survey | Sep 2024 | HIGH |
| 81% report significant data quality problems; 85% say leadership isn’t addressing | Qlik / Wakefield Research, n=500 U.S. data pros, $500M+ firms | Feb 4–18, 2025 | MEDIUM-HIGH — independent field research, Qlik-commissioned |
| 47% of companies worry they overinvested in AI | Qlik / Wakefield | Feb 2025 | MEDIUM-HIGH |
| Enterprise migrations: 6–18 months typical | Hakkoda/NTT Data/Databricks aggregation | 2024–2025 | MEDIUM — SI-published |
| 2,500+ stored procedures, 40 weeks → 10 weeks with AI | Caylent / Teamfront case study | 2025 | MEDIUM — SI case study, no independent audit |
| 70% AI-automated / 20% AI-assisted / 10% hand-coded conversion mix | Caylent / Teamfront | 2025 | MEDIUM |
| Data scientists spend 39–45% of time on preparation | Anaconda State of Data Science | Recurring through 2024 | HIGH — independent repeat survey |
| Poor data quality costs orgs $12.9M/year avg | Gartner (ongoing citation) | 2020–2024 | MEDIUM (cited widely, original methodology opaque) |
| 47% of newly created records contain at least one critical error | MIT Sloan | Published 2017–2020, still cited | MEDIUM — predates current tools, trend only |
| Only 3% of companies’ data meets basic quality standards | Harvard Business Review | 2017 | TIER 4 — historical context only |
| 15–25% revenue loss annually from poor data quality | MIT Sloan with Cork Univ. Business School | 2022 | MEDIUM |
| IDC/NetApp “Scaling Enterprise AI Responsibly,” n=1,200+ global decision makers | IDC | Oct 2025 | HIGH — independent, large-n |
| 42% of companies abandoned at least one AI initiative in 2025; avg $7.2M sunk cost | Aggregator stat (multiple sources) | 2025 | LOW-MEDIUM — origin chain unclear |
Note: the viral near-total-failure headline figure from 2025 is excluded per the banned-statistics list (fabricated attribution). The 60% abandonment prediction from Gartner is the better-sourced and more defensible number.
What This Means for Your Organization
Three practical calibrations. First, if your board is budgeting a “data readiness project” as a one-time capital expense, reframe it. Data readiness for AI is an operating discipline with a multi-year horizon. The companies that capture AI value are the ones that funded data quality as a permanent line, not as a sprint. Plan for 6–18 months on the first domain and a named data owner who stays in the role.
Second, the work decomposes into three buckets with very different compression ratios. Code translation, schema mapping, and bronze-tier ingestion are compressed 70–90% by AI tooling. Entity resolution, business rule definition, and source documentation compress 20–40% at best. Cleaning is the easy part; deciding what “customer” means across your CRM, ERP, and billing system is not. Budget accordingly — the slowest slice is governance, not engineering.
Third, the SI-with-AI-tooling model is the dominant execution pattern at mid-market scale, and the cost envelope is $300K–$1.5M for a focused, single-domain initiative with one AI use case tied to it. If a vendor or SI pitches a platform-first rather than use-case-first project, that is usually a sign the scope will expand and the timeline will slip. If you are deciding between an in-house build, a platform-first engagement, and a use-case-scoped SI engagement — or trying to estimate what your organization specifically should spend before AI value is realistic — I’d welcome the conversation: brandon@brandonsneider.com.
Sources
- Gartner, “Lack of AI-Ready Data Puts AI Projects at Risk” (press release, Feb 26, 2025). Q3 2024 survey of 248 data management leaders. HIGH credibility. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk
- Qlik / Wakefield Research, “Data Quality Is Not Being Prioritized on AI Projects” (Feb 2025). n=500 U.S. data and analytics professionals at $500M+ firms. Field dates Feb 4–18, 2025. MEDIUM-HIGH credibility (Qlik-commissioned, independent fielder). https://www.qlik.com/us/news/company/press-room/press-releases/data-quality-is-not-being-prioritized-on-ai-projects
- IDC / NetApp, “Scaling Enterprise AI Responsibly: The Critical Role of Data Readiness and an Intelligent Data Infrastructure” (Oct 2025). n=1,200+ global decision makers, two waves Jan 2024 and Jun 2025. HIGH credibility (large-n, independent fielder; vendor-published, treat headline framings with caution). https://www.netapp.com/media/142474-idc-2025-ai-maturity-findings.pdf
- Anaconda, “State of Data Science” (recurring annual survey, most recent cited wave 2024). HIGH credibility — independent repeat survey. https://www.anaconda.com/resources/whitepapers
- Caylent, “How AI Is Revolutionizing Database Migration” (2025). Teamfront/Arborgold case. MEDIUM credibility — systems integrator case study, no independent audit. https://caylent.com/blog/how-ai-is-revolutionizing-database-migration-from-year-long-projects-to-quarterly-wins
- Informatica, “The Surprising Reason Most AI Projects Fail” (2025). Vendor marketing; useful for aggregated stats with original citations. LOW-MEDIUM credibility. https://www.informatica.com/blogs/the-surprising-reason-most-ai-projects-fail-and-how-to-avoid-it-at-your-enterprise.html
- Integrate.io, “Data Quality Improvement Stats from ETL” (updated 2026). Aggregation of Gartner, IDC, MIT Sloan, HBR source stats. MEDIUM — derivative but well-sourced. https://www.integrate.io/blog/data-quality-improvement-stats-from-etl/
- MIT Sloan Management Review with Cork University Business School. Data quality revenue-loss research (2017–2022). MEDIUM — original methodology not fully public.
- Companion research in this corpus:
research/07-adoption-challenges/bronze-silver-gold-data-tiering-timelines.md(medallion architecture), andresearch/01-ai-native-landscape/real-roi-by-function.md.
Brandon Sneider | brandon@brandonsneider.com April 2026