← Adoption Challenges 🕐 9 min read
Adoption Challenges

Data Cleaning for AI: What It Actually Takes, Who Actually Does It, and How Long It Really Runs

The honest answer: almost nobody has finished.


Executive Summary

  • Gartner (Feb 2025, n=248 data management leaders) predicts 60% of AI projects unsupported by AI-ready data will be abandoned through 2026. 63% of organizations either lack or are unsure they have the right data management practices for AI.
  • Qlik / Wakefield Research (Feb 2025, n=500 U.S. data professionals at $500M+ firms) finds 81% report significant data quality problems and 85% say leadership isn’t addressing them. 96% expect widespread crises from poor AI data quality. This is the cleanest recent picture of the gap between AI ambition and data reality.
  • Realistic enterprise data modernization runs 6–18 months end-to-end. The bottleneck is not moving the raw data — it is the consumption layer (dashboards, reports) and transformation logic (thousands of stored procedures, undocumented business rules). Mid-market Snowflake setups cost $200K–$1M+ to stand up and $150K–$400K/year to run.
  • AI-assisted tooling compresses one slice of the work dramatically: Caylent’s SQL-to-PostgreSQL migration for Teamfront/Arborgold ran 2,500+ stored procedures in 10 weeks vs. a 40-week manual estimate, with 70% AI-automated, 20% AI-assisted, 10% hand-coded. The remaining 30% — the part that still requires humans — is where mid-market companies stall.
  • Anaconda’s recurring State of Data Science survey finds data scientists spend 39–45% of their time on data preparation (not the often-cited 80%). Independent of AI, 47% of newly created records contain at least one critical error (MIT Sloan) and only 3% of companies’ data meets basic quality standards (HBR). Cleaning is continuous, not a one-time project.

What “Doing It Well” Actually Looks Like

The honest answer: almost nobody has finished. The companies cited as data-ready role models are mostly large enterprises with seven-figure data platform teams (Morgan Stanley, JPMorgan, Walmart) that have been investing in data infrastructure for a decade before GenAI arrived. There is no 500-person law firm or manufacturer in the public record that ran a 90-day data-cleanup sprint and emerged AI-ready. What exists instead is a smaller pattern: companies that picked a narrow scope, funded it properly, and accepted that “clean” is a steady-state operational discipline, not a project milestone.

The independent benchmark to anchor on is Gartner’s Q3 2024 data management survey (n=248). 63% of organizations do not have or are unsure they have AI-ready data management practices. Against that baseline, the firms that have made real progress share four traits: executive sponsorship of data quality as an ongoing P&L line, a named data owner per domain (not a shared services team), ruthless scope reduction (one AI use case at a time, not a “platform”), and willingness to rewrite source system integrations rather than paper over them with transformation layers.

Qlik’s February 2025 survey of 500 U.S. professionals at $500M+ companies sharpens the picture: 81% report significant data quality problems; 85% say leadership isn’t addressing them. The same survey shows 77% of $5B+ companies expect a major crisis from poor AI data quality — meaning senior leaders now see the risk but the execution gap remains. That is the operational reality of most 200–5,000 employee American companies right now.

Realistic Timelines: What 6–18 Months Actually Buys

Enterprise data migrations consistently run 6–18 months across public case write-ups from Hakkoda, Databricks, NTT Data, and Acuvate. The distribution is not normal. Roughly:

Phase Typical duration (mid-market) What slows it
Discovery & source audit 4–8 weeks Undocumented source systems; shadow spreadsheets; orphaned databases
Architecture decision (warehouse/lakehouse) 2–6 weeks Vendor selection, procurement, security review
Raw ingestion (bronze tier) 4–12 weeks Rarely the bottleneck once tooling is in place
Cleaning, deduplication, entity resolution (silver) 3–9 months Business rule definition — the binding constraint
Business-ready aggregates (gold) 2–6 months Stakeholder alignment on definitions, metrics, hierarchies
Consumption-layer rebuild (dashboards, stored procs) 3–9 months Volume: 2,500+ stored procedures is a real mid-market number

The silver tier is where 60% of AI projects die — the same finding as Pass 88’s medallion timeline brief. Cleaning raw data is tool-assisted; resolving “is Customer #1002 the same as Customer #10020?” is not. Neither is deciding whether revenue recognition follows the contract or the invoice.

What AI-Assisted Tools Actually Compress

The Caylent / Teamfront / Arborgold case is the clearest recent datapoint. Four SQL Server clusters, 2,500+ stored procedures, 40-week manual estimate compressed to 10 weeks with AI tooling. 70% was AI-automated, 20% AI-assisted with human guidance, 10% hand-coded for edge cases. That 70% is the largest verified compression ratio in a published mid-market case. The work was done by Caylent (a systems integrator), not by internal staff — which matters for cost modeling.

Databricks partner Trellis IQ reports clearing a seven-year data harmonisation backlog in seven days for a global CPG manufacturer; SevDesk migrated 600+ dbt models from Redshift to Snowflake in “a few weeks,” saving ~2,500 hours. These are vendor-published with no independent verification — treat as directional. The consistent pattern: AI-assisted code translation and schema mapping compresses one slice (transformation logic) by 70–90%; entity resolution, business rule definition, and source documentation compress by 20–40% at most.

Anaconda’s State of Data Science surveys (recurring, most recent wave 2024) consistently find data scientists spend 39–45% of their time on data preparation — not the “80%” figure circulating for a decade. The 80% figure comes from older CrowdFlower/Figure Eight surveys that included data collection and labeling. For mid-market planning, assume 40% of data team capacity goes to preparation on an ongoing basis even after the “big cleanup” is complete. This is the steady-state operating cost most planning decks omit.

Who Actually Does the Work

The pattern across Teamfront (Caylent), the CPG manufacturer (Databricks/Trellis IQ), and the broader modernization literature is consistent: AI-era data remediation at mid-market scale is almost always done by a systems integrator with AI tooling, not by internal staff. Internal teams own domain knowledge (what the data means) and governance decisions. The SI owns the mechanical work (extraction, migration, code translation). A realistic mid-market staffing model:

  • Internal: 1 data owner per domain (part-time), 1 data engineer, 1 analytics lead, plus executive sponsor for scope and governance decisions.
  • External: SI team of 3–8, usually 6–12 months of engagement.
  • Cost envelope: $300K–$1.5M for a focused mid-market initiative (one domain, one AI use case), on top of platform run costs ($150K–$400K/year for Snowflake/Databricks at this scale).

What makes this category hard to benchmark honestly: the vendors and SIs who write the case studies have every incentive to publish the fastest examples. The failures are private. Qlik’s finding that 47% of executives worry their company overinvested in AI is the other side of that same ledger.

Key Data Points

Finding Source Date Credibility
60% of AI projects without AI-ready data will be abandoned through 2026 Gartner press release, n=248 data leaders Feb 26, 2025 HIGH — independent analyst, named methodology
63% of orgs lack/unsure of AI-ready data management practices Gartner Q3 2024 survey Sep 2024 HIGH
81% report significant data quality problems; 85% say leadership isn’t addressing Qlik / Wakefield Research, n=500 U.S. data pros, $500M+ firms Feb 4–18, 2025 MEDIUM-HIGH — independent field research, Qlik-commissioned
47% of companies worry they overinvested in AI Qlik / Wakefield Feb 2025 MEDIUM-HIGH
Enterprise migrations: 6–18 months typical Hakkoda/NTT Data/Databricks aggregation 2024–2025 MEDIUM — SI-published
2,500+ stored procedures, 40 weeks → 10 weeks with AI Caylent / Teamfront case study 2025 MEDIUM — SI case study, no independent audit
70% AI-automated / 20% AI-assisted / 10% hand-coded conversion mix Caylent / Teamfront 2025 MEDIUM
Data scientists spend 39–45% of time on preparation Anaconda State of Data Science Recurring through 2024 HIGH — independent repeat survey
Poor data quality costs orgs $12.9M/year avg Gartner (ongoing citation) 2020–2024 MEDIUM (cited widely, original methodology opaque)
47% of newly created records contain at least one critical error MIT Sloan Published 2017–2020, still cited MEDIUM — predates current tools, trend only
Only 3% of companies’ data meets basic quality standards Harvard Business Review 2017 TIER 4 — historical context only
15–25% revenue loss annually from poor data quality MIT Sloan with Cork Univ. Business School 2022 MEDIUM
IDC/NetApp “Scaling Enterprise AI Responsibly,” n=1,200+ global decision makers IDC Oct 2025 HIGH — independent, large-n
42% of companies abandoned at least one AI initiative in 2025; avg $7.2M sunk cost Aggregator stat (multiple sources) 2025 LOW-MEDIUM — origin chain unclear

Note: the viral near-total-failure headline figure from 2025 is excluded per the banned-statistics list (fabricated attribution). The 60% abandonment prediction from Gartner is the better-sourced and more defensible number.

What This Means for Your Organization

Three practical calibrations. First, if your board is budgeting a “data readiness project” as a one-time capital expense, reframe it. Data readiness for AI is an operating discipline with a multi-year horizon. The companies that capture AI value are the ones that funded data quality as a permanent line, not as a sprint. Plan for 6–18 months on the first domain and a named data owner who stays in the role.

Second, the work decomposes into three buckets with very different compression ratios. Code translation, schema mapping, and bronze-tier ingestion are compressed 70–90% by AI tooling. Entity resolution, business rule definition, and source documentation compress 20–40% at best. Cleaning is the easy part; deciding what “customer” means across your CRM, ERP, and billing system is not. Budget accordingly — the slowest slice is governance, not engineering.

Third, the SI-with-AI-tooling model is the dominant execution pattern at mid-market scale, and the cost envelope is $300K–$1.5M for a focused, single-domain initiative with one AI use case tied to it. If a vendor or SI pitches a platform-first rather than use-case-first project, that is usually a sign the scope will expand and the timeline will slip. If you are deciding between an in-house build, a platform-first engagement, and a use-case-scoped SI engagement — or trying to estimate what your organization specifically should spend before AI value is realistic — I’d welcome the conversation: brandon@brandonsneider.com.

Sources


Brandon Sneider | brandon@brandonsneider.com April 2026