Data Cleaning for AI: What It Actually Takes, Who Actually Does It, and How Long It Really Runs

Brandon Sneider · April 2026

The honest answer: almost nobody has finished.

Executive Summary

Gartner (Feb 2025, n=248 data management leaders) predicts 60% of AI projects unsupported by AI-ready data will be abandoned through 2026. 63% of organizations either lack or are unsure they have the right data management practices for AI.
Qlik / Wakefield Research (Feb 2025, n=500 U.S. data professionals at $500M+ firms) finds 81% report significant data quality problems and 85% say leadership isn’t addressing them. 96% expect widespread crises from poor AI data quality. This is the cleanest recent picture of the gap between AI ambition and data reality.
Realistic enterprise data modernization runs 6–18 months end-to-end. The bottleneck is not moving the raw data — it is the consumption layer (dashboards, reports) and transformation logic (thousands of stored procedures, undocumented business rules). Mid-market Snowflake setups cost $200K–$1M+ to stand up and $150K–$400K/year to run.
AI-assisted tooling compresses one slice of the work dramatically: Caylent’s SQL-to-PostgreSQL migration for Teamfront/Arborgold ran 2,500+ stored procedures in 10 weeks vs. a 40-week manual estimate, with 70% AI-automated, 20% AI-assisted, 10% hand-coded. The remaining 30% — the part that still requires humans — is where mid-market companies stall.
Anaconda’s recurring State of Data Science survey finds data scientists spend 39–45% of their time on data preparation (not the often-cited 80%). Independent of AI, 47% of newly created records contain at least one critical error (MIT Sloan) and only 3% of companies’ data meets basic quality standards (HBR). Cleaning is continuous, not a one-time project.

What “Doing It Well” Actually Looks Like

The honest answer: almost nobody has finished. The companies cited as data-ready role models are mostly large enterprises with seven-figure data platform teams (Morgan Stanley, JPMorgan, Walmart) that have been investing in data infrastructure for a decade before GenAI arrived. There is no 500-person law firm or manufacturer in the public record that ran a 90-day data-cleanup sprint and emerged AI-ready. What exists instead is a smaller pattern: companies that picked a narrow scope, funded it properly, and accepted that “clean” is a steady-state operational discipline, not a project milestone.

The independent benchmark to anchor on is Gartner’s Q3 2024 data management survey (n=248). 63% of organizations do not have or are unsure they have AI-ready data management practices. Against that baseline, the firms that have made real progress share four traits: executive sponsorship of data quality as an ongoing P&L line, a named data owner per domain (not a shared services team), ruthless scope reduction (one AI use case at a time, not a “platform”), and willingness to rewrite source system integrations rather than paper over them with transformation layers.

Qlik’s February 2025 survey of 500 U.S. professionals at $500M+ companies sharpens the picture: 81% report significant data quality problems; 85% say leadership isn’t addressing them. The same survey shows 77% of $5B+ companies expect a major crisis from poor AI data quality — meaning senior leaders now see the risk but the execution gap remains. That is the operational reality of most 200–5,000 employee American companies right now.

Realistic Timelines: What 6–18 Months Actually Buys

Enterprise data migrations consistently run 6–18 months across public case write-ups from Hakkoda, Databricks, NTT Data, and Acuvate. The distribution is not normal. Roughly:

Phase	Typical duration (mid-market)	What slows it
Discovery & source audit	4–8 weeks	Undocumented source systems; shadow spreadsheets; orphaned databases
Architecture decision (warehouse/lakehouse)	2–6 weeks	Vendor selection, procurement, security review
Raw ingestion (bronze tier)	4–12 weeks	Rarely the bottleneck once tooling is in place
Cleaning, deduplication, entity resolution (silver)	3–9 months	Business rule definition — the binding constraint
Business-ready aggregates (gold)	2–6 months	Stakeholder alignment on definitions, metrics, hierarchies
Consumption-layer rebuild (dashboards, stored procs)	3–9 months	Volume: 2,500+ stored procedures is a real mid-market number

The silver tier is where 60% of AI projects die — the same finding as Pass 88’s medallion timeline brief. Cleaning raw data is tool-assisted; resolving “is Customer #1002 the same as Customer #10020?” is not. Neither is deciding whether revenue recognition follows the contract or the invoice.

What AI-Assisted Tools Actually Compress

The Caylent / Teamfront / Arborgold case is the clearest recent datapoint. Four SQL Server clusters, 2,500+ stored procedures, 40-week manual estimate compressed to 10 weeks with AI tooling. 70% was AI-automated, 20% AI-assisted with human guidance, 10% hand-coded for edge cases. That 70% is the largest verified compression ratio in a published mid-market case. The work was done by Caylent (a systems integrator), not by internal staff — which matters for cost modeling.

Databricks partner Trellis IQ reports clearing a seven-year data harmonisation backlog in seven days for a global CPG manufacturer; SevDesk migrated 600+ dbt models from Redshift to Snowflake in “a few weeks,” saving ~2,500 hours. These are vendor-published with no independent verification — treat as directional. The consistent pattern: AI-assisted code translation and schema mapping compresses one slice (transformation logic) by 70–90%; entity resolution, business rule definition, and source documentation compress by 20–40% at most.

Anaconda’s State of Data Science surveys (recurring, most recent wave 2024) consistently find data scientists spend 39–45% of their time on data preparation — not the “80%” figure circulating for a decade. The 80% figure comes from older CrowdFlower/Figure Eight surveys that included data collection and labeling. For mid-market planning, assume 40% of data team capacity goes to preparation on an ongoing basis even after the “big cleanup” is complete. This is the steady-state operating cost most planning decks omit.

Who Actually Does the Work

The pattern across Teamfront (Caylent), the CPG manufacturer (Databricks/Trellis IQ), and the broader modernization literature is consistent: AI-era data remediation at mid-market scale is almost always done by a systems integrator with AI tooling, not by internal staff. Internal teams own domain knowledge (what the data means) and governance decisions. The SI owns the mechanical work (extraction, migration, code translation). A realistic mid-market staffing model:

Internal: 1 data owner per domain (part-time), 1 data engineer, 1 analytics lead, plus executive sponsor for scope and governance decisions.
External: SI team of 3–8, usually 6–12 months of engagement.
Cost envelope: $300K–$1.5M for a focused mid-market initiative (one domain, one AI use case), on top of platform run costs ($150K–$400K/year for Snowflake/Databricks at this scale).

What makes this category hard to benchmark honestly: the vendors and SIs who write the case studies have every incentive to publish the fastest examples. The failures are private. Qlik’s finding that 47% of executives worry their company overinvested in AI is the other side of that same ledger.

Key Data Points

Finding	Source	Date	Credibility
60% of AI projects without AI-ready data will be abandoned through 2026	Gartner press release, n=248 data leaders	Feb 26, 2025	HIGH — independent analyst, named methodology
63% of orgs lack/unsure of AI-ready data management practices	Gartner Q3 2024 survey	Sep 2024	HIGH
81% report significant data quality problems; 85% say leadership isn’t addressing	Qlik / Wakefield Research, n=500 U.S. data pros, $500M+ firms	Feb 4–18, 2025	MEDIUM-HIGH — independent field research, Qlik-commissioned
47% of companies worry they overinvested in AI	Qlik / Wakefield	Feb 2025	MEDIUM-HIGH
Enterprise migrations: 6–18 months typical	Hakkoda/NTT Data/Databricks aggregation	2024–2025	MEDIUM — SI-published
2,500+ stored procedures, 40 weeks → 10 weeks with AI	Caylent / Teamfront case study	2025	MEDIUM — SI case study, no independent audit
70% AI-automated / 20% AI-assisted / 10% hand-coded conversion mix	Caylent / Teamfront	2025	MEDIUM
Data scientists spend 39–45% of time on preparation	Anaconda State of Data Science	Recurring through 2024	HIGH — independent repeat survey
Poor data quality costs orgs $12.9M/year avg	Gartner (ongoing citation)	2020–2024	MEDIUM (cited widely, original methodology opaque)
47% of newly created records contain at least one critical error	MIT Sloan	Published 2017–2020, still cited	MEDIUM — predates current tools, trend only
Only 3% of companies’ data meets basic quality standards	Harvard Business Review	2017	TIER 4 — historical context only
15–25% revenue loss annually from poor data quality	MIT Sloan with Cork Univ. Business School	2022	MEDIUM
IDC/NetApp “Scaling Enterprise AI Responsibly,” n=1,200+ global decision makers	IDC	Oct 2025	HIGH — independent, large-n
42% of companies abandoned at least one AI initiative in 2025; avg $7.2M sunk cost	Aggregator stat (multiple sources)	2025	LOW-MEDIUM — origin chain unclear

Note: the viral near-total-failure headline figure from 2025 is excluded per the banned-statistics list (fabricated attribution). The 60% abandonment prediction from Gartner is the better-sourced and more defensible number.

What This Means for Your Organization

Three practical calibrations. First, if your board is budgeting a “data readiness project” as a one-time capital expense, reframe it. Data readiness for AI is an operating discipline with a multi-year horizon. The companies that capture AI value are the ones that funded data quality as a permanent line, not as a sprint. Plan for 6–18 months on the first domain and a named data owner who stays in the role.

Second, the work decomposes into three buckets with very different compression ratios. Code translation, schema mapping, and bronze-tier ingestion are compressed 70–90% by AI tooling. Entity resolution, business rule definition, and source documentation compress 20–40% at best. Cleaning is the easy part; deciding what “customer” means across your CRM, ERP, and billing system is not. Budget accordingly — the slowest slice is governance, not engineering.

Third, the SI-with-AI-tooling model is the dominant execution pattern at mid-market scale, and the cost envelope is $300K–$1.5M for a focused, single-domain initiative with one AI use case tied to it. If a vendor or SI pitches a platform-first rather than use-case-first project, that is usually a sign the scope will expand and the timeline will slip. If you are deciding between an in-house build, a platform-first engagement, and a use-case-scoped SI engagement — or trying to estimate what your organization specifically should spend before AI value is realistic — I’d welcome the conversation: brandon@brandonsneider.com.

Sources

Gartner, “Lack of AI-Ready Data Puts AI Projects at Risk” (press release, Feb 26, 2025). Q3 2024 survey of 248 data management leaders. HIGH credibility. https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk
Qlik / Wakefield Research, “Data Quality Is Not Being Prioritized on AI Projects” (Feb 2025). n=500 U.S. data and analytics professionals at $500M+ firms. Field dates Feb 4–18, 2025. MEDIUM-HIGH credibility (Qlik-commissioned, independent fielder). https://www.qlik.com/us/news/company/press-room/press-releases/data-quality-is-not-being-prioritized-on-ai-projects
IDC / NetApp, “Scaling Enterprise AI Responsibly: The Critical Role of Data Readiness and an Intelligent Data Infrastructure” (Oct 2025). n=1,200+ global decision makers, two waves Jan 2024 and Jun 2025. HIGH credibility (large-n, independent fielder; vendor-published, treat headline framings with caution). https://www.netapp.com/media/142474-idc-2025-ai-maturity-findings.pdf
Anaconda, “State of Data Science” (recurring annual survey, most recent cited wave 2024). HIGH credibility — independent repeat survey. https://www.anaconda.com/resources/whitepapers
Caylent, “How AI Is Revolutionizing Database Migration” (2025). Teamfront/Arborgold case. MEDIUM credibility — systems integrator case study, no independent audit. https://caylent.com/blog/how-ai-is-revolutionizing-database-migration-from-year-long-projects-to-quarterly-wins
Informatica, “The Surprising Reason Most AI Projects Fail” (2025). Vendor marketing; useful for aggregated stats with original citations. LOW-MEDIUM credibility. https://www.informatica.com/blogs/the-surprising-reason-most-ai-projects-fail-and-how-to-avoid-it-at-your-enterprise.html
Integrate.io, “Data Quality Improvement Stats from ETL” (updated 2026). Aggregation of Gartner, IDC, MIT Sloan, HBR source stats. MEDIUM — derivative but well-sourced. https://www.integrate.io/blog/data-quality-improvement-stats-from-etl/
MIT Sloan Management Review with Cork University Business School. Data quality revenue-loss research (2017–2022). MEDIUM — original methodology not fully public.
Companion research in this corpus: research/07-adoption-challenges/bronze-silver-gold-data-tiering-timelines.md (medallion architecture), and research/01-ai-native-landscape/real-roi-by-function.md.

Brandon Sneider | brandon@brandonsneider.com April 2026