See also (wiki): wiki/data-products-reuse.md, wiki/data-readiness.md, wiki/ai-delivery-pods.md, wiki/mlops-ai-platform-engineering.md
Executive Summary
- Every organization pays data preparation costs on every AI deployment. The question is whether they pay them once or repeatedly. O’Reilly research finds data preparation consumes 60–80% of AI project time. Without a data product architecture, an organization building 10 AI workflows pays those costs 10 times. With data products, they pay those costs once — and each subsequent workflow is faster, cheaper, and more reliable.
- A data product is not a report, a database, or a data lake. It is a governed asset with four properties: a documented schema, quality monitoring, a named owner, and a consumption interface. The distinction matters because most “data assets” organizations already have lack three of the four.
- McKinsey Rewired (Ch. 25, 26, 31) identifies data architecture as the single strongest structural predictor of sustained AI value. “No data architecture, no AI advantage” (Ch. 25) is the book’s most unhedged claim. The independent evidence (Gartner 2025: 60% of AI projects abandoned through 2026 due to data readiness gaps; Palantir 139% NRR Q4 2025 driven by customers reusing shared data infrastructure) supports it.
- For mid-market companies, the data product architecture does not require a multimillion-dollar data platform overhaul. It requires identifying the two or three datasets that will serve the most AI workflows, assigning owners to them, adding quality monitoring, and creating a consumption interface. The minimum viable data product program can start with $50K–$150K and 90 days.
- The biggest risk in data product investment is building the product before the consumer exists. The correct sequence is: commit to two or three AI workflows, identify their shared data requirements, and build the data product to serve those workflows. Data products built speculatively, without committed consumers, are expensive data warehouse entries.
Why the Bespoke Extract Pattern Fails at Scale
Most mid-market AI deployments start with a bespoke data extract: the AI engineer connects directly to the CRM, ERP, or HRIS, queries the fields needed for the pilot workflow, and feeds them into the model. This works for one pilot. It fails at scale for five predictable reasons:
Duplication of effort. When the second AI workflow needs customer data, a second engineer queries the CRM, extracts a different subset of fields with different normalization logic, and builds a second pipeline. The two pipelines may produce different answers to the same question from the same source data — because they applied different entity resolution logic, different null handling, or different date formats. When there are five pipelines from the same source, this is the data consistency nightmare documented in IBM CDO Survey 2025: only 26% of organizations can use unstructured data in a way that delivers business value, because the other 74% cannot get consistent answers from their own systems.
No one to call when the data breaks. A bespoke extract has no owner. When the source system changes its schema — a routine event in any actively maintained CRM or ERP — the extract breaks silently. The AI starts running on corrupted or missing data. No alert fires because no one set up monitoring. The production failure is discovered by a user complaint, not by the team.
No quality guarantees for downstream AI. A bespoke extract does not document its quality thresholds. The AI engineer who built it knows (roughly) what the data is like. The ML engineer building the next workflow on top of the same source does not. Every new consumer of the bespoke extract discovers its quality problems independently, in production.
Cannot support regulated data handling. When a compliance team asks “which AI systems have access to customer financial data?” — a question that is now legally relevant in five U.S. states and under GDPR — a bespoke extract architecture cannot answer it reliably. The access governance required for regulated data does not exist at the extract level; it must exist at the data product level.
Technical debt that compounds. Every bespoke extract is a custom integration that requires maintenance. Organizations with 10 AI workflows and 10 bespoke extracts have 10 maintenance obligations, 10 failure modes, and 10 separate monitoring requirements. The operational overhead grows linearly with the number of deployments. Data products reverse this: 10 workflows sharing 3 data products have 3 maintenance obligations.
The Four Properties That Make Something a Data Product
A data product is defined by four properties. A dataset that lacks any of these four is not a data product — it is a data asset, which is a lower tier of governance and reliability.
Property 1 — Governed schema
A documented, versioned definition of what the product contains. This includes:
- Field definitions: For each field in the product, a business definition (not just a column name), the source system and field it maps to, and the transformation logic applied if any.
- Version history: Changes to the schema are versioned (v1, v2, v3), with a change log and a communication record for consumers. Schema changes are announced before they are deployed so consuming AI workflows can be updated.
- Inclusion rules: The rules that determine which source records are included in the product. A customer-360 product might include all accounts with at least one transaction in the last 24 months and exclude test accounts and internal accounts. These rules must be documented explicitly — every “obvious” rule that is not documented becomes a source of incorrect AI outputs when the rule is not enforced consistently.
Property 2 — Quality monitoring
Automated checks that run on every refresh cycle, with alerting for out-of-threshold results. The minimum monitoring set:
- Completeness: % of expected records present. A customer-360 product that is missing 30% of accounts because a CRM sync failed is a data quality event, not a normal condition.
- Freshness: Timestamp of the most recent source data. If the product is supposed to refresh daily, a 48-hour-old refresh is stale by definition.
- Schema consistency: All expected fields present, correct data types, no new nulls where values are required.
- Referential integrity: Foreign keys resolve correctly. Customer IDs in the product match customer IDs in the transaction system. Broken referential integrity is the silent cause of AI personalization failures (customer 360 says the customer has no recent purchases; transaction system shows 10 in the last month — the join key broke).
The quality monitoring results should be visible to the data product owner and to the AI delivery pod (the engineering team consuming the product). When a quality check fails, the data product owner resolves it before consumer teams escalate.
Property 3 — Named ownership
A specific person or team accountable for the data product’s quality, schema governance, consumer support, and evolution roadmap.
The ownership question is the one most organizations avoid, because it requires someone to accept accountability for something that will fail occasionally. The data product model cannot work without it. Assigning ownership to “the data team” is not ownership — it is diffusion of accountability.
The data product owner is accountable for:
- The quality monitoring setup and response
- Schema versioning and consumer communication
- Adjudicating conflicts between consumer needs (when two AI workflows need the same field with different normalization logic, the owner decides the canonical version)
- The evolution roadmap (what new fields or entities will be added in future versions)
In a mid-market organization with a small data team, one person may own two to three data products. That is acceptable. No owner is not.
Property 4 — Consumption interface
A consistent, documented way for AI workflows and other consumers to access the data product. The interface abstracts the underlying source systems — when the source changes, the interface does not, and downstream consumers are unaffected.
Consumption interface options in order of maturity:
- Governed SQL view or table in a shared data warehouse: The simplest implementation. The data product is a view in Snowflake, BigQuery, or Databricks with documented column names and access controls. Any AI workflow with access credentials can query it.
- Feature store entry: For ML-intensive workflows, a feature store (Feast, Tecton, Vertex Feature Store) serves the data product as a versioned feature set with consistent train/serve parity — the model training environment and the production serving environment see the same data.
- REST API endpoint: For real-time or near-real-time consumption. The data product is served as an API that returns a customer record, supplier profile, or transaction history on demand. Appropriate for customer-facing AI workflows where latency matters.
The consumption interface is also where access governance is enforced: only consumers with documented business need and appropriate credentials can read the product.
The Four Data Products Most Worth Building First
Customer 360
What it is: A unified view of each customer across all touchpoints: identity resolved across CRM, billing, support, and e-commerce; transaction history; engagement history; risk or opportunity signals.
Why it is the highest-leverage starting point: The customer 360 data product enables the widest range of AI workflows of any single investment. Customer service routing, personalization, churn prediction, cross-sell modeling, customer lifetime value, contract renewal forecasting — all of these draw from a shared customer data foundation. Building the customer 360 once makes each of these workflows faster to build and more reliable in production.
The identity resolution challenge: The hardest part of building a customer 360 is not the schema design — it is resolving the same customer across systems that use different IDs. A customer who is “CRM-00345” in Salesforce, “CUST-8821” in the billing system, and “user@email.com” in the support system is one customer in reality and three records in data. Identity resolution — matching and merging these records into a canonical customer entity — is the data engineering work that makes the customer 360 accurate. This is not a one-time project; it requires maintenance as new accounts are created and existing accounts change.
Budget range: $80K–$350K for a mid-market company (500–2,000 employees, 10K–100K customer accounts), depending on the number of source systems, the quality of existing data, and whether identity resolution tooling is purchased or built.
Supplier / Vendor Risk Profile
What it is: A governed view of each supplier: financial health indicators, contractual terms, performance history (on-time delivery, quality, compliance), and third-party risk signals.
Why it matters for AI: MHI/Deloitte Supply Chain AI 2026 finds 43% of supply chain AI failures trace to data quality issues in supplier records. Demand forecasting models that include supplier reliability signals dramatically outperform those that do not — but only if the supplier data is governed and current. A supplier risk profile data product enables: procurement AI for bid evaluation, supply disruption prediction, contract renewal prioritization, and ESG reporting automation.
The freshness challenge: Supplier risk data is unusually time-sensitive. A supplier that was financially stable six months ago may be in distress today. The quality monitoring for a supplier risk profile must include freshness checks on the risk signals (credit ratings, news feeds, compliance filings) in addition to the internal performance data.
Budget range: $50K–$200K for a mid-market company with 200–2,000 active suppliers, depending on the number of data sources and the degree of third-party risk feed integration.
Employee Capability and Capacity Model
What it is: A governed, role-level view of workforce skills, capacity availability, and deployment history.
Why it matters for AI: The CHRO workforce planning gap documented in Mercer Global Talent Trends 2026 (n=12,000): 65% of executives expect 11–30% of their workforce to be redeployed or reskilled within two years, but only 29% of CHROs feel confident in their workforce planning capability. The capability model is the data foundation for AI-assisted workforce planning, skills-gap analysis, training prioritization, and the hire/train/borrow/bot decision.
The access governance requirement: This data product carries the highest access governance requirements of the four. Individual-level workforce data is subject to five state employment AI laws that took effect in 2026 (Illinois, Colorado, California, New Jersey, NYC) — requiring bias audits, transparency disclosures, and impact assessments for AI systems that use this data for hiring, promotion, or discipline decisions. The data product governance layer must include documented access controls, an audit trail of which AI systems have accessed the data, and a legal review before each new AI workflow is connected.
Budget range: $40K–$150K for a mid-market company, if HRIS data is reasonably structured. Higher if skills data is manually maintained in spreadsheets rather than in a structured system.
Financial Transaction Ledger
What it is: A governed, reconciled view of financial transactions across business units, legal entities, and time periods.
Why it matters for AI: IBM IBV Dynamic Finance 2026 finds organizations with unified financial data products deploy CFO-office AI 40% faster than those building workflow-by-workflow extracts. The financial ledger enables: AI-assisted close process, anomaly detection and fraud monitoring, tax preparation automation, board financial reporting, and the CFO workflow automation documented in wiki/cfo-ai-workflows.md.
The multi-entity challenge: Mid-market companies with multiple legal entities (common in PE-backed platforms and multi-brand organizations) have the most complex financial ledger data product requirements. The product must resolve transactions across entities with different chart-of-accounts structures, different ERP instances, and different period-end dates. The investment in building this product correctly is typically the highest of the four — but the downstream AI value (automated consolidation, intercompany reconciliation, board package generation) is also the highest ROI.
Budget range: $100K–$500K for a mid-market company with multiple legal entities; $40K–$150K for a single-entity company.
Build Sequence: The Data Product Roadmap
The data product roadmap is derived from the AI workflow roadmap. The sequence:
Step 1: Commit to the first two to three AI workflows. The data product investment is only justified by committed consumers. Identify the workflows that are going into production in the next 12 months.
Step 2: Map the data requirements. For each workflow, document: source systems required, fields needed, quality thresholds required for acceptable AI performance, access governance requirements.
Step 3: Identify the shared foundation. Look for overlap across the workflows’ data requirements. If the first three AI workflows all need customer identity + transaction history + support history, that overlap defines the customer 360 data product.
Step 4: Build the data product to serve the first workflow; design the schema to serve the others. The first consumer funds the build; additional consumers share the benefit. This is the economic logic Rewired (Ch. 31) calls the reuse case.
Step 5: Assign an owner before deployment, not after. Ownership assigned retroactively, after the first consumer team has already started querying the product, creates confusion about governance authority. Assign the owner before the product goes live.
Step 6: Run the quality monitoring through at least two refresh cycles before connecting AI workflows. Quality issues discovered in production AI are more expensive to diagnose and remediate than quality issues discovered in the monitoring setup phase.
The 90-Day Data Product Sprint
For a mid-market organization building its first data product to support an upcoming AI deployment:
Days 1–15: Discovery
- Inventory the source systems that contain the data needed for the first two AI workflows
- Identify the overlapping data requirements (define the data product scope)
- Map the source-to-target transformation requirements (entity resolution needs, field normalization, null handling)
- Assign the data product owner
Days 16–45: Build
- Build the ingestion pipeline from source systems to the staging environment
- Implement entity resolution logic
- Define and implement quality monitoring rules
- Document the schema (field definitions, inclusion rules, version 1.0)
- Build the consumption interface (SQL view, feature store entry, or API)
Days 46–60: Validation
- Run the quality monitoring through three refresh cycles
- Have the AI engineer for the first consuming workflow query the product and validate it produces the data the AI workflow expects
- Document any schema refinements and update to version 1.1 before production
Days 61–90: Production and handoff
- Connect the first AI workflow to the data product
- Monitor quality for the first 30 days with heightened attention (daily check instead of weekly)
- Document the data product in the data catalog with the schema, owner, refresh cadence, and consumption interface
- Prepare the “second pod inheritance” documentation: what a second AI team needs to know to consume the product
What Good Looks Like at Month 6
At month 6 after the first data product is in production serving AI workflows, a CIO should be able to observe:
- The second AI workflow connected to the data product in less time than the first. If the second workflow took the same amount of time and cost as the first to get data-ready, the reuse architecture is not working. The target is 40–60% reduction in data preparation time for the second workflow drawing from the same data product.
- Data quality alerts are firing and being resolved before AI workflows are affected. The monitoring is working if it is catching issues before they surface as AI quality problems. If the first time the AI team learns about a data issue is from a bad AI output, the monitoring threshold is too low.
- The data product owner can answer “which AI systems use this data” in under five minutes. If the lineage documentation is not maintained, the governance layer is incomplete.
- Schema version history exists and is current. If the product has been in production for six months and is still listed as v1.0 with no change log, schema changes are being applied without versioning.
Connecting Data Products to the Broader Architecture
Data products do not stand alone — they are one layer in a three-layer AI data architecture:
Layer 1 — Source data: The CRM, ERP, HRIS, data lake, or other systems that contain the raw operational data. This layer is not designed for AI consumption; it is designed for operational transaction processing.
Layer 2 — Data products: The governed, versioned, quality-monitored assets built on top of source data. This layer abstracts source systems from AI consumers and enforces quality and governance. This is the investment this document is about.
Layer 3 — AI workflow data: The final transformation and indexing steps specific to each AI workflow — the vector embeddings for a RAG pipeline, the feature engineering for an ML model, the prompt context assembly for an LLM workflow. This layer is workflow-specific and is the AI engineer’s domain.
Rewired (Ch. 25) summarizes the dependency: without Layer 2 (data products / data architecture), Layer 3 (AI workflow data) is rebuilt from scratch for every deployment. The investment in Layer 2 is the investment that makes Layer 3 scalable.
Related Reading
- wiki/data-products-reuse.md — concept page; anatomy of a data product, the four high-value products, governance trap
- wiki/data-readiness.md — upstream data quality and the reset decision; data products are the structured output of a successful data readiness investment
- wiki/ai-delivery-pods.md — the data engineer role within the pod; pod data requirements that drive data product prioritization
- wiki/mlops-ai-platform-engineering.md — data pipeline monitoring that feeds into data product quality monitoring
- research/09-ai-adoption-cycle/ai-data-reset-decision-tree.md — decision tree for data investment; when to build a data product vs. bespoke extract
- research/07-adoption-challenges/ai-data-reset-decision-framework.md — full reset framework; data product investment fits the “Reset” path for multi-domain AI programs
- research/04-consulting-firms/mckinsey-rewired-2nd-edition-synthesis.md — Rewired Ch. 25 (“no data architecture, no AI advantage”), Ch. 26 (data products as reusable building blocks), Ch. 31 (“the best use case is the reuse case”)