← AI Adoption Cycle 🕐 13 min read
AI Adoption Cycle

MLOps Production Readiness: What It Takes to Keep AI Working After Launch

Understanding why AI deployments fail after launch is the prerequisite for building infrastructure that prevents failure. The post-launch failures cluster into five types:

See also (wiki): wiki/mlops-ai-platform-engineering.md, wiki/ai-delivery-pods.md, wiki/agentic-ai-governance.md, wiki/data-readiness.md


Executive Summary

  • The most expensive AI mistake a CIO can make is launching without operational infrastructure. Gartner (Feb 2025, n=248 data management leaders) finds 60% of AI projects that reach the pilot stage are abandoned before full production — not because the model failed, but because the organization could not maintain it. The failure mode is not a bad pilot. It is a good pilot deployed into an operational vacuum.
  • MLOps is not a toolset. It is a discipline. Model registries, data pipeline monitors, evaluation harnesses, deployment automation, and production dashboards are the components. The discipline is the decision that all five will be in place before production launch — not as a post-launch retrofit. Organizations that treat MLOps as a Phase 2 concern spend Phase 2 doing emergency triage instead of expanding.
  • The evaluation gap is the root cause of most “AI is not working” conclusions. METR’s July 2025 RCT found experienced developers believed they were 20% faster with AI tools while actually running 19% slower. The belief-reality gap exists because most professional environments lack measurement infrastructure. An evaluation harness built before launch closes that gap — or reveals, before launch, that the system is not production-ready.
  • Agentic AI raises the operational stakes beyond what most organizations have modeled. When AI takes external actions — booking, filing, ordering, communicating on behalf of employees — a silent failure is not a bad document. It is a wrong booking, a misfiled compliance document, or an unauthorized customer communication. The monitoring and governance infrastructure required for Level 3-4 agentic autonomy (Rewired, Ch. 5 framework) is meaningfully more complex than what is required for augmentation tools.
  • The build-vs-buy answer is clear at mid-market scale. PwC AI Performance Study 2026: median mid-market AI platform builds achieve 0.7–1.3x ROI over three years; median mid-market platform purchases achieve 1.4–2.2x. Buy the infrastructure layer; invest the saved engineering capacity in evaluation discipline and workflow design — the two components that are workflow-specific and cannot be purchased off the shelf.

The Production Failure Taxonomy

Understanding why AI deployments fail after launch is the prerequisite for building infrastructure that prevents failure. The post-launch failures cluster into five types:

Type 1 — Silent quality degradation

The AI produces lower-quality outputs over time, but no one notices because there is no monitoring in place. Common triggers: provider-side model updates (GPT-4o behavior changed between the version used in pilot and the version running in production), data drift (the source CRM schema changed, altering the data flowing into the AI pipeline), or prompt regression (a well-intentioned prompt refinement degraded performance on edge cases).

The danger: Silent degradation compounds. By the time someone notices — usually a user complaint, an audit finding, or a random quality check — the organization has weeks or months of degraded outputs in production. The investigation is expensive, the remediation is disruptive, and the trust damage persists.

Prevention: Automated evaluation suite running on a weekly cadence against a fixed test set. Alert threshold set for any metric drop of >5 percentage points week-over-week.

Type 2 — Data pipeline breaks

The data feeding the AI workflow changes in a way the pipeline was not designed to handle. Schema changes in source systems (a new field added to the CRM, a legacy field deprecated, a table renamed) are the most common trigger. The AI continues running but on data that does not match its training distribution or its prompt expectations.

The danger: Unlike total system failure, data pipeline breaks are often partial — some records process correctly, others produce garbage. The mixed signal makes the break harder to detect and diagnose.

Prevention: Data pipeline monitoring with schema-change alerting and record-count anomaly detection. The data product governance model (wiki/data-products-reuse.md) adds a versioned schema layer that isolates AI workflows from upstream source system changes.

Type 3 — Adoption collapse after initial enthusiasm

User adoption spikes at launch, then drops sharply after 30–60 days as the novelty effect wears off and the friction of the new workflow becomes apparent. This is not a change management failure at launch — it is a failure to build the reinforcement mechanisms that sustain adoption through the “adoption valley of death” (the 60–90 day post-launch period).

The signal: User rejection rate (if the workflow allows users to override AI outputs) climbing, or usage frequency declining in monitoring data.

Prevention: Design the adoption reinforcement cadence before launch. The change lead role in the pod owns this. Monthly domain owner communication, visible wins tracked and shared, user feedback loops that demonstrably influence the product — these are the mechanisms that sustain adoption past day 60.

Type 4 — Scope and model drift

The AI workflow is extended informally — users start submitting inputs the system was not designed for, or someone adds a new data source without updating the evaluation baseline. The system starts being used for purposes it was not evaluated for, and quality degrades on those new use cases.

Prevention: Versioned scope documentation and evaluation baseline. Every change to the workflow’s scope triggers an evaluation update. The model versioning log captures what the system was expected to do at each version.

Type 5 — Governance and compliance exposure

An agentic AI system takes an action that turns out to be unauthorized, incorrect, or in violation of a regulatory requirement. Common in workflows where AI sends external communications, files documents with regulatory agencies, or modifies financial records.

Prevention: Tool-call logging, human-in-the-loop gates for irreversible actions, and the right-to-deploy review framework (Rewired, Ch. 34; Bain Phase 1 framework). This is not optional for Level 3-4 agentic autonomy — it is the difference between a governance-defensible deployment and a liability event. See wiki/agentic-ai-governance.md for the full framework.


The Five MLOps Components: Build Sequence

The five components are built in a specific sequence because each one depends on the previous. Organizations that skip the sequence or build them in parallel often end up with components that do not integrate.

Component 1 — Evaluation harness (build first, before production deployment)

The evaluation harness must exist before the first production deployment, not as a post-launch audit mechanism. It defines what “working correctly” means for this specific workflow.

Minimum viable evaluation harness:

  • A test set of 100–500 examples representing the full distribution of expected inputs (not just the clean, obvious cases — intentionally include edge cases and ambiguous inputs)
  • Metrics defined for each test case: task success rate (binary: did the AI do the right thing?), and workflow-specific accuracy metrics (clause extraction precision/recall for a contract workflow; retrieval accuracy for a RAG pipeline; response appropriateness for a customer-facing workflow)
  • Hallucination check: a sample of outputs reviewed for content not supported by the AI’s input context
  • Automated execution: the full test suite runs in under 30 minutes so it can be run before every production deployment

The evaluation baseline problem: Most organizations pilot an AI workflow, declare it “good enough” based on a demo and a handful of manual spot checks, and launch. The evaluation baseline — the precise metrics on the test set at launch — is not recorded. When quality degrades, there is no baseline to compare against. Recording the evaluation baseline at launch is the single most underinvested MLOps practice.

Component 2 — Data pipeline monitoring (build before or concurrent with evaluation harness)

Automated checks on the data flowing into the AI workflow. The checks vary by data type but the minimum set is:

  • Volume check: Is the expected number of records arriving in each pipeline run? A 30%+ drop signals a source system issue.
  • Freshness check: Is the data current? Stale data (pipeline ran but source data was not updated) is a silent failure mode.
  • Schema consistency check: Are all expected fields present with expected data types? Schema changes in source systems are detected immediately.
  • Distribution check (quarterly, not daily): Has the statistical distribution of key input fields drifted from the launch baseline? Distribution drift predicts quality degradation before it appears in evaluation metrics.

Alert routing: pipeline failures alert the data engineer (immediate) and the workflow designer (within 24 hours). The domain owner does not need pipeline alerts — they need the business impact summary, which flows from the domain owner briefing cadence, not from operational alerts.

Component 3 — Model and prompt registry (build before or concurrent with first production deployment)

A versioned log of every model, prompt template, and agent configuration in production. At mid-market minimum, this is a structured document (Notion, Confluence, or GitHub README) with the following fields per entry:

  • Deployment date
  • Model provider and version (e.g., OpenAI GPT-4o, version 2025-01-01)
  • Prompt template version (v1, v2, etc.) with change summary
  • Evaluation baseline metrics at deployment
  • Owner (AI engineer name)
  • Last reviewed date

The registry enables rollback: when a model update or prompt change degrades performance, the previous version is retrieved and redeployed in minutes, not hours. Without a registry, rollback requires reconstructing what was running before — a research exercise under pressure.

For organizations running multiple AI workflows, the registry also serves as the audit trail that governance and compliance functions require. Regulatory inquiries about AI decision-making become answerable.

Component 4 — Production monitoring dashboard (build during the first sprint after production launch)

Metrics visible to the domain owner, not just to engineering. The dashboard must answer three questions:

  1. Is the AI running? (Uptime, error rate, pipeline status)
  2. Is it performing correctly? (Task success rate from the evaluation harness, run on the most recent week’s production outputs)
  3. Is it moving the business metric? (The KPI the domain owner owns: handle time, error rate, processing cost, conversion rate — whichever line item the workflow is designed to move)

The business metric is the most important and the least commonly implemented. Organizations that only track technical performance metrics (uptime, latency, API error rate) often declare AI “working” while it is producing no business impact.

Dashboard cadence: the domain owner should review the business metric trend weekly in steady state — 5 minutes, not a meeting. Engineering reviews the full dashboard daily during the first 30 days post-launch, weekly thereafter.

Component 5 — Deployment automation and rollback (build before the second production deployment)

The ability to promote a new version to production without manual intervention and to roll back to the previous version within 30 minutes if monitoring detects a regression.

Minimum viable:

  • Two environments: staging (exact replica of production for testing) and production
  • Automated promotion gate: the evaluation harness must pass at >X% on all metrics before a change can be promoted from staging to production
  • Documented rollback procedure: step-by-step instructions that any member of the pod can execute, not knowledge in the AI engineer’s head

Production standard:

  • Canary release: new versions deployed to 5% of traffic first, monitored for 24–48 hours before full promotion
  • Feature flags: individual workflow components can be enabled/disabled without a full deployment
  • Shadow mode testing: new version runs in parallel with production (not affecting outputs) for evaluation comparison

The deployment automation investment pays off at the second production deployment. Without it, every change is a manual, risky event. With it, the pod ships improvements weekly.


The Agentic AI Escalation: Additional Requirements for Level 3-4 Automation

The Rewired four-level automation framework (Ch. 5, Exhibit 5.1) — individual augmentation, task automation, agentic workflows, agentic systems — represents an escalating operational requirement, not just an escalating capability level.

Levels 1 and 2 (augmentation and task automation) produce outputs that humans review before action. The operational failure mode is a bad document, which is recoverable.

Level 3 (agentic workflows) and Level 4 (agentic systems) take actions in external systems. The operational failure mode is a wrong booking, misfiled document, unauthorized transaction, or incorrect communication. These may not be recoverable, and they create operational and legal exposure.

Additional MLOps requirements for Level 3-4 agentic deployments:

Tool-call audit log. Every API call the agent makes — to a CRM, ERP, calendar, email system, or external service — is logged with: timestamp, agent ID, tool called, input sent, output received, and the triggering decision context. This is the investigation tool when something goes wrong. Without it, tracing a wrongly executed action is archaeology.

Reversibility classification. Before deployment, every action type in the agentic workflow is classified: reversible (can be undone within N minutes), difficult-to-reverse (requires human effort), or irreversible (cannot be undone). The classification drives the oversight architecture: irreversible actions require human approval before execution. This is not a policy preference — it is the governance requirement documented in IBM IBV Agentic Governance Playbook 2026 and Bain’s agentic deployment framework.

Human-in-the-loop gates for irreversible actions. The gateway between “AI recommends” and “AI executes” on irreversible actions. The gate can be synchronous (explicit human approval before execution) or asynchronous (a 24-hour window during which a human can cancel). The choice depends on action frequency, action consequence, and user tolerance for approval friction. See wiki/hitl-deployment-pattern.md for the full architecture.

Agent performance dashboard, separated from the technical dashboard. The workflow owner (not just engineering) needs to see: what types of actions the agent is taking, at what volume, with what success rate, and what error patterns are appearing. This is a different view from the technical uptime dashboard — it is a behavioral audit of the agent’s decisions.


Build vs. Buy: The MLOps Decision Framework

The mid-market default for each MLOps component:

Component Mid-Market Default Leading Platforms When to Build Custom
Evaluation harness Build (test cases are workflow-specific) None — this must be custom Always custom to the workflow
Data pipeline monitoring Buy / configure Great Expectations, dbt tests, Snowflake data quality, Fivetran alerts When on-premises data with compliance requirements
Model / prompt registry Build (lightweight document system is sufficient) MLflow Model Registry, Weights & Biases, Vertex AI Model Registry When running >10 models in production
Production monitoring Buy / configure Grafana, Datadog, platform-native When security isolation is required
Deployment automation Buy / configure GitHub Actions, Vertex Pipelines, Azure DevOps, Databricks Jobs When specialized compliance gating is required
Agentic audit logs Build (tool-call schema is workflow-specific) Partial: LangSmith, Arize AI, WhyLabs Always need custom log schema for governance

The Grant Thornton 2026 warning: 60%+ of mid-market AI platform builds are never fully decommissioned from vendor dependency. The correct posture is: buy and configure the commodity components (pipeline monitoring, deployment automation, production dashboards), and build the workflow-specific components (evaluation harness, audit logs) on top. Never build what can be purchased; never purchase what must be tailored.


The 90-Day MLOps Implementation Sprint

For a mid-market organization launching its first production AI workflow, a 90-day sprint to production-grade MLOps:

Days 1–30: Foundation

  • Define evaluation metrics for the workflow (task success rate, accuracy metrics, hallucination check)
  • Build the evaluation test set (100+ examples)
  • Run the evaluation baseline against the pilot-stage AI and record it
  • Set up data pipeline monitoring with volume, freshness, and schema checks
  • Create the model/prompt registry (can be a structured document)

Days 31–60: Production launch

  • Deploy to staging, run the evaluation harness, confirm metrics above threshold
  • Set up the production monitoring dashboard with business metric tracking
  • Deploy to production
  • Configure automated evaluation harness to run weekly
  • Activate pipeline monitoring alerts

Days 61–90: Stabilization and agentic preparation (if applicable)

  • Review first month of production performance data with domain owner
  • Document any deployment issues and update the registry
  • If the workflow includes agentic actions: implement tool-call audit logging and HITL gates for irreversible actions
  • Train the backup operator on rollback procedure (single point of failure elimination)
  • Document the MLOps playbook for the next pod to inherit

Key Metrics: What to Track at Month 3

At month 3 post-launch, a CIO should be able to answer these questions from a dashboard, not from a meeting:

  1. Task success rate this week vs. launch baseline. If it has declined >5 percentage points, investigate.
  2. Data pipeline health: last successful run, record count vs. expected, schema errors in last 30 days.
  3. User adoption: % of eligible users actively using the workflow this week vs. month 1.
  4. Business metric delta: the KPI the domain owner owns, current vs. pre-deployment baseline.
  5. Version currency: is the deployed model/prompt version the one we intended to have in production?

If a CIO cannot answer all five questions within 5 minutes of opening the dashboard, the MLOps instrumentation is incomplete. That is not a technology gap — it is a governance gap. The data exists somewhere; the dashboard is the discipline of surfacing it to the decision-maker who needs it.