← AI Native Landscape 🕐 11 min read

AI Native Landscape

OpenAI Enterprise Case Studies (2025–2026): What the Evidence Actually Shows

Brandon Sneider · April 2026

These case studies are vendor-published and represent selected wins with no control group and no independent verification.

Executive Summary

OpenAI has published case studies from Morgan Stanley, Lowe’s, Balyasny Asset Management, BBVA, Moderna, Commonwealth Bank, Booking.com, and Thermo Fisher Scientific. Each shows real results in a specific, narrow workflow — not organization-wide transformation.
Every case study is vendor-published, features self-selected winners, has no control group, and was not independently verified. They represent the ceiling of what is achievable, not the median.
The contrast with independent evidence is stark: METR’s RCT (n=16 experienced developers, July 2025) found experienced developers were 19% slower with AI tools, while believing they were 20% faster. The CMU analysis of 807 GitHub repos found cognitive complexity increased 39% in AI-assisted projects.
The cases that deliver measurable ROI share a common pattern: narrow task scope, clear input/output definition, workflow redesign before deployment, and human oversight maintained. Cases without these conditions produce adoption metrics, not business outcomes.
For executives evaluating these cases: the question is not “did it work for Lowe’s?” but “which of the conditions that made it work for Lowe’s does my organization have in place?”

Methodology Caveat (Read Before Proceeding)

These case studies are vendor-published and represent selected wins with no control group and no independent verification. Cross-reference against: METR’s RCT (experienced developers 19% slower, July 2025, n=16, 246 tasks); CMU analysis (39% cognitive complexity increase in AI-assisted repos, 807 repos studied through August 2025); Denis Atlan’s 200-deployment B2B analysis (median +159.8% ROI over 24 months, but 27% failure rate, with success concentrated in projects where training consumed 25%+ of budget).

The pattern across all independent research: AI delivers real gains in specific, well-scoped tasks when workflow is redesigned first. It produces measurement theater when deployed broadly without process change.

Financial Services: The Clearest Evidence

Morgan Stanley — Wealth Management Knowledge Retrieval

Morgan Stanley deployed two OpenAI-powered tools for its financial advisors: the AI @ Morgan Stanley Assistant (September 2023) and AI @ Morgan Stanley Debrief (a GPT-4-powered meeting summary tool). The institution reports 98% advisor adoption across its wealth management division.

The measurable result is specific: query resolution expanded from 7,000 answerable questions to effectively any question across 350,000+ documents. Response time dropped from 30+ minutes of manual search to seconds. Document retrieval efficiency improved from 20% to 80%.

What to notice about this case: The task is knowledge retrieval — well-defined input, well-defined output, clear quality standard. It did not require advisors to change how they advise clients; it changed how they find information before the client conversation. The 98% adoption reflects a tool that fit into an existing workflow rather than requiring a new one. This is the structural condition most enterprise AI deployments lack.

Source credibility: MEDIUM — OpenAI-published, no independent measurement, Morgan Stanley’s own press releases confirm adoption. The specific numbers are plausible given the task scope.

Balyasny Asset Management — Investment Research Acceleration

Balyasny (hedge fund, $29B AUM) built a custom AI research platform on top of OpenAI’s GPT-5.4 model family, deployed to investment teams. OpenAI published this case study in March 2026. The headline metric: a central bank speech analyst reduced macroeconomic scenario analysis from 2 days to approximately 30 minutes.

OpenAI reports 95% of investment teams actively using the platform. Additional applications include a Merger Arbitrage Superforecaster agent that continuously monitors deal probabilities, replacing bespoke spreadsheets.

What to notice about this case: Balyasny invested years in model evaluation, selecting GPT-5.4 empirically for multi-step planning, tool execution, and hallucination reduction. The 2-days-to-30-minutes result involves a single analyst performing a single type of analysis (central bank speech processing). It does not represent the average across all users or all tasks. The 95% adoption figure measures access and usage, not outcome improvement per analyst.

Source credibility: MEDIUM — OpenAI-published, March 2026. No independent verification of the 2-days-to-30-minutes claim or what percentage of workflows see similar gains. The Hedgeweek reporting notes separately that “Balyasny’s AI outperforms OpenAI in financial applications” — suggesting Balyasny also uses internal models, complicating attribution.

BBVA — Banking Workflow Transformation

BBVA started with 3,300 ChatGPT Enterprise accounts in May 2024 and expanded to 120,000 employees by late 2025. OpenAI reports employees saved close to 3 hours per week on routine tasks, and 80%+ engage with tools daily. BBVA created 20,000+ custom GPTs.

The most specific data point in the BBVA case: their Mexican legal compliance workflow (“bastanteo” — verifying company representative authority before transactions) processed 9,000+ queries annually through automation and redeployed the equivalent of 3 FTE toward producing 11,000+ bastanteos per year, delivering 26% of the Legal Services division’s annual savings KPI.

What to notice about this case: The bastanteo example is the most credible metric in the BBVA case — it has a defined process, a measurable input/output, and a redeployment outcome (3 FTE shifted, not eliminated). The 3-hours-per-week savings figure is self-reported and not independently audited. BBVA’s 80% daily engagement rate reflects mandated corporate deployment, which does not validate that engagement produces proportional value.

Source credibility: MEDIUM — OpenAI-published. The bastanteo metric is specific and plausible. The headline hours-saved figure is unverified.

Retail: Scale Without Uniform Depth

Lowe’s — Customer-Facing AI at 1,700+ Stores

Lowe’s deployed “Mylow” (customer-facing AI) and “Mylow Companion” (store associate AI) simultaneously across all 1,700+ U.S. stores in May 2025. The tools handle approximately 1 million customer questions per month.

OpenAI reports: conversion rates more than doubled when customers engaged with Mylow during online visits; in-store customer satisfaction scores rose 200 basis points when associates used Mylow Companion. An OpenAI advisory engagement also identified $1.2M in AI cost avoidance over 24 months by restructuring Lowe’s API spend — reducing projected API costs from $2.4M to $1.2M while maintaining performance.

Lowe’s engineering team reports “double-digit productivity gains” from AI-assisted development.

What to notice about this case: Conversion rate doubling during Mylow-engaged sessions is a real metric, but self-selected: customers who engage with the AI Trip Planner are already in a higher-intent state than average visitors. 200 basis points in customer satisfaction scores is meaningful but narrow. The $1.2M cost avoidance figure relates to API pricing optimization, not to business outcome improvement. “Double-digit productivity gains” in engineering lacks a numerator, denominator, and control group.

Source credibility: MEDIUM — OpenAI-published. The conversion and satisfaction metrics are plausible and specific enough to be useful, but the selection effect in customer engagement data is not disclosed.

Life Sciences: Partnership Announcements vs. Demonstrated Results

Moderna — Enterprise-Wide Deployment

Moderna deployed ChatGPT Enterprise across 80%+ of its workforce and created 750 custom GPTs within two months of launch. The headline case study application is “Dose ID,” a data-analysis assistant for clinical study teams that helps evaluate optimal vaccine doses.

What to notice about this case: Moderna’s case study is primarily a deployment announcement, not a demonstrated-outcomes study. The 80% adoption and 750 GPTs in two months reflect organizational mandate and ease of initial deployment. No clinical outcome, drug development timeline acceleration, or cost metric is disclosed in the OpenAI case study. The Dose ID application “helps evaluate” doses — the outcome of that assistance is not quantified.

Source credibility: LOW for business impact claims — OpenAI-published, deployment metrics only. Moderna’s clinical pipeline progress (Phase 2/3 trials for personalized cancer vaccines) is separately documented but is not attributable to the OpenAI partnership specifically.

Thermo Fisher Scientific — Strategic Partnership (2025)

Thermo Fisher Scientific (130,000 employees, $40B+ revenue) announced a strategic collaboration with OpenAI in October 2025. The partnership focuses on embedding OpenAI APIs into product development, service delivery, customer engagement, and operational efficiency, with initial focus on the PPD clinical research business (trial data analysis, study design optimization, patient recruitment).

What to notice about this case: This is a partnership announcement, not a results case study. No specific outcomes had been disclosed as of Q1 2026.

Source credibility: N/A — No outcome data available to evaluate.

Banking and Travel: Early Signals

Commonwealth Bank of Australia — Scam Prevention and Internal Adoption

Commonwealth Bank deployed ChatGPT Enterprise to nearly 50,000 employees and reports that AI tools helped cut customer scam losses by 50%, with a separate figure citing 70% reduction in scam losses through real-time GenAI and predictive AI. Call center wait times fell 40%.

What to notice about this case: The scam loss reduction figures (50% and 70% cited in different sources) likely conflate GenAI with broader predictive AI systems that CBA has been building for years. The specific contribution of the OpenAI partnership to scam prevention vs. existing ML infrastructure is not disaggregated. The 40% call center wait time reduction has no baseline date, no volume context, and no connection to specific AI tools.

Source credibility: LOW for specific attributions — Multiple conflicting figures, unclear attribution between OpenAI tools and pre-existing AI infrastructure.

Booking.com — Travel AI at Scale

Booking.com deployed an AI Trip Planner and Smart Filters using OpenAI’s GPT models, building the first prototype in 10 weeks. CEO Glenn Fogel called it “very encouraging” with “very early signals” around impact. Metrics being tracked: faster search, better conversion, lower cancellation rates, customer satisfaction.

What to notice about this case: Booking.com’s own leadership describes results as early-stage. No specific conversion or revenue metric has been published. This is an honest case study in one respect: the company has not overstated the evidence.

Source credibility: LOW for business outcomes — Self-described “early signals” from OpenAI-published case study. No quantified results disclosed.

Key Data Points

Company	Industry	Metric Claimed	Scope	Credibility
Morgan Stanley	Wealth Management	98% advisor adoption; 7K → 350K answerable questions	Specific workflow (knowledge retrieval)	MEDIUM
Balyasny	Hedge Fund	2 days → 30 min (one analyst task); 95% team adoption	Custom platform, single task type highlighted	MEDIUM
BBVA	Banking	~3 hrs/week saved; 80% daily engagement; 26% of Legal Services savings KPI	Broad deployment + one specific process	MEDIUM
Lowe’s	Retail	2x conversion rate (engaged customers); +200bps CSAT; 1M queries/month	Customer-facing + associate tools	MEDIUM
Moderna	Biopharma	80% adoption; 750 custom GPTs in 2 months	Deployment metrics only	LOW
Commonwealth Bank	Banking	50–70% scam loss reduction; 40% call wait reduction	Unclear attribution to OpenAI vs. existing AI	LOW
Booking.com	Travel	“Very encouraging early signals”	Not quantified	LOW
Thermo Fisher	Life Sciences	Partnership announced	No outcomes disclosed	N/A

What This Means for Your Organization

The pattern across these cases is consistent. The cases with the strongest evidence — Morgan Stanley, BBVA’s legal workflow, Lowe’s customer conversion — share three conditions: a specific, well-bounded task with clear success criteria; workflow redesign before deployment (not adding AI to an existing broken process); and measurement that connects AI usage to a business output rather than a usage metric.

The cases with weak evidence share a different pattern: broad rollout, high adoption numbers, and no clear line from AI usage to business outcome. 80% of Moderna employees using ChatGPT Enterprise is a deployment fact, not a performance fact.

For executives in the 200–5,000 employee range: the enterprise case study library is a catalog of what is possible under ideal conditions, not a forecast of what your organization will experience. The four-week engagement that saved Lowe’s $1.2M in API costs was the result of a careful scoping process, not the natural output of deploying OpenAI tools. The 2-days-to-30-minutes Balyasny result is a single analyst, one task type, after years of platform investment.

The most useful question these cases answer is not “does AI work?” — it works demonstrably in narrow, well-designed applications. The useful question is: “what level of process clarity, task scoping, and measurement infrastructure does my organization have compared to these cases?” If your answer is “less than BBVA’s” or “less than Morgan Stanley’s,” that gap is where your AI program will either deliver or fail.

If you’re mapping your own workflows against these case studies and want an outside read on where your conditions match — and where they don’t — that’s a conversation worth having. brandon@brandonsneider.com.

Sources

OpenAI / Morgan Stanley case study — “Morgan Stanley uses AI evals to shape the future of financial services,” openai.com/index/morgan-stanley/, 2025. Credibility: MEDIUM (vendor-published; adoption metrics corroborated by Morgan Stanley press releases).
OpenAI / Balyasny Asset Management case study — “How Balyasny Asset Management built an AI research engine,” openai.com/index/balyasny-asset-management/, March 2026. Credibility: MEDIUM (vendor-published; 2-days-to-30-min is one analyst, one task).
OpenAI / BBVA case studies — “BBVA puts AI in the hands of every team with OpenAI,” openai.com/index/bbva/, 2024–2025; “BBVA and OpenAI collaborate to transform global banking,” openai.com/index/bbva-collaboration-expansion/, December 2025. Credibility: MEDIUM (bastanteo metric is specific and plausible; headline hours-saved figures are unverified).
OpenAI / Lowe’s case study — “OpenAI Transforms Lowe’s Retail Experience,” aiplusinfo.com, 2025; Redress Compliance advisory case study, redresscompliance.com, 2025. Credibility: MEDIUM (conversion and CSAT metrics plausible; selection effects in engagement data not disclosed).
OpenAI / Moderna case study — “Collaboration with OpenAI: Transforming the Way We Work and Innovate Through AI,” modernatx.com; BioPharma Dive, “Moderna turns to AI to change how its employees work,” biopharmadive.com, 2024. Credibility: LOW for business impact (deployment metrics only).
OpenAI / Commonwealth Bank case study — “Commonwealth Bank of Australia builds AI fluency at scale,” openai.com/index/commonwealth-bank-of-australia/, 2025. Credibility: LOW for specific impact attributions (conflicting figures across sources; GenAI vs. ML attribution unclear).
OpenAI / Booking.com case study — “Booking.com and OpenAI personalize travel at scale,” openai.com/index/booking-com/, 2025; Booking Holdings Q3 2025 earnings, PhocusWire. Credibility: LOW for business outcomes (self-described early signals, no quantified results).
Thermo Fisher Scientific / OpenAI partnership announcement — “Thermo Fisher Scientific to Accelerate Life Science Breakthroughs with OpenAI,” ir.thermofisher.com, October 16, 2025. Credibility: N/A (partnership announcement, no outcome data).
METR RCT — “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” metr.org/blog/2025-07-10, July 2025. n=16 experienced developers, 246 tasks. Credibility: HIGH (independent RCT, pre-registered, open-source methodology).
CMU / GitHub repository analysis — Analysis of 807 AI-assisted repos vs. 1,380 control repos, tracking through August 2025; reported via Rob Bowley / DevOps.com, December 2025. 39% increase in cognitive complexity in AI-assisted repos. Credibility: HIGH (independent observational study with control group).
Denis Atlan, “AI ROI Analysis: Evidence from 200 B2B Deployments (2022–2025)” — SSRN, 2025. Median ROI +159.8% over 24 months; 27% failure rate; 8-month median breakeven. Credibility: MEDIUM (independent, dataset publicly available under CC BY 4.0, methodology documented; French mid-market context, not directly comparable to U.S. enterprise).

Brandon Sneider | brandon@brandonsneider.com April 2026