Category: AI Strategy

AI-Ready Data Governance: A Practical Trust Framework
You are ready to move an AI capability from pilot to production. The demo performs well, but the release review exposes harder questions: Which data produced this answer? Was the system allowed to use it? What happens when the data becomes stale, its meaning changes, or a customer challenges the result?

If you cannot answer those questions quickly, you do not have an AI model problem yet. You have a trust-chain problem. The practical goal of AI-ready governance is to make every important input identifiable, interpretable, permitted, observable, and recoverable without turning each release into a committee project.

Trust is a chain, not a model score

A strong evaluation score can tell you how a system behaved against a defined set of cases. It cannot prove that production data was collected lawfully, interpreted consistently, retrieved with the right permissions, or handled according to retention rules. Those are separate conditions, and a trustworthy AI product needs all of them.

My working definition is simple: trust is the justified ability to rely on an AI system for a defined use case and level of consequence. It is not a general property that a model earns once. Change the data, user, purpose, or action, and you need to validate the chain again.

Use four questions to expose where that chain is weak:
1. What did the system use? You should be able to trace the relevant inputs, transformations, retrieval results, and freshness state.
2. What did the data mean? Business definitions, schemas, labels, and event taxonomies should be consistent enough that producers and consumers interpret the signal the same way.
3. Was this use allowed? Data classification, consent, retention, purpose, and user permissions should travel with the data rather than disappear at the model boundary.
4. Can you prove the controls worked? Automated checks, policy decisions, exceptions, human reviews, and operational events should leave evidence suitable for investigation and audit.
A no to any one of these questions is a specific failure, not a vague lack of AI readiness. That distinction matters because the remedies differ. Missing or duplicate records require data-quality work. Conflicting definitions require semantic ownership. An unauthorized retrieval requires access-policy work. A grounded answer that still violates a product rule requires an output control. Retraining the model will not repair any of those failures.

When an output is challenged, diagnose it in that order: authorization, retrieved context, source meaning and freshness, transformation logic, then model behavior. Starting with the model encourages expensive experimentation while the actual defect remains upstream.

AI-ready does not mean making every table in the company pristine. It means the data used by a particular AI capability has an explicit purpose, accountable ownership, reliable semantics, enforceable policy, and enough lineage to reconstruct what happened. Treating data as a product turns those requirements into an operating responsibility instead of an indefinite cleanup program.

Build a minimum control plane around each data product

Start with the data products that feed production AI use cases. A data product may be an event stream, a document corpus, a labeled outcome set, or a derived feature set. For each one, create a contract that answers the questions a producer, consumer, reviewer, and incident responder will actually ask.
- Purpose: the decision, experience, or workflow the data is intended to support.
- Accountability: a data owner responsible for meaning and policy, plus an AI use-case owner responsible for how the product relies on it.
- Semantics: field definitions, schema, taxonomy, labels, deduplication rules, and known limitations.
- Quality: the agreed expectations for completeness, validity, uniqueness, and freshness, including what happens when an expectation is missed.
- Lineage: where the data originated, which transformations changed it, and which indexes, features, or contexts consume it.
- Policy: sensitivity classification, permitted purposes, access conditions, consent state, retention, masking, and deletion behavior.
- Evidence: the tests, logs, approvals, exceptions, and monitoring signals that demonstrate the contract is operating.
A quality SLA is only useful when it has a measurable condition and a failure response. Do not write that data should be timely. Define the freshness expectation appropriate to the use case, identify who receives the alert, and specify whether the AI product should continue, degrade, abstain, or escalate when the expectation is breached. The appropriate threshold will differ between use cases, so the contract should carry it rather than burying it in general policy.

The next step is to enforce the contract at the moments when risk enters the system:
- At change time, run schema and data-contract checks in CI/CD. Pair tracking or taxonomy changes with code review so a renamed event or field cannot silently alter downstream behavior.
- At access time, apply least-privilege permissions through role- or attribute-based controls. Carry consent and purpose metadata into the decision, and apply masking or exclusion before sensitive values reach an index, training set, or prompt.
- At request time, filter retrieval using the requesting identity and use case. Record which eligible inputs informed the response and which policy decisions were applied.
- At output time, check for PII exposure, policy violations, unsafe actions, and adversarial behavior. Add human review where the consequence warrants judgment.
- At incident time, preserve a usable audit trail and invoke a defined response playbook with an owner, containment path, and recovery decision.
This is what it means to make approval workflows guardrails rather than gates. Schema checks, data contracts, least-privilege access, consent metadata, and policy-as-code can run inside the delivery workflow. A review board should handle material ambiguity and exceptions, not manually repeat checks that software can perform consistently.

Do not apply one approval path to every AI change. Classify changes by data sensitivity, consequence, autonomy, reversibility, and external exposure. A low-consequence internal feature using non-sensitive data may be eligible for self-service release when its automated controls pass. A customer-facing capability using sensitive context needs designated review. A high-stakes or difficult-to-reverse action should retain meaningful human control.

Human-in-the-loop is not satisfied by placing a person at the end of the workflow. The reviewer needs the relevant context, source trace, risk flags, and authority to stop or change the action. Otherwise, the human is only absorbing accountability from a system they cannot evaluate.

Consent, lawful basis, retention, and regulatory duties depend on jurisdiction and the precise use of the data. Treat those as decisions to make with qualified privacy or legal counsel, then translate the decisions into technical rules. An architecture checklist is not a legal determination, and silently guessing can create customer and regulatory exposure.

Govern the full path from ingestion to feedback

Many AI governance programs focus on model output because that is what users see. The more persistent risks often begin earlier, when data is collected for one purpose, transformed without visible lineage, indexed under broader permissions, or reused as feedback without a deliberate policy decision. You need controls across the complete path.

Ingestion and preparation

Every input should arrive with enough metadata to determine its origin, owner, meaning, sensitivity, permitted use, retention rule, and freshness. If those attributes are unknown, label the gap rather than allowing an implicit assumption to harden into production behavior.

Do not assume that permission to analyze data also grants permission to train on it, place it in a retrieval index, or expose it to another user through generated text. Evaluate each purpose explicitly. Apply deterministic masking and exclusions before the data crosses into a system where removal becomes harder to verify.

Data labeling deserves product-level attention. A label should have a documented definition, creation method, owner, and review path. If two teams use the same label to mean different outcomes, the model receives a conflict that infrastructure cannot resolve. If the definition changes, treat that change like an API change: identify consumers, test the impact, and preserve the lineage.

Retrieval and response

A retrieval-first architecture can improve grounding only when retrieval itself is governed. At query time, determine the requesting identity, account context, permitted purpose, and eligible sources before assembling model context. Do not retrieve broadly and hope the prompt tells the model what to ignore.

Keep the context window relevant as well as permitted. Irrelevant, conflicting, or stale material can obscure the signal even when every document is technically accessible. Context management should therefore enforce both policy and quality: authorized does not automatically mean useful.

The system also needs an explicit failure behavior. When retrieval returns insufficient, conflicting, stale, or unauthorized material, decide whether the product should abstain, ask for clarification, use a constrained fallback, or route the case to a person. A fluent answer is not an acceptable default when the evidence is inadequate.

For a material production interaction, retain enough evidence to reconstruct the event:
- The requesting actor or account context, represented in a privacy-conscious way.
- The use case and relevant system configuration.
- The retrieved inputs and their lineage or version identifiers.
- The access, consent, retention, and policy decisions applied.
- The output risk flags and any automated intervention.
- The human decision or override when review was required.
- The time of the event and the retention class governing the evidence.
Audit data needs governance too. Prompt and response logs can contain the same sensitive information you are trying to control. Collect the minimum evidence required for the stated purpose, mask where possible, restrict access, and apply an explicit retention rule. Logging everything forever is not traceability; it is an unmanaged secondary dataset.

Feedback and continuous improvement

User interactions, corrections, and business outcomes can improve an AI product, but they should not flow automatically into evaluation or training. First decide what the feedback represents, whether it is permitted for that purpose, how it will be labeled, and how long it should be retained.

Build evaluation cases from approved examples and segment results by the use case and risk that matter. A single average can hide a severe failure in a sensitive path. Pair model evaluations with source-quality checks, retrieval traces, policy results, human-review outcomes, and data-drift monitoring. That lets you distinguish a model regression from a context, permission, or data-contract regression.

Continuous monitoring, audit logs, PII checks, adversarial testing, drift detection, and incident playbooks make governance part of normal operations. The essential move is closing the loop: a failed case should lead to the layer that owns the defect, a corrective change, and a test that prevents the same failure from returning unnoticed.

Measure whether governance is earning trust

A dashboard labeled governance health is not useful unless each metric supports a decision. Start with measures that reveal coverage, control performance, delivery friction, and product consequences. Define each numerator, denominator, owner, and escalation condition so the number cannot drift into decorative reporting.
- Coverage: the share of production AI use cases with a named owner, current data contract, documented lineage, policy classification, and risk-based release path.
- Data reliability: schema-check pass rate, freshness-SLA compliance, duplicate or missing-data failures, and restoration time after a breach.
- Access and privacy: blocked unauthorized attempts, open policy exceptions, consent or retention violations, PII risk flags, and time to resolve each class of issue.
- Traceability: the share of reviewed outputs for which the team can reconstruct the relevant inputs, transformations, policy decisions, and reviewer actions.
- Evaluation: pass rates by use case and risk class, with failures attributed to data, retrieval, policy, model, or workflow layers.
- Delivery: lead time from a production-ready change to release, manual-review waiting time, and rework caused by late data or policy discovery.
- Consequences: incident frequency and severity, repeated failure modes, customer disputes, support escalations, and the product outcome the AI capability is meant to improve.
Read these measures in pairs. Faster release time with a growing backlog of unreviewed exceptions is not healthy acceleration. A high number of blocked access attempts may indicate that controls are working, that clients are misconfigured, or that an attempted abuse pattern is increasing. A rising evaluation score alongside worsening traceability means you know more about test performance but less about production accountability.

Do not collapse the dashboard into one trust score. A composite number hides which control failed and encourages teams to optimize the arithmetic. Executives can use a compact status view, but product, data, security, and privacy owners need the underlying measures and exception details.

Each material release should also produce an evidence packet containing the current data contract, automated test results, evaluation results, applicable approvals or exceptions, monitoring configuration, and incident owner. This does not need to become a large document. It needs to be complete enough that a reviewer can reproduce the release decision without relying on memory.

Finally, connect governance to outcomes rather than celebrating control activity. The relevant question is not how many reviews occurred. It is whether teams can ship responsibly with less rework, whether incidents and repeat failures decline, whether challenged outputs can be explained, and whether the intended product outcome improves without transferring hidden risk to the customer.

A 30-60-90 day path from policy to operating system

You do not need to finish an enterprise-wide catalog before improving one production path. Use a high-value AI capability as a vertical slice while the broader inventory progresses. That forces the governance design to survive real delivery constraints and produces reusable patterns for the next use case.

Days 1-30: expose the current state
- Inventory production AI use cases and the systems, datasets, indexes, outputs, and feedback loops they depend on.
- Map one priority flow from collection through transformation, retrieval, generation, action, and feedback.
- Assign accountable data and use-case owners. Record unknown ownership as a risk, not as a shared responsibility.
- Classify PII and other sensitive data, then document the current consent, purpose, lawful-basis, and retention decisions with the appropriate specialists.
- Define the first quality SLAs and failure behaviors for the inputs that can materially change the product result.
- Publish a concise operating policy that product managers, engineers, analysts, security partners, and reviewers can use during normal delivery.
The exit test is evidence, not document completion. For the priority use case, you should be able to name the owners, draw the data path, identify sensitive inputs, show the current permissions, and list the unresolved gaps that could block or constrain release.

Days 31-60: turn decisions into controls
- Standardize the metadata required for ownership, lineage, classification, consent, retention, quality, and permitted use.
- Implement fine-grained access controls and propagate the requesting identity into retrieval.
- Add consent-aware tracking, masking, and exclusions at the earliest enforceable point in the flow.
- Wire schema checks, data-contract tests, PII checks, and policy checks into CI/CD and runtime monitoring.
- Establish risk-based release paths so low-risk compliant changes can move without waiting for a general committee.
- Create the first governance dashboard using access attempts, exceptions, quality failures, risk flags, trace coverage, and delivery time.
The exit test is an end-to-end trace. Select a production interaction and reconstruct what the system used, what each important field meant, why access was allowed, which checks ran, and how an owner would respond if the result were challenged.

Days 61-90: close the learning and accountability loop
- Connect governance measures to outcomes such as release cycle time, avoidable rework, incident severity, repeat failures, and a defined customer-trust signal.
- Add human review to high-consequence paths and give reviewers the context and authority required to make a real decision.
- Run the incident playbook against a realistic failure and repair gaps in ownership, evidence, containment, or recovery.
- Review exceptions for recurring patterns. Automate repeatable decisions and escalate unresolved policy ambiguity to the accountable owner.
- Train product and engineering teams on the operating rules, then use a community of practice to share decisions and reusable controls.
- Review one release using the complete evidence packet and remove any step that produces ceremony without decision value.
The exit test is repeatability. A second team should be able to adopt the contracts, controls, evidence requirements, and escalation paths without inventing a separate governance system.

Key takeaways
- Define trust for a specific use case and consequence; do not treat it as a permanent property of a model.
- Trace four things for every material output: inputs, meaning, permission, and control evidence.
- Put governance into data contracts, CI/CD, access decisions, retrieval, monitoring, and incident response.
- Use risk-based release paths so routine compliant changes move quickly while sensitive or high-consequence decisions receive judgment.
- Measure coverage, control performance, delivery friction, and product consequences separately rather than hiding them in one score.
- Use the first 90 days to prove one end-to-end operating path, then reuse it across additional AI products.
At your next AI roadmap review, choose one production use case and ask the four trust-chain questions. Turn every missing answer into a named contract, control, owner, or test before expanding the capability’s reach. That is the point at which governance stops being overhead and starts making responsible delivery repeatable.

References
December 2, 2025
Own Your AI: 4 Essential Roles to Supercharge Support and Prevent Performance Drift by 2026

AI doesn’t fail because the model is bad, it fails because ownership is missing.

When someone truly owns your AI, everything changes. Resolution and automation rates climb, the system self-improves, and the customer experience transforms in ways a dashboard alone will never show you.

This is part three of our five-part series on customer service planning for 2026. We’ll be sharing all five editions on our blog and on LinkedIn.

If you’d rather have them emailed to you directly as they’re published, drop your details here.

Last week, we introduced the four roles that make AI actually work in a support organization. These roles are already showing up inside the teams who are scaling AI the fastest, and this week, we get closer to the ground.

Here’s what these roles look like in practice — what they do, how they work, and why your AI performance will inevitably drift without them.

AI operations lead — owns AI performance, every day. I think of this person as the air-traffic controller for our AI Agent. I treat the AI as a living system that needs ongoing supervision, evaluation, and tuning. This role is accountable for what leaders care about most: quality, reliability, and continuous improvement.

The AI ops lead sees the whole picture: conversation quality, missing knowledge, flawed assumptions, unexpected failures, new opportunities for automation, and the subtle signals that the system is beginning to drift. In practice, that vigilance is the difference between steady gains and slow decline.

Day-to-day, here’s what I expect from this role.

1. Reviews AI conversations and surfaces performance patterns. The AI ops lead monitors the AI Agent’s behavior — the tone shift after a product launch, a sudden dip in resolution for a specific intent, or conversation clusters revealing new customer behavior. They scan for anomalies, trends, and early warnings, with an emphasis on what’s happening right now, not last week. Without this intentional ownership, I’ve watched a 2% dip turn into a 10% drop in days.

2. Prioritizes fixes and improvements. Once patterns emerge, they triage fixes like a product team handles bugs. Missing or incorrect content? They route it to the knowledge manager. Behavioral issues? They adjust guidance and guardrails. Action or system issues? They partner with the automation specialist. This connective tissue turns individual fixes into compounding improvements.

3. Defines and maintains AI guardrails. Leaders everywhere worry about AI doing things it shouldn’t. This role answers that fear by establishing clarification logic, escalation rules, “never answer” policies, and safety boundaries. The goal is predictable behavior that protects customer trust — an essential pillar of any AI Strategy and AI risk management practice.

4. Aligns reporting with leadership. The AI ops lead reports on resolution rate, CX Score, CSAT, automation coverage, and hours saved — making the economic impact visible. That visibility is a foundational step in any credible customer support ai strategy.

Why this role exists now. AI systems are dynamic and require constant tuning. A small dip in quality quickly becomes an operational issue, and no existing role naturally owns that. When someone does, teams feel the benefit almost immediately.

Knowledge manager — builds and maintains the structured knowledge AI depends on. I hear the same thing from leaders again and again: AI is only as good as the content you give it. This role is rapidly evolving from classic knowledge management into knowledge strategy — part content designer, part systems thinker, part information architect. Their job is to build the knowledge scaffolding that lets AI answer accurately, consistently, and safely.

Here’s how the knowledge manager creates leverage.

1. Writes, maintains, and improves support knowledge — continuously. After every product change, they update articles, remove duplication, resolve contradictions, and pay down “knowledge debt” that quietly erodes accuracy. The upkeep is shaped by AI performance; when patterns expose gaps, they fix the source.

2. Structures knowledge for AI, not for browsing. Traditional help centers are for humans skimming pages. AI needs clean intent signals, crisp formatting, and clearly structured language. The knowledge manager designs that structure as intentionally as the content itself.

3. Works hand-in-hand with AI ops. Many performance issues stem from missing or unclear knowledge. When the AI ops lead surfaces recurring misunderstandings or low-resolution categories, the knowledge manager resolves the root cause at the source.

4. Ensures accuracy and compliance at scale. As AI handles more sensitive situations, the knowledge manager safeguards correctness, currency, and compliance — critical for data governance and regulatory alignment.

5. Develops a cross-functional knowledge strategy. The role creates a canonical, cross-functional source of truth that product, engineering, product marketing, go-to-market, and support (AI and human) can all rely on.

Why this role exists now. This is one of the highest-leverage positions in an AI-first support org. Teams like Rocket Money and Anthropic are hiring knowledge managers because AI accuracy depends on the quality of knowledge feeding it. Without this role, resolution rate caps out early and never climbs.

Conversation designer — designs how the AI speaks, clarifies, and interacts. AI isn’t just a tool customers use; it’s a representative they interact with. Tone, clarity, pacing, and conversational structure matter, especially in voice. Every word affects perceived expertise, trustworthiness, and brand. The conversation designer ensures the AI feels human-friendly without pretending to be human — the sweet spot that builds trust without misleading customers.

In my experience, staffing conversation design early accelerates results. It changes not only how we tune AI, but how we understand the end-to-end customer experience.

Here’s what great conversation design looks like.

1. Shapes the AI’s tone, voice, and communication style. This role refines phrasing, tunes politeness, adjusts how confusion is handled, and shapes micro-interactions that determine whether customers feel cared for or dismissed. On voice channels, natural cadence is make-or-break.

2. Designs flows for high-value conversations. They design how the AI clarifies intent, branches, communicates uncertainty, verifies details, escalates, hands off, and returns to the main thread without feeling mechanical — treating customer experience as a product with language as the interface.

3. Translates procedures and complex workflows into natural language and logic. As AI runs structured procedures and actions, this role becomes a conversational system architect, translating SOPs into conditional logic with exceptions and fallbacks. For example, in Intercom, our conversation designer uses Simulations to run simulated conversations to see where the AI Agent gets confused, over-confident, or awkward, and refine flows until the interaction feels effortless end-to-end.

4. Ensures transitions to humans feel smooth and respectful. Handoffs should provide clear context to the human agent and maintain continuity so customers never feel dropped.

Why this role exists now. As AI becomes the primary interface, conversation design directly influences trust, brand perception, and operational outcomes. It’s a core competency for any Generative AI and LLMs for product managers program.

Support automation specialist — builds the backend actions that allow AI to do real work. If the conversation designer shapes expression, this role shapes capability. They transform AI from an answering machine into an outcome engine by bridging AI and the systems it must safely and deterministically act on.

Support teams increasingly expect AI to do what a human would do: refund a charge, adjust a subscription, verify an identity, update an account setting, or pull relevant data. That expectation creates a new technical role at the edge of support, ops, and engineering.

What I rely on this specialist to deliver.

1. Creates and maintains backend workflows the AI executes. This includes building and maintaining: Fin Tasks. Fin Procedures with embedded steps. Action flows that call internal and external APIs. Automations that span billing systems, user identity layers, CRM objects, subscription entitlements, refund tools, and more. They ensure the AI can act compliantly and predictably — the playbooks that turn intent into action.

2. Owns the integrations required for advanced automation. Many problems require data elsewhere — billing platforms, internal databases, systems of record. The specialist ensures the AI can retrieve, validate, and use that information safely, often partnering closely on CRM integration and internal services.

3. Partners closely with product and engineering. Some workflows require new endpoints, permission layers, safety gates, or deterministic fallbacks. This role drives those changes across the stack.

4. Ensures reliability and safety at every step. Guardrails, validation logic, exception handling, safe execution paths — all are essential. They confirm that the AI has access to the correct data, the action matches policy, edge cases are accounted for, risky flows have deterministic constraints, and every action is auditable and reversible.

Why this role exists now. Customers don’t want answers, they want outcomes. AI can now deliver those outcomes, but only with the right backend scaffolding. This role modernizes operational architecture and unlocks end-to-end automation.

How these roles work together — the new operating loop. These roles aren’t silos; they’re interdependent parts of one system. The AI ops lead identifies patterns and performance gaps. The knowledge manager resolves inaccuracies or missing content. The conversation designer improves clarity, tone, and flow. The automation specialist expands the system’s ability to take action. Each improvement compounds the next, moving you from early automation to transformational resolution rates through continuous refinement.

This loop is what separates teams that plateau early from teams that scale AI into a reliable, high-performing system — the essence of a durable AI Strategy.

How to get started (even if you can’t hire all four roles today). Most teams phase into this model: assign partial ownership, formalize responsibilities, then specialize as AI volume grows. Here’s the progression I recommend.

Phase 1: Assign ownership. Give each role’s core responsibilities to someone who can devote five to 10 hours weekly. Early on, support ops, enablement, senior ICs, and technically inclined teammates can anchor the work.

Phase 2: Formalize the responsibilities. As AI resolves more queries, optimization becomes core operational work. Formalizing ownership prevents performance drift and knowledge debt.

Phase 3: Specialize and hire. Once AI handles 50–70% of incoming volume, these responsibilities become full-time roles. Investing in specialization becomes essential infrastructure for the next scale stage.

The bottom line. AI changes the shape of your support team. These four roles — AI operations lead, knowledge manager, conversation designer, and support automation specialist — form the backbone of the AI-first support organization. They bring order to a constantly changing environment and enable AI to deliver the outcomes leaders and customers expect heading into 2026.

Next week, we’ll continue the 2026 planning series with a deep dive into org design models for AI-first support teams — how to structure people, workflows, and accountability in a world where AI resolves most conversations before a human ever sees them.

To follow along with the series and have each new edition emailed to you directly, drop your details here.

Inspired by this post on The Intercom Blog.

December 2, 2025
Unlock AI Product Roadmaps: Essential Tools Every PM Needs to Prioritize and Ship Faster

In my role leading product teams, the AI product roadmap isn’t just a plan—it’s the operating system for how we discover value, prioritize with rigor, and ship with confidence. The pace has changed, the stakes are higher, and the best product managers are now orchestrating AI capabilities, data, and customer insight in near-real time.

Master the evolving art of the AI product roadmap. Prioritize smarter, turn data into direction and insight into action, only much faster.

When I say “AI product roadmap,” I’m talking about a living system that blends strategy, discovery, and delivery. It’s less about dates and more about outcomes, risk reduction, and sequencing learning. In practice, that means combining AI Strategy with product roadmapping and sprint planning, then validating each bet with real customer signals.

For prioritization, I anchor on outcomes vs output OKRs and connect them to measurable signals across the funnel. Continuous discovery keeps insights flowing, while a unified approach to analytics and retention analysis tells me where the lift is. This lets me rank initiatives not just by impact and effort, but by how quickly we can learn, iterate, and compound value.

On discovery, product trios are non-negotiable. We prototype early with gen ai and LLMs for product managers to accelerate concept validation and reduce ambiguity. When customers can co-create through in-app guides or lightweight product tours, we turn vague needs into crisp problem statements and testable hypotheses far faster.

On delivery, I pair tight feedback loops with experimentation. A deliberate cadence of A/B testing and strong instrumentation ensures we’re learning every sprint, not just launching. The goal is to de-risk decisions quickly, keep momentum high, and translate signals into roadmap movement without thrash.

Under the hood, the AI stack matters. I rely on a retrieval-first pipeline to ground models in trusted data, and I’m intentional about privacy-by-design and data governance from day one. As agentic AI patterns emerge, I put evaluation workflows in place so we can ship confidently—and safely—without slowing down innovation.

Finally, alignment is the multiplier. Clear narrative roadmaps tied to customer outcomes help stakeholders see trade-offs, while crisp interfaces with go-to-market and CRM integration close the loop from roadmap to revenue. When everyone can trace a line from AI strategy to shipped value, prioritization becomes easier and trust grows.

If you’re feeling the acceleration, you’re not alone. With the right AI product toolbox—rooted in discovery, grounded in data, and delivered through tight feedback loops—you can move faster, learn smarter, and build products your customers can’t live without.

Inspired by this post on Product School.

December 1, 2025
AI Product Owner in 2026: The High-Impact Role Every Team Needs to Win With AI

By 2026, the AI Product Owner will be the keystone role that turns AI strategy into measurable business outcomes. In my teams, this seat bridges market insight, model capability, data governance, and shipping velocity—so product decisions are not just clever, but compliant, reliable, and fast.

I often describe the remit simply: "Here is your clear guide to the AI product owner role (skills, responsibilities, how it differs from PM) and ways AI tools supercharge delivery." In practice, the AI Product Owner translates business goals into model-backed experiences, aligns cross-functional execution, and ensures the product’s AI behavior remains safe, lawful, and on-brand under real-world constraints.

How does this differ from a traditional PM? While Product Management sets portfolio strategy, positioning, and market narratives, the AI Product Owner owns the AI experience end-to-end—data readiness, evaluation harnesses, safety guardrails, and the iterative model improvements that drive outcomes vs output OKRs. I anchor the role inside empowered product teams and product trios (PM/Design/ML Eng) to keep discovery continuous and delivery disciplined.

On responsibilities, I expect four pillars. First, discovery: continuous discovery with customers and internal experts to uncover use cases where generative AI or LLMs beat the status quo. Second, experience: define the right interaction patterns for AI UX, including retrieval-first pipeline choices, context window management, and feedback loops for human-in-the-loop correction. Third, governance: privacy-by-design, AI risk management, data governance, and regulatory compliance baked into the roadmap. Fourth, delivery: CI/CD for models and prompts, observable evaluation with A/B testing and minimum detectable effect (MDE), and SRE-grade incident management when AI behavior drifts.

Skills-wise, I look for product sense plus technical fluency. That includes LLMs for product managers (prompting, grounding, RAG), analytics mastery (Amplitude analytics, retention analysis, activation metrics), and comfort with DORA metrics and deployment frequency to keep iteration high but safe. Strong stakeholder management and clear writing are non-negotiable—AI capabilities evolve fast, and leaders must see risk, cost, and ROI with no ambiguity.

AI tools truly supercharge delivery when they eliminate bottlenecks. My practical stack: an AI product toolbox with Claude Code and a ChatGPT connector for rapid prototyping; CustomGPT workflows for support triage and internal knowledge; Pendo product tours and in-app guides to validate behavior changes; Intercom for customer support ai strategy; and tight CRM integration via HubSpot to measure revenue impact. The outcome is faster idea-to-learning cycles, sharper telemetry, and far cleaner handoffs.

For roadmapping, I prioritize thin slices that prove value early—shipping narrowly scoped assistants or copilots, then expanding with product roadmapping and sprint planning that ties capability unlocks to outcomes. A unified analytics platform helps compare human-only baselines to augmented workflows, while agentic AI patterns automate routine steps under strict guardrails.

Risk is a product surface, not a side task. I require explicit policy gates (PII handling, red-teaming, bias audits), clear escalation paths, and incident playbooks. When we treat policy and reliability as features, customers reward us with deeper adoption and higher trust.

If you’re pursuing the AI Product Owner path, build a portfolio around shipped learnings: the experiment you killed with data, the safety constraint you designed, the postmortem you led, and the business metric you moved. That story—evidence of disciplined discovery, responsible delivery, and real-world results—is exactly what teams (and boards) want to see in 2026.

Inspired by this post on Product School.

November 26, 2025

How to Build Marketing Analytics That Measures Revenue

You are probably not short of marketing data. The harder problem appears when a budget decision is due: campaign reports show conversions, the CRM shows pipeline, product analytics shows activation, and finance shows revenue. Every number can be locally correct while the business still cannot explain which investment created durable growth.

If you need to decide where the next dollar or product sprint should go, do not start by choosing a more elaborate attribution model. Build a measurement chain that follows an eligible customer from a consented marketing touch to product value, commercial outcomes, retention, and expansion. Then match each decision to the kind of evidence it actually requires.

Start with the revenue decision, not the dashboard

A dashboard becomes useful only when someone can name the decision it is meant to change. “Improve marketing performance” is not a decision. Reallocating campaign spend, changing an audience, fixing trial onboarding, revising lifecycle messaging, or testing a pricing signal are decisions.

Before requesting another report, write a short measurement brief with these fields:

Decision: What will you start, stop, scale, or change?
Eligible population: Which users or accounts could have received the intervention?
Primary outcome: Which business result determines the decision?
Leading indicator: Which earlier behavior should move if the mechanism is working?
Guardrails: Which important outcome must not deteriorate while the primary metric improves?
Observation window: How long must the customer journey remain visible before the result is interpretable?
Evidence standard: Do you need descriptive reporting, diagnosis, a causal estimate, or an economic forecast?
Decision rule: What result would cause each available action?

Set those fields before looking at the result. If the outcome, segment, or success threshold changes after the data arrives, the analysis has become a story fitted to the answer.

Separate four questions that dashboards often blur

What happened? Descriptive reporting counts touches, sign-ups, opportunities, revenue events, and retained customers.
Where did the journey weaken? Diagnostic analysis examines segments, cohorts, funnel transitions, time-to-value, and behavior preceding the change.
Did marketing cause the change? Causal analysis asks what would have happened to an equivalent eligible population without the intervention.
Was the change economically worthwhile? Revenue analysis adds acquisition cost, customer value, payback, retention, and expansion to the observed lift.

These questions can use some of the same data, but they do not have interchangeable answers. An attribution report can distribute credit for observed revenue without estimating incremental revenue. An experiment can estimate lift without proving that the lift will repay its cost. A conversion increase can be real while customer quality and retention decline.

Connect every marketing touch to a customer value journey

Channel dashboards split one customer into several records: an ad click, a web visitor, a trial user, an account in the CRM, and a commercial outcome. Revenue measurement starts by reconnecting those records without pretending that every join is reliable.

A practical journey model contains the following stages:

Acquisition: Record the eligible campaign, audience, creative, source, and consent state.
Identity: Define how an anonymous visitor becomes a known user and how users map to an account. In B2B products, a user identifier alone cannot represent a buying group or an account-level revenue event.
Activation: Capture the first observable behavior that indicates the customer has received meaningful product value.
Engagement: Measure whether the customer repeats the valuable behavior, uses it more deeply, or adopts the critical workflow around it.
Commercial progression: Join the account to clearly defined CRM stages and the authoritative commercial outcome.
Retention and expansion: Observe whether the acquired cohort continues receiving value and whether its usage produces credible expansion signals.

Putting campaign performance, product behavior, and CRM pipeline into one journey changes the management question. Instead of asking which channel deserves all the credit, you can ask where each acquired cohort reached value, stalled, converted, retained, or expanded.

A unified platform does not create this chain merely by ingesting every table. You still need a canonical user and account identity, consistent timestamps, stable campaign identifiers, documented CRM stages, and explicit ownership of every event. A silent identity merge can make the journey look complete while assigning one customer’s behavior or revenue to another. Preserve the raw identifiers, record the join method, and make uncertain matches visible rather than forcing them into a clean-looking funnel.

For each event used in revenue analysis, document its business meaning, trigger, actor, account mapping, source system, required properties, consent treatment, owner, and version history. Event names are not definitions. Two teams can emit an event called activated while measuring entirely different customer behaviors.

Instrument value moments instead of feature clicks

A feature click proves that an interface element was used. It does not prove that the customer solved the problem they came to solve. Define activation around a completed value-producing behavior, then measure time-to-value, depth of use, and signals associated with expansion.

Describe the customer outcome in plain language before naming an event.
Identify the smallest observable behavior that credibly represents that outcome.
Instrument completion, not merely entry into the workflow.
Measure how long eligible users take to reach the event and whether they repeat or deepen the behavior.
Compare later conversion and retention for cohorts that reach the value moment and cohorts that do not.
Treat that comparison as diagnostic evidence until an experiment tests whether moving the value moment changes the later outcome.

That last distinction matters. A behavior associated with retention may simply identify customers who were already more motivated. It is still a valuable signal for diagnosis and segmentation, but correlation does not turn it into a causal lever.

Build a driver tree from realized revenue back to controllable inputs

Revenue is an outcome, not an operating lever. A driver tree makes the path to that outcome explicit. It also prevents marketing, product, sales, and finance from optimizing different definitions of success.

Start with the commercial outcome your finance function recognizes. Branch it into new-customer revenue, retained revenue, and expansion where those distinctions fit your business. Then work backward through the behaviors and transitions that teams can influence:

Acquisition quality: Eligible demand reaches the intended customer profile and enters a measurable journey.
Activation: Acquired users or accounts reach the defined value moment.
Conversion: Activated customers progress to the relevant commercial outcome.
Retention: Cohorts continue performing the valuable behavior and remain commercially active.
Expansion: Usage depth, account participation, or repeated value creates a credible reason to grow the relationship.
Efficiency: Customer acquisition cost, lifetime value assumptions, and payback remain acceptable for the decision being considered.

Do not collapse the tree into a single blended conversion rate. Read it by acquisition cohort, customer segment, route to market, and other distinctions that could change the mechanism. A campaign can generate inexpensive trials yet perform poorly on activation. Another can create fewer trials but stronger retention and expansion. The top-of-funnel view favors the first campaign; the revenue journey may favor the second.

Metric	Decision it can inform	Definition that must be locked
Campaign-attributed revenue	Consistent reporting and allocation	Attribution rule, eligible touches, identity logic, and observation window
Activation	Audience quality and onboarding priorities	Value event, eligible population, unit of analysis, and observation window
Retention	Customer quality and durable growth	Starting cohort, retained behavior or commercial state, and comparison period
Customer acquisition cost	Acquisition efficiency	Included costs and the definition of an acquired customer
Lifetime value and payback	Whether and how aggressively to scale	Value horizon, cost boundary, retention assumptions, and treatment of expansion

Finance should remain the owner of authoritative commercial definitions. Marketing analytics can connect those outcomes to customer journeys, but it should not quietly substitute attributed pipeline, bookings, billing, collections, and recognized revenue for one another. If the decision uses money, state exactly which commercial event the number represents.

Assign every driver a definition, owner, system of record, refresh expectation, and decision it supports. If a metric has no owner or cannot alter a decision, it is probably dashboard inventory rather than a management instrument.

Keep attribution in its lane and use experiments for incrementality

Attribution is a rule for distributing credit among recorded touches. It is useful when the business needs a consistent reporting convention, campaign history, or a shared way to discuss observed journeys. It does not create the missing counterfactual: what the same eligible customers would have done without the marketing intervention.

Choose the method from the question:

Use attribution to describe how observed revenue is assigned across recorded touchpoints.
Use funnel and cohort analysis to locate friction and generate hypotheses about the mechanism.
Use randomized experiments when you need a defensible estimate of incremental impact and randomization is feasible.
Use customer acquisition cost, lifetime value, and payback to decide whether the measured impact is economically attractive.

Do not make an attribution disagreement carry more meaning than it has. Different attribution rules can produce different answers from the same customer journey because they distribute credit differently. That disagreement does not tell you which touch caused the revenue. If the decision depends on causality, the next step is better experimental design, not another credit-allocation rule.

Define the minimum detectable effect before an A/B test begins

The minimum detectable effect is the smallest effect your test is designed to detect with its chosen statistical setup. It should come from the business decision: the smallest improvement that would justify the intervention after considering cost, risk, and downstream quality. It should not be selected merely because a smaller number sounds impressive.

A credible test plan records the hypothesis, eligibility rule, randomization unit, primary outcome, guardrails, minimum detectable effect, exposure logic, measurement window, and analysis plan before results are inspected. A/B testing with explicit MDE discipline and cohort-based retention analysis keeps teams focused on decision-relevant effects instead of test volume.

Match the randomization unit to the way the intervention spreads. If people within the same account influence one another or share the commercial outcome, randomizing individual users can contaminate the comparison. Consider the account as the unit when the treatment, customer value, or revenue event operates at account level.

Do not stop the analysis at the easiest conversion event when the decision depends on durable revenue. A message can increase sign-ups while bringing in users who never activate. An onboarding change can improve activation while harming a later guardrail. Follow the cohort far enough to observe the outcome named in the measurement brief.

When randomization is not feasible, label the evidence as observational. Record plausible alternative explanations, look for consistent signals across campaign exposure, product behavior, CRM progression, and cohort outcomes, and make the resulting decision more reversible. Honest uncertainty is more useful than a precise causal claim the design cannot support.

Turn revenue measurement into an operating cadence

The work is not complete when a dashboard ships. Measurement becomes operational when the same definitions guide budget choices, product experiments, lifecycle changes, and executive reviews.

Use each decision review to answer a fixed sequence of questions:

Which business outcome changed, and for which eligible cohort?
Which branch of the driver tree explains the movement?
Where in the customer journey did behavior diverge?
Is the evidence descriptive, diagnostic, causal, or economic?
What decision follows, who owns it, and what evidence would reverse it?
Which instrumentation or definition gap weakened confidence in the answer?

Ownership should follow the underlying data-generating process. Marketing owns campaign taxonomy, spend, audiences, and creative metadata. Product owns value events, activation, and engagement definitions. Sales and revenue operations own CRM stage fidelity and account mapping. Data teams own transformation logic, quality tests, and the semantic layer. Finance owns the commercial definitions used for authoritative revenue decisions.

Treat governance as part of growth infrastructure. Consented data, privacy-by-design, documented schemas, and clear metric definitions make analysis more dependable and executive decisions easier to defend. Do not stitch identities beyond the permission and purpose under which the data was collected. The safe alternative is an explicit gap in the journey, with its effect on the analysis documented.

Use generative AI as an analyst, not a measurement authority

Generative AI can accelerate query drafting, anomaly discovery, segment exploration, and the first pass at possible drivers. It cannot repair an ambiguous activation event, an unreliable identity join, or a CRM stage that teams use inconsistently. It also cannot turn observational data into causal evidence by explaining it fluently.

Require every AI-generated finding to show the metric definition, filters, eligible population, time window, comparison, underlying query or transformation, and evidence class. Validate the denominator and join logic before acting. Keep causal conclusions behind the same experimental and statistical standards you would require from a human analyst.

The leverage comes from combining fast exploration with a strong taxonomy and disciplined validation. Without those foundations, AI produces a faster version of the same disagreement that fragmented dashboards created.

Key takeaways

Start every analytics request with the decision, eligible population, outcome, evidence standard, and decision rule.
Connect campaigns to account identity, product value, CRM progression, revenue, retention, and expansion.
Use a revenue driver tree to expose which controllable behavior connects marketing activity to durable growth.
Keep attribution for consistent credit allocation; use experiments when the decision requires incremental impact.
Define value moments, event contracts, commercial outcomes, and MDE before inspecting results.
Let AI accelerate exploration, but require transparent definitions, queries, joins, and human validation.

Begin with the next disputed budget or roadmap decision. Write its measurement brief, then trace one eligible cohort from a consented first touch through product value, CRM progression, and the authoritative commercial outcome. Wherever that chain breaks is the next item for your analytics backlog.

Once the same journey can be reproduced without manual interpretation, add more channels and automate more analysis. That is the point at which marketing analytics stops being a reporting layer and becomes a revenue management system.

References

Amplitude – Marketing Analytics in 2026: Bold, Data-Driven Predictions to Outperform Your Market

November 25, 2025

How to Build a Conversation-Based Customer Experience Score

Your dashboard says the ticket was resolved. The customer remembers repeating the problem, moving between an AI agent and a teammate, and discovering that company policy still blocked the outcome they wanted. Product, Support, and Operations can all look at the same conversation and reach different conclusions.

If you are considering conversation-based customer experience scoring, the hard part is not asking an AI model for a rating. It is designing a measurement system that distinguishes the experience from its causes, shows people why the score exists, and sends each cause to someone who can change it.

A useful score separates experience from ownership

A customer experience score should answer a narrow question: how well did this interaction work for the customer? It should not silently answer a different question, such as whether the support agent performed well or whether the product team made the right policy decision.

Those questions overlap, but they are not interchangeable. A teammate can give a clear and accurate explanation of an unpopular refund policy. The teammate’s answer quality may be strong while the overall experience remains poor. An AI agent can use a warm tone while giving an incorrect answer. The sentiment may look positive even though the handling failed. A product limitation can make resolution impossible despite excellent support work.

This is why a credible score needs several layers:

Outcome: Was the customer’s request resolved, partially resolved, redirected to a workable next step, or left unresolved?
Answer quality: Were the responses clear, accurate, relevant, and internally consistent? Evaluate AI and human responses separately when both participated.
Customer effort: Did the customer repeat information, survive avoidable handoffs, chase a promised follow-up, or clarify something the company should already have understood?
Emotional context: Did the customer express strong frustration, anger, relief, gratitude, or delight? Treat emotion as context rather than a verdict by itself.
Product or service feedback: Was the customer reacting to a bug, missing capability, reliability problem, delivery failure, confusing design, or service issue?
Policy feedback: Was the real source of dissatisfaction a refund rule, eligibility condition, account limit, return policy, or another business decision?

These dimensions reflect the reality that customers react to the whole interaction, including effort and product or policy constraints, not merely the final support response.

Score the experience first. Attribute the drivers second. Assign ownership third. Reversing that order creates predictable dysfunction: teams defend their own performance, difficult conversations get excluded, and the metric becomes a political argument instead of a customer signal.

Design the score as a diagnosis, not a black box

Leadership may want one number for a dashboard, but the useful product is the diagnostic record underneath it. If a support leader cannot open a low-scoring conversation and see why it received that result, the number is not ready for coaching, prioritization, or executive reporting.

The minimum record behind each score

For every eligible conversation, preserve these fields:

Overall experience band: A small set of anchored labels is easier to calibrate than a decimal-heavy score that implies unsupported precision.
Eligibility status: Record whether the interaction was scored, excluded under a defined rule, or genuinely lacked enough information.
Outcome status: Resolved, partially resolved, unresolved, or unclear.
Answer-quality results: Separate evaluations for AI and teammate contributions where applicable.
Driver codes: Effort, strong emotion, product or service feedback, policy feedback, and any operational reason codes you have explicitly defined.
Evidence: The specific message or interaction event that supports each driver. A generated explanation without transcript evidence is an assertion, not an explanation.
Plain-language summary: What the customer needed, what happened, and why the experience earned its band.
System metadata: The scoring model, rubric, and schema versions used to produce the record.

I would begin with anchored experience bands rather than pretending the system can distinguish tiny numerical differences. A practical rubric might distinguish a strong experience, an acceptable experience with minor friction, a weak experience with material friction or incomplete resolution, and a poor experience with an unresolved outcome, serious inaccuracy, contradiction, or excessive burden.

The labels matter less than the anchors. Reviewers need observable criteria for each band. Phrases such as good conversation or unhappy customer leave too much room for interpretation. Criteria such as customer repeated the account history after a handoff or answer contradicted an earlier commitment can be checked against the transcript.

Do not let emotion dominate the rubric. A customer may arrive angry because of a product outage and receive excellent assistance. Another may remain polite after receiving a materially wrong answer. Emotion can increase urgency and explain the experience, but it cannot substitute for outcome, accuracy, and effort.

Do not average away disagreement between dimensions either. An acceptable overall score can conceal an inaccurate AI answer that a teammate later repaired. Preserve that AI-quality failure as a driver so the AI product team can add it to an evaluation set even when the customer ultimately gets a resolution.

Make the metric reliable enough for decisions

A score can look stable while measuring a changing subset of conversations. If short threads, low-context requests, escalations, or mixed AI-human interactions are harder to score, improvements in the average may simply reflect which conversations entered the denominator.

Coverage therefore belongs beside the score, not in a technical footnote. Broader scoring can reveal parts of the support mix that were previously invisible, and adding previously unscored conversations can change the reported result even when operating performance has not changed.

Define eligibility before calibration. Spam, automated notifications, internal-only threads, and interactions with no customer request may reasonably sit outside the metric. A short conversation should not be excluded merely because it is short, and a difficult conversation should not be excluded merely because the model is uncertain. Track uncertainty explicitly rather than removing inconvenient cases from view.

Your recurring dashboard should show:

The share of eligible conversations that received a score.
The distribution across experience bands, not just an average.
The mix of positive and negative drivers.
Results split by AI-only, teammate-only, and mixed handling.
Relevant slices such as channel, language, issue type, conversation length, product area, and escalation path.
The active model, rubric, and schema versions.

Calibration should happen against human judgment before the score becomes a target. Use a representative set containing routine resolutions, short exchanges, long investigations, escalations, emotionally charged threads, AI-only conversations, human-only conversations, and AI-to-human handoffs. Have independent reviewers apply the same rubric, examine disagreements, and rewrite any criterion that depends on intuition rather than observable evidence.

Then test the slices separately. Aggregate agreement can hide systematic failure in one language, channel, issue class, or interaction type. The acceptable level of disagreement depends on the decision. A model used to discover recurring workflow friction can tolerate more uncertainty than one used in individual performance management.

Keep the adjudicated examples as a regression set. Re-run them whenever you change the prompt, model, rubric, knowledge architecture, conversation parser, or driver definitions. Review newly common failure patterns as well; a frozen evaluation set eventually stops representing the work.

Model changes require visible reporting boundaries. A more contextual scoring system may produce a one-time shift without a corresponding decline in support quality. Backfill historical conversations with the new version when that is practical. Otherwise, annotate the change on every trend view and establish a new baseline. Never splice two scoring regimes into one continuous line and ask leaders to interpret the movement as operational performance.

Turn low scores into routed work, not dashboard theatre

A low score is only a symptom. The driver determines who should investigate it and what kind of intervention is plausible. Sending every poor experience to the support manager guarantees that product defects, policy choices, and broken workflows will be misclassified as coaching problems.

Primary driver	What to inspect	Primary owner	Default next action
AI answer quality	Inaccuracy, contradiction, irrelevant guidance, or repeated clarification	AI product and knowledge owners	Correct the underlying knowledge or response path, then add the failure to the AI evaluation set
Teammate answer quality	Unclear explanation, incorrect guidance, missed question, or inconsistent commitment	Support lead or enablement owner	Review the conversation against the rubric, then improve coaching, documentation, or access to information
Customer effort	Repeated information, handoff loops, unnecessary forms, follow-up chasing, or duplicated verification	Support operations or journey owner	Map the failing transition and remove the avoidable step, ownership gap, or workflow rule
Product or service feedback	Bug, missing capability, confusing design, reliability issue, delivery failure, or service breakdown	Relevant Product, Engineering, or service owner	Cluster related conversations, connect them to the product area, and decide whether the response is a fix, discovery work, or an explicit trade-off
Policy feedback	Refund, return, eligibility, account, usage, or limit rule	Business or operations owner responsible for the policy	Separate unclear communication from disagreement with the policy, then revise the explanation, the policy, or neither – deliberately
Strong negative emotion	The event that triggered the emotion and whether the issue remains unresolved	Triage owner, followed by the owner of the actual cause	Prioritize review where appropriate, but do not treat emotion alone as proof of agent failure

Automation should route the evidence package, not just the score. Include the conversation link, customer request, outcome, overall band, driver codes, supporting messages, scoring version, and proposed owner. That context lets the receiving team judge the issue without rereading an entire thread or trusting an opaque summary.

Use separate operating lanes for individual cases and recurring patterns. A materially incorrect answer may need immediate review. Repeated handoff friction usually needs aggregation so Operations can see the broken transition. Product and policy feedback becomes useful when related conversations are clustered around a shared problem, while still retaining representative examples.

Count affected conversations consistently rather than allowing a verbose customer to create many separate votes within one thread. Preserve the denominator for every filter. A driver that appears frequently in one product area may look dominant in a filtered dashboard while remaining uncommon across the full support mix.

For recurring themes, maintain a problem record with the driver, affected journey, frequency, severity, controllable owner, proposed intervention, status, and comparable post-change cohort. This converts conversation scoring into a product and operations feedback loop. Without that record, the same issue can be rediscovered in every review without anyone becoming accountable for changing it.

After an intervention, compare like with like: the same scoring version, eligibility rules, issue type, and relevant handling path. If the score improves but coverage falls, or the issue mix changes, you do not yet know whether the intervention worked.

Earn the right to replace CSAT

Conversation scoring addresses a real blind spot: survey metrics describe the customers who choose to respond, while a conversation-based system can evaluate a much broader share of eligible support volume. That makes it attractive as a replacement for CSAT, but broader coverage does not automatically make the new metric valid.

Start in shadow mode. Continue the existing reporting while you calibrate the new score, inspect disagreements, and learn which drivers are actionable. Do not demand that the two measures match. They observe different things: one evaluates evidence in the interaction, while the other records a respondent’s self-reported reaction.

Move the conversation score into operational reviews once teams can inspect its reasoning and route its drivers. Move it into executive reporting only after coverage, version changes, and slice-level performance are visible. Consider reducing or retiring a survey only when all of the following are true:

Eligibility and coverage are stable enough that changes in the denominator cannot masquerade as experience improvements.
The rubric has been calibrated against human review, including difficult and ambiguous conversations.
Explanations consistently point to transcript evidence rather than merely producing plausible prose.
Important channels, languages, issue classes, and AI-human handling paths have been checked separately.
Model and rubric changes are versioned, regression-tested, and visibly marked in reporting.
Driver routing produces owned work, and teams can show what they changed because of the signal.
Material disagreements between the conversation score and survey feedback are investigated rather than averaged away.

Keep a higher standard for individual performance decisions. A conversation score can flag work for human QA, but it should not become an automatic employee rating merely because it covers more conversations. Product limitations, customer history, policy constraints, and model error can all affect the result. Use the driver record and human review to establish what the teammate actually controlled.

Key takeaways

Measure the customer’s experience separately from the performance of the AI agent, teammate, product, policy, or workflow that shaped it.
Keep an overall band for scanning, but preserve outcome, answer quality, effort, emotion, feedback drivers, evidence, and version metadata underneath it.
Report coverage and score distribution together; an unexplained denominator change can invalidate the trend.
Calibrate with representative human-reviewed conversations and retest meaningful slices after every scoring change.
Route each driver to the owner who can change it, then measure a comparable cohort after the intervention.
Replace CSAT only after the conversation score has earned trust as both a measurement system and an operating loop.

At your next customer experience review, bring one low-scoring conversation, its evidence-backed driver record, and the owner capable of changing that driver. If the meeting ends with only a debate about whether the number is fair, calibration is unfinished. If it ends with a named intervention and a valid way to examine comparable future conversations, the score is doing useful work.

References

Intercom – The new CX Score explained

November 25, 2025

AI-Enabled Customer Support Roles: A Practical Org Design

Your AI agent is resolving enough conversations that queue volume is no longer a useful blueprint for organizing the team. Yet your org chart still assumes that every customer outcome belongs to a human agent. The result is a dangerous ownership gap: everyone can recognize a poor AI interaction, but no one is clearly responsible for the content, behavior, action, or handoff that caused it.

The decision in front of you is not simply which jobs AI will remove. It is which responsibilities become more important, who should own them now, and when that work deserves a dedicated role. You can answer those questions before adding headcount.

The unit of work has shifted from tickets to systems

A human-owned ticket usually has a visible assignee, a queue, and a closed state. An AI conversation can fail much earlier in the system. The policy may be stale. The right knowledge may exist but be difficult to retrieve. The response may be accurate but confusing. A backend action may fail. A handoff may reach a person without the context needed to continue.

If you classify all of those outcomes as generic AI accuracy problems, the team will spend its time rewriting prompts while structural defects remain untouched. Diagnosis has to begin with the layer that failed.

Knowledge: Did the system have an accurate, current, and unambiguous basis for answering?
Conversation: Did it communicate clearly, follow policy, and recognize when it should stop or escalate?
Action: Could it complete the requested task safely and confirm the result?
Operations: Could the organization detect the failure, assign it, correct it, and verify that the correction worked?

When an AI agent carries a substantial share of conversation volume, human work moves from processing individual questions toward improving the system that handles them. That does not make human support irrelevant. People still own ambiguity, exceptions, sensitive situations, and cases that require judgment. What changes is the management model around them.

Start with a simple accountability rule: every layer needs one named owner. Several people can contribute, but shared contribution is not shared accountability. If nobody has the authority to prioritize a correction and see it through, performance will drift between support, product, engineering, and content teams.

Assign four ownership roles before opening four requisitions

I would not begin by hiring four specialists. I would begin by assigning four explicit responsibilities to people who already understand the customers, policies, tools, and failure modes. A dedicated job becomes necessary when the work is continuous, consequential, and repeatedly displaced by the owner’s primary role.

AI operations lead: owns performance and improvement

The AI operations lead is accountable for day-to-day performance. This person maintains the quality view, classifies recurring failure modes, prioritizes corrections, and coordinates changes across support, product, data, content, and engineering.

This is not a meeting coordinator or a person who manually edits every weak response. The role needs enough operating authority to decide which problems deserve attention and enough analytical depth to separate isolated bad conversations from systemic patterns. Support operations is often a strong internal starting point because the function already understands workflows, tooling, routing, and capacity.

The first useful deliverable is an AI performance register. For each recurring issue, record the affected customer intent, observed outcome, failure layer, accountable owner, proposed change, validation method, and current status. That register becomes the shared backlog for the AI support system.

A good decision boundary is equally important: the operations lead prioritizes what must improve, while the relevant domain owner decides how to change knowledge, conversation behavior, or automation. Otherwise, one person becomes a bottleneck for every adjustment.

Knowledge manager: owns what the AI is allowed to know

The knowledge manager owns the material that grounds customer answers: help content, internal procedures, macros, snippets, policy explanations, and the relationships between them. The job is not merely publishing more documentation. It is making sure the AI has one dependable answer for each supported question.

Conflicting content is often more damaging than missing content because the system can produce a plausible answer from the wrong instruction. Every important knowledge item should therefore have a clear owner, audience, scope, status, and review trigger. Product or policy changes should update the source of truth before teams try to compensate through prompt wording.

The first useful deliverable is a knowledge inventory organized by customer intent. Mark where content is missing, duplicated, contradictory, overly broad, or dependent on information the AI cannot reliably access. This turns an abstract content audit into a prioritized quality backlog.

Measure this role by whether knowledge-related failures become easier to prevent and diagnose. Page count is an output. Dependable answers are the outcome.

Conversation designer: owns how the AI behaves

The conversation designer defines how the AI communicates and how an interaction progresses. That includes tone, question sequencing, explanation structure, confirmation language, boundaries, escalation triggers, and the context passed to a human agent.

This role is broader than polishing wording. A response can be factually correct and still produce a poor outcome because it is too confident, asks for information in the wrong order, buries a constraint, or continues after the situation calls for human judgment. Conversation design turns brand, policy, and customer-experience expectations into observable behavior.

The first useful deliverable is an interaction specification for each important intent. It should define the customer’s likely goal, information required, acceptable response structure, prohibited claims, escalation conditions, action confirmation, and handoff payload. That specification gives QA something concrete to evaluate and gives the automation specialist a stable flow to implement.

Content design, UX writing, and support enablement are natural backgrounds for this work. The essential skill is not clever phrasing. It is recognizing how small language and sequencing choices affect comprehension, trust, and completion.

Support automation specialist: owns safe execution

The automation specialist connects customer intent to business systems. This person builds and maintains the workflows that let an AI agent retrieve account-specific information, update a record, initiate an approved process, or complete another backend task.

This is the role that moves AI support from answering to resolving. It also introduces a different class of risk. A weak answer can be corrected in conversation; a wrongly executed refund, cancellation, permission change, or account update can create financial loss, access problems, or corrupted state. Begin with reversible, bounded actions. Enforce identity, authorization, business rules, and transaction limits outside the language model, and preserve a human path when the system cannot establish that an action is safe.

The first useful deliverable is an action catalog. For each action, document the eligible intent, required inputs, source system, authorization rule, success response, known failure states, recovery path, and human fallback. Do not enable an action merely because the model can call it.

Support engineering, systems administration, solutions engineering, and tooling operations can all supply the necessary background. The role must be able to work with product and engineering without waiting for those teams to own every support-specific workflow.

Organize around human support, AI support, and optimization

The four roles solve ownership at the working level. You still need an organizational model that prevents AI support from becoming an isolated automation project. A practical structure uses three connected pillars: Human Support, AI Support, and Support Operations and Optimization.

Pillar	Primary responsibility	Questions it must answer
Human Support	Resolve complex, sensitive, ambiguous, and exception-driven customer needs	What requires judgment? What should be escalated? What are people learning that the system does not yet know?
AI Support	Own automated knowledge, behavior, actions, and continuous performance improvement	Where does the AI succeed or fail? What change will improve the outcome? Who can safely approve that change?
Support Operations and Optimization	Provide tooling, analytics, enablement, QA, workflow design, and capacity planning	Can performance be measured? Can failures be routed to an owner? How should human capacity change as automation coverage changes?

The reporting lines can vary. The interfaces cannot. Before debating where each role sits, write down how work crosses the boundaries.

Human Support to AI Support: Frontline agents provide structured evidence about missing knowledge, failed automation, confusing language, and escalation gaps. A collection of anecdotes is not enough; feedback needs an intent and a failure category.
AI Support to Human Support: A handoff carries the customer’s goal, relevant context, questions already asked, actions attempted, confirmed results, and remaining uncertainty. The customer should not have to reconstruct the conversation.
Operations to both: Operations supplies the measurement, workflow, tools, and change process needed to turn observed failures into verified improvements.
Product and engineering partnership: Support owns the customer problem and operating priority. Product and engineering own changes that affect the core product, shared platform, security boundary, or technical architecture.

Make decision rights explicit as well. Name who can publish source knowledge, change conversational behavior, enable or disable an action, alter handoff rules, accept residual risk, and declare a correction ready. Without these boundaries, teams either move recklessly or wait for broad consensus on routine changes.

Transition the current team without hiring ahead of the work

Most organizations do not need a separate AI department on the first day. They need visible ownership. Distributing the responsibilities across existing people lets you prove where the workload and leverage actually sit before converting responsibilities into job titles.

Map recent failures to ownership layers. Classify each meaningful problem as knowledge, conversation, action, or operations. If the team cannot classify it, that ambiguity is itself an operations problem.
Put a person’s name beside every layer. Avoid team names such as Support Ops or Product. A team cannot make a decision; an accountable owner can.
Give each owner an artifact and decision boundary. Use a performance register, knowledge inventory, interaction specification, and action catalog so that the role produces something inspectable.
Run the work on a fixed operating cadence. Review outcomes, inspect representative conversations, assign root causes, prioritize changes, and check whether previous corrections held.
Formalize the role when borrowed capacity stops working. A dedicated hire is justified when the responsibility is continuous, affects important outcomes, and repeatedly loses priority to the owner’s original job.

The existing support functions should evolve at the same time:

Frontline agents spend less time repeating known answers and more time resolving exceptions, preserving trust in difficult moments, and supplying structured feedback about system weaknesses.
Enablement teaches agents how to receive AI handoffs, identify failure layers, use AI-generated context critically, and submit feedback that another owner can act on.
Quality assurance expands beyond grading agent conversations. It evaluates the end-to-end customer outcome, including AI behavior, action results, escalation decisions, and continuity after handoff.
Workforce management plans for automation coverage and the type of work reaching people, not only gross inbound volume. Lower human volume can still demand substantial capacity when the remaining cases are more complex.
Support leadership becomes a player-coach responsibility. The leader must understand performance data and system behavior well enough to guide priorities while helping people move into unfamiliar work.

Do not treat the move as a title-renaming exercise. A knowledge manager without publishing authority, an operations lead without a performance view, or an automation specialist without access to technical partners will reproduce the old model under new labels.

This transition can also create credible internal career paths. Analytical support-operations talent can grow into AI operations. Content and enablement specialists can move toward knowledge or conversation design. Technically inclined support staff can develop into automation. Frontline experts with strong policy judgment can contribute to knowledge governance, QA, and escalation design. The best candidate is often the person who already understands where customer intent and company systems fail to meet.

Run AI support as a product, not a side project

An AI support system changes whenever its knowledge, instructions, workflows, integrations, policies, or underlying product changes. It therefore needs a product-like operating loop: observe an outcome, diagnose the responsible layer, change the right artifact, validate the result, and watch for regression.

The scorecard should distinguish customer outcomes from automation activity. An impressive volume metric can hide poor resolution, unnecessary handoffs, or actions that appear successful but do not complete in the business system.

Resolution quality: Did the customer achieve the intended outcome, rather than merely receive a response?
Handoff quality: Was escalation appropriate, correctly routed, and supplied with enough context for a person to continue?
Action reliability: Did the requested action complete, produce the expected state, and recover safely when it failed?
Knowledge health: Which failures came from missing, stale, conflicting, or poorly scoped information?
Customer signals: Do repeat contacts, corrections, abandonment, or explicit dissatisfaction indicate that an apparently completed interaction did not work?
Coverage: Which customer intents are eligible for automation, and which remain deliberately human-owned?
Human workload: What volume, complexity, and urgency reach agents after automation and handoffs?

Segment these measures by customer intent. A single aggregate can conceal a reliable password-reset flow beside a weak billing or cancellation flow. Intent-level views also make ownership clearer: you can connect a measurable outcome to the knowledge, conversation specification, action workflow, and escalation rule behind it.

During an operating review, resist the urge to solve every failure by changing the prompt. First classify the root cause. Correct the source material when the knowledge is wrong. Change the interaction specification when the behavior is wrong. Repair the workflow when an action is wrong. Improve instrumentation or accountability when the organization cannot tell what happened.

The leader’s job is to keep that loop moving. AI support needs someone who can move between customer experience, operational data, content, and technical constraints. Pure people management is insufficient, but so is pure systems administration. The effective leader coaches the people while actively shaping the system they operate.

Key takeaways

Organize AI support around four accountable layers: operations, knowledge, conversation, and action.
Assign the responsibilities before creating dedicated positions; hire when continuous ownership can no longer fit beside an existing role.
Connect Human Support, AI Support, and Support Operations through explicit handoffs, feedback contracts, and decision rights.
Evolve enablement, QA, workforce management, and leadership around system outcomes rather than ticket throughput.
Measure resolution, action reliability, handoff quality, and knowledge health by customer intent, then fix the layer that actually failed.

Your first move should be small but explicit. Pull recent AI failures, classify each one into the four ownership layers, and put a person’s name beside every layer. Then publish what each owner may change and how the team will verify that a correction worked.

Do that before requesting a new organization chart. Once the work is visible, you will know which responsibilities can remain distributed and which have become real jobs. More importantly, your customers will no longer depend on an AI system that everybody observes but nobody owns.

References

Intercom — The Customer Service Roles AI Needs to Thrive: A Practical Playbook for High-Impact Support

November 25, 2025

Mastering Data Governance in the AI Era: Move Fast, Reduce Risk, and Unlock Trusted Insights

Every week, I’m in conversations with product leaders, engineers, and security teams who are trying to ship AI features faster without compromising trust. The tension is real: stakeholders want velocity, customers want transparency, and regulators want accountability. That’s exactly where modern data governance earns its keep.

New AI pressures are redefining what good governance takes. Learn how to build better frameworks, move fast with confidence, and keep your data from being a black box.

In my role leading product management, I’ve learned that robust data governance isn’t a compliance checkbox—it’s a strategic capability. When we treat governance as a product, we architect for clarity, safety, and speed. That means aligning AI Strategy with day-to-day delivery so teams know what they can ship, when, and why.

Here’s the practical blueprint I rely on. First, establish ownership and a shared language. Create a living data catalog, lineage maps, and clear data classifications so teams know which assets are sensitive, regulated, or eligible for training LLMs. Second, harden privacy-by-design and least-privilege access. Bake PII detection, secrets management, and role-based policies directly into your workflows. Third, bring quality and observability to the forefront: instrument data contracts, monitor drift, and track model performance across environments. Finally, implement model governance end to end—dataset cards, model cards, bias testing, human-in-the-loop review, and a repeatable evaluation harness.

To move fast with confidence, make governance invisible and automated. Treat policies as code in CI/CD, gate deployments with pre-merge checks, and fail builds that violate data contracts. Log prompts and outputs responsibly, route unsafe patterns to red-teaming, and use a retrieval-first pipeline to anchor models on verified sources rather than fragile context stuffing. This is how we scale AI product development while keeping audit trails complete and costs in check.

Avoiding the black-box problem starts with transparency. Document assumptions, training data sources, and known limitations—then expose explanations where it matters in the product experience. Pair this with a unified analytics platform to tie telemetry, feature flags, and user feedback to model changes. When something goes sideways, your observability, incident management playbooks, and threat detection and response processes should make root-cause analysis fast and defensible.

If you’re building your program from scratch, use a 30-60-90 approach. In the first 30 days, inventory systems, classify data, and map high-risk use cases. By day 60, formalize RACI for governance, deploy access controls, and set up your evaluation pipeline with golden datasets and measurable acceptance thresholds. By day 90, operationalize incident response, conduct tabletop exercises, and wire governance outcomes into OKRs—think time-to-approval for high-risk changes, reduction in production incidents, and model evaluation pass rates.

This playbook pays off in board conversations and with customers. You can articulate your AI risk management posture, show measurable progress on regulatory compliance, and demonstrate how governance accelerates—not hinders—delivery. Most importantly, your teams gain the confidence to experiment, knowing there’s a safety net that protects users, the brand, and the business.

If your organization is wrestling with how to balance innovation and control, start small, codify what works, and scale with intent. With the right foundations in data governance, AI becomes an engine for durable advantage—not a source of sleepless nights.

Inspired by this post on Amplitude – Perspectives.

November 21, 2025
How We Built an AI Sleep Coach: CBTI, Voice AI, and a Product Playbook for Better Rest

What if your morning started with a helpful check-in from a voice AI that actually improves your sleep—using the same core principles that typically cost thousands of dollars and come with year-and-a-half waitlists? That idea energizes me as a product leader, because it blends clinical-grade outcomes with consumer-grade accessibility. Recently, I dug into how the team at Rest built an AI sleep coach inspired by Cognitive Behavioral Therapy for Insomnia (CBTI), and why their method offers a repeatable blueprint for complex, personal AI products.

The origin story is a classic product discovery moment. Rest’s team noticed that a meaningful slice of users in their podcast app were using audio to fall asleep. Although it represented only about 10% of users, that group showed a high willingness to pay. That signal pushed them to explore a dedicated sleep solution, moving from a general audio app to a targeted sleep experience—and eventually toward an AI-powered coach as LLMs matured.

Through jobs-to-be-done research, they identified a clear, underserved segment: “DIY sleep hackers.” These are motivated users who want agency, structure, and results without navigating clinical systems. Choosing CBTI (a clinically proven approach with 80% efficacy) gave the product a strong evidence-based foundation while remaining accessible as a wellness tool. It’s the kind of strategic choice I look for: credible, measurable, and aligned with user motivation.

The product evolution moved in smart, incremental steps. Rest started with a basic text chatbot before graduating to a voice-first experience—using Vapi for voice and OpenAI for reasoning. Voice changed the relationship dynamic: it increased intimacy, lowered friction for daily check-ins, and made behavioral coaching feel human without pretending to be. The team built a memory system that tracks context (like traveling or having a dog) with time-based relevance, which keeps conversations fresh, respectful, and genuinely personalized.

Daily engagement is driven by dynamic agendas that adapt based on sleep data, the user’s stage in the program, and their recent compliance. I love this mechanic: it operationalizes behavior change by sequencing the right intervention at the right time. In parallel, they developed text via OpenAI Assistants while building voice with Vapi, which let them ship value while learning in two modes. They also moved from massive system prompts to RAG for general sleep knowledge, keeping personal user context in the prompt—reducing brittleness while improving scalability.

Because sleep sits close to healthcare, the team drew a firm line between wellness and medical positioning. They implemented clear guardrails: no diagnosis, no medication advice, and strong boundaries on scope. Weekly error analyses with domain experts (sleep therapists) tightened quality and tone, and they adopted LLM-powered evals to enforce safety boundaries. For observability and evaluations, they leveraged Langfuse, and they experimented with Hamming for voice testing to refine the experience end-to-end.

Under the hood, this is a great example of “one bite of the apple at a time” product building in AI. Start with a simple interface, anchor on an evidence-based method, layer personalization with memory, formalize program structure with dynamic agendas, and shift to RAG when general knowledge outgrows prompt engineering. As a product leader, I see strong echoes of agentic patterns here—goal-oriented orchestration, stateful memory, and adaptive planning—shipped in pragmatic increments rather than as a monolithic platform rewrite.

A few takeaways I’m applying with my teams: First, segment deeply and pick a high-intent niche (those “DIY sleep hackers” were the right beachhead). Second, let modality fit the job—voice is not a gimmick when it boosts compliance and empathy. Third, design safety and scope from day one if you’re anywhere near health. Finally, invest early in evals and observability so you can improve with confidence, not hope.

If you want to explore the full conversation and product decisions, you can listen here: Spotify | Apple Podcasts.

Resources & Links:

Rest – AI sleep coach app

Vapi – Voice agent platform Rest uses

Langfuse – Observability and evals platform

Hamming – Voice testing platform

AI Evals Maven Course by Hamel Husain and Shreya Shankar

Bottom line: Rest demonstrates how to take a clinically grounded method like CBTI, translate it into a daily voice-first experience, and ship it with rigor. If you’re building in AI, this is a model worth studying—practical, safe, and deeply user-centered.

Inspired by this post on Product Talk.

November 20, 2025
High-Quality Data, High-Velocity AI: My Product Playbook for Governance, Trust, and Scale

Every breakthrough we ship in AI reinforces a simple truth I live by: "Companies that prioritize data quality, governance, and structure will accelerate their AI initiatives the fastest." That statement captures the difference between flashy demos and durable, scalable products. In my experience, the strongest AI Strategy starts with the discipline to treat data as a product, not an afterthought.

When teams rush to production with generative AI or LLMs, the first issues rarely come from the model itself—they come from the data. Poor lineage leads to hallucinations, inconsistent schemas inflate costs, and weak access controls erode trust. For LLMs for product managers, this is the gap between a compelling prototype and a reliable system customers depend on every day.

Let me clarify what I mean by data quality, governance, and structure. Quality is completeness, accuracy, freshness, and consistency across sources. Governance is policy, ownership, and accountability—privacy-by-design, regulatory compliance, and AI risk management built in from day one. Structure is the architecture: clear data contracts, standardized schemas, metadata and lineage, and role-based access that keeps sensitive signals protected while enabling speed.

Here’s the product playbook I use to operationalize this. First, map critical sources and define data contracts at the edges so producers and consumers can move independently. Second, standardize schemas and entity resolution to eliminate ambiguous joins. Third, enforce privacy-by-design with policy-as-code and automated redaction. Fourth, converge analytics into a unified analytics platform so definitions, freshness, and observability are shared. Fifth, instrument end-to-end lineage and quality SLAs with alerting. Finally, close the loop with human feedback and labeling to continuously improve model performance.

For generative AI workloads, a retrieval-first pipeline is essential. Unify trusted sources (product analytics, CRM, support, docs), embed and index them with guardrails, and focus on context window management to keep prompts lean, relevant, and cost-effective. This approach improves response quality, reduces token spend, and makes updates near-real-time—without retraining the base model every week.

Measure what matters. Tie model outcomes to product metrics through rigorous A/B testing, and size experiments with minimum detectable effect (MDE) so you can ship confidently. Use product analytics to verify that better data actually improves activation, retention, and support deflection. When teams can trace an AI improvement back to a specific data-quality fix, they invest in governance with conviction.

Culture closes the gap. Empowered product teams and product trios (PM, design, engineering) make crisper decisions when data stewards are embedded and accountable. Clear ownership, shared definitions, and transparent dashboards reduce friction with security and compliance while speeding up delivery. This is how product management leadership sustains velocity without trading away trust.

The bottom line: if we want faster, safer, and more scalable AI, we start with the data. Build strong foundations, treat governance as enablement, and structure every step so improvements compound. With that in place, Generative AI stops being a science experiment and becomes a durable competitive advantage.

Inspired by this post on Amplitude – Perspectives.

November 19, 2025

AI-First Customer Support for Sustainable Ecommerce Growth

Your ecommerce support queue is growing, but cutting ticket volume is not the real decision in front of you. The harder question is which customer outcomes you can let AI own – from order questions to address changes and refunds – without creating a faster path to a wrong answer or action.

AI-first support earns its place when it completes customer work safely, gives human agents the full context when it cannot, and produces evidence you can use to improve the buying and ownership experience. Growth does not mean forcing a sale into every conversation. It means removing avoidable friction before purchase, resolving post-purchase problems well, and turning repeated support demand into better product and operational decisions.

Define the unit of automation as a resolved customer job

A message is not a resolution. An answer is not always a resolution either. If a customer asks to cancel an order, sending the cancellation policy may be factually correct while leaving the actual job unfinished.

For an AI agent to resolve that request, it must verify the customer and order, check whether cancellation is allowed, execute the permitted action, confirm the exact outcome, and recognize when an exception requires a person. This distinction matters because a deflected conversation can still represent an unresolved customer and a second contact waiting to happen.

Start by separating support demand into four kinds of work:

Informational work: order status, delivery information, return-policy questions, and other requests that can be completed with a grounded answer.
Bounded transactional work: changing an eligible shipping address, cancelling an order, issuing an allowed refund, or performing another action with clear rules and permissions.
Advisory work: helping a shopper find a suitable product using current catalog data and the constraints the shopper has provided.
Judgment-heavy work: policy exceptions, ambiguous intent, conflicting account data, unusual financial consequences, or emotionally sensitive cases where discretion matters.

Use a workflow map like this before choosing what to automate:

Customer job	AI needs	Evidence of completion	When AI must stop
Get current order information	Verified identity, correct storefront, and current order data	The requested state is returned from the commerce system	Identity, store, or order data is missing or inconsistent
Change a shipping address	An eligible order, editable fields, an authorized tool, and customer confirmation	The commerce platform accepts the new value and returns the updated order	The order has progressed too far, the address is ambiguous, or the tool fails
Cancel or refund an order	Policy rules, order state, transaction permissions, and explicit confirmation	The platform confirms the exact cancellation or refund that occurred	The request is an exception, the amount is unclear, or execution is incomplete
Choose a product	Current catalog data and relevant shopper constraints	The shopper receives grounded options or a clean route to human advice	Required constraints are unknown or the catalog cannot support the recommendation

For example, a Shopify support integration can distinguish between retrieving order information and executing actions such as address edits, cancellations, refunds, and duplicate-order workflows. That separation is the architectural principle to preserve: knowing something about an order is not the same as having permission to change it.

Prioritize each workflow using three factors: how much customer demand it represents, how ready the required data and tools are, and how costly a wrong outcome would be. High frequency alone is a poor selection rule. A common request with unreliable data will produce common failures, while a lower-volume workflow with clear rules may be the better place to prove the operating model.

Build shared context, bounded actions, and deliberate handoffs

Treating AI as infrastructure and assigning clear ownership of its performance changes the design question. You are no longer adding a writing assistant to an inbox. You are creating a customer-facing system that reads business state, applies policy, calls tools, and hands work to people.

The minimum useful context for ecommerce support usually includes verified customer identity, storefront, order and customer records, applicable policies, product or catalog information, conversation history, and the current state of any attempted workflow. Multi-store merchants need the store identifier to travel with the conversation. A valid order number in the wrong storefront is still the wrong context.

Data architecture deserves the same attention as the model. Capabilities such as multi-store handling, synchronized custom fields, updated data mappings, and EU workspace support illustrate the practical requirements. If the AI cannot determine which record is authoritative, it should expose the conflict and stop. It should never manufacture the missing state.

Give every action an explicit contract

A prompt is not an adequate control for a transactional workflow. Every tool the AI can call should have an action contract that defines:

Preconditions: what must be true before the action is available.
Required inputs: which values must come from verified commerce data and which may come from the customer.
Permissions: which customers, agents, stores, order states, and transaction types are eligible.
Confirmation: the exact order, field, amount, or consequence the customer must approve.
Execution response: a structured success or failure state returned by the commerce platform, not a guess based on generated text.
Duplicate-submission protection: how the system prevents the same action from being executed twice.
Failure behavior: whether to retry, stop, reverse a reversible step, or hand the case to a person.
Audit data: what action was requested, which policy was applied, what the tool returned, and what the customer was told.

Separate permissions by consequence. Reading authenticated order status is different from drafting a proposed change. Drafting is different from executing a reversible update. A cancellation or refund carries financial and customer-trust consequences, so it needs stricter eligibility checks, explicit confirmation, and a reliable human path for exceptions. Customer confirmation does not compensate for an ineligible order or an unreliable tool.

The integration method does not remove these obligations. Whether a tool is exposed through a native connector, an internal API, or Model Context Protocol, the AI still needs a constrained schema, narrow permissions, deterministic validation, and an unambiguous result.

Make escalation a designed path, not a failure bucket

AI-first does not mean AI-only. Humans should enter when judgment adds value or when a control condition is triggered. Define those conditions before launch rather than expecting the model to improvise them.

Escalate when identity cannot be verified, records conflict, a policy exception is requested, a consequential action falls outside permission, a tool returns an incomplete result, the customer disputes an executed action, or the customer asks for a person. A model confidence score is not enough unless you have calibrated it against the actual intents and failure costs in your environment.

The human receiving the conversation should get a compact handoff package containing:

The customer’s current request and the reason for escalation.
The verified customer, storefront, and order identifiers.
A short summary of facts already established.
Every action attempted and the exact tool result.
The unresolved decision or exception.
Anything already promised to the customer.

The customer should not have to reconstruct the case. When the AI has enough context to recognize that it cannot finish, passing that context forward is part of the resolution experience.

Measure verified outcomes, system reliability, and growth impact

Deflection is an activity measure. It tells you a human did not enter the conversation, but it does not prove the customer received the right answer, the requested action succeeded, or the issue stayed resolved. An AI-first operating model should instead emphasize resolution, impact, and system reliability.

Define a successful automated resolution before you build a dashboard. A practical definition is: the AI correctly understood an eligible request, delivered the correct answer or completed the authorized action, communicated the outcome accurately, and did not create an avoidable repeat contact within a fixed follow-up window. Choose the window for your business and apply it consistently.

Report coverage and success separately. A strong success rate on a very narrow set of conversations can look impressive while leaving most customer demand untouched. A broad coverage rate can hide weak execution. At minimum, track these metric layers:

Eligibility and coverage: the share of total conversations that match a workflow AI is allowed to handle, followed by the share it actually attempts.
Resolution quality: verified correctness by intent, policy adherence, repeat contact, customer dispute, and the rate of unnecessary escalation.
Action reliability: successful tool execution, rejected actions, duplicate attempts, incomplete results, and wrong or unauthorized changes.
Handoff quality: whether the right cases escalate, whether the context package is complete, and whether customers must repeat information.
Customer experience: time to the completed outcome and satisfaction segmented by intent and resolution path.
Business impact: cost per verified resolution, pre-purchase assisted conversion where attribution is credible, and downstream retention or repeat-purchase signals.

Do not present an association as growth causation. Customers who contact support may already differ from those who do not. Use controlled experiments where they are practical, compare like-for-like intent cohorts, and treat retention as a downstream signal unless the measurement design supports a stronger claim.

Ownership matters as much as measurement. Assign someone to own AI support as a product surface, someone to govern knowledge and policy, someone to own commerce integrations and permissions, and someone to review quality and customer harm. These are responsibilities, not mandatory job titles. A smaller organization may place several with one person, but none should be left implicit.

During a live rollout, I would review every failed or disputed write action and sample successful actions across each active intent every operating day. Once the important failure modes are understood and performance is stable, intent-level review can move to a weekly cadence. Scope changes should still happen through an explicit release decision, not because the queue happens to be busy.

Roll out one dependable resolution lane at a time

The safest path to meaningful automation is not a site-wide chatbot launch. It is a sequence of narrow resolution lanes, each with grounded data, an evaluation set, clear permissions, a human fallback, and a rollback path.

Establish the baseline. Group current conversations by customer intent and record volume, time to outcome, repeat contact, escalation, and the systems or policies each intent depends on.
Select a narrow first lane. Favor a request with clear rules, reliable data, and low action reversibility. Authenticated order information is often a better proving ground than refunds, but your own data readiness should decide.
Create an evaluation set from real, appropriately handled conversations. Include ordinary cases as well as missing orders, stale data, multi-store ambiguity, policy exceptions, tool errors, changed customer intent, and explicit requests for a person.
Write expected outcomes before testing. For every case, specify whether AI should answer, act, ask for missing information, or escalate. Classify unauthorized disclosure, wrong transactional action, and missed consequential escalation as critical failures that an overall average cannot hide.
Observe before granting broad action permissions. If your platform supports a draft or shadow mode, compare proposed behavior with the expected outcomes. Then launch to a limited storefront, channel, workflow, or customer cohort with active monitoring.
Add one write action at a time. Confirm the action contract, permissions, confirmation language, duplicate protection, audit trail, human fallback, and rollback mechanism before expanding eligibility.
Protect peak periods. Do not introduce a consequential workflow immediately before your highest-demand period unless it has already passed realistic evaluation and the operating team can disable it quickly. Keep staffing and fallback capacity based on verified workload movement, not projected deflection.

This expansion model creates a compounding loop. Every failed or repeated conversation should produce a specific improvement task: repair missing knowledge, correct a data mapping, clarify a policy, tighten an action permission, improve the handoff, or send a recurring upstream problem to product, merchandising, fulfillment, or operations. The value is not only that AI absorbs work. It is that support demand becomes structured evidence about where ecommerce growth is leaking.

Continue expanding only when a lane remains dependable under real conditions. Tight merchant feedback loops and peak-season planning are especially important as the agent moves from answering questions to taking actions. Pause when unresolved contacts or ambiguous cases rise. Roll back immediately when the system performs an unauthorized or incorrect consequential action.

Key takeaways

Optimize for completed customer jobs, not avoided human conversations.
Separate information retrieval from transactional authority, and give every action a testable contract.
Make verified identity, storefront, order state, policy, and tool state part of the shared context.
Design human escalation before launch so judgment-heavy cases arrive with their context intact.
Report eligibility, coverage, resolution quality, action harm, and business impact separately.
Expand through evaluated resolution lanes with explicit release, monitoring, and rollback decisions.

Your next move is concrete: choose one customer job, write down its required data, allowed actions, stop conditions, success evidence, and human fallback. If you cannot make those five elements explicit, the workflow is not ready for autonomous resolution. If you can, you have the first building block of an AI-first support system that can grow without asking customers to absorb the risk.

References

November 18, 2025

PendomoniumX London: An Operating Model for AI Products
If your AI portfolio has plenty of prototypes but little habitual use, the gap is probably not access to better models. It is operating design. A team can ship an impressive assistant and still fail because it chose a weak workflow, buried the feature, measured clicks instead of changed behavior, or treated trust as a post-launch review.

At PendomoniumX London, more than 350 software leaders gathered around AI transformation and product innovation. The useful signal for product leaders was the move from broad enthusiasm to execution: clearer customer problems, measurable adoption, faster learning, and explicit governance. You can turn that signal into an operating model for your own AI roadmap.

Transform a customer workflow, not a feature list

An AI feature generates, summarizes, classifies, recommends, or takes an action. An AI product transformation changes how a person completes a meaningful job. The distinction matters because customers do not adopt model capabilities in isolation. They adopt a faster, easier, or more reliable way to get something done.

Starting with the model usually produces a familiar failure mode: the team finds technically plausible places to insert AI, ships several disconnected experiences, and then struggles to explain why customers should change their behavior. Starting with the workflow forces the team to identify the user, the moment of friction, the desired behavior, and the evidence that would justify further investment.

I would not approve an AI roadmap item until the team can complete this sentence:

For a specific user completing a specific workflow, the product will use AI to remove a named source of effort or uncertainty, leading to an observable behavior change and a defined customer or business outcome, within explicit trust boundaries.

Build the statement in this order:
1. Describe the current workflow. Write the steps a customer takes now, including any handoffs, repeated decisions, manual checks, or places where work is abandoned.
2. Isolate one consequential friction point. Avoid vague problems such as “the workflow is inefficient.” Name the decision, delay, rework, or uncertainty that prevents progress.
3. Define the assistance. State whether AI will draft, recommend, retrieve, classify, predict, or act. These modes create different expectations and require different controls.
4. Name the behavior that should change. Examples include completing a setup step, accepting or editing a recommendation, resolving a case, or returning to use the capability again.
5. Connect the behavior to an outcome. A click is not an outcome. Faster time-to-value, lower abandonment, greater task completion, and sustained use are closer to the value you need to establish.
6. Write the boundary before the prototype. Specify what data the system may use, what the user must verify, when a human remains responsible, and what happens when the system cannot produce an acceptable result.
This framing also gives you a useful way to reduce an overcrowded AI roadmap. Reject ideas that cannot name a recurring workflow, an observable behavior, and a credible path to customer value. A clever demonstration without those elements is an experiment, not yet a product commitment.

Run one evidence loop from discovery through go-to-market

AI work becomes slow when discovery, delivery, analytics, and go-to-market operate as separate projects. Research identifies one problem, engineering explores another, marketing promises a broad capability, and analytics arrives after launch. Each function can appear busy while the product accumulates uncertainty.

The better unit of management is one evidence loop:
1. Discovery identifies the costly moment. Combine customer interviews with behavioral data. Interviews explain the user’s reasoning and workarounds; analytics shows where the behavior occurs, which segments encounter it, and whether the problem is frequent enough to matter.
2. Prioritization exposes the assumptions. Compare bets using problem severity, workflow frequency, data readiness, trust burden, reach, and speed of learning. Do not hide weak evidence behind a single calculated score. Record why each factor received its assessment.
3. Sprint planning targets uncertainty. A prototype should answer a specific question: whether customers want assistance at this moment, whether the available context supports an acceptable output, or whether users understand how to review the result. Building the full workflow before answering the riskiest question creates expensive evidence.
4. Go-to-market explains the changed job. Lead with what the customer can now accomplish. “AI-powered” describes an implementation choice; it does not tell a customer when to use the capability, what input it needs, or what outcome to expect.
5. Post-launch behavior changes the roadmap. Compare actual use with the original baseline and bet statement. Look at starts, completions, acceptance or editing of outputs, abandonment, repeated use, and downstream outcomes. Feed those observations into the next discovery decision.
A lightweight decision log keeps this loop honest. For every AI bet, record the customer problem, riskiest assumption, evidence collected, decision made, owner, and next review condition. The log prevents a prototype from quietly becoming a permanent commitment simply because significant effort has already been spent.

A prototype that misses the mark can still be valuable if it retires uncertainty. If customers do not recognize the problem, stop. If they value the workflow but distrust the output, change the interaction or control model. If the output is useful but discovery is weak, address distribution and onboarding. Those are different diagnoses, so they should not all produce the same response of adding more features.

Make adoption part of the product itself

Launching an AI capability does not teach customers when to trust it, what information to provide, or how it fits into an existing routine. That education is part of the experience, especially when the product asks someone to replace a familiar manual process with a probabilistic system.

Examples at PendomoniumX paired Pendo’s in-app guides and product tours with behavioral analytics to improve activation and reduce friction around important onboarding moments. The transferable lesson is not to add a tour to every AI release. It is to place guidance at the moment of intent and measure whether it helps the customer reach value.

Instrument the adoption path before you publish the guidance:
- Eligible: the right user reaches the relevant workflow and has permission to use the AI capability.
- Exposed: the user can see the entry point or receives contextual guidance.
- Started: the user initiates the AI-assisted action.
- Delivered: the system returns an output or completes the requested action.
- Evaluated: the user accepts, edits, rejects, retries, or reverses the result.
- Completed: the user finishes the larger workflow in which the AI action sits.
- Repeated: the user chooses the capability again when the relevant need returns.
This sequence prevents a common measurement mistake. A guide view shows exposure, not activation. A button click shows curiosity, not value. Even a generated output may not matter if the user discards it or fails to complete the surrounding task. Define activation at the first point where the customer receives meaningful value, then monitor whether that behavior repeats.

Keep the guidance proportional to the decision:
- Use a short contextual prompt when the customer only needs to notice a new action.
- Use a tooltip when the customer needs one local explanation, such as what information the model will use.
- Use a multi-step tour only when the workflow itself spans multiple unfamiliar steps.
- Show an example input when output quality depends heavily on how the request is framed.
- Explain review and fallback behavior next to the action, not in a distant help page.
- Let experienced users dismiss education that no longer helps them.
If traffic and risk permit a controlled experiment, compare eligible guided and unguided cohorts on workflow completion and repeated use. If you cannot create a credible control group, use a documented baseline and staged rollout. In either case, do not claim that guidance caused adoption merely because guide views and feature use rose at the same time.

Make trust boundaries and decision rights explicit

Trust is not a legal checklist appended to an otherwise finished AI experience. It affects what the system may do, what the interface must explain, which events need monitoring, and whether the customer remains in control. Deferring these decisions creates rework because the team may later need to change data flows, permissions, interaction design, or the scope of automation.

For each workflow, answer these questions in language the product team can implement:
- What customer, account, or third-party data may enter the system?
- What context is necessary, and what data should be excluded even if it could improve the output?
- What is retained, for what purpose, and who can access it?
- Which outputs are suggestions, and which can cause an action in the customer’s environment?
- What must the user review or confirm before an action becomes consequential?
- How does the experience communicate uncertainty, missing context, or inability to complete the task?
- What fallback lets the customer continue when the AI path fails?
- Which signals trigger investigation, rollback, or a narrower release?
- Who owns customer feedback, incidents, and changes to the evaluation criteria?
When personal data, sensitive customer information, or regulated decisions are involved, bring privacy, security, and legal reviewers into discovery. The safe alternative to making assumptions is to narrow the data and action scope until the appropriate review is complete.

Governance must be matched by clear decision rights. An empowered product team is not an ungoverned team. It is a team that knows which decisions it can make, the evidence expected, and the boundary at which another owner must participate.

A practical division is to distinguish three layers:
- Team-owned decisions: workflow design, contextual education, experiments within approved boundaries, evaluation cases, and roadmap changes supported by product evidence.
- Cross-functional review: new data access, material changes to retention, model-provider changes, higher-impact automation, and controls that affect security, privacy, support, or compliance.
- Leadership decisions: risk tolerance, strategic investment across portfolios, shared platform choices, and conflicts that cannot be resolved within the product outcome.
Write these rights into the AI bet rather than relying on organizational memory. Also define the conditions for continuing, reworking, pausing, or stopping the work. The exact thresholds should come from your baseline and risk context, but the decisions should exist before launch. Otherwise, encouraging signals will be celebrated while contradictory evidence is explained away.

Key takeaways
- Frame every AI investment around a recurring customer workflow, not a model capability.
- Require a bet statement that connects assistance, behavior change, customer value, and trust boundaries.
- Use one evidence loop across discovery, prioritization, sprint planning, go-to-market, and post-launch learning.
- Measure the full adoption path from eligibility to repeated use; guide views and feature clicks are intermediate signals.
- Treat in-app education as contextual product design, not a substitute for a clear value proposition.
- Set data boundaries, human-review points, fallback behavior, decision rights, and stop conditions before broad release.
In your next planning cycle, choose one live AI initiative and rewrite it as a workflow bet. Add its behavioral baseline, activation event, trust boundary, decision owner, and stop condition. Then instrument the path before expanding the feature set. If the team cannot agree on those elements, the roadmap item is not ready. If it can, AI has started to become a managed product capability rather than a collection of prototypes.

References
- Pendo – Perspectives – Inside PendomoniumX London: AI Transformation, Real-World Wins, and Product Innovation
November 17, 2025

Category: AI Strategy

Trust is a chain, not a model score

Build a minimum control plane around each data product

Govern the full path from ingestion to feedback

Ingestion and preparation

Retrieval and response

Feedback and continuous improvement

Measure whether governance is earning trust

A 30-60-90 day path from policy to operating system

Days 1-30: expose the current state

Days 31-60: turn decisions into controls

Days 61-90: close the learning and accountability loop

Key takeaways

References

Start with the revenue decision, not the dashboard

Separate four questions that dashboards often blur

Connect every marketing touch to a customer value journey

Instrument value moments instead of feature clicks

Build a driver tree from realized revenue back to controllable inputs

Keep attribution in its lane and use experiments for incrementality

Define the minimum detectable effect before an A/B test begins

Turn revenue measurement into an operating cadence

Use generative AI as an analyst, not a measurement authority

Key takeaways

References

A useful score separates experience from ownership

Design the score as a diagnosis, not a black box

The minimum record behind each score

Make the metric reliable enough for decisions

Turn low scores into routed work, not dashboard theatre

Earn the right to replace CSAT

Key takeaways

References

The unit of work has shifted from tickets to systems

Assign four ownership roles before opening four requisitions

AI operations lead: owns performance and improvement

Knowledge manager: owns what the AI is allowed to know

Conversation designer: owns how the AI behaves

Support automation specialist: owns safe execution

Organize around human support, AI support, and optimization

Transition the current team without hiring ahead of the work

Run AI support as a product, not a side project

Key takeaways

References

Define the unit of automation as a resolved customer job

Build shared context, bounded actions, and deliberate handoffs

Give every action an explicit contract

Make escalation a designed path, not a failure bucket

Measure verified outcomes, system reliability, and growth impact

Roll out one dependable resolution lane at a time

Key takeaways

References

Transform a customer workflow, not a feature list

Run one evidence loop from discovery through go-to-market

Make adoption part of the product itself

Make trust boundaries and decision rights explicit

Key takeaways

References