Tag: AI risk management

AI-Ready Data Governance: A Practical Trust Framework
You are ready to move an AI capability from pilot to production. The demo performs well, but the release review exposes harder questions: Which data produced this answer? Was the system allowed to use it? What happens when the data becomes stale, its meaning changes, or a customer challenges the result?

If you cannot answer those questions quickly, you do not have an AI model problem yet. You have a trust-chain problem. The practical goal of AI-ready governance is to make every important input identifiable, interpretable, permitted, observable, and recoverable without turning each release into a committee project.

Trust is a chain, not a model score

A strong evaluation score can tell you how a system behaved against a defined set of cases. It cannot prove that production data was collected lawfully, interpreted consistently, retrieved with the right permissions, or handled according to retention rules. Those are separate conditions, and a trustworthy AI product needs all of them.

My working definition is simple: trust is the justified ability to rely on an AI system for a defined use case and level of consequence. It is not a general property that a model earns once. Change the data, user, purpose, or action, and you need to validate the chain again.

Use four questions to expose where that chain is weak:
1. What did the system use? You should be able to trace the relevant inputs, transformations, retrieval results, and freshness state.
2. What did the data mean? Business definitions, schemas, labels, and event taxonomies should be consistent enough that producers and consumers interpret the signal the same way.
3. Was this use allowed? Data classification, consent, retention, purpose, and user permissions should travel with the data rather than disappear at the model boundary.
4. Can you prove the controls worked? Automated checks, policy decisions, exceptions, human reviews, and operational events should leave evidence suitable for investigation and audit.
A no to any one of these questions is a specific failure, not a vague lack of AI readiness. That distinction matters because the remedies differ. Missing or duplicate records require data-quality work. Conflicting definitions require semantic ownership. An unauthorized retrieval requires access-policy work. A grounded answer that still violates a product rule requires an output control. Retraining the model will not repair any of those failures.

When an output is challenged, diagnose it in that order: authorization, retrieved context, source meaning and freshness, transformation logic, then model behavior. Starting with the model encourages expensive experimentation while the actual defect remains upstream.

AI-ready does not mean making every table in the company pristine. It means the data used by a particular AI capability has an explicit purpose, accountable ownership, reliable semantics, enforceable policy, and enough lineage to reconstruct what happened. Treating data as a product turns those requirements into an operating responsibility instead of an indefinite cleanup program.

Build a minimum control plane around each data product

Start with the data products that feed production AI use cases. A data product may be an event stream, a document corpus, a labeled outcome set, or a derived feature set. For each one, create a contract that answers the questions a producer, consumer, reviewer, and incident responder will actually ask.
- Purpose: the decision, experience, or workflow the data is intended to support.
- Accountability: a data owner responsible for meaning and policy, plus an AI use-case owner responsible for how the product relies on it.
- Semantics: field definitions, schema, taxonomy, labels, deduplication rules, and known limitations.
- Quality: the agreed expectations for completeness, validity, uniqueness, and freshness, including what happens when an expectation is missed.
- Lineage: where the data originated, which transformations changed it, and which indexes, features, or contexts consume it.
- Policy: sensitivity classification, permitted purposes, access conditions, consent state, retention, masking, and deletion behavior.
- Evidence: the tests, logs, approvals, exceptions, and monitoring signals that demonstrate the contract is operating.
A quality SLA is only useful when it has a measurable condition and a failure response. Do not write that data should be timely. Define the freshness expectation appropriate to the use case, identify who receives the alert, and specify whether the AI product should continue, degrade, abstain, or escalate when the expectation is breached. The appropriate threshold will differ between use cases, so the contract should carry it rather than burying it in general policy.

The next step is to enforce the contract at the moments when risk enters the system:
- At change time, run schema and data-contract checks in CI/CD. Pair tracking or taxonomy changes with code review so a renamed event or field cannot silently alter downstream behavior.
- At access time, apply least-privilege permissions through role- or attribute-based controls. Carry consent and purpose metadata into the decision, and apply masking or exclusion before sensitive values reach an index, training set, or prompt.
- At request time, filter retrieval using the requesting identity and use case. Record which eligible inputs informed the response and which policy decisions were applied.
- At output time, check for PII exposure, policy violations, unsafe actions, and adversarial behavior. Add human review where the consequence warrants judgment.
- At incident time, preserve a usable audit trail and invoke a defined response playbook with an owner, containment path, and recovery decision.
This is what it means to make approval workflows guardrails rather than gates. Schema checks, data contracts, least-privilege access, consent metadata, and policy-as-code can run inside the delivery workflow. A review board should handle material ambiguity and exceptions, not manually repeat checks that software can perform consistently.

Do not apply one approval path to every AI change. Classify changes by data sensitivity, consequence, autonomy, reversibility, and external exposure. A low-consequence internal feature using non-sensitive data may be eligible for self-service release when its automated controls pass. A customer-facing capability using sensitive context needs designated review. A high-stakes or difficult-to-reverse action should retain meaningful human control.

Human-in-the-loop is not satisfied by placing a person at the end of the workflow. The reviewer needs the relevant context, source trace, risk flags, and authority to stop or change the action. Otherwise, the human is only absorbing accountability from a system they cannot evaluate.

Consent, lawful basis, retention, and regulatory duties depend on jurisdiction and the precise use of the data. Treat those as decisions to make with qualified privacy or legal counsel, then translate the decisions into technical rules. An architecture checklist is not a legal determination, and silently guessing can create customer and regulatory exposure.

Govern the full path from ingestion to feedback

Many AI governance programs focus on model output because that is what users see. The more persistent risks often begin earlier, when data is collected for one purpose, transformed without visible lineage, indexed under broader permissions, or reused as feedback without a deliberate policy decision. You need controls across the complete path.

Ingestion and preparation

Every input should arrive with enough metadata to determine its origin, owner, meaning, sensitivity, permitted use, retention rule, and freshness. If those attributes are unknown, label the gap rather than allowing an implicit assumption to harden into production behavior.

Do not assume that permission to analyze data also grants permission to train on it, place it in a retrieval index, or expose it to another user through generated text. Evaluate each purpose explicitly. Apply deterministic masking and exclusions before the data crosses into a system where removal becomes harder to verify.

Data labeling deserves product-level attention. A label should have a documented definition, creation method, owner, and review path. If two teams use the same label to mean different outcomes, the model receives a conflict that infrastructure cannot resolve. If the definition changes, treat that change like an API change: identify consumers, test the impact, and preserve the lineage.

Retrieval and response

A retrieval-first architecture can improve grounding only when retrieval itself is governed. At query time, determine the requesting identity, account context, permitted purpose, and eligible sources before assembling model context. Do not retrieve broadly and hope the prompt tells the model what to ignore.

Keep the context window relevant as well as permitted. Irrelevant, conflicting, or stale material can obscure the signal even when every document is technically accessible. Context management should therefore enforce both policy and quality: authorized does not automatically mean useful.

The system also needs an explicit failure behavior. When retrieval returns insufficient, conflicting, stale, or unauthorized material, decide whether the product should abstain, ask for clarification, use a constrained fallback, or route the case to a person. A fluent answer is not an acceptable default when the evidence is inadequate.

For a material production interaction, retain enough evidence to reconstruct the event:
- The requesting actor or account context, represented in a privacy-conscious way.
- The use case and relevant system configuration.
- The retrieved inputs and their lineage or version identifiers.
- The access, consent, retention, and policy decisions applied.
- The output risk flags and any automated intervention.
- The human decision or override when review was required.
- The time of the event and the retention class governing the evidence.
Audit data needs governance too. Prompt and response logs can contain the same sensitive information you are trying to control. Collect the minimum evidence required for the stated purpose, mask where possible, restrict access, and apply an explicit retention rule. Logging everything forever is not traceability; it is an unmanaged secondary dataset.

Feedback and continuous improvement

User interactions, corrections, and business outcomes can improve an AI product, but they should not flow automatically into evaluation or training. First decide what the feedback represents, whether it is permitted for that purpose, how it will be labeled, and how long it should be retained.

Build evaluation cases from approved examples and segment results by the use case and risk that matter. A single average can hide a severe failure in a sensitive path. Pair model evaluations with source-quality checks, retrieval traces, policy results, human-review outcomes, and data-drift monitoring. That lets you distinguish a model regression from a context, permission, or data-contract regression.

Continuous monitoring, audit logs, PII checks, adversarial testing, drift detection, and incident playbooks make governance part of normal operations. The essential move is closing the loop: a failed case should lead to the layer that owns the defect, a corrective change, and a test that prevents the same failure from returning unnoticed.

Measure whether governance is earning trust

A dashboard labeled governance health is not useful unless each metric supports a decision. Start with measures that reveal coverage, control performance, delivery friction, and product consequences. Define each numerator, denominator, owner, and escalation condition so the number cannot drift into decorative reporting.
- Coverage: the share of production AI use cases with a named owner, current data contract, documented lineage, policy classification, and risk-based release path.
- Data reliability: schema-check pass rate, freshness-SLA compliance, duplicate or missing-data failures, and restoration time after a breach.
- Access and privacy: blocked unauthorized attempts, open policy exceptions, consent or retention violations, PII risk flags, and time to resolve each class of issue.
- Traceability: the share of reviewed outputs for which the team can reconstruct the relevant inputs, transformations, policy decisions, and reviewer actions.
- Evaluation: pass rates by use case and risk class, with failures attributed to data, retrieval, policy, model, or workflow layers.
- Delivery: lead time from a production-ready change to release, manual-review waiting time, and rework caused by late data or policy discovery.
- Consequences: incident frequency and severity, repeated failure modes, customer disputes, support escalations, and the product outcome the AI capability is meant to improve.
Read these measures in pairs. Faster release time with a growing backlog of unreviewed exceptions is not healthy acceleration. A high number of blocked access attempts may indicate that controls are working, that clients are misconfigured, or that an attempted abuse pattern is increasing. A rising evaluation score alongside worsening traceability means you know more about test performance but less about production accountability.

Do not collapse the dashboard into one trust score. A composite number hides which control failed and encourages teams to optimize the arithmetic. Executives can use a compact status view, but product, data, security, and privacy owners need the underlying measures and exception details.

Each material release should also produce an evidence packet containing the current data contract, automated test results, evaluation results, applicable approvals or exceptions, monitoring configuration, and incident owner. This does not need to become a large document. It needs to be complete enough that a reviewer can reproduce the release decision without relying on memory.

Finally, connect governance to outcomes rather than celebrating control activity. The relevant question is not how many reviews occurred. It is whether teams can ship responsibly with less rework, whether incidents and repeat failures decline, whether challenged outputs can be explained, and whether the intended product outcome improves without transferring hidden risk to the customer.

A 30-60-90 day path from policy to operating system

You do not need to finish an enterprise-wide catalog before improving one production path. Use a high-value AI capability as a vertical slice while the broader inventory progresses. That forces the governance design to survive real delivery constraints and produces reusable patterns for the next use case.

Days 1-30: expose the current state
- Inventory production AI use cases and the systems, datasets, indexes, outputs, and feedback loops they depend on.
- Map one priority flow from collection through transformation, retrieval, generation, action, and feedback.
- Assign accountable data and use-case owners. Record unknown ownership as a risk, not as a shared responsibility.
- Classify PII and other sensitive data, then document the current consent, purpose, lawful-basis, and retention decisions with the appropriate specialists.
- Define the first quality SLAs and failure behaviors for the inputs that can materially change the product result.
- Publish a concise operating policy that product managers, engineers, analysts, security partners, and reviewers can use during normal delivery.
The exit test is evidence, not document completion. For the priority use case, you should be able to name the owners, draw the data path, identify sensitive inputs, show the current permissions, and list the unresolved gaps that could block or constrain release.

Days 31-60: turn decisions into controls
- Standardize the metadata required for ownership, lineage, classification, consent, retention, quality, and permitted use.
- Implement fine-grained access controls and propagate the requesting identity into retrieval.
- Add consent-aware tracking, masking, and exclusions at the earliest enforceable point in the flow.
- Wire schema checks, data-contract tests, PII checks, and policy checks into CI/CD and runtime monitoring.
- Establish risk-based release paths so low-risk compliant changes can move without waiting for a general committee.
- Create the first governance dashboard using access attempts, exceptions, quality failures, risk flags, trace coverage, and delivery time.
The exit test is an end-to-end trace. Select a production interaction and reconstruct what the system used, what each important field meant, why access was allowed, which checks ran, and how an owner would respond if the result were challenged.

Days 61-90: close the learning and accountability loop
- Connect governance measures to outcomes such as release cycle time, avoidable rework, incident severity, repeat failures, and a defined customer-trust signal.
- Add human review to high-consequence paths and give reviewers the context and authority required to make a real decision.
- Run the incident playbook against a realistic failure and repair gaps in ownership, evidence, containment, or recovery.
- Review exceptions for recurring patterns. Automate repeatable decisions and escalate unresolved policy ambiguity to the accountable owner.
- Train product and engineering teams on the operating rules, then use a community of practice to share decisions and reusable controls.
- Review one release using the complete evidence packet and remove any step that produces ceremony without decision value.
The exit test is repeatability. A second team should be able to adopt the contracts, controls, evidence requirements, and escalation paths without inventing a separate governance system.

Key takeaways
- Define trust for a specific use case and consequence; do not treat it as a permanent property of a model.
- Trace four things for every material output: inputs, meaning, permission, and control evidence.
- Put governance into data contracts, CI/CD, access decisions, retrieval, monitoring, and incident response.
- Use risk-based release paths so routine compliant changes move quickly while sensitive or high-consequence decisions receive judgment.
- Measure coverage, control performance, delivery friction, and product consequences separately rather than hiding them in one score.
- Use the first 90 days to prove one end-to-end operating path, then reuse it across additional AI products.
At your next AI roadmap review, choose one production use case and ask the four trust-chain questions. Turn every missing answer into a named contract, control, owner, or test before expanding the capability’s reach. That is the point at which governance stops being overhead and starts making responsible delivery repeatable.

References
December 2, 2025
Own Your AI: 4 Essential Roles to Supercharge Support and Prevent Performance Drift by 2026

AI doesn’t fail because the model is bad, it fails because ownership is missing.

When someone truly owns your AI, everything changes. Resolution and automation rates climb, the system self-improves, and the customer experience transforms in ways a dashboard alone will never show you.

This is part three of our five-part series on customer service planning for 2026. We’ll be sharing all five editions on our blog and on LinkedIn.

If you’d rather have them emailed to you directly as they’re published, drop your details here.

Last week, we introduced the four roles that make AI actually work in a support organization. These roles are already showing up inside the teams who are scaling AI the fastest, and this week, we get closer to the ground.

Here’s what these roles look like in practice — what they do, how they work, and why your AI performance will inevitably drift without them.

AI operations lead — owns AI performance, every day. I think of this person as the air-traffic controller for our AI Agent. I treat the AI as a living system that needs ongoing supervision, evaluation, and tuning. This role is accountable for what leaders care about most: quality, reliability, and continuous improvement.

The AI ops lead sees the whole picture: conversation quality, missing knowledge, flawed assumptions, unexpected failures, new opportunities for automation, and the subtle signals that the system is beginning to drift. In practice, that vigilance is the difference between steady gains and slow decline.

Day-to-day, here’s what I expect from this role.

1. Reviews AI conversations and surfaces performance patterns. The AI ops lead monitors the AI Agent’s behavior — the tone shift after a product launch, a sudden dip in resolution for a specific intent, or conversation clusters revealing new customer behavior. They scan for anomalies, trends, and early warnings, with an emphasis on what’s happening right now, not last week. Without this intentional ownership, I’ve watched a 2% dip turn into a 10% drop in days.

2. Prioritizes fixes and improvements. Once patterns emerge, they triage fixes like a product team handles bugs. Missing or incorrect content? They route it to the knowledge manager. Behavioral issues? They adjust guidance and guardrails. Action or system issues? They partner with the automation specialist. This connective tissue turns individual fixes into compounding improvements.

3. Defines and maintains AI guardrails. Leaders everywhere worry about AI doing things it shouldn’t. This role answers that fear by establishing clarification logic, escalation rules, “never answer” policies, and safety boundaries. The goal is predictable behavior that protects customer trust — an essential pillar of any AI Strategy and AI risk management practice.

4. Aligns reporting with leadership. The AI ops lead reports on resolution rate, CX Score, CSAT, automation coverage, and hours saved — making the economic impact visible. That visibility is a foundational step in any credible customer support ai strategy.

Why this role exists now. AI systems are dynamic and require constant tuning. A small dip in quality quickly becomes an operational issue, and no existing role naturally owns that. When someone does, teams feel the benefit almost immediately.

Knowledge manager — builds and maintains the structured knowledge AI depends on. I hear the same thing from leaders again and again: AI is only as good as the content you give it. This role is rapidly evolving from classic knowledge management into knowledge strategy — part content designer, part systems thinker, part information architect. Their job is to build the knowledge scaffolding that lets AI answer accurately, consistently, and safely.

Here’s how the knowledge manager creates leverage.

1. Writes, maintains, and improves support knowledge — continuously. After every product change, they update articles, remove duplication, resolve contradictions, and pay down “knowledge debt” that quietly erodes accuracy. The upkeep is shaped by AI performance; when patterns expose gaps, they fix the source.

2. Structures knowledge for AI, not for browsing. Traditional help centers are for humans skimming pages. AI needs clean intent signals, crisp formatting, and clearly structured language. The knowledge manager designs that structure as intentionally as the content itself.

3. Works hand-in-hand with AI ops. Many performance issues stem from missing or unclear knowledge. When the AI ops lead surfaces recurring misunderstandings or low-resolution categories, the knowledge manager resolves the root cause at the source.

4. Ensures accuracy and compliance at scale. As AI handles more sensitive situations, the knowledge manager safeguards correctness, currency, and compliance — critical for data governance and regulatory alignment.

5. Develops a cross-functional knowledge strategy. The role creates a canonical, cross-functional source of truth that product, engineering, product marketing, go-to-market, and support (AI and human) can all rely on.

Why this role exists now. This is one of the highest-leverage positions in an AI-first support org. Teams like Rocket Money and Anthropic are hiring knowledge managers because AI accuracy depends on the quality of knowledge feeding it. Without this role, resolution rate caps out early and never climbs.

Conversation designer — designs how the AI speaks, clarifies, and interacts. AI isn’t just a tool customers use; it’s a representative they interact with. Tone, clarity, pacing, and conversational structure matter, especially in voice. Every word affects perceived expertise, trustworthiness, and brand. The conversation designer ensures the AI feels human-friendly without pretending to be human — the sweet spot that builds trust without misleading customers.

In my experience, staffing conversation design early accelerates results. It changes not only how we tune AI, but how we understand the end-to-end customer experience.

Here’s what great conversation design looks like.

1. Shapes the AI’s tone, voice, and communication style. This role refines phrasing, tunes politeness, adjusts how confusion is handled, and shapes micro-interactions that determine whether customers feel cared for or dismissed. On voice channels, natural cadence is make-or-break.

2. Designs flows for high-value conversations. They design how the AI clarifies intent, branches, communicates uncertainty, verifies details, escalates, hands off, and returns to the main thread without feeling mechanical — treating customer experience as a product with language as the interface.

3. Translates procedures and complex workflows into natural language and logic. As AI runs structured procedures and actions, this role becomes a conversational system architect, translating SOPs into conditional logic with exceptions and fallbacks. For example, in Intercom, our conversation designer uses Simulations to run simulated conversations to see where the AI Agent gets confused, over-confident, or awkward, and refine flows until the interaction feels effortless end-to-end.

4. Ensures transitions to humans feel smooth and respectful. Handoffs should provide clear context to the human agent and maintain continuity so customers never feel dropped.

Why this role exists now. As AI becomes the primary interface, conversation design directly influences trust, brand perception, and operational outcomes. It’s a core competency for any Generative AI and LLMs for product managers program.

Support automation specialist — builds the backend actions that allow AI to do real work. If the conversation designer shapes expression, this role shapes capability. They transform AI from an answering machine into an outcome engine by bridging AI and the systems it must safely and deterministically act on.

Support teams increasingly expect AI to do what a human would do: refund a charge, adjust a subscription, verify an identity, update an account setting, or pull relevant data. That expectation creates a new technical role at the edge of support, ops, and engineering.

What I rely on this specialist to deliver.

1. Creates and maintains backend workflows the AI executes. This includes building and maintaining: Fin Tasks. Fin Procedures with embedded steps. Action flows that call internal and external APIs. Automations that span billing systems, user identity layers, CRM objects, subscription entitlements, refund tools, and more. They ensure the AI can act compliantly and predictably — the playbooks that turn intent into action.

2. Owns the integrations required for advanced automation. Many problems require data elsewhere — billing platforms, internal databases, systems of record. The specialist ensures the AI can retrieve, validate, and use that information safely, often partnering closely on CRM integration and internal services.

3. Partners closely with product and engineering. Some workflows require new endpoints, permission layers, safety gates, or deterministic fallbacks. This role drives those changes across the stack.

4. Ensures reliability and safety at every step. Guardrails, validation logic, exception handling, safe execution paths — all are essential. They confirm that the AI has access to the correct data, the action matches policy, edge cases are accounted for, risky flows have deterministic constraints, and every action is auditable and reversible.

Why this role exists now. Customers don’t want answers, they want outcomes. AI can now deliver those outcomes, but only with the right backend scaffolding. This role modernizes operational architecture and unlocks end-to-end automation.

How these roles work together — the new operating loop. These roles aren’t silos; they’re interdependent parts of one system. The AI ops lead identifies patterns and performance gaps. The knowledge manager resolves inaccuracies or missing content. The conversation designer improves clarity, tone, and flow. The automation specialist expands the system’s ability to take action. Each improvement compounds the next, moving you from early automation to transformational resolution rates through continuous refinement.

This loop is what separates teams that plateau early from teams that scale AI into a reliable, high-performing system — the essence of a durable AI Strategy.

How to get started (even if you can’t hire all four roles today). Most teams phase into this model: assign partial ownership, formalize responsibilities, then specialize as AI volume grows. Here’s the progression I recommend.

Phase 1: Assign ownership. Give each role’s core responsibilities to someone who can devote five to 10 hours weekly. Early on, support ops, enablement, senior ICs, and technically inclined teammates can anchor the work.

Phase 2: Formalize the responsibilities. As AI resolves more queries, optimization becomes core operational work. Formalizing ownership prevents performance drift and knowledge debt.

Phase 3: Specialize and hire. Once AI handles 50–70% of incoming volume, these responsibilities become full-time roles. Investing in specialization becomes essential infrastructure for the next scale stage.

The bottom line. AI changes the shape of your support team. These four roles — AI operations lead, knowledge manager, conversation designer, and support automation specialist — form the backbone of the AI-first support organization. They bring order to a constantly changing environment and enable AI to deliver the outcomes leaders and customers expect heading into 2026.

Next week, we’ll continue the 2026 planning series with a deep dive into org design models for AI-first support teams — how to structure people, workflows, and accountability in a world where AI resolves most conversations before a human ever sees them.

To follow along with the series and have each new edition emailed to you directly, drop your details here.

Inspired by this post on The Intercom Blog.

December 2, 2025
AI Product Owner in 2026: The High-Impact Role Every Team Needs to Win With AI

By 2026, the AI Product Owner will be the keystone role that turns AI strategy into measurable business outcomes. In my teams, this seat bridges market insight, model capability, data governance, and shipping velocity—so product decisions are not just clever, but compliant, reliable, and fast.

I often describe the remit simply: "Here is your clear guide to the AI product owner role (skills, responsibilities, how it differs from PM) and ways AI tools supercharge delivery." In practice, the AI Product Owner translates business goals into model-backed experiences, aligns cross-functional execution, and ensures the product’s AI behavior remains safe, lawful, and on-brand under real-world constraints.

How does this differ from a traditional PM? While Product Management sets portfolio strategy, positioning, and market narratives, the AI Product Owner owns the AI experience end-to-end—data readiness, evaluation harnesses, safety guardrails, and the iterative model improvements that drive outcomes vs output OKRs. I anchor the role inside empowered product teams and product trios (PM/Design/ML Eng) to keep discovery continuous and delivery disciplined.

On responsibilities, I expect four pillars. First, discovery: continuous discovery with customers and internal experts to uncover use cases where generative AI or LLMs beat the status quo. Second, experience: define the right interaction patterns for AI UX, including retrieval-first pipeline choices, context window management, and feedback loops for human-in-the-loop correction. Third, governance: privacy-by-design, AI risk management, data governance, and regulatory compliance baked into the roadmap. Fourth, delivery: CI/CD for models and prompts, observable evaluation with A/B testing and minimum detectable effect (MDE), and SRE-grade incident management when AI behavior drifts.

Skills-wise, I look for product sense plus technical fluency. That includes LLMs for product managers (prompting, grounding, RAG), analytics mastery (Amplitude analytics, retention analysis, activation metrics), and comfort with DORA metrics and deployment frequency to keep iteration high but safe. Strong stakeholder management and clear writing are non-negotiable—AI capabilities evolve fast, and leaders must see risk, cost, and ROI with no ambiguity.

AI tools truly supercharge delivery when they eliminate bottlenecks. My practical stack: an AI product toolbox with Claude Code and a ChatGPT connector for rapid prototyping; CustomGPT workflows for support triage and internal knowledge; Pendo product tours and in-app guides to validate behavior changes; Intercom for customer support ai strategy; and tight CRM integration via HubSpot to measure revenue impact. The outcome is faster idea-to-learning cycles, sharper telemetry, and far cleaner handoffs.

For roadmapping, I prioritize thin slices that prove value early—shipping narrowly scoped assistants or copilots, then expanding with product roadmapping and sprint planning that ties capability unlocks to outcomes. A unified analytics platform helps compare human-only baselines to augmented workflows, while agentic AI patterns automate routine steps under strict guardrails.

Risk is a product surface, not a side task. I require explicit policy gates (PII handling, red-teaming, bias audits), clear escalation paths, and incident playbooks. When we treat policy and reliability as features, customers reward us with deeper adoption and higher trust.

If you’re pursuing the AI Product Owner path, build a portfolio around shipped learnings: the experiment you killed with data, the safety constraint you designed, the postmortem you led, and the business metric you moved. That story—evidence of disciplined discovery, responsible delivery, and real-world results—is exactly what teams (and boards) want to see in 2026.

Inspired by this post on Product School.

November 26, 2025
Mastering Data Governance in the AI Era: Move Fast, Reduce Risk, and Unlock Trusted Insights

Every week, I’m in conversations with product leaders, engineers, and security teams who are trying to ship AI features faster without compromising trust. The tension is real: stakeholders want velocity, customers want transparency, and regulators want accountability. That’s exactly where modern data governance earns its keep.

New AI pressures are redefining what good governance takes. Learn how to build better frameworks, move fast with confidence, and keep your data from being a black box.

In my role leading product management, I’ve learned that robust data governance isn’t a compliance checkbox—it’s a strategic capability. When we treat governance as a product, we architect for clarity, safety, and speed. That means aligning AI Strategy with day-to-day delivery so teams know what they can ship, when, and why.

Here’s the practical blueprint I rely on. First, establish ownership and a shared language. Create a living data catalog, lineage maps, and clear data classifications so teams know which assets are sensitive, regulated, or eligible for training LLMs. Second, harden privacy-by-design and least-privilege access. Bake PII detection, secrets management, and role-based policies directly into your workflows. Third, bring quality and observability to the forefront: instrument data contracts, monitor drift, and track model performance across environments. Finally, implement model governance end to end—dataset cards, model cards, bias testing, human-in-the-loop review, and a repeatable evaluation harness.

To move fast with confidence, make governance invisible and automated. Treat policies as code in CI/CD, gate deployments with pre-merge checks, and fail builds that violate data contracts. Log prompts and outputs responsibly, route unsafe patterns to red-teaming, and use a retrieval-first pipeline to anchor models on verified sources rather than fragile context stuffing. This is how we scale AI product development while keeping audit trails complete and costs in check.

Avoiding the black-box problem starts with transparency. Document assumptions, training data sources, and known limitations—then expose explanations where it matters in the product experience. Pair this with a unified analytics platform to tie telemetry, feature flags, and user feedback to model changes. When something goes sideways, your observability, incident management playbooks, and threat detection and response processes should make root-cause analysis fast and defensible.

If you’re building your program from scratch, use a 30-60-90 approach. In the first 30 days, inventory systems, classify data, and map high-risk use cases. By day 60, formalize RACI for governance, deploy access controls, and set up your evaluation pipeline with golden datasets and measurable acceptance thresholds. By day 90, operationalize incident response, conduct tabletop exercises, and wire governance outcomes into OKRs—think time-to-approval for high-risk changes, reduction in production incidents, and model evaluation pass rates.

This playbook pays off in board conversations and with customers. You can articulate your AI risk management posture, show measurable progress on regulatory compliance, and demonstrate how governance accelerates—not hinders—delivery. Most importantly, your teams gain the confidence to experiment, knowing there’s a safety net that protects users, the brand, and the business.

If your organization is wrestling with how to balance innovation and control, start small, codify what works, and scale with intent. With the right foundations in data governance, AI becomes an engine for durable advantage—not a source of sleepless nights.

Inspired by this post on Amplitude – Perspectives.

November 21, 2025
How We Built an AI Sleep Coach: CBTI, Voice AI, and a Product Playbook for Better Rest

What if your morning started with a helpful check-in from a voice AI that actually improves your sleep—using the same core principles that typically cost thousands of dollars and come with year-and-a-half waitlists? That idea energizes me as a product leader, because it blends clinical-grade outcomes with consumer-grade accessibility. Recently, I dug into how the team at Rest built an AI sleep coach inspired by Cognitive Behavioral Therapy for Insomnia (CBTI), and why their method offers a repeatable blueprint for complex, personal AI products.

The origin story is a classic product discovery moment. Rest’s team noticed that a meaningful slice of users in their podcast app were using audio to fall asleep. Although it represented only about 10% of users, that group showed a high willingness to pay. That signal pushed them to explore a dedicated sleep solution, moving from a general audio app to a targeted sleep experience—and eventually toward an AI-powered coach as LLMs matured.

Through jobs-to-be-done research, they identified a clear, underserved segment: “DIY sleep hackers.” These are motivated users who want agency, structure, and results without navigating clinical systems. Choosing CBTI (a clinically proven approach with 80% efficacy) gave the product a strong evidence-based foundation while remaining accessible as a wellness tool. It’s the kind of strategic choice I look for: credible, measurable, and aligned with user motivation.

The product evolution moved in smart, incremental steps. Rest started with a basic text chatbot before graduating to a voice-first experience—using Vapi for voice and OpenAI for reasoning. Voice changed the relationship dynamic: it increased intimacy, lowered friction for daily check-ins, and made behavioral coaching feel human without pretending to be. The team built a memory system that tracks context (like traveling or having a dog) with time-based relevance, which keeps conversations fresh, respectful, and genuinely personalized.

Daily engagement is driven by dynamic agendas that adapt based on sleep data, the user’s stage in the program, and their recent compliance. I love this mechanic: it operationalizes behavior change by sequencing the right intervention at the right time. In parallel, they developed text via OpenAI Assistants while building voice with Vapi, which let them ship value while learning in two modes. They also moved from massive system prompts to RAG for general sleep knowledge, keeping personal user context in the prompt—reducing brittleness while improving scalability.

Because sleep sits close to healthcare, the team drew a firm line between wellness and medical positioning. They implemented clear guardrails: no diagnosis, no medication advice, and strong boundaries on scope. Weekly error analyses with domain experts (sleep therapists) tightened quality and tone, and they adopted LLM-powered evals to enforce safety boundaries. For observability and evaluations, they leveraged Langfuse, and they experimented with Hamming for voice testing to refine the experience end-to-end.

Under the hood, this is a great example of “one bite of the apple at a time” product building in AI. Start with a simple interface, anchor on an evidence-based method, layer personalization with memory, formalize program structure with dynamic agendas, and shift to RAG when general knowledge outgrows prompt engineering. As a product leader, I see strong echoes of agentic patterns here—goal-oriented orchestration, stateful memory, and adaptive planning—shipped in pragmatic increments rather than as a monolithic platform rewrite.

A few takeaways I’m applying with my teams: First, segment deeply and pick a high-intent niche (those “DIY sleep hackers” were the right beachhead). Second, let modality fit the job—voice is not a gimmick when it boosts compliance and empathy. Third, design safety and scope from day one if you’re anywhere near health. Finally, invest early in evals and observability so you can improve with confidence, not hope.

If you want to explore the full conversation and product decisions, you can listen here: Spotify | Apple Podcasts.

Resources & Links:

Rest – AI sleep coach app

Vapi – Voice agent platform Rest uses

Langfuse – Observability and evals platform

Hamming – Voice testing platform

AI Evals Maven Course by Hamel Husain and Shreya Shankar

Bottom line: Rest demonstrates how to take a clinically grounded method like CBTI, translate it into a daily voice-first experience, and ship it with rigor. If you’re building in AI, this is a model worth studying—practical, safe, and deeply user-centered.

Inspired by this post on Product Talk.

November 20, 2025
High-Quality Data, High-Velocity AI: My Product Playbook for Governance, Trust, and Scale

Every breakthrough we ship in AI reinforces a simple truth I live by: "Companies that prioritize data quality, governance, and structure will accelerate their AI initiatives the fastest." That statement captures the difference between flashy demos and durable, scalable products. In my experience, the strongest AI Strategy starts with the discipline to treat data as a product, not an afterthought.

When teams rush to production with generative AI or LLMs, the first issues rarely come from the model itself—they come from the data. Poor lineage leads to hallucinations, inconsistent schemas inflate costs, and weak access controls erode trust. For LLMs for product managers, this is the gap between a compelling prototype and a reliable system customers depend on every day.

Let me clarify what I mean by data quality, governance, and structure. Quality is completeness, accuracy, freshness, and consistency across sources. Governance is policy, ownership, and accountability—privacy-by-design, regulatory compliance, and AI risk management built in from day one. Structure is the architecture: clear data contracts, standardized schemas, metadata and lineage, and role-based access that keeps sensitive signals protected while enabling speed.

Here’s the product playbook I use to operationalize this. First, map critical sources and define data contracts at the edges so producers and consumers can move independently. Second, standardize schemas and entity resolution to eliminate ambiguous joins. Third, enforce privacy-by-design with policy-as-code and automated redaction. Fourth, converge analytics into a unified analytics platform so definitions, freshness, and observability are shared. Fifth, instrument end-to-end lineage and quality SLAs with alerting. Finally, close the loop with human feedback and labeling to continuously improve model performance.

For generative AI workloads, a retrieval-first pipeline is essential. Unify trusted sources (product analytics, CRM, support, docs), embed and index them with guardrails, and focus on context window management to keep prompts lean, relevant, and cost-effective. This approach improves response quality, reduces token spend, and makes updates near-real-time—without retraining the base model every week.

Measure what matters. Tie model outcomes to product metrics through rigorous A/B testing, and size experiments with minimum detectable effect (MDE) so you can ship confidently. Use product analytics to verify that better data actually improves activation, retention, and support deflection. When teams can trace an AI improvement back to a specific data-quality fix, they invest in governance with conviction.

Culture closes the gap. Empowered product teams and product trios (PM, design, engineering) make crisper decisions when data stewards are embedded and accountable. Clear ownership, shared definitions, and transparent dashboards reduce friction with security and compliance while speeding up delivery. This is how product management leadership sustains velocity without trading away trust.

The bottom line: if we want faster, safer, and more scalable AI, we start with the data. Build strong foundations, treat governance as enablement, and structure every step so improvements compound. With that in place, Generative AI stops being a science experiment and becomes a durable competitive advantage.

Inspired by this post on Amplitude – Perspectives.

November 19, 2025
AI Won’t Replace Engineers—Engineers Using AI Will: A Practical Playbook for Your Next Move

Will AI replace software engineers or reshape their roles? Explore risks, opportunities, and alternative career paths in tech.

I’m often asked whether AI will make software engineers obsolete. My short answer: AI is already automating tasks, not eliminating the role. The engineers who learn to orchestrate models, systems, and stakeholders will create more value—not less. The real shift is from keystrokes to judgment, from writing code to designing socio-technical systems that deliver outcomes.

Today’s gen ai assistants—think Claude Code and ChatGPT connector—excel at unit test scaffolding, boilerplate generation, refactoring, docstrings, and code search. When integrated into CI/CD, they can open draft pull requests, annotate diffs, and propose fixes. This lifts developer productivity and frees time for higher-leverage work: problem framing, architecture decisions, and customer discovery.

What changes in the role? We spend more cycles on product discovery, privacy-by-design, and AI Strategy, and fewer on repetitive implementation. We design agentic AI workflows that combine retrieval, tools, and guardrails; we evaluate trade-offs that blend performance, cost, and safety; and we partner with empowered product teams to ship the smallest valuable slice, learn, and iterate.

Measure what matters. If AI is working, DORA metrics should improve: higher deployment frequency, shorter lead time for changes, stable change failure rate, and faster MTTR. Pair that with outcomes vs output OKRs to avoid gaming the system—shaving seconds off a build is meaningless if it doesn’t move activation, retention, or revenue. A unified analytics platform can help connect engineering signals to business impact.

Risk is real—and manageable. AI risk management and data governance are now core competencies, not afterthoughts. Protect IP with robust access controls, context window management, and red-teaming. In production, instrument threat detection and response to catch prompt injection, data leakage, and model drift. Treat this like any other reliability discipline alongside SRE.

If parts of coding get automated, where can great engineers thrive? Several high-impact paths are emerging: platform engineering for LLMs (tooling, evals, observability), SRE for AI-infused systems, developer evangelism and education, product management for AI-native experiences, security engineering focused on model and data threats, and forward deployed engineers who pair with customers to solve messy, real-world problems.

How to upskill fast: build an AI product toolbox and ship small. Prototype gen ai features end-to-end—retrieval, function calling, human-in-the-loop QA—and connect them to your CRM integration or support stack. Use A/B testing with a clear minimum detectable effect (MDE) to validate impact. Leverage CustomGPT workflows for internal enablement and in-app guides or product tours to onboard users safely.

Here’s a pragmatic 90-day plan. Week 0–2: audit your top 10 engineering tasks by time spent; identify 3 that are ripe for AI augmentation. Week 3–6: pilot inside CI/CD with explicit guardrails; track DORA metrics and developer sentiment. Week 7–10: productionize the wins; document runbooks; add incident management paths. Week 11–12: share learnings with product trios, refine your value proposition, and set next-quarter OKRs.

AI won’t replace software engineers; engineers who master AI will outpace those who don’t. If we embrace the shift—toward systems thinking, responsible governance, and customer outcomes—we’ll build better products faster and open new, rewarding career paths. The opportunity is here and compounding.

Inspired by this post on Product School.

November 12, 2025

How to Evaluate AI Voice Support in Real-World Conditions

You have a shortlist of AI voice support products, a polished recording, and a decision that could affect thousands of customer conversations. The hard question is not whether an agent can sound convincing during one ideal call. It is whether the system stays useful when a caller interrupts, corrects themselves, asks an ambiguous question, waits on a backend system, or needs a human.

You can answer that question before a broad rollout. The method is to test complete support outcomes, introduce controlled complications, score failures separately from conversational polish, and use the result to define a limited production pilot.

Evaluate the support outcome, not the performance

A natural voice can create an impression of competence before the agent has done anything useful. Pleasant pacing, expressive speech, and a quick opening matter, but they cannot compensate for retrieving the wrong account, misunderstanding the request, or claiming that an action succeeded when it did not.

Treat the unit of evaluation as a completed support job. Depending on the intent, that job may require the agent to identify the caller, understand the request, retrieve the right information, explain the answer, perform an authorized action, confirm the resulting state, and send a follow-up or transfer the conversation. If you score only the spoken answer, you leave most of the product untested.

One live Fin Voice call illustrated this end-to-end standard in about 90 seconds: the agent verified identity, retrieved account information, managed an interruption, presented options, completed a workflow, and sent a follow-up email. That sequence is a useful model for constructing a test. It is not, by itself, proof of reliability across other calls.

Before anyone places a test call, write an outcome contract for each scenario:

Caller goal: What is the person trying to accomplish?
Starting state: What customer, account, order, subscription, or case data exists before the call?
Available evidence: Which knowledge, policies, and records may the agent use?
Permitted actions: What may the agent change, create, send, cancel, or escalate?
Required clarification: Which missing or conflicting facts must be resolved before an answer or action?
Completion evidence: What observable state proves that the request was resolved?
Unacceptable outcome: What error would make the call a failure even if the conversation sounded good?

This contract prevents a common scoring mistake: confusing non-transfer with resolution. A call can remain inside the AI channel and still leave the customer with a wrong answer, an incomplete action, or no idea what happens next. Conversely, an intentional transfer can be the correct resolution when the agent reaches a policy, permission, or confidence boundary.

Build scenarios around the ways real calls become difficult

Start with support intents your operation actually receives. Prioritize intents that are frequent, expensive to handle, important to customer trust, or dependent on multiple systems. Do not begin with trivia questions that merely demonstrate broad language-model knowledge. You are evaluating support execution.

For every core intent, create a straightforward case and several controlled variants. Keep the customer objective constant while changing one condition at a time. That makes a failure diagnosable instead of merely disappointing.

A practical scenario matrix

Clean path: The caller gives the relevant facts in a clear order. This establishes whether the basic workflow works at all.
Missing information: Omit a detail the agent needs. Check whether it asks a focused question instead of guessing or restarting the intake.
Ambiguous intent: Use wording that could map to two support issues. The agent should disambiguate before retrieving data or taking action.
Mid-call correction: Let the caller change an account detail, date, product, or preferred option. Check whether the corrected fact replaces the old one throughout the workflow.
Interruption: Speak while the agent is answering. Observe whether it stops cleanly, understands the new input, and continues from the right point.
Backend delay: Introduce a slow retrieval or action. Evaluate how the agent manages the wait and whether it distinguishes a pending operation from a completed one.
Backend failure: Make a required system unavailable or return an error. The agent should not fabricate a result or promise completion it cannot verify.
Policy boundary: Ask for something the agent is not allowed to do. Test the explanation, alternatives, and escalation path.
Human request: Ask directly for a person. Verify that the agent follows the configured policy without turning the handoff into an argument.
Listening conditions: If your deployment must support different languages, accents, devices, or noisy environments, test each condition explicitly rather than treating one clear studio call as representative.

Give testers the goal, account state, and one complication. Do not script every sentence. A fully written dialogue tests whether the agent can follow the dialogue you anticipated; a goal-based scenario tests whether it can manage the conversation the caller actually creates.

Keep a few variants undisclosed until the live session. This is not a trick. It prevents the evaluation from becoming a memorized path while still keeping every test fair and reproducible. Record the exact variant afterward so another evaluator can run it again.

Run the call through the systems you expect to deploy

An unedited live call is more informative than a produced recording, but live alone is not enough. A live test can still use ideal data, a simplified integration, a practiced caller, and a workflow that avoids the hard parts of your environment.

Ask to run the scenario through a path that resembles the intended deployment:

Place a normal phone call through the proposed telephony route. If production will use call forwarding, test the forwarding path rather than a direct internal endpoint.
Use a safe test account containing representative records, permissions, and history.
Require the agent to retrieve data from the backend system that will be authoritative in production.
Introduce the chosen interruption, correction, ambiguity, delay, or error during the live conversation.
Require a real test action where it is safe to do so, not a verbal description of what the agent would have done.
Inspect the backend state after the call. Confirm that the correct record changed once, with the expected values.
Verify every promised follow-up, case creation, notification, or handoff outside the voice channel.
Retain the recording, transcript, timestamps, tool activity, and final system state for scoring.

This is especially important when an agent can take consequential actions. A fluent confirmation is not evidence that the action happened. The system of record is the evidence.

Repeat important scenarios with different wording and a different caller. One successful run demonstrates that the capability can work. Repeated variants reveal whether the capability depends on a narrow phrase, a rehearsed cadence, or an unusually forgiving path.

Key takeaways

Score complete resolution, including backend state and follow-up, rather than voice quality alone.
Change one condition at a time so you can identify why a call failed.
Test interruptions, corrections, ambiguity, system delays, system errors, and escalation.
Measure different kinds of waiting separately; a lookup pause and a turn-detection problem are not the same defect.
Treat a successful demo as evidence for a pilot, not permission for an unrestricted rollout.

Score conversation, reasoning, and operational closure separately

A single overall rating hides the information you need to make a product decision. The call may sound awkward but reach the correct outcome, or sound excellent while making a dangerous mistake. Separate the evaluation into three layers.

Layer	What to inspect	Evidence of a pass	Typical failure
Conversation mechanics	Turn detection, interruption handling, pacing, response length, and intelligibility	The caller can speak naturally, correct the agent, and follow the response without fighting for the floor	The agent talks over the caller, leaves confusing silence, or delivers answers too long to retain by ear
Decision quality	Intent recognition, clarification, use of account context, policy application, and answer accuracy	The agent asks only for missing information, uses the correct evidence, and avoids unsupported conclusions	The agent guesses, asks redundant questions, ignores a correction, or applies the wrong policy
Operational closure	Identity checks, tool calls, state changes, confirmation, follow-up, and escalation	The verified backend state matches the caller’s request and the agent’s final explanation	The agent claims success without a completed action, changes the wrong record, duplicates work, or drops context during handoff

Use a simple 0-2 score for each criterion: 0 for failed or unsupported, 1 for completed with material caller effort or recovery, and 2 for correct and usable. The scale is deliberately small. Evaluators can usually distinguish failure, friction, and success more consistently than they can defend the difference between seven and eight on a ten-point scale.

Do not average away critical errors. A wrong account action, failed identity control, fabricated completion, or forbidden disclosure should remain visible as a release blocker even if many low-risk calls receive high scores. Record both the criterion scores and the count of critical failures.

Break latency into moments the caller can feel

Latency is not one number. Capture at least three moments: the time the agent takes to recognize that the caller has finished, the time it spends reasoning or waiting for a system, and the time needed to begin and complete the spoken response.

End-of-turn delay: A long delay after every caller turn makes the exchange feel unresponsive and can encourage both sides to start speaking at once.
Reasoning or retrieval delay: A pause can be appropriate when the agent is checking account data or invoking a backend workflow. Brief pauses were audible during live subscription and backend checks, which is more informative than editing those waits out.
Response delivery: A fast start does not help if the answer becomes a long monologue. Voice responses need structure and pacing that work for listening, not merely text that sounds acceptable when read.

Ask what is happening during a pause. If the system is doing useful work, the next statement should reflect that work and the action log should verify it. If the pause is long enough to make a caller wonder whether the call has dropped, the experience needs an appropriate progress cue. If the agent answers instantly but guesses, speed is concealing a quality problem.

Review individual timings as well as an average. A generally responsive agent with occasional severe stalls creates a different operational problem from one that is consistently a little slow. Your test recordings and timestamps should make both patterns visible without inventing a universal pass threshold that ignores the complexity of the workflow.

Make recovery and escalation part of the product test

The strongest voice experiences are not the ones that never encounter confusion. They are the ones that recover without making the caller restart. Recovery is therefore a capability to test, not an embarrassing exception to hide.

Interrupt the agent in the middle of an answer. Correct a fact it has already used. Add a second request after the first appears resolved. Say that an explanation was unclear. Ask for a human. These moves reveal whether the agent maintains conversational state or merely produces plausible turns one at a time.

During recovery, look for specific behavior:

It stops speaking promptly when the caller takes the turn.
It identifies what changed instead of repeating the whole interaction.
It replaces corrected information rather than carrying both versions forward.
It asks a narrow clarification when the next action is uncertain.
It does not claim to understand when the transcript or subsequent action shows otherwise.
It preserves verified context and the reason for contact when a human takes over.
It tells the caller what will happen next instead of ending on an internal routing label.

Tone belongs in this test, but not as a beauty contest between synthetic voices. Evaluate whether pacing, brevity, acknowledgement, and word choice suit the moment. A caller correcting a billing detail needs a clear acknowledgement and an accurate update, not theatrical empathy. A caller who sounds uncertain may need a shorter explanation and a confirming question. Tone is the behavior of the conversation, not just the timbre selected in a settings menu.

Escalation should also count as a valid outcome when it is timely and informed. Define which conditions require a handoff, which allow one, and what context must travel with it. Then test the handoff from the caller’s side. If the customer reaches a person but has to repeat identity, intent, and every attempted step, the routing technically worked while the support experience failed.

Turn the evaluation into a controlled pilot decision

A strong live evaluation earns the right to run a pilot. It does not justify sending every eligible call to the agent. Production introduces variation in callers, data quality, traffic, integrations, and issue combinations that a demonstration cannot reproduce fully.

I would require five gates before approving even a limited external pilot:

Capability gate: Every must-have intent has completed its end-to-end workflow, including at least one controlled complication.
Critical-risk gate: No unresolved failure can expose the wrong account, bypass a required check, perform an unauthorized action, or report a false completion.
Conversation gate: The agent can handle interruptions, corrections, clarification, and explicit human requests without trapping the caller in a loop.
Operations gate: Your team can configure terminology, guidance, escalation behavior, greetings, voice, and deployment controls for the intended support environment.
Learning gate: Owners can inspect recordings, transcripts, tool activity, outcomes, and failures, then change the knowledge, workflow, policy, or conversation design responsible.

Start the pilot with a reversible slice of traffic and a clear human fallback. Select intents whose correct outcome can be verified in your systems. Define who reviews failed and escalated calls, who can pause the rollout, and who owns each class of fix. An answer-quality issue, a telephony issue, and a backend integration issue require different owners even when the caller experiences all three as one bad call.

Expand only when observed calls meet the outcome contracts you wrote before the demo. If the definition of success keeps changing after failures appear, the evaluation is no longer protecting the decision.

For your next vendor session, replace “show me your best call” with a scenario pack, a test account, and a request to inspect the final system state. You will learn more from one imperfect call that recovers correctly than from a flawless recording that never had to recover at all.

References

Intercom – Stop Falling for Hollywood Demos: The Unfiltered Truth of Live AI Voice for Support

November 11, 2025

Agentic AI for Incident Response: A Practical Operating Model

An incident fires. Your responders are not short of data; they are short of a trustworthy path through it. Deployment timelines, service ownership, dashboards, logs, runbooks, and prior incidents live in separate places, while the cost of a wrong action rises by the minute.

The decision in front of you is not whether AI can summarize the incident channel. It is whether an agent can shorten the investigation without becoming another failure mode. That requires an operating model covering the agent’s job, context, permissions, interface, and evaluation before you give it meaningful authority.

Give the agent an investigation job before action authority

An incident-response agent should run a goal-directed investigation loop, not wait for isolated prompts like a chatbot. A credible implementation can collect context, form and test hypotheses, and draft fixes inside Slack. The important product decision is where that loop must stop for human judgment.

Model the loop on the work a strong responder already performs:

Scope the incident. Identify the affected service, environment, customer surface, start time, and known symptoms. Preserve unknowns instead of filling them with plausible guesses.
Gather relevant context. Retrieve recent changes, service ownership, dependencies, telemetry, runbooks, feature-flag changes, and similar incidents.
Form competing hypotheses. Produce a ranked set rather than locking onto the first convincing explanation. Distinguish observed facts from inferences.
Test each hypothesis. Use read-only tools to query metrics, logs, traces, deployment state, and dependency health. Record what supports or weakens each possibility.
Propose the next best action. Explain the target, expected effect, risk, preconditions, and recovery path. Do not hide uncertainty behind an authoritative tone.
Update the investigation. Incorporate tool results and responder corrections, discard disproven hypotheses, and choose the next check.

The incident commander remains accountable for priorities and mitigation. The agent acts as an investigation engine: it gathers, tests, organizes, and proposes. This division is more useful than treating human involvement as a final approval click after the AI has already made every material decision.

Choose the first workflow with care. A good starting point has a bounded service area, dependable read-only signals, known responders, established runbooks, and outcomes you can verify after the incident. A workflow that depends on undocumented tribal knowledge or unrestricted production access is not ready for agentic automation. Fix the operating system around the incident before expecting a model to compensate for it.

Do not begin with the most dramatic remediation you can automate. Early value usually comes from reducing context switching, locating the correct owner, connecting symptoms to recent changes, and eliminating weak hypotheses. Those tasks consume scarce attention but do not require the agent to mutate production.

Context quality determines the ceiling of the investigation

A capable model cannot reason with operational context it cannot find, distinguish, or trust. If a service has three names across the deployment system, observability platform, and incident channel, retrieval becomes unreliable before model reasoning even begins.

Create a context contract for every service placed within the agent’s scope. At minimum, make these fields explicit:

Identity: canonical service name, aliases, repository, runtime, and environment.
Ownership: accountable team, current on-call route, and escalation path.
Topology: upstream dependencies, downstream consumers, data stores, queues, and shared infrastructure.
Change history: deployments, configuration changes, feature flags, migrations, and rollback state.
Operational knowledge: current runbooks, known failure modes, dashboards, alerts, and prior incident records.
Control policy: tools the agent may call, environments it may inspect, actions it may propose, and actions it may never execute.

Start retrieval with exact operational signals. Filter by canonical service, environment, incident time window, deployment identifier, alert type, and ownership tag. Then rerank the surviving records for the current question. This deterministic tagging and reranking foundation is easier to debug than making semantic similarity responsible for every retrieval decision.

Add embeddings where language actually creates ambiguity: matching an unfamiliar symptom to a differently worded historical incident, finding a relevant paragraph inside a long runbook, or connecting terminology used by two teams. Semantic retrieval should widen discovery, not erase exact boundaries such as production versus staging or one tenant versus another.

Require every retrieved item to carry provenance that a responder can inspect: its system of record, service and environment, creation or update time, incident-time availability, and reason for retrieval. This lets the responder notice four common failures quickly:

A runbook is relevant but stale.
An ownership record is current but was different when the incident began.
A similar incident came from another environment with different dependencies.
A historical evaluation accidentally exposed the final root cause before the agent could have known it.

Treat missing context as an observable product state. The agent should say that it cannot locate a deployment record or dependency map, identify which system was checked, and propose a safe way to continue. A confident answer assembled around a missing record is more dangerous than an explicit gap.

Scale permissions to reversibility and blast radius

Autonomy is not one switch. It is a set of permissions attached to particular tools, targets, environments, and action classes. Granting broad credentials because the agent usually behaves conservatively turns a model-quality issue into a production-control issue.

Action class	Appropriate agent role	Required human control
Read-only investigation	Query approved telemetry, changes, ownership, and runbooks	Audited access with service and environment boundaries
Recommendation or communication	Draft a diagnostic check, remediation plan, incident update, or escalation	A responder reviews customer-facing messages and consequential recommendations
Bounded, reversible execution	Invoke a preapproved runbook against an explicitly named target	Approval bound to the exact action, target, inputs, and current incident
Irreversible or broad execution	Explain the need and prepare a plan, but do not execute during the initial rollout	Existing change controls and accountable operators remain in force

Do not label an action reversible merely because the interface contains a rollback button. A deployment rollback can still be unsafe after an incompatible schema or data change. A restart can amplify load or destroy useful diagnostic state. Reversibility has to be validated for the specific service state, not inferred from the action name.

For every executable tool, define guardrails outside the prompt:

Use least-privilege credentials scoped by service and environment.
Allowlist tools, targets, and input shapes rather than relying on natural-language prohibitions.
Preview the exact command or workflow, target, parameters, and expected effect before approval.
Bind approval to that exact action so the agent cannot reuse it for a changed target or plan.
Use rate limits, idempotency controls, and circuit breakers where repeated calls could cause harm.
Route production changes through existing CI/CD or runbook automation when possible.
Record retrievals, tool inputs, tool outputs, approvals, denials, and resulting state changes in an audit trail.
Provide a direct way to suspend the agent’s tool access without disabling the incident workflow itself.

The action proposal should be a control artifact, not a conversational suggestion. It needs the evidence supporting the action, the exact target, the expected observable result, the maximum intended scope, known preconditions, and what the responder will do if the result does not appear. If the agent cannot supply those fields, it has not earned execution authority for that action.

Keep outward communication on a separate permission path. Drafting a status update is low-risk technically but consequential for customers and the business. Human review should verify what is known, what remains uncertain, and whether the message promises a recovery time the evidence cannot support.

Make evidence and uncertainty legible in the incident room

Putting the agent inside the collaboration surface where incidents already unfold reduces the friction of opening another product and re-explaining the situation. It also means the agent’s output competes with urgent human messages. Long narrative answers will be skipped, however intelligent they sound.

Give each investigation update a stable structure:

Observed: facts returned by named systems, with timestamps and links where available.
Hypotheses: ranked explanations with the supporting and conflicting evidence for each.
Changed since the last update: new evidence, rejected hypotheses, and responder corrections.
Next check: the read-only query or tool call most likely to distinguish between the remaining possibilities.
Proposed action: target, expected effect, blast radius, preconditions, and recovery path.
Decision needed: the specific approval, input, or ownership choice required from a human.

This is not a request to expose a model’s private, free-form chain of thought. Responders need a structured evidence trail: claims, retrieved signals, tool results, rejected alternatives, and action rationale. That artifact is more useful for review because each part can be checked against the operational record.

Confidence labels are helpful only when they change behavior. Define what the interface does when confidence is low: ask for a missing service identifier, run another safe check, present multiple hypotheses, or escalate to the owner. Do not display a precise-looking score unless you have evaluated whether that score corresponds to actual correctness in your incident set.

Design human correction as part of the main workflow. A responder should be able to reject a hypothesis, correct the service or environment, mark a retrieved record stale, deny an action, and state why. The agent should preserve that decision in the incident record and replan from it. Repeatedly resurfacing a rejected hypothesis erodes trust even when the underlying model is otherwise capable.

Watch for a subtle interface failure: polished summaries can make weak investigations look complete. Make unresolved questions and conflicting signals visually prominent in the message structure. The goal is not to make the agent sound certain. It is to help the incident commander see what is known, what is inferred, and what decision comes next.

Test against past incidents, then expand authority one boundary at a time

A demo proves that the agent can complete a favorable path. It does not prove that the agent will retrieve the right context, resist a misleading correlation, respect permissions, or propose a safe action when production is ambiguous.

Use post-incident time-travel evaluations. Reconstruct what the agent could have known at each point in a real incident. Begin with the original trigger and expose deployments, telemetry, messages, and tool results only when they became available. Hide the final root cause, later analysis, and corrected metadata until the corresponding point in the replay. Otherwise, you are testing hindsight rather than incident response.

Grade the investigation on operational usefulness, not prose quality:

Scoping accuracy: Did it identify the correct service, environment, symptoms, and ownership route?
Context retrieval: Did it find the relevant change, runbook, dependency, or earlier incident without mixing incompatible records?
Hypothesis quality: Where did the eventual cause appear in the ranked set, and what evidence was used to test it?
Evidence integrity: Does every factual claim match a retrieved record or tool result? Did the agent invent a signal that was never observed?
Tool correctness: Did it select the correct tool, target, environment, and parameters?
Action safety: Was the proposed action inside policy, and were its blast radius, preconditions, and recovery path explicit?
Calibration: Did expressed certainty track actual correctness, especially when context was incomplete?
Time compression: How did the time to a useful hypothesis, correct owner, mitigation decision, and recovery compare with the existing workflow?
Human effort: Which searches, handoffs, repeated explanations, and diagnostic checks did the agent remove or add?

Treat safety failures differently from diagnostic misses. A missed hypothesis is a capability problem. Crossing a permission boundary, inventing evidence, or targeting the wrong environment is a release blocker for that tool path. Averaging all outcomes into one quality score can conceal exactly the failure that matters most.

A practical rollout sequence

Instrument the human workflow. Capture incident timelines, ownership changes, diagnostic steps, approvals, mitigations, and outcomes. You need a baseline before claiming improvement.
Replay historical incidents. Use time-bounded context and score the agent against known outcomes. Repair retrieval and service metadata before tuning for eloquence.
Run in shadow mode. Let the agent investigate live incidents without posting conclusions or changing systems. Compare its evidence and hypotheses with the responder’s path.
Expose read-only assistance. Allow responders to request context, hypothesis checks, and draft updates. Collect explicit acceptance, correction, and rejection signals.
Add recommendation mode. Let the agent propose remediations using the structured action artifact, while humans continue to execute through established controls.
Enable one bounded action path. Choose a preapproved runbook with a clear target, validated preconditions, observable effect, and recovery procedure. Keep approval attached to the exact invocation.
Expand by tool and service. Grant additional authority only when evaluation evidence supports that particular boundary. Do not treat success on one service as proof of readiness everywhere.

Re-run the evaluation set after changes to prompts, models, tools, service topology, runbooks, or permissions. An agent can regress even when its general language quality improves. Operational behavior depends on the whole system around the model.

Key takeaways

Start with investigation and context compression; earn execution authority later.
Build deterministic service, environment, time, and ownership filters before depending on semantic retrieval.
Separate observed facts, hypotheses, and proposed actions in every incident update.
Enforce permissions in tools and infrastructure, not only in prompts.
Evaluate with historical time travel so the agent never sees facts that were unavailable during the real incident.
Expand autonomy one action, tool, service, and environment boundary at a time.

The next outage is the wrong time to discover that your agent cannot distinguish a plausible explanation from verified evidence. Before it happens, choose one bounded incident workflow, define its context contract and permission envelope, and replay several real investigations without future information. If the agent can make its evidence legible, stay inside policy, and consistently move responders toward the next correct decision, you have a foundation worth expanding.

References

Shivam.Consulting Blog — How Incident.io’s AI SRE Diagnoses, Hypothesizes, and Fixes Outages in Slack at Record Speed

November 6, 2025

AI at Home, Impact at Work: Experiments That Supercharged My Product Leadership

I recently tuned into an insightful All Things Product episode featuring Teresa Torres and Petra Wille on how experimenting with AI in everyday life sharpens how we build AI-powered products at work. The core premise resonated deeply with my AI Strategy: low-stakes, personal experiments accelerate confidence, clarify limitations, and build an AI product toolbox we can bring into the office with rigor.

If you want to dive in, you can listen on Spotify or Apple Podcasts. I found the conversation especially relevant for product trios and anyone shaping LLMs for product managers in high-stakes environments.

The idea is simple but powerful: when I prototype with AI at home—where the stakes are low—I learn faster, make safer mistakes, and internalize critical product patterns. Over time, those patterns transfer directly to work: tighter context management, sharper bias awareness, clearer human-in-the-loop guardrails, and a more nuanced view of when to use AI as a thought partner versus when to consider agentic AI.

In my own practice, I’ve mirrored many of the scenarios discussed: using ChatGPT by OpenAI to plan meals, analyze public data sets like school budgets, and even sanity-check real estate evaluations. These seemingly mundane tasks are fertile ground for learning about context window limits, hallucination (artificial intelligence), AI bias, and privacy-by-design trade-offs. Each experiment helps me craft better prompts, structure data for clarity, and decide when a human review step is non-negotiable—core habits for AI risk management.

At work, I treat AI as a thought partner for writing, research synthesis, and contract review. I also explore when and how to responsibly evolve toward agentic AI for repeatable workflows. The distinction matters: a thought partner augments judgment; an agent automates execution. Building the right scaffolding—data governance, auditability, constraints, and escalation paths—ensures we unlock speed without compromising safety.

Three lines from the episode stayed with me: “I’m trying to write things that only I can write — that’s my guiding writing light right now.” — Teresa. “The more we use AI, the more we learn what it’s good at, what it’s not good at, and where context becomes a limitation.” — Teresa. “It’s a safer playground — we can build our toolbox at home before bringing those lessons to work.” — Petra. These are practical north stars for product management leadership in the GenAI era.

For anyone getting started, here’s what worked for me: begin with “low-stakes” personal experiments, write down your prompts and outcomes, and reflect on failure modes. Treat each activity as product discovery: What problem am I solving? What outcome matters? What data and context does the model need? Which decisions must stay human-in-the-loop? This discipline builds an AI product toolbox you can confidently apply to real customer problems.

I also keep a running toolkit of references and tools that inform my practice: Context window as a concept helps me size and sequence information. Visual and video tools like Midjourney and Sora expand how I think about multimodal experiences. I rotate between Claude by Anthropic and ChatGPT by OpenAI depending on task fit, and I’ve used Claude Code when I need structured assistance with code review. For knowledge capture and workflow, Readwise and Ghost help me structure insights and ship content.

If you want more structured learning paths, I found Josh Seiden’s Learn AI With Me, A 30-Day Sprint to be a practical primer, and the broader community conversation at Product at Heart Conference is invaluable. For a deeper grounding in risk, I recommend reviewing topics like Hallucination (artificial intelligence), AI bias, and Agentic AI—and revisiting the complementary episode, Context is King.

I’d love to hear how you’re experimenting: Where have you seen AI meaningfully reduce toil? Where does it still struggle? How are you balancing creativity, data safety, and compliance as you scale? Drop a comment below and let’s compare notes—especially on patterns that help product trios move faster without sacrificing trust.

Bottom line: start small at home, carry lessons into the office, and build with curiosity and intentionality. That’s how we level up our product discovery, sharpen our value proposition, and lead teams confidently through the GenAI transition.

Inspired by this post on Product Talk.

November 4, 2025

How to Build an Evaluation-Driven AI Innovation Strategy

Your team has several credible AI demos, every sponsor sees potential, and no one can answer the question that matters: which idea deserves more engineering time, customer exposure, and operating risk?

That is not an ideation problem. It is an evidence-design problem. A useful AI innovation strategy makes each investment earn its way forward through customer outcomes, representative evaluations, and explicit kill-or-scale decisions. The result is not less experimentation. It is faster learning with fewer expensive surprises.

Start every AI bet with a decision contract

Most AI roadmaps begin too far downstream. The discussion jumps to a model, an assistant, or an agent before the team agrees on the user problem or the evidence required to fund the next stage. The feature then acquires momentum simply because it exists.

Replace the feature brief with a decision contract. This is a short agreement about what the bet must prove, how it will be evaluated, and what happens when the evidence arrives. It connects vision, portfolio choices, and execution to measurable outcomes before implementation choices harden.

Name the user and the job. Specify who encounters the capability, what they are trying to accomplish, and which situations are out of scope. “Improve support with AI” is not a problem statement. “Help eligible customers resolve account questions without waiting for an agent” is testable.
Choose the business outcome and its baseline. Use resolution rate, time-to-value, activation, retention, revenue lift, or another measure of customer and business value. Record how the existing workflow performs so the AI is compared with a real alternative, not with an empty screen.
State the behavioral hypothesis. Explain how the proposed capability should cause the outcome to move. This exposes weak logic early. A faster response, for example, does not automatically produce a correct resolution.
Define the evidence stack. Identify the offline evaluations needed to establish behavioral confidence and the live experiment needed to validate customer impact. Neither can substitute for the other.
Set constraints and hard guardrails. Include unacceptable failures, privacy boundaries, safe-action requirements, latency expectations, and cost limits. A capability that is accurate but too slow, unsafe, or uneconomic is not ready.
Pre-commit to the decision. Record the minimum detectable effect for the live experiment, the evaluation thresholds that block release, the time at which evidence will be reviewed, and the conditions for killing, refining, or scaling the bet.

The contract should separate three metric layers. The outcome metric tells you whether customer or business value changed. Behavioral metrics tell you whether the AI performed its assigned job. Guardrails tell you whether that performance remained safe, reliable, responsive, and affordable. This prevents a team from celebrating a model score while the customer experience deteriorates.

Consider a customer-support assistant. Eligible deflection and first-contact resolution can represent the business outcome. Factuality against the approved knowledge base, helpfulness, tone, retrieval accuracy, and safe CRM actions describe the system’s behavior. Harmful-content rate, unsafe-action rate, response latency, and token cost act as guardrails. A live test can then examine customer satisfaction and resolution instead of merely counting generated replies.

This is the practical difference between an output and an outcome. Shipping an assistant is an output. Producing more successful resolutions without unacceptable safety, latency, or cost regressions is an outcome. Disciplined evaluation makes that distinction measurable.

Match the evidence burden to the type and consequence of the bet

A portfolio needs different kinds of AI innovation, but it should not evaluate every bet in the same way. Core optimization, adjacent expansion, and transformational innovation face different uncertainties. The label determines the strategic question. The consequence of failure determines the rigor.

Portfolio bet	Question it must answer	Evidence that matters most	Typical decision
Core optimization	Can AI improve an established journey without damaging what already works?	A reliable baseline, regression tests, live A/B results, and cost and latency guardrails	Adopt the change only when the improvement survives the existing quality bar
Adjacent expansion	Does the capability solve a known job for a new segment, channel, or use case?	Problem discovery, segment-representative evaluation cases, activation signals, and retention evidence	Expand only after the new audience reaches a meaningful value moment
Transformational innovation	Can a materially different workflow create value and be trusted?	Task-completion tests, human review, adversarial testing, safe tool-use checks, and a staged customer pilot	Increase autonomy and exposure only as reliability and business evidence mature

A core change can have a small strategic scope and still require a high evidence burden. An apparently simple classifier may sit inside a sensitive workflow. Conversely, a transformational concept can begin with a narrow, reversible prototype. Do not use “experimental” as permission to lower the bar for privacy, security, or consequential actions.

The same discipline improves build, partner, and buy decisions. Generic demonstrations do not reveal how a system will perform on your customers’ language, your knowledge, your policies, or your tools. Run every viable option through the same representative task set. Compare task quality, latency, cost, integration effort, data boundaries, governance fit, and failure recovery. The vendor category matters less than whether the option can satisfy the decision contract.

Portfolio funding should follow evidence maturity rather than presentation quality. Continue a bet when the team can identify remaining uncertainty and run a proportionate test to reduce it. Pause or kill it when customer value does not materialize, critical failure modes remain unresolved, or the required quality cannot fit inside the operating cost and latency envelope.

A neutral experiment is not automatically wasted work. It can eliminate a weak hypothesis and release capacity for a better bet. But a poorly instrumented or under-sensitive experiment does not produce a useful neutral result. Set the minimum detectable effect and instrumentation before launch so “no movement” has an interpretable meaning.

Build an evaluation stack that resembles the real product

An AI evaluation is useful only when it represents the decisions the product must make under realistic conditions. A polished answer to a convenient prompt is weak evidence. The production system also has to handle ambiguous requests, imperfect retrieval, policy boundaries, long-tail inputs, adversarial behavior, and tool failures.

Turn the golden dataset into an executable product specification

Your golden dataset should express product intent through examples. Start with real, properly anonymized inputs from discovery, support, and product usage. Add important edge cases, long-tail situations, and adversarial prompts deliberately; waiting for production to reveal them transfers avoidable risk to customers.

Each case should carry enough context to diagnose a failure, not just assign a score:

The user input and relevant conversation or workflow state
The approved information or system state the response may rely on
The expected behavior, acceptable answer range, or permitted action
A rubric for correctness, helpfulness, tone, and safety
A risk label that distinguishes ordinary quality defects from release-blocking failures
Metadata for the user segment, use case, input pattern, or workflow stage

Keep the set versioned. Preserve cases that caught previous regressions, refresh it as customer behavior changes, and hold back examples that are not used for prompt tuning. Otherwise, the team can optimize for a familiar test set while making little progress on the wider product experience.

Privacy belongs in dataset design. Anonymization, access control, retention rules, and approved data boundaries should be established before customer interactions become test fixtures. Retrofitting those controls after an evaluation pipeline spreads sensitive data is slower and riskier.

Use several evaluators because each catches a different failure

No single evaluation method is a complete quality system. Layer methods according to what is being tested:

Deterministic tests are appropriate for business rules, schemas, required fields, forbidden actions, exact calculations, and tool arguments. If a rule can be checked directly, do not ask another model to guess whether it passed.
Grounded checks compare claims with an approved knowledge base or retrieved context. They are essential when the product promises answers based on company or account information.
LLM-as-judge scoring can cover subjective dimensions such as helpfulness, relevance, and tone at useful scale. Define the rubric tightly and calibrate the judge against human decisions. Consistency is not enough if the judge consistently applies the wrong standard.
Pairwise preference tests help compare prompt, retrieval, or model variants when an absolute score is hard to interpret. They answer which candidate better satisfies the same rubric.
Human review remains necessary for critical, ambiguous, policy-sensitive, or high-consequence cases. It also provides the reference needed to recalibrate automated judges.
Red teaming probes manipulation, unsafe requests, policy evasion, and unexpected combinations of otherwise valid instructions.

Agentic systems need evaluation beyond the final prose. A fluent confirmation can hide a failed or unauthorized action. Measure whether the agent chose the correct tool, supplied valid arguments, respected permissions and confirmation requirements, completed the intended task, and recovered safely when a dependency failed. Task-completion reliability and safe-action rate are more revealing than answer style alone.

Quality must also be evaluated inside the cost-quality-latency envelope. A larger model can improve a difficult generation task and still be the wrong default for a simple classification step. Test model routing, token budgets, caching, prompt structure, retrieval quality, and function-calling patterns by task. The goal is not to minimize each cost independently; it is to meet the product’s quality bar with an operating profile the business can sustain.

Turn evaluations into release gates and portfolio decisions

An evaluation document that lives outside delivery will eventually be skipped. The evaluation suite should run whenever a prompt, model, retrieval pipeline, knowledge source, tool schema, or workflow changes. That makes evaluation part of the release mechanism instead of a launch ceremony.

Use a gate sequence from discovery through production

Stage	Evidence to collect	Decision enabled
Problem discovery	User problem, current workflow, baseline, value hypothesis, and major risks	Decide whether the problem deserves an AI bet
Prototype	Representative golden-set results, failure taxonomy, latency, and estimated operating cost	Decide whether the capability has a credible path to the product bar
Pre-release	Regression suite, calibrated human review, adversarial cases, privacy checks, and safe-action tests	Block, revise, or approve a controlled rollout
Controlled rollout	Predefined A/B test, value-moment telemetry, satisfaction, guardrails, and incident signals	Validate whether offline quality creates customer and business value
Production scale	Continuous monitoring, segment-level failures, cost and latency trends, incidents, and refreshed evaluations	Scale, route, constrain, roll back, or retire the capability

Separate hard gates from optimization targets. A prohibited action, a privacy-boundary violation, or a broken business rule should block release. A modest tone improvement or non-critical cost regression may be handled as a tracked trade-off. If every metric is a hard gate, delivery stalls. If none is, the gate is theater.

I use a simple test for gate quality: if two accountable leaders can read the same result and reach opposite release decisions, the decision rule is incomplete. Define the failing threshold, affected cases, permitted exception process, and rollback action before the result arrives.

For systems that can change customer data, communicate externally, or trigger another consequential action, start with narrow permissions and human confirmation. Log the proposed action, the tool call, the result, and the reason for escalation. Increase autonomy only when the relevant task and safety evaluations hold under real usage. A human-in-the-loop control is most useful when the escalation path, response owner, and incident procedure are explicit.

Offline evaluations create confidence to expose the product. They do not prove business impact. A live experiment must test the stated outcome with a predefined minimum detectable effect while watching for novelty bias and segment-specific failures. Instrument the customer’s value moment, not merely clicks on the AI entry point. An assistant can attract curiosity without improving activation, retention, resolution, or satisfaction.

Production telemetry should feed back into the golden dataset. Add recurring failures, newly observed edge cases, incidents, and examples where users abandon or escalate. This turns customer reality into the next regression suite and prevents evaluation from freezing at the assumptions held before launch.

Carry one scorecard from the product team to the QBR

Leadership does not need a separate innovation narrative built from feature updates. Use one scorecard at product reviews, investment reviews, and QBRs. It should contain:

The portfolio class and strategic outcome
The target user, job, and current baseline
The causal hypothesis and non-AI alternative
The primary business metric and minimum detectable effect
The offline quality measures and live outcome measures
The safety, privacy, latency, reliability, and cost guardrails
The current evidence, unresolved uncertainty, and confidence level
The next test, accountable owner, review point, and kill-or-scale rule

This creates a common language for product, engineering, design, go-to-market, risk, and executive stakeholders. The conversation becomes: What did the bet need to prove? What evidence changed? Which uncertainty remains? What decision follows? It no longer depends on who presents the most persuasive demonstration.

The scorecard also protects speed. Teams with explicit boundaries can make routine prompt, retrieval, routing, and interface improvements without reopening the entire strategy. Leadership attention can stay on exceptions, material regressions, capital allocation, and bets whose evidence no longer supports the original thesis.

Key takeaways for your next AI portfolio review

Require a decision contract before an AI idea receives roadmap momentum: user, outcome, hypothesis, evidence, guardrails, and kill-or-scale rule.
Classify each bet as core, adjacent, or transformational, but set evaluation rigor according to the consequence of failure.
Build a versioned golden dataset from anonymized real inputs, important edge cases, long-tail situations, and adversarial prompts.
Layer deterministic checks, grounded tests, calibrated model judging, human review, preference testing, and red teaming.
Evaluate agent actions and task completion, not only the fluency of the final response.
Run relevant regressions whenever prompts, models, retrieval, knowledge, tools, or workflows change.
Use offline evaluation to control release risk and live experimentation to validate customer and business impact.
Fund, refine, pause, or kill bets based on evidence maturity rather than demo quality or sunk effort.

At your next roadmap review, pick one upcoming AI bet and pause the implementation discussion until its decision contract is complete. Then run the current workflow through a representative evaluation set before changing it. That baseline gives every later improvement something honest to beat.

When each investment has a visible path from user problem to evaluation to decision, AI innovation stops being a contest between plausible demos. It becomes a repeatable way to allocate attention, manage risk, and scale the capabilities that produce durable value.

References

November 3, 2025

AI-Enabled Product Management: A Practical Operating Model

Your product managers are probably already using AI to summarize feedback, draft requirements, and prepare planning documents. The harder question is whether any of that is improving the decisions behind the documents.

That distinction matters. Faster artifact production can create the appearance of progress while weak evidence, unclear ownership, and unresolved trade-offs remain untouched. A useful AI-enabled product operating model shortens the path from customer evidence to accountable action without treating fluent output as product judgment.

Start with a recurring decision, not a general-purpose assistant

The natural starting point is an assistant that can answer anything. It is also difficult to evaluate because every request has different inputs, quality criteria, and consequences. Start with one recurring decision whose current workflow you understand.

AI is already useful for synthesizing feedback, drafting PRDs and acceptance criteria, turning notes into user stories, and preparing experiment plans. Those are valuable tasks, but they are parts of a workflow. None of them determines which customer problem deserves investment or which trade-off the company should accept.

Define a decision contract before choosing a model or writing a prompt:

Decision: State the exact choice to be made. Replace improve onboarding with choose which activation barrier to address next.
Trigger: Name when the workflow runs, such as before roadmap review, after a discovery cycle, or when an anomaly appears.
Required evidence: Identify the interviews, support records, analytics, CRM context, experiments, and strategic constraints that must inform the choice.
Output contract: Specify the claims, citations, contradictory evidence, unknowns, and proposed next questions the AI must return.
Decision owner: Name the person accountable for accepting, rejecting, or changing the recommendation.
Red lines: Identify actions the system may not take, data it may not expose, and conclusions it may not present without review.
Outcome signal: Choose the product or workflow measure that will reveal whether the decision improved anything.

If you cannot name the decision owner and the action that follows the output, you have an AI demonstration rather than an operating workflow.

Product decision	What AI can prepare	What the PM must decide
Which problem to investigate	Clusters of interview, support, and behavioral signals with links to the underlying records	Whether the pattern is strategically important and which customers need follow-up
Which roadmap request deserves attention	Evidence by segment, frequency, workflow, and conflicting signal	Opportunity cost, strategic fit, and whether the request represents a problem or a proposed solution
Whether an experiment is ready	Hypothesis, acceptance criteria, instrumentation needs, and minimum detectable effect inputs	Whether the causal question is worth testing and whether the exposure risk is acceptable
How to position a capability	Customer language, points of parity, objections, and candidate messages	The value proposition and competitive differentiation the company can credibly defend
How to respond to an operational signal	Anomaly context, affected journey stage, supporting records, and candidate playbooks	Whether to intervene, whom to affect, and how to judge the result

The prompt should reflect that contract. A weak request says: summarize customer feedback. A decision-ready request says: for the specified segment and workflow, group evidence by customer problem, cite every supporting record, identify contradictions and missing coverage, separate observation from inference, and propose the next discovery question without recommending a roadmap commitment.

That change is small but important. It directs AI toward evidence preparation while preserving the PM’s responsibility for interpretation and commitment.

Build a context layer your PMs can interrogate and verify

A generic model knows language patterns, not the current state of your customers, product, strategy, or commitments. Copying a few notes into a prompt helps with an isolated task, but it does not create a reliable product-management system.

Retrieval-Augmented Generation connects an LLM to internal product, customer, and market knowledge so relevant material can be retrieved when a question is asked. For a PM, that knowledge may include interview notes, support tickets, win-loss records, QBRs, specifications, CRM data, and product analytics. The practical benefit is not merely a more personalized answer. It is an answer that can be checked against the company’s evidence.

Do not begin by indexing every repository. A large corpus increases coverage, but it also introduces stale specifications, duplicate tickets, conflicting terminology, inaccessible customer data, and documents whose status is unclear. Trust is usually lost at the corpus boundary before it is lost at the model layer.

A minimum trustworthy context layer needs:

Explicit scope: Document which repositories, products, segments, and time periods are included. The system should disclose when a question falls outside that scope.
Access enforcement: Apply user and tenant permissions during retrieval, not merely after an answer has been generated. A record being technically retrievable does not make it appropriate for every PM or every output.
Useful metadata: Preserve product area, customer segment, workflow, channel, date, product version, record owner, and status where available. These fields help distinguish current evidence from historical noise.
Evidence hierarchy: Decide how the system handles an approved specification that conflicts with an old planning note, or verified analytics that conflict with an anecdotal request. It should show the conflict rather than silently blending the two.
Answer boundaries: Require separate sections for supported facts, inferences, contradictory evidence, and unknowns. Require links to the records carrying each material claim.
Feedback history: Store reviewer corrections and the failure category behind each correction. A thumbs-down with no explanation does not tell you whether retrieval, reasoning, freshness, permissions, or presentation failed.

Start in read-only mode with a narrow, high-signal workflow, such as synthesizing support patterns for one segment. Ask reviewers to mark each important claim as supported, partly supported, or unsupported and to note relevant evidence that was missed. A polished answer with no traceable basis fails even when its conclusion happens to be plausible.

RAG does not turn internal data into truth. Retrieval can return stale, partial, or contradictory material, and a missing record is not proof that a customer problem does not exist. Your PM still has to assess coverage, distinguish signal from sampling bias, and decide when fresh discovery is necessary.

Privacy-by-design belongs in this layer as well. Support and CRM records may contain personal information, confidential commitments, or account-specific context. Minimize what is indexed, redact what is not needed, preserve access controls, and define which outputs may leave the internal workflow. Data governance is part of product quality here, not an administrative task to add after launch.

Match AI autonomy to the consequence of being wrong

Human review is too vague to be a control. It can mean a careful decision by an accountable owner, or a hurried click on an approval button after the work has effectively been accepted. Define autonomy according to the consequence and reversibility of each action.

Assist: AI transforms material without changing external state. Examples include transcribing notes, formatting requirements, clustering feedback, or drafting an internal brief. The user reviews the result before relying on it.
Recommend: AI interprets evidence and proposes a choice, but a named owner makes the decision. Roadmap evidence summaries, experiment proposals, and candidate positioning belong here.
Act reversibly: AI performs a bounded action that is observable and easy to undo, such as creating a draft ticket, applying an internal label, running an analysis, or staging an in-app guide in preview. Tool permissions, scope, and rollback must be enforced.
Act with material consequence: The workflow affects customers, exposure to an experiment, permissions, contractual commitments, published messaging, or data that cannot be restored easily. Require explicit approval from the accountable owner before execution.

A credible direction of travel includes agents that monitor activation funnels, flag anomalies, prepare playbooks, and help coordinate experiments or in-app guidance. That does not justify giving one agent broad access to analytics, messaging, experimentation, and customer data. Each tool should have the narrowest permission and action scope the workflow needs.

For consequential actions, make the approval packet decision-ready:

The exact action the agent proposes to take
The affected product area, customer cohort, or internal system
The evidence supporting the action, with links
Contradictory evidence and unresolved uncertainty
The expected product outcome and how it will be observed
The rollback procedure and the conditions that trigger it
The approver, approval expiry, and complete action log

Enforce guardrails in the system rather than relying on prompt language. Use constrained service accounts, scoped tools, staging environments, rate limits, complete logs, and an accessible kill switch. A prompt is an instruction to a model; it is not a security boundary.

My rule is simple: if the accountable PM cannot explain how the evidence supports the proposed action, the workflow has not earned more autonomy. The right response is to improve the context and evaluation loop, not to make the approval interface easier to click through.

Evaluate the output, the workflow, and the product outcome

An AI initiative can generate more documents while making product management worse. More drafts may create review queues, spread unsupported claims, or encourage teams to reopen decisions that lacked new evidence. Measure three layers so local speed is not mistaken for organizational value.

Evaluation layer	Question	Evidence to inspect
Output reliability	Is the result grounded, complete enough for its purpose, appropriately uncertain, and safe to use?	Citation checks, missed evidence, unsupported claims, privacy failures, and subject-matter review
Workflow performance	Does AI reduce elapsed time and rework without moving effort into a hidden review step?	Time from trigger to decision, acceptance and editing patterns, handoffs, reopened work, and blocked decisions
Product impact	Did the resulting decision improve the customer or business outcome the workflow exists to influence?	The relevant activation, retention, experiment, support, or commercial measure, interpreted in the context of the decision

Baseline the existing workflow before introducing AI. Record its trigger, participants, elapsed time, common failure modes, and decision outcome. Otherwise, a faster AI run will be compared with an imaginary manual process instead of the work people actually perform.

Use outcomes rather than artifact volume when setting the objective. Drafts produced, prompts submitted, and active users describe activity. A shorter evidence-to-decision cycle, fewer unsupported roadmap claims, or better performance on the product outcome describes value. The metric must match the workflow; there is no universal AI productivity score.

A practical review loop looks like this:

Maintain a representative evaluation set containing ordinary cases, known failures, ambiguous inputs, permission boundaries, and contradictory evidence.
Run the current prompt, retrieval configuration, model, and tools against that set.
Have the relevant product, design, engineering, data, or domain reviewer score the output against the decision contract.
Classify each failure. Separate missing retrieval from unsupported inference, stale context, permission errors, incomplete instructions, and poor presentation.
Change one major component at a time so you can tell whether the prompt, corpus, retrieval rules, model, tool, or approval design improved the result.
Run the full evaluation set again before promoting the change. Keep prompts and retrieval configurations versioned so regressions can be traced and reversed.
Review production corrections and near misses, add them to the evaluation set, and revisit the autonomy level if the consequence profile has changed.

This is a good ritual for a product trio, with engineering or a forward deployed engineer handling system integration and observability where the workflow requires it. The PM owns the problem definition and decision quality; design protects the fidelity of customer interpretation; engineering owns the reliability and bounded behavior of the implementation. Subject-matter owners still review claims that cross their domain.

Expand in stages. Move from a single-segment synthesis to a cited discovery brief, then to roadmap evidence, experiment preparation, and only later to reversible execution. Do not promote the workflow when material claims remain uncited, permission failures are unresolved, reviewers cannot explain its conclusions, or downstream rework is increasing. Those are operating failures, even if the model’s prose looks strong.

Key takeaways

Choose one recurring product decision and define its owner, evidence, output, red lines, and outcome before selecting AI tools.
Use a governed retrieval layer to make internal context accessible, current, permission-aware, and traceable to the underlying records.
Separate evidence preparation from judgment. AI can organize and challenge the case; the PM remains accountable for the bet.
Increase autonomy only when actions are bounded, observable, reversible, and supported by an explicit approval model.
Evaluate output reliability, workflow performance, and product impact. Artifact volume is not a proxy for better product management.
Scale only after real corrections and failure cases have been added to a repeatable evaluation set.

Before your next planning cycle, pick one disputed decision that repeats often. Write its decision contract, assemble a small representative evidence set, and run the AI workflow in read-only mode beside the current process. If reviewers can trace the material claims, identify what is missing, and make the decision with less rework, you have a foundation worth expanding. If they cannot, improve the context and controls before adding another feature or agent.

References

November 3, 2025

Tag: AI risk management

Trust is a chain, not a model score

Build a minimum control plane around each data product

Govern the full path from ingestion to feedback

Ingestion and preparation

Retrieval and response

Feedback and continuous improvement

Measure whether governance is earning trust

A 30-60-90 day path from policy to operating system

Days 1-30: expose the current state

Days 31-60: turn decisions into controls

Days 61-90: close the learning and accountability loop

Key takeaways

References

Evaluate the support outcome, not the performance

Build scenarios around the ways real calls become difficult

A practical scenario matrix

Run the call through the systems you expect to deploy

Key takeaways

Score conversation, reasoning, and operational closure separately

Break latency into moments the caller can feel

Make recovery and escalation part of the product test

Turn the evaluation into a controlled pilot decision

References

Give the agent an investigation job before action authority

Context quality determines the ceiling of the investigation

Scale permissions to reversibility and blast radius

Make evidence and uncertainty legible in the incident room

Test against past incidents, then expand authority one boundary at a time

A practical rollout sequence

Key takeaways

References

Start every AI bet with a decision contract

Match the evidence burden to the type and consequence of the bet

Build an evaluation stack that resembles the real product

Turn the golden dataset into an executable product specification

Use several evaluators because each catches a different failure

Turn evaluations into release gates and portfolio decisions

Use a gate sequence from discovery through production

Carry one scorecard from the product team to the QBR

Key takeaways for your next AI portfolio review

References

Start with a recurring decision, not a general-purpose assistant

Build a context layer your PMs can interrogate and verify

Match AI autonomy to the consequence of being wrong

Evaluate the output, the workflow, and the product outcome

Key takeaways

References