Category: AI Strategy

How to Scale AI Pilots Into Mature Production Systems

You have AI pilots that demo well, enthusiastic teams asking for broader rollout, and executives expecting the investment to show up in operating results. Yet the closer you get to production, the longer the list of unresolved questions becomes: Who owns the workflow? How will quality be measured? What happens when the model is wrong? Can the economics survive real usage?

The next move is not to launch more pilots. It is to install a system that can repeatedly turn a validated use case into a governed, measurable, and improving production workflow. That system is what separates AI experimentation from mature deployment.

A successful pilot is not evidence of production readiness

AI adoption is already common enough that adoption itself tells you very little. Among more than 2,400 global customer service professionals, 82% of senior leaders invested in AI in 2025, 87% planned to invest in 2026, and only 10% described their deployment as mature. The sample is specific to customer service, so those figures are better used as a directional benchmark than as a universal maturity rate. The underlying execution problem applies much more broadly: buying or piloting AI is easier than making it dependable inside a core workflow.

A pilot is designed to answer a narrow learning question. Can the model classify this request, draft this response, summarize this record, or choose the next action under controlled conditions? Production has to answer a harder question: can the entire workflow create enough value, across ordinary and difficult cases, while remaining safe, observable, supportable, and economically sensible?

I use a simple test. If the team can describe the model but cannot describe the operating workflow around it, the work is still a prototype. A production case should make each of these elements explicit:

Outcome: The customer or business result that should improve, plus the current baseline.
Workflow boundary: Where AI enters, which decisions it may make, which systems it may use, and where its authority ends.
Quality standard: The evaluation cases, acceptance criteria, and failure categories that determine whether a release is good enough.
Safe failure path: What the system does when information is missing, a tool fails, a policy is triggered, or the requested action exceeds its authority.
Accountability: A named product owner for the outcome and a named operational owner for production performance.
Economics: The value created and the full cost of inference, retrieval, tools, review, support, and incident handling.
Learning mechanism: How production failures and user corrections return to the evaluation set and release process.

These are not finishing tasks to schedule after the model works. They are part of the product. Deferring them creates a predictable trap: the pilot looks increasingly impressive while the distance to a responsible launch quietly grows.

Do not confuse automation coverage with maturity, either. A system can handle many requests and still be immature if nobody can explain why it made a decision, detect a quality regression, contain a failure, or calculate the result. Conversely, a narrowly scoped workflow can be mature when its boundaries, controls, outcomes, and ownership are clear.

Depth matters because quality is produced by the whole operating system, not the prompt alone. In customer service, 43% of mature adopters reported higher quality and consistency, compared with 24% of teams in earlier stages. These are self-reported results, but the practical implication is sound: integration, evaluation, and continuous improvement are not overhead around the AI. They are how the AI becomes useful at scale.

Promote each workflow through explicit maturity gates

Maturity should be earned workflow by workflow. An organization does not become mature because it has a central AI team, an approved model vendor, or a large portfolio. It becomes mature when important workflows can move through a repeatable sequence of decisions without relying on heroics.

Stage	Decision to make	Evidence required to advance	Reason to hold
Discover	Is this a valuable and appropriate problem for AI?	A defined user problem, current baseline, workflow map, risk classification, and initial build-versus-buy view	The use case is driven by model novelty, has no meaningful outcome, or depends on inaccessible data
Prove	Can the proposed workflow improve on the current process?	Representative evaluation cases, a working prototype, documented failure modes, and a controlled comparison with the baseline	Success appears only in curated demos, or the team cannot reproduce the result across realistic cases
Operate	Can the workflow run safely and reliably in production?	Monitoring, escalation, access controls, auditability, incident procedures, release controls, rollback, and an accountable operator	Failures cannot be detected or contained, or production responsibility is still ambiguous
Scale	Should usage, autonomy, channels, or organizational reach expand?	Sustained outcome improvement, acceptable quality and risk, validated economics, user adoption, and reusable operating components	Volume is growing faster than quality, cost, support capacity, or governance can be understood

The purpose of a gate is not to create a committee. It is to prevent enthusiasm, executive attention, or sunk cost from substituting for evidence. The domain team should be able to prepare the evidence as part of normal product development. Specialist review should become more demanding only as the possible consequence of failure increases.

Give every workflow a short deployment contract. Keep it in the same system where the team manages releases and evaluations, not in a presentation that disappears after approval. The contract should include:

The intended user, job to be done, business outcome, and current baseline.
The inputs the workflow accepts and the outputs or actions it may produce.
The actions that are prohibited, require confirmation, or must be routed to a person.
The data sources, retrieval rules, system permissions, retention rules, and privacy constraints.
The evaluation set, quality dimensions, acceptance criteria, and known limitations.
The failure taxonomy, escalation path, incident owner, and customer recovery procedure.
The prompt, model, retrieval, tool, and policy versions included in the release.
The production metrics, cost measures, rollout control, and rollback conditions.
The product owner, operational owner, and risk approvers.

The acceptance criteria will differ by workflow. A drafting assistant, an internal search experience, and an agent authorized to modify a customer account should not face the same bar. Base the bar on consequence, reversibility, detectability, and recovery. If an error can create an irreversible change, expose sensitive data, make a material commitment, or deny someone an important service, require an appropriate human authorization step rather than relying on average model performance.

The deployment contract also makes scope changes visible. Adding a new tool, data source, channel, language, model, or autonomous action is not merely more traffic. It changes the system’s failure surface. Update the contract, extend the evaluation set, and pass the relevant gate again.

Build three feedback loops before increasing autonomy

A mature deployment learns at three levels: whether the workflow creates value, whether its decisions meet the required standard, and whether the production system remains reliable. If any loop is missing, the team can collect impressive activity metrics while the actual product deteriorates.

Connect model behavior to a business outcome

Start with the baseline process, not an AI metric. If the workflow is intended to resolve a support request, qualify an opportunity, complete an onboarding step, or assist an employee, measure how that outcome happens without the new system. Otherwise, you will know that the AI generated output but not whether it improved anything.

Use a metric stack that separates outcomes from diagnostics:

Business outcome: The customer, revenue, cost, risk, or productivity result the investment is meant to change.
Workflow outcome: Completion, resolution, successful handoff, correction, rework, abandonment, or another measure of whether the task reached its intended end.
Quality and safety: Correctness, grounding, policy compliance, appropriate escalation, harmful failure, and user correction.
Operational performance: Availability, latency, tool success, retrieval quality, incident volume, and recovery.
Economics: Cost per successful outcome, including model usage, infrastructure, external tools, human review, support, and remediation.

The layers diagnose different problems. A prompt change may improve an offline score without changing task completion. More automation may reduce handling work while increasing corrections. A cheaper model may lower inference cost but create enough rework to raise the cost per successful outcome. Do not compress those effects into one AI score.

Measurement tends to improve as deployment deepens. In the customer service maturity data, reported ROI tracking increased from 35% among teams exploring AI to 70% among mature deployments. That does not prove maturity automatically causes measurement, but it shows how closely operational depth and measurement discipline travel together.

When traffic and product conditions support an experiment, compare the AI workflow with the current experience. Define the decision metric and minimum detectable effect before running an A/B test. For lower-volume or higher-risk workflows, use controlled rollout evidence, expert review, and structured case analysis rather than pretending a small sample provides statistical certainty.

Turn evaluations into release criteria

An evaluation set is not a collection of attractive examples. It should represent ordinary work, difficult edge cases, policy boundaries, known failures, and the situations in which the system should refuse or escalate. Build it before optimizing the prompt so the team cannot unconsciously redefine success around whatever the prototype already does well.

For each case, record the expected behavior and why it is expected. Some outputs can be checked against a deterministic answer. Others need a rubric that distinguishes task completion, factual support, instruction following, tone, policy compliance, and escalation quality. Where reviewers can reasonably disagree, capture that disagreement instead of forcing false precision into a single label.

Use offline and online evaluation for different jobs. Offline evaluation protects releases by testing candidate changes against a stable set. Online evaluation reveals distribution shifts, new user behavior, integration failures, and outcomes that cannot be recreated fully before launch. Neither is sufficient on its own.

Version the entire behavior-producing system: model, prompt, retrieval configuration, knowledge snapshot, tools, policies, and routing logic. A model comparison is not meaningful if the surrounding system changed silently. For every proposed release, make the decision policy explicit: ship, hold, narrow the scope, expand gradually, or roll back. This is the practical core of eval-driven development with target metrics and a decision policy defined before launch.

Operate the workflow as a production service

AI introduces variable outputs, but it still depends on familiar production systems: identity, permissions, data pipelines, APIs, queues, search, external tools, and user interfaces. A model can appear to be wrong when retrieval returned stale information or a downstream tool rejected an action. Monitoring only the final text hides the failure that engineers need to fix.

Trace the workflow end to end. Subject to your privacy and retention rules, capture the release version, retrieval and tool events, policy decisions, response, escalation, user correction, and eventual workflow outcome. Monitor distributions and failure categories, not just averages. An acceptable overall score can conceal a serious regression for a particular intent, customer segment, channel, or action.

When the workflow depends on changing or private knowledge, connect it to governed retrieval instead of expecting the base model to contain the right answer. Use safe integration points for tools, least-privilege access, and explicit authorization for consequential actions. CI/CD, feature flags, canary releases, observability, audit trails, privacy controls, red teaming, and human review form a practical control plane for releasing changes without exposing the entire population at once.

Every material production failure should produce more than an incident ticket. Classify the failure, add or update the corresponding evaluation case, correct the prompt, retrieval, policy, tool, or interface responsible, and retest the workflow before restoring scope. That turns operational pain into a permanent improvement in the release system.

Use 30-60-90 days to build the scaling system

A useful 30-60-90-day sequence starts with two lighthouse use cases. The goal is not to force every use case into production within a quarter. It is to prove that your organization can move valuable workflows through the same gates, shared controls, and learning loops.

Days 0-30: narrow the portfolio and establish accountability

Inventory active pilots and classify each as discovery, proof, operation, or scale. Do not let a polished demo assign its own stage.
Select two lighthouse workflows using customer impact, feasibility, strategic relevance, and risk. Choose workflows meaningful enough to matter but bounded enough to operate responsibly.
Record the current process and baseline before the AI changes user or employee behavior.
Name the product owner, operational owner, and required risk decision-makers for each workflow.
Complete the first version of each deployment contract, including the autonomy boundary and safe failure path.
Make the build-versus-buy decision at the workflow level. Include data access, integration, auditability, evaluation portability, operating cost, and switching constraints.
Pause pilots that have no accountable owner, no measurable outcome, or no plausible route through the operating gate.

This first phase is where leadership earns focus. A broad AI mandate often creates a queue of unrelated prototypes, each with its own vendor, data assumptions, and definition of success. Choosing lighthouse workflows gives the platform and governance work a real customer instead of turning them into abstract architecture programs.

Days 31-60: install evaluation, controls, and workflow operations

Build the offline evaluation set from representative work, edge cases, policy boundaries, and failures already found during discovery.
Define acceptance criteria and the release decision policy before further prompt or model optimization.
Integrate the necessary retrieval and tools through governed access points. Keep permissions narrower than the user’s full access where the workflow does not need it.
Add observability across retrieval, reasoning inputs, tool execution, output, escalation, and business outcome.
Prepare feature flags, a controlled rollout, rollback, incident procedures, and a customer recovery path.
Run the workflow with appropriate human oversight. Record corrections and escalations as structured evidence, not informal feedback in chat.
Train the people who will supervise, support, and improve the workflow. Update operating procedures before transferring real responsibility to AI.

Training cannot be limited to prompt tips. Operators need to know what the system may do, how its failure modes appear, when to intervene, how to report a new failure, and who can change production behavior. Product and engineering teams need the same vocabulary for evaluation, incidents, and risk.

Days 61-90: expand evidence, not enthusiasm

Increase scope only for workflows that meet their operating gate. Expansion may mean more traffic, another intent, a new channel, or greater autonomy; evaluate each change explicitly.
Compare the production outcome and cost with the original baseline. Include corrections, review, support, and remediation in the economics.
Turn repeated needs into shared components such as model access, retrieval, identity, evaluation infrastructure, observability, policy enforcement, and audit logging.
Move validated production failures into the evaluation suite and confirm that the release process catches them.
Review job responsibilities, incentives, staffing assumptions, and training needs created by the redesigned workflow.
Hold a portfolio decision for every remaining pilot: advance, narrow, combine, pause, buy, or stop.

Organizational change is part of this phase. As AI altered customer service work, 45% of teams updated job descriptions and 40% increased AI training. That is a useful warning against treating adoption as an in-app onboarding problem. If AI takes responsibility for part of a workflow, someone must take responsibility for supervising it, handling exceptions, and improving the system.

Assign decision rights clearly. The domain product team should own the user problem, outcome, workflow design, evaluation cases, and adoption. A platform function should own shared access, retrieval, observability, release infrastructure, and policy enforcement. Risk specialists should define control requirements and review higher-consequence uses. The operational owner should manage quality, escalations, and incidents after launch. Executive leadership should decide portfolio priority, capacity, and which bets no longer deserve investment.

This structure avoids two common extremes. A fully centralized AI team becomes a delivery bottleneck and loses domain context. Fully independent teams duplicate infrastructure and apply inconsistent controls. Centralize reusable capabilities and non-negotiable policies; keep workflow outcomes and day-to-day learning with empowered domain teams.

Expect pressure to spread successful patterns. In customer service organizations, 52% planned to scale AI into areas such as customer success, marketing, and sales. Reuse the platform, governance, evaluation methods, and operating vocabulary. Do not copy a support workflow into another function and assume its value, risks, permissions, or quality bar remain valid.

FAQ: decisions that determine whether AI scales

Should AI be owned centrally or by product teams?

Use a federated model. Centralize capabilities that become safer, cheaper, or more consistent when shared: approved model access, identity, data controls, retrieval services, evaluation tooling, observability, auditability, incident standards, and risk policies. Embed workflow ownership in the domain team that understands the user, process, and business outcome. A central group can set the paved road, but it should not become the permanent product team for every AI use case.

When is an AI workflow ready for more autonomy?

Increase autonomy when the workflow has demonstrated acceptable behavior for the exact action and population being added, failures are detectable, consequences are containable, rollback works, and an operational owner can handle exceptions. Do not remove human review merely because the average quality score improved. Judge autonomy by the worst credible consequence, the reversibility of the action, and the system’s ability to recognize when it should stop.

Autonomy is not binary. The system can retrieve information, recommend an action, draft the result, ask for confirmation, execute within a limited permission, or execute and trigger retrospective review. Choose the narrowest level that captures the value. Expand only when evidence supports the next level.

When should a pilot be stopped rather than scaled?

Stop or reframe a pilot when it has no accountable workflow owner, cannot beat a meaningful baseline, works only on curated inputs, requires unacceptable access, has no safe failure path, or creates more review and remediation than the outcome justifies. Also stop when the supposed AI problem is actually a broken policy, missing data, or poorly designed process that should be fixed directly.

A failed autonomy concept can still reveal a useful assistive product. If execution is too risky, narrow the workflow to retrieval, recommendation, drafting, or exception detection. That is a product decision, not a face-saving exercise. The right scope is the one that creates measurable value under an operating model you can defend.

At your next AI portfolio review, ask each owner to bring a baseline, deployment contract, evaluation evidence, and a clear gate decision. Fund shared infrastructure where the lighthouse workflows expose a recurring need. Expand only after the operating evidence catches up with the demo. That is how you turn a collection of pilots into an AI capability that can carry real responsibility.

References

January 28, 2026

Stop Losing Customers: Predict Churn with Digital Analytics and Act Before It’s Too Late

I stopped treating churn as a postmortem and started treating it as a forecasting problem. When we instrument our product, connect the dots across journeys, and embed those signals into our daily operations, churn becomes predictable—and preventable. This shift has been one of the most impactful product strategy moves my teams have made for product-led growth and retention analysis.

"Discover why and how CS teams can use digital analytics to take a proactive, predictive approach to churn, stopping it before it happens." That is exactly the mindset I bring to customer success and product collaboration: anticipate risk, intervene with precision, and demonstrate measurable impact.

The practical work starts with leading indicators. I look at user activation milestones, time-to-first-value, feature adoption depth, frequency and recency of key events, account-level coverage (are multiple users active or just one champion?), usage volatility, and friction signals like repeated errors or stalled onboarding. These behavioral inputs are stronger predictors of churn than survey sentiment alone.

From there, I create a churn risk score. Early on, a transparent rules-based model is usually enough to separate healthy from at-risk accounts. Over time, we can layer in supervised learning if the data supports it. I rely on Amplitude analytics, Pendo, or a unified analytics platform to tag events, build cohorts, and compute risk in near real time. This is where we consistently see the patterns that matter—especially around user activation and sustained adoption.

Signals without action won’t save a customer, so I connect the model to our systems of engagement. Through CRM integration, at-risk accounts trigger clear playbooks for CSMs and lifecycle marketers. Inside the product, in-app guides address gaps exactly where they occur—guiding users to the next best action, unblocking onboarding, or showcasing the value hidden behind underused features.

Because not every nudge works for every segment, we treat intervention design as a product problem and run A/B testing on copy, timing, channel, and offer. We test whether a contextual tooltip outperforms an email sequence, whether a short product tour beats a knowledge base link, and which incentives accelerate onboarding without cannibalizing expansion.

Operationally, this is a team sport. Product, CS, and marketing meet in product trios to review risk cohorts, prioritize root-cause fixes, and tune playbooks. We run a weekly risk review to turn insights into decisions, and we use monthly business reviews to connect leading indicators to lagging outcomes like retention, expansion, and NRR.

Measurement is non-negotiable. We pair retention analysis with qualitative feedback to understand whether our interventions truly change behavior. The goal is to close the loop: when a risk cluster improves, we codify the playbook; when a tactic underperforms, we learn, adjust, and try again. Over time, the organization builds a muscle for proactive, data-informed customer health management.

If you’re getting started, begin by instrumenting events tied to value moments, define a simple health score, and stand up a basic alerting workflow. Pilot one or two interventions, measure lift, and iterate. Within a single quarter, you’ll have enough signal to prioritize product improvements and scale the practices that reliably reduce risk.

Churn rarely surprises teams that listen to their data and respond in real time. With disciplined analytics, thoughtful in-product guidance, and tight alignment across CS and product, we can move from reacting to predicting—and keep more customers succeeding with far less effort.

Inspired by this post on Amplitude – Perspectives.

January 27, 2026
Build vs. Buy in an AI-First World: My Framework to De-Risk Decisions and Own Your Data

Build vs. buy is a decision that never truly goes away, and with AI reshaping the economics of software, I’m revisiting this question more frequently—and with more nuance—than ever. The temptation to “just build it” is real when prototypes are cheaper, shipping feels faster, and small tools can rival big platforms. But the real decision has never been about code; it’s about value, data, and long-term responsibility.

Across product orgs at every stage, I see the same pattern: AI makes building feel easier—but it doesn’t eliminate the tradeoffs. The hard part is separating what differentiates your product from what simply supports it. That’s why I start by asking whether the capability is truly core to my value stream, and then I force myself to reason about ownership and maintenance, not just velocity.

My rule of thumb remains simple: If something isn’t core to your value stream, don’t build it. And even when it is core, vendors may still be better positioned—especially for payments, invoicing, and infrastructure. Those domains carry deep operational complexity, continuous compliance, and reliability requirements that are easy to underestimate and painful to own.

Here’s how this plays out for me. I would never build my own blogging platform. I moved from WordPress to Ghost, because publishing isn’t where I differentiate, and the long tail of upgrades, security, and performance is a drag on focus. The platform does the job, my audience gets a better experience, and my team avoids owning commodity maintenance work.

On the other hand, I did build my own task management system—despite the abundance of excellent tools like Trello, Evernote, and OmniFocus. For me, tasks, notes, and workflows are deeply personal and idiosyncratic. I wanted my system to reflect how I think, plan, and communicate, with tight integration to my daily product rituals. In this case, the underlying data became the real product—and owning and controlling that data changed the equation.

That’s the heart of the decision: When the underlying data becomes the real product, ownership matters. Task management, notes, and workflows evolve into a personalized operating system. The moment your data model represents your unique value—and your future differentiation—build vs. buy is no longer a tooling choice; it’s a strategy choice.

AI is pushing this even further. Cheaper prototyping and “vibe coding” lower the cost of building. Tools like Claude Code and platforms from OpenAI make it viable to ship smaller, targeted tools that would have been uneconomical a few years ago. That expands the frontier of what teams can build without committing to a monolithic platform—and it puts pressure on vendors to improve data portability.

Which brings me to vendor lock-in. Exports aren’t always enough. When I evaluate CRMs or course platforms, I look for more than CSV dumps. I want robust, well-documented APIs, webhook coverage, import/export parity, schema transparency, and a clear migration path. I’ve seen teams drown in brittle integrations with Salesforce or HubSpot, struggle to unwind course data from Teachable, or get stuck in signature workflows around DocuSign without a clean escape hatch. Portability is table stakes now.

I treat build vs. buy as a discovery problem. Options are assumptions to test. On the build side, I run feasibility spikes: proof-of-concept integrations, latency checks, cost-to-serve models, and a sober read on maintenance. On the buy side, I trial vendors, not their marketing. I replicate a real workflow, test the edges, validate data portability, and simulate failure modes like vendor downtime or schema changes.

A word of caution on complexity: “we can build anything” is not the same as “we should build this.” Long-lived products accumulate hidden complexity over time—security, privacy, performance, observability, SRE runbooks, QA automation, documentation, and compliance. Be honest about engineering capabilities and maintenance costs, especially when uptime and regulatory exposure are in play.

My practical checklist looks like this: Is this core to our differentiation? Do we need to own the data model? How strong is data portability (APIs, webhooks, mapping, re-import)? What’s the true total cost of ownership over three years (people, ops, security, compliance)? Are there regulatory or reliability constraints better handled by a vendor? What’s the opportunity cost of not building something more strategic? And if we buy, what’s our exit plan?

Ultimately, build vs. buy isn’t just about speed or cost—it’s about core value, data ownership, and long-term responsibility. AI lowers the barrier to building, but it doesn’t erase complexity. Treat build vs. buy decisions like any other discovery effort: test assumptions, prototype, and validate before committing. Ask not just can we build it, but should we own it?

If you’re wrestling with vendor lock-in, fielding pressure to “just build it,” or rethinking your stack in an AI-first world, this lens will help you ask better questions before you commit. And if you’re exploring targeted builds alongside platforms like Stripe, Dropbox, Obsidian, or Ghost, I’d love to hear what’s working for you and where portability remains a hurdle.

Inspired by this post on Product Talk.

January 27, 2026
The Customer Feedback Playbook: AI-Powered Tactics I Use to Make Better Product Decisions

Customer feedback is the most reliable compass I have for product strategy and execution. Over the years leading product at HighLevel, I’ve built and refined a system that turns raw signals from users into clear, prioritized decisions our teams can confidently ship.

A practical guide to collecting and using product feedback in product management (from AI tools to early-stage tactics) for better product decisions.

My playbook starts with continuous discovery. I keep a steady flow of insights from sales calls, customer support threads, community forums, and in-product behavior so I can triangulate patterns rather than chase loud anecdotes. This mix of quantitative and qualitative data helps me separate urgent noise from strategically meaningful trends.

On the quantitative side, I rely on product analytics to ground the conversation. Amplitude analytics gives me activation, retention cohorts, and feature engagement, while controlled experiments and A/B testing validate whether an idea actually moves a target metric. Tying these signals to specific customer segments helps me see where product-led growth is working—and where it’s stalling.

For qualitative insight, I combine in-app guides and lightweight surveys (via tools like Pendo) with structured interviews and support escalations (often surfaced through platforms like Intercom). I map problems using the Kano Model to understand which requests are basic expectations, which are performance drivers, and which are potential delights. This keeps our roadmap focused on outcomes, not just outputs.

AI now accelerates the synthesis step. With LLMs for product managers in my AI product toolbox, I summarize interview transcripts, cluster themes across thousands of notes, and quantify sentiment without losing nuance. I still review raw artifacts to avoid hallucinations and preserve context, but AI reduces the time from signal to insight dramatically—freeing me to spend more energy on judgment and storytelling.

In early-stage contexts, I bias toward speed and proximity to users. I schedule founder- or PM-led discovery calls weekly, instrument product tours early, and launch scrappy in-product prompts to validate demand before over-investing. When data is sparse, I focus on high-signal channels (power users, churned customers with qualified use cases) and document crisp problem statements that connect directly to activation, retention analysis, and revenue outcomes.

Prioritization ties everything together. I translate insights into hypotheses aligned to outcomes vs output OKRs, then pressure-test them with feasibility and strategic fit. We run small, measurable experiments, track deltas in activation and retention, and adjust the product roadmapping and sprint planning cadence based on what the data and customers teach us.

This approach builds trust with stakeholders and creates empowered product teams. By grounding decisions in a transparent trail of feedback, analytics, and experiments, we reduce thrash, move faster, and—most importantly—ship product moments that customers value.

If you’re refining your own feedback engine, start by instrumenting the basics, set a weekly discovery rhythm, and let AI handle the heavy lifting on aggregation and synthesis. The compounding effect is real: better insights lead to better bets, which lead to better outcomes for your users and your business.

Inspired by this post on Product School.

January 26, 2026
Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

Trust is the currency of any high-stakes AI product, and nowhere is that more true than in healthcare. I recently dug into how Healio built an AI assistant for physicians—an audience that can’t afford to be wrong—and it’s a masterclass in balancing accuracy, transparency, and speed without compromising credibility.

Healio, a 125-year-old medical publishing company, set out to create Healio AI to help clinicians prepare for patient care. From the outset, their guiding principle was simple: physicians won’t trust you until you prove it. That lens shaped every decision—from discovery and prototyping to architecture, evaluation, and ongoing validation.

Discovery started with a survey of 300 healthcare professionals to understand real-world needs at the point of care. The headline insight: physicians primarily want AI for preparation, not bedside use. Even more surprising, the top ask wasn’t purely diagnostic support; it was help with patient communication and empathy—translating complex information into clear, accessible conversation.

Momentum mattered. After beginning with Figma mockups to validate workflows, the team built a working prototype in a single weekend using Cursor. That velocity wasn’t about cutting corners; it was about proving value quickly, reducing ambiguity, and iterating with concrete feedback from physicians.

Under the hood, the system employs RAG and hybrid search—combining lexical search, vector search, and semantic search across multiple trusted sources like PubMed. As any PM who has integrated biomedical literature knows, "just use PubMed" isn’t simple—there are five different ways to access the same data, each with trade-offs. The team made pragmatic choices to balance freshness, coverage, latency, and cost while preserving trust in source quality.

Designing for trust extended all the way to the citation UX. The team leaned into citations that physicians actually trust: subscripts, hover states, and progressive disclosure. This gave clinicians verifiable threads back to source material without overwhelming the core interaction, aligning with how experts want to audit evidence under time pressure.

Evaluation wasn’t left to chance. They stood up eight LLM judges for evals: safety, medical accuracy, faithfulness, relevancy, completeness, reasoning, clarity, and overall quality. Just as importantly, they treated those signals as directional, not definitive. In a high-stakes domain, physician feedback trumps LLM-as-judge feedback—so they complemented automated evals with direct reviews from practicing clinicians to calibrate quality and reduce hallucinations.

On the safety front, the team implemented HIPAA compliance and input guardrails for masking personal health information. That choice reflects strong data governance and privacy-by-design thinking: protect PHI by default, constrain prompts to safe boundaries, and make compliance a first-class citizen in the product architecture.

They also addressed monetization without compromising experience. Serving contextual ads while the LLM processes queries is a practical approach that preserves physician workflow efficiency and creates a clear, non-intrusive revenue model.

Critically, the work didn’t stop at launch. The Healio Innovation Partners provide ongoing discovery and validation, ensuring the system evolves with physician needs and the medical evidence base. This is the operating cadence you want for any AI product that sits at the intersection of safety, accuracy, and fast-changing knowledge.

My takeaways for building AI in high-stakes domains: prioritize retrieval-first pipelines over model cleverness; couple RAG with hybrid search across vetted sources; design citations that earn trust at a glance; use eval-driven development, but let domain-expert feedback be the ultimate judge; and embed regulatory compliance into your product strategy from day one. If trust is your North Star, this is a playbook worth emulating.

Inspired by this post on Product Talk.

January 22, 2026
AI-Powered Growth Loops: Transform Your PLG Product into a Self-Optimizing Engine

Across my teams and portfolio, I’m watching AI fundamentally reshape product-led growth—from static funnels and one-off playbooks to adaptive, compounding growth loops that learn in real time. The shift isn’t just technological; it’s an operating model change that rewards continuous discovery, rigorous instrumentation, and outcome-driven product strategy.

"Learn how AI is transforming PLG with a new generation of growth loops that can turn your product into a self-optimizing platform." That line captures what I’ve been building toward: systems that sense user intent, decide the next best action, act contextually, and learn to improve the loop with every interaction.

Here’s the core pattern I rely on. First, sense: unify product analytics and behavioral signals (think Amplitude analytics, Pendo events, Intercom conversations) into a single, queryable, privacy-safe layer. Second, decide: apply AI Strategy—LLMs for product managers, rules, and retrieval—to segment users by intent and probability of success. Third, act: deliver in-app guides, product tours, tooltips, or personalized nudges that accelerate user activation and time-to-value. Finally, learn: run A/B testing with a clear minimum detectable effect (MDE), then feed outcomes back into the model for continuous optimization.

Activation is where the gains start compounding. With gen ai, I can auto-generate tailored onboarding checklists, dynamic walkthroughs, and contextual help that adapts to the user’s role, data maturity, and current friction points. We’ve moved from generic product tours to precision guidance that updates based on real-time behavior—often lifting first-week activation and shortening time-to-first-value without adding support load.

Experimentation is the governor that keeps speed and quality in balance. I instrument every growth loop end to end and pair eval-driven development with A/B testing to confirm incremental impact. Amplitude analytics gives me cohort views and path analysis; Pendo or Intercom can deliver in-app variants; a unified analytics platform closes the loop on retention analysis so I’m not optimizing for click-through at the expense of long-term value.

Retention and expansion are where AI shines as a compounding engine. Retrieval-first pipeline patterns allow instant, contextual support that deflects tickets and boosts perceived product competence. Agentic AI can orchestrate next-best actions—prompting power users toward advanced features, surfacing value moments, or timing expansion prompts when success signals appear. The result is a virtuous cycle: better guidance drives deeper adoption, which improves model accuracy, which unlocks more relevant guidance.

None of this works without guardrails. I bake in AI risk management from the start: strict data governance, privacy-by-design, human-in-the-loop review for high-impact actions, transparent user consent, and continuous drift monitoring. The goal is reliable automation that users trust—augmented by clear fail-safes when confidence drops.

Operationally, I anchor the work in empowered product teams and product trios, focus on outcomes vs output OKRs, and practice continuous discovery to validate problems and solutions before scaling. The baseline metrics I watch: activation rate, time-to-value, week-four retention, PQL/PQA conversion, expansion revenue, and support deflection—each tied to a specific growth loop hypothesis.

If you’re starting fresh, begin with the highest-leverage loop: user activation. Instrument your onboarding journey, define the critical path to value, ship two to three personalized interventions, and measure impact with a precommitted MDE. Scale what wins, drop what doesn’t, and iterate weekly. Once activation is compounding, extend the same approach to adoption depth, collaboration features, and expansion triggers.

In practical terms, AI-powered PLG is less about flashy features and more about disciplined feedback loops. Build the sensing fabric, keep the decision layer auditable, ship small actions quickly, and treat learning as the product. Do that, and your product doesn’t just grow—it becomes a self-optimizing platform.

Inspired by this post on Product School.

January 21, 2026
Inside Product at Heart 2026: Bold Single-Track Vision, AI Everywhere, Deeper Connections

I just tuned into the latest conversation on the upcoming Product at Heart 2026, and it hit on the exact challenges product leaders are navigating right now: curating meaningful content in a world where AI moves faster than our agendas, designing formats that create real connection, and ensuring every minute earns its place. Listening to Petra Wille and Teresa Torres map out the speaker lineup, workshops, and structural shifts, I found myself nodding along—this is the kind of thoughtful curation we need if we want product teams and product leaders to walk away with practical value, not just inspiration.

Listen to this episode on: Spotify | Apple Podcasts

What stood out immediately is the bold move to a single-track conference for 2026. In an era of gen ai hype and endless breakouts, this choice signals clear intent: tighter curation, a shared experience, and less FOMO. The team isn’t carving out a separate AI track—and I love that decision. Their stance is simple and sensible: No AI track—AI will show up everywhere, but not as a siloed topic. The team sees it as part of the everyday toolkit. That mirrors how high-performing, empowered product teams actually work today—AI Strategy and AI workflows are part of the operating system, not a side show.

The keynote lineup is already compelling. Christian Idiodi (SVPG) brings storytelling that turns product principles into habits you can actually use on Monday. Elaine Kasket, cyber-psychologist, exploring digital afterlife and AI replicas, will push us to think more deeply about the human side of our systems. And Teresa Torres will be sharing what she’s learning about AI—exactly the kind of continuous discovery mindset we need as we integrate LLMs into product discovery and delivery.

I’m also thrilled to see roundtables become what they’re calling an “alternative track.” That’s a smart way to deepen learning without fragmenting attention. The best conference ROI I’ve had often comes from targeted small-group conversations—where product trios compare approaches, swap metrics frameworks, or challenge each other’s product strategy assumptions. It’s a design choice that rewards curiosity and builds communities of practice.

We also get a behind-the-scenes look at Teresa’s Maker Studio workshop, where participants will build personal AI workflows. That’s exactly the hands-on, practitioner-first approach teams need right now—less demo theater, more systems that stick. If your roadmap includes integrating LLMs into continuous discovery or augmenting your team’s decision velocity, this kind of guided practice is gold.

The broader workshop slate looks deep and balanced. Expect returning favorites and practical frameworks: Rich Mironov on the realities of product leadership in complex orgs; Büşra’s metrics workshop translating outcomes into action; and an overview of additional workshops from Rich Mironov, Büşra Coşkuner, Marcus Castenfors, and Özlem Yüce. From success metrics to toolkits for product managers, the content spans IC to product management leadership—ideal if you’re stepping into new roles or scaling empowered product teams.

One of the most exciting evolutions is the Product Leadership Event, now a 1.5-day retreat. The format blends talk sessions, mini-workshops, dinners, and small-group excursions (boat rides, improv, etc.), giving leaders time and space to exchange playbooks, stress-test decisions, and build real relationships. It’s capped at 60 attendees (all in product leadership roles) to keep it intimate and useful. As someone who believes in outcomes vs output OKRs and first principles decision making, I appreciate how this structure encourages depth over breadth—and real accountability among peers.

Here are the core takeaways I’m carrying into my own planning: single-track means tighter curation, so every talk has to earn its place. Roundtables are growing into an “alternative track,” offering more ways to engage beyond stage talks. Workshops go deep and meet you where you are—IC, manager, or executive. And the leadership retreat expands to maximize learning from peers, not just from the stage. If you care about product discovery, product strategy, and conference networking that leads to actual business impact, this program looks thoughtfully engineered.

If you’re planning your 2026 calendar—or just curious how conferences evolve alongside the craft—this is a thoughtful walkthrough of what to expect. Come say hi to Teresa and Petra—on stage, at a roundtable, or somewhere in the hallway conversations that make these events memorable.

For more context and resources mentioned, explore: Product at Heart, Arne Kittler, Mind the Product, Christian Idiodi of Silicon Valley Product Group, Elaine Kasket, House of Beautiful Business, The 7 Habits of Highly Effective People by Stephen Covey, Rich Mironov, Marty Cagan, Claude Code, Codex by OpenAI, Marcus Castenfors, Büşra Coşkuner and her Success Metrics: A Playbook for Product Managers, Özlem Yüce’s Essential Toolkit for Product Managers, Petra’s Product Leadership Wheel (PLwheel), and Netlight.

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Full transcripts are only available for paid subscribers.

Inspired by this post on Product Talk.

January 20, 2026
How I Harness AI to Supercharge Product Discovery for Faster Research, Prototyping, and Validation

I’ve led product teams through countless discovery cycles, and nothing has accelerated our learning loops like AI. By weaving AI into our continuous discovery practice at HighLevel, I cut time-to-insight, reduce risk earlier, and keep our product strategy relentlessly focused on customer outcomes.

AI streamlines product discovery by accelerating research, prototyping, and validation, enabling teams to make faster, smarter, and user-driven decisions.

In the research phase, I use gen ai and LLMs for product managers to synthesize interviews, cluster themes, and surface unmet needs in minutes instead of days. Pairing those qualitative insights with behavioral signals in Amplitude analytics helps me spot high-intent cohorts and friction points at scale, so our problem framing is both human-centered and data-backed.

From there, I translate insights into crisp hypotheses and prioritize with the Kano Model and outcomes vs output OKRs. To keep experiments honest, I define a minimum detectable effect (MDE) up front and design A/B testing plans that reflect realistic traffic and seasonality, ensuring our decisions are statistically grounded rather than anecdotal.

Prototyping is where gen ai for product prototyping really shines. I spin up multiple UX flows, UI copy variants, and edge-case scenarios using prompt engineering, then iterate with rapid feedback from product trios. When needed, I mock in-app guides and product tours to validate onboarding concepts before we commit to code, preserving velocity without sacrificing quality.

For validation, I lean on a mix of lightweight experiments—fake-door tests, concierge pilots, and targeted A/B testing—augmented by in-product surveys via Pendo or Intercom. For AI-powered features, I apply eval-driven development to measure relevance, latency, and safety, so we can ship responsibly while maintaining the pace of learning.

This approach only works when the team is structured to move fast. Empowered product teams and product trios own discovery end-to-end, with clear guardrails around data governance, privacy-by-design, and AI risk management. That alignment lets us shift from opinions to evidence, and from output to outcomes, without friction.

If you’re getting started, pick one discovery loop to transform: automate research synthesis, prototype two to three variants with AI, and validate with a tightly scoped experiment. Instrument your analytics, track time-to-insight and time-to-prototype, and iterate your product roadmapping and sprint planning with what you learn. The payoff is immediate: faster cycles, stronger conviction, and a more user-driven path to product-led growth.

Inspired by this post on Product School.

January 19, 2026

Agentic AI for Construction Tendering: A Product Playbook

Your tender inbox contains a deadline, a stack of attachments, and a chain of decisions that still lives in people’s heads. The tempting response is to buy a model that can read PDFs. That solves only the most visible part of the problem.

A useful tendering product must determine which documents matter, extract requirements with evidence, match those requirements to a catalog, retrieve approved pricing, draft an offer, identify uncertainty, and route exceptions before anything reaches the customer. If you lead AI or product strategy for a manufacturer or supplier, your first goal should not be an autonomous bidder. It should be the smallest tender category in which every decision can be observed, evaluated, and improved.

Pick a bounded quote, not the entire tendering department

Construction tendering is too broad for a credible first release. Product categories have different terminology, selection rules, catalogs, pricing structures, and exception patterns. A system that works for one bounded category has not automatically learned how to quote every building product.

One effective wedge started with radiator requests for a single design partner before expanding to other building products. That constraint made the catalog, expected outputs, and expert reviewers knowable. It also created a place to learn the real workflow rather than designing from an idealized process diagram.

Choose your wedge using operational criteria, not enthusiasm for the model:

The request appears often enough for reviewers to recognize recurring patterns.
The relevant catalog is bounded and maintained by a clear owner.
A domain expert can explain why a product is suitable, unsuitable, or uncertain.
The correct price can be traced to an approved system or document.
Historical tenders and reviewer corrections are available for evaluation.
An error can be caught during review before it becomes a customer-facing commitment.

A design partner is especially valuable because the work is not fully documented. In one implementation, the product team spent a week observing the process on-site. That kind of observation exposes the browser tabs, informal checks, catalog shortcuts, and exception handling that an interview alone can miss.

Follow one tender from the incoming email to the final offer. At every handoff, record five things: the input, the decision, the evidence used, the person or system responsible, and the condition that triggers an exception. If you cannot state those five things, you do not yet have a well-defined agent task.

Keep three layers of information separate from the beginning:

Stated requirements: what the tender explicitly asks for, with the originating file, page, section, or table cell.
Interpretations: conclusions the system or reviewer draws when terminology is ambiguous, incomplete, or inconsistent.
Commercial decisions: the selected product, approved price, assumptions, exclusions, and offer language.

This separation matters because a polished offer can hide a weak inference. A reviewer needs to see where the tender ends and the system’s judgment begins.

Define the first outcome as a review-ready tender case: organized source documents, structured requirements, proposed product matches, price provenance, unresolved issues, and a draft offer. That is a more useful product boundary than “understands construction PDFs.” It gives the reviewer something concrete to accept, correct, or reject.

Turn the workflow into a decision graph with specialist agents

A chatbot is the wrong mental model. Tendering is a decision graph in which an early classification or extraction error can contaminate every downstream step. Real packages can range from a short request to more than 1,800 pages describing an entire building. The system therefore needs to plan work, retain state, reconcile evidence, and know when coverage is incomplete.

Use agents only where a task requires interpretation, planning, or exception handling. Keep exact operations – file handling, arithmetic, catalog queries, identifiers, template validation, and access control – in deterministic code or approved systems.

Stage	Preferred control	Required output
Intake	Rules plus a classifier	A tender case containing the email, attachments, document types, and routing status
Requirement extraction	Parser tools plus a specialist agent	Structured requirements with source locations, missing fields, and ambiguities
Product matching	Catalog retrieval plus a reasoning agent	Candidate products, requirement coverage, incompatibilities, and rationale
Pricing	Approved database, CPQ, or pricing service	Exact product-price records with source and validity information
Offer generation	Controlled template plus a drafting agent	A draft that distinguishes confirmed facts, assumptions, and exclusions
Quality review	Rules plus a separate review agent	A pass or block decision with issue codes and supporting evidence
Human approval	Domain and commercial policy	An approval, correction, rejection, or escalation that becomes evaluation data

Each agent should have an explicit contract. Specify its permitted inputs, tools, output schema, evidence requirements, completion test, and escalation behavior. “Find the right product” is not a contract. “Return catalog candidates that meet the extracted requirements, identify uncovered requirements, cite the catalog evidence, and abstain when no candidate qualifies” is much closer to one.

For large document sets, require the workflow to maintain a task plan. It should inventory files, identify relevant sections, process bounded units of work, track completed and pending units, reconcile repeated or conflicting requirements, and run a final coverage check. A generated answer is not proof that the package was fully processed.

The review agent deserves its own role. Asking the drafting agent to “check your work” keeps creation and approval inside the same reasoning path. A separate reviewer can inspect the draft against the extracted requirements, catalog evidence, price records, and policy rules. It should return defects and a gate decision rather than silently rewriting the offer. Silent rewriting makes it harder to identify which upstream component failed.

This pattern has practical value because a dedicated review agent can catch errors before human review, much like a separate code review step. Independence comes from the reviewer’s task and evidence contract; adding more agent personas without distinct responsibilities only creates orchestration overhead.

The interface is part of the architecture. During discovery, a dedicated web workbench can be more useful than hiding the workflow behind a legacy integration. Put the source document, extracted requirement, proposed match, price evidence, and review issue within the same review path. That gives the product team control over feedback capture and makes the reason for each correction visible. One tendering product used its own web application to iterate toward greater automation rather than beginning as a backend-only integration.

You can still read from and write to existing systems at defined boundaries. The distinction is between integration and dependence: integrate with systems of record for catalogs, prices, customers, and approved quotes, but do not let an inflexible legacy screen determine how reviewers inspect an emerging AI workflow.

Evaluate every decision before judging the complete quote

An end-to-end result tells you whether a tender case failed. It rarely tells you why. If the final product is wrong, the defect may have come from document routing, requirement extraction, catalog retrieval, product reasoning, pricing, drafting, or review. A single overall accuracy score collapses those failure modes into an unactionable number.

Build an evaluation set for each agent contract and retain a smaller end-to-end set for workflow behavior. Per-agent evaluations make changes and regressions easier to localize. The useful measures differ by decision:

Intake: correct document classification, attachment coverage, and routing accuracy for each supported tender type.
Extraction: field-level completeness, exactness for identifiers and numeric values, source-location accuracy, and the rate of unsupported fields.
Product matching: reviewer agreement, requirement coverage, incompatible recommendations, unsupported matches, and appropriate abstention.
Pricing: exact agreement with the approved source, correct product-price association, formula validation, and rejection of unavailable or invalid records.
Offer generation: required-field completeness, consistency with selected products and prices, correct treatment of assumptions, and unsupported statements.
Review: detection of known defects, false blocks on valid cases, issue classification, and evidence quality.

Slice failures by characteristics that change the work: document length, file type, layout, product category, presence of tables, and conflicting or revised requirements. An aggregate score can improve while performance deteriorates on the long or unusual tenders that consume most reviewer attention.

Use end-to-end measures for the product outcome: review time, correction volume, correction severity, exception rate, percentage of cases that reach the defined completion state, and whether the workflow finishes before its operational deadline. Keep commercial outcomes separate from model correctness. Quote acceptance or win rate can be affected by price, availability, competition, customer relationships, and sales execution; it should not be treated as a clean extraction or matching metric.

Observability must connect those layers. For each tender case, retain the task plan, agent inputs and outputs, tool calls, retrieved catalog or price records, prompt and model versions, gate decisions, latency, failures, and human corrections. Complex agent chains can exceed what generic monitoring exposes, which is why custom tracing and Agent Analytics became necessary in a production tendering workflow.

Capture reviewer feedback as structured data, not only as an edited final document. Store the original output, corrected value, responsible stage, reason code, evidence used, and final disposition. Useful reason codes include missing requirement, incorrect extraction, unsupported product match, pricing issue, unresolved conflict, invalid assumption, and drafting defect.

Do not feed every edit directly back into the system and call it self-learning. A reviewer may change wording for preference, apply customer-specific knowledge, or correct an upstream error in the final draft. Validate the correction, assign it to the right component, and add it to the corresponding evaluation set. That turns human review into controlled learning rather than an untraceable feedback loop.

Release changes through two gates. First, the modified agent must pass its own evaluation set. Second, the complete workflow must pass the end-to-end set because an improvement in one component can change the assumptions of another. The trace should show exactly which version produced every customer-facing artifact.

Earn autonomy one commercial boundary at a time

“Straight-through processing” is incomplete unless you define where the straight-through path ends. Automatically extracting requirements is not the same risk as automatically selecting a product, committing a price, writing to the CPQ, or sending an offer to a customer.

Use an autonomy ladder with an explicit boundary at each stage:

Shadow: the system processes live-shaped cases, but its outputs do not affect the operational tender.
Assist: it organizes documents and extracts requirements while a person performs matching, pricing, and drafting.
Draft: it proposes products and produces an offer, but a human must review and approve every case.
Gated processing: it completes predefined internal actions for in-scope cases and sends all exceptions to a reviewer.
External dispatch: it sends an offer without case-by-case approval only when commercial policy explicitly permits that action and every required gate passes.

Eligibility for a higher-autonomy path should be machine-checkable. At minimum, confirm that the product category is in scope, every required document was processed, required fields are present, source evidence is attached, conflicts and revisions are resolved, the product match satisfies its evidence rules, the price comes from an approved valid source, the review agent reports no blocking defect, and the full trace is retained.

A wrong product or price can create margin, delivery, contractual, and customer-trust exposure. If an offer may create a binding commitment, keep human approval until the appropriate commercial and legal owners have defined the policy for automatic dispatch. The safe alternative is to automate preparation while preserving approval at the commitment boundary.

Operational controls matter after launch. Give reviewers a visible exception queue, make the reason for every block legible, preserve manual processing when the AI path is unavailable, and provide a way to suspend autonomous actions without disabling access to already processed cases. Assign owners for catalog quality, pricing validity, tender policy, model behavior, and production reliability; otherwise each exception will bounce between teams.

Expansion should follow evidence and customer pull. A request to replace an existing CPQ system is a meaningful product-market signal, but it is also a change in product scope. CPQ replacement introduces responsibilities for quote versions, approval policy, catalog administration, pricing governance, integrations, and records. Treat that request as a roadmap decision, not as proof that the original agent workflow already covers those capabilities.

Key takeaways

Start with one product category, one known workflow, and reviewers who can explain the correct decision.
Optimize first for a review-ready tender case, not an impressive answer from a general chatbot.
Use deterministic systems for exact operations and specialist agents for interpretation, planning, and exceptions.
Require structured outputs, source evidence, explicit completion tests, and an abstain-or-escalate path from every agent.
Evaluate each stage separately, then use end-to-end metrics to measure the operational outcome.
Increase autonomy only when observable eligibility gates protect the next commercial boundary.

At your next roadmap review, put one representative tender on the screen and draw the decision path from email to offer. Name the owner, evidence, pass condition, and exception path for every node. Wherever those are missing, the next task is workflow discovery, not another agent. Once the graph is explicit, you can automate one bounded decision, measure it, and earn the right to automate the next.

References

Shivam.Consulting Blog – From PDFs to Proposals: How Tendos AI’s Agent Swarm Automates Construction Quotes Fast

January 15, 2026

AI Product Governance: A Practical Operating Model for PMs
Your AI feature has passed the demo. Customers want it, leadership wants a date, and the team believes the remaining risks can be handled before launch. The problem is that nobody can state what evidence would make the feature safe enough to release – or who can stop it when that evidence is missing.

This is where AI ethics has to become product governance. You need a repeatable way to classify risk, set release conditions, assign decision rights, test safeguards, and respond when production behavior differs from the demo. The goal is not to eliminate uncertainty. It is to make uncertainty visible and govern the consequences.

Start with a release contract, not a list of principles

Principles such as fairness, transparency, privacy, and safety matter, but they do not tell a team whether Friday’s build should ship. A release decision needs observable conditions. That requires putting the intended outcome and its ethical constraints in the same product brief.

For each AI capability, write a short release contract before implementation begins. It should answer:
1. What decision or task is the product helping with? Describe the user outcome, not the model output. Generating a response is an output; helping a support agent resolve a request accurately is an outcome.
2. What must the system never do? Name unacceptable behavior such as exposing restricted data, presenting unsupported claims as facts, acting without required confirmation, or concealing that AI influenced an outcome.
3. Who can be affected? Include people represented in the data, people discussed in generated content, employees asked to rely on the output, and anyone subject to a downstream decision.
4. How consequential is a wrong result? Separate an inconvenient suggestion from an output that can affect access, money, employment, safety, privacy, or another difficult-to-reverse outcome.
5. What evidence is required to ship? Tie every material risk to an evaluation, control, review, or operational test. Avoid release criteria such as reasonable quality or adequate safeguards; two reviewers can interpret those phrases differently.
6. What will stop or reverse the feature? Define the conditions for disabling an action, reverting a version, narrowing availability, or returning the workflow to human handling.
Treat these conditions as part of the acceptance criteria. If a trust condition fails, the feature has not passed release readiness even when its primary quality metric looks strong. That keeps ethical constraints from becoming optional work negotiated away at the end of the schedule.

Classify the use case by consequence, autonomy, and reversibility

A model does not have one fixed risk level. The same underlying model can draft a headline, recommend an account action, or execute that action. Governance should therefore follow the use case rather than the model name.

A practical classification starts with three questions:
- Consequence: What happens if the output is wrong, biased, misleading, or disclosed to the wrong person?
- Autonomy: Does the system inform a person, recommend a decision, or take the action itself?
- Reversibility: Can the affected person notice the result, challenge it, and restore the prior state without disproportionate effort?
Use those answers to choose a product path. A reviewable drafting aid may rely on disclosure, editing controls, standard evaluations, and ordinary monitoring. A consequential recommendation needs stronger evidence, an accountable human reviewer, and a clear appeal or correction path. An autonomous, hard-to-reverse action should not launch until the team can justify the autonomy, constrain permissions, require confirmation where appropriate, and demonstrate a reliable override.

Do not confuse a human in the workflow with meaningful human oversight. A person who lacks context, time, authority, or a usable way to reject the output is functioning as a rubber stamp. For higher-risk actions, the reviewer needs the evidence behind the recommendation, a clear indication of uncertainty or limitations, and the authority to choose a non-AI path.

Record the classification in an AI risk register. Each entry should contain the risk scenario, affected parties, possible impact, warning signals, preventive control, detection method, response, owner, required evidence, residual risk, and the person authorized to accept that residual risk. A model defect belongs in the backlog; a plausible future failure belongs in the risk register; a failure already affecting users belongs in incident management. Keeping those states distinct prevents serious risks from disappearing into a generic bug queue.

Likelihood will often be uncertain before production. Do not turn that uncertainty into a convenient low-risk label. Record what is unknown, how the team will test it, and which production signal will cause a review. For a consequential or difficult-to-reverse feature, I would also separate the person implementing the control from the person accepting the remaining risk.

Turn governance into four evidence-based release gates

A governance meeting should inspect evidence, not collect reassuring opinions. Four gates cover the path from data collection to production response. The depth of each gate should match the use-case classification.

Data gate: prove that the inputs are governed

Trust problems often begin before a prompt reaches the model. The data gate should make the full path of customer and organizational data inspectable.
- Document what data is collected, where it came from, why it is needed, and which product purpose it serves.
- Identify the applicable basis for processing and make consent flows explicit where consent is used. Legal requirements depend on the product, data, and jurisdiction, so product teams should validate this with qualified privacy and legal partners rather than infer an answer from a generic checklist.
- Remove fields that are not needed for the stated outcome. Data minimization reduces both privacy exposure and the number of inputs that can produce unexpected behavior.
- Map data lineage across ingestion, retrieval, model calls, logs, analytics, support tools, and vendors. A deletion promise is not credible if the team cannot locate every copy.
- Apply role-based access to raw inputs, retrieved context, generated outputs, and operational logs. Access to the application should not automatically imply access to all AI interaction data.
- Set retention and deletion rules, then test that they work across the full data path rather than only in the primary database.
The gate passes when the team can trace an input, explain its permitted use, name who can access it, and show how it is removed. A policy document without an enforceable data path is not sufficient evidence.

Model gate: test the failures that matter to the use case

Do not ask whether the model is good. Ask whether the complete product system performs acceptably under the conditions in which customers will use it. Eval-driven development makes quality, safety, bias, and robustness testable release concerns instead of post-launch aspirations.
- Map every important risk in the register to an evaluation. If a risk has no test, state which manual review or production control provides the evidence instead.
- Define the passing condition before reviewing final results. Moving a threshold after seeing a disappointing result turns a gate into a negotiation.
- Test normal requests, ambiguous requests, edge cases, adversarial prompts, and realistic multi-step interactions. A polished set of happy-path prompts will not expose operational failure modes.
- Compare performance across the user groups and contexts relevant to the product. Aggregate quality can conceal a meaningful gap affecting a smaller group.
- Red-team prompts, retrieved context, tool use, and permission boundaries. For an agentic workflow, the safety of the text is only one part of the problem; the allowed action is another.
- Keep the evaluation set and results tied to the model, prompt, retrieval configuration, tools, and policy version that produced them. Otherwise, a passing report can outlive the system it evaluated.
When an LLM must answer from known organizational information, a retrieval-first pipeline can ground the response in authoritative material. It does not remove the need for evaluation. Test missing documents, conflicting documents, stale content, access-restricted content, and questions the knowledge base cannot answer. The safe behavior may be to abstain, ask for clarification, or route the task to a person.

Experience gate: help users exercise judgment and control

Disclosure is useful only when it changes what a person can understand or do. Place it near the AI-assisted decision, in plain language, and explain the limitation that matters in that moment. A broad statement hidden in terms and conditions does not help a user assess a specific output.
- Make it clear when AI generated, transformed, recommended, or acted on information.
- Let users inspect, edit, reject, or correct an output before a consequential action where that control is meaningful.
- Separate generated content from verified facts in the interface. Do not use confident UX writing to imply certainty the system cannot support.
- Explain what data the feature needs and what changes when the user turns it off.
- Provide a non-AI or human-assisted path when the AI path is unsuitable for the task.
- Test whether users understand the system’s role. A control that exists but cannot be found or understood is not an effective safeguard.
Match the amount of friction to the consequence. Requiring confirmation for every low-impact suggestion can train users to click through automatically. For a high-impact or hard-to-reverse action, the extra pause may be the safeguard that preserves meaningful control.

Operations gate: demonstrate that failure can be contained

Pre-launch evaluations cannot cover every production context. The operations gate determines whether the team can detect, contain, and learn from behavior that escaped testing.
- Monitor model behavior and customer impact. Technical availability can look healthy while unsupported outputs, harmful actions, or repeated user corrections are increasing.
- Assign an owner and response for each alert. An unowned dashboard is visibility without control.
- Create a kill switch or permission cutoff for risky actions, plus a rollback path for model, prompt, retrieval, and tool changes.
- Test the rollback under realistic access and dependency conditions. A safeguard that nobody has exercised may fail during the incident it was meant to contain.
- Prepare an incident playbook covering triage, containment, evidence preservation, affected-user assessment, communication, recovery, and the decision to restore service.
- Keep a human override for high-risk actions and verify that the operator can use it without depending on the failing AI path.
This gate passes when the team can answer three questions without improvising: How will the failure be detected? Who can stop it? What evidence is required before it is turned back on?

Assign decision rights across the product lifecycle

Governance slows teams when everyone can raise concerns but nobody knows who decides. Put decision rights beside the risk register and release gates.
- Product: owns the intended outcome, use-case classification, release contract, customer trade-offs, and completeness of the risk register.
- Engineering and data: produce evidence for system behavior, data lineage, access controls, evaluations, technical constraints, and remediation.
- Design and research: verify disclosure, comprehension, correction, appeal, and user control in the actual workflow.
- Security and privacy: examine access, abuse paths, data handling, vendor exposure, and response controls.
- Legal and compliance: interpret applicable obligations and identify where a product decision creates legal exposure. Product leaders should bring these partners in while choices are still reversible.
- SRE and operations: own observability, alerting, rollback mechanics, incident readiness, and production recovery with the product team.
- Executive risk owner: accepts material residual risk when the decision exceeds the product team’s authority and ensures that the required mitigation has resources.
The review itself should be a decision forum, not a status meeting. Send the release contract, risk register, failed and passed evaluations, unresolved questions, and requested decision in advance. End with one of four outcomes: approved, approved with explicit conditions, returned for more evidence, or rejected. Record the rationale and the event that will trigger another review.

Apply the same discipline to purchased models and AI services. A vendor can operate part of the stack, but it cannot absorb your accountability to customers. Due diligence should cover model provenance, data use and retention, access, evaluation evidence, incident history, change notification, and subcontracted dependencies. Contracts should carry operational commitments such as service levels, deletion obligations, audit rights, and incident responsibilities into the vendor relationship.

If a vendor cannot answer a material question, record the item as unknown. Do not silently translate missing evidence into low risk. Decide whether a compensating control – limited data, narrower permissions, independent evaluation, or a manual workflow – makes the unknown acceptable. If not, change the design or supplier.

Treat launch approval as a monitored, reversible decision

Approval should attach to a defined system configuration and use case, not to the feature name forever. A model change, system-prompt change, new retrieval corpus, broader user group, expanded data access, new tool permission, or shift from recommendation to autonomous action can invalidate earlier evidence. Put those change triggers in the original approval.

Launch with the smallest exposure that can produce useful operational evidence. Watch model-quality signals alongside user corrections, overrides, complaints, unexpected actions, access violations, and downstream customer impact. Set an owner and response for each signal before rollout. Waiting for a broad satisfaction metric to move can leave a concentrated harm hidden inside an apparently successful launch.

Customer trust also depends on what you reveal outside the internal review. A customer-facing trust center can publish the AI system’s role, material limitations, relevant data practices, available controls, change history, and a path for reporting problems. Model facts, limitations, and change logs make responsible operation visible. Candor about a boundary is more useful than a vague claim that the system is responsible or safe.

Key takeaways
- Govern the use case, not the model in isolation. Consequence, autonomy, and reversibility determine the controls you need.
- Pair every success metric with an unacceptable outcome and observable release condition.
- Use one living risk register to connect risk scenarios, evidence, owners, safeguards, residual risk, and review triggers.
- Require evidence across data, model behavior, user experience, and production operations before release.
- Treat human oversight as a designed capability. The reviewer needs context, time, authority, and a usable alternative.
- Carry governance into vendor selection, contracts, monitoring, incident response, and material system changes.
Take one AI item from your current roadmap and write its release contract before the next planning or governance meeting. Name the intended decision, unacceptable outcomes, affected people, required evidence, stop conditions, and accountable risk owner. Any blank you cannot fill is not paperwork still to complete. It is product work you have found before customers find it for you.

References
- Product School – AI Ethics That Win Trust: The Product Manager’s Playbook for Safe, Scalable Innovation
January 15, 2026
New Year, New Product Habits: AI Workflows, Coaching Culture, and Community in 2026

Happy New Year! I’m kicking off 2026 with a behind-the-scenes look at what’s changing in my product practice, the experiments I’m running with my teams at HighLevel, and the trends I’m most energized by—especially around continuous discovery, AI workflows, and building stronger coaching cultures.

If you want to listen to the conversation that sparked many of these reflections, you can find it here: Spotify | Apple Podcasts.

Why Teresa sunset the live deep-dive cohorts—and how on-demand and the new Discovery Habits Toolbox better support real behavior change. This pivot resonated with my own experience: some skills, especially discovery habits, only stick when they’re reinforced in the flow of real product work, not just in a time-boxed cohort. In my org, we’re leaning into on-demand learning paired with manager coaching to drive durable behavior change.

What leaders actually need to coach interviewing, assumption testing, and core discovery habits inside their orgs. I’ve found that empowered product teams thrive when leaders have lightweight coaching tools, practical prompts, and clear expectations for product trios. This is less about one-off training and more about building communities of practice where deliberate practice and feedback loops become routine.

Why training is shifting toward ongoing, leader-supported learning (and how AI will accelerate the shift). AI Strategy isn’t just about tools—it’s about learning systems. For LLMs for product managers to create leverage, we need eval-driven development, privacy-by-design, and clear guardrails. I’m building AI workflows that enable managers to review interviews, spot anti-patterns, and nudge teams toward better decisions—without replacing critical thinking.

Teresa’s move into paid subscriptions and why AI content doesn’t fit the classic “design once, run for years” course model. I see the same reality in my content roadmap: the half-life of AI guidance is short. That pushes us toward subscription models, tighter feedback loops, and a more adaptive go-to-market strategy for education products.

A sneak peek into the AI tools Teresa is building for discovery work—from interview coaching to near-ready interview snapshot generation. I’m particularly excited by tooling that scaffolds better interviews, sharpens assumption testing, and speeds up synthesis without skipping the human judgment step. These capabilities map directly to where I want my teams investing time: spending less energy on admin and more on learning from customers.

Petra’s plans for the year: community building with Product at Heart, a new product leadership email course, her Product Leadership Wheel, and workshops launching in Cairo. As someone who believes in conferences as high-quality “energy wells,” I’m inspired by how these programs create momentum for leaders who are upgrading their coaching muscles.

The role of conferences and retreats in staying grounded, inspired, and connected. I treat these gatherings as strategic resets—spaces to test ideas, confront blind spots, and deepen my network for future collaboration. The best outcomes often come from serendipitous hallway conversations and hands-on sessions where you can pressure test frameworks with peers.

How Teresa is staying on top of academic research (and why “synthetic users” aren’t ready for prime time). I agree: while synthetic data can be useful for scaffolding, it’s not a substitute for direct customer contact. Combine academic rigor with real-world interviewing and strong data governance—especially when operating under General Data Protection Regulation (GDPR).

The shared challenge of evaluating vendors and conference speakers making questionable AI claims. My heuristic: ask for clear problem statements, reproducible evaluations, grounded benchmarks, and a path to safe deployment. If a pitch can’t show measurable uplift or ignores compliance, it’s not ready for empowered product teams.

Key takeaways I’m carrying into 2026: delivery models matter; leaders need coaching tools, not just training; AI is reshaping how we teach and learn; experimentation is the theme of 2026; and community still energizes. That’s the blueprint I’m using to strengthen continuous discovery, refine our AI workflows, and sustain high standards in product management leadership.

What about you? How are you integrating AI workflows into your discovery practice, and what coaching tools are helping your managers reinforce the right habits? Share your approach—I’d love to learn what’s working in your context.

Resources & Links:

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Teresa’s website: Product Talk

General Data Protection Regulation (GDPR)

Product Talk Academy

Deliberate Practice – ATP episode where Teresa talked about the ending live cohorts for Deep Dive classes

Teresa’s Discovery Habits Toolbox program

Petra’s A 52-Week Transformation Journey

Teresa’s Product Talk subscriptions (AI workflows + discovery content)

Claude Code

The Interview Coach by Teresa

Product at Heart Conference (Hamburg)

Petra’s Coaching Packages

Petra’s Ways We Can Work Together

Petra’s Product Leadership Wheel (PLwheel)

Petra’s Product Manager (PMwheel)

Prdkt+ MENA Product Summit 2026

World Beautiful Business Forum by House of Beautiful Business

Melissa Suzuno

Vistaly (Teresa’s integration partner for some upcoming AI tools)

Teresa’s Just Now Possible podcast

Inspired by this post on Product Talk.

January 13, 2026
11 Product Management Shifts Redefining 2026: Actionable Signals from Top Leaders

2026 is closer than it feels, and the signals are already clear. I’ve been synthesizing what I’m seeing across empowered product teams, boards, and cross-functional partners into a practical view of what matters next. A sharp look at product management trends for 2026. Not guesses, but signals from top product leaders shaping how PMs will actually work next.

In this analysis, I distill eleven shifts that are changing the craft—from outcomes vs output OKRs and continuous discovery to stronger product strategy and tighter product roadmapping and sprint planning. The throughline is simple: prioritize customer value, ship with focus, and measure what moves the business. These aren’t headline trends; they’re working patterns I’m seeing across high-performing organizations.

AI is no longer a side project—it’s part of the product manager’s core toolkit. Agentic AI, LLMs for product managers, and trustworthy AI workflows are accelerating discovery, sharpening problem framing, and enabling faster iteration. The best teams pair this with disciplined evaluation and experimentation, so insight compounds without sacrificing safety, privacy, or product quality.

Execution is getting crisper through product trios and stronger stakeholder management. When design, product, and engineering co-own discovery and delivery, teams reduce handoffs and increase clarity. That alignment translates into better prioritization, fewer context-switches, and a roadmap that reflects real trade-offs—not wish lists.

On growth, product-led growth remains a durable engine when it’s anchored in a compelling value proposition and instrumented end-to-end. Clear activation moments, in-app guides, and thoughtful product tours outperform brute-force acquisition. When we connect these motions back to product strategy and the roadmap, we create a repeatable loop that compounds adoption and retention.

Governance and trust are now table stakes. Privacy-by-design, data governance, and a pragmatic approach to regulatory compliance protect both users and velocity. Teams that build these practices into their operating model move faster because they avoid late-stage rework and maintain stakeholder confidence.

If you’re leading a product org—or aspiring to—this is your field guide to 2026. I’ll unpack where these shifts are strongest, how to apply them in your context, and the pitfalls to avoid. The aim is to give you clear language, concrete practices, and a sharper edge as you shape what your team builds next.

Inspired by this post on Product School.

January 12, 2026

Category: AI Strategy

A successful pilot is not evidence of production readiness

Promote each workflow through explicit maturity gates

Build three feedback loops before increasing autonomy

Connect model behavior to a business outcome

Turn evaluations into release criteria

Operate the workflow as a production service

Use 30-60-90 days to build the scaling system

Days 0-30: narrow the portfolio and establish accountability

Days 31-60: install evaluation, controls, and workflow operations

Days 61-90: expand evidence, not enthusiasm

FAQ: decisions that determine whether AI scales

Should AI be owned centrally or by product teams?

When is an AI workflow ready for more autonomy?

When should a pilot be stopped rather than scaled?

References

Pick a bounded quote, not the entire tendering department

Turn the workflow into a decision graph with specialist agents

Evaluate every decision before judging the complete quote

Earn autonomy one commercial boundary at a time

Key takeaways

References

Start with a release contract, not a list of principles

Classify the use case by consequence, autonomy, and reversibility

Turn governance into four evidence-based release gates

Data gate: prove that the inputs are governed

Model gate: test the failures that matter to the use case

Experience gate: help users exercise judgment and control

Operations gate: demonstrate that failure can be contained

Assign decision rights across the product lifecycle

Treat launch approval as a monitored, reversible decision

Key takeaways

References