Tag: product discovery

How to Evaluate a Shopify-Native AI Shopping Agent

You’ve probably been asked a deceptively simple question: should your Shopify store add an AI shopping agent? The hard part isn’t installing another chat widget. It is deciding whether the agent can help an uncertain shopper choose correctly without recommending unavailable products, misreading policy, or making a costly order change.

Treat this as a commerce-system decision, not a chatbot decision. A useful agent must connect conversation, live product data, cart behavior, checkout, and post-purchase service. The evaluation framework below will help you separate a persuasive demo from a system you can trust with customers and revenue.

The native test: can the agent read, reason, and act?

“Shopify-native” should describe an architecture, not a distribution channel. Being listed in an app marketplace or embedded in a storefront does not make an agent native. The meaningful test is whether Shopify remains the operational source of truth while the agent uses its data and APIs in the customer’s current context.

A concrete implementation shows how high that bar can be: a Shopify connection can expose products, variants, content, live inventory, order data, policies, and transactional APIs to the same customer-facing agent. That combination matters because a correct product description is still a bad answer if the relevant size is unavailable, and an accurate return policy is insufficient if the customer must start over somewhere else to use it.

I would evaluate an agent at four capability levels. A product that stops at the first level may still be useful, but it should not be presented internally as an autonomous commerce agent.

Capability	What the agent needs	What you should ask it to prove
Answer	Product content, store policies, and current catalog facts	Answer a precise product or policy question and identify the relevant item, variant, or rule
Recommend	Catalog relationships, inventory, conversation context, and the shopper’s constraints	Turn an ambiguous request into a short, reasoned shortlist instead of returning generic search results
Transact	Cart or order APIs, authentication, permissions, and confirmation controls	Update a test cart or prepare an order change while showing exactly what will happen before execution
Recover	Shared state across shopping, service, and human escalation	Resolve a support interruption and resume the customer’s original shopping task without asking for the same context again

Freshness is part of correctness. During a test, change the availability of a variant in Shopify and repeat the same shopping request. The agent should stop recommending that variant once the change is reflected in the connected system. Run a similar test with a policy update. A polished answer based on yesterday’s state is not a small quality defect; it can create a promise your operations team must later unwind.

Actions deserve an even stricter test. Ask the vendor to demonstrate the complete chain: customer identification, authorization, interpretation of the request, action preview, explicit confirmation, API execution, and a visible result. If any step is simulated, ask which one. A fast setup can reduce implementation effort, but it does not prove that the agent is accurate, observable, or safe.

Design the shopping dialogue around decisions, not keywords

Traditional ecommerce search works best when the customer already knows the product vocabulary. A shopping agent earns its place when the request is incomplete: a gift for a partner, a mattress for a particular sleep preference, or shoes that must work across road and trail conditions. The agent’s first job is not to produce an answer. It is to discover which answer would be useful.

A strong product-discovery dialogue follows a repeatable decision sequence:

Restate the customer’s job in plain language so a misunderstanding becomes visible early.
Identify hard constraints first, such as an in-stock variant, required use case, compatibility, budget boundary, or delivery requirement.
Ask only for information that could change the recommendation. A question that does not alter ranking or eligibility is conversational overhead.
Present a small shortlist and tie each option to the constraints the customer supplied.
Explain the meaningful tradeoff between the options instead of declaring a universal winner.
Offer the next useful action: compare details, select a variant, update the cart, or continue narrowing the choice.

This sequence turns conversation design into a product requirement. For every recommended item, your evaluation record should capture the customer’s stated need, the product facts used, the reason the item fits, the tradeoff disclosed, current variant availability, and the next question or action. That record gives your team something concrete to inspect when a recommendation is challenged.

Test whether the reasoning is responsive rather than decorative. Change one important answer while holding the rest of the conversation constant. If the customer switches from occasional use to daily use, removes a budget constraint, or requires an available color, the shortlist should change when that information is material. If the products remain identical regardless of the customer’s answers, the experience is probably search with conversational packaging.

The agent should also know when not to narrow further. Once the customer has enough information to choose, another question adds friction. Conversely, confidence should not be manufactured when catalog data cannot resolve the request. A safe response identifies the missing fact, asks for clarification, or hands the conversation to a person with the constraints already summarized.

Product cards can accelerate the final step, but the interface should preserve the reasoning that produced them. An image, name, and price answer “what is this?” The conversation must also answer “why does this fit me?” That is the difference between displaying inventory and assisting a decision.

Make shopping and support one customer state machine

A shopper does not experience your sales and support departments as separate funnels. The same person can compare products, ask about shipping, check an existing order, correct a variant, and return to buying in one session. Routing each intent to a separate tool forces the customer to reconstruct context at every boundary.

Model the journey as one state machine: discover, decide, transact, service, and resume. The agent can move between those states, but it should retain the customer’s goal, constraints, products considered, cart state, relevant order, completed actions, and unresolved question. That shared state is more important than whether the organization labels the current message “sales” or “support.”

This is where a connected agent can do more than answer FAQs. Current Shopify-oriented implementations can handle tracking, returns, exchanges, refunds, order changes, shipping questions, and subscription updates through connected procedures and APIs. Each additional action increases usefulness, but it also increases the consequence of a misunderstanding.

Use different controls for different action classes:

Read-only actions, such as showing order status, should still require appropriate customer identification but do not change commercial state.
Reversible shopping actions, such as adding or removing a cart item, should be immediately visible and easy for the customer to undo.
Financially consequential actions, including refunds, paid order changes, and subscription updates, should require authentication, an exact action summary, explicit confirmation, and a durable result or receipt.
Ambiguous or unsupported actions should stop safely and transfer to a person. The agent must not treat conversational enthusiasm, silence, or an inferred preference as consent.

That last distinction protects both the customer and the business. A mistaken recommendation can usually be reconsidered. An executed refund or subscription change creates financial and operational consequences. If the system cannot preview and verify the exact action, keep that workflow read-only and let a trained person execute it.

The transition back to shopping also needs deliberate design. After resolving a delivery problem or order correction, the agent should restore the prior context and offer a relevant path forward. It should not force an upsell into every service interaction. The next best action after a serious order problem is often confirmation that the problem is resolved. Commercial momentum comes from reducing friction, not from ignoring the customer’s immediate priority.

When escalation is necessary, pass a structured handoff rather than a transcript dump. Include the detected intent, verified identity state, constraints already collected, products or orders involved, actions attempted, results returned by Shopify, and the unresolved decision. A human agent should be able to continue with the next question, not repeat the first one.

Measure incremental commerce value and operational risk

Chat conversion is an attractive metric and an easy one to misread. People who open a shopping conversation may already have higher intent than people who do not. Comparing those groups directly can credit the agent for demand it did not create.

Ninja Transfers reported that 10% of its conversations converted to orders with values 20% above the store’s average order value. That is a useful customer result, but it is a vendor-supplied case from one merchant, not a universal benchmark or proof that the agent caused the full difference. Your business case should depend on your own incremental test.

Where traffic permits, randomize eligible storefront sessions between an agent experience and the existing experience. Measure the result across all eligible sessions, not only the visitors who choose to chat. That intention-to-treat view reduces self-selection bias and answers the executive question: what changed when the store made the agent available?

Use a balanced scorecard rather than a single conversion target:

Business outcomes: completed orders per eligible session, revenue per eligible session, average order value, checkout completion, and assisted revenue.
Decision outcomes: recommendation engagement, product-detail visits after a recommendation, add-to-cart actions, and successful comparison flows.
Downstream quality: cancellations, returns, exchanges, and contacts caused by a poor recommendation or incorrect expectation.
Service outcomes: successful action completion, repeat contact for the same problem, human escalation, and time to a confirmed resolution.
Agent quality: use of current catalog facts, in-stock recommendation rate, policy accuracy, clarification behavior, safe refusal, and correct tool execution.
Risk outcomes: unauthorized or incorrect actions, failed confirmations, customer complaints, policy exceptions, and cases requiring operational repair.

Conversion and average order value belong beside returns and cancellations. An agent can raise the initial basket by recommending a more expensive option while reducing customer fit. Without the downstream view, the dashboard rewards the sale and hides the repair.

Your event model should make the journey reconstructable. Useful events include agent opened, intent classified, clarification answered, recommendation shown, product selected, cart changed, checkout started, order completed, support action requested, confirmation received, action completed, and human handoff. Join these events through an appropriately governed session or order identifier so the team can inspect both funnel movement and individual failure paths.

Build the evaluation before the vendor demo

Create scenarios from your real catalog shape, policies, and failure modes. For each scenario, write the expected outcome and the failures that would make deployment unacceptable. Include:

An ambiguous shopping request that requires clarification.
Two products that appear similar but differ on a constraint customers care about.
An unavailable variant that would otherwise be the best match.
A question whose answer depends on store policy rather than product copy.
A conversation that moves from shopping to order support and back again.
A request that sounds actionable but lacks required authentication or confirmation.
An unsupported request that should trigger a safe handoff.
A catalog or policy change made after the agent’s initial synchronization.

Run the same scenarios repeatedly and record the underlying catalog state each time. You are testing consistency, grounding, and recovery, not literary elegance. A shorter answer that uses the correct variant and policy is more valuable than a fluent answer that improvises.

Expand autonomy in the order of consequence

A staged rollout lets evidence determine how much authority the agent receives:

Evaluate offline with approved scenarios and representative catalog data.
Launch read-only product discovery and policy answers with an obvious human fallback.
Add visible, reversible cart actions after recommendation quality is stable.
Introduce authenticated order-support workflows with previews and confirmations.
Enable financially consequential actions only after tool execution, auditability, exception handling, and operational ownership have been tested end to end.

Define the owner of each failure before launch. Product should own the intended customer behavior and success measures. Commerce or operations should own policy and workflow correctness. Engineering should own integration reliability and observability. Customer support should own escalation quality and emerging failure patterns. The exact reporting lines can vary; an unowned failure queue cannot.

Review both aggregates and conversation-level traces after release. Aggregate metrics tell you whether the experience is moving the business. Traces tell you why. A small cluster of incorrect variant recommendations or failed order actions can disappear inside a healthy conversion average while creating disproportionate customer harm.

Key takeaways

A Shopify-native agent should use live commerce data and governed APIs; storefront placement alone is not enough.
The agent’s product-discovery job is to uncover decision criteria, apply hard constraints, explain tradeoffs, and lead to the next useful action.
Shopping and support should share customer context, but the agent’s permissions must become stricter as actions become harder to reverse.
Conversation conversion is a diagnostic metric, not causal proof. Measure incremental results across eligible traffic whenever possible.
Pair conversion and average order value with returns, cancellations, incorrect actions, and operational repair costs.
Begin with read-only assistance and expand autonomy only after each workflow proves accurate, observable, recoverable, and properly owned.

Before you approve a purchase, bring a vague shopping request, a policy edge case, an unavailable variant, a mixed sales-and-support conversation, and a consequential order action into a live test store. If the agent cannot show where its answer came from, what it will change, and how it fails safely, you have found the next product requirement, not a detail to defer until launch.

References

Intercom – Fin for Ecommerce: The Shopify-native AI Agent transforming product discovery and sales

May 7, 2026

Taste vs. Evidence in the AI Era: What Product Leaders Must Invest In Now

I just finished listening to "Taste – All Things Product Podcast with Teresa Torres & Petra Wille," and as a product leader shipping AI-powered capabilities at HighLevel, Inc., I wanted to pressure-test the sudden obsession with "taste."

If you're curious, you can listen to this episode on Spotify or Apple Podcasts.

The core question landed perfectly for our moment: Is "taste" the must-have skill of the AI era — or just the latest tech buzzword in a world where AI is eating through design, delivery, and discovery?

Teresa pushes back hard, highlighting how slippery the term can be. "It's just this month's flavor of founder mode." She points out that "taste" is rarely defined, can't be easily taught, and too often becomes shorthand for "my preference trumps yours." Just as importantly, "It's not about your taste. It's about your customer's taste."

Petra adds needed nuance from years in the craft: pattern-recognition is real, and some people do develop sharper product sense over time. As she put it, "I am a strong believer that you develop product sense and taste over time. It's never finished."

Both threads lead back to familiar roots in product: product sense, founder mode, and the enduring myth of the lone visionary. They even grapple with the big question on everyone’s mind—Will AI Eat Taste Too?—and where that leaves product teams navigating GenAI, LLMs for product managers, and evolving product strategy.

Here’s my take. "Taste" can be useful as a personal north star, but it is not a decision system. In my teams, we bias toward evidence: continuous discovery, customer interviews, discovery synthesis with opportunity solution trees, and tight collaboration in product trios. Opinion can start the conversation, but evidence should end it.

Practically, that means investing in the skills that compound: Discovery skills — understanding customers, matching solutions to real needs. Human-to-human interaction skills. Learning to collaborate with AI effectively. Critical thinking and judgment grounded in evidence.

On AI collaboration specifically, we treat GenAI as a force multiplier, not a decider. We prototype with AI to explore breadth, then narrow with qualitative and quantitative signals, ablation-style experiments, and clear success criteria. The bar I hold myself to is simple: taste without evidence is just opinion.

Three lines I underlined from the conversation:

"It's just this month's flavor of founder mode." — Teresa Torres

"It's not about your taste. It's about your customer's taste." — Teresa Torres

"I am a strong believer that you develop product sense and taste over time. It's never finished." — Petra Wille

If you want to go deeper, these references are helpful for sharpening judgment without falling into the "great man" theory trap.

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Founder mode

Marty Cagan: Founder-Style Leadership

Vercel/v0 CEO Guillermo Rauch on building taste: from Lenny Rachitsky’s Linkedin post

Continuous discovery (Read Teresa’s Everyone Can Do Continuous Discovery—Even You! Here’s How

The "great man" theory

Steve Jobs and the myth of the lone product visionary

Have thoughts on this episode? Leave a comment below and share how your team balances product sense with evidence in the age of AI.

Inspired by this post on Product Talk.

May 5, 2026
Mastering Product Marketing with Amplitude Analytics: Proven Playbooks for Sustainable Growth

I’m continually refining how we use analytics to elevate product marketing, and this collection brings together my most effective playbooks for driving measurable growth with Amplitude Analytics. If you’re focused on product-led growth, you’ll find pragmatic guidance on translating behavioral analytics into sharper positioning, stronger activation, and durable retention.

In my day-to-day work, I connect product strategy with go-to-market strategy by grounding every narrative in real user behavior. That means using event data to validate our value proposition, mapping journeys to uncover friction, and aligning product positioning with the moments that actually matter in-app. The outcome is a marketing engine that mirrors how customers discover, adopt, and expand within the product.

Activation and retention are where outcomes are won or lost. I detail how to set leading indicators for user activation, instrument key behaviors, and run retention analysis that distinguishes healthy engagement from noisy usage. You’ll see how I turn cohort insights into precise messaging, targeted onboarding, and experiments that compound over time.

Cross-functional execution is essential, so I share ways to operationalize a unified analytics platform across product, marketing, and customer success. With shared metrics, product trios can move faster from product discovery to launch, and marketing can scale campaigns that reflect what’s truly driving adoption. This tight loop reduces guesswork and increases our hit rate on both features and narratives.

If you’re building a modern product marketing function, these essays and guides will help you move from intuition-led storytelling to evidence-backed strategy. Dive in to learn how I connect behavioral analytics to positioning, packaging, and roadmap choices—so every campaign and release ladders up to meaningful customer outcomes and sustainable growth.

Inspired by this post on Amplitude – Perspectives.

May 4, 2026
Master Opportunity Mapping with Continuous Discovery Habits — Join the May 2026 Book Club

Five years in, Continuous Discovery Habits continues to be one of the most practical frameworks I use to align empowered product teams, sharpen product strategy, and convert customer interviews into outcomes. To celebrate its impact, I’m hosting a community read-along and inviting you to dig in with me this May.

Each month, I’m releasing an in-depth reading guide to make learning stick. You’ll find the chapters we’ll be reading, a preview of the essential concepts, short videos to help you spread the ideas across your organization, individual and team discussion prompts, team exercises to put the concepts into practice, and additional reading if you want to go deeper. My goal is simple: help you turn product discovery into a steady habit, not a once-a-quarter activity.

We’ll discuss each month’s reading in the comments, and we’ll gather quarterly on a live call to compare notes and share what’s working. Joining late is absolutely fine—I monitor the conversation throughout the year. Start with the current month or rewind to January; you can ask for help, share wins and roadblocks, and connect with other readers anytime.

If you want to participate, grab a copy of the book (or dust off your old one), share the "Spread the Love" videos with your team, block focused time for the exercises, and register for the community sessions. Let’s do this together.

This Month’s Reading

Chapter: Chapter 6: Mapping the Opportunity Space

Estimated reading time: ~23 minutes

This month’s chapter will introduce you to why opportunity mapping is critical for structuring the ill-structured problem of reaching your desired outcome; how to move from overwhelming opportunity backlogs to well-structured opportunity spaces; the power of tree structures for depicting parent-child and sibling relationships between opportunities; how to identify distinct branches in your opportunity space using key moments in time; common anti-patterns to avoid when building your first opportunity solution tree; and why structure "gets done, undone, and redone" as you continue to learn.

Need a copy? Grab the book.

Share the Love with Friends and Colleagues

We learn best in community. Use these short videos to spread the core concepts from this chapter—then invite your team to join the book club with you.

The need for opportunity mapping – You will never fully satisfy your customers' desires

Understanding the structure of an opportunity solution tree – Depicting two types of relationships

Turn big intractable problems into smaller, more solvable problems – The power of decomposition

How to map an opportunity space – Getting started with opportunity solution trees

A well-structured opportunity space has distinct branches – Identify key moments in time

Reflect & Discuss What You Read

Reflection turns reading into capability. This chapter asks us to shift from reacting to every request to deliberately structuring the opportunity space. If you’ve ever felt overwhelmed by a never-ending backlog or pressure to ship output over outcomes, this is where the fog starts to lift. As you read, focus on how your team currently organizes (or doesn’t organize) what you hear from customers.

Individual Reflection

1) Think about your current product backlog or opportunity list. Is it a flat list, or do you have some structure to it? If you were to group similar opportunities together, what patterns would emerge?

2) When was the last time you heard a customer need and immediately jumped to a solution without exploring whether there were related opportunities? What would change if you took the time to map how that opportunity connects to others?

3) Review the anti-patterns from the chapter (opportunities framed from your company's perspective, vertical opportunities, opportunities with multiple parents, etc.). Which of these do you recognize in how your team currently talks about opportunities?

Team Discussion

1) As a team, pick a top-level opportunity you're currently working on. Try breaking it down into sub-opportunities together. Where do you struggle? Where do you disagree about how to frame or group opportunities? What does that tell you about gaps in your shared understanding?

2) Look at your experience map (from Chapter 4) and identify 3-5 distinct moments in time during your customer's experience. Could these become the top-level branches of your opportunity solution tree? Where do you see overlap, and where are there clear distinctions?

3) Discuss the quote from Barbara Tversky: "Structure gets done, undone, and redone." How does your team currently respond when you discover new information that changes how you understand the opportunity space? Do you treat your opportunity map as fixed or as something that evolves?

Put It Into Practice

Reading is step one; building your first opportunity solution tree is where the real learning happens. The exercises below are exactly how I coach product trios to transform ambiguous problems into aligned action.

Exercise: Build Your First Opportunity Solution Tree

Time: 60 minutes. Do this: With your product trio.

Start by reviewing your interview snapshots from the past few weeks. For each opportunity you captured, ask the three questions from the chapter:

Is this opportunity framed as a customer need, pain point, or desire (not a solution)?

Is this opportunity unique to one customer, or have we seen it in more than one interview?

If we address this opportunity, will it drive our desired outcome?

Then, using your experience map, identify 3-5 distinct moments in time to serve as your top-level opportunities. Group the opportunities from your interviews under these top-level branches.

Look for opportunities to add structure to each branch. Group similar opportunities together and identify a parent opportunity. Look for vertical stacks (one parent, one child) and fill in missing siblings. Reframe opportunities that are too broad or that could live in multiple branches.

Don’t aim for perfection. Get something on paper (or a digital canvas) and iterate the tree with every new interview.

Exercise: Practice Framing Opportunities from Your Customer’s Perspective

Time: 30-45 minutes. Do this: With your product trio.

Take 10-15 opportunities from your current backlog or list. For each one, ask: "Can I imagine a customer saying this?" If the answer is no, reframe it from your customer’s perspective. For example:

"Increase subscription conversions" becomes "I want to know if this product is worth paying for"

"Reduce support tickets" becomes "I can't figure out how to do X"

"Improve onboarding completion" becomes "I'm not sure what to do next"

This exercise helps you spot business-centric opportunities disguised as customer opportunities. It also trains your team to listen for opportunities in interviews that are framed from the customer’s point of view.

Go Deeper: Additional Reading

If you prefer an audio summary of this month’s reading, including the book chapters and the following resources, I’ve included an audio version for paid subscribers at the bottom of this post.

Related In-Depth Guides

Opportunity Solution Trees: Visualize Your Discovery to Stay Aligned and Drive Outcomes

Customer Interviews: Uncover Hidden Insights from Every Conversation

Supplementary Reading

Prioritize Opportunities, Not Solutions

Product in Practice: Opportunity Mapping at Grailed

Product in Practice: Opportunity Mapping at trivago

7 Key Benefits of Using Opportunity Solution Trees

Getting Started with Opportunity Solution Trees at SuperAwesome

Bringing Order to Chaos: Using Opportunity Solution Trees in Everyday Life

Other Voices

Why Groups Struggle to Solve Problems Together by Al Pittampalli

More PM Problem Areas by Marty Cagan

Five Superpowers of Diagrams by Abby Covert

Critical Thinking is Product Management by This Is Product Management

Our Live Discussion Schedule

Our live discussion sessions are for paid subscribers. Sessions are not recorded. Invitations will go out to Supporting Members and CDH Members two weeks before the scheduled event. But reserve the time on your calendar now.

Tuesday, June 16, 2026: 9am-10am PDT

Thursday, September 17, 2026: 9am-10am PDT

Wednesday, December 16, 2026: 9am-10am PST

Audio Summary

This summary was produced by NotebookLM. The sources supplied were the book chapters as well as all of the additional reading.

Inspired by this post on Product Talk.

May 4, 2026
Beyond Command and Control: How I Build Trust, Speed, and Autonomy in Product Teams

When uncertainty spikes, I notice many organizations snap back to "Command and control." It feels fast, safe, and decisive—especially when the stakes are high. But in product management leadership, speed without shared context is often an illusion, and control without trust rarely scales. I’ve learned that what looks like strength from the top can quietly create bottlenecks, missed signals, and disengaged teams.

Why do smart companies revert in tough times? Familiarity. Centralizing decisions can reduce short-term cognitive load and signal clarity. Yet the cost shows up quickly: leaders become single-threaded on context they cannot possibly hold, and teams spend cycles asking for permission rather than creating value. The result is slower learning and weaker product strategy just when continuous discovery and iteration matter most.

Here’s the hard truth: no single leader can hold all the context required to make every decision in a modern, cross-functional environment. The hidden complexity of customer segments, technical debt, data signals, and go-to-market constraints outstrips any one person’s bandwidth. That’s why empowered product teams, staffed with domain experts, outperform command centers—provided they’re aligned on outcomes and guardrails.

I like the burning house analogy: in a true emergency, crisp direction helps—"take the stairs, not the elevator"—because the problem is clear, the time horizon is short, and the action is obvious. But most product work is not a single burning house; it’s a city with evolving fire codes, shifting weather, and neighborhoods that look different block to block. In that environment, distributed action scales better than centralized control.

Strong leadership is not the same as command-and-control. In practice, it means setting a compelling direction, defining guardrails, and running tight feedback loops. I aim for what I call the "Flotilla of kayaks": we’re all headed to the same lighthouse, but each kayak navigates its own currents based on local information. That’s aligned autonomy—fast, resilient, and deeply accountable.

People often ask why some command-and-control companies still succeed. My view: beneath the surface, there’s usually more trust and unofficial autonomy than their org charts suggest. Teams earn freedom by shipping reliably, sharing decision rationales, and showing outcomes. Leaders tolerate—and even quietly endorse—those pockets of autonomy because they see the results.

It’s a spectrum, not a binary. I flex my style based on risk, reversibility, and time horizon—what I’d call spectrum thinking. Early in a bet, or when risks are existential, I raise the altitude and tighten the cadence. As confidence builds, I widen autonomy and shift the team to outcomes over outputs. Beware "Founder mode" when it drifts from vision-setting into day-to-day decision vetoes; it’s intoxicating early and suffocating at scale.

On decision-making, I prefer a simple principle: let the person with the most relevant expertise decide, while incorporating the right input. That’s "Consultative decision-making" in practice. In some regions, you’ll hear it called "Konsultativer Einzelentscheid." The point is to seek counsel without defaulting to consensus that bogs down speed. One person owns the call, and everyone commits to the decision once it’s made.

Practically, here’s what works for my teams: we clarify decision rights up front, draft pre-reads with clear options and risks, involve the smallest set of stakeholders required, and document the decision and expected signals ahead of time. Product trios keep discovery tight with design and engineering, while stakeholder management focuses on context, not sign-offs. We track outcomes vs output OKRs and hold regular decision reviews so we can reverse or double down fast.

My key takeaways are consistent: "Command and control" can feel efficient, but it doesn’t scale in complex environments. No leader can hold all the context. Strong leadership is about direction, guardrails, and feedback loops—not control. High-performing teams balance autonomy with alignment. Decision-making should sit with the person closest to the problem, supported by the right input and transparent reasoning. Trust is built and earned over time—and it changes how teams operate.

Reflection prompts I use with my leads: Where does your team sit on the command-and-control ↔ autonomy spectrum? Are the highest-context people truly making the decisions? What would it take to increase trust and autonomy—better instrumentation, clearer guardrails, or tighter cadences? Which calls require consensus, and which deserve a decisive, single-threaded owner?

If you’re wrestling with speed, alignment, and autonomy in your organization, start small: pilot "Consultative decision-making" on one consequential decision, set explicit guardrails, and measure the outcome. You may be surprised how quickly aligned autonomy compounds into better product discovery, sharper product strategy, and stronger execution.

Inspired by this post on Product Talk.

April 28, 2026
Master Build-to-Learn: The Essential FAQ to Supercharge Product Discovery in the AI Era

In the age of AI, I’ve come to believe we’re all builders—yet not all building is the same. There is a very meaningful difference between building to learn (known as product discovery) versus building to earn (known as product delivery). When we confuse the two, we waste precious time, budget, and team energy on output over outcomes. My goal in this FAQ-style reflection is to clarify when and how to choose each mode so we can make smarter, faster, more confident product decisions.

Why does this distinction matter so much right now? Because as the cost of product delivery continues to drop, the scarce resource shifts from shipping capacity to clarity of problem, solution, and value. Cloud infrastructure, CI/CD, feature flags, and even gen AI code assistance have made it cheaper to launch. That’s great—but if we don’t learn the right things before we scale, we’ll efficiently deliver the wrong product. Discovery is how we de-risk that.

What do I mean by build to learn? I use discovery to quickly validate problems, test value, and shape solutions before committing delivery teams to scale. In practice, that means continuous discovery with customer interviews, rapid prototyping, and lightweight experiments that put us in front of real users fast. I rely on product trios and empowered product teams to co-own outcomes, not just output, and I anchor decisions with outcomes vs output OKRs so we stay focused on measurable impact.

How do I structure discovery sprints? I start with an opportunity solution tree to map customer pain points and candidate solutions, then select the smallest test that can invalidate a risky assumption. When signals are ambiguous, I refine the questions and instrument better learning loops rather than pushing harder on delivery. For experiments, I keep a bias to speed: clickable prototypes, concierge tests, or gen ai for product prototyping often reveal more in days than a coded MVP does in weeks. When experiments go live, I use a clear minimum detectable effect (MDE) and resist reading noise as signal.

Where does AI change the calculus? LLMs for product managers are turbocharging discovery by accelerating research synthesis, persona drafts, and early concept validation. I pair that with eval-driven development to set crisp acceptance criteria for AI behaviors before any production integration. Prompt engineering and conversation design are part of the toolkit, but the same rule applies: prototype to learn, not to impress. AI can make bad ideas cheaper to build—so disciplined discovery matters more than ever.

So when do I switch to build to earn? Once I have evidence of value and feasibility, I shift into product delivery to scale with quality, security, and reliability. This is where I bring in product roadmapping and sprint planning, DORA metrics to monitor deployment frequency and lead time, and strong SRE and observability practices to safeguard the user experience. The handoff isn’t a wall; discovery continues inside delivery to refine scope, reduce risk, and maintain momentum.

What pitfalls do I watch for? The biggest is treating delivery as discovery—shipping features to “see what happens” without a clear learning thesis. Another is tech-first decisions driven by technology FOMO instead of product strategy and customer value. I also see teams set output-based commitments that crowd out learning; outcomes vs output OKRs keep us honest. And when considering build vs buy, I evaluate whether the capability differentiates us; if not, I’ll buy to preserve discovery capacity on what truly matters.

My operating conviction is simple: invest early and deliberately in build to learn so build to earn becomes high-confidence, high-velocity, and high-impact. In practical terms, that means smaller bets, faster feedback, clearer outcomes, and tighter collaboration across product, design, and engineering. If we get discovery right, delivery feels inevitable—and customers feel understood.

Inspired by this post on SVPG.

April 27, 2026

AI Product Validation: From Promising Demo to Proven Value

You have an AI demo that looks impressive. It answers the happy-path prompt, the latency seems acceptable, and stakeholders can already imagine the launch. The uncomfortable question is whether any of that proves the product is worth building.

It does not. A useful validation process has to reduce several different risks: whether customers care, whether the workflow helps them, whether the AI performs reliably, whether the economics work, and whether failures remain tolerable. Test those risks in that order and you can make a defensible investment decision without turning production traffic into your debugging environment.

Define the decision before you design the AI

The first artifact for an AI initiative should not be a model shortlist or a prototype. It should be a decision contract that states what must become true for the initiative to deserve more investment.

A practical decision statement has this shape: For a defined user in a defined situation, the proposed capability will improve an observable outcome relative to the current alternative, without breaching named guardrails. If the agreed threshold is met, you will advance. If it is not, you will stop or change a specific assumption.

Write down these five elements before the experiment begins:

User and job: Name who encounters the problem, when it occurs, and what they are trying to complete. A broad label such as knowledge workers is not precise enough to design a useful test.
Current alternative: Record what the user does now, including manual work, an existing product flow, a rules engine, or simply tolerating the problem. This is the baseline the AI must beat.
Observable outcome: Choose a user or business result, not a model activity. Task completion, time-to-value, corrected routing, rework, repeat use, or downstream resolution can carry more meaning than generations or prompt volume.
Success threshold and guardrails: Decide how much improvement would justify the cost and what must not deteriorate. Safety failures, latency, privacy exposure, retention, and cost per successful outcome can all constrain an otherwise positive result.
Decision rule: State what evidence will trigger expansion, another iteration, a change in direction, or cancellation. Precommitting prevents enthusiasm for a polished demo from moving the goalposts later.

The threshold is not universal. It should reflect the value of the outcome, the implementation and operating costs, the consequences of errors, and the return available from competing roadmap work. Minimum detectable effect belongs here: define the smallest improvement that would actually change your decision, then size the test to detect that effect. A test that cannot distinguish a worthwhile gain from noise is not a faster test. It is a delayed decision.

A driver tree helps prevent a common measurement mistake. Start with the desired outcome, connect it to the user behaviors that could produce that outcome, and then connect those behaviors to system-level drivers. For an AI support-triage capability, the outcome might be faster correct routing. Accepted category and priority suggestions are leading signals; downstream corrections, reassignment, and resolution are closer to the outcome. Model classification accuracy matters, but it is only one driver in the chain.

If the proposal involves an autonomous or semi-autonomous agent, run a precondition check before planning the experiment. Volume, instructions, tolerance, access, and a learning loop expose whether agentic complexity is justified:

Volume: Does the workflow happen often enough for automation to create meaningful leverage?
Instructions: Can success, constraints, and exceptions be expressed in testable terms?
Tolerance: Is the likely failure reversible, detectable, and contained?
Access: Can the system use the necessary data and tools with reliable integrations and least-privilege permissions?
Learning loop: Can you measure quality, latency, cost, and failures after launch?

A missing condition tells you what to validate next. Unclear instructions call for more discovery and rubric design. Weak access calls for an integration or data-quality spike. Low error tolerance calls for approvals and a narrower action space. Low volume may mean that a clear workflow, a rule, or better product UX is the better answer. The purpose of validation is not to prove that AI belongs in the solution; it is to discover whether it does.

Climb an evidence ladder instead of jumping to a pilot

An oversized pilot often mixes market, usability, model, integration, and operational risk into one expensive test. When the result disappoints, nobody knows which assumption failed. An evidence ladder gives each experiment one dominant question and increases fidelity only after the previous uncertainty has been reduced.

Question to answer	Lean experiment	Evidence to inspect	What it does not prove
Do users care enough to act?	Painted door, landing page, waitlist, concierge offer, preorder, or deposit where appropriate	Click-through intent, qualified sign-ups, willingness to pay, and continued requests	Usability, AI quality, or scalable delivery
Can the proposed workflow help?	Wizard-of-Oz flow or realistic interactive prototype	Task completion, time on task, errors, material friction, and repeat use	Whether an AI system can deliver the experience reliably
Can the system perform the job?	Offline evaluation on a curated golden set plus targeted technical spikes	Rubric results by case type, failure patterns, latency, and cost	Whether the complete product changes user behavior
Does the product improve the target outcome?	Feature-flagged A/B test or holdout	Primary outcome, leading indicators, cohort effects, and guardrails	Long-term stability under every operating condition
Can it operate within acceptable risk?	Capped rollout with approvals, audit logs, monitoring, and rollback controls	Harm and privacy events, reversals, escalations, reliability, and cost per successful outcome	That future changes will remain safe without continued evaluation

Use the first row when demand is the dominant risk. A painted-door click is a signal of curiosity, not proof of durable value. A qualified sign-up asks for more commitment. A preorder or deposit, when honest and operationally appropriate, tests willingness to pay. Repeated use of a manually delivered service provides stronger behavioral evidence. Do not collapse these signals into a single conversion metric; they represent different levels of commitment.

Once demand appears credible, use a prototype or Wizard-of-Oz flow to learn whether the proposed interaction helps. Pretotyping should answer whether the product deserves to exist, while prototyping should answer how it needs to work. Keeping those questions separate prevents a polished interface from disguising weak demand and prevents a crude early interface from killing a valuable idea before its workflow has been understood.

These experiments still owe users honest expectations. A painted door should reveal that the capability is unavailable after the user expresses interest, rather than pretending it already exists. A concierge or Wizard-of-Oz flow should be explicit about how data will be handled and what follow-up the participant can expect. Deception can manufacture a metric while damaging the trust the eventual product will need.

Advance when the evidence changes the dominant uncertainty. Strong demand does not authorize a production launch; it authorizes a workflow test. A usable workflow authorizes a system evaluation. An offline pass authorizes limited exposure. Each rung earns the next investment without pretending to answer questions it was not designed to answer.

Separate model quality from product value

A model can produce better answers while the product creates less value. Added latency can interrupt the workflow. A retrieval failure can ground an otherwise capable model in the wrong context. A user may spend more time checking and rewriting an answer than doing the task manually. This is why a single accuracy score cannot validate an AI product.

Build a golden set from the work users actually do

Eval-driven development starts before production traffic. Build a curated set of cases that reflects real user complexity, then turn your definition of good into a reproducible scoring process.

Define the evaluation unit: Score the completed job whenever possible, not merely an isolated response. An agent that drafts a correct message but sends it to the wrong destination has failed the job.
Represent meaningful variation: Include normal cases, longer and shorter inputs, ambiguous requests, important customer segments, and known edge conditions. A convenience sample of clean happy paths measures demo readiness.
Tag each slice: Label cases by intent, complexity, risk, input type, or other distinctions that could conceal a concentrated failure. Aggregate performance can improve while a critical slice gets worse.
Write a multidimensional rubric: Score correctness, completeness, groundedness, safety, tone, policy compliance, and any task-specific requirements separately. Add latency and cost as system measures rather than blending everything into an opaque average.
Choose a real baseline: Compare the candidate with the current product, manual workflow, rules-based approach, or incumbent model. The relevant question is not whether the candidate looks capable in isolation; it is whether switching produces enough value.
Preserve regression evidence: Keep a stable set for comparisons and add newly discovered failures to an evolving challenge set. This turns production learning into protection against recurrence.

Keep the measurement layers visible in every readout:

Output quality: correctness, completeness, groundedness, tone, safety, and compliance.
System performance: retrieval quality, tool execution, policy enforcement, latency, reliability, and cost.
User outcome: task completion, time-to-value, edits, rejection, rework, escalation, and repeat use.
Business consequence: the downstream result the initiative was funded to improve, along with retention or other core guardrails where relevant.

Each layer diagnoses a different problem. If output quality is weak, work on context, prompts, retrieval, tools, policies, or the model. If output quality passes but completion does not improve, inspect the interaction and workflow. If users succeed but the cost per successful outcome is unacceptable, narrow the use case or revisit the architecture. A composite score can hide these distinctions at exactly the moment you need them.

Test the behavior distribution, not a lucky response

AI output is variable, so a candidate should not pass because one run happened to look good. Use two evaluation modes. A regression configuration should be as controlled as the system allows, with model, prompt, retrieval, tool, temperature, top-p, and seed settings documented where they apply. A production-like configuration should match the variability users will experience and repeat cases often enough to reveal unstable behavior and tail failures.

Run candidate and baseline systems on the same cases under comparable settings.
Inspect results by slice and failure type, not only the overall average.
Repeat stochastic cases so the team sees consistency, variance, and severe outliers.
Automate clear rubric checks, but retain human review for ambiguous or high-consequence judgments.
Version the model, prompt, retrieval configuration, tools, policies, and evaluation set so a change can be reproduced.

This creates a release gate instead of a demo contest. Offline evaluation will not prove market value, but it can prevent known regressions, unsafe behavior, and obviously weak variants from consuming customer trust in a live experiment.

Make the production test answer a business decision

Production exposure is justified when demand, workflow, and offline performance have enough evidence behind them. The live test should then answer a narrow causal question: does access to this capability improve the intended outcome for the eligible population, compared with the current experience, without violating the operating constraints?

Instrument the complete causal chain

Your event schema should connect eligibility to exposure, interaction, system behavior, task completion, and downstream consequences. At minimum, distinguish these moments:

The user or account became eligible for the test.
The treatment was actually shown or made available.
The capability was invoked, whether explicitly or automatically.
The system succeeded, failed, timed out, or triggered a safeguard.
The output was displayed, accepted, edited, rejected, reversed, or escalated.
The target task was completed or abandoned.
The downstream outcome occurred, such as a correction, reassignment, reopening, or successful resolution for a support workflow.

Attach the cohort and the relevant model, prompt, retrieval, tool, and policy versions to the trace. Capture latency, cost, and safety results without indiscriminately logging sensitive payloads. Privacy-by-design and data governance determine which data may be retained, who may inspect it, and how long it should remain available.

Missing links create predictable misreadings. Without an exposure event, low adoption can be confused with low visibility. Without version information, a regression cannot be tied to a system change. Without the downstream event, acceptance can be mistaken for value even when users later undo the AI’s work.

Choose the design and sample around the decision

Randomization: Choose user, account, workflow, or time window based on where contamination can occur. If people in one account share outputs, user-level assignment may mix treatment and control experiences.
Population: Define eligibility before launch. Balance or stratify meaningful groups such as new accounts and power users when their behavior or exposure differs.
Primary metric: Select one outcome that can settle the main question. Treat diagnostic measures as supporting evidence, not a menu from which to pick a winner later.
Guardrails: Monitor core experience, retention where relevant, time-to-value, safety, privacy, reliability, and cost. Write rollback conditions for unacceptable movement before exposure begins.
Effect size and power: Set the minimum detectable effect from the business decision, estimate the required sample, and acknowledge when available traffic cannot support the desired conclusion.
Exposure control: Use feature flags, a capped rollout, and a holdout so you can stop quickly and preserve a valid comparison.

Standard A/B testing fits many product changes. Ranking and retrieval changes can benefit from interleaving when alternatives can be compared within the same experience. Switchback designs can help when time, seasonality, or shared operating conditions make simultaneous assignment misleading. Match the design to the interference in the workflow instead of defaulting to the experiment template you use for deterministic UI changes.

AI variability also changes the readout. Aggregate outcomes across the multiple interactions users have, compare cohorts, and track confidence intervals over time. A snapshot p-value should not overrule an underpowered test, an unstable effect, or a concentrated safety failure. A statistically inconclusive result means the test did not resolve the decision; it does not prove that the feature has no effect.

Prewrite the scale, iterate, and stop rules

I prefer four explicit decision states because they force the readout to connect evidence to action:

Scale: The primary outcome clears the meaningful threshold, guardrails hold, important cohorts do not show an unacceptable reversal, and reliability and cost remain viable.
Iterate the AI system: User intent is strong, but a defined output or system failure blocks value. The next test should target that failure rather than repeat the same broad pilot.
Change the product experience: Offline quality passes, but users cannot discover, trust, control, or efficiently use the capability. Treat this as workflow evidence, not an automatic reason to swap models.
Stop or reframe: Demand is weak, the economics cannot work, the necessary data or access is unavailable, or credible risk remains outside tolerance.

Risk must be part of the launch rule, not a review added after a positive metric appears. Include toxicity and personally identifiable information checks where relevant, enforce least-privilege access, retain appropriate audit logs, and make rollback operational before exposure. For irreversible financial actions, sensitive regulatory decisions, or any workflow where the acceptable error rate is effectively zero, keep a qualified human approval step or defer autonomy. Faster execution does not compensate for an unacceptable blast radius.

Autonomy should be earned in stages. Begin with assistance that the user can inspect. Move to required approval before actions. Allow autonomous execution only for narrow, low-stakes, reversible actions after stability is demonstrated. Expand permissions and exposure only when monitoring shows that the earlier guardrails continue to hold.

The experiment does not end at launch. Model behavior, retrieval content, user mix, prompts, tools, and operating costs can change. Continue tracking quality, latency, cost per successful outcome, safety, and cohort behavior. Feed new failures into the evaluation set and keep a holdout when the decision warrants one. A weekly readout should identify what changed, which assumption the evidence affected, and what decision follows; it should not become a tour of every available dashboard.

Key takeaways

Start with a precommitted decision contract: user, job, baseline, outcome, threshold, guardrails, and next action.
Validate demand before usability, usability before system capability, and system capability before broad production impact.
Compare the AI with the user’s current alternative, not with an abstract standard of impressive output.
Measure output quality, system performance, user outcomes, and business consequences separately so failures remain diagnosable.
Treat stochastic behavior as a distribution: document configurations, repeat runs, inspect slices, and watch severe outliers.
Use feature flags, holdouts, exposure caps, auditability, and prewritten rollback rules to contain risk while learning.

At your next AI review, ask for the experiment contract instead of another demo. If the team cannot name the dominant risk, current baseline, meaningful threshold, guardrails, and action for each possible result, the next step is not production exposure. It is a sharper test.

Start with the smallest experiment that could credibly invalidate the idea. Evidence that survives that test earns the right to spend more, increase fidelity, and expose more users.

References

April 27, 2026

The AI PM One-Pager: Radical prototyping requirements for speed, clarity, and truth

I move fastest in Generative AI when I strip work down to its essential signals. At HighLevel, I rely on a single-page format—”Prototyping Requirements: The One-Pager for AI PMs”—to turn ideas into testable artifacts within hours, not weeks. This approach reinforces AI Strategy, minimizes coordination overhead, and keeps Product Management focused on learning over ceremony.

“Prototyping requirements go rogue: one page, zero bureaucracy, built for AI. Shape concepts fast, prompt tools directly, and get to the truth sooner.”

In practice, my one-pager captures only what’s required to run an immediate experiment: the user problem, the target behavior change, success signals, core constraints, intended AI workflows, and the smallest realistic path to an evaluable demo. I also include example prompts, guardrails, and evaluation criteria so the team can apply prompt engineering and LLMs for product managers without guessing.

This is eval-driven development in action. I document a minimal hypothesis, concrete inputs/outputs, and a quick plan for metrics, including qualitative signals from product discovery and continuous discovery. By prompting tools directly, we expose assumptions early, shorten feedback loops, and build an AI product toolbox that compounds learning sprint after sprint.

I run this with a product trio to ensure we balance feasibility, usability, and value. We align on risks, dependencies, and what “good” looks like, then we integrate the learnings into product roadmapping and sprint planning. The result: fewer meetings, tighter collaboration, and empowered product teams delivering sharper outcomes with less friction.

If you want speed and clarity without sacrificing rigor, adopt the one-pager. It centers the conversation on evidence, accelerates AI workflows from prompt to prototype, and makes it obvious what to try next—and what to stop doing. Most importantly, it keeps the team focused on truth over theater, which is how great AI products actually ship.

Inspired by this post on Product School.

April 24, 2026
Product Work Is Relationship Work: How I Align Stakeholders Faster and Cut Team Politics

Lately, I keep hearing a familiar question: with AI making it so easy to generate ideas and build products, do we still need product managers? My answer is unequivocal—yes. Tools accelerate delivery, but they don’t build trust, reconcile competing incentives, or create the shared understanding teams need to ship outcomes. Product work is relationship work.

I recently listened to “Product Work Is Relationship Work – All Things Product with Teresa & Petra,” and it echoed what I see every day in high-performing product organizations. If you prefer to watch, here’s the episode on YouTube: https://www.youtube.com/embed/d-0f8uAfc8w?feature=oembed

Listen to this episode on: Spotify | Apple Podcasts

While AI can help build things faster, it can’t replace the relationship work required to align stakeholders, navigate competing priorities, and create shared understanding across teams. That’s the hard, human part of product management—and it’s not going away.

In my experience, product teams stall when collaboration becomes transactional. We jump to negotiation (“What can you commit by Friday?”) before establishing context (“What problem are we solving and why now?”). When I slow down to get curious—about constraints, incentives, and assumptions—momentum actually increases because we’re rowing in the same direction.

Stakeholder alignment often breaks down when we conflate advocacy with exploration. We argue our viewpoint as if it were the only lens that matters, rather than making space to surface how others see the system. I’ve found the distinction between “dialogue vs. discussion,” rooted in work by Chris Argyris and elaborated in The Fifth Discipline by Peter Senge, to be a powerful reset. Dialogue builds shared understanding; discussion decides. You need both, in the right order.

Language matters in the room. The improv principle “Yes, and” is deceptively simple but transformative. When a designer, engineer, or executive feels heard (“Yes”) and we build on their idea (“and”), we create psychological safety without sacrificing critical thinking. I use “Yes, and” to explore perspectives before we converge on decisions—especially with product trios and senior stakeholders.

Here are the moves I rely on to keep collaboration relational and outcomes-focused. First, we align on outcomes before solutions. I explicitly separate outcomes vs output OKRs so we’re clear on what success looks like, independent of the features we ship. That clarity reduces rework and speeds up decision-making later.

Second, we operationalize curiosity with continuous discovery. I schedule recurring, lightweight touchpoints with customers and internal stakeholders so insights compound. When learning is continuous, debates quiet down—evidence does the heavy lifting.

Third, we invest in relationship rituals. Regular 1:1s with key partners, stakeholder maps that capture motivations, and pre-reads that frame trade-offs all prevent misalignment from surfacing in the last mile. These small habits pay huge dividends in trust and speed.

Fourth, I’m explicit about mode-switching in meetings: are we advocating a position or exploring perspectives? Calling the mode out loud prevents people from mistaking questions for opposition and keeps the conversation productive.

Fifth, we use “Yes, and” to move from possibility to practicality. We explore generously, then converge rigorously—ranking options by impact, effort, and risk so decisions are transparent and fair.

If stakeholder alignment, team dynamics, or product “politics” slow your team down, this conversation offers a practical reframe. You’ll move faster when you build the relational tissue first—because alignment is an accelerant, not a tax.

Resources & Links:

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Mentioned in this episode:

Petra’s Coaching Packages

Work by Chris Argyris on organizational learning and dialogue vs. discussion

The Fifth Discipline: The Art and Practice of the Learning Organization by Peter Senge

Improv principle “Yes, and”: Saying “Yes, and” — A principle for improv, business & life and Yes, and …

Have thoughts on this episode or examples from your team? Leave a comment below—I’d love to learn what’s working (and what’s not) in your stakeholder landscape.

Inspired by this post on Product Talk.

April 14, 2026
Commercial vs. Internal Products: Hard Truths, High Leverage, and How I Make the Call

Internal Products Are Hard; Commercial Products Are Harder. That line captures years of hard-won lessons from leading both internal platforms and market-facing SaaS at HighLevel. I’ve seen how the two demand different muscles—even when the tech stack, talent, and timelines look the same on paper.

When I talk about internal products, I mean services and solutions that our own employees use to take care of customers—customer-enabling tools and services, agent consoles, fulfillment and billing workflows, operations dashboards, and the underlying platforms that keep them fast, compliant, and resilient. These tools don’t generate revenue directly, but they quietly determine customer experience, gross margin, and how quickly we can ship, resolve issues, and scale.

Commercial products, by contrast, add a second challenge layer. Beyond discovery, usability, and reliability, we must conquer positioning, pricing and packaging, competitive differentiation, sales enablement, procurement hurdles, and ongoing customer success motion. The surface area for failure is bigger, and the time-to-signal on product-market fit is slower and noisier.

Here’s how I decide where to invest. First, I anchor on outcomes, not output. If the business priority is net revenue retention, faster onboarding, or reduced cost-to-serve, internal products often provide the highest-leverage path. If the priority is new revenue, new market entry, or a must-have differentiator, we lean commercial. I make the trade explicit in outcomes vs output OKRs so we can defend the decision when pressure mounts.

Second, I run a clear build vs buy calculus. For internal needs, the default is buy if a mature, configurable solution exists that meets our security, data governance, and integration requirements. I only build when the workflow is core to our differentiation, the TCO of customization is lower than vendor sprawl, or we can capture unique proprietary advantage. For commercial products, I avoid embedding third-party IP in a way that caps differentiation or compresses margins as we scale.

Third, I insist on continuous discovery. Internal audiences are not a captive market—they’re discerning experts with real jobs to do. I treat them like customers, with structured customer interviews, journey mapping, and opportunity solution trees. I rely on empowered product teams and product trios to validate problems and reduce solution risk before we commit engineering time.

Fourth, I frame commercial vs internal work with capacity guardrails. In most planning cycles, I reserve explicit allocation for platform scalability and internal tooling, separate from feature bets. Without this, internal products become backlog filler, which guarantees we’ll pay the interest later in churn, SLA breaches, and slower delivery.

Execution differs too. For internal products, change management is the make-or-break. I plan enablement as a first-class deliverable: clear rollouts, in-app guides, training, and feedback loops with frontline champions. I track adoption, time-to-resolution, error rate, and satisfaction for internal users with the same rigor we apply to external users.

For commercial products, I design the discovery-to-GTM handshake early. Pricing and packaging must reflect value drivers discovered in research, not what’s easiest to meter. Sales and solutions engineering need crisp narratives, objection handling, and proof points. Customer success needs activation plans and health signals tied directly to leading indicators of retention.

Across both, I instrument the product and process. I lean on feature flags and progressive delivery to manage risk, and I protect SLOs with error budgets so teams balance reliability with iteration speed. CI/CD isn’t a badge—it’s how we earn the right to ship continuously without eroding trust.

Common pitfalls recur. Teams skip UX for employee tools because “they have to use it”—which backfires as shadow workflows and rework. Leaders underfund internal platforms, then wonder why velocity stalls. On the commercial side, teams over-index on features and under-invest in positioning and onboarding, leading to poor activation and elongated sales cycles.

What’s the payoff? When we treat internal products as products, we unlock scale: shorter handling times, fewer escalations, clearer accountability, and higher customer satisfaction. When we approach commercial products with the same discovery rigor plus smart GTM, we compress time-to-value and amplify differentiation. The craft is knowing which lever to pull when—and having the discipline to measure what matters.

My rule of thumb is simple. If the goal is operational excellence that compounds across the entire customer journey, invest in internal products with the same intensity you reserve for revenue-generating features. If the goal is market expansion or category leadership, invest in commercial products with a tight discovery-to-GTM loop. In either case, clarity of outcomes, disciplined discovery, and empowered teams win the day.

Inspired by this post on SVPG.

April 9, 2026
Beat AI FOMO: A Product Leader’s Playbook to Choose Tools, Stay Focused, and Learn Deeply

Lately, it feels like every morning brings a new AI launch, a dazzling demo, or a must-try tool. I love the pace of innovation, but the constant stream can trigger counterproductive FOMO if I’m not intentional. As a product leader, I’ve learned to turn that anxiety into a disciplined learning system—one that keeps me curious without letting novelty hijack my focus.

That’s exactly why this conversation with Petra Wille and Teresa Torres resonated with me. They explore how to stay experimental in the AI era without chasing every shiny object. Their perspective aligns closely with my own operating cadence: start with real problems, go deep on a small set of tools, and create explicit boundaries between work, learning, and play.

Listen to this episode on: Spotify | Apple Podcasts

Here’s the mindset I apply. I don’t start with tools—I start with problems. When I encounter concrete friction in a workflow or see a credible opportunity to improve an outcome, that’s my trigger to explore a new capability. This mirrors the continuous discovery habit of prioritizing opportunities over solutions, and it’s how I avoid performing “innovation theater.”

To keep exploration healthy, I time-box my learning. I block recurring windows specifically for experiments, reading, and hands-on trials so they don’t overrun my core product work. During these blocks, I’ll set a clear question, run a tight test, and capture what I learned. No rabbit holes, no endless tinkering.

I also separate “interesting” from “actionable.” Plenty of inputs are worth awareness, but very few deserve immediate action. I bookmark the rest for later. This simple filter reduces cognitive load and keeps my backlog—from ideas to proofs of concept—well-governed.

Social media can amplify technology hype cycles, so I establish boundaries. I batch consumption, mute low-signal channels, and prioritize practitioner communities over performative threads. The goal isn’t to be first; it’s to be right for my customers, my team, and our strategy.

When choosing what to try next, I use a practical rubric. Does the tool target a real friction I’ve seen in discovery or delivery? Can it plug cleanly into our AI workflows without unsustainable glue work? Do we have a safe, compliant way to test it? Is there a plausible path from trial to compounding value? If the answer isn’t a confident yes to most of these, I wait.

Depth beats breadth. I’d rather take one promising tool into a real use case, instrument it, and measure outcomes than skim ten trending demos. That tighter loop produces sharper intuition, clearer product bets, and better partner decisions. A quick opportunity solution tree helps me connect user pain to outcomes before I let any solution onto the field.

In the episode, Petra Wille and Teresa Torres talk candidly about managing FOMO, deciding which tools to explore, and designing intentional learning systems. They discuss why starting with a problem is more valuable than starting with a tool, how social media amplifies technology FOMO, and why going deeper with fewer tools can lead to better learning. If you’ve ever felt like you’re falling behind because you haven’t tried the latest AI tool yet, this conversation will help you rethink how you approach learning and experimentation.

If you’re curious about what came up, here are some of the tools and communities mentioned: Claude Code, OpenClaw (formerly Clawdbot, Moltbot), NotebookLM, Product Talk, ElevenLabs, Lenny’s Newsletter Community, and even a nod to Bridgerton for a touch of levity.

My takeaway is simple but powerful: curiosity doesn’t require constant experimentation. The best product managers cultivate a balanced system—grounded in product discovery, energized by focused experiments, and protected by clear boundaries—so we can learn faster while staying pointed at outcomes that matter.

Discussion Question: How do you decide which new tools or technologies are worth exploring—and which ones you can safely ignore?

Resources & Links: Follow Teresa Torres: https://ProductTalk.org | Follow Petra Wille: https://Petra-Wille.com

Full transcripts are only available for paid subscribers.

Have thoughts on this episode? Leave a comment below.

Inspired by this post on Product Talk.

April 7, 2026
Stop Misleading A/B Tests: Master Sample Size Assumptions for Reliable Results

I’ve learned the hard way that sample size calculators can be both empowering and deceptive. They feel wonderfully precise, but they’re only as trustworthy as the assumptions you feed them. When I lead A/B testing at scale, I treat the calculator as a planning tool, not a verdict—then I systematically validate the assumptions behind it so our decisions stay rigorous and our roadmap stays credible.

At a minimum, most calculators assume you know your baseline rate, your “minimum detectable effect (MDE),” your desired statistical power, and your significance level. They also quietly assume independent observations, clean randomization, stable traffic quality, and a fixed test horizon with no peeking. If any of those break, the “right” sample size can be wildly wrong—and the test conclusions can nudge teams toward the wrong product or go-to-market bet.

Baseline and variance come first for me. I estimate the baseline conversion (and volatility) from recent behavior using behavioral analytics, sanity-check it across key segments, and look for seasonality. Tools like Amplitude analytics help me spot anomalies, bots, or instrumentation drift. If baseline is unstable or highly skewed, I either stabilize it with longer lookbacks or narrow the target segment to reduce noise.

Setting the “minimum detectable effect (MDE)” is where product strategy meets statistics. I work backward from an outcome that actually matters: the revenue, retention, or activation uplift that justifies the opportunity cost of building and running the experiment. If that effect size is implausible given historic lift and variance, I rethink the scope or stack changes into a sequenced set of learning experiments rather than overpromising a single moonshot.

For power and alpha, I default to 80–90% power and a 5% significance level unless the downside risk of a false positive is unusually high, in which case I tighten alpha. I choose one-tailed tests only when we would not act on a negative result and we’ve explicitly pre-registered that decision; otherwise, two-tailed is safer for real-world ambiguity.

Randomization and independence are where many tests quietly fail. I randomize at the user level (not session or pageview), guard against cross-device contamination, and ensure consistent exposure via feature flags. If there’s shared context—say, team-based usage or geographic clustering—I account for it via cluster randomization or acknowledge the inflated variance it can introduce.

Traffic allocation integrity is non-negotiable. I monitor for sample ratio mismatch by comparing observed group splits to the intended allocation and immediately pause if they drift. When SRM appears, the root cause is often instrumentation gaps, eligibility filters applied asymmetrically, or caching layers. Fixing that early preserves trust in every test that follows.

Fixed-horizon math assumes no peeking. If stakeholders need continuous reads, I use sequential testing methods with alpha spending or always-valid approaches designed for ongoing monitoring. If we commit to a fixed horizon, we stay disciplined: no early looks, no midstream metric swaps, no retrofitted hypotheses.

Multiple comparisons can quietly inflate false positives. I predeclare one primary metric to decide, define guardrail metrics to protect experience and revenue, and apply appropriate corrections (for example, controlling the false discovery rate) when testing many variants or slicing results by numerous segments.

Duration and seasonality matter more than most roadmaps admit. I run through full business cycles (at least one complete week for daily patterns, longer for B2B buying rhythms), plan for novelty effects, and watch for behavior settling after initial exposure. If the intervention changes long-run behavior, I extend the measurement window or add a post-test holdout to capture durable impact.

Not all metrics are binomial. For revenue, time-on-task, or heavy-tailed distributions, I confirm variance assumptions, use robust estimators or bootstrapping, and consider variance reduction methods like CUPED to improve power without overextending duration. The calculator’s simplicity should not mask the data’s complexity.

Finally, I connect experimentation to product outcomes. I map hypotheses to a driver tree, ensure each test ladders to activation, retention, or monetization, and document assumptions up front so we learn even when results are null. The result is a culture that respects the math and moves faster precisely because we trust our reads.

Here’s the practical checklist I use before pressing “Start”: validate baseline and variance from recent behavior; set an MDE tied to meaningful business impact; choose power and alpha explicitly; confirm user-level randomization and stable exposure; watch for sample ratio mismatch; align on fixed-horizon vs sequential testing; predeclare a single primary metric and guardrails; run long enough to cover seasonality; use robust methods for non-binomial metrics; and write a brief pre-read so the whole team commits to the plan.

When we honor these assumptions, sample size calculators become sharp instruments rather than blunt ones. You’ll ship fewer misleading wins, avoid costly false negatives, and build a repeatable experimentation engine that compounds learning—and results—over time.

Inspired by this post on Amplitude – Perspectives.

April 6, 2026