Tag: AI product toolbox

Claude Code for Product Managers: Accelerate Prototypes, Validate Faster, Ship with Confidence

I build products under constant pressure to learn faster without breaking trust. Claude Code has become a pragmatic addition to my AI product toolbox because it helps me move from idea to evidence with less friction—while keeping engineering, design, and compliance in the loop.

“Claude Code for Product Managers explained: what it is, why it matters, and how it helps PMs prototype, validate, and move faster.” That line captures the essence. In practice, I use it to turn ambiguous problem statements into tangible artifacts—API stubs, SQL queries, test data, and lightweight prototypes—that sharpen conversation and accelerate decision cycles.

What is it in PM terms? A code-aware assistant that helps me prototype safely and quickly. I can generate example API calls, transform messy CSVs for retention analysis, draft instrumentation plans for Amplitude analytics, or spin up a mock service to validate an integration. Because it understands structure, it’s effective at scaffolding small utilities (e.g., a data cleaner or a CLI harness) that make discovery and validation faster.

Day to day, Claude Code reduces handoffs. If I’m exploring a new partner integration, I’ll have it produce a curl library and a Postman collection, then annotate each step with acceptance criteria and expected responses. When I’m shaping a feature, I lean on it to outline event taxonomies and feature flags so that engineering can wire telemetry without guesswork. For insights work, I’ll ask it to propose SQL for cohort, funnel, and retention analysis—always verifying against source schemas before anything touches production.

Speed is only useful when it improves signal quality. I anchor the workflow in continuous discovery: small hypotheses, thin-slice prototypes, and fast instrumentation. Claude Code helps me estimate A/B testing readiness (including minimum detectable effect), generate smoke tests for critical user paths, and structure an eval-driven development loop so we learn from every iteration. It also supports context window management by summarizing long PRDs into the few constraints a prototype must respect.

Governance matters. I apply AI readiness and AI risk management principles: never paste secrets or PII, isolate sandboxes, and log prompts as docs-as-code for auditability. I prefer a retrieval-first pipeline that feeds approved product docs, OpenAPI specs, and design tokens so generations stay grounded. When tools are integrated, I favor the Model Context Protocol (MCP) to constrain capabilities and maintain least-privilege access. Human-in-the-loop review is non-negotiable—especially for anything that might influence customer data or pricing.

The best outcomes show up in product trios. I’ll facilitate a live session with design and engineering: we co-create prompts, compare alternatives, and converge on a thin slice we can ship. That collaboration keeps us empowered, reduces interpretation drift, and turns Claude Code into an accelerant rather than a sidecar. Over time, the trio curates a reusable prompt library for PRD outlines, experiment checklists, and integration playbooks.

Getting started is straightforward: define a safe environment, assemble your authoritative corpus (requirements, specs, taxonomies), and codify a few high-value templates—API exploration, instrumentation plans, sandbox data generators, and acceptance tests. Track impact with simple, objective metrics: cycle time from hypothesis to instrumented prototype, time-to-first-signal, and the proportion of decisions made with data versus opinion.

There are pitfalls. Hallucinated fields can creep into API calls, schema drift can break generated queries, and “clever” refactors may miss edge cases. I mitigate this by grounding generations in current specs, asking for unit tests alongside any code, and validating against a staging environment before anyone talks about production. Treat Claude Code as a collaborator, not an oracle.

If your mandate is to learn faster, de-risk bets, and ship with confidence, Claude Code is worth adopting. Used thoughtfully, it compresses the distance between questions and answers, elevates product discovery, and lets teams validate more ideas with fewer meetings—without compromising on governance or quality.

Inspired by this post on Product School.

June 12, 2026
The AI PM One-Pager: Radical prototyping requirements for speed, clarity, and truth

I move fastest in Generative AI when I strip work down to its essential signals. At HighLevel, I rely on a single-page format—”Prototyping Requirements: The One-Pager for AI PMs”—to turn ideas into testable artifacts within hours, not weeks. This approach reinforces AI Strategy, minimizes coordination overhead, and keeps Product Management focused on learning over ceremony.

“Prototyping requirements go rogue: one page, zero bureaucracy, built for AI. Shape concepts fast, prompt tools directly, and get to the truth sooner.”

In practice, my one-pager captures only what’s required to run an immediate experiment: the user problem, the target behavior change, success signals, core constraints, intended AI workflows, and the smallest realistic path to an evaluable demo. I also include example prompts, guardrails, and evaluation criteria so the team can apply prompt engineering and LLMs for product managers without guessing.

This is eval-driven development in action. I document a minimal hypothesis, concrete inputs/outputs, and a quick plan for metrics, including qualitative signals from product discovery and continuous discovery. By prompting tools directly, we expose assumptions early, shorten feedback loops, and build an AI product toolbox that compounds learning sprint after sprint.

I run this with a product trio to ensure we balance feasibility, usability, and value. We align on risks, dependencies, and what “good” looks like, then we integrate the learnings into product roadmapping and sprint planning. The result: fewer meetings, tighter collaboration, and empowered product teams delivering sharper outcomes with less friction.

If you want speed and clarity without sacrificing rigor, adopt the one-pager. It centers the conversation on evidence, accelerates AI workflows from prompt to prototype, and makes it obvious what to try next—and what to stop doing. Most importantly, it keeps the team focused on truth over theater, which is how great AI products actually ship.

Inspired by this post on Product School.

April 24, 2026

Build an AI Toolbox That Improves Product Management

You have an interview transcript waiting to be synthesized, a roadmap debate with more opinions than evidence, and a stakeholder update due before the decisions are settled. A general-purpose chatbot can help with each task. It can also produce a polished version of the wrong answer.

I’ve evaluated dozens of generative AI products against the work product managers actually do, from discovery through launch. The useful pattern is simple: choose a recurring decision, connect the model to the evidence for that decision, define the human review, and measure whether the workflow improves. The tool is only one part of that system.

Start with the decision that needs to improve

If you begin with a tool, its demo will define your use case. You will end up generating summaries, specifications, and slide copy because those outputs are easy to show, not because they remove the most important constraint in your product process.

Begin with a decision that is slow, inconsistent, or poorly supported. Write the workflow in one sentence:

When [trigger occurs], [owner] uses [approved evidence] to produce [decision artifact], which [reviewer] checks before [downstream action]. Success is measured by [workflow metric] and [product metric].

For customer discovery, that might become: when an interview round closes, the product manager uses transcripts, participant metadata, and the research question to produce a theme map and a list of unresolved questions. A research or design partner checks the evidence before the findings enter an opportunity solution tree. Synthesis time, evidence corrections, and the quality of the next research questions show whether the workflow is helping.

A strong first use case has four properties:

It recurs. A workflow used repeatedly gives you enough opportunities to find failure modes and improve the prompt.
Its evidence is bounded. You can identify the transcripts, event definitions, strategy documents, or decision logs the model is allowed to use.
A qualified person can review it. The reviewer knows what a plausible but unsupported answer looks like.
The improvement is observable. You can compare cycle time, rework, evidence quality, or another meaningful measure before and after introducing AI.

My rule is to start with frequent, evidence-rich work where a mistake is reversible. Interview synthesis, experiment readouts, roadmap option framing, and release communication are usually better learning environments than an autonomous decision that immediately changes customer data or launches an experience.

Capture a baseline before changing the workflow. Record how long the work takes, where review cycles occur, which errors appear repeatedly, and what downstream decision the artifact supports. Without that baseline, faster drafting can look like progress even when reviewers spend the saved time correcting unsupported claims.

Build the toolbox in layers, not by brand

An effective product management toolbox connects LLMs, research synthesis, behavioral analytics, and lightweight automation. These layers solve different problems. Buying several products that all generate text does not create a complete system.

Tool layer	Best PM job	Evidence it needs	Useful output	Main failure to catch
General-purpose LLM workspace	Framing, critique, drafting, and option generation	Objective, constraints, definitions, and approved documents	Questions, alternatives, structured drafts, and decision briefs	Confident invention or generic advice detached from product context
Research synthesis	Organizing customer interviews and qualitative feedback	Transcripts, participant identifiers, segment metadata, and research questions	Evidence-linked themes, contradictions, unmet needs, and follow-up questions	Treating a small sample as market prevalence or erasing minority views
Behavioral analytics	Finding where behavior changes and sizing an opportunity	Event definitions, entity grain, cohorts, funnels, paths, retention views, and experiment results	Drop-off patterns, affected segments, anomalies, and testable hypotheses	Turning correlation into causation or analyzing an incorrectly defined event
Knowledge and retrieval layer	Grounding answers in current product context	Strategy, decision logs, research, taxonomy, policies, and product documentation	Traceable answers with evidence and visible conflicts	Retrieving stale, unauthorized, or contradictory material without warning
Workflow and experience automation	Moving an approved decision into repeatable execution	Approved copy, segments, triggers, stop conditions, owners, and measurement events	In-app guides, product tours, handoffs, checklists, and status updates	Publishing or acting before human approval, measurement, or rollback is ready

Use the table to expose missing layers. If research synthesis is strong but event definitions are unreliable, another writing assistant will not improve opportunity sizing. If analytics is mature but the model cannot access the current strategy or decision history, its prioritization advice will remain generic. If automation is available but ownership and rollback are unclear, speed will amplify operational risk.

Evaluate each candidate against the workflow, not against a feature checklist. Ask:

Can it work where the approved evidence already lives, or will people create uncontrolled copies?
Can a reviewer trace a conclusion back to a transcript, event definition, document, or decision record?
Can access, retention, sharing, and deletion follow your data governance rules?
Can you test a stable workflow with representative examples instead of judging a polished demo?
Can you observe failures, corrections, latency, and cost after rollout?
Does the total cost include integration, governance, evaluation, review time, and maintenance rather than only the license?

A vendor can be impressive and still be wrong for your operating environment. The decisive question is whether it strengthens a specific product decision without weakening evidence quality, privacy, or accountability.

Turn the tools into repeatable PM workflows

The prompt is not the workflow. A production workflow includes prepared inputs, an output contract, a review step, a decision owner, and a place to record what happened. The following patterns cover the PM work where AI can create leverage without pretending to replace product judgment.

Synthesize interviews without manufacturing certainty

Qualitative synthesis becomes unreliable when the model merges observation, interpretation, and recommendation into one smooth narrative. Preserve those boundaries. Give each participant a stable identifier, retain relevant segment context, and tell the model to cite the evidence behind every theme.

Copy-paste prompt: Act as a product research analyst. Use only the supplied interviews and research brief. For each theme, return the claim, supporting participant identifiers, contradictory evidence, affected segment, confidence with a reason, and the next unanswered question. Separate direct observations from interpretation and recommendation. Do not infer market prevalence from this interview sample. If a conclusion lacks evidence, label it unsupported.

Review the output by opening the cited passages, not by judging whether the summary sounds plausible. Look for participants who do not fit the dominant theme. Check whether two different needs have been combined because they use similar words. Confirm that the model has not converted the loudest quotation into the most important opportunity.

Only then move the findings into your discovery structure. The useful handoff is not a list of themes. It is a set of evidence-backed needs, open questions, affected segments, and assumptions that the product trio can investigate.

Combine behavioral data with customer evidence

Behavioral analytics can tell you where users drop out, which segments behave differently, and whether a pattern is large enough to deserve attention. It does not tell you why the behavior occurred. Interviews can reveal possible motivations, but a qualitative sample does not establish how common each motivation is. Use the two evidence types together without asking either to answer the other’s question.

Before involving an LLM, verify the event name, event meaning, user or account grain, relevant cohort, and analysis window. If instrumentation changed, include that context. Prefer aggregated or appropriately governed data; do not paste raw personal or confidential customer data into an unapproved model.

Copy-paste prompt: Use the supplied event definitions, cohort table, funnel, and interview themes. Identify the largest observed behavior changes by segment. For each change, distinguish the observed fact from possible explanations. List data quality questions, supporting customer evidence, conflicting customer evidence, and the cheapest analysis or experiment that could reduce uncertainty. Do not claim causation from a correlation.

Return to the analytics system to validate every material claim. The model is useful for connecting evidence and generating hypotheses; the governed analytics layer remains the place to confirm event behavior, segment definitions, retention patterns, and experiment results.

Frame roadmap choices as options, not generated certainty

A roadmap debate rarely fails because nobody can generate feature ideas. It fails when alternatives, assumptions, constraints, and expected outcomes are implicit. AI is most useful here as an argument compiler: it can turn scattered evidence into comparable options and expose what each option requires you to believe.

Copy-paste prompt: Use the supplied product objective, customer evidence, behavioral evidence, strategic constraints, technical constraints, and decision history. Create a set of distinct options rather than a ranked feature backlog. For each option, state the target outcome, supporting evidence, contradictory evidence, critical assumptions, excluded alternatives, leading indicator, delivery risk, and cheapest test. Flag any recommendation that lacks a traceable source. Do not make the final priority decision.

This format makes outcome-versus-output confusion visible. An option such as build a new onboarding checklist is an output. Improve successful first-time setup for a defined customer segment is an outcome. The first can support the second, but the relationship is still a hypothesis. Keep that hypothesis visible in the roadmap and in the experiment plan.

The human decision owner should record the selected option, why it won, what evidence mattered, which assumption remains unresolved, and when the decision should be revisited. That decision log becomes grounding material for later planning instead of forcing the next model session to reconstruct context from scattered documents.

Move an approved launch into an observable experience

Once the decision is approved, AI can reduce the mechanical work of adapting positioning into release notes, support context, product tours, and in-app guides. The risky part is not drafting the words. It is allowing generated content to reach the wrong segment, appear at the wrong moment, or launch without a measurement and stop condition.

Copy-paste prompt: Using only the approved positioning, UX terminology, target segment, trigger event, and product constraints, draft an in-app sequence. For each message, state its purpose, trigger, target user, action requested, dismissal behavior, stop condition, and measurement event. Preserve the approved claim boundaries. Flag any copy that introduces a benefit, capability, or promise not present in the supplied material.

Review the experience in context. Confirm that the audience definition matches the analytics definition, the trigger can actually be observed, the requested action exists in the current interface, and users can dismiss or complete the sequence. Keep experiment design and success analysis outside the copy generator. Fluent wording cannot declare the launch successful.

Make every output inspectable before it becomes operational

The difference between a useful personal assistant and a dependable organizational workflow is inspectability. A reviewer must be able to see which evidence was available, which instructions shaped the answer, what the model produced, what a person changed, and which decision followed.

Use a retrieval-first pipeline grounded in product documents and decision logs. Do not rely on model memory for current strategy, naming, policy, or product behavior. Define an authority order for conflicting material. A current approved decision record should not silently lose to an older planning document simply because the older document contains more text.

Your grounding layer should preserve permissions. Retrieval is not an excuse to expose every document to every workflow. Record the owner and freshness of important material, remove obsolete versions from the approved collection, and instruct the model to show conflicts instead of resolving them invisibly.

Treat each repeated prompt as a small product surface with a contract:

Goal: the decision or artifact the workflow must support.
Allowed evidence: the documents, data, and tools the model may use.
Definitions: the product terms, entities, events, segments, and metrics that must remain consistent.
Method constraints: what the model must separate, preserve, cite, or avoid inferring.
Output contract: the required fields, order, labels, and evidence links.
Uncertainty behavior: when to flag missing context, conflicting inputs, or unsupported conclusions.
Review and stop conditions: who approves the output and what prevents it from moving downstream.

Then create an evaluation set from representative work. Include ordinary inputs, ambiguous cases, conflicting documents, incomplete evidence, sensitive-data traps, and previously observed failures. A good evaluation checks groundedness, traceability, coverage, decision usefulness, confidentiality, and consistency. Writing quality matters, but polish is not evidence.

Re-run the evaluation whenever the model, prompt, connector, knowledge collection, event taxonomy, or output schema changes. A workflow that passed yesterday’s cases can regress when one dependency changes. This is why eval-driven development, observability, privacy-by-design, and AI risk management belong in the product manager’s toolbox rather than in a separate governance document.

For each operational run, retain enough information to diagnose failure: workflow name, input sources, prompt or configuration version, output, reviewer corrections, final decision, latency, and cost where available. The record should support improvement without retaining sensitive data longer than your policy permits.

A screenshot checklist can make the workflow easier to teach and audit. Capture the approved input location, relevant access setting, prompt configuration, evidence-linked output, human edits, final decision record, and measurement view. Screenshots do not replace logs or documentation, but they give PMs and stakeholders the same operating picture during onboarding and review.

Scale adoption through gates and measurable outcomes

Do not roll an AI tool out to every product manager and hope good practices emerge. Move one workflow through explicit gates:

Baseline the current workflow. Record cycle time, review effort, recurring errors, and the downstream outcome it supports.
Run in shadow mode. Produce the AI-assisted artifact without allowing it to drive the real decision. Compare it with the normal process and save failure cases.
Introduce assisted use. Let a named human owner use and edit the output. Require evidence checks before it reaches stakeholders or customers.
Standardize the operating pattern. Publish the input rules, prompt contract, evaluation set, owner, storage location, escalation path, and fallback process.
Expand only after the workflow holds up. Add users, data sources, or automation after quality, privacy, and review behavior remain dependable.

Measure the workflow at more than one level. Cycle time tells you whether work moves faster. Correction rate and review effort show whether speed is hiding rework. Evidence coverage shows whether claims can be defended. The linked product metric shows whether the artifact supports a meaningful outcome. Total cost tells you whether licenses, integration, evaluations, governance, and human review are worth the saved effort.

Do not count prompts submitted, words generated, summaries created, or seats assigned as product impact. Those are activity measures. A workflow is valuable when it shortens a real decision cycle, improves the evidence behind a decision, reduces preventable rework, or helps the team learn about an outcome sooner.

Pause or roll back the workflow when material claims cannot be traced, confidential data crosses an unapproved boundary, reviewers begin rubber-stamping output, small configuration changes cause unpredictable recommendations, or the review and governance burden cancels the useful gain. A graceful fallback to the previous process is part of the design, not an admission that AI failed.

Key takeaways

Choose a recurring product decision before choosing an AI product.
Combine LLMs, research synthesis, behavioral analytics, grounded knowledge, and automation only where the workflow needs them.
Require bounded evidence, visible uncertainty, traceable claims, and a named human decision owner.
Turn repeated prompts into governed contracts with evaluations, observability, and clear stop conditions.
Judge the toolbox by cycle time, evidence quality, rework, product learning, and total cost rather than by generated output.

This week, select one recurring PM decision and write its workflow sentence. Baseline the current process, run the AI-assisted version in shadow mode, and save every failure as an evaluation case. Your toolbox becomes valuable when it improves a decision you can defend, not when it produces more material to review.

References

Shivam.Consulting Blog – My Essential AI Toolbox for Product Managers: Tested Picks, Prompts, Workflows + Checklists

April 22, 2026

Beat AI FOMO: A Product Leader’s Playbook to Choose Tools, Stay Focused, and Learn Deeply

Lately, it feels like every morning brings a new AI launch, a dazzling demo, or a must-try tool. I love the pace of innovation, but the constant stream can trigger counterproductive FOMO if I’m not intentional. As a product leader, I’ve learned to turn that anxiety into a disciplined learning system—one that keeps me curious without letting novelty hijack my focus.

That’s exactly why this conversation with Petra Wille and Teresa Torres resonated with me. They explore how to stay experimental in the AI era without chasing every shiny object. Their perspective aligns closely with my own operating cadence: start with real problems, go deep on a small set of tools, and create explicit boundaries between work, learning, and play.

Listen to this episode on: Spotify | Apple Podcasts

Here’s the mindset I apply. I don’t start with tools—I start with problems. When I encounter concrete friction in a workflow or see a credible opportunity to improve an outcome, that’s my trigger to explore a new capability. This mirrors the continuous discovery habit of prioritizing opportunities over solutions, and it’s how I avoid performing “innovation theater.”

To keep exploration healthy, I time-box my learning. I block recurring windows specifically for experiments, reading, and hands-on trials so they don’t overrun my core product work. During these blocks, I’ll set a clear question, run a tight test, and capture what I learned. No rabbit holes, no endless tinkering.

I also separate “interesting” from “actionable.” Plenty of inputs are worth awareness, but very few deserve immediate action. I bookmark the rest for later. This simple filter reduces cognitive load and keeps my backlog—from ideas to proofs of concept—well-governed.

Social media can amplify technology hype cycles, so I establish boundaries. I batch consumption, mute low-signal channels, and prioritize practitioner communities over performative threads. The goal isn’t to be first; it’s to be right for my customers, my team, and our strategy.

When choosing what to try next, I use a practical rubric. Does the tool target a real friction I’ve seen in discovery or delivery? Can it plug cleanly into our AI workflows without unsustainable glue work? Do we have a safe, compliant way to test it? Is there a plausible path from trial to compounding value? If the answer isn’t a confident yes to most of these, I wait.

Depth beats breadth. I’d rather take one promising tool into a real use case, instrument it, and measure outcomes than skim ten trending demos. That tighter loop produces sharper intuition, clearer product bets, and better partner decisions. A quick opportunity solution tree helps me connect user pain to outcomes before I let any solution onto the field.

In the episode, Petra Wille and Teresa Torres talk candidly about managing FOMO, deciding which tools to explore, and designing intentional learning systems. They discuss why starting with a problem is more valuable than starting with a tool, how social media amplifies technology FOMO, and why going deeper with fewer tools can lead to better learning. If you’ve ever felt like you’re falling behind because you haven’t tried the latest AI tool yet, this conversation will help you rethink how you approach learning and experimentation.

If you’re curious about what came up, here are some of the tools and communities mentioned: Claude Code, OpenClaw (formerly Clawdbot, Moltbot), NotebookLM, Product Talk, ElevenLabs, Lenny’s Newsletter Community, and even a nod to Bridgerton for a touch of levity.

My takeaway is simple but powerful: curiosity doesn’t require constant experimentation. The best product managers cultivate a balanced system—grounded in product discovery, energized by focused experiments, and protected by clear boundaries—so we can learn faster while staying pointed at outcomes that matter.

Discussion Question: How do you decide which new tools or technologies are worth exploring—and which ones you can safely ignore?

Resources & Links: Follow Teresa Torres: https://ProductTalk.org | Follow Petra Wille: https://Petra-Wille.com

Full transcripts are only available for paid subscribers.

Have thoughts on this episode? Leave a comment below.

Inspired by this post on Product Talk.

April 7, 2026
The Customer Feedback Playbook: AI-Powered Tactics I Use to Make Better Product Decisions

Customer feedback is the most reliable compass I have for product strategy and execution. Over the years leading product at HighLevel, I’ve built and refined a system that turns raw signals from users into clear, prioritized decisions our teams can confidently ship.

A practical guide to collecting and using product feedback in product management (from AI tools to early-stage tactics) for better product decisions.

My playbook starts with continuous discovery. I keep a steady flow of insights from sales calls, customer support threads, community forums, and in-product behavior so I can triangulate patterns rather than chase loud anecdotes. This mix of quantitative and qualitative data helps me separate urgent noise from strategically meaningful trends.

On the quantitative side, I rely on product analytics to ground the conversation. Amplitude analytics gives me activation, retention cohorts, and feature engagement, while controlled experiments and A/B testing validate whether an idea actually moves a target metric. Tying these signals to specific customer segments helps me see where product-led growth is working—and where it’s stalling.

For qualitative insight, I combine in-app guides and lightweight surveys (via tools like Pendo) with structured interviews and support escalations (often surfaced through platforms like Intercom). I map problems using the Kano Model to understand which requests are basic expectations, which are performance drivers, and which are potential delights. This keeps our roadmap focused on outcomes, not just outputs.

AI now accelerates the synthesis step. With LLMs for product managers in my AI product toolbox, I summarize interview transcripts, cluster themes across thousands of notes, and quantify sentiment without losing nuance. I still review raw artifacts to avoid hallucinations and preserve context, but AI reduces the time from signal to insight dramatically—freeing me to spend more energy on judgment and storytelling.

In early-stage contexts, I bias toward speed and proximity to users. I schedule founder- or PM-led discovery calls weekly, instrument product tours early, and launch scrappy in-product prompts to validate demand before over-investing. When data is sparse, I focus on high-signal channels (power users, churned customers with qualified use cases) and document crisp problem statements that connect directly to activation, retention analysis, and revenue outcomes.

Prioritization ties everything together. I translate insights into hypotheses aligned to outcomes vs output OKRs, then pressure-test them with feasibility and strategic fit. We run small, measurable experiments, track deltas in activation and retention, and adjust the product roadmapping and sprint planning cadence based on what the data and customers teach us.

This approach builds trust with stakeholders and creates empowered product teams. By grounding decisions in a transparent trail of feedback, analytics, and experiments, we reduce thrash, move faster, and—most importantly—ship product moments that customers value.

If you’re refining your own feedback engine, start by instrumenting the basics, set a weekly discovery rhythm, and let AI handle the heavy lifting on aggregation and synthesis. The compounding effect is real: better insights lead to better bets, which lead to better outcomes for your users and your business.

Inspired by this post on Product School.

January 26, 2026
PMs and Developers Need Different AI Metrics—Here’s How That Builds Faster, Better Products

I’ve sat in countless AI measurement debates and noticed a recurring gap. One major voice has been noticeably underrepresented in the AI measurement conversation: the product manager (PM) that’s leading development. From experience, PMs and developers do need different measurement tools—and making those differences explicit is exactly what speeds up decisions and improves outcomes.

Developers optimize the model and system layer. Their toolkit centers on eval-driven development: offline evals, regression suites, red-teaming, latency and throughput monitoring, token cost tracking, and hallucination rate reduction. On the delivery side, engineering teams watch DORA metrics alongside CI/CD performance to keep iteration fast and safe. When building LLM-backed experiences, they also care deeply about retrieval-first pipeline quality and context window management because those mechanics determine grounding, relevance, and consistency.

PMs, by contrast, own outcomes. We instrument user journeys end to end and define a clear north-star tied to value: activation, time-to-value, task success rate, retention analysis, support deflection, and revenue contribution. We rely on A/B testing frameworks and minimum detectable effect (MDE) planning to separate real impact from noise, and we consolidate behavioral signals in a unified analytics platform like Amplitude analytics and Pendo to understand adoption, friction, and cohort differences. This is the heart of product-led growth and continuous discovery: evidence, not anecdotes.

The fact that these toolboxes differ is a strength, not a weakness. Specialized metrics keep responsibilities crisp: developers guarantee model quality and reliability; PMs guarantee that quality translates into customer and business outcomes. What we need is an explicit metrics ladder that connects layers—model-level quality floors and SLOs, feature-level KPIs, and company-level results—so trade-offs are transparent and prioritization is principled.

In practice, I create a shared measurement contract for every AI initiative. It links eval sets to user-facing success criteria, defines acceptance thresholds, and spells out observability across the stack. We include governance from day one—AI risk management, privacy-by-design, and data governance—so we can scale responsibly without slowing teams down.

Here’s the AI product toolbox I give my teams: start with a concise value hypothesis; define a success rubric the customer would recognize; instrument the happy path and the failure path; plan experiments with MDE up front; segment results by persona and job-to-be-done; and close the loop with qualitative feedback inside the product via in-app guides, product tours, and lightweight surveys. For AI features specifically, add Agent Analytics for agentic AI, capture grounding sources for explainability, and log model/context inputs to make debugging and iteration repeatable. That way, LLMs for product managers stop being magic and start being manageable.

When we roll out a new assistant—whether a retrieval-augmented copilot or a voice AI agent—we set two dashboards: one for developers (eval pass rates, latency, context integrity, error budgets) and one for PMs (activation, task completion, deflection, satisfaction). The dashboards read differently by design, yet they are joined at the hip by shared definitions and experiment IDs. This lets us move quickly with confidence: engineering can tighten quality loops while product steers toward the outcome that matters most.

If you’re feeling the tension between model metrics and product metrics, don’t collapse them—connect them. Start with a thin slice, agree on 3–5 measurable outcomes, and let your evals and A/B tests work together. With a clear metrics ladder and a unified analytics platform, PMs and developers can each excel at their craft and still ship AI that customers love.

Inspired by this post on Pendo – Perspectives.

January 7, 2026
AI Product Management Skills: A Practical 12-Month Roadmap
You may know how to prompt a model and still feel unprepared to own an AI product. That gap is real. Producing a plausible response is easy; deciding what should be built, how to evaluate it, when to trust it, and whether it improved the user journey requires a broader product skill set.

The useful roadmap is not a queue of courses or tools. It is a sequence of increasingly consequential work: understand model behavior, turn ideas into testable artifacts, ship a bounded workflow, and then build the operating system that lets more teams do it responsibly.

What you should be able to do after 12 months

An AI product manager does not need to become a machine-learning engineer. You do need enough technical judgment to frame a feasible problem, challenge an architecture, inspect failures, define an evaluation, and make a release decision with engineering and design.

The 12-month progression from foundations to governed scale works because each phase produces evidence needed by the next one. You learn model constraints before promising a user experience. You build evaluations before exposing the system to real customers. You prove one workflow before standardizing it across a product organization.

Key takeaways
- Months 1-3: Learn model behavior, context management, prompting, retrieval, privacy, and data governance. Apply them to product discovery.
- Months 4-6: Build prototypes and an evaluation system. Instrument activation and retention before treating the feature as ready.
- Months 7-9: Ship a bounded AI-enabled workflow with safeguards, monitoring, recovery paths, and clear human control.
- Months 10-12: Standardize evaluation gates, analytics, discovery practices, roadmapping, and outcome-based reporting.
Treat these as capability gates, not calendar milestones. If you cannot explain why a prototype failed in month six, more production infrastructure will not fix the problem. If you cannot show that users received value in month nine, scaling the feature will only distribute uncertainty.

By the end of the roadmap, your portfolio should contain operating artifacts rather than course certificates: an AI product brief, a prompt and retrieval pattern, a reusable evaluation set, an instrumented production workflow, a risk checklist, and a scale playbook. Those artifacts demonstrate that you can move from possibility to accountable product performance.

Months 1-3: Learn enough AI to make sound product decisions

Your first objective is not technical fluency for its own sake. It is learning where model behavior changes a familiar product decision. A deterministic feature is expected to return the same result for the same state. A generative feature can produce different, incomplete, or confidently incorrect outputs. That changes acceptance criteria, testing, interface design, and the meaning of “done.”

Build an operator’s mental model

Work through four capabilities in order:
1. Model behavior and constraints: Learn what the model receives, what it produces, where variability enters, and which failures matter to the user. You should be able to distinguish a capability problem from a context, instruction, or workflow problem.
2. Context window management: Decide which information belongs in the model’s working context, which information is stale, and which information should never be sent. More context is not automatically better context. Irrelevant material can obscure the evidence the task actually requires.
3. Prompting as product specification: Write reusable instructions that state the task, relevant context, constraints, required output, and quality criteria. Save the prompt with examples of both acceptable and unacceptable behavior. A prompt library is useful only when another person can reproduce and assess the result.
4. Retrieval-first design: For tasks that depend on changing or proprietary knowledge, learn the basic pipeline: retrieve relevant approved information, give that information to the model, generate an answer, and preserve enough traceability to investigate failures. This is a product choice as much as an architecture choice because it determines what the experience can reliably know.
Pair these capabilities with privacy-by-design and data governance from the beginning. Before using customer or company information, write down which data classes are permitted, who can access them, where they may be retained, and what must be removed or masked. If those answers are unclear, use synthetic or explicitly approved material until the policy is settled. Avoiding sensitive data at the prototype stage is safer than trying to remove it after it has spread through prompts, logs, and evaluation files.

Apply the foundations to product discovery

Discovery gives you a low-risk place to practise. Use generative AI to summarize research, cluster feedback, compare recurring needs, or sharpen a value proposition. Keep the model in an assistive role: every synthesized theme should remain traceable to the underlying customer evidence. If you cannot inspect the feedback behind a cluster, you cannot tell whether the model found a pattern or flattened important differences.

Create an AI product brief for one candidate problem. Include:
- The user and the job they are trying to complete.
- The decision or work the model will assist with.
- The inputs the system may use and the inputs it must reject.
- The expected output and the conditions that make it useful.
- The consequence of a wrong, missing, or delayed output.
- The point at which a person reviews, edits, approves, or overrides the result.
- The product signal that would show improved user behavior.
You are ready for the next phase when you can explain the proposed experience without hiding behind model vocabulary. You should be able to identify the necessary context, name the important failure modes, explain whether retrieval is needed, and show how the user remains in control.

Months 4-6: Prototype the experience and build its evaluation system

A prototype is valuable when it tests uncertainty, not when it merely looks polished. Use generative AI to accelerate UX mocks, PRDs, in-app guidance, and alternative interaction flows, but spend the saved time on the questions that determine whether the product deserves to ship.

Prototype the entire decision loop. Show where the user supplies context, how the result is presented, what happens when the answer is weak, how the user corrects it, and whether that correction improves the next step. The error state is part of the primary AI experience; hiding it until engineering integration creates false confidence.

Use evaluation as a development method

Eval-driven development turns a vague judgment such as “the answers seem good” into a repeatable product decision. Build the evaluation alongside the prototype:
1. Define the task boundary. State what the system is expected to do and what remains outside its responsibility.
2. Collect representative cases. Include normal inputs, ambiguous inputs, missing information, adversarial behavior, and cases where the correct response is to stop or ask for clarification.
3. Write a scoring rubric. Assess the properties the user actually needs, such as correctness, relevance, completeness, appropriate tone, traceability, or compliance with a constraint.
4. Record a baseline. Compare the proposed experience with the current workflow or a simpler non-AI alternative. A model output is not valuable merely because it exists.
5. Inspect failure patterns. Separate prompt failures, missing-context failures, retrieval failures, model limitations, interface confusion, and policy violations. Each category points to a different remedy.
6. Set a release gate. Decide which failures block launch, which require human review, and which are tolerable in the intended use case. The gate should reflect the consequence of error, not enthusiasm for the feature.
Keep the evaluation set versioned with the product. When you change the prompt, model, retrieval logic, or available tools, rerun the same cases. Otherwise, an apparent improvement in one example can conceal regressions elsewhere.

Instrument behavior before launch

Quality evaluation and product analytics answer different questions. An evaluation tells you whether the system behaved acceptably on known cases. Behavioral analytics tells you whether customers reached value in the product.

Define the journey in Amplitude or your existing analytics system before exposing the prototype broadly. Capture the moment a user encounters the feature, supplies enough information, receives an output, accepts or edits it, completes the downstream task, returns to use it again, abandons it, or escalates to a person. That sequence gives you activation and retention signals rather than a vanity count of generations.

If you run an A/B test, choose the minimum detectable effect before launch. The decision matters because an experiment that cannot detect a product-relevant change may produce an inconclusive result even when the dashboard looks busy. Define the primary outcome, guardrail metrics, exposure rule, and analysis plan before looking at the results.

Move forward when the prototype solves a defined task, the evaluation catches meaningful failures, the events expose the user journey, and the experiment can answer a decision. A persuasive demo without those four elements is still a demo.

Months 7-9: Ship a bounded workflow, not an open-ended assistant

The production phase is where product judgment becomes visible. Start with a workflow that has a recognizable beginning, end, and owner. Customer-support, CRM, and guided-onboarding workflows are useful patterns because the AI can sit inside an existing user journey rather than asking customers to invent a use case from a blank chat box.

Screen the workflow before committing engineering capacity:
- Is the user’s job clear enough to define a successful completion?
- Does the system have access to approved, relevant context?
- Can you observe whether the user accepted, corrected, ignored, or escalated the output?
- What happens to the customer if the system is wrong?
- Can a consequential action be paused, reviewed, or reversed?
- Is a generative system materially better than a rule, search result, template, or conventional workflow?
Use agentic AI only when the job genuinely requires several connected steps, tool use, or changing plans. Additional autonomy also creates more places for permissions, context, and actions to go wrong. Begin with the narrowest useful boundary, then expand it when production evidence supports the change.

Map the production loop before building it

A product trio should be able to trace the complete workflow on one page:
1. Trigger: What user action or system event begins the workflow?
2. Context: Which profile, conversation, account, or knowledge records are retrieved?
3. Generation or decision: What does the model produce, classify, recommend, or plan?
4. Tool action: Which systems can it read from or write to, and under whose authority?
5. Human checkpoint: Which output can be edited, rejected, or approved before it changes customer data or sends an external message?
6. Recovery: How does the product handle low confidence, missing data, tool failure, timeouts, or a user correction?
7. Learning signal: Which feedback updates the evaluation set, product decision, or workflow design?
Place safeguards at the point of consequence. Restrict the data and tools the workflow can access. Require explicit approval before a high-impact external action. Preserve a record of the inputs, retrieved context, output, action, and user response so a failure can be investigated. If an action cannot be safely reversed, keep it behind human review until the risk has been addressed.

Threat detection and response also need a product playbook. Define what counts as suspicious input or abnormal behavior, who receives the alert, how the workflow is disabled or contained, what evidence is retained, and how affected users are handled. The escalation path should exist before the first serious incident, not be improvised during it.

Monitor the experience at four levels
- User outcome: Did the customer complete the intended job with less effort or fewer avoidable handoffs?
- AI quality: Are the evaluation scores and failure categories changing after releases?
- Workflow health: Are retrieval, model, and tool steps completing as expected, and can the team locate the failing stage?
- Risk: Are users overriding outputs, escalating cases, encountering policy violations, or triggering suspicious behavior?
Track deployment frequency because a team that can release safely can also learn faster. Do not confuse release frequency with customer value, though. The useful loop connects a deployment to a quality change, a behavior change, and a decision about what to do next.

Months 10-12: Turn one successful product into a repeatable system

Scaling is not copying the same AI feature into every surface. It is making the successful practices reusable while preserving room for different user risks and workflow requirements.

Codify the operating assets that reduced uncertainty during the earlier phases:
- An intake template that starts with the user problem, current workflow, expected outcome, and consequence of error.
- A continuous-discovery practice that keeps generated themes connected to original customer evidence.
- A retrieval-first architecture template for products that depend on approved or changing knowledge.
- A shared prompt library with owners, versions, expected behavior, and known limitations.
- An evaluation gate covering representative cases, blocking failures, human-review requirements, and regression checks.
- A production checklist covering permissions, privacy, observability, recovery, threat response, and user control.
- A monitoring cadence that connects product behavior, AI quality, workflow health, and risk.
Do not impose one universal quality threshold on every AI feature. A low-consequence drafting aid and a workflow that changes a customer account do not carry the same downside. Use the same evaluation process across teams, but set release gates according to the task, affected user, reversibility, and consequence of failure.

Use common analytics without erasing product context

A unified analytics model lets leadership compare lift across products without forcing every team to use an identical funnel. Standardize the basic meanings of exposure, meaningful use, successful task completion, correction, abandonment, escalation, and return usage. Then let each product define the events that represent those states in its own journey.

This is also where roadmapping and sprint planning should move from output commitments to outcome-based decisions. “Ship an AI assistant” is an output. A useful objective describes the customer behavior or business result that should change. The roadmap can then contain competing ways to produce that change, including improvements that do not require AI.

Use a consistent stakeholder narrative:
- What shipped: The workflow or capability placed in users’ hands.
- What moved: The user, product, quality, and risk signals that changed.
- What was learned: The assumptions confirmed, rejected, or still unresolved.
- What happens next: The decision to expand, revise, contain, or stop the work.
That structure prevents activity from masquerading as progress. It also gives executives a clear basis for funding decisions: evidence of value, evidence of control, and a specific next bet.

Start this week with one recurring user decision. Write its AI product brief, run the workflow manually with permitted data, and save the successful and failed cases as the beginning of an evaluation set. If you cannot define a good result or the consequence of a bad one, stay in discovery. If you can, you have a concrete first artifact and a reason to proceed to a prototype.

References
- Shivam.Consulting Blog – Master AI as a Product Manager in 12 Months: My 2026 Roadmap to Ship Smarter, Faster
December 17, 2025
AI in Product Design: My Proven Playbook, Real Use Cases, and the Tools That Win Faster

In product design, AI has shifted from novelty to non-negotiable. I’ve watched teams accelerate discovery, compress prototyping cycles, and turn ambiguous ideas into validated experiences faster than ever—without sacrificing quality or customer trust.

AI in product design has quickly moved from new to necessary. Here are the AI product design tools and approaches you need to stay relevant in this decade.

From my vantage point leading product teams, “necessary” means AI is woven throughout the product lifecycle—discovery, prioritization, prototyping, validation, and iteration—not bolted on. The goal isn’t to chase hype; it’s to build durable advantage with clear AI Strategy, disciplined execution, and measurable outcomes.

First, anchor the work in strategy. Tie every AI initiative to a specific customer problem and value proposition, then express that linkage with outcomes vs output OKRs. This keeps teams focused on real impact and avoids feature-chasing. It also sharpens product positioning and clarifies where AI can deliver competitive differentiation versus simple points of parity.

Second, upgrade discovery. I rely on AI workflows to synthesize interviews, cluster themes, and surface insights at scale. A retrieval-first pipeline—grounding models in our own data—improves factuality and reduces hallucinations. Combine this with strong data governance and privacy-by-design so insights are trustworthy and compliant from day one.

Third, make quality measurable. Adopt eval-driven development: define evaluation sets and acceptance thresholds that reflect real user tasks before you ship. Pair that with A/B testing and minimum detectable effect (MDE) discipline, so you learn quickly and confidently. Add safety guardrails (red-teaming prompts, content filters, and bias checks) to manage AI risk without slowing the pace.

Fourth, enable empowered product teams. Product trios (PM, design, engineering) should co-create prompts, prototypes, and evaluation criteria. Give designers and PMs practical tools—LLMs for product managers, structured prompt templates, and reusable components—so AI-augmented work becomes the default, not a special project.

Where does AI shine in product design today? Concept exploration and market scans, turning fuzzy opportunity spaces into crisp problem statements. Rapid wireframes and interaction ideas, using gen ai for product prototyping to explore multiple design directions in minutes. UX writing that adapts tone and reduces friction across onboarding, tooltip design, and microcopy.

It also excels at guided experiences. I’ve seen strong lifts in user activation when we pair in-app guides and product tours with context-aware suggestions. For support and education use cases, a retrieval-grounded assistant can deflect tickets, shorten time-to-value, and reinforce the product’s value proposition at the exact moment a user needs help.

Voice is another frontier. A well-scoped voice AI agent can accelerate complex workflows (think data entry or multi-step configurations) when hands-free is faster or more intuitive. Just be intentional about when agentic AI adds net value versus when a simple UI tweak would do.

On the tooling side, my AI product toolbox is pragmatic and modular. For analytics and learning loops, Amplitude analytics and Pendo help quantify behavior changes and retention analysis. For in-product engagement and feedback routing, Intercom and HubSpot integrate cleanly with LLM-driven tagging and summarization. For ideation and automation, I use a ChatGPT connector and Claude Code for quick scripts, data wrangling, and prompt experiments. The constant: a retrieval-first pipeline that grounds models in approved knowledge and maintains context window management at scale.

Risk management is built in, not bolted on. Set clear AI risk management policies, catalog model and data dependencies, and document decisions. Align with regulatory compliance requirements early, and keep an audit trail of prompts, datasets, and eval results. That’s how you move fast without breaking trust.

If you’re getting started, begin small: pick one high-friction workflow, add a retrieval-grounded copilot, and measure the lift. Use the results to inform product roadmapping and sprint planning, then scale to adjacent use cases. With disciplined discovery, sharp evaluation, and the right tooling, AI becomes a force multiplier for product teams and a clear win for customers.

Inspired by this post on Product School.

December 15, 2025
From No-Code Hack to 10,000 Weekly Calls: Inside Perk’s Voice AI That Actually Works

I love real-world AI that ships, scales, and actually solves painful customer problems. This story checks every box. As a product leader who has brought agentic AI to production environments, I was captivated by how a small, focused team at Perk took a no-code voice AI prototype and turned it into a system that reliably makes 10,000+ calls per week to prevent failed hotel payments.

What happens when you combine a real customer problem, a no-code prototype, and a team willing to listen to every single call?

Steven Payne (Product Manager), Gabriel Stock (Senior Engineering Manager), and Philipe Steiff (Senior Software Engineer) from Perk share how they built a voice AI agent that calls hotels to verify virtual credit card payments, preventing travelers from arriving to find their rooms unpaid. This is a textbook example of linking operational pain to a high-leverage AI solution.

What started as a hackathon experiment in Make.com became a production system handling over 10,000 calls per week across multiple languages. Along the way, the team learned hard lessons about prompt engineering for voice (numbers, pronunciation, and a very "Karen-like" first version), how to break a single monolithic prompt into structured conversation stages, and why listening to actual calls beats any amount of theorizing.

From a product management perspective, this approach aligns perfectly with eval-driven development and continuous discovery. Structure the problem, instrument aggressively, ship safely, then listen—deeply—to real interactions. In my own teams, I’ve seen that nothing accelerates iteration on agentic AI like closing the loop between qualitative call reviews and quantitative evals.

They built a working prototype without writing a single line of backend code.

They structured the call into discrete stages (IVR, booking confirmation, payment) to improve reliability.

They created two eval systems: one for call success classification, another for conversational behavior.

They scaled from five calls a day to tens of thousands per week while maintaining quality.

This is a detailed look at building AI for real-time human interaction—where the stakes are high and the feedback is immediate.

Guests: Steven Payne, Product Manager, Perk; Gabriel Stock, Senior Engineering Manager, Perk; Philipe Steiff, Senior Software Engineer, Perk.

What stood out to me was how Perk's team identified an AI use case by connecting prior experimentation with a real operational problem. Why they chose Make.com for prototyping—and shipped to production without touching backend code—underscores how far no-code can take you when paired with crisp problem framing. The evolution from a single prompt to structured conversation stages (IVR handling, booking confirmation, payment request) is exactly how you harden agent behavior for production.

Breaking up the agent's task dramatically improved reliability. They also built two eval systems: classification for success rates and LLM-as-judge for conversational behavior. Even with automation, the team still listens to calls manually—a practice I strongly endorse for uncovering edge cases, trust issues, and UX nuances that dashboards can’t show.

The challenge of prompt engineering for voice—numbers, booking references, and text-to-speech markup—was non-trivial. Expanding to German revealed that prompts in native language improve results. And, as often happens with operations-heavy rollouts, this project uncovered other operational problems they didn't know existed—valuable signal for the roadmap.

Resources & Links: Perk. Make.com — No-code automation platform used for the prototype. Twilio — Voice/telephony provider. Eleven Labs — Text-to-speech provider (used in early experiments).

Chapters: 00:00 Introduction to the Team; 01:54 Understanding PERK's Mission; 02:59 Challenges in Travel Booking; 07:27 AI Solutions for Customer Care; 09:52 Prototyping with AI and Voice; 17:00 Implementing AI in Production; 25:51 Learning Through Trial and Error; 26:40 Prompting Challenges and Solutions; 27:58 Iterating on Prompts and Evaluations; 30:08 Scaling and Production Challenges; 32:43 Advanced Evaluation Techniques; 35:32 Real-World Applications and Success; 49:07 Future Directions and Expansion; 53:53 Conclusion and Team Reflections.

My product takeaways: Start with clear operational pain and measurable outcomes (e.g., payment verification). Use no-code to validate quickly, then progressively harden. Treat voice AI like any production system: break it into deterministic stages, add guardrails, and measure both outcome and behavior. Pair automated evals with hands-on reviews. And when going multilingual, write prompts in the native language—your accuracy will thank you.

If you’re exploring agentic AI for operations, this is the blueprint: tight scoping, Make.com for speed, Twilio for reliability, structured prompts for control, and an eval-driven loop to scale quality with confidence.

Inspired by this post on Product Talk.

December 4, 2025
AI vs. Human Judgment in Customer Interviews: The Hard‑Won Lessons That Changed My Mind

I recently revisited a topic I once pushed back on: using AI to analyze (and maybe even synthesize) customer interviews. After six months of real-world experiments and countless conversations with seasoned product leaders, I’ve evolved my perspective. There is meaningful value here—but only when we’re clear about where AI helps and where it quietly erodes the hard-won customer understanding that powers great product decisions.

If you want to experience the conversation that sparked this reflection, you can listen to the episode on Spotify or Apple Podcast, and watch the discussion here: YouTube. It’s a candid, practical exploration of AI’s role in continuous discovery, and it mirrors what I’m seeing on the ground with product trios and empowered product teams.

Here’s the crux: AI raises the floor for beginners but accelerates experts even more. That matches my experience—early-career PMs get structure, momentum, and a confidence boost, while experienced interviewers can move faster without sacrificing nuance. But there’s a catch. If your interviewing skills aren’t solid yet, AI can create a veneer of insight that masks shallow understanding. In other words, it can help you go wrong more efficiently.

The conversation makes an important distinction between analysis and synthesis. Analysis is about extracting signals from the interview. Synthesis is about building meaning—connecting patterns, weighing contradictions, and deciding what to do next. AI can speed up the former with summaries and highlights. The latter—true synthesis—still demands expert judgment, context, and empathy.

One line from the episode stuck with me: your unpolished interview skills matter more than any shiny new AI workflow. I’ve felt that firsthand. When interview quality is uneven, dropping transcripts into an LLM won’t save you. You still need to synthesize every interview individually so the signals remain traceable and credible. That discipline keeps teams aligned, prevents overfitting to noise, and builds the organizational memory that fuels better bets.

We also explored the operational reality most teams face: interviews pile up. Backlogs grow. Leaders want speed. This is where “expert + AI” shines. With the right prompts, templates, and context, tools like ChatGPT and Claude can help transform raw transcripts into structured artifacts you can trust—provided a strong interviewer sets the frame and makes the calls. That balance preserves both velocity and quality.

What changed my mind most was the evidence from experiments—running sets of interviews through different LLMs and comparing outcomes. The patterns were consistent: beginner + AI is usually better than nothing, but the real performance gains come from expert + AI. When experts guide the process, AI becomes an accelerant rather than a crutch.

A favorite story in the episode takes a detour into building a gaming PC—an unexpected but perfect metaphor for AI’s limits. You can get great step-by-step guidance from a model, but when context shifts or edge cases appear, expertise is what keeps you from making expensive mistakes. Customer interviews are like that. Empathy comes from human interaction; AI can’t replace the experience of talking directly to your customers.

My practical guidance for teams integrating AI into continuous discovery: start with interviewing fundamentals, separate analysis from synthesis, and standardize how you capture single-interview learnings. If you need a tight template for this, refer to “The Interview Snapshot: How to Synthesize and Share What You Learned from a Single Customer Interview.” Use AI for summaries, clustering, and draft artifacts—but have an expert finalize the narratives, evaluate trade-offs, and document assumptions.

If you’re scaling this across an organization, invest in training first, then in workflows. Build a lightweight operating system for discovery: consistent interview guides, “story-based” techniques, and a shared library of prompts. Consider resources like “The Interview Coach,” as well as practical write-ups such as “Customer Interview Analysis: Where AI Helps and Hurts.” These help teams avoid common pitfalls and make better use of AI in high-judgment moments.

My bottom line: AI isn’t magic. It can help, but only if your interviews are strong and you provide the right context. Customer understanding is a competitive moat; outsourcing it entirely will cost you in the long run. Use AI to accelerate—not replace—the human judgment that makes product discovery work.

Resources and links worth exploring: ChatGPT, Claude, The Interview Snapshot: How to Synthesize and Share What You Learned from a Single Customer Interview, The Interview Coach, and Customer Interview Analysis: Where AI Helps and Hurts.

I’d love to hear how your team is using AI in discovery. What’s working, what’s risky, and where do you draw the line between automation and judgment? Share your experiences in the comments—our community learns faster when we compare notes.

Inspired by this post on Product Talk.

December 2, 2025
Unlock AI Product Roadmaps: Essential Tools Every PM Needs to Prioritize and Ship Faster

In my role leading product teams, the AI product roadmap isn’t just a plan—it’s the operating system for how we discover value, prioritize with rigor, and ship with confidence. The pace has changed, the stakes are higher, and the best product managers are now orchestrating AI capabilities, data, and customer insight in near-real time.

Master the evolving art of the AI product roadmap. Prioritize smarter, turn data into direction and insight into action, only much faster.

When I say “AI product roadmap,” I’m talking about a living system that blends strategy, discovery, and delivery. It’s less about dates and more about outcomes, risk reduction, and sequencing learning. In practice, that means combining AI Strategy with product roadmapping and sprint planning, then validating each bet with real customer signals.

For prioritization, I anchor on outcomes vs output OKRs and connect them to measurable signals across the funnel. Continuous discovery keeps insights flowing, while a unified approach to analytics and retention analysis tells me where the lift is. This lets me rank initiatives not just by impact and effort, but by how quickly we can learn, iterate, and compound value.

On discovery, product trios are non-negotiable. We prototype early with gen ai and LLMs for product managers to accelerate concept validation and reduce ambiguity. When customers can co-create through in-app guides or lightweight product tours, we turn vague needs into crisp problem statements and testable hypotheses far faster.

On delivery, I pair tight feedback loops with experimentation. A deliberate cadence of A/B testing and strong instrumentation ensures we’re learning every sprint, not just launching. The goal is to de-risk decisions quickly, keep momentum high, and translate signals into roadmap movement without thrash.

Under the hood, the AI stack matters. I rely on a retrieval-first pipeline to ground models in trusted data, and I’m intentional about privacy-by-design and data governance from day one. As agentic AI patterns emerge, I put evaluation workflows in place so we can ship confidently—and safely—without slowing down innovation.

Finally, alignment is the multiplier. Clear narrative roadmaps tied to customer outcomes help stakeholders see trade-offs, while crisp interfaces with go-to-market and CRM integration close the loop from roadmap to revenue. When everyone can trace a line from AI strategy to shipped value, prioritization becomes easier and trust grows.

If you’re feeling the acceleration, you’re not alone. With the right AI product toolbox—rooted in discovery, grounded in data, and delivered through tight feedback loops—you can move faster, learn smarter, and build products your customers can’t live without.

Inspired by this post on Product School.

December 1, 2025
AI Product Owner in 2026: The High-Impact Role Every Team Needs to Win With AI

By 2026, the AI Product Owner will be the keystone role that turns AI strategy into measurable business outcomes. In my teams, this seat bridges market insight, model capability, data governance, and shipping velocity—so product decisions are not just clever, but compliant, reliable, and fast.

I often describe the remit simply: "Here is your clear guide to the AI product owner role (skills, responsibilities, how it differs from PM) and ways AI tools supercharge delivery." In practice, the AI Product Owner translates business goals into model-backed experiences, aligns cross-functional execution, and ensures the product’s AI behavior remains safe, lawful, and on-brand under real-world constraints.

How does this differ from a traditional PM? While Product Management sets portfolio strategy, positioning, and market narratives, the AI Product Owner owns the AI experience end-to-end—data readiness, evaluation harnesses, safety guardrails, and the iterative model improvements that drive outcomes vs output OKRs. I anchor the role inside empowered product teams and product trios (PM/Design/ML Eng) to keep discovery continuous and delivery disciplined.

On responsibilities, I expect four pillars. First, discovery: continuous discovery with customers and internal experts to uncover use cases where generative AI or LLMs beat the status quo. Second, experience: define the right interaction patterns for AI UX, including retrieval-first pipeline choices, context window management, and feedback loops for human-in-the-loop correction. Third, governance: privacy-by-design, AI risk management, data governance, and regulatory compliance baked into the roadmap. Fourth, delivery: CI/CD for models and prompts, observable evaluation with A/B testing and minimum detectable effect (MDE), and SRE-grade incident management when AI behavior drifts.

Skills-wise, I look for product sense plus technical fluency. That includes LLMs for product managers (prompting, grounding, RAG), analytics mastery (Amplitude analytics, retention analysis, activation metrics), and comfort with DORA metrics and deployment frequency to keep iteration high but safe. Strong stakeholder management and clear writing are non-negotiable—AI capabilities evolve fast, and leaders must see risk, cost, and ROI with no ambiguity.

AI tools truly supercharge delivery when they eliminate bottlenecks. My practical stack: an AI product toolbox with Claude Code and a ChatGPT connector for rapid prototyping; CustomGPT workflows for support triage and internal knowledge; Pendo product tours and in-app guides to validate behavior changes; Intercom for customer support ai strategy; and tight CRM integration via HubSpot to measure revenue impact. The outcome is faster idea-to-learning cycles, sharper telemetry, and far cleaner handoffs.

For roadmapping, I prioritize thin slices that prove value early—shipping narrowly scoped assistants or copilots, then expanding with product roadmapping and sprint planning that ties capability unlocks to outcomes. A unified analytics platform helps compare human-only baselines to augmented workflows, while agentic AI patterns automate routine steps under strict guardrails.

Risk is a product surface, not a side task. I require explicit policy gates (PII handling, red-teaming, bias audits), clear escalation paths, and incident playbooks. When we treat policy and reliability as features, customers reward us with deeper adoption and higher trust.

If you’re pursuing the AI Product Owner path, build a portfolio around shipped learnings: the experiment you killed with data, the safety constraint you designed, the postmortem you led, and the business metric you moved. That story—evidence of disciplined discovery, responsible delivery, and real-world results—is exactly what teams (and boards) want to see in 2026.

Inspired by this post on Product School.

November 26, 2025