Tag: eval-driven development

From Prototype to Production: How I Built Reliable AI-Generated Opportunity Solution Trees

I just wrapped an all-out engineering sprint. That still sounds odd coming from me, because while I’ve written code on and off for years, I don’t self-identify as an engineer. I’m a product manager who used to be a designer. It’s been a long time since I wrote code for a living.

But AI has expanded what’s just now possible—for our products, and for us. It’s pushed me to do more than I imagined. In that spirit, I want to share a recent engineering story. It includes technical details, and a year ago I couldn’t have done any of it. I learned it with the help of AI, and my aim is to show what’s now within reach.

I’ve been building two services with a partner at Vistaly: AI-generated interview snapshots and AI-generated opportunity solution trees. We put out a call for alpha partners, received over 100 applicants, and selected eight design partners to start.

A clear, color‑coded map from desired outcome to opportunities, solutions, and assumption tests—showing how to structure discovery work and prompt AI to generate, compare, and validate product ideas.

Each team uploaded three customer interviews. I identified the key moments and opportunities and then generated an opportunity solution tree from those snapshots. I provide the AI services; Vistaly is building the UI and workflows around them.

Early feedback was strong. Teams immediately asked to upload more interviews—exactly the kind of demand signal you hope to see—so we got to work making that possible.

Go behind the scenes as AI turns raw feedback into a clear Opportunity Solution Tree. Linked cards reveal user needs—onboarding, support offload, and bot-readiness signals—so product teams can spot priorities and next steps at a glance.

Updating an opportunity solution tree with new interview content is far harder than generating a new tree from scratch. I initially underestimated the complexity. Our goal wasn’t to produce a tree and declare it truth. We wanted teams to engage, correct, and collaborate with the AI—scaffolding cross-interview synthesis instead of doing it for them.

To support that, we needed a way to communicate precisely how a tree would change after new interviews were added. We took inspiration from git diff and set out to build the equivalent for opportunity solution trees—step-by-step change sets that explain each proposed modification.

A clear visual of AI‑generated opportunity solution trees: outcomes feed opportunities that branch into sub‑opportunities, while evidence is preserved. The structure ensures updates stay traceable and never cause data loss.

That decision was right, but the lift was larger than I expected. It wasn’t enough to generate an updated tree; I also had to provide a clear, ordered walkthrough of what changed and why.

I often see the same pattern with AI: it’s easy to get to an impressive prototype, but much harder to reach a production-grade product. That was exactly my experience here. My service actually comprised two sub-services: generating a new tree from scratch and updating an existing tree with new interviews. The first worked well in alpha; the second had to be built before anyone could add a fourth interview.

Explore how an outcome expands into an Opportunity Solution Tree: Opportunities A and B stem from the goal, with C and D nested under B, while a concise change set tracks every node added along the way.

On the surface, these services look similar. In reality, updates must preserve existing structure unless new evidence requires a change. You have to account for compound operations—merges, splits, deletes—while guaranteeing no data loss. Every node has source opportunities (supporting evidence from interviews) and children (tree sub-opportunities), and neither can be dropped.

In classic AI fashion, I got a reasonable version working in a few days and shipped it to our design partners. One team quickly hit our beta limits and asked to convert to a paid subscription so they could keep going. They showed a willingness to pay, converted, and started uploading aggressively.

Watch an Opportunity Solution Tree evolve: the original parent A with x, y, z branches is split into A and B, shifting evidence while preserving links—mirroring how AI refines scope and structure in discovery.

At the 14th, 15th, and 16th uploads, the cracks appeared. We saw odd behavior in some trees. The Vistaly team noticed that the change sets—the step-by-step instructions emitted by my service—didn’t always reconstruct the final tree my service also emitted. We needed those steps to match exactly, so teams could review and accept, modify, or reject each change with confidence.

They flagged the issue the day I was flying to New Orleans for Jazz Fest. In hindsight, I’m glad I didn’t grasp the scope of what awaited me. I had roughly 80% of the work still to do to make tree updates rock solid. At least I got to enjoy the music first.

From fragments to focus: this diagram shows how Opportunities B and C are merged into a single Opportunity Solution Tree, removing duplicates and unifying context so AI can rank and explore five related opportunities with clarity.

Back home, I started diagnosing. My service was a pipeline: several LLM-driven steps followed by deterministic code to compare trees and produce change sets. As I dug in, I realized that approach was flawed. Tree diffs, unlike linear document diffs, are ambiguous.

In a document, if I add a sentence, the diff shows an addition. If I delete a paragraph and rewrite it, the diff shows a removal and an addition. Simple. But trees are different. Suppose I split opportunity A into A and B, and later merge B with C. The split can disappear from the final diff.

Peek inside our process: a simple opportunity solution tree maps an outcome to prioritized opportunities A and C with downstream options x-z and t-v. A clear snapshot of how AI organizes product discovery.

When the model splits an opportunity, it must distribute A’s source opportunities and children between A and B. For instance, if A has source opportunities 1, 2, 3 and children x, y, z, after the split A might keep 1, 2, and x, while B takes 3, y, and z.

Now suppose the model merges B into C. If C originally had source opportunities 4 and 5 and children t, u, v, then after the merge C now has source opportunities 3, 4, 5 and children t, u, v, y, z. When you compare the original and final trees, it looks like A somehow donated some evidence and children directly to C. The split and merge that explain why are invisible to a naive diff.

See how an AI-generated Opportunity Solution Tree unfolds: one Outcome flows to Opportunities A and C, then into options x–v. Clean colors and arrows reveal the hierarchy from goal to opportunities at a glance.

That was the core insight: we didn’t just need to show what changed—we needed to show why it changed. I had to reconstruct each move step-by-step. That meant getting the model to show its work, which opened a new can of worms.

I refactored my prompts so the model produced both the final output and the exact change set it used to get there. The action language was explicit: add, delete, reframe, merge, split, and so on. Crucially, I asked the model to describe its moves in user-meaningful terms—“split A into A and B, then merge B into C”—not as opaque reassignments of sources and children.

Watch an opportunity solution tree take shape: start with the outcome, add opportunities A and B, then extend B to C and D. The paired change set makes every edit transparent—ideal for AI-assisted product discovery.

For each LLM step, the model now emitted its recommendation and the corresponding change set. This helped, but it wasn’t perfect. After extensive testing and error analysis, two classes of errors emerged: (1) the model attempted an invalid move, and (2) the change set didn’t actually generate the recommendation.

Category 1 felt like designing a game while the model played it creatively. For example, what happens when the model tries to merge a parent with a child? If opportunity A has children B, C, and D and the model merges A with B, the merge is directional. If the instruction is “keep A, delete B,” that works—the parent absorbs the child. But if the instruction is “keep B, delete A,” then C and D become orphans. These puzzles were solvable and even fun.

Visual explainer from Product Talk on AI-generated Opportunity Solution Trees. It contrasts an allowed merge (B into A) with a not-allowed merge (A into B) that leaves child opportunities orphaned, guiding safe hierarchy edits.

Category 2 was harder. Despite prompt iterations, I could only push the discrepancy rate down to about 1 in 40 instances. With 10–20 LLM calls per run, that meant roughly half of all runs still failed. Not acceptable for production. I hit a wall. A paying customer was waiting, and more design partners were queued up.

Next, I tried to correct the model’s mistakes with deterministic code. I had promised that my change sets would generate the output tree, so I wrote verifiers: detect conflicts (e.g., delete a node, then try to use it later), guard against data loss, prevent orphaned nodes, and more. Detection was straightforward; correction was not. Fixing issues required guessing the model’s intent. If the sequence said “delete A, then merge A with B,” should I remove A entirely or salvage A’s sources and children by merging into B? There were dozens of such cases with no unambiguous answer.

A step-by-step loop shows how changes are validated: generate a change set, run a validation tool, review the result, then repeat on failure and exit on pass—mirroring iterative work behind AI-built Opportunity Solution Trees.

After 11 straight days of deep work—including weekends—I was exhausted. I dislike hustle culture; this isn’t how I design my life. But I was stuck, and then I had an insight.

On a walk with my husband (also an engineer), I realized I could have the LLM repair its own mistakes. My data contract with Vistaly requires that the change set must generate the output tree. I had already built robust validation code. I knew exactly when a change set failed—and why. No amount of prompt tuning alone was fixing it. So I turned the validator into a tool for the model and created a simple agentic loop.

The loop works like this: the model proposes a change set, calls the validation tool, and gets back a pass/fail plus specific feedback. If it fails, the model uses those instructions to repair the change set and calls the tool again. Iterate until success or a max number of turns.

I prototyped in Node.js with a single model call, a verifier pass, and a repair attempt. At first, the loop didn’t converge—it just accumulated compute. I experimented with how to communicate errors, how much context to include, and how to sequence feedback. Eventually, it clicked: the model began fixing its own mistakes and typically returned a valid change set in one or two repairs. It was, in practice, eval-driven development applied to LLM outputs.

I had already built an agent loop utility for another AI workflow, so I productionized quickly: model call, optional tool invocation, tool result returned to the model, repeat until the validator signals success or the loop times out. I integrated the new loop into the pipeline and shipped the revamped service to Vistaly on Monday at noon. They’re integrating now, and it will be in the hands of our design partners shortly. I’m relieved—and ready for a day off.

Reflecting on the last two weeks, a few things stand out. First, I shed limiting beliefs about being an engineer. To make this reliable, I had to solve legitimately hard problems, and that feels good.

Second, this was genuinely fun. Designing the action set and watching the model push those boundaries was like working through elegant puzzles. Models are incredibly creative, and harnessing that creativity with the right constraints is deeply satisfying.

Third, I learned when I can and can’t trust Claude to write code for me. Since Opus 4.6 came out, I gave Claude a much longer leash. After the past two weeks, Claude is back on a short leash. I found a lot of gaps in my implementation in areas where I simply trusted that Claude got it right, when in fact it didn’t. If you don’t have the right infrastructure—planning, testing, code review—this can be disastrous. I’ll be investing more here and sharing what I learn.

Finally, if this work had been spread over two months, it would have been thoroughly enjoyable. I’m discovering how much I like being an AI engineer. It feels like a new chapter where I can combine opportunity solution trees with modern AI engineering—and deliver real value to product teams doing continuous discovery.

I’m excited to share more of what we’re building with Vistaly and to onboard more design partners soon. If you’re interested, get on the waiting list. And if you’ve been hesitant to stretch beyond your current skill set, I hope this story nudges you to take the first small step toward what’s just now possible.

Inspired by this post on Product Talk.

May 13, 2026

AI-Assisted Product Strategy: A Practical Operating System

You can get an AI model to produce a roadmap in minutes. That is precisely the problem. A polished roadmap can hide weak evidence, unresolved trade-offs, and a strategy that never made a real choice.

The useful question is not whether AI can do product management work. It is where AI should accelerate the path from evidence to decision, where human judgment must remain explicit, and how you will know the resulting strategy is working. The operating system below gives you that separation.

Key takeaways

Give AI a defined role in the decision process. It can extract, organize, challenge, and draft; the product leader still owns choices, trade-offs, and commitments.
Build a strategy chain from customer problem to business result before asking AI for initiatives. Otherwise, the model will fill strategic gaps with plausible language.
Ground every workflow in canonical product context, and require every important claim to point back to evidence.
Use AI to shorten discovery synthesis, not to turn a limited set of interviews or support conversations into false market certainty.
Carry the same strategic hypothesis through the roadmap, experiment, launch, and learning review. Changing the success definition between those stages makes measurement meaningless.

Start with decision architecture, not a better prompt

Most weak AI-assisted strategy work begins with an underspecified request: analyze this feedback, prioritize these ideas, or build a roadmap. The model responds by making silent assumptions about the customer, the business objective, and the meaning of priority. Its output may read well while answering a question nobody deliberately chose.

Write a decision brief before opening the model. This is not a conventional product requirements document. It is a compact contract defining the decision AI is helping you make.

Decision: State the choice in one sentence. For example, decide which onboarding opportunity deserves discovery capacity in the next planning cycle.
Target customer and context: Name the segment, job, and situation. Feedback from an administrator configuring an account should not be blended with feedback from an end user completing a daily task.
Desired outcome: Identify the customer behavior you want to change and the business result it is expected to influence.
Evidence in scope: List the interviews, behavioral data, support conversations, journey maps, and prior experiments the model may use.
Constraints: Include privacy requirements, technical dependencies, commercial commitments, capacity limits, and non-goals.
Decision owner: Name the person accountable for accepting the trade-off. An AI-generated recommendation does not distribute accountability.

Build a strategy chain the model can inspect

Your strategy should form a traceable chain:

Choose the customer and job that matter.
Define the value proposition, including what must match the market and what should be meaningfully different.
Name the customer outcome and business outcome.
Break that outcome into drivers the product can influence.
Select an opportunity supported by evidence.
Form a testable product bet.
Decide what evidence would justify continuing, changing, or stopping.

A driver tree makes this chain concrete. It creates a visible connection between roadmap work and measures such as activation, retention, expansion, and Net Recurring Revenue. AI is useful here as a critic. Ask it to identify unsupported jumps, duplicated drivers, initiatives disguised as outcomes, and metrics the proposed product change cannot plausibly affect.

Keep outputs and outcomes separate. Shipping an AI onboarding assistant is an output. Changing a defined activation behavior for a defined customer segment is an outcome. The model can help rewrite output-oriented objectives, but it cannot choose a credible target without baseline data, business context, and an accountable owner.

Force a distinction between fact, inference, and assumption

Require the model to label every material statement as one of three things:

Observed: Directly supported by a supplied interview, event, support conversation, or experiment.
Inferred: A reasonable interpretation that combines observations but is not explicitly stated by the customer or proven by the data.
Assumed: Necessary for the recommendation to work but not yet supported by the supplied evidence.

This simple classification prevents an attractive narrative from laundering assumptions into facts. It also improves discovery planning: the most consequential assumption with the weakest evidence becomes a candidate for the next test.

A useful instruction is: Use only the supplied material. For every recommendation, show the observations that support it, the inference connecting those observations to the recommendation, the assumptions that remain, and the evidence that could disprove it. If support is missing, say that it is missing.

Build a controlled workflow from context to decision record

AI assistance becomes reliable when it is a workflow rather than a chat session. A chat encourages improvisation: context changes, instructions disappear, and nobody can reconstruct why an answer looked different the next time. A workflow gives each pass a defined input, output, and approval gate.

Ground the model in canonical product context

Start with a retrieval-first set of canonical documents. At minimum, that context should include the current vision, product strategy, target segments, value proposition, OKRs, metric definitions, analytics dashboards, relevant discovery evidence, decision history, and definition-of-done checks.

Canonical does not mean comprehensive. More context can make conflicts harder to notice. Give each item an owner, a freshness indicator, and an authority level. If an old positioning document conflicts with the approved strategy, the workflow should identify the conflict rather than silently averaging the two.

Include exclusions as well. Tell the model which documents are historical, which metrics are deprecated, which segments are out of scope, and which proposals have already been rejected. Without those boundaries, previously abandoned ideas can return as apparently new recommendations.

Separate extraction, synthesis, challenge, and approval

Extract: Pull observations, customer language, events, metrics, decisions, and unresolved questions from the supplied material. Preserve links to the original evidence.
Synthesize: Group related observations and propose opportunity statements. Keep contradictory evidence visible.
Challenge: Look for alternative explanations, missing segments, weak causal claims, metric gaming, dependencies, and reasons the recommendation could fail.
Decide: Have the accountable product leader and relevant partners accept, modify, or reject the recommendation. Record the trade-off explicitly.
Publish: Store the decision, evidence, owner, expected outcome, guardrails, and next review trigger in the system the team already uses.

Do not combine these passes into one request for a final answer. Extraction should not quietly prioritize. Synthesis should not hide inconvenient evidence. A challenge pass should test a proposed direction without changing the original evidence set. The human approval gate should be visible, not implied by the fact that somebody copied the output into a roadmap.

Raw interviews, support threads, CRM records, and analytics exports can contain personal or confidential data. Do not paste them into an unapproved model. Minimize the data, remove identifiers that are not needed for the decision, use the governed environment approved by your organization, and retain only what the workflow requires. Privacy-by-design belongs at intake because redacting an output does not undo an inappropriate disclosure in the input.

For recurring workflows, add acceptance criteria and evaluation cases. A discovery synthesis evaluation might check whether every theme retains evidence links, whether contradictions survive summarization, and whether unsupported market-size claims are rejected. A strategy evaluation might check whether every initiative maps to an outcome driver and whether an output has been mislabeled as an objective. Re-run those checks when the model, prompt, context set, or output schema changes.

Use AI in discovery without laundering uncertainty

Discovery generates exactly the kind of material language models handle well: interview transcripts, support conversations, journey notes, behavioral patterns, and open-ended hypotheses. AI can reduce the time between collecting this material and discussing it. It cannot make a biased sample representative or turn a correlation into a cause.

Run synthesis as part of a weekly learning cadence that combines customer evidence with journey and behavioral analysis. Waiting for a large quarterly research readout increases the distance between observation and decision. Treating every new conversation as a roadmap mandate creates the opposite problem. A regular review gives the team a stable point at which evidence can accumulate, conflict, and change an existing belief.

A cluster is a lead, not a finding

Theme clustering is useful for navigation. It is not proof of importance. A frequent topic in support data may reflect product friction, a noisy customer segment, a documentation gap, or a recent incident. The model sees only the supplied dataset, not the market outside it.

Require each proposed opportunity to include:

The affected segment and the context in which the problem occurs.
The job the customer is trying to complete.
Links to supporting observations, including direct customer language where it preserves important nuance.
The observed count within the supplied dataset, clearly distinguished from prevalence in the customer base or market.
Behavioral evidence that supports or challenges the qualitative pattern.
The outcome driver the opportunity could influence.
Contradictory evidence and plausible alternative explanations.
The unanswered question that creates the greatest decision risk.
The next piece of evidence that would materially change the decision.

Then place the opportunity in an opportunity solution tree. Keep the opportunity separate from candidate solutions. If the branch says customers need an AI assistant, it has already collapsed a customer problem into a preferred implementation. Rewrite it in terms of the customer’s obstacle or desired progress, then generate multiple ways to address it.

At the weekly review, ask four practical questions: What did the team observe? Which belief changed? Which important assumption remains weakly supported? What evidence should be collected next? AI can prepare the evidence packet and show deltas from the prior review. The product trio should decide what the evidence means and whether it changes the opportunity being pursued.

Connect roadmap, experiment, launch, and learning

A strategy loses integrity when each delivery stage invents its own explanation. The roadmap promises one outcome, the experiment measures another, the launch emphasizes a feature, and the retrospective celebrates shipping. AI can help maintain the thread, but only if the same hypothesis and metric definitions travel with the work.

Decision layer	Useful AI assistance	Required human judgment	Artifact to preserve
Strategy	Check the chain from customer value to business result and expose unsupported jumps	Choose the segment, differentiation, outcome, and trade-offs	Strategy brief and driver tree
Discovery	Extract observations, cluster themes, retain contradictions, and draft opportunities	Interpret evidence and choose the next uncertainty to reduce	Evidence-linked opportunity record
Roadmap	Map candidate initiatives to drivers, surface dependencies, and prepare option comparisons	Allocate capacity and accept opportunity cost	Prioritization decision record
Experiment	Draft hypotheses, instrumentation, guardrails, edge cases, and analysis checks	Approve the test design, statistical assumptions, and decision rule	Experiment brief
Launch	Adapt release notes, in-product guidance, support material, and segment messaging	Approve claims, rollout risk, positioning, and readiness	Launch plan and approved message set
Learning	Summarize funnels, cohorts, retention patterns, qualitative feedback, and anomalies	Decide whether to continue, revise, expand, or stop	Learning review and updated decision

Make the roadmap show its reasoning

Ask AI to produce roadmap options, not a single supposedly objective ranking. Each option should show the outcome driver it targets, evidence strength, important dependencies, unresolved risk, stakeholder impact, and the work displaced by choosing it. A priority score can organize inputs, but it cannot resolve a strategic disagreement about which customer or outcome matters most.

Every roadmap item should answer: Why this customer problem, why now, what behavior should change, which business result should follow, and what observation would make the team reconsider? If the answer is merely that customers requested it or a competitor has it, the strategy is incomplete.

Make experiments decision-ready before they run

An AI-drafted experiment brief should contain a falsifiable hypothesis, eligible population, primary metric, guardrail metrics, instrumentation plan, exposure logic, expected mechanism, known confounders, and decision rule. For A/B testing, define the minimum detectable effect before interpreting results. The value must be tied to a practically meaningful change and checked against baseline behavior and available traffic; a model cannot infer those constraints from a feature description.

Instrumentation deserves its own review. Specify the event, properties, eligibility conditions, trigger, and expected sequence in the funnel. Use behavioral analytics to check that exposure and activation are measured consistently across variants. Feature flags can separate deployment from release, support a controlled ramp, and limit exposure while the team checks behavior.

For an AI-powered product experience, add eval-driven checks alongside product metrics. Define the behavior the model should exhibit, edge cases it must handle, unacceptable outputs, privacy constraints, and regression cases. Product success cannot compensate for a model behavior that violates an explicit safety or trust requirement.

Keep launch language tied to the original value proposition

AI can adapt UX copy, product tours, tooltips, release notes, in-app guides, and support macros for different segments. Give every channel the same approved value proposition, capability boundaries, terminology, and claims. Otherwise, speed creates message drift: the release note promises an outcome the interface does not support, while the support macro describes a different workflow again.

After release, bring the original decision brief into the learning review. Examine the target cohort, funnel behavior, activation, retention, qualitative feedback, and guardrails. Do not ask only whether the feature was adopted. Ask whether the intended customer behavior changed, whether the assumed mechanism appears credible, and whether the business outcome remains a reasonable consequence.

Scale the workflow only when another person can audit it

Before expanding AI assistance across the product organization, hand one completed decision package to a colleague who was not part of the workflow. They should be able to identify the governing strategy, trace each important claim to evidence, see which assumptions remain open, understand the trade-off, and find the metric that will trigger the next decision.

If they cannot, do not solve the problem with a longer prompt. Repair the missing artifact, unclear ownership, broken evidence link, or inconsistent metric definition. That is where strategic reliability lives.

Start with one decision entering your next weekly discovery review. Build its evidence set, label observations and assumptions, run separate synthesis and challenge passes, and publish the human decision with its reversal signal. Once that chain survives review, reuse the workflow. The goal is not more AI-generated product work. It is a shorter, more inspectable path from customer evidence to a measurable strategic choice.

References

May 12, 2026

How to Link AI Evals to Retention Without Chasing Proxies

Your AI activation rate is rising. More users are reaching the agent, completing setup, or trying the workflow. Yet the retention curve is flat. That usually means you know who touched the product, but not who received enough value to return.

A higher aggregate eval score will not resolve that gap. You need to identify an AI quality signal that appears early, connect it to later behavior, and determine whether changing that signal can change retention. The result should influence onboarding, roadmap priorities, customer success, and model releases, not just add another chart to an eval dashboard.

Start with the retention decision, not the eval dashboard

The wrong opening question is: Which evals can the team measure? Start with: What must a user experience early enough that returning becomes the rational next action?

That framing forces you to define retention before searching for a predictive signal. A login is rarely enough. Choose a return behavior that represents recurring value: running another workflow, completing another meaningful task, or bringing the agent into an ongoing process. Then make five decisions explicit:

Define the eligible population. Decide whether you are studying newly activated users, newly activated tenants, or another clearly bounded cohort.
Choose the unit of analysis. Use the user when value is individual. Use the tenant or account when adoption and renewal depend on a shared workflow.
Name the retained behavior. It should represent renewed value, not passive presence.
Select the retention window. Weekly and monthly cohorts answer different product questions, so do not switch between them after seeing the result.
Close the observation period before the retention outcome begins. Otherwise, later behavior can leak into the feature that supposedly predicts it.

This distinction matters when activation improves but retention does not. Activation proves that a user crossed a product milestone. It does not prove that the AI produced a trustworthy, complete, safe, and usable outcome. Your eval candidates should measure that missing experience.

Eval family	Question it should answer	When it deserves product attention
Semantic accuracy	Did the output correctly address the intended task?	Incorrect results prevent completion or make the user unwilling to rely on the agent again.
Containment	Did the agent complete the eligible workflow without an avoidable human escalation?	Escalation prevents the workflow from delivering repeatable automation.
Safety	Did the interaction remain within the product’s acceptable risk boundaries?	A regression creates unacceptable exposure, even if another engagement metric improves.
Latency	Did the result arrive fast enough for the user’s workflow?	Delay causes abandonment, repeated attempts, or a return to the previous process.
UX friction	Could the user reach a good outcome without unnecessary setup, retries, or corrections?	Users fail before they have a fair chance to experience the agent’s value.

Shortlist three to five candidates tied to these user outcomes. A long eval inventory makes analysis look comprehensive while weakening the decision. You are not trying to find every quality problem. You are looking for an early signal that is measurable, related to meaningful retention, and alterable through a product intervention.

Build an identity and time contract before modeling anything

The hardest part is usually not the statistical model. It is joining AI interactions to product behavior without duplicating records, losing users, or assigning an outcome to the wrong account. Evals often live in notebooks or model-observability systems while retention events live in product analytics. A plausible-looking join can still be wrong.

Create a data contract that covers both systems. At minimum, it should specify:

Stable user and tenant identifiers, including the rule used when a user belongs to more than one tenant.
The timestamp that determines whether an interaction belongs inside the observation period.
The model and workflow version associated with the interaction.
The conditions that make an interaction eligible for each eval.
The grain of the analysis table, such as one row per user-day or tenant-day.
The treatment of missing data, especially the difference between no eligible interaction and an evaluated failure.

That last distinction is easy to miss. A user who never invoked an eligible workflow did not fail the accuracy eval. Combining non-use and poor quality into the same value hides whether the retention problem comes from discovery, setup, or AI performance.

Compute daily per-user and per-tenant features rather than joining every raw interaction directly to every product event. Each feature should retain its denominator or exposure count. A pass rate without the number of eligible interactions can make sparse use look equivalent to sustained use.

Keep the definition of each feature readable. Containment, for example, needs an explicit eligible-workflow denominator and an explicit rule for what counts as avoidable escalation. UX friction needs named events, such as a retry or correction, rather than an opaque composite score. If a product manager cannot explain how the feature changes, the team will struggle to turn it into a roadmap decision.

Watch for many-to-many joins. One AI interaction may generate several product events, and one product session may contain several AI interactions. Joining both raw tables can multiply rows and inflate success or failure counts. Aggregate each side to the agreed grain first, then join the resulting features to the retention cohort.

Versioning also matters. If a model or workflow changes during the observation period, an account-level average can blend materially different experiences. Preserve the version so you can distinguish a real quality improvement from a change in traffic or segment mix.

Find a threshold that survives segment and leakage checks

Once the dataset is reliable, begin with cohort analysis rather than a complex predictive model. Compare retention among users who reached different levels of each candidate signal. You are looking for a separation that is large enough to matter, stable enough to repeat, and reachable through product changes.

Use this sequence:

Plot weekly or monthly retention against each early eval feature.
Use a driver tree to show where the feature sits between acquisition, activation, AI quality, repeat behavior, and the final retention outcome.
Fit a simple logistic model that controls for plan type, segment, region, and acquisition channel.
Repeat the analysis inside important segments instead of relying only on the blended population.
Check whether the threshold remains directionally useful when you vary the observation definition without allowing it to overlap the outcome.

The controls are not statistical decoration. Higher-plan customers may have better implementation support. One region may contain a different account mix. A high-intent acquisition channel may produce both better agent usage and better retention. Without those checks, customer composition can masquerade as model improvement.

In one product context, users who crossed a specific eval threshold early showed three times higher retention than peers who did not. That is evidence that an eval can become a commercially useful leading indicator. It is not a universal benchmark. Your threshold, effect size, eligible population, and retention behavior will depend on your product.

Do not choose the threshold merely because it creates the largest visual gap. Prefer a boundary that has enough eligible users on both sides, persists across relevant segments, and corresponds to an experience the product can influence. A dramatic ratio from a small cohort is a hypothesis, not a roadmap mandate.

Run an explicit leakage review before presenting the result. Common forms include an eval feature calculated after the retention window begins, an account-health field that already contains renewal information, or a usage feature whose value can only rise when the user returns. Leakage can make a weak signal look uncannily predictive.

The decision artifact should show the cohort definition, feature window, retention window, cohort sizes, effect estimate, control variables, and segment sensitivity together. If the threshold only works for a particular plan or acquisition channel, say so. A narrow, honest signal is more actionable than a broad result that disappears when the mix changes.

Use experiments to separate a predictor from a product lever

A predictive eval signal is not automatically causal. Sophisticated users may configure the agent better, choose easier workflows, or persist through early friction. Their higher eval scores and higher retention may share the same cause. Improving the score will not necessarily reproduce their behavior.

Convert the signal into a testable product intervention:

Choose an intervention that can move the signal during the early observation period. Depending on the failure, that could be an in-app guide, a product tour, a setup change, or a model change behind a feature flag.
Keep the threshold definition fixed for the experiment. Redefining success after seeing the result turns the test into another exploratory analysis.
Predefine the retained behavior, retention window, target population, and second-order guardrails.
Use a minimum detectable effect calculation to determine whether the experiment can answer the question with the available population.
Run an A/B test where randomization is practical. Measure whether the intervention moves the eval signal and whether that movement is followed by the intended retention lift.
Inspect results by the same segments used in the observational analysis. A blended win can hide a regression for a strategically important group.

This creates a necessary chain of evidence: the intervention changed the early experience, the early eval feature moved, and the retention outcome moved in the expected direction. If retention improves without movement in the eval, your intervention may work through another mechanism. If the eval improves without retention, the signal is not yet a proven growth lever.

Treat safety differently from an ordinary optimization metric. A retention increase cannot compensate for an unacceptable safety regression. Use risk scoring to gate exposure, keep model changes behind feature flags until the required evals pass, and monitor anomalies in both the score and its eligible volume. A stable percentage on a collapsing sample is not stability.

Track support tickets, NPS, and Net Recurring Revenue alongside the primary retention result. These measures operate on different timelines, but they help catch proxy optimization. An intervention that pushes users across an eval threshold while increasing support burden or degrading customer sentiment has not produced a clean product win.

Separate the user-level and release-level uses of the signal. A user-level signal can trigger onboarding or customer-success help when a new account has not reached the value threshold. A release-level eval can prevent a model change from expanding when quality falls. Combining both into one vague health score makes ownership and response unclear.

Put the winning signal into the product operating system

The analysis matters only when it changes what happens next. Give the signal a definition, an owner, an intervention, and a response to regression.

For onboarding, guide new users toward the workflow conditions associated with crossing the threshold. Do not merely show them where the AI button is.
For customer success, add the signal to a health score only when the team has a specific action to take. A warning without a playbook creates dashboard noise.
For roadmap planning, require proposed work to identify which eval feature it should move, why that feature connects to retention, and how the effect will be tested.
For model releases, keep exposure controlled with feature flags until the relevant eval improves without violating safety or experience guardrails.
For monitoring, use anomaly detection on the eval value, eligible interaction volume, and important segments so a blended average does not conceal a regression.

This operating model also clarifies ownership. Product owns the intervention and decision. Data science owns the validity of the feature and analysis. Engineering owns reliable instrumentation and release controls. Customer success owns the response when an account-level signal indicates missed value. Those responsibilities can be distributed differently in your organization, but none should be implicit.

Key takeaways

Define the retained behavior, population, unit, and time window before selecting an eval.
Shortlist three to five eval candidates that describe real user value: accuracy, containment, safety, latency, or UX friction.
Aggregate reliable daily features with stable user and tenant identifiers before joining them to product cohorts.
Use cohort analysis, driver trees, and simple controlled models to find a predictive threshold, then check sample size, segment mix, and label leakage.
Use an A/B test to learn whether a product intervention can move both the eval signal and retention.
Operationalize a validated signal through onboarding, customer success, release gates, feature flags, and anomaly detection.

At your next product review, bring a short decision sheet with the retained behavior, observation window, no more than five candidate evals, join keys, and the first intervention you can test. If the team cannot fill in those fields, fix the analytics contract first. If it can, run the smallest credible experiment and let retained behavior, not a prettier eval dashboard, decide the roadmap.

References

Amplitude — The Surprising Eval Signal That Tripled Retention: How I Connected AI Evals to Product KPIs

May 7, 2026

Amplitude MCP: Evidence-Grounded AI Workflows for Product Teams

An AI assistant can produce a convincing roadmap recommendation or code patch before you have established what users actually did. That speed feels productive until a confident answer turns an instrumentation gap, a rare edge case, or a coincidental sequence into a product decision.

Amplitude MCP is most useful when it reverses that order. The assistant retrieves behavioral evidence first, labels what is observed versus inferred, proposes a bounded action, and defines how the result will be verified. You still make the decision and own the release, but you spend less time moving context between analytics, product documents, Session Replay, and the development environment.

Key takeaways

Treat Amplitude MCP as an evidence-retrieval layer, not an automated decision-maker. Access to analytics does not make every conclusion valid.
Require every response to separate observed behavior, inferred explanations, proposed actions, and verified outcomes.
Use aggregate analytics to establish prevalence and affected segments, Session Replay to understand the journey, and code-level tests to validate a technical explanation.
End product workflows with a decision brief and engineering workflows with a reproducible test, a controlled release plan, and post-release behavioral verification.
Begin with a narrow, high-value workflow. Apply least-privilege access, redact sensitive data, and evaluate retrieval accuracy, analytical discipline, latency, and business usefulness before expanding.

Create an evidence contract before asking for a recommendation

An MCP connection can make evidence accessible, but it cannot decide whether your event taxonomy is reliable, whether a cohort is appropriate, or whether a pattern is causal. Amplitude MCP can let an assistant request behavioral context such as funnels, cohorts, segments, and user journeys as needed. Your workflow still has to constrain what is retrieved and how it may be interpreted.

The practical control is an evidence contract: a short specification for the question, the permitted data, the expected output, and the point at which the assistant must stop. Write it before asking for a recommendation. Otherwise, the assistant can silently change the population, comparison, or definition while producing an answer that sounds coherent.

Decision: State the exact choice the analysis is meant to inform. “Improve onboarding” is a theme; “decide which onboarding step needs further investigation” is a decision.
Population: Name the relevant segment, account type, lifecycle stage, product surface, or release exposure. Do not let the assistant substitute all users because that query is easier.
Behavior definition: Specify the events or funnel that represent the outcome. If activation, retention, or failure has no agreed event definition, resolve that ambiguity before interpreting results.
Comparison: Define the cohort, release, segment, or other baseline against which a difference should be assessed.
Permitted evidence: List the analytics views, event paths, Session Replays, error details, and code context the assistant may use.
Required traceability: Make the assistant identify the query, event definition, segment, and replay behind each material observation.
Abstention rule: Require the assistant to say when missing instrumentation, insufficient data, or conflicting evidence prevents a conclusion.

A reusable prompt can be direct: “Analyze [outcome] for [segment] using [funnel, cohort, or event path]. Use [comparison] as the baseline. For every conclusion, identify the supporting query or replay. Return observed facts, data limitations, hypotheses, next retrievals, recommended action, and a verification plan. If the evidence is insufficient, state what is missing instead of filling the gap.”

The labels matter. Without them, a behavioral sequence can become a supposed root cause within one paragraph. Use the following distinction in product investigations, incident work, and roadmap analysis:

Layer	What belongs here	What must support it
Observed	An event pattern, funnel difference, cohort trend, replayed interaction, error, or test result	A traceable query, event timeline, replay, log, or test output
Inferred	A plausible explanation for the observed behavior	Supporting and conflicting evidence, plus assumptions that remain unverified
Proposed	An instrumentation change, discovery step, experiment, code change, or rollout action	A stated rationale, expected effect, risk, and owner
Verified	A conclusion that the intervention produced the intended result without an unacceptable regression	Post-change tests and behavioral evidence using definitions consistent with the original investigation

This structure does more than improve prompt quality. It makes reviews faster. A product manager can challenge the population, an analyst can challenge the event definition, and an engineer can challenge the technical hypothesis without reopening the entire conversation.

Turn product questions into bounded analytics tasks

Broad questions invite broad stories. “Why is activation down?” asks the assistant to choose the definition, locate a pattern, infer a cause, and recommend a solution in one leap. Break that work into retrieval, interpretation, and decision stages instead.

Find an activation blocker without inventing causality

Suppose you need to determine which onboarding step deserves attention for an SMB segment. Behavioral analytics can locate where journeys diverge, while Session Replay can show what happened around that point. Neither alone proves why the behavior occurred.

Define activation. Name the event or event sequence that represents the outcome. If stakeholders use different definitions, surface that disagreement rather than averaging it away.
Fix the population and comparison. Specify the SMB segment and the cohort, release, or successful journey against which it should be compared.
Retrieve the funnel or event path. Ask for the event definitions as well as the result. An unexplained event name is not enough to support a decision.
Locate the observed divergence. Identify where completion or progression differs. Call it a divergence, not a cause or even a blocker yet.
Inspect contrasting journeys. Review unsuccessful and successful Session Replays around the same step. Capture UI state, preceding actions, environment details, errors, and unexpected loops.
Generate competing hypotheses. Include product friction, technical failure, user intent, and instrumentation error where each is plausible. Ask what evidence would weaken each explanation.
Choose the next action that matches the evidence. That may be additional instrumentation, customer discovery, a controlled experiment, a targeted technical investigation, or a product change. The assistant should not default to shipping.
Write the decision record. Preserve the query, segment, replay references, observed facts, unresolved uncertainty, chosen action, and verification signal.

Do not let the assistant jump from “fewer users completed this step” to “the copy is confusing.” The first statement may be observable. The second is a hypothesis that needs corroboration. This distinction is the difference between faster analysis and faster rationalization.

Use behavioral context to sharpen roadmap decisions

Behavioral evidence can show whether a problem appears in real journeys, which segments encounter it, and how the surrounding path differs. It does not determine strategic importance, implementation cost, contractual commitments, regulatory exposure, or the opportunity cost of displacing other work. Those remain product leadership inputs.

Ask the assistant to produce an opportunity brief rather than a priority score. The brief should contain:

The outcome and user segment under consideration
The observed behavior and the exact analytics definition behind it
The prevalence and journey context the available evidence can support, without pretending that frequency equals severity
Successful paths or unaffected segments that provide counterevidence
Known data-quality limitations
Competing explanations and what would distinguish them
The smallest useful discovery, instrumentation, experiment, or delivery step
The signal that would cause you to continue, revise, or stop

This format is particularly useful for activation and retention work because it prevents a familiar category error: an analytics pattern describes behavior, while a roadmap decision combines that behavior with strategy, feasibility, risk, and judgment. Amplitude MCP can improve the behavioral part of the decision without pretending to own the whole decision.

Close the engineering loop from customer signal to verified fix

Code generation is only the middle of a debugging workflow. The more important sequence is evidence, reproduction, hypothesis, failing test, bounded change, controlled release, and verification. Amplitude MCP helps connect the customer side of that sequence to Claude or Cursor, but a plausible diff is not a completed investigation.

From a customer report to a reproducible failure

A support ticket usually contains a symptom. Turn it into an evidence packet before asking the coding assistant for a fix.

Establish impact. Use behavioral analytics to find affected segments, related anomalies, and comparable successful journeys. This tells you whether you are investigating an isolated path or a broader degradation.
Reconstruct the experience. Use Session Replay to capture the sequence of actions, UI state, environment, and the moment the behavior diverged. Preserve timestamps for relevant console errors or API failures.
State expected versus actual behavior. Do not make the coding assistant infer the product requirement from the failure.
Provide constraints. Include known dependencies, release exposure, rate limits, feature-flag state, and any code areas that must not change.
Ask for hypotheses before a patch. Require a list of candidate causes, supporting evidence, contradictory evidence, and missing instrumentation.
Request the smallest failing test. Whenever feasible, reproduce the failure in a test before accepting a code change. If urgent containment is necessary, record it separately from the durable fix.
Validate locally and through CI/CD. A generated test or patch still needs human review and the normal engineering checks.
Release behind a feature flag where appropriate. Limit exposure while verifying the behavior in production.
Verify with the original signals. Re-run the relevant analytics, inspect post-change replays, and monitor related behavioral and performance indicators before increasing exposure.

This workflow can turn a replayed customer problem into reproduction steps, a root-cause hypothesis, a minimal failing test, and a controlled verification plan. The human owner still decides whether the evidence is sufficient, whether the patch is safe, and whether the rollout should continue.

A useful debugging prompt is: “Reconstruct the observed sequence from this replay and event timeline. Separate facts from suspected causes. Identify missing instrumentation. Propose the smallest failing test and the narrowest relevant patch surface. State what post-release evidence would confirm or falsify the fix.”

A passing test proves that the code behaves under the conditions represented by that test. It does not prove that the affected customer journey is repaired. That is why the workflow returns to behavioral evidence after deployment.

From a code symptom back to customer impact

Sometimes the investigation begins with a flaky test, a suspicious diff, or a performance regression. In that direction, the assistant first maps possible failure modes and critical code paths. Amplitude then helps answer whether real users reach those paths, under which conditions, and with what observable consequences.

Give the assistant the test failure, diff, or performance symptom and ask it to enumerate the affected code paths.
Translate those paths into observable events, screens, releases, or journey conditions. If no observable signal exists, add instrumentation before making a product-impact claim.
Retrieve matching behavioral patterns and inspect replays that support and contradict the suspected failure.
Separate technical correctness from operational priority. A real defect may have limited observed reach; a common path may still be functioning correctly.
Implement and test the narrowest justified change.
After release, monitor the original journey, relevant errors, and performance measures such as Web Vitals before ramping the flag.

Frequency must not become the only severity test. Security, privacy, data-integrity, and irreversible-loss risks can demand action even when behavioral analytics shows few affected sessions. Use analytics to understand exposure, not to override the appropriate risk process.

Scale only after retrieval and governance earn trust

The strongest rollout begins with one recurring question, not unrestricted access to every project and replay. Activation blockers and bug triage are good candidates because the input, evidence, decision, and verification artifacts can all be made explicit. Start with a high-value, lower-risk dataset and expand only after the workflow performs reliably.

Make access narrower than the assistant’s capability

Session Replay and event data can contain sensitive customer context. An MCP connection does not remove the obligations attached to that data. Apply the same access rules inside the AI workflow that apply in the analytics product, then reduce exposure further where the task does not require it.

Begin with read-only retrieval for the selected workflow.
Limit access to the relevant projects, datasets, and replay permissions supported by your access model.
Redact sensitive fields before the data reaches either replay or the assistant.
Send the minimum context necessary for the task. Prefer event identifiers, stack traces, test cases, and bounded timelines over raw personally identifiable information.
Keep analytics retrieval, code modification, and deployment authority separate. Successful retrieval is not a reason to grant release permissions.
Preserve the query and evidence references behind material decisions so a reviewer can reconstruct what the assistant saw.
Treat a replay link as governed customer data, not as a generic attachment that can be copied into any conversation.

These controls reflect a practical privacy-by-design rule: include only the information needed to reach the fix and favor structured technical artifacts over raw PII. If the workflow cannot answer a question within those boundaries, the correct result may be escalation to an authorized person rather than broader automated access.

Evaluate the workflow, not just the prose

A polished response is a weak success criterion. Build an evaluation set from representative work and include cases where the answer is easy, ambiguous, unsupported by current instrumentation, and blocked by permissions. The assistant should succeed by reaching the right conclusion or by refusing to overstate what the evidence supports.

Retrieval correctness: Did it use the intended project, event definitions, segment, comparison, and available time scope?
Traceability: Can a reviewer follow every material observation back to a query, replay, error, or test?
Analytical discipline: Did it distinguish behavioral association from cause and identify counterevidence?
Action quality: Is the proposed next step bounded, testable, and proportionate to the evidence?
Abstention quality: Did it stop when data was missing, permissions were insufficient, or the available evidence conflicted?
Latency: Did the workflow reduce time spent finding and transferring context without adding review overhead elsewhere?
Business usefulness: Did the evidence improve the decision, reproduction, or verification outcome rather than merely shorten the response?
Governance: Did retrieval stay within approved access and data-handling boundaries?

Classify failures by layer. A wrong segment is a retrieval failure. An unsupported causal claim is an interpretation failure. An oversized code rewrite is an action failure. Exposure of unnecessary customer data is a governance failure. That classification tells you whether to change permissions, analytics definitions, prompts, review rules, or the underlying product instrumentation.

Use a narrow adoption sequence

Choose one repeated workflow with a visible evidence trail, such as activation analysis or production bug triage.
Record how the workflow operates without MCP, including where context is lost and which handoffs cause rework.
Define the evidence contract, approved access, expected artifact, and human decision gate.
Run representative cases and record retrieval, interpretation, action, and governance failures.
Standardize the prompts, evidence packet, and review checklist only after the failure patterns are understood.
Measure time-to-insight, decision usefulness, and engineering outcomes without assuming that faster responses mean better decisions.
Expand to retention analysis, roadmap shaping, or experiment generation only when the narrow workflow remains traceable and safe.

For incident and engineering use cases, preserve root causes and guardrails as docs-as-code so the next investigation can retrieve known failure patterns instead of rediscovering them. Watch change lead time and deployment frequency alongside stability; speed that produces more regressions is not an improvement.

Start with one decision your team faces repeatedly. Define what the assistant may observe, how it must label inference, who approves the action, and what evidence will verify the result. If it cannot show that chain, it is not ready to influence the decision. If it can, Amplitude MCP becomes more than a convenient connector: it becomes part of a disciplined evidence loop between product behavior and execution.

References

May 6, 2026

5 Proven Agent Skills I Use to Automate Weekly Product Reviews with Claude, Cursor, and Codex

Weekly product reviews are where strategy meets execution, and over the past year I’ve turned them into a high-signal, low-friction ritual by leaning on agentic AI. As VP of Product Management at HighLevel, Inc., I’ve standardized a set of agent skills that compress preparation time, surface the right insights, and keep PMs, engineers, and designers focused on decisions—not document wrangling.

"Learn how our teams use agent skills with claude, cursor and codex to run product reviews as PMs, engineers, and designers. Here are 5 killer use cases for builder."

Below, I walk through the five skills I rely on most in our weekly cadence—each one mapped to a clear product management outcome. They’re simple to set up, easy to govern, and aligned with core practices like continuous discovery, product roadmapping and sprint planning, and eval-driven development.

Skill 1 — Backlog triage with signal extraction: I point an agent at fresh tickets, customer notes, and experiment results to cluster themes, tag impact, and flag regressions. Using a retrieval-first pipeline and Agent Analytics, the assistant ranks items by value, effort, and risk so our meeting starts with a prioritized, explainable shortlist instead of a raw queue.

Skill 2 — PRD and spec synthesizer: Ahead of the review, an agent drafts a one-page PRD update from design diffs, git history, and decision logs. With Claude Code and Cursor, it highlights interface changes, acceptance criteria, and open questions, linking back to sources. The result is a crisp, auditable brief that keeps product trios aligned without re-litigating context.

Skill 3 — Experiment and metrics analyzer: An analytics agent pulls A/B testing readouts, checks minimum detectable effect assumptions, and annotates anomalies. It turns raw telemetry into a narrative: what moved, by how much, and whether we trust it. This makes our discussion about tradeoffs, not spreadsheets, and speeds commitments on next steps.

Skill 4 — Voice-of-customer synthesizer: The assistant clusters interviews, support threads, and NPS verbatims into jobs-to-be-done and pain themes. It proposes opportunity solution tree updates and calls out places where our roadmap diverges from customer signal. That keeps continuous discovery alive in the room—even when time is tight.

Skill 5 — Roadmap and sprint planning co-pilot: After decisions, an agent converts outcomes into scoped backlog items, engineering tasks, and stakeholder updates. It drafts sprint goals, flags dependency risks, and aligns work to objectives. Because it’s grounded in the meeting record, it preserves intent while removing ambiguity.

Under the hood, prompt engineering patterns and guardrails keep these workflows predictable: a retrieval-first pipeline for context, eval-driven development for quality checks, and role-specific prompts for PMs, engineers, and designers. With Claude Code I generate structured diffs and test scaffolds; with Cursor I accelerate code-review summaries; and with codex I bootstrap utility scripts that keep the loop tight between insights and implementation.

The payoff is tangible: higher decision velocity, fewer meetings to “re-clarify,” and clearer accountability across the product organization. Just as important, governance and privacy-by-design are built in—every agent logs rationale, cites sources, and respects data boundaries—so leaders can scale AI workflows confidently.

If you’re looking to level up your product reviews, start with these five skills, measure impact with Agent Analytics, and iterate. Small automations compound quickly, and the more consistently you run them, the more your team’s attention shifts from preparing content to making better product decisions.

Inspired by this post on Amplitude – Perspectives.

May 4, 2026

How to Build a Reliable WhatsApp AI Ordering Agent

You are not really deciding whether an LLM can chat about a menu. You are deciding whether it can turn a messy WhatsApp exchange into a correct, payable order without making the customer or venue staff repair its work.

That distinction changes the product. The hard parts are structured order state, deterministic commerce operations, response time, failure recovery, and venue-specific evaluation. Get those right and WhatsApp can become a genuine ordering channel. Get them wrong and you have a fluent chatbot sitting in front of an unreliable transaction.

Key takeaways

Define success as a confirmed, recoverable order in the system of record, not a conversation that sounded helpful.
Let the model interpret customer language, but keep menu data, prices, modifiers, delivery eligibility, payment state, and order commits behind deterministic tools.
Store the current order as structured state outside the transcript. A conversation is evidence of intent, not an order ledger.
Measure useful response time across the complete WhatsApp-to-POS path, then remove tool round trips and parallelize safe read operations.
Make item identification accuracy the primary trust metric, supported by guardrails for modifiers, payments, duplicate submissions, handoffs, and latency.
Evaluate every venue against its real menu and rules, then turn recurring configuration, tests, and operating procedures into reusable templates.

Define the product around a completed order

WhatsApp is the interface, not the product boundary. The product boundary should run from the customer’s first request to an order state that the venue can fulfill and the customer can verify.

A useful benchmark is the end-to-end flow implemented by AITropos: recommendations, item modifiers, delivery-zone checks, payment links, and status updates inside WhatsApp. Covering the whole journey matters because every missing step creates a handoff. A bot that recommends a meal but cannot resolve its required modifiers is a discovery feature. A bot that drafts an order but cannot verify submission is an assistant. Neither is yet an autonomous ordering agent.

Write an order contract before choosing models or orchestration frameworks. The contract is the minimum structured state required to fulfill, charge for, recover, and audit an order. It will usually include:

The venue and the applicable menu version.
Canonical item identifiers, quantities, and customer-facing item names.
Required and optional modifier selections, represented by identifiers rather than prose alone.
Fulfillment method, such as pickup or delivery.
The validated delivery result when delivery is requested.
A system-generated quote, including the values the customer must approve before payment or submission.
Payment-link and payment states, without treating a generated link as proof of payment.
Customer confirmation state, POS submission state, and the resulting order identifier.
The current owner of the interaction: agent, venue staff, or a defined recovery process.

The contract gives product, engineering, operations, and venue teams the same definition of done. It also exposes where autonomy is not yet safe. If the integration cannot validate a delivery zone, for example, the agent should collect the address and hand the order to a person. It should not infer eligibility from a conversational guess.

Order stage	The agent’s job	Condition before proceeding
Discover	Map natural language to menu candidates and explain relevant options.	One supported item is identified, or the agent asks a specific clarifying question.
Configure	Capture quantity, required modifiers, exclusions, and additions.	Every required choice is present and valid for that item.
Fulfillment	Resolve pickup or delivery and call the applicable eligibility checks.	The requested fulfillment method is supported for this order.
Quote and payment	Retrieve the authoritative quote and create the approved payment flow.	Prices and payment state come from the commerce system, not generated text.
Commit	Present the structured summary and submit the confirmed order once.	The customer has confirmed the current version and the POS returns a result.
Status and recovery	Report system-backed status or transfer the interaction with its context intact.	The response is tied to an order identifier or an explicit handoff owner.

Pay particular attention to the acceptance boundary. A friendly message such as “your order is being prepared” is an operational commitment. It must only appear after the system of record has accepted the order. If submission times out or returns an ambiguous result, the safe response is that confirmation is still pending, followed by a status check or human recovery. Guessing success can create duplicate orders, missed orders, and payment disputes.

You can still launch with partial automation, but name it accurately. Menu search, order drafting, and staff-assisted submission can deliver value while the integrations mature. The mistake is allowing the customer to believe the order was accepted when the product has only generated a plausible summary.

Keep the order deterministic even when the conversation is not

Customers do not speak in schemas. They change quantities, refer to items by incomplete names, add a second request before answering the first question, and revise earlier choices. Your architecture has to translate that non-deterministic conversation into structured, POS-compatible data without losing which version the customer actually approved.

My rule is simple: the model may interpret intent and propose an order-state change, but deterministic services must validate and commit it. The transcript should never be the only place where the current order exists.

A reliable turn can follow this sequence:

Load the current structured order, venue configuration, and relevant menu context.
Interpret the latest message as a proposed change: add, remove, replace, modify, confirm, cancel, pay, or request status.
Resolve referenced items and modifiers to canonical identifiers.
Call read-only tools for availability, configuration, fulfillment rules, or quotes as needed.
Validate the proposed change against required modifiers and venue rules.
Write a new order-state version and generate the next response from that validated state.
Use a separate, idempotent write operation when the customer confirms submission.

This design makes corrections much safer. If the customer says, “Make the second one large and remove the fries,” the agent should apply a state delta to the identified lines, validate the revised configuration, and show the updated summary. It should not regenerate the entire order from memory and hope that unrelated details remain intact.

Tool contracts should be narrow and explicit. Menu search should return canonical candidates and the information needed to distinguish them. Item detail should return valid modifier groups. A quote tool should return authoritative values. A payment tool should return a system-created link or a structured error. An order-submission tool should return an accepted identifier, a definite rejection, or an unresolved state that triggers recovery.

Do not let the model invent a price, payment URL, availability claim, delivery decision, or order status. These are business facts with financial and operational consequences. The response composer can explain them in natural language, but the underlying values must come from an approved system.

Separate reads from writes in the architecture. Independent menu and item lookups can often run in parallel. Writes should be serialized against a known order-state version. Every commit operation should accept an idempotency key so a retry cannot create a second order. If the state changed after the customer saw the summary, require confirmation of the new version rather than silently committing it.

The same discipline applies to human handoff. Transfer the structured cart, unresolved question, relevant tool results, and submission state along with the transcript. A handoff that forces staff to reread the entire conversation and reconstruct the order is not graceful degradation; it is deferred manual work.

Choose the orchestration pattern from the service objective, not from architectural fashion. Under tight response constraints, AITropos chose direct tool calls instead of MCP or a multi-stage pipeline to reduce orchestration overhead. That is not a universal argument against MCP. It is a reason to benchmark the actual path. Compare end-to-end latency, traceability, schema governance, failure isolation, and engineering cost using representative ordering turns. If an abstraction adds useful control, keep it. If it only adds another round trip, remove it.

Manage latency as part of the customer experience

The model’s inference time is only one part of latency. From the customer’s perspective, the clock starts when the message is sent and stops when a useful next action arrives. Context retrieval, menu search, validation, payment calls, POS submission, message delivery, retries, and overloaded queues all sit inside that interval.

Instrument the complete path before optimizing it. Capture timestamps for message receipt, context assembly, model execution, every tool call, state validation, response creation, and outbound delivery. Report median and tail latency by turn type. A single average can hide a checkout path that is consistently slower than menu questions.

At minimum, separate these turn classes:

Menu discovery and recommendation.
Item identification and configuration.
Cart edits and corrections.
Delivery or fulfillment validation.
Quote and payment-link creation.
Order confirmation and POS submission.
Order-status retrieval.
Human escalation and recovery.

Set a service objective for each class from observed channel behavior and the operational risk of delay. There is no useful universal number. A status lookup and a multi-item order edit do different work. What matters is that the team can see which component consumes the budget and what happens when that component times out.

Optimize in the order that removes uncertainty as well as delay:

Remove unnecessary model and tool round trips. Load the active order and venue configuration before asking the model what to do.
Parallelize independent read operations, such as resolving multiple products mentioned in one message.
Prefetch likely item context so the agent does not discover basic menu facts one call at a time.
Inject only the context needed for the current turn. An oversized prompt moves latency rather than eliminating it.
Keep deterministic validation outside the model when a rule or schema check can answer immediately.
Give every external dependency a timeout, an observable error state, and a safe recovery path.
Use concise responses that advance the order. Extra prose increases reading time and can obscure the decision you need from the customer.

A useful implementation pattern is already visible in production: multiple product searches run in parallel, product context is prefetched, and smaller, faster components prepare the relevant context for each turn. The product lesson is not to create a swarm of agents. It is to move predictable preparation out of the critical reasoning loop while preserving one coherent order state.

Watch the failure mode on the other side of aggressive optimization. Cached menu metadata can reduce retrieval work, but stale availability or price data can create a wrong commitment. Define which fields are stable enough to cache, how they are invalidated, and which values must be retrieved at quote or submission time. Speed is valuable only when the answer remains authoritative.

When a slow operation cannot be avoided, use an honest progress message and preserve the pending state. Do not fill the wait with repeated acknowledgements that imply completion. If the customer sends another message while the tool is running, the state machine should know whether to queue the change, cancel the pending operation, or ask the customer to wait for its result.

Evaluate each venue, then template what repeats

Make item accuracy precise enough to govern decisions

Item identification accuracy deserves to be the primary trust metric. If the agent resolves the wrong item, every later component can behave perfectly and still produce the wrong order. AITropos treats order item identification accuracy as its most important KPI, giving model, prompt, retrieval, and fallback decisions a common objective.

Define the metric before building a dashboard. I would count an attempted line item as correct only when the canonical item, quantity, and required modifier interpretation match the customer’s resolved intent. A necessary clarification is not automatically an error; it should count against a separate clarification-burden metric. Otherwise, the team may improve apparent accuracy by asking the customer to confirm every obvious detail.

Do not let the primary KPI hide transaction failures. Pair it with guardrails for:

Unsupported substitutions or invented items.
Missing and invalid required modifiers.
Customer corrections after the agent presents a summary.
Quote, payment-link, and POS tool failures.
False confirmations, unresolved submissions, and duplicate commits.
Order completion and abandonment by journey stage.
Human handoff rate, reason, and time to recovery.
End-to-end latency by turn class and venue.

Link corrections back to the original decision. If the customer changes an item because the agent misunderstood it, label the item-resolution turn rather than treating the correction as an unrelated edit. That is how production behavior becomes useful evaluation data instead of a collection of support anecdotes.

Simulate failures before customers encounter them

A venue-specific evaluation suite should use that venue’s menu identifiers, modifiers, availability behavior, delivery rules, payment flow, and POS adapter. A generic restaurant benchmark can test language understanding, but it cannot tell you whether the agent knows that a particular size requires a particular modifier or that two similar menu names map to different SKUs.

Build test families for:

Incomplete names, colloquial references, and ambiguous matches.
Several products requested in one message.
Required modifiers, exclusions, additions, and invalid combinations.
Quantity changes, replacements, removals, and cancellation.
Unavailable items and acceptable alternatives.
Pickup, delivery, and addresses that cannot be validated.
Quote changes before confirmation.
Payment failure, delayed payment state, and an abandoned payment flow.
Tool timeouts, malformed tool results, retries, and uncertain POS submission.
Interrupted conversations that resume with an existing cart.
Requests that require staff judgment rather than autonomous execution.

Generate the expected structured order independently from the agent being tested. Otherwise, the same model can reproduce its own misunderstanding in both the answer and the grade. Keep a small, human-reviewed set of critical conversations alongside the larger generated suite, and add every material production failure to the permanent regression set.

Scale matters when menus contain many combinations. Before each new venue goes live, AITropos runs thousands of simulated customer conversations overnight. The number alone is not the release gate. Coverage, a trustworthy expected answer, and clear failure categories are what make simulation useful.

Simulation also cannot reproduce every production condition. Follow it with a staff sandbox and a controlled production phase. Use only redacted, properly authorized customer conversations in evaluation systems, and retain no more personal data than the test requires.

I would treat any path that invents a price or payment state, falsely confirms an order, or can duplicate a commit as release-blocking. Other thresholds should reflect the venue’s menu complexity, existing human baseline, handoff capacity, and the cost of a wrong order. Record those thresholds before the final test run so launch pressure cannot redefine success afterward.

Roll out autonomy in observable stages

Start with a venue that is operationally manageable but representative enough to expose real modifiers, fulfillment rules, and integration behavior. An unusually simple pilot may produce a clean demo while postponing the problems that determine whether the product can scale.

Configuration: ingest and normalize the menu, map canonical identifiers, mark required modifiers, connect fulfillment and payment rules, and produce a completeness report. No customer-facing ordering is enabled.
Sandbox: let venue staff run realistic conversations while write tools remain disabled or point to a test environment.
Approval mode: allow the agent to prepare a structured order, but require a person to approve the commit. Measure how often the person changes it and why.
Constrained production: enable autonomous submission for the supported venue, fulfillment modes, and order types, with a staffed handoff path and rapid rollback.
Expansion: widen scope only after production traces confirm the accuracy, latency, recovery, and operational workload expected by the release criteria.

For every stage, decide who can pause the agent, how staff take over an active conversation, how the customer learns that a person has taken over, and how an uncertain submission is reconciled before another order is created. These are product requirements, not post-launch operating notes.

Once one venue works, resist copying its prompt and integrations into a new branch. Make venue differences configuration wherever possible: normalized menu schemas, modifier patterns, fulfillment policies, tool mappings, escalation contacts, evaluation packs, and dashboard dimensions. Keep truly distinct behavior explicit rather than burying it in prompt prose.

The scalability payoff can be substantial. AITropos reduced new-venue onboarding from three months to a few weeks, while domain templates are being used to shorten it further. Track your own onboarding work by category: configuration, data cleanup, integration, prompt or policy changes, evaluation, venue training, and launch support. If every venue still requires bespoke code and a rewritten conversation flow, the product has not yet separated its platform from its implementations.

Your next step should be concrete. Choose one representative venue and create three artifacts: the canonical order contract, a failure-and-recovery matrix for every tool, and a venue-specific evaluation set built from redacted, authorized scenarios. If those artifacts cannot show what happens when item resolution, a modifier, delivery validation, payment, or POS submission fails, the agent is not ready to accept orders. Once those states are explicit, model and architecture choices become testable decisions rather than matters of confidence.

References

Shivam.Consulting Blog — Inside AITropos: Lightning-Fast AI Employees for Hospitality That Take Orders on WhatsApp

April 30, 2026

CPO Leadership in the AI Era: A Practical System for Focus
You open a portfolio review and find an AI request from nearly every direction. One team wants an assistant. Another wants an agent. A third has a promising prototype that now needs production funding. Every request sounds plausible, yet approving all of them would spread the company across disconnected experiments.

This isn’t primarily a prioritization problem. It is a leadership-system problem. Your job as CPO is to define the customer advantage worth pursuing, concentrate attention on a few coherent bets, specify the evidence that earns more investment, and make it clear what the company will stop doing. The roadmap should record those choices. It should not make them for you.

Allocate attention before you allocate roadmap space

AI expands the number of things a product team can plausibly build. It does not expand engineering capacity, customer attention, management bandwidth, or the company’s tolerance for operational risk at the same rate. That mismatch is why an orderly backlog can still represent a deeply unfocused strategy.

A prototype adds to the confusion because it compresses the distance between an idea and a convincing demonstration. A good demo shows that a capability may be technically possible under selected conditions. It does not establish that customers will adopt it, that it will perform reliably across real workflows, that its economics will work, or that competitors cannot reproduce it.

Before discussing priority, force each proposed investment through four decisions:
- Customer advantage: What will a specific customer be able to do materially better, faster, or more safely?
- Behavioral outcome: What observable change would show that the advantage matters, such as stronger activation, repeated use, retention, or expansion?
- Business consequence: Which company outcome should move if the customer behavior changes, such as NRR, gross margin, payback, or cost-to-serve?
- Opportunity cost: Which existing initiative, workflow, or commitment will receive less attention if this bet is funded?
The fourth decision is where focus becomes real. If a proposal enters the portfolio without displacing time, money, or executive attention somewhere else, the company has not prioritized it. It has merely added it.

A shared driver tree makes these trade-offs visible. Start with the company outcome. Connect it to the customer behavior that must change, the product lever expected to change that behavior, and the evidence required from the current initiative. If a team cannot draw a credible path through those layers, pause the funding discussion until it can. That is more useful than arguing about whether the item belongs near the top or middle of a feature list.

Your leadership context changes how you create this clarity. In a founder-led company, you often need to influence without becoming deferential: preserve the ambition in the founder’s vision while pressure-testing assumptions with customer evidence, data, and portfolio consequences. Under a hired CEO, the emphasis shifts toward explicit investment theses, capital allocation, and a tighter connection among product, financial, and go-to-market plans.

In either setting, ambition must be more precise than a mandate to become an AI company. Name the customer capability you want to own, the workflow in which it matters, and the durable advantage the company can build around it. Technology is an ingredient. Customer advantage is the strategic claim.

Turn AI feature requests into testable investment theses

A feature request arrives with a solution already embedded in it. An investment thesis keeps the solution open long enough to test whether the opportunity deserves capital. That distinction matters when models, interfaces, and implementation patterns are changing faster than an annual plan can absorb.

Rewrite each material AI proposal using this structure:
<!– wp:list {
April 30, 2026
AI Product Growth: A Strategy and Execution Operating System
You have an AI capability that demos well, yet its growth story is still unclear. Some users try it once. The team debates model quality. The roadmap fills with features, while the link to activation, retention, or revenue remains an assumption.

You can fix that by managing AI growth as a measurable path from user intent to trusted value, repeated behavior, and a business outcome. That path tells you where growth is breaking, which experiment to run next, and whether a more capable model would solve the problem at all.

Build the growth thesis as a measurable chain

AI products invite feature-shaped goals: launch a copilot, add an agent, improve the prompt, or introduce recommendations. Those goals describe output. They do not tell you whose behavior should change or why the change matters to the business.

In my product strategy work at HighLevel, I use a simple test: if a roadmap item cannot name the user behavior it should change and the business lever that behavior affects, it is not yet a growth strategy. A North Star and its driver tree force that connection into the open.

Build your driver tree from right to left. Start with the business outcome, identify the customer behavior that can produce it, and then identify the AI-assisted moments that can change that behavior. This order prevents model capabilities from dictating the roadmap.
1. Name the segment. Choose a group with a shared job and context. New administrators setting up an account are more useful than all users because their intent, constraints, and success event can be observed.
2. Define the value moment. State what the user can do after the AI interaction that was difficult before it. An answer displayed is not a value moment. A configured workflow, resolved issue, completed analysis, or approved action can be.
3. Select the behavior change. Decide whether you need more users to reach first value, reach it sooner, repeat a valuable workflow, or adopt an additional capability.
4. Connect the behavior to one growth mechanism. Activation, retention, and expansion require different product decisions. Choose a primary mechanism for the bet instead of claiming that one feature will improve all three.
5. Add quality and trust guardrails. Relevance, correctness, abandonment, corrections, unauthorized actions, privacy exposure, and recoverability can invalidate an apparent growth win.
A practical AI growth equation is: eligible users multiplied by discovery, first successful use, repeat successful use, and downstream conversion. You do not need to treat the equation as a financial model. Use it to locate the weakest link. More traffic will not repair poor first-use success, and better answers will not create growth if eligible users never discover the capability.

Turn the thesis into a one-page decision document before adding projects to the roadmap. It should contain:
- The target segment and the high-value job it is trying to complete.
- The current friction, supported by behavioral evidence or customer discovery.
- The proposed AI intervention and why AI is necessary for this step.
- The primary behavioral outcome and its baseline.
- The activation, retention, or expansion lever that outcome should affect.
- The leading indicators that can move before the business outcome does.
- The quality, reliability, and trust guardrails that must not deteriorate.
- The assumptions that would cause you to stop, narrow, or redesign the bet.
Your outcome statement can follow this form: increase a named behavior for a named segment by improving a specific driver, without degrading named guardrails. Supply the target only after you have a baseline and know what change your measurement system can detect. A target chosen for presentation value is not a strategy.

A worked example: turn AI search into an activation path

Consider a SaaS administrator who searches for help while configuring a workflow. A search team could optimize result clicks and declare success. A growth team traces the job further: query submitted, useful guidance received, setup started, workflow activated, and the workflow used successfully.

If result clicks rise but completed setups do not, the team improved engagement with search rather than activation. If setup completion rises but repeat use does not, the next constraint may be workflow value, onboarding, or the quality of the initial configuration. The query-to-outcome path makes those distinctions visible.

This is why every AI growth bet needs an explicit endpoint. The endpoint is not the response. It is the valuable behavior the response enables.

Choose an intent wedge before choosing the AI experience

A broad assistant usually creates a broad measurement problem. It serves unrelated intents, carries different failure costs, and leaves the team unable to explain why adoption changed. Start with an intent wedge: a narrow set of related requests from one segment, encountered at a meaningful point in its journey.

A strong first wedge has several useful properties:
- The job recurs. Repetition gives the product a chance to create a habit or reduce recurring friction.
- The current path is observable. You can see where users search, abandon, ask for help, switch tools, or fail to complete the workflow.
- Success is verifiable. The product can observe a completed action or downstream outcome instead of relying only on a positive reaction to the answer.
- The job is close to value. Improving it can plausibly affect activation, retention, or expansion.
- The failure is recoverable. Early versions should avoid irreversible or high-cost autonomy when a suggestion, preview, or confirmation can solve the same problem safely.
- The scope is evaluable. The team can assemble representative intents and define what an acceptable response or action looks like.
Use continuous discovery and journey mapping to find that wedge. Review behavioral funnels and query logs, then speak with users who completed the job, abandoned it, and avoided the AI experience entirely. The last group matters because usage data cannot explain a discovery problem among people who never entered the funnel.

Capture each candidate in an opportunity card. Record the segment, trigger, intent in the user’s own language, current workaround, failure consequence, next valuable action, available evidence, trust constraints, and outcome metric. This keeps prioritization centered on customer work rather than the novelty of a model capability.

When comparing opportunities, do not collapse everything into one unexplained score. Look separately at the strength of the evidence, proximity to a growth outcome, frequency of the job, severity of the friction, ability to measure success, and cost of a wrong answer or action. A high-frequency request with no meaningful downstream behavior may be less valuable than a narrower request sitting directly before activation.

Match autonomy to the evidence you have

AI product teams often jump from static software to autonomous agents in one roadmap step. A safer growth path increases autonomy only when the preceding level has demonstrated reliable value.
1. Visibility. Capture and classify what users are trying to accomplish. This exposes unmet demand before you automate anything.
2. Retrieval and explanation. Return relevant, grounded information that helps the user make the decision. A retrieval-first approach is often the cleanest starting point because the evidence and failure points are easier to inspect.
3. Recommendation. Suggest a next action using the user’s context, while keeping the decision with the user.
4. Guided or agentic execution. Prepare or perform a multi-step workflow with appropriate permissions, confirmation, observability, and recovery.
Move up a level when the current experience has repeat use, its major failure classes are understood, and the next level removes a documented point of friction. Do not add agency merely because the model can call tools. An agent that takes the wrong action creates a more serious problem than a search result that fails to earn a click.

The decision rule is straightforward: use the least autonomous experience that can produce the target behavior. This makes learning cheaper, limits risk, and shows whether users want the job completed before you invest in completing it on their behalf.

Instrument the full path from interaction to revenue

You cannot manage AI growth from a usage count. Monthly users of an AI feature can rise because of novelty, forced exposure, or repeated failure. Instrument a closed loop that connects intent, system behavior, user response, task completion, and the relevant business outcome.

A useful event spine contains the following stages:
1. Eligibility and exposure: Was the right user able to discover the capability at the right moment?
2. Intent: What job was the user trying to complete, and how was that intent classified?
3. Response: Which result, recommendation, or planned action did the system produce?
4. User judgment: Did the user select, accept, edit, reject, retry, or abandon it?
5. Execution: Did the user or agent start and complete the intended workflow?
6. Value: Did the product observe the success event defined in the growth thesis?
7. Business outcome: Did the relevant account activate, retain, expand, or contribute to Net Recurring Revenue through the defined path?
Concrete event names make implementation reviews easier. Depending on the experience, you might use events such as ai_query_submitted, ai_answer_shown, ai_recommendation_accepted, ai_action_started, ai_action_completed, ai_answer_corrected, and ai_flow_abandoned. The exact naming convention matters less than preserving the sequence and using it consistently.

Attach the properties needed to diagnose changes: segment, intent class, entry point, answer type, retrieval or ranking version, model and prompt version, experiment assignment, content identifiers, action type, completion status, and failure class. Carry account and customer identifiers into downstream systems only under your approved privacy and data-governance rules.

Raw prompts and conversations can contain personal, confidential, or commercially sensitive information. Logging them by default can create exposure that outlives the experiment. Define redaction, retention, access control, and deletion rules before broad collection. If a diagnostic goal can be met with a classified intent or structured error code, do not retain the raw text merely because it may be useful later.

Organize the resulting metrics into five views:
- Reach: eligible users, exposure, discovery, and first use.
- Experience: acceptance, correction, retry, abandonment, and progression to the next step.
- Task value: successful workflow completion and time to the defined value event.
- Repeat value: return use for the same job and successful use across relevant workflows.
- Business impact: activation, retention, expansion, and revenue outcomes for the target segment.
Sentiment can help you locate frustration, but it should not become the success metric. A polite response can still be wrong, and a frustrated user can still complete the task. Pair inferred sentiment with observable behavior such as correction, abandonment, repeated queries, completion, and downstream product use.

Revenue attribution needs similar discipline. Connect experiment exposure and product behavior to the CRM or revenue system, choose an attribution window that matches the natural decision cycle, and distinguish influenced revenue from causally demonstrated lift. Users who voluntarily adopt an AI capability may already be more engaged, so a dashboard correlation does not prove that AI caused their retention or expansion.

This distinction changes roadmap decisions. Behavioral analytics can reveal where the path is breaking. Controlled experiments are needed when you want to know whether fixing that point changes behavior or dollars.

Turn evaluation and experiments into a delivery system

AI growth execution needs two evidence gates. Offline evaluation asks whether the system performs the intended task well enough to expose safely. Online experimentation asks whether the experience changes customer behavior. Passing one gate does not imply that you will pass the other.

Gate releases with eval-driven development

Build the first evaluation set from real intents in the chosen wedge. Cover common requests, ambiguous requests, known failures, and cases where a wrong answer or action carries a higher cost. Preserve segment and intent labels so an average score cannot hide a severe failure in an important slice.
1. Write the rubric before tuning. Define what must be true for the response or action to pass: correct intent, relevant evidence, accurate guidance, appropriate next step, permitted action, and recoverability where needed.
2. Separate failure classes. Coverage, retrieval, generation, interaction design, tool execution, permissions, and policy failures need different fixes.
3. Version the system. Record the model, prompt, retrieval configuration, content version, tools, and policy configuration associated with each result.
4. Review performance by slice. Inspect high-value intents and high-consequence failures instead of relying only on the aggregate.
5. Keep human review where judgment is material. Automated scoring can accelerate evaluation, but uncertain or consequential cases still need an accountable review path.
6. Promote real failures into the eval set. Production corrections and abandoned workflows should make future regressions easier to catch.
Do not treat every low score as a prompt problem. If the required information is absent, fix content or data coverage. If the right material exists but is not retrieved, fix retrieval or ranking. If the answer is accurate but users cannot act on it, fix interaction design or the handoff into the workflow. Prompt iteration cannot compensate for every layer of the system.

Use online experiments to answer growth questions

Once the experience passes its release gate, write an experiment card that another product leader could audit without attending the planning meeting.
- The target segment and eligibility rule.
- The behavior you expect to change and the mechanism behind that change.
- The control and variant, including model, prompt, ranking, content, or interaction differences.
- One primary behavioral outcome tied to the growth thesis.
- Quality, reliability, cost, privacy, and business guardrails relevant to the change.
- The randomization unit, especially when users within the same account can affect one another.
- The minimum detectable effect and the sample required to evaluate it.
- The natural usage or buying cycle the test must observe.
- The decision rule for shipping, iterating, narrowing, or stopping.
Good early tests isolate a decision. Compare retrieval or ranking approaches when users cannot find the right information. Compare concise and detailed answer formats when comprehension or action is the constraint. Test prompt variants when the failure is genuinely in instruction following or response construction. Test a guided workflow against a static answer when users understand the answer but fail at the next step.

A click is an acceptable primary metric only when the click itself represents value. Otherwise, measure the completed behavior downstream. A compelling answer that produces no useful action is engagement without growth.

Run the first 90 days around evidence, not launch theater

A 30-60-90 operating sequence gives the team enough structure to create momentum while preserving room to learn.
1. Days 1-30: establish the truth. Select the segment and intent wedge. Baseline the driver tree. Map the current journey. Audit events and data access. Build the initial evaluation set. Define privacy and permission constraints. Write the first growth thesis and identify the assumptions most likely to break it.
2. Days 31-60: ship the smallest complete path. Release a thin experience behind a feature flag. Instrument the full event spine. Run offline evaluations and the first controlled experiment. Review failed intents and abandoned workflows every week. Fix the largest diagnosable constraint rather than adding adjacent features.
3. Days 61-90: prove, prune, and scale. Aim to land two or three measurable wins, remove low-signal bets, and decide whether the wedge deserves more distribution, deeper personalization, or greater autonomy. Standardize the operating cadence only after the team has learned which reviews lead to decisions.
The product trio should co-own problem framing and solution shaping. Product owns the growth thesis and trade-offs. Design owns comprehension, control, feedback, and recovery in the interaction. Engineering owns system behavior, instrumentation, reliability, and safe delivery. Data science should help design evaluations, experiments, and attribution. Customer-facing teams should validate whether the job and value proposition match the language customers actually use.

Use a cadence that matches the decision:
- Weekly: inspect intent coverage, failure classes, corrections, abandonment, and unexpected trust issues.
- Every two-week sprint: ship a testable improvement, review the evidence, and update the decision record.
- Monthly: review the driver tree, experiment portfolio, business impact, costs, and cross-functional blockers.
- Quarterly: reset outcomes, stop bets that have lost their evidence, and fund the next constraint in the growth path.
Feature flags, CI/CD, and observability are growth infrastructure because they reduce the cost and risk of learning. They let you separate code deployment from customer exposure, compare variants, detect regressions, and reverse a problematic release. Privacy-by-design, data governance, and observability should be release requirements rather than work deferred until scale.

Watch for five failure modes
- The roadmap is organized by AI features. Reorganize it around segments, intents, and outcomes so multiple solution types can compete for the same problem.
- One quality score hides the system. Break performance into coverage, retrieval, generation, interaction, execution, and policy slices so the owner of the next fix is clear.
- The team optimizes prompts while the funnel is broken elsewhere. Locate the failing step before choosing the technical intervention.
- Autonomy arrives before trust. Start with visibility, retrieval, or recommendations, then increase agency when reliable task completion and recovery have been demonstrated.
- AI usage becomes a vanity metric. Keep the primary outcome downstream of the interaction and tie it to activation, retention, or expansion.
Key takeaways
- An AI growth strategy must connect a defined segment and intent to a valuable behavior and one primary business lever.
- Start with a narrow intent wedge whose success is observable, repeatable, and close to activation, retention, or expansion.
- Use the least autonomous experience that solves the documented friction, then earn the right to add agency.
- Instrument eligibility, intent, response, user judgment, execution, value, and business outcome as one path.
- Use offline evaluations to manage quality and controlled experiments to establish behavioral or revenue impact.
- Treat feature flags, observability, privacy, and data governance as part of the growth system.
- Review failures weekly, ship testable improvements in two-week sprints, and prune bets that do not change customer behavior.
Start with one segment, one recurring intent, and one outcome. Trace the current path event by event, identify its weakest link, and write the smallest experiment that can test your explanation. That is enough to turn AI growth from a feature campaign into a learning system.

References
- Shivam.Consulting Blog – Make AI Search Count: Convert Every Query into Revenue with Visibility, Sentiment, and Action
- Shivam.Consulting Blog – Principal Product Manager Playbook: Strategy, Leadership, and Execution That Scales
April 29, 2026

How to Run Customer-Facing AI Agents Across Sales and Support

You don’t have a chatbot problem. You have an operating-model decision: which customer outcomes may an AI agent own, when must a person take over, and who is accountable when the system gets it wrong?

Get those decisions right and one frontline system can qualify buyers, resolve routine requests, and give specialists better conversations. Get them wrong and you will automate confusion: weak meetings enter the pipeline, unresolved tickets look like successful deflection, and customers repeat themselves after every handoff.

Give the agent a job with an observable finish line

The phrase ‘handle customer conversations’ is not a usable product requirement. It describes a channel, not a job. An agent needs a bounded responsibility, the information and actions required to perform it, and an event that tells you whether the work was completed correctly.

Write a job card before designing prompts or choosing a model. It should specify:

Customer job: the need the agent is expected to address, such as qualifying an inbound buyer or resolving a known setup question.
Eligible intents: the requests it may own and the requests it must immediately transfer.
Required context: identity, account state, product entitlement, lifecycle stage, prior conversations, or qualification facts.
Allowed actions: retrieve an approved answer, update a permitted field, schedule a meeting, initiate a workflow, or route to a named queue.
Completion event: a correctly qualified meeting, a documented disqualification, a verified resolution, or an accepted handoff.
Failure event: an unsupported answer, an incorrect action, a dropped conversation, a lost handoff, or an outcome that violates policy.
Accountable owner: one person who owns performance across the model, knowledge, workflow, integrations, and operating policy.

The finish line matters because apparent activity is easy to mistake for value. A calendar booking is not a sales success if the buyer does not meet the qualification rules. A conversation that ends without a human transfer is not a support resolution if the customer simply gives up.

Correct disqualification and justified escalation should count as valid outcomes. The objective is not to force every conversation into automation. It is to move every eligible conversation to the right outcome with the least avoidable effort.

Start by running the agent beside an existing human path. Parallel operation gives you a visible fallback, preserves service while the system is learning, and makes outcome quality easier to compare. Broaden ownership only after the agent performs reliably on the job it already has.

Route by customer intent, not your organization chart

Customers do not arrive thinking in sales and support queues. A question about a feature may come from an anonymous buyer, a trial user, an existing customer considering an upgrade, or a customer blocked from completing a task. The words can be identical while the correct response, permitted data, and next action are completely different.

This is why CRM integration and conversation context are core parts of the product rather than optional enrichment. The agent needs enough verified context to determine which job it is performing. It should not expose account-specific information, alter a record, or initiate a commercial workflow until identity and permissions are clear.

A practical conversation policy follows this sequence:

Establish the relationship. Determine whether the person is an anonymous visitor, prospect, trial user, customer, or authorized account contact.
Classify the job. Identify the outcome the customer wants, not merely the keywords in the message.
Retrieve permitted context. Load only the account, conversation, product, and lifecycle information needed for that job.
Ask for missing facts. Collect the minimum qualification or troubleshooting details required to make the next decision.
Complete or transfer. Take an approved action when confidence, policy, and permissions allow it. Otherwise, move the conversation to the correct person.
Record the disposition. Store the recognized intent, facts collected, actions attempted, outcome, and reason for any handoff.

The handoff is part of the agent experience. It should contain the person’s identity and account state, the stated goal, relevant facts, knowledge consulted, actions already attempted, results, and the recommended next step. A transcript dump is not enough. It makes the human reconstruct the problem and usually makes the customer repeat it.

Define transfer triggers before launch. Useful triggers include missing or contradictory approved knowledge, insufficient identity, an action outside the agent’s permissions, repeated failed attempts, an explicit request for a person, a commercial exception, or a conversation where relationship judgment matters more than speed.

Keep the commercial objective visible without letting it corrupt support. Resolve the customer’s blocking issue before introducing an upgrade unless the customer explicitly asks about buying. Likewise, a low-intent visitor does not need to be forced into a meeting. The agent can direct that visitor to useful self-service material and preserve context for a later conversation.

Measure sales creation and support resolution separately

A single automation rate hides the decisions you need to make. Sales and support share an interface, but they create different outcomes. Give each motion its own scorecard and connect the two through shared measures for handoff quality, trust, and customer effort.

Motion	Primary outcome	Diagnostic signals	Downstream proof
Sales	A correctly qualified meeting, documented disqualification, or appropriate nurture path	Qualified, disqualified, dropped, routed, and handoff-accepted conversations	Opportunity creation, attributable pipeline, and revenue
Support	A correct routine resolution or a context-rich transfer	Intent, topic, repeated attempt, escalation reason, time to resolution, and where customers abandon the flow	Successful resolution, repeat contact, sentiment, and CSAT
Shared experience	A trustworthy completion with no unnecessary restart	Unsupported answers, incorrect actions, lost context, policy violations, and customer-requested transfers	Outcome quality by intent, channel, customer type, and agent version

Give agent-originated sales conversations a distinct origin field in the CRM. Retain the conversation identifier, final disposition, and qualification facts, then follow each cohort through opportunity and close. If agent results disappear inside total inbound performance, you cannot tell whether the system created incremental pipeline, shifted work from another channel, or merely booked more low-quality meetings. Meetings, pipeline, and revenue need explicit attribution.

Support needs the same discipline. Do not treat a lack of escalation as proof of resolution. Examine whether the requested task was completed, whether the answer came from approved knowledge, whether the customer returned with the same issue, and whether the handoff arrived in a usable state. Topic and intent analytics should reveal where demand is rising, where customers get stuck, and which workflows actually shorten resolution.

Use a high-performing human on the same channel as the operational benchmark. That comparison is more useful than a generic automation target because it preserves the standards customers already experience. It is a target for your system, not a claim that every agent meets it. Compare like with like: the same eligible intents, customer mix, qualification policy, and access to knowledge.

Before expanding eligibility, use eval-driven development and controlled experiments. Keep the eligibility rules stable during a comparison, segment results by intent, and change one major layer at a time. If the prompt, knowledge base, routing policy, and action permissions all change together, a better aggregate score will not tell you what improved or which new failure mode you introduced.

Put one owner over knowledge, guardrails, and iteration

A customer-facing agent is a production system, not a launch asset. Product knowledge changes. Qualification rules change. Integrations fail. Customers find language the original tests did not cover. Performance will drift unless someone owns the whole loop.

That owner needs program-level responsibility. In sales, the role may be an AI SDR program lead. In support, it may sit with an AI operations or product leader. The title matters less than the decision rights: the owner must be able to change eligibility, knowledge, prompts, workflows, routing, evaluation criteria, and rollout scope.

The operating loop should be explicit:

Review outcomes by intent. Inspect successful completions as well as failures; a passing aggregate can conceal one dangerous category.
Classify the failure. Separate knowledge gaps, intent errors, policy mistakes, tool failures, permission problems, poor handoffs, and correct answers delivered in an unhelpful way.
Fix the smallest upstream cause. Update the audited knowledge when the fact is missing, the workflow when the action is wrong, the policy when the boundary is unclear, or the conversation design when the interaction creates friction.
Replay representative evaluations. Test the changed component against known successful cases, known failures, ambiguous requests, and transfer scenarios.
Release to limited eligibility. Preserve the human fallback and monitor the affected intent before increasing traffic or adding actions.
Record the change. Version the knowledge, prompt, policy, workflow, and evaluation set so a metric movement can be traced to a real product change.

Ground answers in a retrieval-first pipeline backed by audited knowledge. The generative layer should explain and adapt approved information; it should not invent product behavior, policy, eligibility, or commercial commitments. When the agent can take action, give each action its own identity checks, required fields, permission boundary, confirmation behavior, and failure path.

CRM context improves relevance, but it also increases the cost of a permission mistake. Apply privacy-by-design at the workflow level: retrieve only what the current job needs, verify identity before exposing account details, restrict actions by role, and preserve an audit trail of what the agent saw and did. A fluent response does not compensate for unauthorized access.

The rollout is incomplete until human work changes. Salespeople should gain time for higher-conversion conversations, multi-stakeholder account development, guided trials, and situations where judgment affects the buying process. Support specialists should receive the nuanced, emotionally sensitive, or genuinely novel problems with the context already assembled.

Removing the human development path entirely is a brittle cost decision. The SDR role often develops future closing talent, while frontline support builds product and customer judgment. Move people toward higher-leverage work instead of assuming the function has become unnecessary.

Key takeaways: use six checks as your launch gate

Is the job bounded? The eligible intents, required context, allowed actions, and prohibited actions are written down.
Is success observable? Sales quality reaches pipeline and revenue; support quality reaches real resolution rather than mere non-escalation.
Is the transfer designed? Triggers are explicit, the receiving queue is known, and the human receives a structured handoff instead of a raw transcript.
Is attribution separate? Agent-originated conversations, dispositions, downstream outcomes, and versions can be analyzed without disappearing into channel totals.
Is trust engineered? Approved knowledge, evaluations, identity controls, action permissions, privacy rules, and audit records exist before broad access does.
Has human capacity been reassigned? Sales and support specialists have named higher-value work to absorb the time the agent releases.

If any answer is no, do not widen the agent’s scope yet. Tighten the job, instrumentation, or boundary that is missing. More traffic will amplify an unclear operating model faster than it will improve one.

Your next move is small but concrete: choose one frequent intent with audited knowledge and an unambiguous finish line. Write its job card, run it beside the existing human path, assign one accountable owner, and track the outcome through the system that ultimately matters. Expand only when the agent is reliably completing that job and the human team is using the released capacity deliberately.

References

April 28, 2026

Amplitude AI Product Analytics: A Practical Agent Playbook

You are deciding whether Amplitude Agents deserve a place in your product operating system. A fluent answer or polished insight is easy to admire. The harder question is whether the agent helps someone make a better decision, complete a valuable task, or change user behavior.

That distinction determines how you should instrument, evaluate, and roll out the experience. Treat Amplitude as the measurement spine connecting agent activity to funnels, cohorts, experiments, retention, and product outcomes. Otherwise, you will know that the agent was used without knowing whether it was useful.

Pick a workflow with an observable finish line

Do not begin with a broad ambition such as helping everyone understand the data. It cannot be measured cleanly, and it gives the agent too much room to produce plausible output without resolving a real job.

The useful standard is that AI product management remains accountable for helping teams build better products. The agent response is therefore an intermediate output, not the outcome. A strong starting point is one narrowly scoped, high-signal workflow with an unambiguous done state.

Write a workflow contract before configuring dashboards or prompts:

User: Name the role doing the work, such as a product manager investigating onboarding friction.
Trigger: Describe the event that makes the job necessary, such as a drop in activation or an unexpected cohort difference.
Bounded job: State exactly what the agent should help accomplish.
Required evidence: Identify the events, funnels, segments, or cohorts that should support the output.
Done state: Define the observable action that marks useful completion.
Fallback: Decide what happens when the inputs are missing, the evidence conflicts, or the agent cannot complete the task reliably.

For an onboarding investigation, the contract might ask the agent to help identify where a defined cohort leaves the activation journey and produce evidence-backed hypotheses for the product manager to review. The task is not complete when text appears. It is complete when the user reviews the relevant evidence and records a decision, launches a follow-up analysis, or creates an experiment.

Use a simple outcome ladder to keep the team honest: eligible users see the experience, some start it, some reach the workflow’s done state, some act on the result, and the intended product outcome changes. Each level answers a different question. Collapsing them into an agent usage metric hides the point at which value disappears.

Instrument the agent journey, not just the final answer

Your event design should let you reconstruct the journey from opportunity to outcome. The names below are examples, not an official Amplitude schema. Adapt them to your existing naming convention and governance rules.

Journey stage	Question it answers	Suggested event
Eligible	Who could reasonably use this workflow?	agent_workflow_eligible
Exposed	Who actually saw an entry point?	agent_entry_viewed
Started	Who chose to begin?	agent_run_started
Evidence reviewed	Who engaged with the information needed to judge the output?	agent_evidence_viewed
Completed	Who reached the workflow-specific done state?	agent_task_completed
Actioned	Who used the output in a downstream decision or action?	agent_output_applied
Handed off	Where did the experience require a deterministic flow or human review?	agent_handoff_triggered
Returned	Who came back when the job occurred again?	agent_run_started, segmented by prior successful completion

Add properties that explain why behavior differs: workflow identifier, product surface, user role, account cohort, journey stage, agent version, prompt or instruction version, completion reason, handoff reason, and error class. Version properties are essential. Without them, a release can change output quality while the dashboard incorrectly treats the experience as one stable product.

If prompts may contain customer or company data, do not log the raw text by default. Prefer derived classifications, structured outcome fields, or properly redacted samples governed by your retention and access policies. Product analytics should increase observability without creating an unnecessary copy of sensitive input.

Build each metric with an explicit denominator:

Discovery rate: exposed eligible users divided by eligible users.
Start rate: users who start divided by users exposed to the entry point.
Completion rate: users reaching the workflow-specific done state divided by users who start.
Action rate: users taking the defined downstream action divided by users who complete.
Retained use: previously successful users who return when the job recurs divided by previously successful users who had another opportunity.

The eligibility and opportunity conditions matter as much as the numerator. A user cannot retain to a workflow that has not recurred, and someone who never saw the entry point should not be treated as a failed starter.

In Amplitude, separate the views rather than forcing everything into one chart. Use an exposure funnel for discoverability, a workflow funnel for completion, cohorts for segment differences, retention analysis for repeat behavior, and a guardrail view for errors, retries, and handoffs. Use Agent Analytics for the execution signals available from the agent, then connect those signals to the behavioral events that represent product value.

Keep output quality and product impact on separate scorecards

Behavioral analytics cannot tell you whether an answer was correct. An evaluation set cannot tell you whether customers changed their behavior. You need both views because they fail in different ways.

Before widening access, create an evaluation set drawn from the workflow contract. Include ordinary cases, incomplete inputs, ambiguous requests, conflicting evidence, and cases that should trigger a handoff. Grade the output against criteria that can be reviewed consistently:

Correctness: Does the conclusion match the available evidence?
Grounding: Can the user see which events, funnels, cohorts, or other inputs support it?
Task adherence: Did the agent solve the bounded job rather than produce a generic analysis?
Uncertainty handling: Does it distinguish supported conclusions from hypotheses?
Handoff behavior: Does it stop or redirect appropriately when required evidence is unavailable?
Actionability: Can the intended user make the next decision without reconstructing the analysis?

Record pass or fail for non-negotiable criteria such as unsupported conclusions and failed handoffs. Keep graded usefulness criteria separate. A high average score should not conceal a smaller set of serious failures.

Run the same evaluation set when you change instructions, tools, model configuration, retrieval behavior, or the data made available to the agent. This is the practical value of eval-driven development: a fast release becomes a controlled product change rather than an untraceable shift in behavior.

Your online scorecard should then contain distinct layers:

Primary outcome: the workflow-specific completion or downstream action that represents value.
Adoption diagnostics: eligibility, exposure, start rate, and first successful completion.
Quality diagnostics: evaluation results, user corrections, retries, and unsupported-output flags.
Operational guardrails: errors, latency appropriate to the workflow, abandonment, and handoffs.
Product impact: the activation, feature adoption, retention, or other behavioral outcome the workflow is intended to influence.

Choose one primary outcome before launch. The other measures explain why it moved or protect against a misleading win. If every metric is primary, the team can always find one that improved after the fact.

User ratings can help diagnose tone, relevance, or missing context, but they are not a substitute for observed outcomes. A response can feel impressive and still produce no action. It can also look concise while helping an expert complete the job quickly. Pair stated feedback with completion, downstream action, and return behavior.

Run an experiment that can survive executive scrutiny

Do not compare enthusiastic agent adopters with everyone who ignored it. Those groups selected themselves, so their product outcomes may have differed before the agent appeared. Establish a baseline and create a controlled comparison wherever the workflow and traffic permit it.

Write the hypothesis in behavioral terms. Name the user, workflow, expected action, and product outcome.
Measure the current workflow before introducing the agent. Capture completion, abandonment, downstream action, and relevant guardrails.
Define eligibility before assignment so the comparison includes people with the same underlying job.
Choose the assignment unit that matches how the workflow spreads. Use an account-level unit when teammates share agent output; use a user-level unit only when experiences are genuinely independent.
Expose the treatment through a feature flag or controlled rollout, while keeping the existing path available as the comparison and fallback.
Evaluate the primary outcome and guardrails together. Do not call a faster workflow successful if output quality, error handling, or downstream behavior deteriorates.
Inspect cohorts to understand a credible result, not to search endlessly for a segment that happens to look positive.

The metric pattern often tells you where to investigate next:

High exposure with low starts can indicate weak positioning, poor timing, or an irrelevant eligible population.
Healthy starts with low completion can indicate that the promise is attractive but the workflow, inputs, or output quality is failing.
High completion with low downstream action can indicate that your done state is too shallow or the output is not trusted enough to use.
Strong agent engagement without movement in the product outcome can indicate a locally pleasant experience that does not change the broader journey.
Strong first use with weak return behavior can indicate novelty, unreliable value, or a job that simply occurs infrequently. Check opportunity before interpreting it as churn.
Good aggregate results with concentrated handoffs in one cohort can indicate missing context, permissions, or data for that segment.

Guardrails should be operational, not aspirational. Validate required inputs. Make the agent’s task and evidence boundaries clear. Route the user to a deterministic flow or human review when observable conditions show that the task cannot be completed. Missing data, failed tool calls, validation failures, and unsupported claims are stronger handoff triggers than an agent merely describing itself as confident.

Scale only when value repeats under real conditions

A spike in usage after launch mainly proves that people noticed something new. Scale when the complete chain repeats: eligible users discover the workflow, finish it, act on the result, and return when the same job appears again.

Segment that chain by role, account cohort, use case, journey, and agent version. A workflow that helps an experienced product analyst may confuse a first-time manager. An onboarding investigation may need different evidence and handoffs from a retention investigation. Aggregate adoption can hide both realities.

Expand the rollout when the primary outcome improves, evaluation quality remains stable across relevant cohorts, guardrail failures stay controlled, and repeat use matches the natural frequency of the job. Redesign when successful users cannot find the entry point, retries cluster around the same step, completed outputs rarely lead to action, or results depend on one unusually capable cohort.

Pause expansion when the agent does not improve the existing workflow, important outputs cannot be audited back to evidence, or failures cannot be routed safely. More exposure only creates more ambiguous data when the workflow contract itself is weak.

Key takeaways

Define one bounded workflow and an observable done state before measuring adoption.
Connect agent execution signals to exposure, completion, downstream action, and product outcomes in Amplitude.
Use evaluation sets for output quality and behavioral analytics for real-world impact; neither replaces the other.
Compare the agent with the existing workflow among equally eligible users.
Treat retries, errors, unsupported outputs, and handoffs as product signals, not merely engineering logs.
Scale repeatable value across cohorts and versions, not a launch-driven usage spike.

Your next move should fit on one page: the workflow contract, event map, evaluation criteria, experiment metric, and fallback path. If those elements are clear, Amplitude can show where the agent creates value and where it merely creates activity. If they are not clear, narrow the workflow before you widen the rollout.

References

April 28, 2026

Master Build-to-Learn: The Essential FAQ to Supercharge Product Discovery in the AI Era

In the age of AI, I’ve come to believe we’re all builders—yet not all building is the same. There is a very meaningful difference between building to learn (known as product discovery) versus building to earn (known as product delivery). When we confuse the two, we waste precious time, budget, and team energy on output over outcomes. My goal in this FAQ-style reflection is to clarify when and how to choose each mode so we can make smarter, faster, more confident product decisions.

Why does this distinction matter so much right now? Because as the cost of product delivery continues to drop, the scarce resource shifts from shipping capacity to clarity of problem, solution, and value. Cloud infrastructure, CI/CD, feature flags, and even gen AI code assistance have made it cheaper to launch. That’s great—but if we don’t learn the right things before we scale, we’ll efficiently deliver the wrong product. Discovery is how we de-risk that.

What do I mean by build to learn? I use discovery to quickly validate problems, test value, and shape solutions before committing delivery teams to scale. In practice, that means continuous discovery with customer interviews, rapid prototyping, and lightweight experiments that put us in front of real users fast. I rely on product trios and empowered product teams to co-own outcomes, not just output, and I anchor decisions with outcomes vs output OKRs so we stay focused on measurable impact.

How do I structure discovery sprints? I start with an opportunity solution tree to map customer pain points and candidate solutions, then select the smallest test that can invalidate a risky assumption. When signals are ambiguous, I refine the questions and instrument better learning loops rather than pushing harder on delivery. For experiments, I keep a bias to speed: clickable prototypes, concierge tests, or gen ai for product prototyping often reveal more in days than a coded MVP does in weeks. When experiments go live, I use a clear minimum detectable effect (MDE) and resist reading noise as signal.

Where does AI change the calculus? LLMs for product managers are turbocharging discovery by accelerating research synthesis, persona drafts, and early concept validation. I pair that with eval-driven development to set crisp acceptance criteria for AI behaviors before any production integration. Prompt engineering and conversation design are part of the toolkit, but the same rule applies: prototype to learn, not to impress. AI can make bad ideas cheaper to build—so disciplined discovery matters more than ever.

So when do I switch to build to earn? Once I have evidence of value and feasibility, I shift into product delivery to scale with quality, security, and reliability. This is where I bring in product roadmapping and sprint planning, DORA metrics to monitor deployment frequency and lead time, and strong SRE and observability practices to safeguard the user experience. The handoff isn’t a wall; discovery continues inside delivery to refine scope, reduce risk, and maintain momentum.

What pitfalls do I watch for? The biggest is treating delivery as discovery—shipping features to “see what happens” without a clear learning thesis. Another is tech-first decisions driven by technology FOMO instead of product strategy and customer value. I also see teams set output-based commitments that crowd out learning; outcomes vs output OKRs keep us honest. And when considering build vs buy, I evaluate whether the capability differentiates us; if not, I’ll buy to preserve discovery capacity on what truly matters.

My operating conviction is simple: invest early and deliberately in build to learn so build to earn becomes high-confidence, high-velocity, and high-impact. In practical terms, that means smaller bets, faster feedback, clearer outcomes, and tighter collaboration across product, design, and engineering. If we get discovery right, delivery feels inevitable—and customers feel understood.

Inspired by this post on SVPG.

April 27, 2026

AI Product Data Security: A Practical Playbook for PMs

Your AI feature is ready to move beyond the prototype, but one question can still stop the release: exactly which customer data leaves your boundary, where is it copied, and who can retrieve it later? If the answer is scattered across architecture diagrams, vendor settings, and assumptions, you do not yet have a security decision.

You can resolve that uncertainty without turning every experiment into a committee exercise. Map the data path, assign the capability a risk lane, minimize what the model receives, and automate the controls that follow from the classification. The result is a release process that is both faster and easier to defend.

Start with the data path, not the model

The first security question is not what the model knows. It is what your product sends, retrieves, transforms, stores, logs, and displays. A provider can have a strong security posture while your implementation still exposes data through an overbroad retrieval query, a debug log, or an incorrectly scoped support tool.

Draw the complete path for one user request. Do not use a generic platform diagram. Follow the actual capability from the moment a user or system creates an input until every resulting copy has expired or been deleted.

Identify the original input, including form fields, uploaded files, messages, system-generated events, and API payloads.
List the context added by your application, such as account attributes, conversation history, analytics, retrieved documents, feature configuration, or tool results.
Mark every transformation before the model call: filtering, redaction, tokenization, summarization, chunking, or schema conversion.
Name the service that receives each payload, including gateways, model providers, observability tools, evaluation systems, queues, and caches.
Trace the response through validation, tool execution, display, analytics, support access, and downstream storage.
Record when each copy expires, how deletion propagates, and who can access it while it exists.

For every step, capture six fields: data class, system owner, access scope, external recipient, retention rule, and failure consequence. If any field is unknown, label it unknown. An explicit unknown is useful discovery work; an undocumented assumption is hidden risk.

Do not stop at obvious records such as customer PII and payment identifiers. Prompts, retrieved context, user-linked analytics, internal roadmaps, feature flags, configuration values, embeddings, vector stores, and evaluation datasets can also reveal confidential facts or inferred identity. Treat them as product data with owners and controls, not harmless implementation residue.

Use a completion test that exposes weak assumptions

Your map is ready for a decision when someone outside the feature team can answer these questions from it:

What is the most sensitive field the capability can receive?
Which fields cross the company boundary, and which named service receives them?
Can one customer ever retrieve another customer’s data?
Are raw prompts, completions, retrieved passages, or tool results logged?
Which identities can inspect those logs or replay a request?
What happens to derived data when the original record is deleted or its permissions change?
Which control contains the incident if the model, retrieval layer, or tool call behaves unexpectedly?

If the team can only answer these questions by asking several vendors or searching production settings, keep the release open. The missing work is not paperwork. It is part of the product’s operating design.

Turn the risk assessment into a release lane

A risk score is useful only when it changes what the team must do. Avoid a long questionnaire that ends with an ambiguous rating. Use a small number of lanes, give each lane an observable entry condition, and attach default release controls.

Risk lane	Typical signals	Default release posture
Low	Internal capability; synthetic or public inputs; no sensitive context; no consequential external action	Approved provider, least-privilege credentials, basic access tests, and confirmation that secrets are not entering prompts or logs
Elevated	Customer-facing capability; authenticated user context; behavioral telemetry; stored prompts or outputs; retrieval from private content	Data minimization, pre-call redaction, permission-aware retrieval, explicit retention, adversarial evaluations, runtime monitoring, and a named incident owner
High	Regulated-data adjacent; payment identifiers; broad confidential retrieval; sensitive identity data; or authority to perform a consequential action	Early Security, Legal, privacy, and Data involvement; documented threat model; human approval where an action warrants it; verified containment; and release evidence reviewed before exposure

These lanes are an operating model, not a compliance determination. Applicable controls depend on the actual data, customer contracts, geography, industry, and use case. Security and legal specialists should make those determinations when the capability creates legal, regulatory, or material customer exposure.

Classify the capability, not the entire product. A writing assistant that uses text supplied for a single request may sit in a different lane from an account assistant that searches every customer conversation and updates CRM records, even when both use the same model.

Score the capability across these dimensions:

Data sensitivity: public, internal, confidential, personal, payment-related, or regulated-data adjacent.
Audience: constrained employee group, all employees, authenticated customers, or public users.
Retrieval reach: one supplied record, an authorized account subset, or a broad internal corpus.
Action authority: produces a suggestion, drafts a change, or executes an external action.
Persistence: ephemeral processing, structured event storage, or retained raw inputs and outputs.
Third-party exposure: stays inside your controlled environment or passes through one or more providers and subprocessors.

Use the highest-risk dimension to set the initial lane. Lower it only after a design change removes the exposure. A promise to be careful is not a mitigating control; scoped retrieval, enforced redaction, disabled raw logging, and restricted tool permissions are.

Reclassify when the feature changes its data, audience, retrieval reach, retention, provider, or ability to act. A seemingly small roadmap addition, such as remembering past conversations or connecting a second data source, can change the security posture more than a model upgrade does.

Design the system to disclose less data

The most reliable way to protect data is to keep unnecessary data out of the AI path. Encryption and contractual terms matter, but they do not make an irrelevant customer field necessary. Start with the user outcome and ask which minimum facts the model needs to produce it.

Minimize before you redact

Redaction is a valuable deterministic safeguard, but it should not carry the whole design. Free-form text can contain names, secrets, identifiers, and confidential business information in formats your rules do not recognize. Reduce the payload first, then redact the smaller payload that remains.

Replace a full customer object with the few fields required for the task.
Use a temporary account token when the model does not need a person’s name, email address, or payment identifier.
Convert long interaction histories into purpose-specific structured fields when the task does not require the original prose.
Exclude internal notes, disabled fields, hidden metadata, and unrelated attachments by default.
Log structured events such as policy result, model identifier, latency, and request status when raw prompt text is not required.

Separate identity from content wherever the workflow allows it. The application can retain the relationship between a temporary token and an account while the model processes only the content needed for the task. Access to the token map should remain narrower than access to routine AI telemetry.

Make retrieval permission-aware

A retrieval-first architecture can keep the raw corpus inside your controlled boundary while selecting only relevant context for a request. It is not automatically private. If an external model receives the selected passages, those passages still cross the boundary and still require minimization, redaction, approved-provider controls, and a clear retention policy.

Apply authorization when the request is made, not only when content is indexed. The retrieval layer should constrain results by tenant, user, role, and current document permissions before any text becomes model context. Do not index content that the eventual searcher could never be allowed to read unless the architecture has another enforceable isolation boundary.

Treat embeddings and vector-store metadata as sensitive derived data. A vector is not a magic anonymizer, and metadata can disclose document names, account relationships, categories, or activity patterns even when full text is elsewhere. Your deletion and permission-change process must reach the index, cached results, evaluation copies, and any stored citations, not just the primary database.

Retrieved content is also untrusted input. A malicious or compromised document can contain instructions intended to change model behavior. Keep system instructions separate, restrict available tools, validate tool arguments, and enforce authorization in application code. The model should never be the component that decides whether a user may access a record or perform an action.

Place deterministic controls on both sides of the call

Before the call: validate the request schema, remove disallowed fields, redact known sensitive patterns, apply allow and deny policies, and constrain retrieval.
After the call: validate output structure, block disallowed sensitive patterns, verify any cited record belongs to the authorized scope, and check tool arguments before execution.
During operation: monitor unusual prompt, output, retrieval, and access patterns without creating a second uncontrolled store of raw content.

An output filter cannot undo data already disclosed to an external provider. Use post-call checks to protect users and downstream systems, but use pre-call minimization and access enforcement to prevent the disclosure itself.

Make vendor approval specific to the intended use

Do not approve an AI vendor in the abstract. Approve a defined service, account configuration, data class, region, retention posture, and use case. A provider suitable for public-content summarization may not be suitable for customer conversations or payment-related identifiers.

Ask questions that produce enforceable answers rather than broad assurances:

Training and service improvement: Can prompts, files, retrieved passages, outputs, feedback, or metadata be used to train models or improve services? Is the restriction a default, a setting, or a contractual term?
Retention: How long does each data type remain in primary systems, safety systems, failure logs, backups, and support tooling? What initiates deletion, and what exceptions apply?
Human access: Under what conditions can provider personnel inspect customer content, and how is that access authorized, logged, and reviewed?
Security controls: Is data encrypted in transit and at rest? What key-management options, private networking, scoped credentials, access logs, and administrative controls are available?
Location and subprocessors: Which regions process and store the data? Where can support access occur? Which subprocessors participate in the path?
Assurance evidence: Which services and controls are covered by SOC 2, ISO 27001, or HIPAA-related commitments where relevant to the use case?
Response: How will the provider communicate a security incident, policy change, model change, or subprocessor change that affects your approved use?

An audit or certification is useful evidence about a defined scope. It is not proof that your architecture, settings, or use case is safe. Confirm that the service named in the evidence is the service your product will actually call, and that your configuration does not bypass the controls you evaluated.

Keep a short decision record with the approved purpose, permitted and prohibited data, named endpoints or services, required account settings, retention terms, region, responsible owner, and review triggers. Reopen the decision when the purpose, data class, provider terms, model path, subprocessor chain, or architecture changes.

A shared catalog of approved providers and patterns also reduces shadow AI. Make the approved route easier to use by supplying scoped credentials, reference architectures, redaction utilities, retrieval patterns, and clear examples of prohibited inputs. Governance works better when the safe path is a usable product for internal teams.

Put the controls into delivery and incident response

A policy that depends on every engineer remembering every rule will drift. Store the capability’s classification, required controls, approved provider configuration, and decision owner alongside the delivery artifacts. Version changes so the team can see when a new data source or retention behavior altered the release posture.

Translate the release lane into automated checks wherever the control can be tested:

Scan prompts, templates, configuration, and code for exposed secrets and unapproved endpoints.
Unit-test redaction and tokenization against representative allowed and disallowed inputs.
Integration-test tenant boundaries, role permissions, retrieval filters, and deletion propagation.
Run evaluations that attempt to elicit restricted data, override instructions, retrieve unauthorized records, or trigger tools outside the allowed scope.
Validate the selected provider, model path, region, logging setting, and retention configuration against the approval record.
Block release when required evidence, monitoring, rollback controls, or an incident owner is missing.

Evaluation data needs the same scrutiny as production data. Remove unnecessary identities, restrict access, define retention, and avoid copying raw customer interactions merely because an evaluation system is internal. A test corpus can become a long-lived data store if nobody owns its lifecycle.

Monitor security-relevant events rather than indiscriminately recording content. Useful signals include blocked sensitive-data patterns, denied cross-scope retrieval, calls to unapproved services, unusual access behavior, unexpected changes in model or endpoint usage, and failed retention or deletion jobs. Structured metadata often provides the operational signal you need without preserving every prompt and completion.

Prepare containment before the first customer request

Your incident runbook should name the people and mechanisms needed to contain the feature. Depending on the incident, that can include disabling the affected path with a feature flag, revoking or rotating credentials, restricting retrieval, stopping unsafe logging, locating downstream copies, and contacting the provider.

Do not improvise evidence deletion or customer notification during an incident. Security, privacy, and legal owners should determine preservation, notification, and regulatory obligations based on the specific exposure. The product runbook should make those owners reachable and give them an accurate data-flow record, timestamps, affected systems, and containment status.

After containment, update the control that failed: the architecture, automated check, provider setting, policy, runbook, or team guidance. A review that ends with a reminder to be more careful leaves the same mechanism in place.

Key takeaways

Map every copy of the data, including retrieved passages, logs, embeddings, evaluations, caches, and tool results.
Classify individual capabilities by their highest-risk dimension, then attach mandatory controls to the lane.
Minimize fields before redaction, enforce permissions outside the model, and treat derived stores as sensitive.
Approve vendors for a named use, configuration, data class, region, and retention posture rather than issuing blanket approval.
Put redaction, access, retrieval, configuration, evaluation, and release checks into CI/CD.
Design containment and ownership before launch so an incident does not begin with a search for the right people and switches.

Pick one AI capability currently approaching release and produce its request-to-deletion data map. Assign its lane, turn every unknown into an owned backlog item, and automate the first control the team is still checking by hand. That is how security becomes part of product delivery instead of a negotiation at the end.

References

Shivam.Consulting Blog – AI Data Security for Product Teams: Protect Sensitive Product Data Without Slowing Innovation

April 27, 2026

Tag: eval-driven development

Key takeaways

Start with decision architecture, not a better prompt

Build a strategy chain the model can inspect

Force a distinction between fact, inference, and assumption

Build a controlled workflow from context to decision record

Ground the model in canonical product context

Separate extraction, synthesis, challenge, and approval

Use AI in discovery without laundering uncertainty

A cluster is a lead, not a finding

Connect roadmap, experiment, launch, and learning

Make the roadmap show its reasoning

Make experiments decision-ready before they run

Keep launch language tied to the original value proposition

Scale the workflow only when another person can audit it

References

Start with the retention decision, not the eval dashboard

Build an identity and time contract before modeling anything

Find a threshold that survives segment and leakage checks

Use experiments to separate a predictor from a product lever

Put the winning signal into the product operating system

Key takeaways

References

Key takeaways

Create an evidence contract before asking for a recommendation

Turn product questions into bounded analytics tasks

Find an activation blocker without inventing causality

Use behavioral context to sharpen roadmap decisions

Close the engineering loop from customer signal to verified fix

From a customer report to a reproducible failure

From a code symptom back to customer impact

Scale only after retrieval and governance earn trust

Make access narrower than the assistant’s capability

Evaluate the workflow, not just the prose

Use a narrow adoption sequence

References

Key takeaways

Define the product around a completed order

Keep the order deterministic even when the conversation is not

Manage latency as part of the customer experience

Evaluate each venue, then template what repeats

Make item accuracy precise enough to govern decisions

Simulate failures before customers encounter them

Roll out autonomy in observable stages

References

Allocate attention before you allocate roadmap space

Turn AI feature requests into testable investment theses

Build the growth thesis as a measurable chain

A worked example: turn AI search into an activation path

Choose an intent wedge before choosing the AI experience

Match autonomy to the evidence you have

Instrument the full path from interaction to revenue

Turn evaluation and experiments into a delivery system

Gate releases with eval-driven development

Use online experiments to answer growth questions

Run the first 90 days around evidence, not launch theater

Watch for five failure modes

Key takeaways

References

Give the agent a job with an observable finish line

Route by customer intent, not your organization chart

Measure sales creation and support resolution separately

Put one owner over knowledge, guardrails, and iteration

Key takeaways: use six checks as your launch gate

References

Pick a workflow with an observable finish line

Instrument the agent journey, not just the final answer

Keep output quality and product impact on separate scorecards

Run an experiment that can survive executive scrutiny

Scale only when value repeats under real conditions

Key takeaways

References

Start with the data path, not the model

Use a completion test that exposes weak assumptions

Turn the risk assessment into a release lane

Design the system to disclose less data

Minimize before you redact

Make retrieval permission-aware

Place deterministic controls on both sides of the call

Make vendor approval specific to the intended use