Category: Product Management

I Pointed a “Ralph Wiggum” AI Loop at My Product for a Week—The Data That Stopped Chaos

I spent a week pointing a "Ralph Wiggum loop" at my product to see how far an agentic AI could take pragmatic, everyday improvements without human micromanagement. It was equal parts exhilarating and nerve-wracking. The short version: the loop moved fast and broke assumptions, but Amplitude analytics kept it from going off the rails—and turned chaos into controlled acceleration.

By "Ralph Wiggum loop," I mean a deliberately naive, endlessly curious cycle: try something small, ship it behind a flag, watch the data, then try again. It is the product equivalent of a fearless intern who experiments constantly. That energy is invaluable for discovery, but it absolutely demands strong guardrails and a clear definition of success.

Before I started, I framed the outcomes I cared about: user activation within the first session, reduction in time-to-value, and early retention indicators. I set baselines and a minimum detectable effect (MDE) for A/B testing so the loop could distinguish noise from signal. I also documented a driver tree of behaviors we wanted to influence and ensured every event was cleanly instrumented in Amplitude analytics to support reliable behavioral analytics.

The guardrails mattered most. I put every change behind feature flags with instant rollback. I defined "off the rails" conditions upfront, including regression thresholds for activation and retention analysis, and enabled anomaly detection to surface unexpected spikes or drops. Session replay was ready to diagnose confusion fast, and I kept a daily evaluation cadence so the loop never ran unattended for long.

Day by day, the loop proposed micro-experiments: onboarding copy variants, tooltip timing, in-app guide sequencing, and subtle changes to progressive disclosure. Each iteration shipped behind a flag to a small cohort. I watched leading indicators in real time, then zoomed out to cohort views to guard against short-term gains that might erode longer-term value. When something looked promising, we expanded exposure methodically; when something looked risky, we paused immediately.

We had a pivotal moment where the loop suggested a bolder call-to-action that spiked activation. On the surface, it looked like a win. Amplitude cohorts told a fuller story: downstream engagement softened, and anomaly detection flagged a pattern that hinted at premature conversion rather than genuine intent. A quick rollback through feature flags saved the week—and reminded me why eval-driven development should be the default for agentic AI workflows.

The most surprising part was how quickly the loop unlocked small compounding gains once the measurement scaffolding was in place. With a unified analytics platform and crisp guardrails, the system became a safe sandbox where the AI could explore aggressively while we stayed anchored to outcomes. The combination of behavioral analytics, A/B testing discipline, and daily human review turned raw speed into durable learning.

My takeaways are direct. Agentic AI can accelerate discovery, but only if you define stop conditions and wire strict feedback loops into your stack. Measurement is product strategy here—without it, you get noisy activity instead of progress. Invest in instrumentation first, treat feature flags as non-negotiable, and let anomaly detection and session replay be your early warning system. Most of all, tie every experiment to activation, engagement, or retention, not vanity metrics.

If you’re considering your own week with a "Ralph Wiggum loop," start painfully small, constrain the blast radius, and insist on decision-quality data. Do that, and you’ll turn a chaotic agent into a compounding engine for product discovery—one that moves fast, learns faster, and stays on track.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
From Vision to Execution: Building Agentic, Data‑Driven Products with Real‑World Rigor

When I consider where product development is headed, one statement captures the mandate perfectly: "Eric Carlson is a Principal AI Engineer helping to shape and build Amplitude's next generation vision of of agentic and data driven product development." That vision resonates deeply with how I lead teams—anchoring strategy in behavioral analytics while enabling agentic AI to act on insights with speed, safety, and measurable impact.

Translating that vision into execution starts with clarity of outcomes. I frame driver trees that connect customer value to leading indicators—activation, engagement depth, and retention—then instrument product telemetry with Amplitude analytics and behavioral analytics to surface the moments that matter. From there, we operationalize learning with A/B testing and feature flags, ensuring each hypothesis gets a fair, observable run and that we can safely ramp what works.

Agentic AI changes the operating model. Instead of static dashboards, we design autonomous workflows that observe signals, reason over context, and take action—grounded in a retrieval-first pipeline and governed by eval-driven development. For product managers, this demands fluency with LLMs for product managers and practical prompt engineering, plus rigorous AI Strategy around data governance, privacy-by-design, and risk scoring so agents remain trustworthy under real-world conditions.

Cross-functional cadence is everything. I partner closely with Principal AI Engineers and product trios to blend continuous discovery with execution: rapid user interviews to reveal intent, opportunity solution trees to prioritize, and outcomes vs output OKRs to align incentives. The result is a system where insights are unified, decisions are explainable, and agents improve through tight feedback loops across analytics, experimentation, and production telemetry.

If you’re building toward an agentic, data-driven future, invest in a unified analytics platform, shorten the path from signal to action, and measure learning velocity as carefully as feature delivery. With the right foundations, agentic AI becomes more than a feature—it becomes a force multiplier for product strategy, customer value, and sustainable growth.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
From Prototype to Production: How I Built Reliable AI-Generated Opportunity Solution Trees

I just wrapped an all-out engineering sprint. That still sounds odd coming from me, because while I’ve written code on and off for years, I don’t self-identify as an engineer. I’m a product manager who used to be a designer. It’s been a long time since I wrote code for a living.

But AI has expanded what’s just now possible—for our products, and for us. It’s pushed me to do more than I imagined. In that spirit, I want to share a recent engineering story. It includes technical details, and a year ago I couldn’t have done any of it. I learned it with the help of AI, and my aim is to show what’s now within reach.

I’ve been building two services with a partner at Vistaly: AI-generated interview snapshots and AI-generated opportunity solution trees. We put out a call for alpha partners, received over 100 applicants, and selected eight design partners to start.

A clear, color‑coded map from desired outcome to opportunities, solutions, and assumption tests—showing how to structure discovery work and prompt AI to generate, compare, and validate product ideas.

Each team uploaded three customer interviews. I identified the key moments and opportunities and then generated an opportunity solution tree from those snapshots. I provide the AI services; Vistaly is building the UI and workflows around them.

Early feedback was strong. Teams immediately asked to upload more interviews—exactly the kind of demand signal you hope to see—so we got to work making that possible.

Go behind the scenes as AI turns raw feedback into a clear Opportunity Solution Tree. Linked cards reveal user needs—onboarding, support offload, and bot-readiness signals—so product teams can spot priorities and next steps at a glance.

Updating an opportunity solution tree with new interview content is far harder than generating a new tree from scratch. I initially underestimated the complexity. Our goal wasn’t to produce a tree and declare it truth. We wanted teams to engage, correct, and collaborate with the AI—scaffolding cross-interview synthesis instead of doing it for them.

To support that, we needed a way to communicate precisely how a tree would change after new interviews were added. We took inspiration from git diff and set out to build the equivalent for opportunity solution trees—step-by-step change sets that explain each proposed modification.

A clear visual of AI‑generated opportunity solution trees: outcomes feed opportunities that branch into sub‑opportunities, while evidence is preserved. The structure ensures updates stay traceable and never cause data loss.

That decision was right, but the lift was larger than I expected. It wasn’t enough to generate an updated tree; I also had to provide a clear, ordered walkthrough of what changed and why.

I often see the same pattern with AI: it’s easy to get to an impressive prototype, but much harder to reach a production-grade product. That was exactly my experience here. My service actually comprised two sub-services: generating a new tree from scratch and updating an existing tree with new interviews. The first worked well in alpha; the second had to be built before anyone could add a fourth interview.

Explore how an outcome expands into an Opportunity Solution Tree: Opportunities A and B stem from the goal, with C and D nested under B, while a concise change set tracks every node added along the way.

On the surface, these services look similar. In reality, updates must preserve existing structure unless new evidence requires a change. You have to account for compound operations—merges, splits, deletes—while guaranteeing no data loss. Every node has source opportunities (supporting evidence from interviews) and children (tree sub-opportunities), and neither can be dropped.

In classic AI fashion, I got a reasonable version working in a few days and shipped it to our design partners. One team quickly hit our beta limits and asked to convert to a paid subscription so they could keep going. They showed a willingness to pay, converted, and started uploading aggressively.

Watch an Opportunity Solution Tree evolve: the original parent A with x, y, z branches is split into A and B, shifting evidence while preserving links—mirroring how AI refines scope and structure in discovery.

At the 14th, 15th, and 16th uploads, the cracks appeared. We saw odd behavior in some trees. The Vistaly team noticed that the change sets—the step-by-step instructions emitted by my service—didn’t always reconstruct the final tree my service also emitted. We needed those steps to match exactly, so teams could review and accept, modify, or reject each change with confidence.

They flagged the issue the day I was flying to New Orleans for Jazz Fest. In hindsight, I’m glad I didn’t grasp the scope of what awaited me. I had roughly 80% of the work still to do to make tree updates rock solid. At least I got to enjoy the music first.

From fragments to focus: this diagram shows how Opportunities B and C are merged into a single Opportunity Solution Tree, removing duplicates and unifying context so AI can rank and explore five related opportunities with clarity.

Back home, I started diagnosing. My service was a pipeline: several LLM-driven steps followed by deterministic code to compare trees and produce change sets. As I dug in, I realized that approach was flawed. Tree diffs, unlike linear document diffs, are ambiguous.

In a document, if I add a sentence, the diff shows an addition. If I delete a paragraph and rewrite it, the diff shows a removal and an addition. Simple. But trees are different. Suppose I split opportunity A into A and B, and later merge B with C. The split can disappear from the final diff.

Peek inside our process: a simple opportunity solution tree maps an outcome to prioritized opportunities A and C with downstream options x-z and t-v. A clear snapshot of how AI organizes product discovery.

When the model splits an opportunity, it must distribute A’s source opportunities and children between A and B. For instance, if A has source opportunities 1, 2, 3 and children x, y, z, after the split A might keep 1, 2, and x, while B takes 3, y, and z.

Now suppose the model merges B into C. If C originally had source opportunities 4 and 5 and children t, u, v, then after the merge C now has source opportunities 3, 4, 5 and children t, u, v, y, z. When you compare the original and final trees, it looks like A somehow donated some evidence and children directly to C. The split and merge that explain why are invisible to a naive diff.

See how an AI-generated Opportunity Solution Tree unfolds: one Outcome flows to Opportunities A and C, then into options x–v. Clean colors and arrows reveal the hierarchy from goal to opportunities at a glance.

That was the core insight: we didn’t just need to show what changed—we needed to show why it changed. I had to reconstruct each move step-by-step. That meant getting the model to show its work, which opened a new can of worms.

I refactored my prompts so the model produced both the final output and the exact change set it used to get there. The action language was explicit: add, delete, reframe, merge, split, and so on. Crucially, I asked the model to describe its moves in user-meaningful terms—“split A into A and B, then merge B into C”—not as opaque reassignments of sources and children.

Watch an opportunity solution tree take shape: start with the outcome, add opportunities A and B, then extend B to C and D. The paired change set makes every edit transparent—ideal for AI-assisted product discovery.

For each LLM step, the model now emitted its recommendation and the corresponding change set. This helped, but it wasn’t perfect. After extensive testing and error analysis, two classes of errors emerged: (1) the model attempted an invalid move, and (2) the change set didn’t actually generate the recommendation.

Category 1 felt like designing a game while the model played it creatively. For example, what happens when the model tries to merge a parent with a child? If opportunity A has children B, C, and D and the model merges A with B, the merge is directional. If the instruction is “keep A, delete B,” that works—the parent absorbs the child. But if the instruction is “keep B, delete A,” then C and D become orphans. These puzzles were solvable and even fun.

Visual explainer from Product Talk on AI-generated Opportunity Solution Trees. It contrasts an allowed merge (B into A) with a not-allowed merge (A into B) that leaves child opportunities orphaned, guiding safe hierarchy edits.

Category 2 was harder. Despite prompt iterations, I could only push the discrepancy rate down to about 1 in 40 instances. With 10–20 LLM calls per run, that meant roughly half of all runs still failed. Not acceptable for production. I hit a wall. A paying customer was waiting, and more design partners were queued up.

Next, I tried to correct the model’s mistakes with deterministic code. I had promised that my change sets would generate the output tree, so I wrote verifiers: detect conflicts (e.g., delete a node, then try to use it later), guard against data loss, prevent orphaned nodes, and more. Detection was straightforward; correction was not. Fixing issues required guessing the model’s intent. If the sequence said “delete A, then merge A with B,” should I remove A entirely or salvage A’s sources and children by merging into B? There were dozens of such cases with no unambiguous answer.

A step-by-step loop shows how changes are validated: generate a change set, run a validation tool, review the result, then repeat on failure and exit on pass—mirroring iterative work behind AI-built Opportunity Solution Trees.

After 11 straight days of deep work—including weekends—I was exhausted. I dislike hustle culture; this isn’t how I design my life. But I was stuck, and then I had an insight.

On a walk with my husband (also an engineer), I realized I could have the LLM repair its own mistakes. My data contract with Vistaly requires that the change set must generate the output tree. I had already built robust validation code. I knew exactly when a change set failed—and why. No amount of prompt tuning alone was fixing it. So I turned the validator into a tool for the model and created a simple agentic loop.

The loop works like this: the model proposes a change set, calls the validation tool, and gets back a pass/fail plus specific feedback. If it fails, the model uses those instructions to repair the change set and calls the tool again. Iterate until success or a max number of turns.

I prototyped in Node.js with a single model call, a verifier pass, and a repair attempt. At first, the loop didn’t converge—it just accumulated compute. I experimented with how to communicate errors, how much context to include, and how to sequence feedback. Eventually, it clicked: the model began fixing its own mistakes and typically returned a valid change set in one or two repairs. It was, in practice, eval-driven development applied to LLM outputs.

I had already built an agent loop utility for another AI workflow, so I productionized quickly: model call, optional tool invocation, tool result returned to the model, repeat until the validator signals success or the loop times out. I integrated the new loop into the pipeline and shipped the revamped service to Vistaly on Monday at noon. They’re integrating now, and it will be in the hands of our design partners shortly. I’m relieved—and ready for a day off.

Reflecting on the last two weeks, a few things stand out. First, I shed limiting beliefs about being an engineer. To make this reliable, I had to solve legitimately hard problems, and that feels good.

Second, this was genuinely fun. Designing the action set and watching the model push those boundaries was like working through elegant puzzles. Models are incredibly creative, and harnessing that creativity with the right constraints is deeply satisfying.

Third, I learned when I can and can’t trust Claude to write code for me. Since Opus 4.6 came out, I gave Claude a much longer leash. After the past two weeks, Claude is back on a short leash. I found a lot of gaps in my implementation in areas where I simply trusted that Claude got it right, when in fact it didn’t. If you don’t have the right infrastructure—planning, testing, code review—this can be disastrous. I’ll be investing more here and sharing what I learn.

Finally, if this work had been spread over two months, it would have been thoroughly enjoyable. I’m discovering how much I like being an AI engineer. It feels like a new chapter where I can combine opportunity solution trees with modern AI engineering—and deliver real value to product teams doing continuous discovery.

I’m excited to share more of what we’re building with Vistaly and to onboard more design partners soon. If you’re interested, get on the waiting list. And if you’ve been hesitant to stretch beyond your current skill set, I hope this story nudges you to take the first small step toward what’s just now possible.

Inspired by this post on Product Talk.

May 13, 2026

AI-Assisted Product Strategy: A Practical Operating System

You can get an AI model to produce a roadmap in minutes. That is precisely the problem. A polished roadmap can hide weak evidence, unresolved trade-offs, and a strategy that never made a real choice.

The useful question is not whether AI can do product management work. It is where AI should accelerate the path from evidence to decision, where human judgment must remain explicit, and how you will know the resulting strategy is working. The operating system below gives you that separation.

Key takeaways

Give AI a defined role in the decision process. It can extract, organize, challenge, and draft; the product leader still owns choices, trade-offs, and commitments.
Build a strategy chain from customer problem to business result before asking AI for initiatives. Otherwise, the model will fill strategic gaps with plausible language.
Ground every workflow in canonical product context, and require every important claim to point back to evidence.
Use AI to shorten discovery synthesis, not to turn a limited set of interviews or support conversations into false market certainty.
Carry the same strategic hypothesis through the roadmap, experiment, launch, and learning review. Changing the success definition between those stages makes measurement meaningless.

Start with decision architecture, not a better prompt

Most weak AI-assisted strategy work begins with an underspecified request: analyze this feedback, prioritize these ideas, or build a roadmap. The model responds by making silent assumptions about the customer, the business objective, and the meaning of priority. Its output may read well while answering a question nobody deliberately chose.

Write a decision brief before opening the model. This is not a conventional product requirements document. It is a compact contract defining the decision AI is helping you make.

Decision: State the choice in one sentence. For example, decide which onboarding opportunity deserves discovery capacity in the next planning cycle.
Target customer and context: Name the segment, job, and situation. Feedback from an administrator configuring an account should not be blended with feedback from an end user completing a daily task.
Desired outcome: Identify the customer behavior you want to change and the business result it is expected to influence.
Evidence in scope: List the interviews, behavioral data, support conversations, journey maps, and prior experiments the model may use.
Constraints: Include privacy requirements, technical dependencies, commercial commitments, capacity limits, and non-goals.
Decision owner: Name the person accountable for accepting the trade-off. An AI-generated recommendation does not distribute accountability.

Build a strategy chain the model can inspect

Your strategy should form a traceable chain:

Choose the customer and job that matter.
Define the value proposition, including what must match the market and what should be meaningfully different.
Name the customer outcome and business outcome.
Break that outcome into drivers the product can influence.
Select an opportunity supported by evidence.
Form a testable product bet.
Decide what evidence would justify continuing, changing, or stopping.

A driver tree makes this chain concrete. It creates a visible connection between roadmap work and measures such as activation, retention, expansion, and Net Recurring Revenue. AI is useful here as a critic. Ask it to identify unsupported jumps, duplicated drivers, initiatives disguised as outcomes, and metrics the proposed product change cannot plausibly affect.

Keep outputs and outcomes separate. Shipping an AI onboarding assistant is an output. Changing a defined activation behavior for a defined customer segment is an outcome. The model can help rewrite output-oriented objectives, but it cannot choose a credible target without baseline data, business context, and an accountable owner.

Force a distinction between fact, inference, and assumption

Require the model to label every material statement as one of three things:

Observed: Directly supported by a supplied interview, event, support conversation, or experiment.
Inferred: A reasonable interpretation that combines observations but is not explicitly stated by the customer or proven by the data.
Assumed: Necessary for the recommendation to work but not yet supported by the supplied evidence.

This simple classification prevents an attractive narrative from laundering assumptions into facts. It also improves discovery planning: the most consequential assumption with the weakest evidence becomes a candidate for the next test.

A useful instruction is: Use only the supplied material. For every recommendation, show the observations that support it, the inference connecting those observations to the recommendation, the assumptions that remain, and the evidence that could disprove it. If support is missing, say that it is missing.

Build a controlled workflow from context to decision record

AI assistance becomes reliable when it is a workflow rather than a chat session. A chat encourages improvisation: context changes, instructions disappear, and nobody can reconstruct why an answer looked different the next time. A workflow gives each pass a defined input, output, and approval gate.

Ground the model in canonical product context

Start with a retrieval-first set of canonical documents. At minimum, that context should include the current vision, product strategy, target segments, value proposition, OKRs, metric definitions, analytics dashboards, relevant discovery evidence, decision history, and definition-of-done checks.

Canonical does not mean comprehensive. More context can make conflicts harder to notice. Give each item an owner, a freshness indicator, and an authority level. If an old positioning document conflicts with the approved strategy, the workflow should identify the conflict rather than silently averaging the two.

Include exclusions as well. Tell the model which documents are historical, which metrics are deprecated, which segments are out of scope, and which proposals have already been rejected. Without those boundaries, previously abandoned ideas can return as apparently new recommendations.

Separate extraction, synthesis, challenge, and approval

Extract: Pull observations, customer language, events, metrics, decisions, and unresolved questions from the supplied material. Preserve links to the original evidence.
Synthesize: Group related observations and propose opportunity statements. Keep contradictory evidence visible.
Challenge: Look for alternative explanations, missing segments, weak causal claims, metric gaming, dependencies, and reasons the recommendation could fail.
Decide: Have the accountable product leader and relevant partners accept, modify, or reject the recommendation. Record the trade-off explicitly.
Publish: Store the decision, evidence, owner, expected outcome, guardrails, and next review trigger in the system the team already uses.

Do not combine these passes into one request for a final answer. Extraction should not quietly prioritize. Synthesis should not hide inconvenient evidence. A challenge pass should test a proposed direction without changing the original evidence set. The human approval gate should be visible, not implied by the fact that somebody copied the output into a roadmap.

Raw interviews, support threads, CRM records, and analytics exports can contain personal or confidential data. Do not paste them into an unapproved model. Minimize the data, remove identifiers that are not needed for the decision, use the governed environment approved by your organization, and retain only what the workflow requires. Privacy-by-design belongs at intake because redacting an output does not undo an inappropriate disclosure in the input.

For recurring workflows, add acceptance criteria and evaluation cases. A discovery synthesis evaluation might check whether every theme retains evidence links, whether contradictions survive summarization, and whether unsupported market-size claims are rejected. A strategy evaluation might check whether every initiative maps to an outcome driver and whether an output has been mislabeled as an objective. Re-run those checks when the model, prompt, context set, or output schema changes.

Use AI in discovery without laundering uncertainty

Discovery generates exactly the kind of material language models handle well: interview transcripts, support conversations, journey notes, behavioral patterns, and open-ended hypotheses. AI can reduce the time between collecting this material and discussing it. It cannot make a biased sample representative or turn a correlation into a cause.

Run synthesis as part of a weekly learning cadence that combines customer evidence with journey and behavioral analysis. Waiting for a large quarterly research readout increases the distance between observation and decision. Treating every new conversation as a roadmap mandate creates the opposite problem. A regular review gives the team a stable point at which evidence can accumulate, conflict, and change an existing belief.

A cluster is a lead, not a finding

Theme clustering is useful for navigation. It is not proof of importance. A frequent topic in support data may reflect product friction, a noisy customer segment, a documentation gap, or a recent incident. The model sees only the supplied dataset, not the market outside it.

Require each proposed opportunity to include:

The affected segment and the context in which the problem occurs.
The job the customer is trying to complete.
Links to supporting observations, including direct customer language where it preserves important nuance.
The observed count within the supplied dataset, clearly distinguished from prevalence in the customer base or market.
Behavioral evidence that supports or challenges the qualitative pattern.
The outcome driver the opportunity could influence.
Contradictory evidence and plausible alternative explanations.
The unanswered question that creates the greatest decision risk.
The next piece of evidence that would materially change the decision.

Then place the opportunity in an opportunity solution tree. Keep the opportunity separate from candidate solutions. If the branch says customers need an AI assistant, it has already collapsed a customer problem into a preferred implementation. Rewrite it in terms of the customer’s obstacle or desired progress, then generate multiple ways to address it.

At the weekly review, ask four practical questions: What did the team observe? Which belief changed? Which important assumption remains weakly supported? What evidence should be collected next? AI can prepare the evidence packet and show deltas from the prior review. The product trio should decide what the evidence means and whether it changes the opportunity being pursued.

Connect roadmap, experiment, launch, and learning

A strategy loses integrity when each delivery stage invents its own explanation. The roadmap promises one outcome, the experiment measures another, the launch emphasizes a feature, and the retrospective celebrates shipping. AI can help maintain the thread, but only if the same hypothesis and metric definitions travel with the work.

Decision layer	Useful AI assistance	Required human judgment	Artifact to preserve
Strategy	Check the chain from customer value to business result and expose unsupported jumps	Choose the segment, differentiation, outcome, and trade-offs	Strategy brief and driver tree
Discovery	Extract observations, cluster themes, retain contradictions, and draft opportunities	Interpret evidence and choose the next uncertainty to reduce	Evidence-linked opportunity record
Roadmap	Map candidate initiatives to drivers, surface dependencies, and prepare option comparisons	Allocate capacity and accept opportunity cost	Prioritization decision record
Experiment	Draft hypotheses, instrumentation, guardrails, edge cases, and analysis checks	Approve the test design, statistical assumptions, and decision rule	Experiment brief
Launch	Adapt release notes, in-product guidance, support material, and segment messaging	Approve claims, rollout risk, positioning, and readiness	Launch plan and approved message set
Learning	Summarize funnels, cohorts, retention patterns, qualitative feedback, and anomalies	Decide whether to continue, revise, expand, or stop	Learning review and updated decision

Make the roadmap show its reasoning

Ask AI to produce roadmap options, not a single supposedly objective ranking. Each option should show the outcome driver it targets, evidence strength, important dependencies, unresolved risk, stakeholder impact, and the work displaced by choosing it. A priority score can organize inputs, but it cannot resolve a strategic disagreement about which customer or outcome matters most.

Every roadmap item should answer: Why this customer problem, why now, what behavior should change, which business result should follow, and what observation would make the team reconsider? If the answer is merely that customers requested it or a competitor has it, the strategy is incomplete.

Make experiments decision-ready before they run

An AI-drafted experiment brief should contain a falsifiable hypothesis, eligible population, primary metric, guardrail metrics, instrumentation plan, exposure logic, expected mechanism, known confounders, and decision rule. For A/B testing, define the minimum detectable effect before interpreting results. The value must be tied to a practically meaningful change and checked against baseline behavior and available traffic; a model cannot infer those constraints from a feature description.

Instrumentation deserves its own review. Specify the event, properties, eligibility conditions, trigger, and expected sequence in the funnel. Use behavioral analytics to check that exposure and activation are measured consistently across variants. Feature flags can separate deployment from release, support a controlled ramp, and limit exposure while the team checks behavior.

For an AI-powered product experience, add eval-driven checks alongside product metrics. Define the behavior the model should exhibit, edge cases it must handle, unacceptable outputs, privacy constraints, and regression cases. Product success cannot compensate for a model behavior that violates an explicit safety or trust requirement.

Keep launch language tied to the original value proposition

AI can adapt UX copy, product tours, tooltips, release notes, in-app guides, and support macros for different segments. Give every channel the same approved value proposition, capability boundaries, terminology, and claims. Otherwise, speed creates message drift: the release note promises an outcome the interface does not support, while the support macro describes a different workflow again.

After release, bring the original decision brief into the learning review. Examine the target cohort, funnel behavior, activation, retention, qualitative feedback, and guardrails. Do not ask only whether the feature was adopted. Ask whether the intended customer behavior changed, whether the assumed mechanism appears credible, and whether the business outcome remains a reasonable consequence.

Scale the workflow only when another person can audit it

Before expanding AI assistance across the product organization, hand one completed decision package to a colleague who was not part of the workflow. They should be able to identify the governing strategy, trace each important claim to evidence, see which assumptions remain open, understand the trade-off, and find the metric that will trigger the next decision.

If they cannot, do not solve the problem with a longer prompt. Repair the missing artifact, unclear ownership, broken evidence link, or inconsistent metric definition. That is where strategic reliability lives.

Start with one decision entering your next weekly discovery review. Build its evidence set, label observations and assumptions, run separate synthesis and challenge passes, and publish the human decision with its reversal signal. Once that chain survives review, reuse the workflow. The goal is not more AI-generated product work. It is a shorter, more inspectable path from customer evidence to a measurable strategic choice.

References

May 12, 2026

How to Run AI-Assisted Feature Launches That Drive Growth
You are days from releasing a feature. Engineering needs a rollout decision. Go-to-market teams need a clear promise. Support needs to know what could go wrong. Leadership wants to know whether the release changed customer behavior. Dropping an AI bot into the launch channel will not resolve those tensions. If the metrics, authority, and escalation rules are vague, the bot will only answer ambiguity faster.

The useful model is a closed loop: define the behavior you want to change, instrument exposure and value, operate the rollout from one shared channel, let agents handle repeatable retrieval and synthesis, and reserve consequential decisions for accountable people. Done well, AI reduces the coordination tax around a launch while making the growth decision more disciplined.

Define the growth decision before you automate the launch

A feature being available is an output. A customer reaching value is an outcome. Your launch plan has to connect the two before anyone writes an agent prompt or schedules a readout.

A durable growth plan translates the product North Star into activation and retention signals, then defines the minimum detectable effect before experimentation. The North Star provides direction, but it is often too distant to diagnose a new feature. A launch needs an earlier behavioral signal that can tell you whether eligible users encountered the feature, understood it, and reached its intended value.

Write a short launch contract with these fields:
1. Target user and moment: Name the user or account segment, the situation that makes the feature relevant, and any eligibility rules. A feature intended for a new administrator solving an initial setup problem should not be evaluated across every user in the product.
2. Behavioral hypothesis: State the current behavior, the desired behavior, and why the feature should cause the change. If the causal link cannot be written plainly, the team is not ready to interpret the launch data.
3. Measurement chain: Instrument eligibility, actual exposure, meaningful engagement, the activation action, and the downstream value event. If you record engagement but not exposure, low adoption could mean either that users ignored the feature or that they never saw it.
4. Primary signal: Choose the behavior closest to customer value that can mature within the launch window. Do not promote every available metric to equal status. That turns a decision into a search for whichever chart looks most favorable.
5. Guardrails: Name the operational and customer signals that can stop a rollout, such as degraded performance, errors, support burden, privacy concerns, or a harmful shift in another important behavior. Define the actual acceptable bounds in your internal contract before launch; do not negotiate them after a concerning result appears.
6. Minimum detectable effect: Decide what change would be large enough to matter to the product and business. This keeps the team from celebrating meaningless movement or waiting indefinitely for certainty that the planned test cannot provide.
7. Decision rule and authority: Specify what evidence permits a ramp, what requires a hold, what triggers investigation, and who can pause or roll back the feature. An agent may assemble the evidence, but it should not invent the rule during the incident.
The contract should also distinguish a growth signal from a health signal. Activation, conversion, or repeated use may tell you whether the feature is producing value. Latency, error rates, complaints, and anomalous segment behavior tell you whether it is safe to continue. A healthy system with an immature growth signal may justify holding the rollout. Broken instrumentation or a material guardrail breach calls for a different response.

This distinction prevents a common category error: treating an inconclusive experiment as a failed feature, or treating early adoption as proof of durable value. The launch decision should always answer the same question: given trustworthy exposure data, the primary signal, and the guardrails, should you ramp, hold, investigate, or roll back?

Turn the launch channel into a decision system

A launch channel becomes useful when it preserves context and decisions, not merely conversation. A practical setup is one channel named #launch-[feature], with its scope, service expectations, success metrics, dashboards, and rollout plan pinned. Product, engineering, data, support, and go-to-market stakeholders can then work from the same operational record.

Set up the channel before rollout begins:
1. Pin the launch contract: Include the hypothesis, eligible population, event definitions, primary signal, guardrails, rollout stages, owners, and links to live dashboards. A screenshot becomes stale; a governed dashboard remains inspectable.
2. Create stable work lanes: Use separate parent threads for metrics, incidents, enablement, and customer feedback. This gives each agent and human responder a predictable place to work without fragmenting the overall launch record.
3. Publish response expectations: State which questions the agent can answer immediately, which require a human owner, and how urgent operational issues are escalated. The agent should never make an urgent request look handled merely because it produced a fluent reply.
4. Keep a decision ledger: For every ramp, hold, pause, or rollback, record the timestamp, evidence considered, decision, rationale, approver, and next review point. This matters later when a stakeholder asks why exposure changed or when the team compares the result with the original hypothesis.
5. Require channel-visible handoffs: If a question moves to a data, engineering, or privacy owner, the agent should post the handoff and preserve the relevant query, definitions, filters, and context. Do not let direct messages become a shadow operating system.
Give every automated data answer a consistent shape:
- The direct answer, including the population and time window.
- The metric definition and denominator.
- The relevant cohort, segment, environment, and experiment variant.
- A link to the approved underlying data or dashboard.
- An as-of timestamp so readers know how fresh the result is.
- Any missing data, definition conflict, or limitation that changes interpretation.
- The named human owner when judgment or investigation is required.
An activation rate without its denominator, an environment, or a timestamp is not decision-grade evidence. A polished answer should not receive more trust than the data lineage beneath it. Make uncertainty visible instead of prompting the agent to conceal it behind a confident summary.

Give agents narrow jobs and humans explicit authority

The safest launch architecture separates three jobs: retrieving data, operating rollout controls, and interpreting evidence. Combining them in one broadly empowered agent creates unnecessary risk. It also makes failures harder to diagnose because you cannot tell whether the problem came from a bad query, a bad recommendation, or an unauthorized action.

Use a data agent for retrieval and first-pass synthesis

Connect the data agent only to approved sources and metric definitions. It can answer repeatable questions such as activation by cohort, conversion by segment, latency by region, exposure by variant, or the movement of a named guardrail. It should provide citations and timestamps, then route questions requiring nuance to an owner while keeping the context in the thread.

Write the escalation boundary into its operating instructions. Escalate when metric definitions conflict, required data is unavailable, a query touches restricted information, the request asks for a causal conclusion that descriptive data cannot establish, or the answer would materially change rollout. The best response in those cases is not a guess. It is a precise statement of what is missing and who must resolve it.

Keep the feature-flag agent read-only by default

A flag agent can safely expose status by environment, current rollout allocation, and change history. That alone removes many repetitive questions. Write access is different: an incorrect production change can expose an unready experience, expand an incident, or remove access unexpectedly.

When you permit flag mutations, require an explicit sequence:
1. The requester names the feature, environment, target population, requested action, and reason.
2. The agent shows the current flag state and summarizes the evidence relevant to the request.
3. The authorized approver confirms the exact change. Approval cannot be inferred from an emoji, an ambiguous reply, or the agent’s own recommendation.
4. The integration performs only the approved action through constrained permissions.
5. The agent posts the resulting state, timestamp, requester, approver, rationale, and change-history link.
Do not give the agent a broad production credential merely because the chat interface is convenient. Restrict its access by environment and role, preserve an audit trail, and keep a manual rollback path available to the responsible engineer.

Use a readout agent to maintain the launch narrative

Scheduled summaries prevent the team from rebuilding the same analysis for every stakeholder. A useful default is to publish readouts at T+1 hour, T+24 hours, and T+7 days, while adapting the questions to the product’s actual usage cycle:
- T+1 hour: Confirm that exposure is occurring as intended. Check instrumentation, operational performance, obvious anomalies, and incident status. This checkpoint is primarily about measurement and safety, not declaring growth success.
- T+24 hours: Review adoption and activation by the planned cohorts, early conversion movement where applicable, support themes, and any uneven behavior across important segments.
- T+7 days: Evaluate experiment results that have had time to mature, retention or repeated-value signals when the product cycle makes them observable, significant outliers, and the follow-up work needed to harden or revise the experience.
These checkpoints are operating cadences, not guarantees of statistical maturity. A feature used on a longer cycle may not produce a meaningful retention signal by the final checkpoint. The readout should say that plainly instead of treating missing maturity as neutral evidence.

Every readout should end with a decision or an explicit statement that no decision is yet warranted. It should also name the evidence still needed, the owner, and the next review point. A summary that lists charts but does not clarify the decision state creates more reading without reducing uncertainty.

Make the accountability map visible
- Product owns the behavioral hypothesis, the primary growth signal, and the recommendation to ramp, hold, or change direction.
- Engineering owns operational health, flag implementation, incident response, and safe rollback execution.
- Data owns metric definitions, instrumentation validity, experiment design, and interpretation limits.
- Support and go-to-market owners contribute customer feedback, readiness concerns, enablement status, and communication needs.
- Agents retrieve, summarize, route, and perform narrowly preauthorized steps. They do not approve their own consequential recommendations.
The governance layer is part of the product, not a final compliance check. Apply role-based access, protect personally identifiable information, require source citations, and retain transparent logs. Then monitor response accuracy, deflection, and time-to-answer through Agent Analytics. Deflection alone is a poor success metric: a confidently wrong response may reduce human questions while increasing decision risk. Review incorrect answers, unnecessary escalations, missed escalations, and stale data as carefully as response speed.

Run the rollout as a sequence of evidence gates

A feature flag is not merely a switch. It lets you separate deployment from exposure and turn a large release decision into a sequence of smaller, inspectable decisions. The appropriate rollout stages depend on the feature’s operational, privacy, and customer risk, so define them in advance rather than copying a universal percentage.

Use this operating sequence:
1. Preflight the measurement: Verify eligibility, exposure, activation, value, and guardrail events in the intended environment. Confirm that dashboards use the launch contract’s definitions and that the agent can retrieve the same governed numbers.
2. Release to the defined cohort: Use the flag to control who can receive the experience. Confirm actual exposure before interpreting engagement. Eligibility and exposure are different facts.
3. Inspect evidence at the scheduled gates: Start with instrumentation and safety, then move to activation, conversion, retention, and other downstream value signals as they become observable. Review the preselected segments before exploring unexpected cuts of the data.
4. Choose a named decision state: Ramp, hold, investigate, pause, or roll back. Record the evidence and rationale. Avoid vague states such as looking good, because they do not tell engineering what to do or stakeholders what has been decided.
5. Feed the learning back into the journey: Update onboarding, in-product guidance, targeting, positioning, or the feature itself based on the observed friction. A winning test becomes a growth mechanism only when the trigger, experience, and value-producing behavior can be repeated reliably.
Use a clear decision ladder:
- Ramp when measurement is trustworthy, guardrails remain inside the pre-agreed bounds, the evidence meets the decision rule, and customer-facing teams are ready for broader exposure.
- Hold when the system is healthy but the outcome has not matured enough to support a decision. State what evidence is pending and when it can reasonably be reviewed.
- Investigate when an anomaly, segment divergence, definition conflict, or instrumentation gap makes the aggregate result unreliable.
- Pause when continued exposure could obscure an incident, contaminate the test, or expand a customer problem while the team diagnoses it.
- Roll back when a material operational, privacy, safety, or customer guardrail crosses the boundary defined in the launch contract. Do not wait for the primary growth metric to mature before acting on a serious downside.
If the feature itself uses AI, measure the product experience separately from the operational agents supporting its launch. AI can provide intelligent nudges, next-best actions, and adaptive experiences while applying privacy-by-design and strong data governance. That creates at least four distinct questions: Was the user eligible? Was the AI experience delivered? Did the user engage with it? Did that engagement lead to value without violating a guardrail?

Logging only the final conversion makes those questions impossible to separate. A delivery problem, poor recommendation, confusing interaction, and weak value proposition can all produce the same downstream result. Preserve the path from eligibility through value, including the experience variant the user received. If targeting or adaptive behavior changes during the test, log the change and account for it in the interpretation.

Do not confuse high initial use with a durable growth loop. Novelty can produce engagement without retained value. Look for the sequence that matters to your product: activation, repeated value, and then the relevant expansion, collaboration, or retention behavior. If the product has no natural invitation or sharing mechanic, do not force a viral story onto it. Build the loop around the behavior customers already have a reason to repeat.

Key takeaways
- Start with a launch contract that names the user, behavioral hypothesis, measurement chain, primary signal, guardrails, minimum detectable effect, decision rules, and accountable owners.
- Use one launch channel as the shared operational record, but separate metrics, incidents, enablement, and feedback into stable threads.
- Split agent responsibilities across data retrieval, flag operations, and scheduled readouts. Keep consequential actions approval-gated and auditable.
- Treat T+1 hour, T+24 hours, and T+7 days as decision checkpoints, not automatic declarations of success.
- Use feature flags to move through evidence gates. Ramp, hold, investigate, pause, or roll back according to rules written before the data arrives.
- Measure AI-powered experiences from eligibility through delivery, engagement, value, and guardrails so you can diagnose why growth did or did not move.
For your next launch, begin with a narrow operating slice: a completed launch contract, a structured channel, a data agent limited to approved queries, read-only flag visibility, and scheduled readouts. Review every wrong answer, escalation, and decision after the rollout. Expand the agent’s authority only when the evidence shows that the control system is trustworthy.

References
- Amplitude – My Playbook for a Smarter Feature Launch Slack Channel with Agents, Feature Flags, and Readouts
- Amplitude – How I Orchestrate Growth & AI at Amplitude to Ignite Viral Product Engagement
May 11, 2026
How a Digital Analytics Visionary Shapes My Product Strategy for Growth, Retention & Monetization

Data has always been my compass for building products that customers love and businesses depend on. Few sentences distill that imperative as crisply as the one below—and it continues to inform how I prioritize, experiment, and scale outcomes across the roadmap.

Krista is a digital analytics leader, product strategist, and industry evangelist. She helps businesses use data to drive growth, retention, and monetization.

That mandate mirrors how I run product: leverage behavioral analytics to uncover patterns, translate those insights into hypotheses, and validate them through rigorous A/B testing. I start by instrumenting the user journey end to end, then use cohort analysis, funnel diagnostics, and retention analysis to pinpoint where activation, engagement, or monetization is stalling. From there, I map driver trees to connect inputs (feature adoption, time-to-value, onboarding friction) to outputs (retention, conversion, revenue), so every experiment has a clear line of sight to business impact.

On experimentation, I hold the bar high: define the minimum detectable effect (MDE) up front, ensure clean experiment design, and size samples to reduce noise. I combine Amplitude analytics with qualitative signals from continuous discovery to prioritize tests that move the needle, not just the vanity metrics. When a variant wins, I don’t stop at the lift—I track downstream effects on user activation, long-term retention, and monetization, ensuring we’re compounding gains rather than optimizing in silos.

For product-led growth, I focus on the moments that matter most: first-value, aha, and habit formation. Journey mapping helps me identify the shortest, clearest path to value, while targeted in-app experiences and contextual nudges accelerate activation without adding friction. Every iteration feeds a learning loop—measure, learn, and ship—so we can pursue step-change outcomes, not incremental tweaks.

Ultimately, the craft is in translating analytics into action. When teams can trace a feature idea to a specific behavioral pattern, test it with a well-powered A/B experiment, and observe durable improvements in retention and revenue, momentum takes care of itself. That’s how I operationalize data to deliver growth, retention, and monetization at scale.

Inspired by this post on Amplitude – Best Practices.

May 11, 2026

How to Build a SaaS Retention and Expansion System

Your team can explain churn after it happens. The harder problem is seeing a customer change direction early enough to do something useful, then knowing whether the intervention actually changed the outcome.

You do not solve that problem with another health dashboard. You solve it with a closed-loop operating system: define how customers progress toward value, detect when that progression changes, choose the right intervention, and measure the incremental result. Built well, the same system protects retention and identifies credible expansion opportunities.

Treat retention and expansion as one value-progression system

Retention and expansion are often split across teams, tools, and meetings. Customer Success monitors renewal risk. Product watches activation and feature adoption. Sales looks for additional revenue. Support handles whatever breaks. Marketing runs lifecycle campaigns. Each function can be busy while the customer still receives a fragmented experience.

The better organizing principle is customer value progression. A retained customer continues receiving enough value to justify the relationship. An expanding customer is ready to receive that value across more users, workflows, usage, or capabilities. The two outcomes sit on the same path.

That changes the question from, Which accounts might churn? to, What value state is this account in, what evidence supports that assessment, and what should happen next?

Define the state. Translate product, support, CRM, and commercial signals into a recognizable customer condition.
Make a decision. Select an intervention, assign a human owner, or deliberately take no action.
Act in context. Use the channel and message appropriate to the customer’s current job, friction, and relationship.
Observe the response. Track whether behavior, value attainment, or commercial outcomes changed.
Learn and revise. Keep playbooks that produce incremental value, change weak ones, and retire harmful or noisy ones.

This loop is the system. A prediction model, lifecycle tool, or customer-success platform is only one component inside it.

Key takeaways

Model movement toward and away from value, not churn as a single binary event.
Keep the account state, its underlying drivers, and the recommended action visible together.
Use automated journeys for clear, low-complexity situations and human help when diagnosis or commercial context matters.
Separate risk recovery from expansion outreach, even when both use the same underlying data.
Measure incremental outcomes with an eligible comparison group or holdout whenever possible.
Start with one segment and one customer state before adding more data, models, and playbooks.

Instrument customer states, not a pile of events

A login is not value. A feature click is not adoption. A support ticket is not necessarily risk. Raw events become useful only when you interpret them in the context of a customer journey.

Begin with a small set of decisions your system must support. Common starting use cases include an activation funnel, onboarding drop-off, and adoption of the product’s core capability. A lightweight tracking plan, consistent event names, and explicit initial use cases give Product, Data, Growth, and Customer Success a shared language for those decisions.

Define customer states before designing a score. The exact evidence will differ by product, segment, pricing model, and maturity, but the state taxonomy can remain understandable:

Customer state	Evidence to define for your product	Decision the state should enable
Onboarding stalled	A required setup or first-value milestone was started but not completed, or progress stopped relative to the expected journey	Remove a specific blocker before sending broader education
Activated but shallow	The account reached initial value, but usage remains concentrated in one person, workflow, or capability	Help the account repeat and distribute the successful behavior
Healthy and deepening	Core outcomes recur, usage is stable or growing, and value is spreading through the intended scope	Reinforce success and watch for an adjacent need
Contracting	Relevant usage, active participation, or workflow breadth is declining relative to the account’s own baseline	Diagnose whether the cause is friction, seasonality, organizational change, or reduced need
Expansion ready	The current scope is producing value and the account has an evidenced adjacent need, capacity constraint, or unserved group	Offer a relevant next step without disrupting existing value

Do not assign universal activity thresholds merely because they are easy to query. The same number of weekly users can mean strong adoption for a small account and serious contraction for a larger one. Compare an account with its expected journey, purchased scope, peer segment, and prior behavior.

Your data model also needs to distinguish a person from an account. A power user can make an account look healthy while every other intended user disengages. Conversely, a stable automated workflow may create value without frequent logins. Track the unit at which value is delivered, then roll that evidence up to the commercial account.

For each meaningful behavioral event, capture enough context to reconstruct what happened: account identity, user identity where relevant, event name, timestamp, source, product object or workflow, plan or entitlement context, and outcome. Resolve duplicate identities before calculating breadth or frequency. Missing data must remain distinguishable from negative behavior; an integration outage is not customer disengagement.

Behavior alone is incomplete. Useful retention systems can combine product usage, CRM context, support interactions, billing health, and qualitative session evidence. Each signal should have an owner, a freshness expectation, and a clear meaning. If nobody can explain how a field affects a decision, it does not yet belong in the model.

Turn signals into explainable risk and opportunity decisions

A single health score is convenient for sorting accounts. It is poor guidance for action. Two accounts can receive the same score for completely different reasons: one failed to finish onboarding, while another lost active users after months of successful use. They should not receive the same message or playbook.

Keep a compact score if it helps prioritize work, but expose the dimensions beneath it:

Value attainment: Has the account completed the behaviors associated with its intended outcome?
Depth: Is the core workflow repeated enough to become part of normal work?
Breadth: Is value distributed across the intended users, teams, use cases, or product areas?
Trajectory: Is relevant behavior growing, stable, stalled, or declining against an appropriate baseline?
Friction: Are unresolved issues, repeated failures, poor outcomes, or setup barriers preventing progress?
Commercial health: Is the account approaching a renewal, reducing scope, encountering billing trouble, or operating near a legitimate capacity boundary?

Every flagged account should carry reason codes in plain language. A useful record says that core workflow usage declined from the account baseline, active participation narrowed, the change began after an unresolved issue, and the evidence was refreshed recently. A label such as health score: 42 does not tell an owner what to do.

Also show what would disconfirm the assessment. If a supposed contraction signal is seasonal, expected, or caused by a tracking change, the owner needs a way to correct it. That feedback should improve the rule or model instead of disappearing into private notes.

My default is to begin with transparent rules and cohort comparisons. Add machine learning when the volume, complexity, and demonstrated lift justify it. A black-box score creates false precision if Product cannot trace it to behavior and Customer Success does not trust it enough to act. Clear drivers, cohort-level analysis, and explainable scoring are operational requirements, not cosmetic reporting features.

AI is useful for classifying issue themes, summarizing account context, detecting unusual changes, ranking eligible accounts, and recommending a playbook. It should not silently make ambiguous commercial commitments or send sensitive outreach to a strategically important account without the controls your business requires. Preserve the underlying evidence, model or rule version, chosen action, human override, and eventual outcome so the decision can be audited.

Apply the same discipline to governance. Limit access to account data by role, record consequential changes, define how customer data may be used, and evaluate retention tooling for privacy, implementation burden, and maintainability as well as predictive performance. A model that cannot be governed will eventually become difficult to trust or operate.

Match each customer state to a bounded playbook

A signal without an intervention is reporting. An intervention without eligibility rules is noise. Build a small library of bounded playbooks, each designed for one customer condition and one desired state change.

Every playbook should specify:

The eligible segment and state.
The evidence that triggers entry.
Conditions that suppress outreach, such as an unresolved incident, a recent human conversation, an opt-out, or an active commercial negotiation.
The customer problem and value hypothesis.
The channel, message, and accountable owner.
The action you want the customer to take.
The success event and business outcome.
The guardrails that reveal annoyance, added support burden, or unintended contraction.
The exit condition, expiration rule, and fallback if the customer does not respond.

That template forces useful distinctions between common plays:

Onboarding rescue. Identify the missing value milestone and address that obstacle directly. Use an in-product guide for a clear, contextual step. Route technical ambiguity or multi-step setup to a person who can diagnose it.
Shallow-adoption expansion. Help an already successful user repeat the core workflow or bring the right colleagues into it. Do not pitch additional commercial scope before the existing scope is working.
Friction recovery. Connect repeated errors, unresolved issues, or failed outcomes to the affected workflow. Fixing the underlying problem takes priority over a generic educational campaign.
Contraction diagnosis. Ask why behavior changed before prescribing a solution. Declining activity may reflect product friction, a completed project, seasonality, team turnover, or a genuine loss of need.
Consultative expansion. Trigger outreach after demonstrated success and an evidenced adjacent need. Frame the next step around the customer’s outcome, not an arbitrary quota or a feature list.

Channel choice matters. In-app guidance works when the next step is clear and the customer is already in the relevant context. Lifecycle messaging can reinforce an understood behavior. Customer Success or Sales should handle relationship-heavy and commercial situations. Support is especially valuable when the opportunity requires product depth, diagnosis, or credibility earned through solving a real problem.

AI automation can give support teams capacity for that higher-context work, but capacity alone does not create a consultative motion. One AI-enabled support transformation started with a small volunteer cohort inside an organization of more than 100 people and grew to roughly 16 participants across regions within a year. Early use cases focused on trial guidance, optimization for mature customers, and accounts that appeared ready for broader adoption.

The implementation lesson is more important than the org chart: protect core support quality, recruit people who want to test the motion, and train for curiosity, commercial awareness, and broader customer context. Product knowledge is necessary, but consultative work also requires the restraint to ask another question before recommending an answer.

Keep automation reversible. If the account’s state changes, a human begins working the case, or new evidence contradicts the trigger, stop the sequence. A retention system should respond to current customer reality, not continue executing an outdated classification.

Prove incremental impact and build an operating rhythm

The easiest measurement mistake is comparing customers who accepted help with customers who ignored it. In a six-month comparison, accounts that engaged with proactive support grew roughly twice as fast in both usage and expansion as accounts that were contacted but did not respond. That is a meaningful operational signal, but it is not the same as randomized causal proof: customers who engage may already be more motivated, better staffed, or more likely to grow.

When the stakes and volume permit, define the eligible population first and assign eligible accounts to treatment and holdout groups. Randomize at the account level when account-level outcomes and cross-user spillover matter. Measure all assigned accounts in their assigned group, including customers who never engage with the intervention. That estimates the effect of offering the playbook, not merely the characteristics of people who accepted it.

Before launch, document:

The customer state and segment being tested.
The intervention unit: user, workspace, account, or another value-bearing entity.
The primary outcome the playbook is meant to change.
The observation window, chosen to match the expected behavior and commercial cycle.
The minimum detectable effect (MDE) that would make the effort worth acting on.
Leading indicators that show whether customers moved through the intended mechanism.
Guardrails that would stop or narrow the rollout.
The decision rule for scaling, revising, or retiring the playbook.

If random assignment is not practical, use the strongest comparison your context allows. At minimum, compare accounts that were eligible at the same time and stratify by segment, starting health, lifecycle stage, and prior trajectory. Label the result as observational. Do not turn a directional association into a causal revenue claim.

Use a measurement stack rather than one success metric:

Mechanism metrics: completion of the missing milestone, restored core behavior, increased workflow breadth, or resolution of the triggering friction.
Intervention metrics: eligibility, delivery, response, acceptance, completion, time to action, and exit reason.
Commercial outcomes: renewal, churn, contraction, expansion, and Net Recurring Revenue.
Guardrails: opt-outs, complaints, avoidable support demand, negative product outcomes, and harm to other customer journeys.

A common NRR calculation is starting recurring revenue plus expansion, minus contraction and churn, divided by starting recurring revenue. Document your exact definition and keep it stable. Report gross retention, contraction, and expansion beside NRR because strong expansion can conceal losses elsewhere in the customer base.

The operating review should end in decisions, not dashboard commentary. Inspect data quality first. Then review movement between customer states, playbook reach and outcomes, experiment evidence, guardrail breaches, and customer feedback. For every change, record an owner, the rule or playbook being changed, the expected effect, and when the evidence will be reviewed.

Ownership must follow the loop. Product can define value milestones and product interventions. Data can maintain instrumentation and analytical quality. Support and Customer Success can diagnose context and execute human plays. Growth can operate scaled journeys. Revenue Operations can maintain CRM and commercial definitions. One accountable leader still needs to own whether the complete system produces better customer and business outcomes.

Do not begin by buying a prediction platform or modeling every possible customer state. Choose one segment where a meaningful signal appears early enough to act. Define the state, instrument the evidence, create one bounded playbook, and preserve a credible comparison group. Add complexity only after that loop changes an outcome you care about. That is how retention stops being a renewal rescue exercise and becomes a product operating capability.

References

May 8, 2026

Outcome-Based Pricing That Delivers: Pay $10 Only for Qualified Leads with Fin for Sales

Our outcome-based pricing model hinges on one principle: you pay when Fin delivers value.

As Fin takes on new roles, that principle doesn’t change, but the definition of value does.

Fin for Sales qualifies leads, engages prospects, and routes high-intent buyers to your sales team. The value it creates isn’t a resolved query, but a pipeline of qualified opportunities. So we price accordingly: $10 per qualified lead. And you, the customer, define what “qualified” means, not Fin.

This is the first outcome-based pricing model for an AI Agent for sales. Here’s why I believe it’s the right approach and how I’ve seen it change the way teams think about SaaS pricing and ROI.

Over the years, I’ve learned that the fastest way to earn trust with sales and finance leaders is to align pricing with outcomes they actually report on. The core finding from our research was unambiguous: zero buyers preferred paying for activity. They wanted to pay for results.

That insight shaped how we priced Fin for its service role, $0.99 per resolution, where a resolution means the customer’s issue is fully solved without human intervention. More recently, we evolved that model to outcomes, reflecting the broader ways Fin delivers value across complex workflows. We believe pricing should be aligned with value delivery, and the vendor should carry risk when the product doesn’t perform. In sales, the best unit of value is pipeline.

Most sales teams today are overwhelmed by leads. Early in my career, I watched reps spend hours chasing form fills that looked promising but went nowhere. That experience cemented a lesson I still use: volume is vanity; qualification is sanity.

Ensuring the right opportunities promptly reach your sales team is what makes a difference. When a prospect visits your site, engages with Fin, answers qualifying questions, and is directed to a sales rep, Fin is identifying whether the opportunity is worth your team’s time and delivering value.

Charging per conversation would penalize businesses for every curious visitor who asks a question but isn’t a buyer. And charging per token, well, that’s always been a model that protects the vendor, not the customer.

We needed a metric that captures the actual value Fin creates in a sales context: qualified leads.

The purest version of outcome-based pricing for Fin’s sales role would be a percentage of closed revenue. Fin qualifies the lead, a rep closes the deal, and we take a cut. On paper, it looks elegant; in practice, I found it breaks down for two reasons that matter to operators.

First, attribution. Between the moment Fin qualifies a lead and the moment a deal closes, dozens of things can impact the final result. The quality of human-led demos can differ, products can have outages, prospects’ budgets can get cut. Tying Fin’s price to the final outcome holds it accountable for variables entirely outside its control.

Second, measurement. To track closed revenue, we’d need deep integration into every customer’s CRM, tracking each opportunity from qualification through to close. That’s a significant implementation burden that slows time to value, which is the opposite of what we want.

So we asked: what’s the most honest proxy for the value Fin delivers, where Fin is clearly the one creating it?

A qualified lead is that proxy. It represents the moment Fin has done its job. It has engaged the prospect, gathered the relevant information, evaluated them against your criteria, and determined they’re qualified. Everything up to that point is Fin’s work. Everything after it is the rep’s. At $10 per qualified lead, the pricing reflects this boundary.

There are two key components to how this pricing model works.

First, the customer defines success. With Fin’s sales role, the customer sets their own qualification criteria based on their business context. A company with high average contract values might set a lower bar because they can’t afford to miss anyone. A company where rep time is scarce and deal sizes are smaller might set a much higher bar, filtering aggressively to only surface the most promising prospects. The criteria flex to match the business.

Second, the economics are different by design. As a Customer Agent, Fin can switch between roles like sales and service. So if you’ve deployed Fin for Sales, it can still handle support queries like prospects asking a product question. Those queries are charged at $1 per resolution, consistent with our service pricing. Disqualifications, where Fin determines a prospect doesn’t meet the criteria, are also $1. The $10 price point for qualified leads reflects the higher value of pipeline creation compared to issue resolution.

The ROI speaks for itself. Early customers are reporting significant returns using Fin for Sales. One shared a perspective that mirrors what I hear in executive QBRs:

“I would say it’s at least 10 times the value. You’re now giving the business exactly what it needs as opposed to just activity. We say this expression in sales leadership all the time – ‘I don’t pay my sales team for activity. I pay them for results.’ I want my AI engine to be the same way.”

When you compare the cost of a qualified lead from Fin against the fully loaded cost of an SDR—salary, benefits, tooling, ramp time—the economics are compelling. For many businesses, particularly those that never had SDRs in the first place, Fin for Sales isn’t just replacing headcount, but creating an entirely new capability that wasn’t economically viable before.

This pricing model came from extensive customer research—qualitative interviews and quantitative studies—exploring how buyers want to pay for AI in a sales context. We tested multiple concepts: per-conversation, per-token, per-seat, revenue share, and per-qualified-lead. The research consistently pointed to outcome-aligned pricing as the preferred model, with the qualified lead emerging as the metric that best balances value alignment, measurability, and practical implementation.

Outcome-based pricing is still rare in AI, but we think that will change. For Sales Agents, we’re the first to do it. Transparency is part of the model. If you understand why we price the way we do, you can evaluate whether it works for your business.

Inspired by this post on The Intercom Blog.

May 8, 2026
4 Costly Agent Analytics Myths—And the Data-Backed Metrics I Rely on Instead

In my work with product, operations, and support leaders, I’m often asked to help make sense of Agent Analytics—what to track, how to attribute outcomes, and where to invest. After reviewing countless dashboards and running experiments across human agents and AI agents, I’ve learned that some of the most common measurement beliefs are precisely the ones that lead teams astray.
What comes up in conversation with leaders about Agent Analytics, and why not everything is what it seems.
Below, I unpack four pervasive myths I encounter and share the data-centered practices I use to replace them. My goal is simple: help you upgrade the way you measure performance so you can improve customer outcomes, accelerate learning, and scale impact with confidence.
Myth 1: “Lower average handle time (AHT) means higher performance.” AHT is useful but incomplete. When teams optimize solely for speed, they often push complexity into repeat contacts, reopens, or escalations. In the data, that shows up as a weak or negative relationship between lower AHT and durable outcomes like first contact resolution (FCR), customer effort, or revenue per conversation.
Reality and what I measure instead: I right-size speed by pairing AHT with intent-level resolution and recontact rate. For simple intents (password reset, billing address update), shorter is usually better. For complex intents (tiered troubleshooting, multi-step verification), “right-speeding” wins—slightly longer interactions that prevent rework. Practically, that means segmenting by intent complexity using behavioral analytics, tracking weighted “intent resolution rate,” and monitoring repeat-contact windows (24–168 hours) to catch downstream pain.
Myth 2: “AI agent containment tells the whole story.” A high containment rate can mask failure modes such as unresolved intent, silent abandonment, or low-quality handoffs that frustrate customers and spike human workload later.
Reality and what I measure instead: I break containment into three parts for voice and chat flows: (1) intent resolution without escalation, (2) graceful handoff quality when escalation is necessary, and (3) post-handoff efficiency and satisfaction. For voice AI agent experiences, I also track escalation clarity (did the transcript summarize history and intent?), time-to-human, and customer satisfaction on the combined interaction. This provides a fuller view of customer support ai strategy effectiveness and avoids over-crediting automation for partial wins.
Myth 3: “Quality is subjective, so it can’t be measured at scale.” Teams often default to sporadic QA because they assume it can’t be standardized across channels or agent types. The result is noisy feedback loops and stalled coaching.
Reality and what I measure instead: Quality becomes measurable when it’s grounded in observable behaviors linked to outcomes. I use a rubric anchored in behavioral analytics (e.g., verified customer need, correct resolution path, policy compliance, empathy markers) and validate it via correlation with FCR, recontact, and retention analysis. To scale, I combine calibrated human reviews with AI-assisted scoring, check inter-rater reliability weekly, and use driver trees to connect quality levers to business results. This creates a consistent, coachable signal for both human agents and AI flows.
Myth 4: “If the dashboard is green after launch, we’ve won.” Early wins can reflect novelty effects, cherry-picked routing, or short-term incentives that don’t persist. Declaring victory too soon locks in fragile gains and hides regressions across cohorts.
Reality and what I measure instead: I treat go-live as the start of learning. I use A/B testing with a clear minimum detectable effect (MDE), stagger ramps, and hold out stable control cohorts for at least one full demand cycle. I track outcomes vs output OKRs—focusing on intent resolution, customer effort, and revenue/customer health over vanity metrics. I also monitor seasonality and channel mix shifts inside a unified analytics platform to ensure improvements generalize beyond the first week.
How I operationalize this day to day: (1) define intents and complexity upfront, (2) unify journey data across channels, (3) instrument resolution and recontact rigorously, (4) apply driver trees to isolate what actually moves outcomes, and (5) iterate via disciplined experiments rather than sweeping changes. This approach aligns product and operations, speeds up coaching, and ensures AI investments compound rather than decay.
If you’re rethinking your Agent Analytics stack, start by replacing each myth with a sharper metric: pair AHT with intent-level resolution, pair containment with handoff quality and satisfaction, pair QA with outcome-linked rubrics, and pair green dashboards with robust experiments. The payoff is a measurement system that earns trust, guides better decisions, and consistently improves customer and business results.

Inspired by this post on Pendo – Best Practices.

May 7, 2026

How to Evaluate a Shopify-Native AI Shopping Agent

You’ve probably been asked a deceptively simple question: should your Shopify store add an AI shopping agent? The hard part isn’t installing another chat widget. It is deciding whether the agent can help an uncertain shopper choose correctly without recommending unavailable products, misreading policy, or making a costly order change.

Treat this as a commerce-system decision, not a chatbot decision. A useful agent must connect conversation, live product data, cart behavior, checkout, and post-purchase service. The evaluation framework below will help you separate a persuasive demo from a system you can trust with customers and revenue.

The native test: can the agent read, reason, and act?

“Shopify-native” should describe an architecture, not a distribution channel. Being listed in an app marketplace or embedded in a storefront does not make an agent native. The meaningful test is whether Shopify remains the operational source of truth while the agent uses its data and APIs in the customer’s current context.

A concrete implementation shows how high that bar can be: a Shopify connection can expose products, variants, content, live inventory, order data, policies, and transactional APIs to the same customer-facing agent. That combination matters because a correct product description is still a bad answer if the relevant size is unavailable, and an accurate return policy is insufficient if the customer must start over somewhere else to use it.

I would evaluate an agent at four capability levels. A product that stops at the first level may still be useful, but it should not be presented internally as an autonomous commerce agent.

Capability	What the agent needs	What you should ask it to prove
Answer	Product content, store policies, and current catalog facts	Answer a precise product or policy question and identify the relevant item, variant, or rule
Recommend	Catalog relationships, inventory, conversation context, and the shopper’s constraints	Turn an ambiguous request into a short, reasoned shortlist instead of returning generic search results
Transact	Cart or order APIs, authentication, permissions, and confirmation controls	Update a test cart or prepare an order change while showing exactly what will happen before execution
Recover	Shared state across shopping, service, and human escalation	Resolve a support interruption and resume the customer’s original shopping task without asking for the same context again

Freshness is part of correctness. During a test, change the availability of a variant in Shopify and repeat the same shopping request. The agent should stop recommending that variant once the change is reflected in the connected system. Run a similar test with a policy update. A polished answer based on yesterday’s state is not a small quality defect; it can create a promise your operations team must later unwind.

Actions deserve an even stricter test. Ask the vendor to demonstrate the complete chain: customer identification, authorization, interpretation of the request, action preview, explicit confirmation, API execution, and a visible result. If any step is simulated, ask which one. A fast setup can reduce implementation effort, but it does not prove that the agent is accurate, observable, or safe.

Design the shopping dialogue around decisions, not keywords

Traditional ecommerce search works best when the customer already knows the product vocabulary. A shopping agent earns its place when the request is incomplete: a gift for a partner, a mattress for a particular sleep preference, or shoes that must work across road and trail conditions. The agent’s first job is not to produce an answer. It is to discover which answer would be useful.

A strong product-discovery dialogue follows a repeatable decision sequence:

Restate the customer’s job in plain language so a misunderstanding becomes visible early.
Identify hard constraints first, such as an in-stock variant, required use case, compatibility, budget boundary, or delivery requirement.
Ask only for information that could change the recommendation. A question that does not alter ranking or eligibility is conversational overhead.
Present a small shortlist and tie each option to the constraints the customer supplied.
Explain the meaningful tradeoff between the options instead of declaring a universal winner.
Offer the next useful action: compare details, select a variant, update the cart, or continue narrowing the choice.

This sequence turns conversation design into a product requirement. For every recommended item, your evaluation record should capture the customer’s stated need, the product facts used, the reason the item fits, the tradeoff disclosed, current variant availability, and the next question or action. That record gives your team something concrete to inspect when a recommendation is challenged.

Test whether the reasoning is responsive rather than decorative. Change one important answer while holding the rest of the conversation constant. If the customer switches from occasional use to daily use, removes a budget constraint, or requires an available color, the shortlist should change when that information is material. If the products remain identical regardless of the customer’s answers, the experience is probably search with conversational packaging.

The agent should also know when not to narrow further. Once the customer has enough information to choose, another question adds friction. Conversely, confidence should not be manufactured when catalog data cannot resolve the request. A safe response identifies the missing fact, asks for clarification, or hands the conversation to a person with the constraints already summarized.

Product cards can accelerate the final step, but the interface should preserve the reasoning that produced them. An image, name, and price answer “what is this?” The conversation must also answer “why does this fit me?” That is the difference between displaying inventory and assisting a decision.

Make shopping and support one customer state machine

A shopper does not experience your sales and support departments as separate funnels. The same person can compare products, ask about shipping, check an existing order, correct a variant, and return to buying in one session. Routing each intent to a separate tool forces the customer to reconstruct context at every boundary.

Model the journey as one state machine: discover, decide, transact, service, and resume. The agent can move between those states, but it should retain the customer’s goal, constraints, products considered, cart state, relevant order, completed actions, and unresolved question. That shared state is more important than whether the organization labels the current message “sales” or “support.”

This is where a connected agent can do more than answer FAQs. Current Shopify-oriented implementations can handle tracking, returns, exchanges, refunds, order changes, shipping questions, and subscription updates through connected procedures and APIs. Each additional action increases usefulness, but it also increases the consequence of a misunderstanding.

Use different controls for different action classes:

Read-only actions, such as showing order status, should still require appropriate customer identification but do not change commercial state.
Reversible shopping actions, such as adding or removing a cart item, should be immediately visible and easy for the customer to undo.
Financially consequential actions, including refunds, paid order changes, and subscription updates, should require authentication, an exact action summary, explicit confirmation, and a durable result or receipt.
Ambiguous or unsupported actions should stop safely and transfer to a person. The agent must not treat conversational enthusiasm, silence, or an inferred preference as consent.

That last distinction protects both the customer and the business. A mistaken recommendation can usually be reconsidered. An executed refund or subscription change creates financial and operational consequences. If the system cannot preview and verify the exact action, keep that workflow read-only and let a trained person execute it.

The transition back to shopping also needs deliberate design. After resolving a delivery problem or order correction, the agent should restore the prior context and offer a relevant path forward. It should not force an upsell into every service interaction. The next best action after a serious order problem is often confirmation that the problem is resolved. Commercial momentum comes from reducing friction, not from ignoring the customer’s immediate priority.

When escalation is necessary, pass a structured handoff rather than a transcript dump. Include the detected intent, verified identity state, constraints already collected, products or orders involved, actions attempted, results returned by Shopify, and the unresolved decision. A human agent should be able to continue with the next question, not repeat the first one.

Measure incremental commerce value and operational risk

Chat conversion is an attractive metric and an easy one to misread. People who open a shopping conversation may already have higher intent than people who do not. Comparing those groups directly can credit the agent for demand it did not create.

Ninja Transfers reported that 10% of its conversations converted to orders with values 20% above the store’s average order value. That is a useful customer result, but it is a vendor-supplied case from one merchant, not a universal benchmark or proof that the agent caused the full difference. Your business case should depend on your own incremental test.

Where traffic permits, randomize eligible storefront sessions between an agent experience and the existing experience. Measure the result across all eligible sessions, not only the visitors who choose to chat. That intention-to-treat view reduces self-selection bias and answers the executive question: what changed when the store made the agent available?

Use a balanced scorecard rather than a single conversion target:

Business outcomes: completed orders per eligible session, revenue per eligible session, average order value, checkout completion, and assisted revenue.
Decision outcomes: recommendation engagement, product-detail visits after a recommendation, add-to-cart actions, and successful comparison flows.
Downstream quality: cancellations, returns, exchanges, and contacts caused by a poor recommendation or incorrect expectation.
Service outcomes: successful action completion, repeat contact for the same problem, human escalation, and time to a confirmed resolution.
Agent quality: use of current catalog facts, in-stock recommendation rate, policy accuracy, clarification behavior, safe refusal, and correct tool execution.
Risk outcomes: unauthorized or incorrect actions, failed confirmations, customer complaints, policy exceptions, and cases requiring operational repair.

Conversion and average order value belong beside returns and cancellations. An agent can raise the initial basket by recommending a more expensive option while reducing customer fit. Without the downstream view, the dashboard rewards the sale and hides the repair.

Your event model should make the journey reconstructable. Useful events include agent opened, intent classified, clarification answered, recommendation shown, product selected, cart changed, checkout started, order completed, support action requested, confirmation received, action completed, and human handoff. Join these events through an appropriately governed session or order identifier so the team can inspect both funnel movement and individual failure paths.

Build the evaluation before the vendor demo

Create scenarios from your real catalog shape, policies, and failure modes. For each scenario, write the expected outcome and the failures that would make deployment unacceptable. Include:

An ambiguous shopping request that requires clarification.
Two products that appear similar but differ on a constraint customers care about.
An unavailable variant that would otherwise be the best match.
A question whose answer depends on store policy rather than product copy.
A conversation that moves from shopping to order support and back again.
A request that sounds actionable but lacks required authentication or confirmation.
An unsupported request that should trigger a safe handoff.
A catalog or policy change made after the agent’s initial synchronization.

Run the same scenarios repeatedly and record the underlying catalog state each time. You are testing consistency, grounding, and recovery, not literary elegance. A shorter answer that uses the correct variant and policy is more valuable than a fluent answer that improvises.

Expand autonomy in the order of consequence

A staged rollout lets evidence determine how much authority the agent receives:

Evaluate offline with approved scenarios and representative catalog data.
Launch read-only product discovery and policy answers with an obvious human fallback.
Add visible, reversible cart actions after recommendation quality is stable.
Introduce authenticated order-support workflows with previews and confirmations.
Enable financially consequential actions only after tool execution, auditability, exception handling, and operational ownership have been tested end to end.

Define the owner of each failure before launch. Product should own the intended customer behavior and success measures. Commerce or operations should own policy and workflow correctness. Engineering should own integration reliability and observability. Customer support should own escalation quality and emerging failure patterns. The exact reporting lines can vary; an unowned failure queue cannot.

Review both aggregates and conversation-level traces after release. Aggregate metrics tell you whether the experience is moving the business. Traces tell you why. A small cluster of incorrect variant recommendations or failed order actions can disappear inside a healthy conversion average while creating disproportionate customer harm.

Key takeaways

A Shopify-native agent should use live commerce data and governed APIs; storefront placement alone is not enough.
The agent’s product-discovery job is to uncover decision criteria, apply hard constraints, explain tradeoffs, and lead to the next useful action.
Shopping and support should share customer context, but the agent’s permissions must become stricter as actions become harder to reverse.
Conversation conversion is a diagnostic metric, not causal proof. Measure incremental results across eligible traffic whenever possible.
Pair conversion and average order value with returns, cancellations, incorrect actions, and operational repair costs.
Begin with read-only assistance and expand autonomy only after each workflow proves accurate, observable, recoverable, and properly owned.

Before you approve a purchase, bring a vague shopping request, a policy edge case, an unavailable variant, a mixed sales-and-support conversation, and a consequential order action into a live test store. If the agent cannot show where its answer came from, what it will change, and how it fails safely, you have found the next product requirement, not a detail to defer until launch.

References

Intercom – Fin for Ecommerce: The Shopify-native AI Agent transforming product discovery and sales

May 7, 2026

How to Scale Session Replay Without Sacrificing Privacy

You want session replay on more journeys because the blind spots are expensive. A funnel can show where users leave, but it cannot show whether they encountered a broken control, a confusing message, a layout shift, or an error that never reached your analytics. Replay can turn those behavioral signals into enough context to make a product decision.

The hard part is expanding that visibility without collecting data you should not have, degrading the experience you are trying to understand, or filling storage with recordings nobody will use. The answer is not a single masking setting. You need a capture contract, a delivery architecture, a sampling model, and an operating scorecard that treat performance, fidelity, and privacy as one system.

Set the capture contract before you expand coverage

Replay programs often begin with a coverage question: what percentage of sessions should you record? That is the wrong first question. Start with the decision you expect the recording to change. If nobody can name that decision, more coverage will create more cost and exposure without producing more insight.

Write a capture contract for each product surface. This is a short, reviewable specification that connects a business purpose to technical controls. It should answer:

What question is replay meant to answer? Examples include diagnosing failed activation, explaining an error spike, or finding friction in a conversion step.
Which routes, components, and user cohorts are in scope? Name them. Do not approve an undefined all-product rollout.
Which data is prohibited? Include form values, credentials, payment details, message content, health information, account-recovery data, and any product-specific sensitive fields that apply.
What consent state permits capture? The recorder should not initialize before the required state is known. Withdrawal should stop capture and prevent queued data from being sent.
Who can watch a replay? Define roles by purpose. Product discovery, support investigation, engineering diagnosis, and administration do not automatically require identical access.
How long will the data remain available? Tie retention to the stated purpose rather than keeping replay indefinitely because storage permits it.
What sampling rule applies? State the baseline rate, targeted cohorts, exclusions, temporary overrides, owner, and expiry condition.

Selective capture, redaction, consent, retention, role-based access, and environment-aware sampling are separate controls. Treating one of them as a substitute for the others creates predictable gaps. Masking does not grant consent. Restricted access does not make excessive collection necessary. Short retention does not make an exposed credential harmless.

Apply those controls as close to collection as possible. A web replay is commonly reconstructed from serialized page state, changes, and interaction events. The privacy risk therefore sits in the data leaving the browser, not only in what the player later displays. A value hidden during playback may already exist in an outbound payload or stored record.

A useful default is to block text and input values, then allowlist only fields proven safe and necessary. Add route-level and component-level exclusions for sensitive surfaces. Use a separate, time-bounded approval for diagnostic capture that needs greater fidelity. I would reject a policy that merely says to mask personal information: the term depends on context, and engineers cannot reliably implement an undefined category.

Test the contract against the raw system, not just the player. Seed a non-production fixture page with recognizable test values, exercise every relevant component state, inspect the browser payload, inspect the stored representation, and verify that exports and downstream tools preserve the restriction. If a prohibited test value crosses the collection boundary, the control has failed even if the replay screen obscures it.

Consent and retention obligations vary by jurisdiction, contract, and data type. Your privacy or legal owner must approve those rules for the markets you serve. Engineering can enforce an approved policy; it cannot infer that policy from a generic replay configuration.

Keep capture off the user’s critical path

Scalable replay starts in the browser, where your product competes with the recorder for main-thread time, memory, and bandwidth. A backend that can ingest billions of events does not help if the recorder makes an interaction sluggish or loses the DOM changes needed to explain the problem.

The delivery design should make page experience more important than recording completeness. Decoupled capture and delivery, adaptive batching, compression, backpressure controls, and priority handling provide the basic pattern:

Capture the minimum useful representation. Filter excluded nodes and values before serialization. Avoid collecting detail that no approved use case needs.
Separate recording from transport. The capture path should write to a bounded queue rather than waiting for a network request. Upload latency must not become interaction latency.
Batch adaptively. Small batches can reduce delay during quiet periods, while larger compressed batches can reduce request overhead during sustained activity. The policy should respond to queue pressure and network conditions.
Define backpressure behavior. When production exceeds delivery capacity, the recorder needs a documented degradation order. Preserve navigation, consent changes, errors, and the structural events required for reconstruction before lower-value detail. Never freeze the page to protect the replay.
Bound long sessions. Flush incrementally, cap memory use, and make reconnection behavior explicit. A queue that grows for the life of a tab will eventually turn a delivery problem into a page-performance problem.
Make partial data visible. Mark gaps, dropped segments, and incomplete uploads. A replay that silently appears complete is more dangerous than one that clearly communicates its limits.

Backpressure deserves special attention because it forces a product decision disguised as an implementation detail. If the system cannot retain everything, what must survive? The answer should come from the capture contract. An error marker without enough surrounding state may be useless, but exhaustive cursor movement may be expendable. Rank event classes before an incident forces the recorder to choose implicitly.

Do not validate the client only on a fast laptop and stable connection. Use representative complex pages and test replay on and off under CPU pressure, constrained networking, rapid DOM change, background-tab transitions, reconnection, and long sessions. Compare Web Vitals, long tasks, memory growth, bytes transferred, queue drops, upload completion, and playback completeness. Long sessions, traffic spikes, complex interactions, and variable networks are precisely where an apparently sound design reveals its failure modes.

There is no universal acceptable overhead that fits every product. Set budgets relative to your production baseline and the importance of the journey. A small regression on a frequently used mobile activation path may matter more than a larger regression on an internal administration page. Segment the results by route, browser, device class, network condition, and session length so averages do not hide the users most affected.

Sample for decisions, not for a warehouse of footage

A single global sample rate is easy to configure and hard to defend. It spends collection capacity uniformly even though product questions are not uniformly valuable. It can also miss rare failures while overrepresenting routine sessions that nobody will watch.

Use a portfolio of sampling modes:

Random baseline sampling gives you a less biased view of ordinary behavior and lets you notice problems you did not predefine.
Cohort sampling increases visibility for a defined population such as new users, a browser family, a release cohort, or users entering a critical journey.
Signal-based sampling concentrates diagnosis around errors, failed steps, rage clicks, dead clicks, abnormal exits, or other instrumented friction signals.
Temporary diagnostic sampling raises fidelity for a narrow incident or release window, with an owner and an automatic expiry condition.
Hard exclusions override every sampling mode. A high-value investigation is not permission to collect from a prohibited surface or consent state.

Onboarding, activation, high-friction conversion flows, and paths with disproportionate revenue or trust impact are sensible places to begin because a clearer diagnosis can change a meaningful decision. Signals such as errors, rage clicks, dead clicks, scroll behavior, and stalled progress can then help you find the sessions worth examining.

Keep one statistical distinction clear. Targeted replay is good for explaining a known problem, but it cannot tell you how prevalent that problem is. If you record sessions because they contain an error, the resulting library will naturally make errors look common. Use analytics or a random baseline to measure frequency. Use replay to understand mechanism and context.

A disciplined investigation looks like this:

Find a measurable change in a funnel, cohort, error rate, performance signal, or support pattern.
Define the affected population before opening replays.
Review a deliberately selected set of relevant sessions and record recurring observable behaviors, not interpretations of user intent.
Turn those observations into a falsifiable product or technical hypothesis.
Instrument, release, or experiment so the hypothesis can be measured outside the replay player.

This prevents two common mistakes: browsing memorable sessions until a story feels true, and treating one vivid recording as evidence of market-wide demand. Replay is strongest when it explains a quantitative signal and leads back to a measurable change.

Run replay with a coupled performance, privacy, and value scorecard

Session replay is not finished when playback works. It is an operating capability with client releases, configuration changes, storage growth, access decisions, and incident risk. Give it an owner and review the system across five dimensions.

Dimension	Signals to watch	Decision the signals should trigger
User experience	Web Vitals, long tasks, main-thread work, memory growth, and replay bytes	Reduce capture detail, change delivery behavior, narrow coverage, or halt a rollout when the recorder breaks its budget
Replay fidelity	Queue drops, missing segments, incomplete uploads, event integrity, and playback reconstruction errors	Fix prioritization or transport before teams rely on incomplete recordings for decisions
Platform reliability	Ingestion failures, processing delay, retrieval latency, playback-start failures, and behavior during traffic spikes	Add capacity, repair a failing stage, or adjust sampling without shifting the problem into the browser
Privacy and governance	Redaction test failures, capture outside approved consent states, retention exceptions, and access outside approved roles	Disable affected capture, contain the data, follow the approved deletion or incident process, and repair the control before restoring it
Decision value	Investigations that reached a useful replay, time to diagnosis, time to resolution, and product hypotheses validated outside replay	Move coverage toward high-value use cases or retire collection that produces no action

These dimensions constrain each other. Aggressive compression may improve bandwidth while hurting reconstruction. More capture may improve fidelity while violating the page budget. Narrow access may improve governance while blocking the support engineers responsible for incident response. The job is not to maximize any single metric; it is to keep the entire system inside approved boundaries.

Version capture configuration like production code. A seemingly harmless selector change can expose text, remove necessary context, or increase mutation volume. Test recorder and configuration releases against fixture pages containing known sensitive values and known reconstructable interactions. Keep a rollback path.

Prepare shutdown controls before launch. You should be able to stop capture for a component, route, environment, tenant group, or the whole product without waiting for a new application release. Document who can use each control, how queued data is handled, how affected stored data is identified, and when privacy, security, support, and engineering must be involved. If collection crosses a prohibited boundary, continuing to record while the team debates ownership compounds the exposure.

Finally, connect replay operations to the workflows that consume it. Product teams need links from behavioral cohorts to relevant sessions. Support needs controlled escalation paths. Engineering and SRE need errors, network signals, layout shifts, and performance context close to the replay timeline. Connecting interaction context to observability and delivery workflows can shorten the path from an anomaly to a testable explanation, but only if the data remains trustworthy and accessible to the right roles.

Key takeaways

Approve a capture contract for each surface before approving a broader sample rate.
Redact or exclude sensitive data before it leaves the browser; a masked player is not enough.
Protect the page with decoupled delivery, bounded queues, adaptive batching, and explicit backpressure priorities.
Keep random sampling for prevalence and use targeted sampling to explain known signals.
Operate performance, fidelity, platform reliability, privacy, and decision value as a coupled scorecard.
Require scoped shutdown controls, retention handling, access ownership, and rollback before production expansion.

Before you increase replay coverage, ask for two artifacts: a one-page capture contract for the next journey and a replay-on versus replay-off test under that journey’s difficult conditions. If the team cannot show what is allowed to leave the browser, how the page stays within budget, and which decision the recordings will change, the rollout is not ready to scale.

References

May 7, 2026

How to Link AI Evals to Retention Without Chasing Proxies

Your AI activation rate is rising. More users are reaching the agent, completing setup, or trying the workflow. Yet the retention curve is flat. That usually means you know who touched the product, but not who received enough value to return.

A higher aggregate eval score will not resolve that gap. You need to identify an AI quality signal that appears early, connect it to later behavior, and determine whether changing that signal can change retention. The result should influence onboarding, roadmap priorities, customer success, and model releases, not just add another chart to an eval dashboard.

Start with the retention decision, not the eval dashboard

The wrong opening question is: Which evals can the team measure? Start with: What must a user experience early enough that returning becomes the rational next action?

That framing forces you to define retention before searching for a predictive signal. A login is rarely enough. Choose a return behavior that represents recurring value: running another workflow, completing another meaningful task, or bringing the agent into an ongoing process. Then make five decisions explicit:

Define the eligible population. Decide whether you are studying newly activated users, newly activated tenants, or another clearly bounded cohort.
Choose the unit of analysis. Use the user when value is individual. Use the tenant or account when adoption and renewal depend on a shared workflow.
Name the retained behavior. It should represent renewed value, not passive presence.
Select the retention window. Weekly and monthly cohorts answer different product questions, so do not switch between them after seeing the result.
Close the observation period before the retention outcome begins. Otherwise, later behavior can leak into the feature that supposedly predicts it.

This distinction matters when activation improves but retention does not. Activation proves that a user crossed a product milestone. It does not prove that the AI produced a trustworthy, complete, safe, and usable outcome. Your eval candidates should measure that missing experience.

Eval family	Question it should answer	When it deserves product attention
Semantic accuracy	Did the output correctly address the intended task?	Incorrect results prevent completion or make the user unwilling to rely on the agent again.
Containment	Did the agent complete the eligible workflow without an avoidable human escalation?	Escalation prevents the workflow from delivering repeatable automation.
Safety	Did the interaction remain within the product’s acceptable risk boundaries?	A regression creates unacceptable exposure, even if another engagement metric improves.
Latency	Did the result arrive fast enough for the user’s workflow?	Delay causes abandonment, repeated attempts, or a return to the previous process.
UX friction	Could the user reach a good outcome without unnecessary setup, retries, or corrections?	Users fail before they have a fair chance to experience the agent’s value.

Shortlist three to five candidates tied to these user outcomes. A long eval inventory makes analysis look comprehensive while weakening the decision. You are not trying to find every quality problem. You are looking for an early signal that is measurable, related to meaningful retention, and alterable through a product intervention.

Build an identity and time contract before modeling anything

The hardest part is usually not the statistical model. It is joining AI interactions to product behavior without duplicating records, losing users, or assigning an outcome to the wrong account. Evals often live in notebooks or model-observability systems while retention events live in product analytics. A plausible-looking join can still be wrong.

Create a data contract that covers both systems. At minimum, it should specify:

Stable user and tenant identifiers, including the rule used when a user belongs to more than one tenant.
The timestamp that determines whether an interaction belongs inside the observation period.
The model and workflow version associated with the interaction.
The conditions that make an interaction eligible for each eval.
The grain of the analysis table, such as one row per user-day or tenant-day.
The treatment of missing data, especially the difference between no eligible interaction and an evaluated failure.

That last distinction is easy to miss. A user who never invoked an eligible workflow did not fail the accuracy eval. Combining non-use and poor quality into the same value hides whether the retention problem comes from discovery, setup, or AI performance.

Compute daily per-user and per-tenant features rather than joining every raw interaction directly to every product event. Each feature should retain its denominator or exposure count. A pass rate without the number of eligible interactions can make sparse use look equivalent to sustained use.

Keep the definition of each feature readable. Containment, for example, needs an explicit eligible-workflow denominator and an explicit rule for what counts as avoidable escalation. UX friction needs named events, such as a retry or correction, rather than an opaque composite score. If a product manager cannot explain how the feature changes, the team will struggle to turn it into a roadmap decision.

Watch for many-to-many joins. One AI interaction may generate several product events, and one product session may contain several AI interactions. Joining both raw tables can multiply rows and inflate success or failure counts. Aggregate each side to the agreed grain first, then join the resulting features to the retention cohort.

Versioning also matters. If a model or workflow changes during the observation period, an account-level average can blend materially different experiences. Preserve the version so you can distinguish a real quality improvement from a change in traffic or segment mix.

Find a threshold that survives segment and leakage checks

Once the dataset is reliable, begin with cohort analysis rather than a complex predictive model. Compare retention among users who reached different levels of each candidate signal. You are looking for a separation that is large enough to matter, stable enough to repeat, and reachable through product changes.

Use this sequence:

Plot weekly or monthly retention against each early eval feature.
Use a driver tree to show where the feature sits between acquisition, activation, AI quality, repeat behavior, and the final retention outcome.
Fit a simple logistic model that controls for plan type, segment, region, and acquisition channel.
Repeat the analysis inside important segments instead of relying only on the blended population.
Check whether the threshold remains directionally useful when you vary the observation definition without allowing it to overlap the outcome.

The controls are not statistical decoration. Higher-plan customers may have better implementation support. One region may contain a different account mix. A high-intent acquisition channel may produce both better agent usage and better retention. Without those checks, customer composition can masquerade as model improvement.

In one product context, users who crossed a specific eval threshold early showed three times higher retention than peers who did not. That is evidence that an eval can become a commercially useful leading indicator. It is not a universal benchmark. Your threshold, effect size, eligible population, and retention behavior will depend on your product.

Do not choose the threshold merely because it creates the largest visual gap. Prefer a boundary that has enough eligible users on both sides, persists across relevant segments, and corresponds to an experience the product can influence. A dramatic ratio from a small cohort is a hypothesis, not a roadmap mandate.

Run an explicit leakage review before presenting the result. Common forms include an eval feature calculated after the retention window begins, an account-health field that already contains renewal information, or a usage feature whose value can only rise when the user returns. Leakage can make a weak signal look uncannily predictive.

The decision artifact should show the cohort definition, feature window, retention window, cohort sizes, effect estimate, control variables, and segment sensitivity together. If the threshold only works for a particular plan or acquisition channel, say so. A narrow, honest signal is more actionable than a broad result that disappears when the mix changes.

Use experiments to separate a predictor from a product lever

A predictive eval signal is not automatically causal. Sophisticated users may configure the agent better, choose easier workflows, or persist through early friction. Their higher eval scores and higher retention may share the same cause. Improving the score will not necessarily reproduce their behavior.

Convert the signal into a testable product intervention:

Choose an intervention that can move the signal during the early observation period. Depending on the failure, that could be an in-app guide, a product tour, a setup change, or a model change behind a feature flag.
Keep the threshold definition fixed for the experiment. Redefining success after seeing the result turns the test into another exploratory analysis.
Predefine the retained behavior, retention window, target population, and second-order guardrails.
Use a minimum detectable effect calculation to determine whether the experiment can answer the question with the available population.
Run an A/B test where randomization is practical. Measure whether the intervention moves the eval signal and whether that movement is followed by the intended retention lift.
Inspect results by the same segments used in the observational analysis. A blended win can hide a regression for a strategically important group.

This creates a necessary chain of evidence: the intervention changed the early experience, the early eval feature moved, and the retention outcome moved in the expected direction. If retention improves without movement in the eval, your intervention may work through another mechanism. If the eval improves without retention, the signal is not yet a proven growth lever.

Treat safety differently from an ordinary optimization metric. A retention increase cannot compensate for an unacceptable safety regression. Use risk scoring to gate exposure, keep model changes behind feature flags until the required evals pass, and monitor anomalies in both the score and its eligible volume. A stable percentage on a collapsing sample is not stability.

Track support tickets, NPS, and Net Recurring Revenue alongside the primary retention result. These measures operate on different timelines, but they help catch proxy optimization. An intervention that pushes users across an eval threshold while increasing support burden or degrading customer sentiment has not produced a clean product win.

Separate the user-level and release-level uses of the signal. A user-level signal can trigger onboarding or customer-success help when a new account has not reached the value threshold. A release-level eval can prevent a model change from expanding when quality falls. Combining both into one vague health score makes ownership and response unclear.

Put the winning signal into the product operating system

The analysis matters only when it changes what happens next. Give the signal a definition, an owner, an intervention, and a response to regression.

For onboarding, guide new users toward the workflow conditions associated with crossing the threshold. Do not merely show them where the AI button is.
For customer success, add the signal to a health score only when the team has a specific action to take. A warning without a playbook creates dashboard noise.
For roadmap planning, require proposed work to identify which eval feature it should move, why that feature connects to retention, and how the effect will be tested.
For model releases, keep exposure controlled with feature flags until the relevant eval improves without violating safety or experience guardrails.
For monitoring, use anomaly detection on the eval value, eligible interaction volume, and important segments so a blended average does not conceal a regression.

This operating model also clarifies ownership. Product owns the intervention and decision. Data science owns the validity of the feature and analysis. Engineering owns reliable instrumentation and release controls. Customer success owns the response when an account-level signal indicates missed value. Those responsibilities can be distributed differently in your organization, but none should be implicit.

Key takeaways

Define the retained behavior, population, unit, and time window before selecting an eval.
Shortlist three to five eval candidates that describe real user value: accuracy, containment, safety, latency, or UX friction.
Aggregate reliable daily features with stable user and tenant identifiers before joining them to product cohorts.
Use cohort analysis, driver trees, and simple controlled models to find a predictive threshold, then check sample size, segment mix, and label leakage.
Use an A/B test to learn whether a product intervention can move both the eval signal and retention.
Operationalize a validated signal through onboarding, customer success, release gates, feature flags, and anomaly detection.

At your next product review, bring a short decision sheet with the retained behavior, observation window, no more than five candidate evals, join keys, and the first intervention you can test. If the team cannot fill in those fields, fix the analytics contract first. If it can, run the smallest credible experiment and let retained behavior, not a prettier eval dashboard, decide the roadmap.

References

Amplitude — The Surprising Eval Signal That Tripled Retention: How I Connected AI Evals to Product KPIs

May 7, 2026

Category: Product Management

Key takeaways

Start with decision architecture, not a better prompt

Build a strategy chain the model can inspect

Force a distinction between fact, inference, and assumption

Build a controlled workflow from context to decision record

Ground the model in canonical product context

Separate extraction, synthesis, challenge, and approval

Use AI in discovery without laundering uncertainty

A cluster is a lead, not a finding

Connect roadmap, experiment, launch, and learning

Make the roadmap show its reasoning

Make experiments decision-ready before they run

Keep launch language tied to the original value proposition

Scale the workflow only when another person can audit it

References

Define the growth decision before you automate the launch

Turn the launch channel into a decision system

Give agents narrow jobs and humans explicit authority

Use a data agent for retrieval and first-pass synthesis

Keep the feature-flag agent read-only by default

Use a readout agent to maintain the launch narrative

Make the accountability map visible

Run the rollout as a sequence of evidence gates

Key takeaways

References

Treat retention and expansion as one value-progression system

Key takeaways

Instrument customer states, not a pile of events

Turn signals into explainable risk and opportunity decisions

Match each customer state to a bounded playbook

Prove incremental impact and build an operating rhythm

References

The native test: can the agent read, reason, and act?

Design the shopping dialogue around decisions, not keywords

Make shopping and support one customer state machine

Measure incremental commerce value and operational risk

Build the evaluation before the vendor demo

Expand autonomy in the order of consequence

Key takeaways

References

Set the capture contract before you expand coverage

Keep capture off the user’s critical path

Sample for decisions, not for a warehouse of footage

Run replay with a coupled performance, privacy, and value scorecard

Key takeaways

References

Start with the retention decision, not the eval dashboard

Build an identity and time contract before modeling anything

Find a threshold that survives segment and leakage checks

Use experiments to separate a predictor from a product lever

Put the winning signal into the product operating system

Key takeaways

References