Tag: A/B testing

Supercharge Core Web Vitals with Amplitude’s Global Agent: Faster Rankings, Happier Users

I measure product health by a simple equation: speed plus clarity equals trust. That’s why I prioritize Core Web Vitals and search performance together—because the fastest path to better UX and higher rankings is a closed loop between measurement, diagnosis, and action. Standardizing on Amplitude’s Global Agent with Amplitude AI Agents let my teams compress that loop from weeks to hours, and in many cases, to minutes.

Learn how to track your web vitals and page rankings faster with Amplitude AI Agents and improve your site’s user experience and SEO rankings. That goal sounds ambitious, but with the right instrumentation and analytics workflow, it becomes a repeatable operating rhythm rather than a one-off project.

Here’s what changed for us with Amplitude’s Global Agent: a single, consistent way to capture performance signals across pages and journeys, unified context for every session, and a lightweight footprint that doesn’t get in the way of speed. By centralizing measurement, we eliminated blind spots and gave product, growth, and engineering one shared truth for Core Web Vitals and behavioral analytics.

My practical playbook is straightforward: 1) Establish a performance baseline for Core Web Vitals on key templates and critical user paths. 2) Segment results by device, location, acquisition channel, and content type to surface where users actually feel the friction. 3) Connect those vitals to downstream behaviors—scroll depth, engagement, and conversion—so we prioritize fixes that move business outcomes, not just lab scores. 4) Use feature flags and A/B testing to ship improvements safely and quantify uplift. 5) Close the loop with Agent Analytics to keep learnings visible and actionable.

Operationally, we rely on anomaly detection to flag regressions early, CI/CD guardrails to prevent performance slips at deploy time, and observability plus session replay to accelerate root-cause analysis. This combination reduces mean time to resolution, protects page experience during fast iteration cycles, and helps us avoid trading UX for speed—or vice versa.

The strategic benefit is compounding: better Core Web Vitals improve user perception and increase engagement, which strengthens SEO signals and, ultimately, page rankings. With a unified analytics platform in place, we can spotlight the few improvements that create outsized gains, then scale those patterns across the site with confidence.

If your roadmap includes faster pages, stronger rankings, and happier users, align your teams around this simple loop: measure precisely, diagnose quickly, experiment safely, and learn continuously. Amplitude’s Global Agent and Amplitude AI Agents give you the instrumentation and insight to make that loop your competitive advantage.

Inspired by this post on Amplitude – Best Practices.

May 20, 2026
How to Validate Behavioral Heatmap Accuracy Before You Act
Your heatmap puts a bright cluster on the primary call to action, and the next step seems obvious: move the button, rewrite the copy, or prioritize a mobile redesign. Pause before turning that picture into a roadmap decision. A heatmap can look coherent while representing the wrong interface state, assigning clicks to the wrong element, or combining users whose layouts are materially different.

Behavioral heatmap accuracy is not about whether the colors look plausible. It is about whether each recorded interaction appears on the interface the user actually encountered, within the correct context, and supports the conclusion you want to draw. You need to validate that chain before you act on the pattern.

Treat accuracy as a chain, not a single metric

There is no single accuracy score that makes a heatmap trustworthy. Four separate conditions have to hold:
- Capture fidelity: The background image represents the relevant product state. The release, page structure, loaded content, navigation, overlays, and experiment variant should match what generated the interactions.
- Placement fidelity: A click is attached to the intended interface element after responsive reflow, personalization, localization, and other layout changes. A precise coordinate on the wrong screenshot is still wrong.
- Population fidelity: The map contains the users, devices, variants, and product states relevant to your decision. An aggregate can be mathematically correct while describing an interface that no individual user experienced.
- Inference fidelity: The visualization can support the claim being made. A click establishes an interaction, not the user’s motivation. Scroll depth establishes reach, not attention, comprehension, or persuasion.
Reliable screenshot capture, selector-based placement, automatic device detection, and clearer scrollmaps address important failure modes in this chain. They reduce ambiguity, but they do not eliminate the need to inspect your product states, filters, selectors, and supporting evidence.

The weakest link determines whether the map is useful. Perfect element placement cannot rescue a screenshot from an old release. Clean device segmentation cannot justify a claim about user intent. Before discussing what the hot area means, establish what was captured, where it was placed, and whose behavior was included.

Run a validation pass before reading the colors

Use the same validation sequence whenever a heatmap is about to influence an experiment, design change, or roadmap priority. This turns accuracy from a vague feeling into a reviewable process.
1. Write down the decision first. Be specific: move the primary action, remove a section, change the activation path, or investigate a mobile interaction. This tells you which page states and elements require the strongest validation.
2. Freeze the analysis scope. Record the screen or template, analysis window, release, experiment variant, device class, and user segment. If the interface changed during the selected window, split the data or identify the limitation rather than treating the period as one stable experience.
3. Build a state matrix. List only the states that materially alter the interface: desktop and mobile layouts, relevant locales, personalized variants, authenticated and unauthenticated views, expanded and collapsed components, or overlays that cover the underlying page. You do not need every possible segment. You need every state capable of moving, replacing, hiding, or duplicating the elements involved in your decision.
4. Compare the screenshot with each relevant state. Check the order and size of major sections, sticky navigation, banners, modals, lazy-loaded content, and conditional components. If the displayed background is stale or combines interactions from incompatible layouts, stop interpreting the map and repair the capture or filtering first.
5. Test element placement. In a controlled recorded session, interact with the target and with nearby controls that could be confused with it. Repeat the check on the layouts that move the element. The target’s hotspot should remain attached to the target rather than to an old coordinate. Exclude the controlled session from normal analysis when your tooling allows it.
6. Inspect critical selectors. Ask engineering to confirm that each selector identifies the intended component across the templates and states in scope. Pay particular attention to repeated cards, reused button components, translated labels, and responsive navigation. If adjacent actions collapse into one hotspot, the map is not suitable for deciding between those actions.
7. Reconcile the picture with events and replay. Apply equivalent page, date, device, user, and variant filters before comparing evidence. Exact numerical agreement is only a reasonable expectation when the systems use the same interaction definition and filters. Otherwise, document why their coverage differs and investigate unexplained gaps.
8. Assign a confidence grade. Mark the map as decision-grade, directional, or invalid. Decision-grade means the relevant states and placements were verified. Directional means a pattern is visible but a known limitation prevents a precise conclusion. Invalid means the visual representation is wrong for the proposed decision.
For a critical call to action, treat any reproducible placement error as a blocker. A hotspot that sometimes lands on a neighboring control can reverse the apparent preference between the two controls. Fix the representation before discussing design implications.

Split heatmaps when the interface or interaction model changes

Segmentation is not merely an analytical refinement. It is part of measurement accuracy. Mobile and desktop users may see different navigation, stacking order, content length, control size, and interaction affordances. Combining them can create a vivid composite that corresponds to neither experience.

Use a simple rule: split the map whenever a cohort can encounter different geometry, different elements, or a different way of interacting. Check these questions before aggregating:
- Does the same element exist in every included state?
- Does it keep the same purpose and selector?
- Does responsive behavior move it relative to neighboring elements?
- Does a variant, locale, or personalized state change the surrounding content?
- Are touch and pointer interactions being interpreted in a comparable way?
- Did a release alter the template during the selected analysis window?
If any answer exposes a material difference, inspect separate maps first. You can compare the resulting patterns afterward, but you should not use the blended view as the primary evidence.

Scrollmaps need the same discipline. The same depth percentage can correspond to different content when a mobile page stacks sections that sit side by side on desktop. Compare scroll behavior within consistent layouts, then map each depth region to the actual value proposition, trust element, form, or call to action shown there. Scroll reach tells you that a region became reachable within the journey; it does not prove that the person read or understood it.

Match the decision to what the evidence can prove

Even a technically accurate heatmap is an observation layer. It can show where interactions accumulated or how far sessions progressed. It cannot, by itself, tell you why the behavior occurred or whether a proposed design change will improve an outcome.

Use an evidence ladder instead of promoting every hotspot directly into the backlog:
- Heatmaps locate the pattern. They help you identify concentrated clicks, neglected controls, competing actions, and sections reached by fewer sessions.
- Event data measures the associated behavior. Use it to determine whether the interaction registered, where it sits in the funnel, and whether it connects to the micro-conversion or product outcome you care about.
- Session replay supplies sequence and context. Inspect what happened immediately before and after the interaction, including overlays, loading states, repeated attempts, navigation changes, and other conditions that an aggregate view hides.
- A controlled experiment evaluates the proposed change. When the claim is that a different placement, label, or layout will improve an outcome, compare that change against a baseline rather than treating the heatmap as causal proof.
The combination also helps you diagnose apparent contradictions. A strong hotspot with no corresponding outcome event may indicate a broken interaction, incomplete instrumentation, or an action whose result is unclear. Low interaction on content that few sessions reach is first a placement or journey question, not automatically a copy problem. High scroll reach with low interaction means the region was available to users, but it does not establish that they noticed or rejected its message. A hotspot outside the visible target is a measurement defect, not a behavioral insight.

Translate each finding into the next appropriate action:
- If the screenshot, selector, or segment is wrong, create an instrumentation or analytics repair.
- If the behavior is verified but its explanation is uncertain, create a discovery question and inspect relevant replays.
- If the behavior is verified and tied to an outcome gap, define a hypothesis and an A/B test.
- If the evidence reveals a reproducible interaction defect, prioritize the defect without disguising it as a preference experiment.
This language matters in product reviews. Say that you observed a pattern, verified its representation, formed a hypothesis, and selected the next test. Do not say that users prefer, understand, ignore, or want something unless your evidence can support that stronger claim.

Key takeaways
- A heatmap is decision-grade only when the captured state, element placement, population, and proposed inference all align.
- Validate the critical target and its neighboring controls across every layout that can move or replace them.
- Split device classes, variants, releases, locales, or personalized states when they produce materially different interfaces.
- Read scroll depth as reach and click concentration as interaction. Neither measure establishes attention, intent, or causality.
- Pair heatmaps with event data and session replay, then use a controlled experiment when your decision depends on predicted impact.
At your next heatmap review, do not begin with the hottest color. Begin with the screenshot, segment label, release, and one critical interaction traced from capture through outcome. If that path survives validation, turn the pattern into a hypothesis or product action. If it does not, fix the measurement before it becomes roadmap evidence.

References
- Amplitude – Amplitude Heatmaps Rebuilt: Rock-Solid Screenshots, Precise Placement, Smarter Scrollmaps
May 15, 2026

How to Prove the ROI of an AI Product Before You Scale It

Your AI product is getting used. The demos land well, task completion is improving, and internal enthusiasm is high. Then the CFO asks a harder question: what changed in the business because this product exists?

You cannot answer that question with prompt volume, response quality, adoption, or tickets touched. You need a measurement system that separates activity from incremental value, counts the full operating cost, and makes risk visible before a rollout gets larger. Here is how to build one.

Start with the decision your ROI model must support

ROI is not a retrospective slide assembled after launch. It is a decision rule. Before development begins, decide what evidence would justify launching, scaling, redesigning, rolling back, or retiring the capability.

That distinction changes the conversation. Instead of asking whether the agent is accurate enough or popular enough, you ask whether a measurable change in customer behavior produces a measurable business result without crossing an unacceptable risk threshold.

Build a driver tree with four levels:

Company outcome: revenue growth, lower cost to serve, or reduced business risk.
Customer outcome: the user completes a valuable job, reaches value sooner, or resolves a problem without unnecessary effort.
Product behavior: the AI capability changes conversion, expansion, self-service completion, containment, handle time, or escalation.
Controllable lever: the team changes the workflow, model behavior, conversation design, human review, or product guidance.

The chain matters because a model metric is rarely a business metric. Better answer quality may improve task completion, which may improve trial-to-paid conversion. The ROI case depends on the full chain, not the first link.

Value path	Business outcome	Leading evidence	Guardrails
Revenue	Higher conversion, average order value, or expansion	Time-to-first-value and self-service completion	Errors, complaints, and policy violations
Cost	Lower cost to serve	Containment, deflection, and reduced handle time	Escalations, false resolution, and downstream customer harm
Risk	Lower frequency or impact of harmful failures	Human-review events and detected violations	False positives, false negatives, hallucinations, and security breaches

Choose one primary value path for the investment case. Revenue, cost, and risk can all appear on the scorecard, but declaring all three as primary makes it too easy to rescue a weak result with whichever metric moved after launch.

A support agent, for example, may appear successful because it contains more conversations. But containment is only valuable if customers actually resolve their problems. A conversation that never reaches a human can reduce measured support volume while increasing complaints or churn risk. This is why revenue, cost, and risk measures must be evaluated together.

Write the measurement contract before you build the dashboard

A measurement contract is a short agreement among product, data, finance, and the operational team affected by the AI workflow. It prevents the definitions, cost boundaries, and success thresholds from changing after results arrive.

Your contract should answer these questions:

Who is eligible? Define the users, accounts, tasks, channels, and exclusions. Do not mix workflows with materially different economics.
What is the intervention? Name the AI capability and the version being evaluated. A model, prompt, retrieval pipeline, policy, or escalation change can alter the result.
What is the primary outcome? Select the business metric that determines whether the hypothesis passed.
What are the leading indicators? Use measures such as time-to-first-value, containment, and self-service completion to diagnose movement before lagging results mature.
What are the guardrails? Predefine acceptable limits for errors, hallucinations, false positives, false negatives, escalations, complaints, security events, and policy violations.
What is the baseline? Freeze the comparison period or control group before exposing the eligible population to the capability.
How will incrementality be proven? Specify the experiment, holdout, assignment unit, and minimum detectable effect.
What costs count? Agree on model or API consumption, labeling, evaluation, human review, and ongoing oversight before calculating value.
What action follows each result? Record the thresholds for launch, scale, redesign, rollback, and retirement.

The contract should distinguish an outcome OKR from an output OKR. Shipping the agent, generating responses, and increasing feature use are outputs. Improving conversion, lowering verified cost to serve, or reducing harmful failures are outcomes. Outputs can explain what happened, but they cannot establish value on their own.

Instrument the complete journey, not just the conversation

An AI log tells you what the model did. An ROI dataset must also tell you what the user did next.

Connect the journey from eligibility to business outcome:

The user or account became eligible for the capability.
The AI experience was offered, viewed, and engaged.
A task was attempted, completed, abandoned, or repeated.
A response was accepted, corrected, regenerated, or sent for human review.
The interaction was contained, escalated, or handed to another workflow.
The downstream conversion, expansion, support, retention, or complaint event occurred.
The associated model cost, labeling work, and human-oversight cost were recorded.

Carry a stable user or account identifier, experiment assignment, agent version, and journey identifier across those events. Without that connective tissue, the team may have an impressive agent dashboard and no defensible way to attribute a business outcome to the experience.

Use behavioral analytics and session replay to understand why a metric moved. Use journey mapping and retention analysis to locate the friction worth solving in the first place. Product tours and in-app guidance can then help eligible users reach a validated workflow. This creates a closed loop from journey friction to experiment and measurable outcome, instead of a collection of disconnected AI metrics.

Calculate economic value without turning activity into savings

Start with net business value:

Net business value = incremental revenue + cost avoided – total operating cost – quantified risk loss

If finance requires an ROI percentage, divide net business value by the agreed investment base. Keep both the numerator and denominator visible. A percentage without its cost boundary is easy to inflate and hard to audit.

Count only incremental revenue

Do not credit the AI product with every transaction it touched. Credit it with the difference between the exposed population and the valid control or holdout.

A practical revenue calculation is:

Incremental revenue = eligible volume x measured outcome lift x value per additional outcome

The measured outcome might be trial-to-paid conversion, self-service upsell, average order value, or expansion. Use the same eligibility definition, attribution window, and revenue treatment for the intervention and control. If the AI experience merely appears somewhere in a successful journey, that is influenced revenue, not proof of incremental revenue.

Separate capacity from cashable savings

Cost claims require more care than a deflection count. A contained interaction may create capacity without reducing expenditure. That capacity can still be valuable, but it should not be presented as cash savings unless spending actually changes.

Capacity created: employees have time available for other work, but the existing cost base remains.
Variable cost avoided: the company no longer incurs a cost that would have grown with each additional interaction.
Cashable savings: an approved budget, vendor charge, or staffing requirement is actually reduced.

Report these separately. Otherwise, the same saved minute can be counted once as employee capacity and again as reduced spend.

Validate that a deflected task was resolved, not abandoned or displaced to another channel. Then calculate avoided cost from the incremental lift in verified resolution, not the total number of conversations the agent handled.

Include the operating costs that make the agent dependable

Model or API cost is only one part of the investment. Include labeling, evaluation, human review, and operational oversight. If a safer workflow requires more review, that review is part of the product’s economics, not an external inconvenience to exclude from the model.

Segment cost by agent, workflow, and outcome. Cost per response is useful for infrastructure management, but cost per verified successful outcome is the better economic unit. A cheap response that triggers retries, escalations, or corrections may be more expensive than a higher-cost response that completes the job.

Do not bury risk inside an average ROI number

Risk adjustment should make uncertainty visible, not create false precision. Use three layers:

Hard guardrails: security and policy conditions that trigger containment or rollback regardless of financial upside.
Observed risk indicators: error, hallucination, escalation, complaint, false-positive, and false-negative rates tracked by workflow and cohort.
Financial adjustment: expected loss deducted from net value only when the probability and impact assumptions are credible enough for finance and risk owners to accept.

Do not let a low-frequency, high-consequence failure disappear inside a high average success rate. If the downside cannot be defensibly monetized, keep it as an explicit decision constraint rather than assigning it a convenient dollar value.

Prove incrementality before claiming impact

The strongest ROI calculation still fails if the attribution is weak. A before-and-after improvement may come from seasonality, pricing, traffic quality, a support policy change, or another product release. The AI capability needs a counterfactual: what would have happened to comparable eligible users without it?

Use an A/B test or holdout whenever the product and risk profile allow it. Make these choices before launch:

Assignment unit: Randomize at the level where the outcome occurs. If expansion is measured per account, account-level assignment can prevent users in the same customer organization from receiving conflicting experiences.
Primary outcome: Pick the metric that determines success and keep diagnostic metrics secondary.
Minimum detectable effect: Precompute the smallest lift worth detecting based on the baseline, available population, and business value. If the experiment cannot detect a decision-relevant change, extending the metric list will not fix it.
Guardrails: Test quality, escalation, complaints, security, and policy outcomes alongside the primary metric.
Analysis population: For a product-level ROI claim, analyze eligible users according to their assigned experience. Looking only at people who voluntarily used the agent introduces selection bias.
Measurement horizon: Keep the holdout long enough to observe the outcome named in the contract. Leading indicators can guide iteration, but they should not be substituted for retention, churn, Net Recurring Revenue, or other lagging outcomes.

If randomization is not practical, use a fixed holdout or a frozen comparison period and document the limitations. A weaker design can still inform a decision, but the ROI claim should carry less confidence. Do not quietly promote correlation to causation because the rollout has executive attention.

Interpret the result as a system. Suppose self-service completion rises but the business outcome does not. The agent may be solving a low-value task, attracting users who would have converted anyway, or shifting effort to a later step. If conversion improves while complaints or policy violations cross the guardrail, the value hypothesis may be valid but the implementation is not ready to scale.

This is eval-driven development applied to product economics: define acceptable behavior and business success, measure both under controlled conditions, diagnose the failures, and repeat the test after a meaningful change.

Turn ROI into a portfolio operating system

A one-time business case goes stale as models, prompts, traffic, user behavior, and operating costs change. Maintain an Agent Analytics view for every production capability.

Each agent scorecard should show:

The primary business outcome and current experiment result.
Leading journey metrics from eligibility through verified completion.
Revenue contribution, cost avoided, and total operating cost using the agreed definitions.
Quality and risk guardrails, including escalations and human-review events.
Performance by relevant customer, task, and journey cohort.
The agent, model, policy, and workflow version associated with the result.
The current decision status: exploring, launching, scaling, redesigning, contained, or retiring.

Use the dashboard to make portfolio decisions, not merely to report trends:

Scale when the primary outcome clears the precommitted threshold, guardrails hold, net value is positive, and the result remains credible across the cohorts that matter.
Redesign when leading indicators improve but the business outcome does not, or when human review and escalation erase the economic gain.
Contain or roll back when a hard security, policy, or customer-harm threshold is breached, even if average financial performance is positive.
Retire when controlled measurement shows no decision-relevant incrementality or when dependable operation costs more than the value created.

Review operational signals with frontline teams because they can explain patterns hidden by aggregate metrics. Review portfolio value in QBRs with product, data, finance, and risk owners so investment follows evidence rather than novelty.

Only accelerate adoption after the workflow has demonstrated unit value. In-app guides, product tours, and lifecycle nudges can bring more eligible users into a validated flow. Measure whether those interventions increase the business outcome, not merely clicks or agent sessions. Scaling exposure to an unproven workflow scales its cost and risk as readily as its potential benefit.

Key takeaways

Treat ROI as a precommitted decision rule for launch, scale, redesign, rollback, or retirement.
Connect model behavior to customer behavior and then to revenue, cost, or risk through a driver tree.
Freeze the baseline, cost boundary, guardrails, attribution method, and success thresholds before results arrive.
Credit only incremental revenue and verified avoided cost. Keep created capacity separate from cashable savings.
Include model consumption, labeling, evaluation, human review, and oversight in the operating cost.
Use controlled experiments or holdouts, with a decision-relevant minimum detectable effect, to separate causal impact from correlation.
Keep severe risk conditions as explicit constraints when they cannot be responsibly converted into a financial estimate.
Scale adoption only after the AI workflow has shown positive unit value under acceptable risk.

Pick one high-friction customer journey and complete its measurement contract before the next roadmap review. If the team cannot name the baseline, control, primary outcome, cost boundary, guardrails, and decision thresholds, the capability is still an exploration. Label it honestly, instrument it properly, and earn the right to make an ROI claim.

References

May 15, 2026

No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

Most teams ship AI agent personalities by accident—emergent quirks, brittle prompts, and uneven behavior. We refused to let that happen. From day one, we treated personality as a first-class product surface, one that should be designed, instrumented, and iterated with the same rigor as any core capability.

Learn how we designed Global Agent’s personality and fine-tuned its inquisitiveness and helpfulness using Agent Analytics.

In my role leading product at HighLevel, Inc., I framed our approach around agentic AI and conversation design: personality is not “flavor text”; it is the control system for how an agent interprets context, asks questions, and decides when to act. Our product strategy prioritized clarity, empathy, and consistency—so the agent would be curious enough to resolve ambiguity without becoming interrogatory, and helpful enough to move work forward without overstepping.

We made that intent measurable. Using behavioral analytics, we defined operational signals such as clarification-question rate, resolution-path efficiency, and escalation quality. We combined eval-driven development with targeted A/B testing to compare prompt patterns and tool strategies, ensuring each change had a clear hypothesis and measurable outcome.

To calibrate inquisitiveness, we mapped decision points where the agent should ask follow-ups versus proceed autonomously. Prompt engineering codified those thresholds, while a retrieval-first pipeline reduced unnecessary questions by improving context completeness up front. When the agent did ask, we constrained tone and cadence to keep queries concise, respectful, and progress-oriented.

To enhance helpfulness, we prioritized precise action-taking and unambiguous guidance. Context window management preserved relevant facts without diluting intent, and guardrails aligned with AI risk management principles ensured the agent stayed within policy, privacy, and compliance boundaries. The result was an assistant that resolved more tasks end-to-end, with fewer stalls and clearer handoffs when human help was warranted.

Agent Analytics became our nervous system. We instrumented every dialog turn to attribute outcomes to design choices, then used driver trees to connect micro-behaviors to macro results like time-to-resolution and customer satisfaction. This closed-loop view let us ship confidently, knowing which levers improved helpfulness, which sharpened curiosity, and which merely added noise.

Process mattered as much as tooling. Product trios ran continuous discovery with customers to surface edge cases—ambiguous intents, multi-intent turns, and sensitive scenarios—while our engineering partners operationalized experiments with clean rollback paths. We favored small, testable changes over sweeping rewrites, building momentum and trust with each iteration.

The payoff is a personality that feels consistent across use cases: curious when clarity is missing, decisive when action is obvious, and transparent when limits are reached. Users experience fewer dead ends, faster resolutions, and a brand voice that shows up the same way every time—because it was defined, measured, and improved on purpose.

If you’re building agentic AI, don’t leave personality to chance. Treat it like a product: set clear outcomes, instrument deeply with Agent Analytics, and iterate with eval-driven development and A/B testing. That’s how curiosity becomes a feature, helpfulness becomes a habit, and your agent becomes reliably, intentionally excellent.

Inspired by this post on Amplitude – Best Practices.

May 13, 2026
From Vision to Execution: Building Agentic, Data‑Driven Products with Real‑World Rigor

When I consider where product development is headed, one statement captures the mandate perfectly: "Eric Carlson is a Principal AI Engineer helping to shape and build Amplitude's next generation vision of of agentic and data driven product development." That vision resonates deeply with how I lead teams—anchoring strategy in behavioral analytics while enabling agentic AI to act on insights with speed, safety, and measurable impact.

Translating that vision into execution starts with clarity of outcomes. I frame driver trees that connect customer value to leading indicators—activation, engagement depth, and retention—then instrument product telemetry with Amplitude analytics and behavioral analytics to surface the moments that matter. From there, we operationalize learning with A/B testing and feature flags, ensuring each hypothesis gets a fair, observable run and that we can safely ramp what works.

Agentic AI changes the operating model. Instead of static dashboards, we design autonomous workflows that observe signals, reason over context, and take action—grounded in a retrieval-first pipeline and governed by eval-driven development. For product managers, this demands fluency with LLMs for product managers and practical prompt engineering, plus rigorous AI Strategy around data governance, privacy-by-design, and risk scoring so agents remain trustworthy under real-world conditions.

Cross-functional cadence is everything. I partner closely with Principal AI Engineers and product trios to blend continuous discovery with execution: rapid user interviews to reveal intent, opportunity solution trees to prioritize, and outcomes vs output OKRs to align incentives. The result is a system where insights are unified, decisions are explainable, and agents improve through tight feedback loops across analytics, experimentation, and production telemetry.

If you’re building toward an agentic, data-driven future, invest in a unified analytics platform, shorten the path from signal to action, and measure learning velocity as carefully as feature delivery. With the right foundations, agentic AI becomes more than a feature—it becomes a force multiplier for product strategy, customer value, and sustainable growth.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
How to Run AI-Assisted Feature Launches That Drive Growth
You are days from releasing a feature. Engineering needs a rollout decision. Go-to-market teams need a clear promise. Support needs to know what could go wrong. Leadership wants to know whether the release changed customer behavior. Dropping an AI bot into the launch channel will not resolve those tensions. If the metrics, authority, and escalation rules are vague, the bot will only answer ambiguity faster.

The useful model is a closed loop: define the behavior you want to change, instrument exposure and value, operate the rollout from one shared channel, let agents handle repeatable retrieval and synthesis, and reserve consequential decisions for accountable people. Done well, AI reduces the coordination tax around a launch while making the growth decision more disciplined.

Define the growth decision before you automate the launch

A feature being available is an output. A customer reaching value is an outcome. Your launch plan has to connect the two before anyone writes an agent prompt or schedules a readout.

A durable growth plan translates the product North Star into activation and retention signals, then defines the minimum detectable effect before experimentation. The North Star provides direction, but it is often too distant to diagnose a new feature. A launch needs an earlier behavioral signal that can tell you whether eligible users encountered the feature, understood it, and reached its intended value.

Write a short launch contract with these fields:
1. Target user and moment: Name the user or account segment, the situation that makes the feature relevant, and any eligibility rules. A feature intended for a new administrator solving an initial setup problem should not be evaluated across every user in the product.
2. Behavioral hypothesis: State the current behavior, the desired behavior, and why the feature should cause the change. If the causal link cannot be written plainly, the team is not ready to interpret the launch data.
3. Measurement chain: Instrument eligibility, actual exposure, meaningful engagement, the activation action, and the downstream value event. If you record engagement but not exposure, low adoption could mean either that users ignored the feature or that they never saw it.
4. Primary signal: Choose the behavior closest to customer value that can mature within the launch window. Do not promote every available metric to equal status. That turns a decision into a search for whichever chart looks most favorable.
5. Guardrails: Name the operational and customer signals that can stop a rollout, such as degraded performance, errors, support burden, privacy concerns, or a harmful shift in another important behavior. Define the actual acceptable bounds in your internal contract before launch; do not negotiate them after a concerning result appears.
6. Minimum detectable effect: Decide what change would be large enough to matter to the product and business. This keeps the team from celebrating meaningless movement or waiting indefinitely for certainty that the planned test cannot provide.
7. Decision rule and authority: Specify what evidence permits a ramp, what requires a hold, what triggers investigation, and who can pause or roll back the feature. An agent may assemble the evidence, but it should not invent the rule during the incident.
The contract should also distinguish a growth signal from a health signal. Activation, conversion, or repeated use may tell you whether the feature is producing value. Latency, error rates, complaints, and anomalous segment behavior tell you whether it is safe to continue. A healthy system with an immature growth signal may justify holding the rollout. Broken instrumentation or a material guardrail breach calls for a different response.

This distinction prevents a common category error: treating an inconclusive experiment as a failed feature, or treating early adoption as proof of durable value. The launch decision should always answer the same question: given trustworthy exposure data, the primary signal, and the guardrails, should you ramp, hold, investigate, or roll back?

Turn the launch channel into a decision system

A launch channel becomes useful when it preserves context and decisions, not merely conversation. A practical setup is one channel named #launch-[feature], with its scope, service expectations, success metrics, dashboards, and rollout plan pinned. Product, engineering, data, support, and go-to-market stakeholders can then work from the same operational record.

Set up the channel before rollout begins:
1. Pin the launch contract: Include the hypothesis, eligible population, event definitions, primary signal, guardrails, rollout stages, owners, and links to live dashboards. A screenshot becomes stale; a governed dashboard remains inspectable.
2. Create stable work lanes: Use separate parent threads for metrics, incidents, enablement, and customer feedback. This gives each agent and human responder a predictable place to work without fragmenting the overall launch record.
3. Publish response expectations: State which questions the agent can answer immediately, which require a human owner, and how urgent operational issues are escalated. The agent should never make an urgent request look handled merely because it produced a fluent reply.
4. Keep a decision ledger: For every ramp, hold, pause, or rollback, record the timestamp, evidence considered, decision, rationale, approver, and next review point. This matters later when a stakeholder asks why exposure changed or when the team compares the result with the original hypothesis.
5. Require channel-visible handoffs: If a question moves to a data, engineering, or privacy owner, the agent should post the handoff and preserve the relevant query, definitions, filters, and context. Do not let direct messages become a shadow operating system.
Give every automated data answer a consistent shape:
- The direct answer, including the population and time window.
- The metric definition and denominator.
- The relevant cohort, segment, environment, and experiment variant.
- A link to the approved underlying data or dashboard.
- An as-of timestamp so readers know how fresh the result is.
- Any missing data, definition conflict, or limitation that changes interpretation.
- The named human owner when judgment or investigation is required.
An activation rate without its denominator, an environment, or a timestamp is not decision-grade evidence. A polished answer should not receive more trust than the data lineage beneath it. Make uncertainty visible instead of prompting the agent to conceal it behind a confident summary.

Give agents narrow jobs and humans explicit authority

The safest launch architecture separates three jobs: retrieving data, operating rollout controls, and interpreting evidence. Combining them in one broadly empowered agent creates unnecessary risk. It also makes failures harder to diagnose because you cannot tell whether the problem came from a bad query, a bad recommendation, or an unauthorized action.

Use a data agent for retrieval and first-pass synthesis

Connect the data agent only to approved sources and metric definitions. It can answer repeatable questions such as activation by cohort, conversion by segment, latency by region, exposure by variant, or the movement of a named guardrail. It should provide citations and timestamps, then route questions requiring nuance to an owner while keeping the context in the thread.

Write the escalation boundary into its operating instructions. Escalate when metric definitions conflict, required data is unavailable, a query touches restricted information, the request asks for a causal conclusion that descriptive data cannot establish, or the answer would materially change rollout. The best response in those cases is not a guess. It is a precise statement of what is missing and who must resolve it.

Keep the feature-flag agent read-only by default

A flag agent can safely expose status by environment, current rollout allocation, and change history. That alone removes many repetitive questions. Write access is different: an incorrect production change can expose an unready experience, expand an incident, or remove access unexpectedly.

When you permit flag mutations, require an explicit sequence:
1. The requester names the feature, environment, target population, requested action, and reason.
2. The agent shows the current flag state and summarizes the evidence relevant to the request.
3. The authorized approver confirms the exact change. Approval cannot be inferred from an emoji, an ambiguous reply, or the agent’s own recommendation.
4. The integration performs only the approved action through constrained permissions.
5. The agent posts the resulting state, timestamp, requester, approver, rationale, and change-history link.
Do not give the agent a broad production credential merely because the chat interface is convenient. Restrict its access by environment and role, preserve an audit trail, and keep a manual rollback path available to the responsible engineer.

Use a readout agent to maintain the launch narrative

Scheduled summaries prevent the team from rebuilding the same analysis for every stakeholder. A useful default is to publish readouts at T+1 hour, T+24 hours, and T+7 days, while adapting the questions to the product’s actual usage cycle:
- T+1 hour: Confirm that exposure is occurring as intended. Check instrumentation, operational performance, obvious anomalies, and incident status. This checkpoint is primarily about measurement and safety, not declaring growth success.
- T+24 hours: Review adoption and activation by the planned cohorts, early conversion movement where applicable, support themes, and any uneven behavior across important segments.
- T+7 days: Evaluate experiment results that have had time to mature, retention or repeated-value signals when the product cycle makes them observable, significant outliers, and the follow-up work needed to harden or revise the experience.
These checkpoints are operating cadences, not guarantees of statistical maturity. A feature used on a longer cycle may not produce a meaningful retention signal by the final checkpoint. The readout should say that plainly instead of treating missing maturity as neutral evidence.

Every readout should end with a decision or an explicit statement that no decision is yet warranted. It should also name the evidence still needed, the owner, and the next review point. A summary that lists charts but does not clarify the decision state creates more reading without reducing uncertainty.

Make the accountability map visible
- Product owns the behavioral hypothesis, the primary growth signal, and the recommendation to ramp, hold, or change direction.
- Engineering owns operational health, flag implementation, incident response, and safe rollback execution.
- Data owns metric definitions, instrumentation validity, experiment design, and interpretation limits.
- Support and go-to-market owners contribute customer feedback, readiness concerns, enablement status, and communication needs.
- Agents retrieve, summarize, route, and perform narrowly preauthorized steps. They do not approve their own consequential recommendations.
The governance layer is part of the product, not a final compliance check. Apply role-based access, protect personally identifiable information, require source citations, and retain transparent logs. Then monitor response accuracy, deflection, and time-to-answer through Agent Analytics. Deflection alone is a poor success metric: a confidently wrong response may reduce human questions while increasing decision risk. Review incorrect answers, unnecessary escalations, missed escalations, and stale data as carefully as response speed.

Run the rollout as a sequence of evidence gates

A feature flag is not merely a switch. It lets you separate deployment from exposure and turn a large release decision into a sequence of smaller, inspectable decisions. The appropriate rollout stages depend on the feature’s operational, privacy, and customer risk, so define them in advance rather than copying a universal percentage.

Use this operating sequence:
1. Preflight the measurement: Verify eligibility, exposure, activation, value, and guardrail events in the intended environment. Confirm that dashboards use the launch contract’s definitions and that the agent can retrieve the same governed numbers.
2. Release to the defined cohort: Use the flag to control who can receive the experience. Confirm actual exposure before interpreting engagement. Eligibility and exposure are different facts.
3. Inspect evidence at the scheduled gates: Start with instrumentation and safety, then move to activation, conversion, retention, and other downstream value signals as they become observable. Review the preselected segments before exploring unexpected cuts of the data.
4. Choose a named decision state: Ramp, hold, investigate, pause, or roll back. Record the evidence and rationale. Avoid vague states such as looking good, because they do not tell engineering what to do or stakeholders what has been decided.
5. Feed the learning back into the journey: Update onboarding, in-product guidance, targeting, positioning, or the feature itself based on the observed friction. A winning test becomes a growth mechanism only when the trigger, experience, and value-producing behavior can be repeated reliably.
Use a clear decision ladder:
- Ramp when measurement is trustworthy, guardrails remain inside the pre-agreed bounds, the evidence meets the decision rule, and customer-facing teams are ready for broader exposure.
- Hold when the system is healthy but the outcome has not matured enough to support a decision. State what evidence is pending and when it can reasonably be reviewed.
- Investigate when an anomaly, segment divergence, definition conflict, or instrumentation gap makes the aggregate result unreliable.
- Pause when continued exposure could obscure an incident, contaminate the test, or expand a customer problem while the team diagnoses it.
- Roll back when a material operational, privacy, safety, or customer guardrail crosses the boundary defined in the launch contract. Do not wait for the primary growth metric to mature before acting on a serious downside.
If the feature itself uses AI, measure the product experience separately from the operational agents supporting its launch. AI can provide intelligent nudges, next-best actions, and adaptive experiences while applying privacy-by-design and strong data governance. That creates at least four distinct questions: Was the user eligible? Was the AI experience delivered? Did the user engage with it? Did that engagement lead to value without violating a guardrail?

Logging only the final conversion makes those questions impossible to separate. A delivery problem, poor recommendation, confusing interaction, and weak value proposition can all produce the same downstream result. Preserve the path from eligibility through value, including the experience variant the user received. If targeting or adaptive behavior changes during the test, log the change and account for it in the interpretation.

Do not confuse high initial use with a durable growth loop. Novelty can produce engagement without retained value. Look for the sequence that matters to your product: activation, repeated value, and then the relevant expansion, collaboration, or retention behavior. If the product has no natural invitation or sharing mechanic, do not force a viral story onto it. Build the loop around the behavior customers already have a reason to repeat.

Key takeaways
- Start with a launch contract that names the user, behavioral hypothesis, measurement chain, primary signal, guardrails, minimum detectable effect, decision rules, and accountable owners.
- Use one launch channel as the shared operational record, but separate metrics, incidents, enablement, and feedback into stable threads.
- Split agent responsibilities across data retrieval, flag operations, and scheduled readouts. Keep consequential actions approval-gated and auditable.
- Treat T+1 hour, T+24 hours, and T+7 days as decision checkpoints, not automatic declarations of success.
- Use feature flags to move through evidence gates. Ramp, hold, investigate, pause, or roll back according to rules written before the data arrives.
- Measure AI-powered experiences from eligibility through delivery, engagement, value, and guardrails so you can diagnose why growth did or did not move.
For your next launch, begin with a narrow operating slice: a completed launch contract, a structured channel, a data agent limited to approved queries, read-only flag visibility, and scheduled readouts. Review every wrong answer, escalation, and decision after the rollout. Expand the agent’s authority only when the evidence shows that the control system is trustworthy.

References
- Amplitude – My Playbook for a Smarter Feature Launch Slack Channel with Agents, Feature Flags, and Readouts
- Amplitude – How I Orchestrate Growth & AI at Amplitude to Ignite Viral Product Engagement
May 11, 2026
How a Digital Analytics Visionary Shapes My Product Strategy for Growth, Retention & Monetization

Data has always been my compass for building products that customers love and businesses depend on. Few sentences distill that imperative as crisply as the one below—and it continues to inform how I prioritize, experiment, and scale outcomes across the roadmap.

Krista is a digital analytics leader, product strategist, and industry evangelist. She helps businesses use data to drive growth, retention, and monetization.

That mandate mirrors how I run product: leverage behavioral analytics to uncover patterns, translate those insights into hypotheses, and validate them through rigorous A/B testing. I start by instrumenting the user journey end to end, then use cohort analysis, funnel diagnostics, and retention analysis to pinpoint where activation, engagement, or monetization is stalling. From there, I map driver trees to connect inputs (feature adoption, time-to-value, onboarding friction) to outputs (retention, conversion, revenue), so every experiment has a clear line of sight to business impact.

On experimentation, I hold the bar high: define the minimum detectable effect (MDE) up front, ensure clean experiment design, and size samples to reduce noise. I combine Amplitude analytics with qualitative signals from continuous discovery to prioritize tests that move the needle, not just the vanity metrics. When a variant wins, I don’t stop at the lift—I track downstream effects on user activation, long-term retention, and monetization, ensuring we’re compounding gains rather than optimizing in silos.

For product-led growth, I focus on the moments that matter most: first-value, aha, and habit formation. Journey mapping helps me identify the shortest, clearest path to value, while targeted in-app experiences and contextual nudges accelerate activation without adding friction. Every iteration feeds a learning loop—measure, learn, and ship—so we can pursue step-change outcomes, not incremental tweaks.

Ultimately, the craft is in translating analytics into action. When teams can trace a feature idea to a specific behavioral pattern, test it with a well-powered A/B experiment, and observe durable improvements in retention and revenue, momentum takes care of itself. That’s how I operationalize data to deliver growth, retention, and monetization at scale.

Inspired by this post on Amplitude – Best Practices.

May 11, 2026
4 Costly Agent Analytics Myths—And the Data-Backed Metrics I Rely on Instead

In my work with product, operations, and support leaders, I’m often asked to help make sense of Agent Analytics—what to track, how to attribute outcomes, and where to invest. After reviewing countless dashboards and running experiments across human agents and AI agents, I’ve learned that some of the most common measurement beliefs are precisely the ones that lead teams astray.
What comes up in conversation with leaders about Agent Analytics, and why not everything is what it seems.
Below, I unpack four pervasive myths I encounter and share the data-centered practices I use to replace them. My goal is simple: help you upgrade the way you measure performance so you can improve customer outcomes, accelerate learning, and scale impact with confidence.
Myth 1: “Lower average handle time (AHT) means higher performance.” AHT is useful but incomplete. When teams optimize solely for speed, they often push complexity into repeat contacts, reopens, or escalations. In the data, that shows up as a weak or negative relationship between lower AHT and durable outcomes like first contact resolution (FCR), customer effort, or revenue per conversation.
Reality and what I measure instead: I right-size speed by pairing AHT with intent-level resolution and recontact rate. For simple intents (password reset, billing address update), shorter is usually better. For complex intents (tiered troubleshooting, multi-step verification), “right-speeding” wins—slightly longer interactions that prevent rework. Practically, that means segmenting by intent complexity using behavioral analytics, tracking weighted “intent resolution rate,” and monitoring repeat-contact windows (24–168 hours) to catch downstream pain.
Myth 2: “AI agent containment tells the whole story.” A high containment rate can mask failure modes such as unresolved intent, silent abandonment, or low-quality handoffs that frustrate customers and spike human workload later.
Reality and what I measure instead: I break containment into three parts for voice and chat flows: (1) intent resolution without escalation, (2) graceful handoff quality when escalation is necessary, and (3) post-handoff efficiency and satisfaction. For voice AI agent experiences, I also track escalation clarity (did the transcript summarize history and intent?), time-to-human, and customer satisfaction on the combined interaction. This provides a fuller view of customer support ai strategy effectiveness and avoids over-crediting automation for partial wins.
Myth 3: “Quality is subjective, so it can’t be measured at scale.” Teams often default to sporadic QA because they assume it can’t be standardized across channels or agent types. The result is noisy feedback loops and stalled coaching.
Reality and what I measure instead: Quality becomes measurable when it’s grounded in observable behaviors linked to outcomes. I use a rubric anchored in behavioral analytics (e.g., verified customer need, correct resolution path, policy compliance, empathy markers) and validate it via correlation with FCR, recontact, and retention analysis. To scale, I combine calibrated human reviews with AI-assisted scoring, check inter-rater reliability weekly, and use driver trees to connect quality levers to business results. This creates a consistent, coachable signal for both human agents and AI flows.
Myth 4: “If the dashboard is green after launch, we’ve won.” Early wins can reflect novelty effects, cherry-picked routing, or short-term incentives that don’t persist. Declaring victory too soon locks in fragile gains and hides regressions across cohorts.
Reality and what I measure instead: I treat go-live as the start of learning. I use A/B testing with a clear minimum detectable effect (MDE), stagger ramps, and hold out stable control cohorts for at least one full demand cycle. I track outcomes vs output OKRs—focusing on intent resolution, customer effort, and revenue/customer health over vanity metrics. I also monitor seasonality and channel mix shifts inside a unified analytics platform to ensure improvements generalize beyond the first week.
How I operationalize this day to day: (1) define intents and complexity upfront, (2) unify journey data across channels, (3) instrument resolution and recontact rigorously, (4) apply driver trees to isolate what actually moves outcomes, and (5) iterate via disciplined experiments rather than sweeping changes. This approach aligns product and operations, speeds up coaching, and ensures AI investments compound rather than decay.
If you’re rethinking your Agent Analytics stack, start by replacing each myth with a sharper metric: pair AHT with intent-level resolution, pair containment with handoff quality and satisfaction, pair QA with outcome-linked rubrics, and pair green dashboards with robust experiments. The payoff is a measurement system that earns trust, guides better decisions, and consistently improves customer and business results.

Inspired by this post on Pendo – Best Practices.

May 7, 2026

How to Link AI Evals to Retention Without Chasing Proxies

Your AI activation rate is rising. More users are reaching the agent, completing setup, or trying the workflow. Yet the retention curve is flat. That usually means you know who touched the product, but not who received enough value to return.

A higher aggregate eval score will not resolve that gap. You need to identify an AI quality signal that appears early, connect it to later behavior, and determine whether changing that signal can change retention. The result should influence onboarding, roadmap priorities, customer success, and model releases, not just add another chart to an eval dashboard.

Start with the retention decision, not the eval dashboard

The wrong opening question is: Which evals can the team measure? Start with: What must a user experience early enough that returning becomes the rational next action?

That framing forces you to define retention before searching for a predictive signal. A login is rarely enough. Choose a return behavior that represents recurring value: running another workflow, completing another meaningful task, or bringing the agent into an ongoing process. Then make five decisions explicit:

Define the eligible population. Decide whether you are studying newly activated users, newly activated tenants, or another clearly bounded cohort.
Choose the unit of analysis. Use the user when value is individual. Use the tenant or account when adoption and renewal depend on a shared workflow.
Name the retained behavior. It should represent renewed value, not passive presence.
Select the retention window. Weekly and monthly cohorts answer different product questions, so do not switch between them after seeing the result.
Close the observation period before the retention outcome begins. Otherwise, later behavior can leak into the feature that supposedly predicts it.

This distinction matters when activation improves but retention does not. Activation proves that a user crossed a product milestone. It does not prove that the AI produced a trustworthy, complete, safe, and usable outcome. Your eval candidates should measure that missing experience.

Eval family	Question it should answer	When it deserves product attention
Semantic accuracy	Did the output correctly address the intended task?	Incorrect results prevent completion or make the user unwilling to rely on the agent again.
Containment	Did the agent complete the eligible workflow without an avoidable human escalation?	Escalation prevents the workflow from delivering repeatable automation.
Safety	Did the interaction remain within the product’s acceptable risk boundaries?	A regression creates unacceptable exposure, even if another engagement metric improves.
Latency	Did the result arrive fast enough for the user’s workflow?	Delay causes abandonment, repeated attempts, or a return to the previous process.
UX friction	Could the user reach a good outcome without unnecessary setup, retries, or corrections?	Users fail before they have a fair chance to experience the agent’s value.

Shortlist three to five candidates tied to these user outcomes. A long eval inventory makes analysis look comprehensive while weakening the decision. You are not trying to find every quality problem. You are looking for an early signal that is measurable, related to meaningful retention, and alterable through a product intervention.

Build an identity and time contract before modeling anything

The hardest part is usually not the statistical model. It is joining AI interactions to product behavior without duplicating records, losing users, or assigning an outcome to the wrong account. Evals often live in notebooks or model-observability systems while retention events live in product analytics. A plausible-looking join can still be wrong.

Create a data contract that covers both systems. At minimum, it should specify:

Stable user and tenant identifiers, including the rule used when a user belongs to more than one tenant.
The timestamp that determines whether an interaction belongs inside the observation period.
The model and workflow version associated with the interaction.
The conditions that make an interaction eligible for each eval.
The grain of the analysis table, such as one row per user-day or tenant-day.
The treatment of missing data, especially the difference between no eligible interaction and an evaluated failure.

That last distinction is easy to miss. A user who never invoked an eligible workflow did not fail the accuracy eval. Combining non-use and poor quality into the same value hides whether the retention problem comes from discovery, setup, or AI performance.

Compute daily per-user and per-tenant features rather than joining every raw interaction directly to every product event. Each feature should retain its denominator or exposure count. A pass rate without the number of eligible interactions can make sparse use look equivalent to sustained use.

Keep the definition of each feature readable. Containment, for example, needs an explicit eligible-workflow denominator and an explicit rule for what counts as avoidable escalation. UX friction needs named events, such as a retry or correction, rather than an opaque composite score. If a product manager cannot explain how the feature changes, the team will struggle to turn it into a roadmap decision.

Watch for many-to-many joins. One AI interaction may generate several product events, and one product session may contain several AI interactions. Joining both raw tables can multiply rows and inflate success or failure counts. Aggregate each side to the agreed grain first, then join the resulting features to the retention cohort.

Versioning also matters. If a model or workflow changes during the observation period, an account-level average can blend materially different experiences. Preserve the version so you can distinguish a real quality improvement from a change in traffic or segment mix.

Find a threshold that survives segment and leakage checks

Once the dataset is reliable, begin with cohort analysis rather than a complex predictive model. Compare retention among users who reached different levels of each candidate signal. You are looking for a separation that is large enough to matter, stable enough to repeat, and reachable through product changes.

Use this sequence:

Plot weekly or monthly retention against each early eval feature.
Use a driver tree to show where the feature sits between acquisition, activation, AI quality, repeat behavior, and the final retention outcome.
Fit a simple logistic model that controls for plan type, segment, region, and acquisition channel.
Repeat the analysis inside important segments instead of relying only on the blended population.
Check whether the threshold remains directionally useful when you vary the observation definition without allowing it to overlap the outcome.

The controls are not statistical decoration. Higher-plan customers may have better implementation support. One region may contain a different account mix. A high-intent acquisition channel may produce both better agent usage and better retention. Without those checks, customer composition can masquerade as model improvement.

In one product context, users who crossed a specific eval threshold early showed three times higher retention than peers who did not. That is evidence that an eval can become a commercially useful leading indicator. It is not a universal benchmark. Your threshold, effect size, eligible population, and retention behavior will depend on your product.

Do not choose the threshold merely because it creates the largest visual gap. Prefer a boundary that has enough eligible users on both sides, persists across relevant segments, and corresponds to an experience the product can influence. A dramatic ratio from a small cohort is a hypothesis, not a roadmap mandate.

Run an explicit leakage review before presenting the result. Common forms include an eval feature calculated after the retention window begins, an account-health field that already contains renewal information, or a usage feature whose value can only rise when the user returns. Leakage can make a weak signal look uncannily predictive.

The decision artifact should show the cohort definition, feature window, retention window, cohort sizes, effect estimate, control variables, and segment sensitivity together. If the threshold only works for a particular plan or acquisition channel, say so. A narrow, honest signal is more actionable than a broad result that disappears when the mix changes.

Use experiments to separate a predictor from a product lever

A predictive eval signal is not automatically causal. Sophisticated users may configure the agent better, choose easier workflows, or persist through early friction. Their higher eval scores and higher retention may share the same cause. Improving the score will not necessarily reproduce their behavior.

Convert the signal into a testable product intervention:

Choose an intervention that can move the signal during the early observation period. Depending on the failure, that could be an in-app guide, a product tour, a setup change, or a model change behind a feature flag.
Keep the threshold definition fixed for the experiment. Redefining success after seeing the result turns the test into another exploratory analysis.
Predefine the retained behavior, retention window, target population, and second-order guardrails.
Use a minimum detectable effect calculation to determine whether the experiment can answer the question with the available population.
Run an A/B test where randomization is practical. Measure whether the intervention moves the eval signal and whether that movement is followed by the intended retention lift.
Inspect results by the same segments used in the observational analysis. A blended win can hide a regression for a strategically important group.

This creates a necessary chain of evidence: the intervention changed the early experience, the early eval feature moved, and the retention outcome moved in the expected direction. If retention improves without movement in the eval, your intervention may work through another mechanism. If the eval improves without retention, the signal is not yet a proven growth lever.

Treat safety differently from an ordinary optimization metric. A retention increase cannot compensate for an unacceptable safety regression. Use risk scoring to gate exposure, keep model changes behind feature flags until the required evals pass, and monitor anomalies in both the score and its eligible volume. A stable percentage on a collapsing sample is not stability.

Track support tickets, NPS, and Net Recurring Revenue alongside the primary retention result. These measures operate on different timelines, but they help catch proxy optimization. An intervention that pushes users across an eval threshold while increasing support burden or degrading customer sentiment has not produced a clean product win.

Separate the user-level and release-level uses of the signal. A user-level signal can trigger onboarding or customer-success help when a new account has not reached the value threshold. A release-level eval can prevent a model change from expanding when quality falls. Combining both into one vague health score makes ownership and response unclear.

Put the winning signal into the product operating system

The analysis matters only when it changes what happens next. Give the signal a definition, an owner, an intervention, and a response to regression.

For onboarding, guide new users toward the workflow conditions associated with crossing the threshold. Do not merely show them where the AI button is.
For customer success, add the signal to a health score only when the team has a specific action to take. A warning without a playbook creates dashboard noise.
For roadmap planning, require proposed work to identify which eval feature it should move, why that feature connects to retention, and how the effect will be tested.
For model releases, keep exposure controlled with feature flags until the relevant eval improves without violating safety or experience guardrails.
For monitoring, use anomaly detection on the eval value, eligible interaction volume, and important segments so a blended average does not conceal a regression.

This operating model also clarifies ownership. Product owns the intervention and decision. Data science owns the validity of the feature and analysis. Engineering owns reliable instrumentation and release controls. Customer success owns the response when an account-level signal indicates missed value. Those responsibilities can be distributed differently in your organization, but none should be implicit.

Key takeaways

Define the retained behavior, population, unit, and time window before selecting an eval.
Shortlist three to five eval candidates that describe real user value: accuracy, containment, safety, latency, or UX friction.
Aggregate reliable daily features with stable user and tenant identifiers before joining them to product cohorts.
Use cohort analysis, driver trees, and simple controlled models to find a predictive threshold, then check sample size, segment mix, and label leakage.
Use an A/B test to learn whether a product intervention can move both the eval signal and retention.
Operationalize a validated signal through onboarding, customer success, release gates, feature flags, and anomaly detection.

At your next product review, bring a short decision sheet with the retained behavior, observation window, no more than five candidate evals, join keys, and the first intervention you can test. If the team cannot fill in those fields, fix the analytics contract first. If it can, run the smallest credible experiment and let retained behavior, not a prettier eval dashboard, decide the roadmap.

References

Amplitude — The Surprising Eval Signal That Tripled Retention: How I Connected AI Evals to Product KPIs

May 7, 2026

AI Product Growth: A Strategy and Execution Operating System
You have an AI capability that demos well, yet its growth story is still unclear. Some users try it once. The team debates model quality. The roadmap fills with features, while the link to activation, retention, or revenue remains an assumption.

You can fix that by managing AI growth as a measurable path from user intent to trusted value, repeated behavior, and a business outcome. That path tells you where growth is breaking, which experiment to run next, and whether a more capable model would solve the problem at all.

Build the growth thesis as a measurable chain

AI products invite feature-shaped goals: launch a copilot, add an agent, improve the prompt, or introduce recommendations. Those goals describe output. They do not tell you whose behavior should change or why the change matters to the business.

In my product strategy work at HighLevel, I use a simple test: if a roadmap item cannot name the user behavior it should change and the business lever that behavior affects, it is not yet a growth strategy. A North Star and its driver tree force that connection into the open.

Build your driver tree from right to left. Start with the business outcome, identify the customer behavior that can produce it, and then identify the AI-assisted moments that can change that behavior. This order prevents model capabilities from dictating the roadmap.
1. Name the segment. Choose a group with a shared job and context. New administrators setting up an account are more useful than all users because their intent, constraints, and success event can be observed.
2. Define the value moment. State what the user can do after the AI interaction that was difficult before it. An answer displayed is not a value moment. A configured workflow, resolved issue, completed analysis, or approved action can be.
3. Select the behavior change. Decide whether you need more users to reach first value, reach it sooner, repeat a valuable workflow, or adopt an additional capability.
4. Connect the behavior to one growth mechanism. Activation, retention, and expansion require different product decisions. Choose a primary mechanism for the bet instead of claiming that one feature will improve all three.
5. Add quality and trust guardrails. Relevance, correctness, abandonment, corrections, unauthorized actions, privacy exposure, and recoverability can invalidate an apparent growth win.
A practical AI growth equation is: eligible users multiplied by discovery, first successful use, repeat successful use, and downstream conversion. You do not need to treat the equation as a financial model. Use it to locate the weakest link. More traffic will not repair poor first-use success, and better answers will not create growth if eligible users never discover the capability.

Turn the thesis into a one-page decision document before adding projects to the roadmap. It should contain:
- The target segment and the high-value job it is trying to complete.
- The current friction, supported by behavioral evidence or customer discovery.
- The proposed AI intervention and why AI is necessary for this step.
- The primary behavioral outcome and its baseline.
- The activation, retention, or expansion lever that outcome should affect.
- The leading indicators that can move before the business outcome does.
- The quality, reliability, and trust guardrails that must not deteriorate.
- The assumptions that would cause you to stop, narrow, or redesign the bet.
Your outcome statement can follow this form: increase a named behavior for a named segment by improving a specific driver, without degrading named guardrails. Supply the target only after you have a baseline and know what change your measurement system can detect. A target chosen for presentation value is not a strategy.

A worked example: turn AI search into an activation path

Consider a SaaS administrator who searches for help while configuring a workflow. A search team could optimize result clicks and declare success. A growth team traces the job further: query submitted, useful guidance received, setup started, workflow activated, and the workflow used successfully.

If result clicks rise but completed setups do not, the team improved engagement with search rather than activation. If setup completion rises but repeat use does not, the next constraint may be workflow value, onboarding, or the quality of the initial configuration. The query-to-outcome path makes those distinctions visible.

This is why every AI growth bet needs an explicit endpoint. The endpoint is not the response. It is the valuable behavior the response enables.

Choose an intent wedge before choosing the AI experience

A broad assistant usually creates a broad measurement problem. It serves unrelated intents, carries different failure costs, and leaves the team unable to explain why adoption changed. Start with an intent wedge: a narrow set of related requests from one segment, encountered at a meaningful point in its journey.

A strong first wedge has several useful properties:
- The job recurs. Repetition gives the product a chance to create a habit or reduce recurring friction.
- The current path is observable. You can see where users search, abandon, ask for help, switch tools, or fail to complete the workflow.
- Success is verifiable. The product can observe a completed action or downstream outcome instead of relying only on a positive reaction to the answer.
- The job is close to value. Improving it can plausibly affect activation, retention, or expansion.
- The failure is recoverable. Early versions should avoid irreversible or high-cost autonomy when a suggestion, preview, or confirmation can solve the same problem safely.
- The scope is evaluable. The team can assemble representative intents and define what an acceptable response or action looks like.
Use continuous discovery and journey mapping to find that wedge. Review behavioral funnels and query logs, then speak with users who completed the job, abandoned it, and avoided the AI experience entirely. The last group matters because usage data cannot explain a discovery problem among people who never entered the funnel.

Capture each candidate in an opportunity card. Record the segment, trigger, intent in the user’s own language, current workaround, failure consequence, next valuable action, available evidence, trust constraints, and outcome metric. This keeps prioritization centered on customer work rather than the novelty of a model capability.

When comparing opportunities, do not collapse everything into one unexplained score. Look separately at the strength of the evidence, proximity to a growth outcome, frequency of the job, severity of the friction, ability to measure success, and cost of a wrong answer or action. A high-frequency request with no meaningful downstream behavior may be less valuable than a narrower request sitting directly before activation.

Match autonomy to the evidence you have

AI product teams often jump from static software to autonomous agents in one roadmap step. A safer growth path increases autonomy only when the preceding level has demonstrated reliable value.
1. Visibility. Capture and classify what users are trying to accomplish. This exposes unmet demand before you automate anything.
2. Retrieval and explanation. Return relevant, grounded information that helps the user make the decision. A retrieval-first approach is often the cleanest starting point because the evidence and failure points are easier to inspect.
3. Recommendation. Suggest a next action using the user’s context, while keeping the decision with the user.
4. Guided or agentic execution. Prepare or perform a multi-step workflow with appropriate permissions, confirmation, observability, and recovery.
Move up a level when the current experience has repeat use, its major failure classes are understood, and the next level removes a documented point of friction. Do not add agency merely because the model can call tools. An agent that takes the wrong action creates a more serious problem than a search result that fails to earn a click.

The decision rule is straightforward: use the least autonomous experience that can produce the target behavior. This makes learning cheaper, limits risk, and shows whether users want the job completed before you invest in completing it on their behalf.

Instrument the full path from interaction to revenue

You cannot manage AI growth from a usage count. Monthly users of an AI feature can rise because of novelty, forced exposure, or repeated failure. Instrument a closed loop that connects intent, system behavior, user response, task completion, and the relevant business outcome.

A useful event spine contains the following stages:
1. Eligibility and exposure: Was the right user able to discover the capability at the right moment?
2. Intent: What job was the user trying to complete, and how was that intent classified?
3. Response: Which result, recommendation, or planned action did the system produce?
4. User judgment: Did the user select, accept, edit, reject, retry, or abandon it?
5. Execution: Did the user or agent start and complete the intended workflow?
6. Value: Did the product observe the success event defined in the growth thesis?
7. Business outcome: Did the relevant account activate, retain, expand, or contribute to Net Recurring Revenue through the defined path?
Concrete event names make implementation reviews easier. Depending on the experience, you might use events such as ai_query_submitted, ai_answer_shown, ai_recommendation_accepted, ai_action_started, ai_action_completed, ai_answer_corrected, and ai_flow_abandoned. The exact naming convention matters less than preserving the sequence and using it consistently.

Attach the properties needed to diagnose changes: segment, intent class, entry point, answer type, retrieval or ranking version, model and prompt version, experiment assignment, content identifiers, action type, completion status, and failure class. Carry account and customer identifiers into downstream systems only under your approved privacy and data-governance rules.

Raw prompts and conversations can contain personal, confidential, or commercially sensitive information. Logging them by default can create exposure that outlives the experiment. Define redaction, retention, access control, and deletion rules before broad collection. If a diagnostic goal can be met with a classified intent or structured error code, do not retain the raw text merely because it may be useful later.

Organize the resulting metrics into five views:
- Reach: eligible users, exposure, discovery, and first use.
- Experience: acceptance, correction, retry, abandonment, and progression to the next step.
- Task value: successful workflow completion and time to the defined value event.
- Repeat value: return use for the same job and successful use across relevant workflows.
- Business impact: activation, retention, expansion, and revenue outcomes for the target segment.
Sentiment can help you locate frustration, but it should not become the success metric. A polite response can still be wrong, and a frustrated user can still complete the task. Pair inferred sentiment with observable behavior such as correction, abandonment, repeated queries, completion, and downstream product use.

Revenue attribution needs similar discipline. Connect experiment exposure and product behavior to the CRM or revenue system, choose an attribution window that matches the natural decision cycle, and distinguish influenced revenue from causally demonstrated lift. Users who voluntarily adopt an AI capability may already be more engaged, so a dashboard correlation does not prove that AI caused their retention or expansion.

This distinction changes roadmap decisions. Behavioral analytics can reveal where the path is breaking. Controlled experiments are needed when you want to know whether fixing that point changes behavior or dollars.

Turn evaluation and experiments into a delivery system

AI growth execution needs two evidence gates. Offline evaluation asks whether the system performs the intended task well enough to expose safely. Online experimentation asks whether the experience changes customer behavior. Passing one gate does not imply that you will pass the other.

Gate releases with eval-driven development

Build the first evaluation set from real intents in the chosen wedge. Cover common requests, ambiguous requests, known failures, and cases where a wrong answer or action carries a higher cost. Preserve segment and intent labels so an average score cannot hide a severe failure in an important slice.
1. Write the rubric before tuning. Define what must be true for the response or action to pass: correct intent, relevant evidence, accurate guidance, appropriate next step, permitted action, and recoverability where needed.
2. Separate failure classes. Coverage, retrieval, generation, interaction design, tool execution, permissions, and policy failures need different fixes.
3. Version the system. Record the model, prompt, retrieval configuration, content version, tools, and policy configuration associated with each result.
4. Review performance by slice. Inspect high-value intents and high-consequence failures instead of relying only on the aggregate.
5. Keep human review where judgment is material. Automated scoring can accelerate evaluation, but uncertain or consequential cases still need an accountable review path.
6. Promote real failures into the eval set. Production corrections and abandoned workflows should make future regressions easier to catch.
Do not treat every low score as a prompt problem. If the required information is absent, fix content or data coverage. If the right material exists but is not retrieved, fix retrieval or ranking. If the answer is accurate but users cannot act on it, fix interaction design or the handoff into the workflow. Prompt iteration cannot compensate for every layer of the system.

Use online experiments to answer growth questions

Once the experience passes its release gate, write an experiment card that another product leader could audit without attending the planning meeting.
- The target segment and eligibility rule.
- The behavior you expect to change and the mechanism behind that change.
- The control and variant, including model, prompt, ranking, content, or interaction differences.
- One primary behavioral outcome tied to the growth thesis.
- Quality, reliability, cost, privacy, and business guardrails relevant to the change.
- The randomization unit, especially when users within the same account can affect one another.
- The minimum detectable effect and the sample required to evaluate it.
- The natural usage or buying cycle the test must observe.
- The decision rule for shipping, iterating, narrowing, or stopping.
Good early tests isolate a decision. Compare retrieval or ranking approaches when users cannot find the right information. Compare concise and detailed answer formats when comprehension or action is the constraint. Test prompt variants when the failure is genuinely in instruction following or response construction. Test a guided workflow against a static answer when users understand the answer but fail at the next step.

A click is an acceptable primary metric only when the click itself represents value. Otherwise, measure the completed behavior downstream. A compelling answer that produces no useful action is engagement without growth.

Run the first 90 days around evidence, not launch theater

A 30-60-90 operating sequence gives the team enough structure to create momentum while preserving room to learn.
1. Days 1-30: establish the truth. Select the segment and intent wedge. Baseline the driver tree. Map the current journey. Audit events and data access. Build the initial evaluation set. Define privacy and permission constraints. Write the first growth thesis and identify the assumptions most likely to break it.
2. Days 31-60: ship the smallest complete path. Release a thin experience behind a feature flag. Instrument the full event spine. Run offline evaluations and the first controlled experiment. Review failed intents and abandoned workflows every week. Fix the largest diagnosable constraint rather than adding adjacent features.
3. Days 61-90: prove, prune, and scale. Aim to land two or three measurable wins, remove low-signal bets, and decide whether the wedge deserves more distribution, deeper personalization, or greater autonomy. Standardize the operating cadence only after the team has learned which reviews lead to decisions.
The product trio should co-own problem framing and solution shaping. Product owns the growth thesis and trade-offs. Design owns comprehension, control, feedback, and recovery in the interaction. Engineering owns system behavior, instrumentation, reliability, and safe delivery. Data science should help design evaluations, experiments, and attribution. Customer-facing teams should validate whether the job and value proposition match the language customers actually use.

Use a cadence that matches the decision:
- Weekly: inspect intent coverage, failure classes, corrections, abandonment, and unexpected trust issues.
- Every two-week sprint: ship a testable improvement, review the evidence, and update the decision record.
- Monthly: review the driver tree, experiment portfolio, business impact, costs, and cross-functional blockers.
- Quarterly: reset outcomes, stop bets that have lost their evidence, and fund the next constraint in the growth path.
Feature flags, CI/CD, and observability are growth infrastructure because they reduce the cost and risk of learning. They let you separate code deployment from customer exposure, compare variants, detect regressions, and reverse a problematic release. Privacy-by-design, data governance, and observability should be release requirements rather than work deferred until scale.

Watch for five failure modes
- The roadmap is organized by AI features. Reorganize it around segments, intents, and outcomes so multiple solution types can compete for the same problem.
- One quality score hides the system. Break performance into coverage, retrieval, generation, interaction, execution, and policy slices so the owner of the next fix is clear.
- The team optimizes prompts while the funnel is broken elsewhere. Locate the failing step before choosing the technical intervention.
- Autonomy arrives before trust. Start with visibility, retrieval, or recommendations, then increase agency when reliable task completion and recovery have been demonstrated.
- AI usage becomes a vanity metric. Keep the primary outcome downstream of the interaction and tie it to activation, retention, or expansion.
Key takeaways
- An AI growth strategy must connect a defined segment and intent to a valuable behavior and one primary business lever.
- Start with a narrow intent wedge whose success is observable, repeatable, and close to activation, retention, or expansion.
- Use the least autonomous experience that solves the documented friction, then earn the right to add agency.
- Instrument eligibility, intent, response, user judgment, execution, value, and business outcome as one path.
- Use offline evaluations to manage quality and controlled experiments to establish behavioral or revenue impact.
- Treat feature flags, observability, privacy, and data governance as part of the growth system.
- Review failures weekly, ship testable improvements in two-week sprints, and prune bets that do not change customer behavior.
Start with one segment, one recurring intent, and one outcome. Trace the current path event by event, identify its weakest link, and write the smallest experiment that can test your explanation. That is enough to turn AI growth from a feature campaign into a learning system.

References
- Shivam.Consulting Blog – Make AI Search Count: Convert Every Query into Revenue with Visibility, Sentiment, and Action
- Shivam.Consulting Blog – Principal Product Manager Playbook: Strategy, Leadership, and Execution That Scales
April 29, 2026

Amplitude AI Product Analytics: A Practical Agent Playbook

You are deciding whether Amplitude Agents deserve a place in your product operating system. A fluent answer or polished insight is easy to admire. The harder question is whether the agent helps someone make a better decision, complete a valuable task, or change user behavior.

That distinction determines how you should instrument, evaluate, and roll out the experience. Treat Amplitude as the measurement spine connecting agent activity to funnels, cohorts, experiments, retention, and product outcomes. Otherwise, you will know that the agent was used without knowing whether it was useful.

Pick a workflow with an observable finish line

Do not begin with a broad ambition such as helping everyone understand the data. It cannot be measured cleanly, and it gives the agent too much room to produce plausible output without resolving a real job.

The useful standard is that AI product management remains accountable for helping teams build better products. The agent response is therefore an intermediate output, not the outcome. A strong starting point is one narrowly scoped, high-signal workflow with an unambiguous done state.

Write a workflow contract before configuring dashboards or prompts:

User: Name the role doing the work, such as a product manager investigating onboarding friction.
Trigger: Describe the event that makes the job necessary, such as a drop in activation or an unexpected cohort difference.
Bounded job: State exactly what the agent should help accomplish.
Required evidence: Identify the events, funnels, segments, or cohorts that should support the output.
Done state: Define the observable action that marks useful completion.
Fallback: Decide what happens when the inputs are missing, the evidence conflicts, or the agent cannot complete the task reliably.

For an onboarding investigation, the contract might ask the agent to help identify where a defined cohort leaves the activation journey and produce evidence-backed hypotheses for the product manager to review. The task is not complete when text appears. It is complete when the user reviews the relevant evidence and records a decision, launches a follow-up analysis, or creates an experiment.

Use a simple outcome ladder to keep the team honest: eligible users see the experience, some start it, some reach the workflow’s done state, some act on the result, and the intended product outcome changes. Each level answers a different question. Collapsing them into an agent usage metric hides the point at which value disappears.

Instrument the agent journey, not just the final answer

Your event design should let you reconstruct the journey from opportunity to outcome. The names below are examples, not an official Amplitude schema. Adapt them to your existing naming convention and governance rules.

Journey stage	Question it answers	Suggested event
Eligible	Who could reasonably use this workflow?	agent_workflow_eligible
Exposed	Who actually saw an entry point?	agent_entry_viewed
Started	Who chose to begin?	agent_run_started
Evidence reviewed	Who engaged with the information needed to judge the output?	agent_evidence_viewed
Completed	Who reached the workflow-specific done state?	agent_task_completed
Actioned	Who used the output in a downstream decision or action?	agent_output_applied
Handed off	Where did the experience require a deterministic flow or human review?	agent_handoff_triggered
Returned	Who came back when the job occurred again?	agent_run_started, segmented by prior successful completion

Add properties that explain why behavior differs: workflow identifier, product surface, user role, account cohort, journey stage, agent version, prompt or instruction version, completion reason, handoff reason, and error class. Version properties are essential. Without them, a release can change output quality while the dashboard incorrectly treats the experience as one stable product.

If prompts may contain customer or company data, do not log the raw text by default. Prefer derived classifications, structured outcome fields, or properly redacted samples governed by your retention and access policies. Product analytics should increase observability without creating an unnecessary copy of sensitive input.

Build each metric with an explicit denominator:

Discovery rate: exposed eligible users divided by eligible users.
Start rate: users who start divided by users exposed to the entry point.
Completion rate: users reaching the workflow-specific done state divided by users who start.
Action rate: users taking the defined downstream action divided by users who complete.
Retained use: previously successful users who return when the job recurs divided by previously successful users who had another opportunity.

The eligibility and opportunity conditions matter as much as the numerator. A user cannot retain to a workflow that has not recurred, and someone who never saw the entry point should not be treated as a failed starter.

In Amplitude, separate the views rather than forcing everything into one chart. Use an exposure funnel for discoverability, a workflow funnel for completion, cohorts for segment differences, retention analysis for repeat behavior, and a guardrail view for errors, retries, and handoffs. Use Agent Analytics for the execution signals available from the agent, then connect those signals to the behavioral events that represent product value.

Keep output quality and product impact on separate scorecards

Behavioral analytics cannot tell you whether an answer was correct. An evaluation set cannot tell you whether customers changed their behavior. You need both views because they fail in different ways.

Before widening access, create an evaluation set drawn from the workflow contract. Include ordinary cases, incomplete inputs, ambiguous requests, conflicting evidence, and cases that should trigger a handoff. Grade the output against criteria that can be reviewed consistently:

Correctness: Does the conclusion match the available evidence?
Grounding: Can the user see which events, funnels, cohorts, or other inputs support it?
Task adherence: Did the agent solve the bounded job rather than produce a generic analysis?
Uncertainty handling: Does it distinguish supported conclusions from hypotheses?
Handoff behavior: Does it stop or redirect appropriately when required evidence is unavailable?
Actionability: Can the intended user make the next decision without reconstructing the analysis?

Record pass or fail for non-negotiable criteria such as unsupported conclusions and failed handoffs. Keep graded usefulness criteria separate. A high average score should not conceal a smaller set of serious failures.

Run the same evaluation set when you change instructions, tools, model configuration, retrieval behavior, or the data made available to the agent. This is the practical value of eval-driven development: a fast release becomes a controlled product change rather than an untraceable shift in behavior.

Your online scorecard should then contain distinct layers:

Primary outcome: the workflow-specific completion or downstream action that represents value.
Adoption diagnostics: eligibility, exposure, start rate, and first successful completion.
Quality diagnostics: evaluation results, user corrections, retries, and unsupported-output flags.
Operational guardrails: errors, latency appropriate to the workflow, abandonment, and handoffs.
Product impact: the activation, feature adoption, retention, or other behavioral outcome the workflow is intended to influence.

Choose one primary outcome before launch. The other measures explain why it moved or protect against a misleading win. If every metric is primary, the team can always find one that improved after the fact.

User ratings can help diagnose tone, relevance, or missing context, but they are not a substitute for observed outcomes. A response can feel impressive and still produce no action. It can also look concise while helping an expert complete the job quickly. Pair stated feedback with completion, downstream action, and return behavior.

Run an experiment that can survive executive scrutiny

Do not compare enthusiastic agent adopters with everyone who ignored it. Those groups selected themselves, so their product outcomes may have differed before the agent appeared. Establish a baseline and create a controlled comparison wherever the workflow and traffic permit it.

Write the hypothesis in behavioral terms. Name the user, workflow, expected action, and product outcome.
Measure the current workflow before introducing the agent. Capture completion, abandonment, downstream action, and relevant guardrails.
Define eligibility before assignment so the comparison includes people with the same underlying job.
Choose the assignment unit that matches how the workflow spreads. Use an account-level unit when teammates share agent output; use a user-level unit only when experiences are genuinely independent.
Expose the treatment through a feature flag or controlled rollout, while keeping the existing path available as the comparison and fallback.
Evaluate the primary outcome and guardrails together. Do not call a faster workflow successful if output quality, error handling, or downstream behavior deteriorates.
Inspect cohorts to understand a credible result, not to search endlessly for a segment that happens to look positive.

The metric pattern often tells you where to investigate next:

High exposure with low starts can indicate weak positioning, poor timing, or an irrelevant eligible population.
Healthy starts with low completion can indicate that the promise is attractive but the workflow, inputs, or output quality is failing.
High completion with low downstream action can indicate that your done state is too shallow or the output is not trusted enough to use.
Strong agent engagement without movement in the product outcome can indicate a locally pleasant experience that does not change the broader journey.
Strong first use with weak return behavior can indicate novelty, unreliable value, or a job that simply occurs infrequently. Check opportunity before interpreting it as churn.
Good aggregate results with concentrated handoffs in one cohort can indicate missing context, permissions, or data for that segment.

Guardrails should be operational, not aspirational. Validate required inputs. Make the agent’s task and evidence boundaries clear. Route the user to a deterministic flow or human review when observable conditions show that the task cannot be completed. Missing data, failed tool calls, validation failures, and unsupported claims are stronger handoff triggers than an agent merely describing itself as confident.

Scale only when value repeats under real conditions

A spike in usage after launch mainly proves that people noticed something new. Scale when the complete chain repeats: eligible users discover the workflow, finish it, act on the result, and return when the same job appears again.

Segment that chain by role, account cohort, use case, journey, and agent version. A workflow that helps an experienced product analyst may confuse a first-time manager. An onboarding investigation may need different evidence and handoffs from a retention investigation. Aggregate adoption can hide both realities.

Expand the rollout when the primary outcome improves, evaluation quality remains stable across relevant cohorts, guardrail failures stay controlled, and repeat use matches the natural frequency of the job. Redesign when successful users cannot find the entry point, retries cluster around the same step, completed outputs rarely lead to action, or results depend on one unusually capable cohort.

Pause expansion when the agent does not improve the existing workflow, important outputs cannot be audited back to evidence, or failures cannot be routed safely. More exposure only creates more ambiguous data when the workflow contract itself is weak.

Key takeaways

Define one bounded workflow and an observable done state before measuring adoption.
Connect agent execution signals to exposure, completion, downstream action, and product outcomes in Amplitude.
Use evaluation sets for output quality and behavioral analytics for real-world impact; neither replaces the other.
Compare the agent with the existing workflow among equally eligible users.
Treat retries, errors, unsupported outputs, and handoffs as product signals, not merely engineering logs.
Scale repeatable value across cohorts and versions, not a launch-driven usage spike.

Your next move should fit on one page: the workflow contract, event map, evaluation criteria, experiment metric, and fallback path. If those elements are clear, Amplitude can show where the agent creates value and where it merely creates activity. If they are not clear, narrow the workflow before you widen the rollout.

References

April 28, 2026

AI Product Validation: From Promising Demo to Proven Value

You have an AI demo that looks impressive. It answers the happy-path prompt, the latency seems acceptable, and stakeholders can already imagine the launch. The uncomfortable question is whether any of that proves the product is worth building.

It does not. A useful validation process has to reduce several different risks: whether customers care, whether the workflow helps them, whether the AI performs reliably, whether the economics work, and whether failures remain tolerable. Test those risks in that order and you can make a defensible investment decision without turning production traffic into your debugging environment.

Define the decision before you design the AI

The first artifact for an AI initiative should not be a model shortlist or a prototype. It should be a decision contract that states what must become true for the initiative to deserve more investment.

A practical decision statement has this shape: For a defined user in a defined situation, the proposed capability will improve an observable outcome relative to the current alternative, without breaching named guardrails. If the agreed threshold is met, you will advance. If it is not, you will stop or change a specific assumption.

Write down these five elements before the experiment begins:

User and job: Name who encounters the problem, when it occurs, and what they are trying to complete. A broad label such as knowledge workers is not precise enough to design a useful test.
Current alternative: Record what the user does now, including manual work, an existing product flow, a rules engine, or simply tolerating the problem. This is the baseline the AI must beat.
Observable outcome: Choose a user or business result, not a model activity. Task completion, time-to-value, corrected routing, rework, repeat use, or downstream resolution can carry more meaning than generations or prompt volume.
Success threshold and guardrails: Decide how much improvement would justify the cost and what must not deteriorate. Safety failures, latency, privacy exposure, retention, and cost per successful outcome can all constrain an otherwise positive result.
Decision rule: State what evidence will trigger expansion, another iteration, a change in direction, or cancellation. Precommitting prevents enthusiasm for a polished demo from moving the goalposts later.

The threshold is not universal. It should reflect the value of the outcome, the implementation and operating costs, the consequences of errors, and the return available from competing roadmap work. Minimum detectable effect belongs here: define the smallest improvement that would actually change your decision, then size the test to detect that effect. A test that cannot distinguish a worthwhile gain from noise is not a faster test. It is a delayed decision.

A driver tree helps prevent a common measurement mistake. Start with the desired outcome, connect it to the user behaviors that could produce that outcome, and then connect those behaviors to system-level drivers. For an AI support-triage capability, the outcome might be faster correct routing. Accepted category and priority suggestions are leading signals; downstream corrections, reassignment, and resolution are closer to the outcome. Model classification accuracy matters, but it is only one driver in the chain.

If the proposal involves an autonomous or semi-autonomous agent, run a precondition check before planning the experiment. Volume, instructions, tolerance, access, and a learning loop expose whether agentic complexity is justified:

Volume: Does the workflow happen often enough for automation to create meaningful leverage?
Instructions: Can success, constraints, and exceptions be expressed in testable terms?
Tolerance: Is the likely failure reversible, detectable, and contained?
Access: Can the system use the necessary data and tools with reliable integrations and least-privilege permissions?
Learning loop: Can you measure quality, latency, cost, and failures after launch?

A missing condition tells you what to validate next. Unclear instructions call for more discovery and rubric design. Weak access calls for an integration or data-quality spike. Low error tolerance calls for approvals and a narrower action space. Low volume may mean that a clear workflow, a rule, or better product UX is the better answer. The purpose of validation is not to prove that AI belongs in the solution; it is to discover whether it does.

Climb an evidence ladder instead of jumping to a pilot

An oversized pilot often mixes market, usability, model, integration, and operational risk into one expensive test. When the result disappoints, nobody knows which assumption failed. An evidence ladder gives each experiment one dominant question and increases fidelity only after the previous uncertainty has been reduced.

Question to answer	Lean experiment	Evidence to inspect	What it does not prove
Do users care enough to act?	Painted door, landing page, waitlist, concierge offer, preorder, or deposit where appropriate	Click-through intent, qualified sign-ups, willingness to pay, and continued requests	Usability, AI quality, or scalable delivery
Can the proposed workflow help?	Wizard-of-Oz flow or realistic interactive prototype	Task completion, time on task, errors, material friction, and repeat use	Whether an AI system can deliver the experience reliably
Can the system perform the job?	Offline evaluation on a curated golden set plus targeted technical spikes	Rubric results by case type, failure patterns, latency, and cost	Whether the complete product changes user behavior
Does the product improve the target outcome?	Feature-flagged A/B test or holdout	Primary outcome, leading indicators, cohort effects, and guardrails	Long-term stability under every operating condition
Can it operate within acceptable risk?	Capped rollout with approvals, audit logs, monitoring, and rollback controls	Harm and privacy events, reversals, escalations, reliability, and cost per successful outcome	That future changes will remain safe without continued evaluation

Use the first row when demand is the dominant risk. A painted-door click is a signal of curiosity, not proof of durable value. A qualified sign-up asks for more commitment. A preorder or deposit, when honest and operationally appropriate, tests willingness to pay. Repeated use of a manually delivered service provides stronger behavioral evidence. Do not collapse these signals into a single conversion metric; they represent different levels of commitment.

Once demand appears credible, use a prototype or Wizard-of-Oz flow to learn whether the proposed interaction helps. Pretotyping should answer whether the product deserves to exist, while prototyping should answer how it needs to work. Keeping those questions separate prevents a polished interface from disguising weak demand and prevents a crude early interface from killing a valuable idea before its workflow has been understood.

These experiments still owe users honest expectations. A painted door should reveal that the capability is unavailable after the user expresses interest, rather than pretending it already exists. A concierge or Wizard-of-Oz flow should be explicit about how data will be handled and what follow-up the participant can expect. Deception can manufacture a metric while damaging the trust the eventual product will need.

Advance when the evidence changes the dominant uncertainty. Strong demand does not authorize a production launch; it authorizes a workflow test. A usable workflow authorizes a system evaluation. An offline pass authorizes limited exposure. Each rung earns the next investment without pretending to answer questions it was not designed to answer.

Separate model quality from product value

A model can produce better answers while the product creates less value. Added latency can interrupt the workflow. A retrieval failure can ground an otherwise capable model in the wrong context. A user may spend more time checking and rewriting an answer than doing the task manually. This is why a single accuracy score cannot validate an AI product.

Build a golden set from the work users actually do

Eval-driven development starts before production traffic. Build a curated set of cases that reflects real user complexity, then turn your definition of good into a reproducible scoring process.

Define the evaluation unit: Score the completed job whenever possible, not merely an isolated response. An agent that drafts a correct message but sends it to the wrong destination has failed the job.
Represent meaningful variation: Include normal cases, longer and shorter inputs, ambiguous requests, important customer segments, and known edge conditions. A convenience sample of clean happy paths measures demo readiness.
Tag each slice: Label cases by intent, complexity, risk, input type, or other distinctions that could conceal a concentrated failure. Aggregate performance can improve while a critical slice gets worse.
Write a multidimensional rubric: Score correctness, completeness, groundedness, safety, tone, policy compliance, and any task-specific requirements separately. Add latency and cost as system measures rather than blending everything into an opaque average.
Choose a real baseline: Compare the candidate with the current product, manual workflow, rules-based approach, or incumbent model. The relevant question is not whether the candidate looks capable in isolation; it is whether switching produces enough value.
Preserve regression evidence: Keep a stable set for comparisons and add newly discovered failures to an evolving challenge set. This turns production learning into protection against recurrence.

Keep the measurement layers visible in every readout:

Output quality: correctness, completeness, groundedness, tone, safety, and compliance.
System performance: retrieval quality, tool execution, policy enforcement, latency, reliability, and cost.
User outcome: task completion, time-to-value, edits, rejection, rework, escalation, and repeat use.
Business consequence: the downstream result the initiative was funded to improve, along with retention or other core guardrails where relevant.

Each layer diagnoses a different problem. If output quality is weak, work on context, prompts, retrieval, tools, policies, or the model. If output quality passes but completion does not improve, inspect the interaction and workflow. If users succeed but the cost per successful outcome is unacceptable, narrow the use case or revisit the architecture. A composite score can hide these distinctions at exactly the moment you need them.

Test the behavior distribution, not a lucky response

AI output is variable, so a candidate should not pass because one run happened to look good. Use two evaluation modes. A regression configuration should be as controlled as the system allows, with model, prompt, retrieval, tool, temperature, top-p, and seed settings documented where they apply. A production-like configuration should match the variability users will experience and repeat cases often enough to reveal unstable behavior and tail failures.

Run candidate and baseline systems on the same cases under comparable settings.
Inspect results by slice and failure type, not only the overall average.
Repeat stochastic cases so the team sees consistency, variance, and severe outliers.
Automate clear rubric checks, but retain human review for ambiguous or high-consequence judgments.
Version the model, prompt, retrieval configuration, tools, policies, and evaluation set so a change can be reproduced.

This creates a release gate instead of a demo contest. Offline evaluation will not prove market value, but it can prevent known regressions, unsafe behavior, and obviously weak variants from consuming customer trust in a live experiment.

Make the production test answer a business decision

Production exposure is justified when demand, workflow, and offline performance have enough evidence behind them. The live test should then answer a narrow causal question: does access to this capability improve the intended outcome for the eligible population, compared with the current experience, without violating the operating constraints?

Instrument the complete causal chain

Your event schema should connect eligibility to exposure, interaction, system behavior, task completion, and downstream consequences. At minimum, distinguish these moments:

The user or account became eligible for the test.
The treatment was actually shown or made available.
The capability was invoked, whether explicitly or automatically.
The system succeeded, failed, timed out, or triggered a safeguard.
The output was displayed, accepted, edited, rejected, reversed, or escalated.
The target task was completed or abandoned.
The downstream outcome occurred, such as a correction, reassignment, reopening, or successful resolution for a support workflow.

Attach the cohort and the relevant model, prompt, retrieval, tool, and policy versions to the trace. Capture latency, cost, and safety results without indiscriminately logging sensitive payloads. Privacy-by-design and data governance determine which data may be retained, who may inspect it, and how long it should remain available.

Missing links create predictable misreadings. Without an exposure event, low adoption can be confused with low visibility. Without version information, a regression cannot be tied to a system change. Without the downstream event, acceptance can be mistaken for value even when users later undo the AI’s work.

Choose the design and sample around the decision

Randomization: Choose user, account, workflow, or time window based on where contamination can occur. If people in one account share outputs, user-level assignment may mix treatment and control experiences.
Population: Define eligibility before launch. Balance or stratify meaningful groups such as new accounts and power users when their behavior or exposure differs.
Primary metric: Select one outcome that can settle the main question. Treat diagnostic measures as supporting evidence, not a menu from which to pick a winner later.
Guardrails: Monitor core experience, retention where relevant, time-to-value, safety, privacy, reliability, and cost. Write rollback conditions for unacceptable movement before exposure begins.
Effect size and power: Set the minimum detectable effect from the business decision, estimate the required sample, and acknowledge when available traffic cannot support the desired conclusion.
Exposure control: Use feature flags, a capped rollout, and a holdout so you can stop quickly and preserve a valid comparison.

Standard A/B testing fits many product changes. Ranking and retrieval changes can benefit from interleaving when alternatives can be compared within the same experience. Switchback designs can help when time, seasonality, or shared operating conditions make simultaneous assignment misleading. Match the design to the interference in the workflow instead of defaulting to the experiment template you use for deterministic UI changes.

AI variability also changes the readout. Aggregate outcomes across the multiple interactions users have, compare cohorts, and track confidence intervals over time. A snapshot p-value should not overrule an underpowered test, an unstable effect, or a concentrated safety failure. A statistically inconclusive result means the test did not resolve the decision; it does not prove that the feature has no effect.

Prewrite the scale, iterate, and stop rules

I prefer four explicit decision states because they force the readout to connect evidence to action:

Scale: The primary outcome clears the meaningful threshold, guardrails hold, important cohorts do not show an unacceptable reversal, and reliability and cost remain viable.
Iterate the AI system: User intent is strong, but a defined output or system failure blocks value. The next test should target that failure rather than repeat the same broad pilot.
Change the product experience: Offline quality passes, but users cannot discover, trust, control, or efficiently use the capability. Treat this as workflow evidence, not an automatic reason to swap models.
Stop or reframe: Demand is weak, the economics cannot work, the necessary data or access is unavailable, or credible risk remains outside tolerance.

Risk must be part of the launch rule, not a review added after a positive metric appears. Include toxicity and personally identifiable information checks where relevant, enforce least-privilege access, retain appropriate audit logs, and make rollback operational before exposure. For irreversible financial actions, sensitive regulatory decisions, or any workflow where the acceptable error rate is effectively zero, keep a qualified human approval step or defer autonomy. Faster execution does not compensate for an unacceptable blast radius.

Autonomy should be earned in stages. Begin with assistance that the user can inspect. Move to required approval before actions. Allow autonomous execution only for narrow, low-stakes, reversible actions after stability is demonstrated. Expand permissions and exposure only when monitoring shows that the earlier guardrails continue to hold.

The experiment does not end at launch. Model behavior, retrieval content, user mix, prompts, tools, and operating costs can change. Continue tracking quality, latency, cost per successful outcome, safety, and cohort behavior. Feed new failures into the evaluation set and keep a holdout when the decision warrants one. A weekly readout should identify what changed, which assumption the evidence affected, and what decision follows; it should not become a tour of every available dashboard.

Key takeaways

Start with a precommitted decision contract: user, job, baseline, outcome, threshold, guardrails, and next action.
Validate demand before usability, usability before system capability, and system capability before broad production impact.
Compare the AI with the user’s current alternative, not with an abstract standard of impressive output.
Measure output quality, system performance, user outcomes, and business consequences separately so failures remain diagnosable.
Treat stochastic behavior as a distribution: document configurations, repeat runs, inspect slices, and watch severe outliers.
Use feature flags, holdouts, exposure caps, auditability, and prewritten rollback rules to contain risk while learning.

At your next AI review, ask for the experiment contract instead of another demo. If the team cannot name the dominant risk, current baseline, meaningful threshold, guardrails, and action for each possible result, the next step is not production exposure. It is a sharper test.

Start with the smallest experiment that could credibly invalidate the idea. Evidence that survives that test earns the right to spend more, increase fidelity, and expose more users.

References

April 27, 2026

Tag: A/B testing

Treat accuracy as a chain, not a single metric

Run a validation pass before reading the colors

Split heatmaps when the interface or interaction model changes

Match the decision to what the evidence can prove

Key takeaways

References

Start with the decision your ROI model must support

Write the measurement contract before you build the dashboard

Instrument the complete journey, not just the conversation

Calculate economic value without turning activity into savings

Count only incremental revenue

Separate capacity from cashable savings

Include the operating costs that make the agent dependable

Do not bury risk inside an average ROI number

Prove incrementality before claiming impact

Turn ROI into a portfolio operating system

Key takeaways

References

Define the growth decision before you automate the launch

Turn the launch channel into a decision system

Give agents narrow jobs and humans explicit authority

Use a data agent for retrieval and first-pass synthesis

Keep the feature-flag agent read-only by default

Use a readout agent to maintain the launch narrative

Make the accountability map visible

Run the rollout as a sequence of evidence gates

Key takeaways

References

Start with the retention decision, not the eval dashboard

Build an identity and time contract before modeling anything

Find a threshold that survives segment and leakage checks

Use experiments to separate a predictor from a product lever

Put the winning signal into the product operating system

Key takeaways

References

Build the growth thesis as a measurable chain

A worked example: turn AI search into an activation path

Choose an intent wedge before choosing the AI experience

Match autonomy to the evidence you have

Instrument the full path from interaction to revenue

Turn evaluation and experiments into a delivery system

Gate releases with eval-driven development

Use online experiments to answer growth questions

Run the first 90 days around evidence, not launch theater

Watch for five failure modes

Key takeaways

References

Pick a workflow with an observable finish line

Instrument the agent journey, not just the final answer

Keep output quality and product impact on separate scorecards

Run an experiment that can survive executive scrutiny

Scale only when value repeats under real conditions

Key takeaways

References

Define the decision before you design the AI

Climb an evidence ladder instead of jumping to a pilot

Separate model quality from product value

Build a golden set from the work users actually do

Test the behavior distribution, not a lucky response

Make the production test answer a business decision

Instrument the complete causal chain

Choose the design and sample around the decision

Prewrite the scale, iterate, and stop rules

Key takeaways

References