Tag: Agent Analytics

Supercharge Core Web Vitals with Amplitude’s Global Agent: Faster Rankings, Happier Users

I measure product health by a simple equation: speed plus clarity equals trust. That’s why I prioritize Core Web Vitals and search performance together—because the fastest path to better UX and higher rankings is a closed loop between measurement, diagnosis, and action. Standardizing on Amplitude’s Global Agent with Amplitude AI Agents let my teams compress that loop from weeks to hours, and in many cases, to minutes.

Learn how to track your web vitals and page rankings faster with Amplitude AI Agents and improve your site’s user experience and SEO rankings. That goal sounds ambitious, but with the right instrumentation and analytics workflow, it becomes a repeatable operating rhythm rather than a one-off project.

Here’s what changed for us with Amplitude’s Global Agent: a single, consistent way to capture performance signals across pages and journeys, unified context for every session, and a lightweight footprint that doesn’t get in the way of speed. By centralizing measurement, we eliminated blind spots and gave product, growth, and engineering one shared truth for Core Web Vitals and behavioral analytics.

My practical playbook is straightforward: 1) Establish a performance baseline for Core Web Vitals on key templates and critical user paths. 2) Segment results by device, location, acquisition channel, and content type to surface where users actually feel the friction. 3) Connect those vitals to downstream behaviors—scroll depth, engagement, and conversion—so we prioritize fixes that move business outcomes, not just lab scores. 4) Use feature flags and A/B testing to ship improvements safely and quantify uplift. 5) Close the loop with Agent Analytics to keep learnings visible and actionable.

Operationally, we rely on anomaly detection to flag regressions early, CI/CD guardrails to prevent performance slips at deploy time, and observability plus session replay to accelerate root-cause analysis. This combination reduces mean time to resolution, protects page experience during fast iteration cycles, and helps us avoid trading UX for speed—or vice versa.

The strategic benefit is compounding: better Core Web Vitals improve user perception and increase engagement, which strengthens SEO signals and, ultimately, page rankings. With a unified analytics platform in place, we can spotlight the few improvements that create outsized gains, then scale those patterns across the site with confidence.

If your roadmap includes faster pages, stronger rankings, and happier users, align your teams around this simple loop: measure precisely, diagnose quickly, experiment safely, and learn continuously. Amplitude’s Global Agent and Amplitude AI Agents give you the instrumentation and insight to make that loop your competitive advantage.

Inspired by this post on Amplitude – Best Practices.

May 20, 2026

How to Prove the ROI of an AI Product Before You Scale It

Your AI product is getting used. The demos land well, task completion is improving, and internal enthusiasm is high. Then the CFO asks a harder question: what changed in the business because this product exists?

You cannot answer that question with prompt volume, response quality, adoption, or tickets touched. You need a measurement system that separates activity from incremental value, counts the full operating cost, and makes risk visible before a rollout gets larger. Here is how to build one.

Start with the decision your ROI model must support

ROI is not a retrospective slide assembled after launch. It is a decision rule. Before development begins, decide what evidence would justify launching, scaling, redesigning, rolling back, or retiring the capability.

That distinction changes the conversation. Instead of asking whether the agent is accurate enough or popular enough, you ask whether a measurable change in customer behavior produces a measurable business result without crossing an unacceptable risk threshold.

Build a driver tree with four levels:

Company outcome: revenue growth, lower cost to serve, or reduced business risk.
Customer outcome: the user completes a valuable job, reaches value sooner, or resolves a problem without unnecessary effort.
Product behavior: the AI capability changes conversion, expansion, self-service completion, containment, handle time, or escalation.
Controllable lever: the team changes the workflow, model behavior, conversation design, human review, or product guidance.

The chain matters because a model metric is rarely a business metric. Better answer quality may improve task completion, which may improve trial-to-paid conversion. The ROI case depends on the full chain, not the first link.

Value path	Business outcome	Leading evidence	Guardrails
Revenue	Higher conversion, average order value, or expansion	Time-to-first-value and self-service completion	Errors, complaints, and policy violations
Cost	Lower cost to serve	Containment, deflection, and reduced handle time	Escalations, false resolution, and downstream customer harm
Risk	Lower frequency or impact of harmful failures	Human-review events and detected violations	False positives, false negatives, hallucinations, and security breaches

Choose one primary value path for the investment case. Revenue, cost, and risk can all appear on the scorecard, but declaring all three as primary makes it too easy to rescue a weak result with whichever metric moved after launch.

A support agent, for example, may appear successful because it contains more conversations. But containment is only valuable if customers actually resolve their problems. A conversation that never reaches a human can reduce measured support volume while increasing complaints or churn risk. This is why revenue, cost, and risk measures must be evaluated together.

Write the measurement contract before you build the dashboard

A measurement contract is a short agreement among product, data, finance, and the operational team affected by the AI workflow. It prevents the definitions, cost boundaries, and success thresholds from changing after results arrive.

Your contract should answer these questions:

Who is eligible? Define the users, accounts, tasks, channels, and exclusions. Do not mix workflows with materially different economics.
What is the intervention? Name the AI capability and the version being evaluated. A model, prompt, retrieval pipeline, policy, or escalation change can alter the result.
What is the primary outcome? Select the business metric that determines whether the hypothesis passed.
What are the leading indicators? Use measures such as time-to-first-value, containment, and self-service completion to diagnose movement before lagging results mature.
What are the guardrails? Predefine acceptable limits for errors, hallucinations, false positives, false negatives, escalations, complaints, security events, and policy violations.
What is the baseline? Freeze the comparison period or control group before exposing the eligible population to the capability.
How will incrementality be proven? Specify the experiment, holdout, assignment unit, and minimum detectable effect.
What costs count? Agree on model or API consumption, labeling, evaluation, human review, and ongoing oversight before calculating value.
What action follows each result? Record the thresholds for launch, scale, redesign, rollback, and retirement.

The contract should distinguish an outcome OKR from an output OKR. Shipping the agent, generating responses, and increasing feature use are outputs. Improving conversion, lowering verified cost to serve, or reducing harmful failures are outcomes. Outputs can explain what happened, but they cannot establish value on their own.

Instrument the complete journey, not just the conversation

An AI log tells you what the model did. An ROI dataset must also tell you what the user did next.

Connect the journey from eligibility to business outcome:

The user or account became eligible for the capability.
The AI experience was offered, viewed, and engaged.
A task was attempted, completed, abandoned, or repeated.
A response was accepted, corrected, regenerated, or sent for human review.
The interaction was contained, escalated, or handed to another workflow.
The downstream conversion, expansion, support, retention, or complaint event occurred.
The associated model cost, labeling work, and human-oversight cost were recorded.

Carry a stable user or account identifier, experiment assignment, agent version, and journey identifier across those events. Without that connective tissue, the team may have an impressive agent dashboard and no defensible way to attribute a business outcome to the experience.

Use behavioral analytics and session replay to understand why a metric moved. Use journey mapping and retention analysis to locate the friction worth solving in the first place. Product tours and in-app guidance can then help eligible users reach a validated workflow. This creates a closed loop from journey friction to experiment and measurable outcome, instead of a collection of disconnected AI metrics.

Calculate economic value without turning activity into savings

Start with net business value:

Net business value = incremental revenue + cost avoided – total operating cost – quantified risk loss

If finance requires an ROI percentage, divide net business value by the agreed investment base. Keep both the numerator and denominator visible. A percentage without its cost boundary is easy to inflate and hard to audit.

Count only incremental revenue

Do not credit the AI product with every transaction it touched. Credit it with the difference between the exposed population and the valid control or holdout.

A practical revenue calculation is:

Incremental revenue = eligible volume x measured outcome lift x value per additional outcome

The measured outcome might be trial-to-paid conversion, self-service upsell, average order value, or expansion. Use the same eligibility definition, attribution window, and revenue treatment for the intervention and control. If the AI experience merely appears somewhere in a successful journey, that is influenced revenue, not proof of incremental revenue.

Separate capacity from cashable savings

Cost claims require more care than a deflection count. A contained interaction may create capacity without reducing expenditure. That capacity can still be valuable, but it should not be presented as cash savings unless spending actually changes.

Capacity created: employees have time available for other work, but the existing cost base remains.
Variable cost avoided: the company no longer incurs a cost that would have grown with each additional interaction.
Cashable savings: an approved budget, vendor charge, or staffing requirement is actually reduced.

Report these separately. Otherwise, the same saved minute can be counted once as employee capacity and again as reduced spend.

Validate that a deflected task was resolved, not abandoned or displaced to another channel. Then calculate avoided cost from the incremental lift in verified resolution, not the total number of conversations the agent handled.

Include the operating costs that make the agent dependable

Model or API cost is only one part of the investment. Include labeling, evaluation, human review, and operational oversight. If a safer workflow requires more review, that review is part of the product’s economics, not an external inconvenience to exclude from the model.

Segment cost by agent, workflow, and outcome. Cost per response is useful for infrastructure management, but cost per verified successful outcome is the better economic unit. A cheap response that triggers retries, escalations, or corrections may be more expensive than a higher-cost response that completes the job.

Do not bury risk inside an average ROI number

Risk adjustment should make uncertainty visible, not create false precision. Use three layers:

Hard guardrails: security and policy conditions that trigger containment or rollback regardless of financial upside.
Observed risk indicators: error, hallucination, escalation, complaint, false-positive, and false-negative rates tracked by workflow and cohort.
Financial adjustment: expected loss deducted from net value only when the probability and impact assumptions are credible enough for finance and risk owners to accept.

Do not let a low-frequency, high-consequence failure disappear inside a high average success rate. If the downside cannot be defensibly monetized, keep it as an explicit decision constraint rather than assigning it a convenient dollar value.

Prove incrementality before claiming impact

The strongest ROI calculation still fails if the attribution is weak. A before-and-after improvement may come from seasonality, pricing, traffic quality, a support policy change, or another product release. The AI capability needs a counterfactual: what would have happened to comparable eligible users without it?

Use an A/B test or holdout whenever the product and risk profile allow it. Make these choices before launch:

Assignment unit: Randomize at the level where the outcome occurs. If expansion is measured per account, account-level assignment can prevent users in the same customer organization from receiving conflicting experiences.
Primary outcome: Pick the metric that determines success and keep diagnostic metrics secondary.
Minimum detectable effect: Precompute the smallest lift worth detecting based on the baseline, available population, and business value. If the experiment cannot detect a decision-relevant change, extending the metric list will not fix it.
Guardrails: Test quality, escalation, complaints, security, and policy outcomes alongside the primary metric.
Analysis population: For a product-level ROI claim, analyze eligible users according to their assigned experience. Looking only at people who voluntarily used the agent introduces selection bias.
Measurement horizon: Keep the holdout long enough to observe the outcome named in the contract. Leading indicators can guide iteration, but they should not be substituted for retention, churn, Net Recurring Revenue, or other lagging outcomes.

If randomization is not practical, use a fixed holdout or a frozen comparison period and document the limitations. A weaker design can still inform a decision, but the ROI claim should carry less confidence. Do not quietly promote correlation to causation because the rollout has executive attention.

Interpret the result as a system. Suppose self-service completion rises but the business outcome does not. The agent may be solving a low-value task, attracting users who would have converted anyway, or shifting effort to a later step. If conversion improves while complaints or policy violations cross the guardrail, the value hypothesis may be valid but the implementation is not ready to scale.

This is eval-driven development applied to product economics: define acceptable behavior and business success, measure both under controlled conditions, diagnose the failures, and repeat the test after a meaningful change.

Turn ROI into a portfolio operating system

A one-time business case goes stale as models, prompts, traffic, user behavior, and operating costs change. Maintain an Agent Analytics view for every production capability.

Each agent scorecard should show:

The primary business outcome and current experiment result.
Leading journey metrics from eligibility through verified completion.
Revenue contribution, cost avoided, and total operating cost using the agreed definitions.
Quality and risk guardrails, including escalations and human-review events.
Performance by relevant customer, task, and journey cohort.
The agent, model, policy, and workflow version associated with the result.
The current decision status: exploring, launching, scaling, redesigning, contained, or retiring.

Use the dashboard to make portfolio decisions, not merely to report trends:

Scale when the primary outcome clears the precommitted threshold, guardrails hold, net value is positive, and the result remains credible across the cohorts that matter.
Redesign when leading indicators improve but the business outcome does not, or when human review and escalation erase the economic gain.
Contain or roll back when a hard security, policy, or customer-harm threshold is breached, even if average financial performance is positive.
Retire when controlled measurement shows no decision-relevant incrementality or when dependable operation costs more than the value created.

Review operational signals with frontline teams because they can explain patterns hidden by aggregate metrics. Review portfolio value in QBRs with product, data, finance, and risk owners so investment follows evidence rather than novelty.

Only accelerate adoption after the workflow has demonstrated unit value. In-app guides, product tours, and lifecycle nudges can bring more eligible users into a validated flow. Measure whether those interventions increase the business outcome, not merely clicks or agent sessions. Scaling exposure to an unproven workflow scales its cost and risk as readily as its potential benefit.

Key takeaways

Treat ROI as a precommitted decision rule for launch, scale, redesign, rollback, or retirement.
Connect model behavior to customer behavior and then to revenue, cost, or risk through a driver tree.
Freeze the baseline, cost boundary, guardrails, attribution method, and success thresholds before results arrive.
Credit only incremental revenue and verified avoided cost. Keep created capacity separate from cashable savings.
Include model consumption, labeling, evaluation, human review, and oversight in the operating cost.
Use controlled experiments or holdouts, with a decision-relevant minimum detectable effect, to separate causal impact from correlation.
Keep severe risk conditions as explicit constraints when they cannot be responsibly converted into a financial estimate.
Scale adoption only after the AI workflow has shown positive unit value under acceptable risk.

Pick one high-friction customer journey and complete its measurement contract before the next roadmap review. If the team cannot name the baseline, control, primary outcome, cost boundary, guardrails, and decision thresholds, the capability is still an exploration. Label it honestly, instrument it properly, and earn the right to make an ROI claim.

References

May 15, 2026

No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

Most teams ship AI agent personalities by accident—emergent quirks, brittle prompts, and uneven behavior. We refused to let that happen. From day one, we treated personality as a first-class product surface, one that should be designed, instrumented, and iterated with the same rigor as any core capability.

Learn how we designed Global Agent’s personality and fine-tuned its inquisitiveness and helpfulness using Agent Analytics.

In my role leading product at HighLevel, Inc., I framed our approach around agentic AI and conversation design: personality is not “flavor text”; it is the control system for how an agent interprets context, asks questions, and decides when to act. Our product strategy prioritized clarity, empathy, and consistency—so the agent would be curious enough to resolve ambiguity without becoming interrogatory, and helpful enough to move work forward without overstepping.

We made that intent measurable. Using behavioral analytics, we defined operational signals such as clarification-question rate, resolution-path efficiency, and escalation quality. We combined eval-driven development with targeted A/B testing to compare prompt patterns and tool strategies, ensuring each change had a clear hypothesis and measurable outcome.

To calibrate inquisitiveness, we mapped decision points where the agent should ask follow-ups versus proceed autonomously. Prompt engineering codified those thresholds, while a retrieval-first pipeline reduced unnecessary questions by improving context completeness up front. When the agent did ask, we constrained tone and cadence to keep queries concise, respectful, and progress-oriented.

To enhance helpfulness, we prioritized precise action-taking and unambiguous guidance. Context window management preserved relevant facts without diluting intent, and guardrails aligned with AI risk management principles ensured the agent stayed within policy, privacy, and compliance boundaries. The result was an assistant that resolved more tasks end-to-end, with fewer stalls and clearer handoffs when human help was warranted.

Agent Analytics became our nervous system. We instrumented every dialog turn to attribute outcomes to design choices, then used driver trees to connect micro-behaviors to macro results like time-to-resolution and customer satisfaction. This closed-loop view let us ship confidently, knowing which levers improved helpfulness, which sharpened curiosity, and which merely added noise.

Process mattered as much as tooling. Product trios ran continuous discovery with customers to surface edge cases—ambiguous intents, multi-intent turns, and sensitive scenarios—while our engineering partners operationalized experiments with clean rollback paths. We favored small, testable changes over sweeping rewrites, building momentum and trust with each iteration.

The payoff is a personality that feels consistent across use cases: curious when clarity is missing, decisive when action is obvious, and transparent when limits are reached. Users experience fewer dead ends, faster resolutions, and a brand voice that shows up the same way every time—because it was defined, measured, and improved on purpose.

If you’re building agentic AI, don’t leave personality to chance. Treat it like a product: set clear outcomes, instrument deeply with Agent Analytics, and iterate with eval-driven development and A/B testing. That’s how curiosity becomes a feature, helpfulness becomes a habit, and your agent becomes reliably, intentionally excellent.

Inspired by this post on Amplitude – Best Practices.

May 13, 2026

Knowledge Management for AI Sales Agents: A Practical System

Your AI sales agent answers the pricing question, then recommends the wrong plan. It identifies a promising buyer, then sends the conversation to the wrong queue. If the underlying facts, decision rules, or routing policy are missing, another prompt adjustment cannot fix the problem.

You need a knowledge operating system, not a larger folder of sales collateral. The goal is to give the agent the smallest reliable path from a buyer’s question to an accurate answer, an appropriate recommendation, useful qualification, and the correct next step.

Key takeaways

Separate product facts, decision guidance, and execution policy. Each solves a different part of the sales conversation.
Turn long documents into focused, approved knowledge records with an owner, scope, effective date, and explicit boundaries.
Launch a complete sales motion for a narrow set of buyer intents before trying to document everything.
Test recommendations, qualification, routing, and escalation behavior, not just whether the agent can repeat a correct sentence.
Convert unanswered, incorrect, and disengaged conversations into a managed improvement queue.

Design for answering, recommending, qualifying, and routing

A conventional knowledge base helps someone find information. An AI sales agent has a harder job: it must interpret a buyer’s situation and decide what to do with the information it retrieves.

A language model does not inherently know your current plans, qualification criteria, commercial boundaries, or customer-specific use cases. That context is unique to your business and must be made explicit. Fluency cannot compensate for a missing policy.

I find it useful to divide sales knowledge into three layers:

Knowledge layer	What it contains	What the agent should do with it	Typical failure when it is missing
Product facts	Pricing, plan structure, capabilities, limitations, availability, and supported use cases	Give a direct, accurate answer	The agent guesses, gives a vague response, or repeats obsolete information
Decision guidance	Plan-fit logic, relevant constraints, case studies, approved comparisons, and the context behind product facts	Explain which option fits and why	The answer is technically correct but does not help the buyer decide
Execution policy	Qualification questions, required fields, routing conditions, escalation rules, and actions the agent may take	Advance the conversation within defined authority	The agent collects irrelevant details, makes an unsupported commitment, or routes the buyer incorrectly

This distinction exposes why uploading product pages is not enough. Public pages and product documentation are useful starting points, but a capable inbound motion also needs FAQs, pricing explanations, case studies, competitive material, qualification criteria, and internal sales guidance.

Facts answer, “What does the product do?” Decision guidance answers, “Is this appropriate for my situation?” Execution policy answers, “What should happen next?” Audit your knowledge against all three questions.

Set a hard boundary around commercial exceptions. The agent should not infer an unlisted discount, invent a contractual commitment, or turn an internal hypothesis into a buyer-facing claim. It should state the approved terms, gather the information required by policy, and route the exception to an authorized person. A plausible but unauthorized promise can create financial and legal exposure.

Turn scattered documents into governed knowledge

Build sales-ready knowledge records

A long document can be correct and still be poor input for an agent. Pricing may be buried below an obsolete introduction. A feature table may omit the condition that changes plan fit. A battlecard may combine approved facts with a rep’s unverified notes.

Convert those documents into focused records. Each record should cover one buyer intent or one tightly related decision. Use a consistent template:

Buyer intent: The question or decision this record addresses, including common alternative phrasings.
Approved answer: The direct response the agent may give without qualification.
Decision context: Why the fact matters and when it changes the recommendation.
Constraints and exceptions: What the answer does not cover, including conditions that require clarification.
Next question or action: The appropriate follow-up, qualification step, route, or escalation.
Scope: The plans, markets, customer types, channels, or agents allowed to use the record.
Evidence location: The canonical product, pricing, or policy record from which the answer was derived.
Owner and approver: The people accountable for accuracy and authorization.
Lifecycle metadata: Effective date, review status, and whether the record replaces an earlier version.

For example, a record about plan fit should not stop after naming a plan. It should state the relevant requirement, identify the condition that changes the answer, give the agent an approved follow-up question, and define where to route a buyer whose situation falls outside the standard policy. The recommendation then becomes reproducible rather than improvised.

Keep buyer language in the record. Prospects rarely use your internal taxonomy, and the same intent may appear as a product question, an outcome question, or a comparison. Alternative phrasing helps the retrieval layer recognize that these expressions belong to the same approved answer.

Create an authority hierarchy

A centralized repository is valuable only if the agent can distinguish current authority from historical residue. Define the hierarchy before connecting more content:

Designate one canonical record for each product fact, commercial rule, or routing policy.
Make approved sales explanations point back to that record rather than becoming independent versions of the truth.
Treat scripts and examples as phrasing aids unless they are explicitly approved to carry facts.
Keep drafts, call notes, chat fragments, and retired material outside the agent’s usable knowledge until they are reviewed.

Do not let the agent reconcile conflicting records by choosing the newest upload or blending the language. When two approved items disagree, the safe behavior is to withhold the disputed claim, follow the defined escalation path, and send the conflict to its owner.

Ownership should also be specific. A knowledge owner maintains the record. A domain approver authorizes sensitive claims. An operations owner monitors how the agent uses the record in conversations. One person may hold more than one role, but every role needs a name rather than a department-shaped placeholder.

Target knowledge by audience and action

Internal knowledge is not automatically buyer-facing knowledge. A qualification score may guide routing without being disclosed. A battlecard may help frame an approved comparison without exposing internal commentary. A security question may require an authorized answer rather than the agent’s summary of a sales note.

Mark each record as buyer-answerable, decision-only, action-only, or restricted. Then expose only the appropriate material to each agent, channel, market, and sales motion. This kind of centralization and content targeting reduces duplication while keeping internal policy separate from the words a prospect sees.

Launch the smallest complete sales motion

Trying to clean every sales document before launch creates a long project with no conversational evidence. Launching with disconnected FAQs creates a different failure: the agent answers isolated questions but cannot move the buyer forward.

The better unit of scope is a complete motion for a bounded set of intents. For each selected intent, the agent needs an answer, the relevant fit logic, the next qualification question, a route or resolution, and an escalation path.

Prioritize by demand and consequence

Start with questions that appear repeatedly, delay buyers, consume sales time, signal meaningful intent, or cause material damage when answered incorrectly. Pricing, plan differences, core capabilities, common use cases, qualification, and routing are natural candidates when they dominate your actual inbound conversations.

Two simple calculations help quantify repetitive work: team time reclaimed = average response composition time x question frequency, while buyer wait avoided = number of prospects asking x average response time. These calculations are useful for prioritizing knowledge work, but neither should be presented as revenue without downstream evidence.

Add consequence to the ranking. A frequent low-risk question may save time, but an infrequent error involving price, eligibility, security, or a contractual promise may deserve earlier treatment. Frequency tells you where the volume is. Consequence tells you where control matters.

Your initial release is ready when the selected motion has:

Approved answers for the recurring and commercially important questions in scope.
Plan-fit or use-case guidance where the buyer needs a recommendation rather than a fact.
Explicit qualification fields and follow-up questions.
Routing rules for the standard paths.
A visible no-answer and human-escalation path.
Owners and lifecycle metadata for every active record.
A representative evaluation set based on real buyer language.

Test the conversation, not the sentence

A retrieval test can show that the right paragraph was found. It cannot show that the agent asked the necessary follow-up, respected a restriction, or routed the lead correctly. Evaluate the complete interaction.

Your test set should include direct questions, paraphrases, multi-part questions, ambiguous requests, outdated assumptions, missing qualification details, requests for exceptions, and scenarios that should be escalated. For each case, check:

Is every factual claim aligned with approved knowledge?
Does the response answer the buyer’s actual question before adding detail?
Does the recommendation apply the right conditions rather than matching a keyword?
Does the agent ask only for information required by the qualification policy?
Does it avoid unsupported commitments and internal-only language?
Does the final route or escalation match the execution rule?

Record pass or fail at the behavior level and attach a reason to every failure. Do not allow a strong average score to conceal a severe pricing or policy error. High-consequence failures should block that behavior from release until the knowledge or policy is corrected.

Use a controlled launch with an obvious human path. The point is to begin collecting real conversational evidence early, not to claim autonomy before the boundaries are reliable. Fast deployment and continuous iteration work when the feedback loop is designed before traffic arrives.

Run the knowledge flywheel from real conversations

Once the agent is live, conversation failures become your most useful knowledge backlog. Do not place every poor result under a generic label such as bad answer. Classify the mechanism so the right owner can fix it.

Coverage gap: No approved record addresses the buyer’s intent.
Retrieval failure: The right knowledge exists, but the wrong record was selected or the right one was missed.
Freshness failure: The agent used information that should have been replaced or retired.
Guidance failure: The fact was correct, but the recommendation ignored relevant context.
Qualification failure: The agent skipped a required question, collected unnecessary information, or misread the answer.
Routing failure: The collected information was correct, but the next action did not follow policy.
Boundary failure: The agent disclosed restricted material or made an unauthorized claim.
Conversation failure: The content was accurate, but the response was unclear, repetitive, or poorly sequenced.

Turn each confirmed failure into a work item containing the conversation, intent, root cause, affected knowledge record, accountable owner, proposed change, and regression test. A change is not complete when the wording is edited. It is complete when the test passes, the approved version is published, and conflicting text is retired.

Track a small set of operational signals that lead to decisions:

Signal	What it tells you	What to do with it
Coverage	Which eligible buyer intents have an approved answer and action path	Add knowledge where demand and consequence justify it
Reviewed correctness	Whether sampled claims match the approved record	Repair facts, retrieval, or response generation
Knowledge conflicts	Where active records disagree or overlap ambiguously	Resolve authority and retire obsolete material
Qualification completion	Whether required information was collected for eligible conversations	Improve questions, field definitions, or sequencing
Routing compliance	Whether the next action matched the approved rule	Correct policy logic or integrations
Buyer progression	Whether the conversation reached its intended next step	Inspect guidance and friction, then validate changes against downstream outcomes
Content health	Which active records lack an owner, approval, scope, or lifecycle status	Repair governance before stale content becomes a live failure

Review these signals by intent and sales path. A global average can look acceptable while one plan, market, or routing branch fails repeatedly. Conversion can be a useful downstream outcome, but it is not proof that a knowledge change caused the result. Traffic mix, offer changes, seasonality, and human follow-up can also move it. Use controlled comparisons where practical and pair outcome data with conversation-level review.

Reserve a recurring weekly block for unanswered questions, disengaged prospects, high-consequence errors, and unresolved conflicts. Process pricing, product, and policy changes as immediate knowledge events rather than waiting for the review block. This is where knowledge management becomes an operating responsibility instead of a cleanup project.

The compounding effect comes from the loop: approved knowledge improves conversations; conversations reveal gaps; each repaired gap becomes a reusable capability. That is why small, well-chosen content improvements can have effects beyond the original conversation.

Start with your recent inbound conversations. Choose the most repeated unanswered question, the most consequential incorrect answer, and one routing failure. Convert each into an owned knowledge record, add a regression test, and release only the behavior that passes. That small loop is the foundation of a sales agent you can trust with progressively more of the funnel.

References

Intercom – The Ultimate Knowledge Management Playbook to Supercharge Your AI Sales Agent

May 13, 2026

From Internal FinOps Agents to Customer-Embedded Optimization

Your cloud-cost agent can identify the line item that moved and still fail to change a single decision. The gap appears after the diagnosis: the recommendation arrives without the product, pricing, ownership, and risk context needed to act.

If you are taking an internal FinOps capability into the customer experience, design for a closed decision loop. The goal is not autonomous cost cutting. It is a governed system that connects spend to customer value, recommends the next move, and proves whether the move worked.

Design a decision loop, not another cost dashboard

Start by naming the decision your product will improve. A broad promise such as optimize cloud spend gives the agent no useful boundary. A better contract is: detect a material change in workload cost, identify the most plausible driver, propose one permitted response, route it to the right owner, and verify the effect.

Draw the product boundary around an outcome

The operating loop is simple to describe: observe, explain, propose, authorize, execute, and verify. A dashboard normally stops at observe or explain. An agentic FinOps workflow carries evidence into a recommendation and then closes the loop with an approved action and post-action telemetry.

Agentic does not mean unrestricted. It means the agent can select the next permitted step based on context. Deterministic services should still perform calculations, enforce policies, check permissions, and execute infrastructure changes. Use the model where interpretation is valuable: reconciling signals, building a driver narrative, identifying missing context, explaining tradeoffs, and routing a decision.

That distinction matters in FinOps. A model should not improvise a billing calculation, invent a price, or bypass a commitment policy. If a calculation has one correct result, compute it in code and give the result to the agent as evidence.

Build four layers with explicit responsibilities

Evidence layer: Billing exports, usage metering, observability, product telemetry, pricing logic, feature flags, deployment activity, environment metadata, customer segmentation, and ownership records.
Reasoning layer: Driver trees, anomaly triage, competing explanations, confidence evidence, and recommendation selection.
Action layer: Policy checks, approval routing, change preparation, execution, rollback, and escalation.
Learning layer: Post-action telemetry, realized outcomes, agent evaluations, customer feedback, and recurring patterns that belong in the product roadmap.

A retrieval-first pipeline that combines billing, usage, observability, product, and go-to-market context is more useful than a large prompt containing a monthly cost export. Retrieve the records needed for the current decision and preserve their lineage. Every recommendation should reveal which records were used, when they were updated, which pricing assumptions applied, and what the agent could not retrieve.

Customer-facing retrieval adds another non-negotiable boundary: tenant isolation must be enforced before context reaches the model. Do not rely on a prompt to prevent cross-customer disclosure. Access control belongs in the retrieval and service layers, with the resulting access decision recorded in the audit trail.

Start with one anomaly and one reversible response

Your first release does not need to optimize every cloud service. A practical thin slice is anomaly detection plus one high-leverage remediation path. For example, the agent might detect a change in non-production workload cost, connect it to a schedule change, prepare a schedule correction, request approval from the workload owner, and monitor the next usage window.

Choose a first action that is bounded and reversible. A scheduling correction is easier to inspect and undo than a long-term financial commitment or a production capacity change. The purpose of the thin slice is to prove the whole operating loop, not merely the anomaly model.

Make every recommendation safe enough to act on

A recommendation without an execution envelope is an opinion. It may be correct, but the recipient still has to reconstruct the evidence, find the owner, assess the downside, and decide how to validate it. That is where apparently intelligent systems create more work than they remove.

Use a recommendation contract

Treat every agent recommendation as a structured product object. At minimum, require these fields:

Decision: The exact choice the recipient is being asked to make.
Scope: The account, workload, service, environment, and time window affected.
Owner: The person or role accountable for the workload and the person authorized to approve the action.
Evidence: Links to the billing, usage, observability, deployment, and product records that support the diagnosis, including their freshness.
Driver path: The causal chain the agent believes explains the change, plus material alternative explanations it considered.
Proposed action: The change, its expected mechanism, and any assumptions behind an estimated effect. If the effect cannot be estimated reliably, say that it is unknown.
Confidence and unknowns: Evaluation-backed confidence evidence, missing context, and conditions that would invalidate the recommendation.
Execution envelope: Policy checks, blast radius, approver, expiration, rollback procedure, and escalation path.
Verification plan: The telemetry, observation window, success condition, and stop condition used after the action.

The expiration field is easy to overlook. Cloud state changes quickly enough that an old recommendation can remain plausible after its evidence has gone stale. Expire the recommendation when its pricing, topology, deployment, or usage assumptions are no longer current. Force a fresh retrieval before execution.

Grant autonomy by action class

Do not give an agent one global autonomy setting. Earn autonomy independently for each action class:

Observe: Detect and organize a possible anomaly.
Explain: Build a driver tree and expose supporting evidence without proposing a change.
Recommend: Propose an action while a human retains approval and execution.
Prepare: Generate a change plan or dry run, but require an authorized owner to apply it.
Execute within policy: Apply a reversible, bounded action only when the policy engine, permissions, evidence freshness, and rollback checks all pass.

Purchasing a cloud commitment or altering production resources can create real financial or availability exposure. Keep finance and service owners in the approval path until confidence evidence and post-action telemetry demonstrate reliable performance for that specific intervention. Good results on anomaly explanations do not establish that the same agent is safe to execute infrastructure changes.

Governance should be visible in the product, not left in a policy document. Show the approver which data was accessed, which rules passed, who changed the recommendation, what action ran, and what happened afterward. Privacy-by-design, data controls, and transparent decision logs are part of the user experience when the system influences money and production infrastructure.

Evaluate the decision loop, not the prose

A polished explanation is not evidence of a useful agent. Build evaluations around the failure modes that can block or distort a decision:

Did the recommendation use the correct customer, workload, environment, price, and time window?
Can each material claim be traced to an underlying record?
Does the driver path match known cases, including cases with several plausible causes?
Does the agent abstain when ownership, telemetry, or pricing context is missing?
Did approval routing and policy enforcement behave correctly?
Can the recipient perform the proposed action without reconstructing missing steps?
Did post-action telemetry confirm the expected direction of change without creating an unacceptable operational tradeoff?

Put retrieval changes, prompts, policies, and tools through the same delivery discipline as application code. Eval-driven development, CI/CD, and a weekly shipping cadence make regressions visible before a persuasive but poorly grounded recommendation reaches an operator or customer.

Embed the capability with customers before scaling it

The first customer version should not be a general-purpose cost chatbot. It should be a narrow, product-assisted engineering motion in which a Forward Deployed Engineer, or FDE, helps the customer connect product usage, cloud architecture, and cost-to-value.

Choose a small pod and customers that can teach you

A sensible starting shape is one FDE pod focused on two or three high-potential customers. High potential should not mean merely the largest cloud bill. Select customers where the team can access the necessary evidence, an accountable sponsor can authorize changes, the problem is likely to recur, and the customer agrees to clear data and governance boundaries.

Evidence readiness: Billing, metering, observability, pricing, and deployment context can be joined without weeks of manual reconciliation.
Decision access: An engineering, product, or finance owner can approve an intervention and explain the operational constraints.
Learning value: The problem represents a pattern that may apply beyond one account.
Measurability: The customer and FDE can agree on a cost-to-value measure before making a change.
Governance fit: Data access, retention, tenant isolation, approvals, and audit expectations are explicit.

If any of these conditions is absent, the engagement may still be commercially important, but it is a weak environment for deciding whether the agentic product works. Separate account urgency from product-learning quality.

Run a customer optimization loop that produces reusable knowledge

Define the value unit. Agree on what an active workload or valuable unit of product usage means. Total spend alone cannot distinguish efficient growth from contraction.
Establish the baseline. Record current cost per active workload, time-to-first-value, relevant deployment behavior, and the constraints the customer will not trade away.
Build the driver tree. Connect the spend change to services, environments, releases, product behavior, and customer usage. Surface gaps instead of filling them with assumptions.
Select one intervention. Prefer the smallest action that can test the diagnosis. Document the expected mechanism, approver, risk, and rollback before execution.
Verify the outcome. Compare post-action telemetry with the agreed baseline. Record savings, unit-economics movement, performance effects, adoption effects, and unintended consequences separately.
Codify the pattern. Capture the inputs, decision rule, action, exceptions, safeguards, and evidence required to repeat the intervention.
Send a weekly learning packet to product. Include successful patterns, failed diagnoses, missing platform capabilities, customer language, and recommendations that still depend on FDE judgment.

Within a quarter, this loop should make it possible to distinguish interventions that can be automated, patterns that should become native product features, and problems that still require deeper solutions engineering. The point is not to eliminate the FDE. It is to reserve that scarce judgment for cases where ambiguity and customer context remain material.

Make the commercial incentive legible

Customer-embedded optimization creates an obvious trust question for a consumption business: does the vendor want the customer to spend less or consume more? The clean answer is to optimize cost-to-value rather than either number in isolation.

A customer’s total cloud cost can rise while cost per active workload improves because valuable usage is growing. Total cost can also fall because the customer is using less of the product, which is not an optimization success. Label the outcome precisely: lower total spend, lower unit cost, avoided waste, shifted commitment, higher useful consumption, or reduced operational risk. Do not collapse these different effects into a generic savings claim.

The FDE is also a trust boundary. The role should explain the recommendation, expose assumptions, and represent the customer’s constraints. It should not become a human interface for repetitive exports and one-off queries that the platform ought to handle.

Turn field work into a roadmap, not permanent custom service

A strong FDE can make a weak product look successful by solving every gap manually. That is useful for an individual customer and dangerous for product strategy. You need an explicit test for moving work from the field into an agent workflow or native platform capability.

Apply a productization test to every recurring intervention

Can the same signal be retrieved reliably across the intended customer segment?
Can the decision logic be expressed without undocumented customer-specific knowledge?
Can the action be bounded by a stable policy, approval path, and rollback procedure?
Can the outcome be measured with telemetry that exists before and after the change?
Do the likely exceptions fit a review workflow, or do they fundamentally change the decision?

If the signal, decision, action, and measurement are repeatable, make the pattern a native feature or automated playbook. If the evidence is repeatable but judgment varies, keep an agentic workflow with human review. If the action carries high financial or availability risk, keep the FDE and accountable owner in the loop. If the pattern is a one-off, document it but resist turning it into product scope.

Use a scorecard that reveals where the loop is breaking

Dimension	Measure	Decision it informs
Insight speed	Time-to-insight from a material spend change	Is the system finding the issue early enough to change an engineering decision?
Action quality	Recommendations with evidence, an owner, a permitted action, and a verification plan	Is the agent producing executable decisions or polished commentary?
Economics	Realized savings per recommendation and cost per active workload	Did the intervention improve spend or unit economics for the intended value unit?
Reliability	Post-action effects, abstentions, rollbacks, and policy failures by action class	Which interventions have earned more autonomy, and which need tighter controls?
Customer outcome	Time-to-first-value and NRR movement on FDE-supported accounts	Is the motion improving adoption and durable account value? NRR is directional evidence, not proof of causation.
Product leverage	Recurring field patterns converted into features, guardrails, or in-product guidance	Is customer work compounding into a scalable product?

Recommendation volume, prompt length, and agent activity are operating diagnostics, not business outcomes. A quiet system that changes a few high-value decisions can be more useful than an active system that produces hundreds of unactioned findings.

Make build versus buy a component decision

Do not treat the choice as one monolithic platform decision. Separate commodity capabilities from the context and workflow that create differentiation. Evaluate billing ingestion, normalization, anomaly detection, the context model, pricing logic, recommendation policy, approval routing, execution, and agent analytics independently.

Does the capability require knowledge of your architecture, pricing model, feature flags, customer usage, or deployment behavior?
Can an external component preserve evidence lineage, tenant isolation, and decision logs at the level your customers require?
Is the capability a generic input to the product, or is it where your product makes a differentiated decision?
Can your team evaluate and operate the component continuously, including regressions after model, prompt, policy, or data changes?
Will the component reduce time-to-value without trapping critical customer and pricing context in an opaque workflow?

Unique architecture, pricing, and growth loops can justify building the context and decision layers. But weak tagging, unclear ownership, and missing observability undermine either path. Fix those foundations before expecting an in-house or purchased agent to produce precise optimization decisions.

Give the core product to a product trio spanning product management, engineering, and FinOps. Bring FDE, customer success, SRE, finance, and security into discovery and evaluation where their decisions are affected. Field requests should enter the roadmap with evidence of recurrence, strategic importance, or platform leverage rather than becoming an informal side door to custom development.

Key takeaways

Define the product as observe, explain, propose, authorize, execute, and verify. Diagnosis alone is not an agentic outcome.
Retrieve billing, usage, observability, pricing, product, and ownership context for each decision, with lineage and tenant boundaries enforced outside the prompt.
Represent every recommendation as a governed contract containing evidence, owner, action, risk, approval, rollback, expiration, and verification.
Grant autonomy by action class. Keep humans in the loop for commitments and production changes until that intervention has reliable post-action evidence.
Start customer delivery with one FDE pod and two or three customers that offer evidence access, decision access, measurable value, and reusable learning.
Measure time-to-insight, realized outcomes, unit economics, reliability, customer value, and productized patterns instead of counting recommendations.

This week, choose one recurring cost anomaly and map the complete path from underlying records to a verified action. Name the owner, approval rule, rollback, and success telemetry before improving the prompt. Do not add a second workflow until the first can explain what changed, why the action was allowed, and whether it improved customer cost-to-value.

References

May 11, 2026

4 Costly Agent Analytics Myths—And the Data-Backed Metrics I Rely on Instead

In my work with product, operations, and support leaders, I’m often asked to help make sense of Agent Analytics—what to track, how to attribute outcomes, and where to invest. After reviewing countless dashboards and running experiments across human agents and AI agents, I’ve learned that some of the most common measurement beliefs are precisely the ones that lead teams astray.
What comes up in conversation with leaders about Agent Analytics, and why not everything is what it seems.
Below, I unpack four pervasive myths I encounter and share the data-centered practices I use to replace them. My goal is simple: help you upgrade the way you measure performance so you can improve customer outcomes, accelerate learning, and scale impact with confidence.
Myth 1: “Lower average handle time (AHT) means higher performance.” AHT is useful but incomplete. When teams optimize solely for speed, they often push complexity into repeat contacts, reopens, or escalations. In the data, that shows up as a weak or negative relationship between lower AHT and durable outcomes like first contact resolution (FCR), customer effort, or revenue per conversation.
Reality and what I measure instead: I right-size speed by pairing AHT with intent-level resolution and recontact rate. For simple intents (password reset, billing address update), shorter is usually better. For complex intents (tiered troubleshooting, multi-step verification), “right-speeding” wins—slightly longer interactions that prevent rework. Practically, that means segmenting by intent complexity using behavioral analytics, tracking weighted “intent resolution rate,” and monitoring repeat-contact windows (24–168 hours) to catch downstream pain.
Myth 2: “AI agent containment tells the whole story.” A high containment rate can mask failure modes such as unresolved intent, silent abandonment, or low-quality handoffs that frustrate customers and spike human workload later.
Reality and what I measure instead: I break containment into three parts for voice and chat flows: (1) intent resolution without escalation, (2) graceful handoff quality when escalation is necessary, and (3) post-handoff efficiency and satisfaction. For voice AI agent experiences, I also track escalation clarity (did the transcript summarize history and intent?), time-to-human, and customer satisfaction on the combined interaction. This provides a fuller view of customer support ai strategy effectiveness and avoids over-crediting automation for partial wins.
Myth 3: “Quality is subjective, so it can’t be measured at scale.” Teams often default to sporadic QA because they assume it can’t be standardized across channels or agent types. The result is noisy feedback loops and stalled coaching.
Reality and what I measure instead: Quality becomes measurable when it’s grounded in observable behaviors linked to outcomes. I use a rubric anchored in behavioral analytics (e.g., verified customer need, correct resolution path, policy compliance, empathy markers) and validate it via correlation with FCR, recontact, and retention analysis. To scale, I combine calibrated human reviews with AI-assisted scoring, check inter-rater reliability weekly, and use driver trees to connect quality levers to business results. This creates a consistent, coachable signal for both human agents and AI flows.
Myth 4: “If the dashboard is green after launch, we’ve won.” Early wins can reflect novelty effects, cherry-picked routing, or short-term incentives that don’t persist. Declaring victory too soon locks in fragile gains and hides regressions across cohorts.
Reality and what I measure instead: I treat go-live as the start of learning. I use A/B testing with a clear minimum detectable effect (MDE), stagger ramps, and hold out stable control cohorts for at least one full demand cycle. I track outcomes vs output OKRs—focusing on intent resolution, customer effort, and revenue/customer health over vanity metrics. I also monitor seasonality and channel mix shifts inside a unified analytics platform to ensure improvements generalize beyond the first week.
How I operationalize this day to day: (1) define intents and complexity upfront, (2) unify journey data across channels, (3) instrument resolution and recontact rigorously, (4) apply driver trees to isolate what actually moves outcomes, and (5) iterate via disciplined experiments rather than sweeping changes. This approach aligns product and operations, speeds up coaching, and ensures AI investments compound rather than decay.
If you’re rethinking your Agent Analytics stack, start by replacing each myth with a sharper metric: pair AHT with intent-level resolution, pair containment with handoff quality and satisfaction, pair QA with outcome-linked rubrics, and pair green dashboards with robust experiments. The payoff is a measurement system that earns trust, guides better decisions, and consistently improves customer and business results.

Inspired by this post on Pendo – Best Practices.

May 7, 2026

How to Link AI Evals to Retention Without Chasing Proxies

Your AI activation rate is rising. More users are reaching the agent, completing setup, or trying the workflow. Yet the retention curve is flat. That usually means you know who touched the product, but not who received enough value to return.

A higher aggregate eval score will not resolve that gap. You need to identify an AI quality signal that appears early, connect it to later behavior, and determine whether changing that signal can change retention. The result should influence onboarding, roadmap priorities, customer success, and model releases, not just add another chart to an eval dashboard.

Start with the retention decision, not the eval dashboard

The wrong opening question is: Which evals can the team measure? Start with: What must a user experience early enough that returning becomes the rational next action?

That framing forces you to define retention before searching for a predictive signal. A login is rarely enough. Choose a return behavior that represents recurring value: running another workflow, completing another meaningful task, or bringing the agent into an ongoing process. Then make five decisions explicit:

Define the eligible population. Decide whether you are studying newly activated users, newly activated tenants, or another clearly bounded cohort.
Choose the unit of analysis. Use the user when value is individual. Use the tenant or account when adoption and renewal depend on a shared workflow.
Name the retained behavior. It should represent renewed value, not passive presence.
Select the retention window. Weekly and monthly cohorts answer different product questions, so do not switch between them after seeing the result.
Close the observation period before the retention outcome begins. Otherwise, later behavior can leak into the feature that supposedly predicts it.

This distinction matters when activation improves but retention does not. Activation proves that a user crossed a product milestone. It does not prove that the AI produced a trustworthy, complete, safe, and usable outcome. Your eval candidates should measure that missing experience.

Eval family	Question it should answer	When it deserves product attention
Semantic accuracy	Did the output correctly address the intended task?	Incorrect results prevent completion or make the user unwilling to rely on the agent again.
Containment	Did the agent complete the eligible workflow without an avoidable human escalation?	Escalation prevents the workflow from delivering repeatable automation.
Safety	Did the interaction remain within the product’s acceptable risk boundaries?	A regression creates unacceptable exposure, even if another engagement metric improves.
Latency	Did the result arrive fast enough for the user’s workflow?	Delay causes abandonment, repeated attempts, or a return to the previous process.
UX friction	Could the user reach a good outcome without unnecessary setup, retries, or corrections?	Users fail before they have a fair chance to experience the agent’s value.

Shortlist three to five candidates tied to these user outcomes. A long eval inventory makes analysis look comprehensive while weakening the decision. You are not trying to find every quality problem. You are looking for an early signal that is measurable, related to meaningful retention, and alterable through a product intervention.

Build an identity and time contract before modeling anything

The hardest part is usually not the statistical model. It is joining AI interactions to product behavior without duplicating records, losing users, or assigning an outcome to the wrong account. Evals often live in notebooks or model-observability systems while retention events live in product analytics. A plausible-looking join can still be wrong.

Create a data contract that covers both systems. At minimum, it should specify:

Stable user and tenant identifiers, including the rule used when a user belongs to more than one tenant.
The timestamp that determines whether an interaction belongs inside the observation period.
The model and workflow version associated with the interaction.
The conditions that make an interaction eligible for each eval.
The grain of the analysis table, such as one row per user-day or tenant-day.
The treatment of missing data, especially the difference between no eligible interaction and an evaluated failure.

That last distinction is easy to miss. A user who never invoked an eligible workflow did not fail the accuracy eval. Combining non-use and poor quality into the same value hides whether the retention problem comes from discovery, setup, or AI performance.

Compute daily per-user and per-tenant features rather than joining every raw interaction directly to every product event. Each feature should retain its denominator or exposure count. A pass rate without the number of eligible interactions can make sparse use look equivalent to sustained use.

Keep the definition of each feature readable. Containment, for example, needs an explicit eligible-workflow denominator and an explicit rule for what counts as avoidable escalation. UX friction needs named events, such as a retry or correction, rather than an opaque composite score. If a product manager cannot explain how the feature changes, the team will struggle to turn it into a roadmap decision.

Watch for many-to-many joins. One AI interaction may generate several product events, and one product session may contain several AI interactions. Joining both raw tables can multiply rows and inflate success or failure counts. Aggregate each side to the agreed grain first, then join the resulting features to the retention cohort.

Versioning also matters. If a model or workflow changes during the observation period, an account-level average can blend materially different experiences. Preserve the version so you can distinguish a real quality improvement from a change in traffic or segment mix.

Find a threshold that survives segment and leakage checks

Once the dataset is reliable, begin with cohort analysis rather than a complex predictive model. Compare retention among users who reached different levels of each candidate signal. You are looking for a separation that is large enough to matter, stable enough to repeat, and reachable through product changes.

Use this sequence:

Plot weekly or monthly retention against each early eval feature.
Use a driver tree to show where the feature sits between acquisition, activation, AI quality, repeat behavior, and the final retention outcome.
Fit a simple logistic model that controls for plan type, segment, region, and acquisition channel.
Repeat the analysis inside important segments instead of relying only on the blended population.
Check whether the threshold remains directionally useful when you vary the observation definition without allowing it to overlap the outcome.

The controls are not statistical decoration. Higher-plan customers may have better implementation support. One region may contain a different account mix. A high-intent acquisition channel may produce both better agent usage and better retention. Without those checks, customer composition can masquerade as model improvement.

In one product context, users who crossed a specific eval threshold early showed three times higher retention than peers who did not. That is evidence that an eval can become a commercially useful leading indicator. It is not a universal benchmark. Your threshold, effect size, eligible population, and retention behavior will depend on your product.

Do not choose the threshold merely because it creates the largest visual gap. Prefer a boundary that has enough eligible users on both sides, persists across relevant segments, and corresponds to an experience the product can influence. A dramatic ratio from a small cohort is a hypothesis, not a roadmap mandate.

Run an explicit leakage review before presenting the result. Common forms include an eval feature calculated after the retention window begins, an account-health field that already contains renewal information, or a usage feature whose value can only rise when the user returns. Leakage can make a weak signal look uncannily predictive.

The decision artifact should show the cohort definition, feature window, retention window, cohort sizes, effect estimate, control variables, and segment sensitivity together. If the threshold only works for a particular plan or acquisition channel, say so. A narrow, honest signal is more actionable than a broad result that disappears when the mix changes.

Use experiments to separate a predictor from a product lever

A predictive eval signal is not automatically causal. Sophisticated users may configure the agent better, choose easier workflows, or persist through early friction. Their higher eval scores and higher retention may share the same cause. Improving the score will not necessarily reproduce their behavior.

Convert the signal into a testable product intervention:

Choose an intervention that can move the signal during the early observation period. Depending on the failure, that could be an in-app guide, a product tour, a setup change, or a model change behind a feature flag.
Keep the threshold definition fixed for the experiment. Redefining success after seeing the result turns the test into another exploratory analysis.
Predefine the retained behavior, retention window, target population, and second-order guardrails.
Use a minimum detectable effect calculation to determine whether the experiment can answer the question with the available population.
Run an A/B test where randomization is practical. Measure whether the intervention moves the eval signal and whether that movement is followed by the intended retention lift.
Inspect results by the same segments used in the observational analysis. A blended win can hide a regression for a strategically important group.

This creates a necessary chain of evidence: the intervention changed the early experience, the early eval feature moved, and the retention outcome moved in the expected direction. If retention improves without movement in the eval, your intervention may work through another mechanism. If the eval improves without retention, the signal is not yet a proven growth lever.

Treat safety differently from an ordinary optimization metric. A retention increase cannot compensate for an unacceptable safety regression. Use risk scoring to gate exposure, keep model changes behind feature flags until the required evals pass, and monitor anomalies in both the score and its eligible volume. A stable percentage on a collapsing sample is not stability.

Track support tickets, NPS, and Net Recurring Revenue alongside the primary retention result. These measures operate on different timelines, but they help catch proxy optimization. An intervention that pushes users across an eval threshold while increasing support burden or degrading customer sentiment has not produced a clean product win.

Separate the user-level and release-level uses of the signal. A user-level signal can trigger onboarding or customer-success help when a new account has not reached the value threshold. A release-level eval can prevent a model change from expanding when quality falls. Combining both into one vague health score makes ownership and response unclear.

Put the winning signal into the product operating system

The analysis matters only when it changes what happens next. Give the signal a definition, an owner, an intervention, and a response to regression.

For onboarding, guide new users toward the workflow conditions associated with crossing the threshold. Do not merely show them where the AI button is.
For customer success, add the signal to a health score only when the team has a specific action to take. A warning without a playbook creates dashboard noise.
For roadmap planning, require proposed work to identify which eval feature it should move, why that feature connects to retention, and how the effect will be tested.
For model releases, keep exposure controlled with feature flags until the relevant eval improves without violating safety or experience guardrails.
For monitoring, use anomaly detection on the eval value, eligible interaction volume, and important segments so a blended average does not conceal a regression.

This operating model also clarifies ownership. Product owns the intervention and decision. Data science owns the validity of the feature and analysis. Engineering owns reliable instrumentation and release controls. Customer success owns the response when an account-level signal indicates missed value. Those responsibilities can be distributed differently in your organization, but none should be implicit.

Key takeaways

Define the retained behavior, population, unit, and time window before selecting an eval.
Shortlist three to five eval candidates that describe real user value: accuracy, containment, safety, latency, or UX friction.
Aggregate reliable daily features with stable user and tenant identifiers before joining them to product cohorts.
Use cohort analysis, driver trees, and simple controlled models to find a predictive threshold, then check sample size, segment mix, and label leakage.
Use an A/B test to learn whether a product intervention can move both the eval signal and retention.
Operationalize a validated signal through onboarding, customer success, release gates, feature flags, and anomaly detection.

At your next product review, bring a short decision sheet with the retained behavior, observation window, no more than five candidate evals, join keys, and the first intervention you can test. If the team cannot fill in those fields, fix the analytics contract first. If it can, run the smallest credible experiment and let retained behavior, not a prettier eval dashboard, decide the roadmap.

References

Amplitude — The Surprising Eval Signal That Tripled Retention: How I Connected AI Evals to Product KPIs

May 7, 2026

5 Proven Agent Skills I Use to Automate Weekly Product Reviews with Claude, Cursor, and Codex

Weekly product reviews are where strategy meets execution, and over the past year I’ve turned them into a high-signal, low-friction ritual by leaning on agentic AI. As VP of Product Management at HighLevel, Inc., I’ve standardized a set of agent skills that compress preparation time, surface the right insights, and keep PMs, engineers, and designers focused on decisions—not document wrangling.

"Learn how our teams use agent skills with claude, cursor and codex to run product reviews as PMs, engineers, and designers. Here are 5 killer use cases for builder."

Below, I walk through the five skills I rely on most in our weekly cadence—each one mapped to a clear product management outcome. They’re simple to set up, easy to govern, and aligned with core practices like continuous discovery, product roadmapping and sprint planning, and eval-driven development.

Skill 1 — Backlog triage with signal extraction: I point an agent at fresh tickets, customer notes, and experiment results to cluster themes, tag impact, and flag regressions. Using a retrieval-first pipeline and Agent Analytics, the assistant ranks items by value, effort, and risk so our meeting starts with a prioritized, explainable shortlist instead of a raw queue.

Skill 2 — PRD and spec synthesizer: Ahead of the review, an agent drafts a one-page PRD update from design diffs, git history, and decision logs. With Claude Code and Cursor, it highlights interface changes, acceptance criteria, and open questions, linking back to sources. The result is a crisp, auditable brief that keeps product trios aligned without re-litigating context.

Skill 3 — Experiment and metrics analyzer: An analytics agent pulls A/B testing readouts, checks minimum detectable effect assumptions, and annotates anomalies. It turns raw telemetry into a narrative: what moved, by how much, and whether we trust it. This makes our discussion about tradeoffs, not spreadsheets, and speeds commitments on next steps.

Skill 4 — Voice-of-customer synthesizer: The assistant clusters interviews, support threads, and NPS verbatims into jobs-to-be-done and pain themes. It proposes opportunity solution tree updates and calls out places where our roadmap diverges from customer signal. That keeps continuous discovery alive in the room—even when time is tight.

Skill 5 — Roadmap and sprint planning co-pilot: After decisions, an agent converts outcomes into scoped backlog items, engineering tasks, and stakeholder updates. It drafts sprint goals, flags dependency risks, and aligns work to objectives. Because it’s grounded in the meeting record, it preserves intent while removing ambiguity.

Under the hood, prompt engineering patterns and guardrails keep these workflows predictable: a retrieval-first pipeline for context, eval-driven development for quality checks, and role-specific prompts for PMs, engineers, and designers. With Claude Code I generate structured diffs and test scaffolds; with Cursor I accelerate code-review summaries; and with codex I bootstrap utility scripts that keep the loop tight between insights and implementation.

The payoff is tangible: higher decision velocity, fewer meetings to “re-clarify,” and clearer accountability across the product organization. Just as important, governance and privacy-by-design are built in—every agent logs rationale, cites sources, and respects data boundaries—so leaders can scale AI workflows confidently.

If you’re looking to level up your product reviews, start with these five skills, measure impact with Agent Analytics, and iterate. Small automations compound quickly, and the more consistently you run them, the more your team’s attention shifts from preparing content to making better product decisions.

Inspired by this post on Amplitude – Perspectives.

May 4, 2026
Supercharge Claude and Cursor with Amplitude Plug and Play: Your AI Analytics Expert in One Install

I’m excited to share that we’ve brought Amplitude Plug and Play to the Claude and Cursor marketplaces—a lightweight way to infuse your everyday prompts with serious product analytics context and speed.

"Learn more about our new AI plugin, the easiest way to turn your favorite AI client into an analytics expert with a single-install."

For years, I’ve watched teams lose momentum hopping between dashboards, docs, and spreadsheets just to answer simple questions like “What changed in activation last week?” or “Which cohort is driving retention?” With Amplitude analytics and behavioral analytics at the core, Amplitude Plug and Play collapses that friction by bringing the answers to where you already think and build—inside Claude and Cursor.

In practice, this means I can ask natural-language questions such as “Show me the funnel from signup to activation by region,” “Compare retention week over week for new users from our latest release,” or “Summarize our last A/B testing results on onboarding” and get structured, context-aware responses. The goal is to keep me in flow while still honoring the rigor of a unified analytics platform.

What I love most is how this elevates both discovery and delivery. Product managers can accelerate continuous discovery by querying cohorts, drivers, and anomalies mid-conversation. Engineers working in Cursor or with Claude Code can validate event definitions, sanity-check metrics, and spot regressions without leaving their IDE. The result is tighter feedback loops and better decision quality.

Just as importantly, the experience is designed for clarity and consistency. When I ask about activation, I expect the same canonical definition every time. When I explore a retention analysis, I want clear assumptions and transparent logic. By anchoring responses to well-defined metrics and event taxonomies, the plugin helps reinforce good data governance while keeping the interaction fast and conversational.

Getting started takes only a few minutes. Open the Claude or Cursor marketplace, search for Amplitude Plug and Play, complete the single-install flow, and connect to your Amplitude analytics workspace. From there, start prompting as you normally would—only now your AI client can reason with product context.

This launch is part of how I see gen ai reshaping AI workflows for product teams: less context switching, more signal per prompt, and a shared, accessible understanding of what’s really moving the business. If you’re ready to turn your AI assistant into a trusted partner for product insight, Amplitude Plug and Play is a powerful next step.

Inspired by this post on Amplitude – Best Practices.

May 1, 2026

Amplitude AI Product Analytics: A Practical Agent Playbook

You are deciding whether Amplitude Agents deserve a place in your product operating system. A fluent answer or polished insight is easy to admire. The harder question is whether the agent helps someone make a better decision, complete a valuable task, or change user behavior.

That distinction determines how you should instrument, evaluate, and roll out the experience. Treat Amplitude as the measurement spine connecting agent activity to funnels, cohorts, experiments, retention, and product outcomes. Otherwise, you will know that the agent was used without knowing whether it was useful.

Pick a workflow with an observable finish line

Do not begin with a broad ambition such as helping everyone understand the data. It cannot be measured cleanly, and it gives the agent too much room to produce plausible output without resolving a real job.

The useful standard is that AI product management remains accountable for helping teams build better products. The agent response is therefore an intermediate output, not the outcome. A strong starting point is one narrowly scoped, high-signal workflow with an unambiguous done state.

Write a workflow contract before configuring dashboards or prompts:

User: Name the role doing the work, such as a product manager investigating onboarding friction.
Trigger: Describe the event that makes the job necessary, such as a drop in activation or an unexpected cohort difference.
Bounded job: State exactly what the agent should help accomplish.
Required evidence: Identify the events, funnels, segments, or cohorts that should support the output.
Done state: Define the observable action that marks useful completion.
Fallback: Decide what happens when the inputs are missing, the evidence conflicts, or the agent cannot complete the task reliably.

For an onboarding investigation, the contract might ask the agent to help identify where a defined cohort leaves the activation journey and produce evidence-backed hypotheses for the product manager to review. The task is not complete when text appears. It is complete when the user reviews the relevant evidence and records a decision, launches a follow-up analysis, or creates an experiment.

Use a simple outcome ladder to keep the team honest: eligible users see the experience, some start it, some reach the workflow’s done state, some act on the result, and the intended product outcome changes. Each level answers a different question. Collapsing them into an agent usage metric hides the point at which value disappears.

Instrument the agent journey, not just the final answer

Your event design should let you reconstruct the journey from opportunity to outcome. The names below are examples, not an official Amplitude schema. Adapt them to your existing naming convention and governance rules.

Journey stage	Question it answers	Suggested event
Eligible	Who could reasonably use this workflow?	agent_workflow_eligible
Exposed	Who actually saw an entry point?	agent_entry_viewed
Started	Who chose to begin?	agent_run_started
Evidence reviewed	Who engaged with the information needed to judge the output?	agent_evidence_viewed
Completed	Who reached the workflow-specific done state?	agent_task_completed
Actioned	Who used the output in a downstream decision or action?	agent_output_applied
Handed off	Where did the experience require a deterministic flow or human review?	agent_handoff_triggered
Returned	Who came back when the job occurred again?	agent_run_started, segmented by prior successful completion

Add properties that explain why behavior differs: workflow identifier, product surface, user role, account cohort, journey stage, agent version, prompt or instruction version, completion reason, handoff reason, and error class. Version properties are essential. Without them, a release can change output quality while the dashboard incorrectly treats the experience as one stable product.

If prompts may contain customer or company data, do not log the raw text by default. Prefer derived classifications, structured outcome fields, or properly redacted samples governed by your retention and access policies. Product analytics should increase observability without creating an unnecessary copy of sensitive input.

Build each metric with an explicit denominator:

Discovery rate: exposed eligible users divided by eligible users.
Start rate: users who start divided by users exposed to the entry point.
Completion rate: users reaching the workflow-specific done state divided by users who start.
Action rate: users taking the defined downstream action divided by users who complete.
Retained use: previously successful users who return when the job recurs divided by previously successful users who had another opportunity.

The eligibility and opportunity conditions matter as much as the numerator. A user cannot retain to a workflow that has not recurred, and someone who never saw the entry point should not be treated as a failed starter.

In Amplitude, separate the views rather than forcing everything into one chart. Use an exposure funnel for discoverability, a workflow funnel for completion, cohorts for segment differences, retention analysis for repeat behavior, and a guardrail view for errors, retries, and handoffs. Use Agent Analytics for the execution signals available from the agent, then connect those signals to the behavioral events that represent product value.

Keep output quality and product impact on separate scorecards

Behavioral analytics cannot tell you whether an answer was correct. An evaluation set cannot tell you whether customers changed their behavior. You need both views because they fail in different ways.

Before widening access, create an evaluation set drawn from the workflow contract. Include ordinary cases, incomplete inputs, ambiguous requests, conflicting evidence, and cases that should trigger a handoff. Grade the output against criteria that can be reviewed consistently:

Correctness: Does the conclusion match the available evidence?
Grounding: Can the user see which events, funnels, cohorts, or other inputs support it?
Task adherence: Did the agent solve the bounded job rather than produce a generic analysis?
Uncertainty handling: Does it distinguish supported conclusions from hypotheses?
Handoff behavior: Does it stop or redirect appropriately when required evidence is unavailable?
Actionability: Can the intended user make the next decision without reconstructing the analysis?

Record pass or fail for non-negotiable criteria such as unsupported conclusions and failed handoffs. Keep graded usefulness criteria separate. A high average score should not conceal a smaller set of serious failures.

Run the same evaluation set when you change instructions, tools, model configuration, retrieval behavior, or the data made available to the agent. This is the practical value of eval-driven development: a fast release becomes a controlled product change rather than an untraceable shift in behavior.

Your online scorecard should then contain distinct layers:

Primary outcome: the workflow-specific completion or downstream action that represents value.
Adoption diagnostics: eligibility, exposure, start rate, and first successful completion.
Quality diagnostics: evaluation results, user corrections, retries, and unsupported-output flags.
Operational guardrails: errors, latency appropriate to the workflow, abandonment, and handoffs.
Product impact: the activation, feature adoption, retention, or other behavioral outcome the workflow is intended to influence.

Choose one primary outcome before launch. The other measures explain why it moved or protect against a misleading win. If every metric is primary, the team can always find one that improved after the fact.

User ratings can help diagnose tone, relevance, or missing context, but they are not a substitute for observed outcomes. A response can feel impressive and still produce no action. It can also look concise while helping an expert complete the job quickly. Pair stated feedback with completion, downstream action, and return behavior.

Run an experiment that can survive executive scrutiny

Do not compare enthusiastic agent adopters with everyone who ignored it. Those groups selected themselves, so their product outcomes may have differed before the agent appeared. Establish a baseline and create a controlled comparison wherever the workflow and traffic permit it.

Write the hypothesis in behavioral terms. Name the user, workflow, expected action, and product outcome.
Measure the current workflow before introducing the agent. Capture completion, abandonment, downstream action, and relevant guardrails.
Define eligibility before assignment so the comparison includes people with the same underlying job.
Choose the assignment unit that matches how the workflow spreads. Use an account-level unit when teammates share agent output; use a user-level unit only when experiences are genuinely independent.
Expose the treatment through a feature flag or controlled rollout, while keeping the existing path available as the comparison and fallback.
Evaluate the primary outcome and guardrails together. Do not call a faster workflow successful if output quality, error handling, or downstream behavior deteriorates.
Inspect cohorts to understand a credible result, not to search endlessly for a segment that happens to look positive.

The metric pattern often tells you where to investigate next:

High exposure with low starts can indicate weak positioning, poor timing, or an irrelevant eligible population.
Healthy starts with low completion can indicate that the promise is attractive but the workflow, inputs, or output quality is failing.
High completion with low downstream action can indicate that your done state is too shallow or the output is not trusted enough to use.
Strong agent engagement without movement in the product outcome can indicate a locally pleasant experience that does not change the broader journey.
Strong first use with weak return behavior can indicate novelty, unreliable value, or a job that simply occurs infrequently. Check opportunity before interpreting it as churn.
Good aggregate results with concentrated handoffs in one cohort can indicate missing context, permissions, or data for that segment.

Guardrails should be operational, not aspirational. Validate required inputs. Make the agent’s task and evidence boundaries clear. Route the user to a deterministic flow or human review when observable conditions show that the task cannot be completed. Missing data, failed tool calls, validation failures, and unsupported claims are stronger handoff triggers than an agent merely describing itself as confident.

Scale only when value repeats under real conditions

A spike in usage after launch mainly proves that people noticed something new. Scale when the complete chain repeats: eligible users discover the workflow, finish it, act on the result, and return when the same job appears again.

Segment that chain by role, account cohort, use case, journey, and agent version. A workflow that helps an experienced product analyst may confuse a first-time manager. An onboarding investigation may need different evidence and handoffs from a retention investigation. Aggregate adoption can hide both realities.

Expand the rollout when the primary outcome improves, evaluation quality remains stable across relevant cohorts, guardrail failures stay controlled, and repeat use matches the natural frequency of the job. Redesign when successful users cannot find the entry point, retries cluster around the same step, completed outputs rarely lead to action, or results depend on one unusually capable cohort.

Pause expansion when the agent does not improve the existing workflow, important outputs cannot be audited back to evidence, or failures cannot be routed safely. More exposure only creates more ambiguous data when the workflow contract itself is weak.

Key takeaways

Define one bounded workflow and an observable done state before measuring adoption.
Connect agent execution signals to exposure, completion, downstream action, and product outcomes in Amplitude.
Use evaluation sets for output quality and behavioral analytics for real-world impact; neither replaces the other.
Compare the agent with the existing workflow among equally eligible users.
Treat retries, errors, unsupported outputs, and handoffs as product signals, not merely engineering logs.
Scale repeatable value across cohorts and versions, not a launch-driven usage spike.

Your next move should fit on one page: the workflow contract, event map, evaluation criteria, experiment metric, and fallback path. If those elements are clear, Amplitude can show where the agent creates value and where it merely creates activity. If they are not clear, narrow the workflow before you widen the rollout.

References

April 28, 2026

How to Build Agentic AI for Product Analytics and Support

Your support bot can tell a customer where a setting lives, yet leave that customer to diagnose the problem, change the setting, and hope it worked. Your product team then receives a chat transcript without knowing whether the interaction improved activation, feature adoption, or retention.

If you are deciding how to connect AI, product analytics, and support, do not start with the model. Design the closed loop first: assemble trustworthy context, choose an allowed action, verify the resulting product state, and measure the user outcome. The model is one component inside that system.

Treat product analytics as the agent’s control plane

A useful standard is an assistant that understands the user’s context, can complete an allowed action, and measures whether the action helped. Remove any one of those capabilities and the experience degrades quickly. Context without action produces advice. Action without context creates risk. Action without measurement creates an impressive demo that cannot earn a durable place on the roadmap.

Product analytics supplies the behavioral context and outcome signals for this loop. It can show where the user is in a journey, which features have been adopted, which step failed, and whether the expected success event eventually occurred. It should not be treated as a warehouse-sized attachment to the prompt.

Define a support context contract

Create a small, governed context object for each supported workflow. Give the agent only the fields required to understand and resolve that workflow:

Actor and access: the authenticated user, account, role, entitlements, and permissions relevant to the requested action.
Journey state: the onboarding step, feature-adoption state, experiment assignment, or other stage that explains what the user is trying to complete.
Current product state: the relevant configuration from the operational system of record, including whether required prerequisites are satisfied.
Friction evidence: recent failed events, validation results, repeated attempts, and known errors connected to this workflow.
Desired outcome: the product state and behavioral event that will count as successful resolution.

Resolve analytics events and tool calls to the same stable user and account identifiers. Preserve timestamps and the origin of each field. For a live action decision, let the operational system of record determine current state; use analytics to explain the journey and measure the outcome. An event stream can be delayed or incomplete, so it should not overrule a current configuration read.

Behavior is also evidence, not intent. Repeated visits to a setup screen could indicate confusion, careful verification, or an advanced workflow. When those interpretations require different actions, the agent should ask one targeted question instead of turning a behavioral pattern into a confident diagnosis.

Apply data minimization at this boundary. Do not place secrets, payment information, unrelated conversation history, or an account’s entire event history into the model context. Filter fields before the model sees them, and enforce the filter in code rather than relying on a prompt instruction.

Give the analytics agent a metric contract

An internal analytics agent has a different job from a customer-facing support agent. It may translate a product question into metrics, cohorts, funnels, or retention views, but a fluent answer is not enough. Require every analysis to return:

the product question it interpreted;
the metric definition and success event it used;
the cohort, filters, and observation window;
the analysis or query reference needed to reproduce the result;
known data-quality limitations and unresolved ambiguity; and
a clear distinction between observed association and demonstrated causal lift.

This turns the analytics agent into a traceable decision aid. It also prevents two agents from using the same metric name while silently applying different event definitions, account filters, or windows.

Design one closed loop from signal to verified outcome

The core unit of agentic support is not the conversation. It is a resolution attempt with a beginning, an authorized action, and a verifiable end state. Use the following loop for every workflow:

Observe the trigger. Capture the user’s request or a product signal that indicates likely friction.
Assemble scoped context. Load only the identity, permission, journey, state, and error fields defined in the context contract.
Diagnose the next constraint. Determine which prerequisite, configuration, permission, or knowledge gap is blocking progress. If the evidence is ambiguous, ask rather than assume.
Select an approved playbook. Match the constraint to a versioned workflow with explicit eligibility rules, allowed tools, and prohibited actions.
Obtain the required authorization. Show the proposed change and its consequence whenever the action changes product state or affects other people.
Execute through a narrow tool. Use a typed, allowlisted operation. Make retryable actions idempotent so a repeated call does not create duplicate changes.
Verify the result. Read the resulting product state and look for the defined success event. Tool completion alone does not prove customer resolution.
Record the outcome. Log the context version, playbook, model, policy decision, tool call, result, success signal, and any escalation or user reversal.

The loop supports two related products without collapsing their permissions. An internal analytics agent can identify an affected cohort, inspect a funnel, or surface a recurring failure pattern. A customer-facing support agent can use the approved finding to help one authenticated user, but it should see only that user’s permitted context and tools. A human support operator should receive the same trace when the agent escalates.

Keep the shared layer deliberately small: stable identities, canonical metric definitions, governed context fields, outcome events, and versioned playbooks. The analytics agent and support agent can then improve the same system while retaining separate access policies and evaluation criteria.

Do not automatically convert every observed correlation into a new support action. Let analytics generate a candidate playbook, review the causal logic and risk, test it against known cases, and release it through a controlled experiment. The learning unit is the reviewed playbook, not an unexamined prompt change.

Choose a first workflow that can prove its own value

The first pilot should be easy to verify, not merely easy to demonstrate. A conversational answer looks polished even when it does not change the user’s outcome. A narrow configuration or onboarding workflow is usually a better proving ground because eligibility, allowed actions, and success can be defined before launch.

Score candidate workflows against these criteria:

Repeated demand: the same intent or failure appears often enough to justify a reusable playbook.
Observable state: the agent can read the prerequisites and current configuration instead of guessing from the user’s description.
Clear success: one product state or behavioral event can verify that the problem was resolved.
Safe execution: the initial actions are reversible, user-scoped, and unlikely to affect billing, security, data retention, or other users.
Short feedback: the primary outcome appears soon enough to support iteration, even if retention is monitored later.
Enough eligible traffic: the workflow can support a meaningful experiment rather than a handful of anecdotes.

Write the pilot contract before the prompt

A pilot contract forces the product, analytics, support, engineering, and risk decisions into one inspectable artifact. It should specify:

the user problem and eligible cohort;
the trigger that starts the workflow;
the context fields and systems of record;
the approved diagnostic branches;
the allowed and prohibited actions;
the point at which confirmation is required;
the precondition and postcondition for each tool call;
the success event and observation window;
known failure modes and the human handoff rule; and
the primary outcome, guardrail metrics, experiment design, and minimum detectable effect.

Consider an onboarding configuration workflow. The trigger might be a user repeatedly reaching setup without completing it. The context could include entitlement, current configuration, prerequisite status, and the latest validation result. The agent may be allowed to run validation, explain a missing prerequisite, prefill a reversible setting, or launch the next approved step. Resolution requires both the expected configuration state and its corresponding success event. If validation continues to fail, the handoff should include the exact state, error, playbook branch, and actions already attempted.

Avoid starting with data deletion, broad permission changes, security recovery, billing adjustments, or external communications. Those workflows combine difficult authorization questions with high consequences. Prove context quality, tool reliability, verification, and measurement on a narrower action set before expanding the blast radius.

Set the minimum detectable effect before the experiment. If the eligible population cannot detect an outcome change that would justify the investment, narrow the claim, combine additional time periods, or choose a more observable workflow. Do not call an underpowered neutral result proof that the agent has no effect.

Instrument the agent like a product surface, not a transcript

Conversation volume, message count, and thumbs-up feedback are diagnostic signals. They are not sufficient outcome measures. A customer can like an explanation and still remain blocked; another can dislike the wording even though the configuration was fixed.

Measurement layer	Question it answers	Useful signals
Operational reliability	Did the system execute as designed?	Tool success, validation failure, retry, latency, rollback, and escalation
Verified resolution	Did the requested product state become true?	Verified resolution rate, time to resolution, repeat attempt, and repeat contact
Product outcome	Did the user progress in the journey?	Activation, feature adoption, workflow completion, and later retention
Support outcome	Did the workflow reduce avoidable support effort?	Eligible ticket rate, escalation reason, handle-time impact, and handoff quality
Safety and trust	Did the agent stay within policy and user intent?	Permission block, wrong-action review, user reversal, policy violation, and privacy incident

Define the denominators as carefully as the numerators. Verified resolution rate should use eligible support sessions as its denominator and require the success state defined in the pilot contract. Action completion rate should use authorized action attempts, not every conversation. Time to resolution should begin with the original request and stop only when the postcondition is verified, not when the agent finishes generating text.

Do not optimize ticket deflection or containment in isolation. The absence of a ticket can represent resolution, abandonment, or a user working around the problem. Pair support-efficiency measures with product success, repeat contact, and safety guardrails.

Use evaluations and experiments for different questions

A disciplined AI product rhythm connects eval-driven development, A/B testing, minimum detectable effect, activation, retention analysis, and data governance. Each mechanism answers a different question:

Pre-release evaluations: Can the system interpret known intents, select the right context, follow policy, choose an allowed tool, handle tool errors, and verify the expected postcondition? Run the relevant suite whenever the model, prompt, context contract, policy, tool, or playbook changes.
Shadow operation: What would the agent have proposed in real traffic without being allowed to change state? Review mismatched diagnoses, unsupported context, unsafe actions, and missed escalation conditions.
Controlled experiments: Does the agent improve the predefined outcome compared with the existing support experience for the eligible population? Record assignment before the interaction and preserve it through outcome analysis.
Production monitoring: Are errors, reversals, escalations, latency, or policy blocks changing by journey, user role, entitlement, playbook, or release version?

Be careful with naive correlation. Users who invoke support are often already struggling, so their outcomes may look worse than those of users who never needed help. Random assignment among eligible users gives you a defensible counterfactual. When randomization is not possible, describe the result as observational and avoid claiming that the agent caused the change.

Log enough version information to reproduce a decision: model, prompt, policy, context schema, playbook, experiment assignment, tool version, input identifiers, authorization result, and postcondition. Do not place raw secrets or unrestricted personal data in that trace. A metric change is actionable only when you can connect it to the system version that produced it.

Set action boundaries before the model receives tool access

Model confidence is not authority. A highly confident response must never expand a user’s permissions, bypass confirmation, or convert a prohibited action into an allowed one. Authorization belongs in deterministic policy and tool infrastructure outside the model.

Action class	Typical scope	Required controls	Verification
Read and explain	Show relevant state, explain an error, or recommend a next step	User-scoped reads, field filtering, and visible uncertainty when evidence conflicts	Confirm that the response used current state and an approved knowledge path
Reversible change	Update a non-sensitive preference, run validation, or trigger a recoverable workflow	Preview, confirmation when needed, typed input, idempotency, and rollback	Read the resulting state and observe the workflow’s success event
Consequential change	Alter billing, permissions, security, external communication, or retained data	Strong confirmation or human review, separation of duties, and a complete audit trail	Verify every postcondition and provide a safe recovery or escalation route

Implement the boundary with controls the agent cannot negotiate away:

Least-privilege credentials: issue short-lived, user-scoped authorization rather than a general service credential wherever the architecture permits it.
Allowlisted tools: expose narrow actions with typed parameters, explicit preconditions, and constrained targets. Do not give a customer-facing agent arbitrary database or shell access.
Policy before execution: validate identity, permission, data sensitivity, action class, and confirmation status outside the model before any state-changing call.
Postcondition checks: require the agent to read the resulting state. A successful API response can still produce the wrong business outcome.
Safe retries: attach idempotency controls to operations that might be repeated after a timeout or interrupted conversation.
Complete handoffs: send the human operator the intent, relevant context, diagnosis, attempted action, tool result, and unresolved condition so the customer does not have to start over.
Controlled release: use feature flags, cohort restrictions, action-level limits, and an immediate disable path while a workflow is being validated.

Evaluate build-versus-buy decisions at the system boundary

Conversation quality is easy to demonstrate and difficult to use as a purchasing criterion. Evaluate an agent platform on whether it can operate inside your context, permission, observability, and experimentation model.

Can you define and inspect the context contract for each workflow?
Can the platform use user-scoped credentials and enforce tool permissions outside the prompt?
Can every decision, action, version, and outcome be exported to your unified analytics platform?
Can you separate aggregate analytics access from individual customer support access?
Can you run offline evaluations, shadow traffic, controlled experiments, and cohort rollouts?
Can you configure confirmation, rollback, handoff, retention, and data-residency policies?
Can you change the model, tool, or support system without losing metric definitions and historical outcome traces?

A platform that generates excellent dialogue but cannot expose its action trace or connect to verified outcomes will make governance and product measurement harder. A less theatrical system with clear contracts may be the more useful product foundation.

Key takeaways

Start with a governed context contract, not a larger prompt or model.
Connect product analytics and support through shared identities, metric definitions, outcome events, and versioned playbooks.
Give customer-facing agents user-scoped context and a small set of reversible, allowlisted actions.
Count a resolution only when the intended product state or success event is verified.
Use offline evaluations for capability and policy, controlled experiments for causal impact, and production monitoring for drift and safety.
Expand autonomy only after context accuracy, tool reliability, outcome lift, and guardrails have all been demonstrated.

At your next roadmap review, ask for one pilot contract rather than a broad AI support initiative. Choose one recurring journey, name its verified success event, define the smallest safe action set, and make the owner show how every action will be authorized, observed, and reversed. That is enough to move from a chatbot concept to an agentic product you can manage.

References

April 21, 2026

How We Taught Agentic AI to Speak Product Analytics—and Unlocked Actionable Insights

I set out to solve a deceptively simple problem: help our teams ask product questions in plain English and get trustworthy, analysis-grade answers—fast. That required more than a powerful model; it demanded agents that genuinely understand the language of product analytics, from behavioral analytics nuances to the messy reality of event taxonomies, funnels, and cohorts. In this post, I share how we engineered agentic AI that speaks our domain fluently and turns questions into decisions.

The core challenge wasn’t data volume or dashboard sprawl; it was semantics. Different teams said “activation,” “onboarding,” or “first value” and meant overlapping but distinct things. Our PMs, analysts, and engineers navigated a maze of synonyms across Amplitude analytics, Pendo, and our unified analytics platform. Generic LLMs stumbled on these nuances, so we built a shared ontology—driver trees anchored to a clear North Star—with canonical definitions for activation, retention, and conversion, plus consistent event naming and cohort logic.

We started with a rigorous metric catalog: every KPI linked to its drivers, exact formulas, cohorts, and time windows; every event mapped to a product taxonomy; every dashboard and SQL snippet versioned with ownership and lineage. That catalog became the ground truth for agents. We embedded data governance and privacy-by-design from the start—permissioning for fields and queries, PII redaction, and scoped access that reflected how product teams actually work.

Next, we built a retrieval-first pipeline to ground the agents in our corpus before generation. We indexed metric definitions, dashboards, experiment readouts, runbooks, and high-signal Slack threads so the agent could cite relevant artifacts, not just predict plausible text. With careful context window management and prompt engineering, the agent retrieves definitions and prior analyses, then plans multi-step actions: run a query, compare cohorts, check “minimum detectable effect (MDE)” for an A/B test, and summarize findings with references.

Architecturally, we treated this as “Agent Analytics”: an orchestrator that selects tools based on intent—querying Amplitude analytics or Pendo for behavioral paths and funnels, hitting our warehouse for cohort tables, or pulling experiment metadata and anomaly detection alerts. Tool use is permission-aware, auditable, and designed to fail safe. The agent’s outputs include citations back to the exact definitions, dashboards, and SQL used, so reviewers can validate and iterate.

Quality came from eval-driven development, not intuition. We built a gold set of representative product questions (activation inflections, retention analysis by segment, funnel drop-offs after feature launches) and scored the agent on faithfulness to definitions, numerical accuracy, latency, and actionability. We incorporated regression checks to catch drifts after schema changes, and we tuned prompts to reduce overconfident answers and push for clarifying questions when context was missing.

Safety and reliability were non-negotiable. We layered AI risk management with role-based access, guardrails that block destructive queries, and risk scoring for unfamiliar joins or sudden spikes in metric deltas. The agent logs every step—what it retrieved, which tools it called, and why—so analysts can replay and refine the chain of thought with transparent provenance.

The payoff: product teams now self-serve nuanced questions in minutes instead of days, and our analysts spend more time on discovery than report wrangling. Retention analysis improved as the agent standardized cohort logic; conversion investigations accelerated thanks to consistent funnel definitions; and cross-functional decisions aligned around the same driver trees and shared language. Most importantly, the agent turned ambiguous asks into structured analyses that stand up to scrutiny.

For fellow product leaders, my lesson is simple: start with semantics, not models. A crisp ontology, disciplined taxonomy, and clear ownership will outperform a flashy stack riddled with ambiguity. Avoid technology FOMO; favor retrieval-first grounding, small sharp tools, and continuous discovery with your product trios. When your organization speaks a common analytics language, agents can finally think with you, not just for you.

Next, we’re extending the agent’s planning skills to recommend experiment designs, estimate power and “minimum detectable effect (MDE),” and propose driver-tree-informed bet sizing. We’re also tightening feedback loops so every accepted answer, edit, or override strengthens the retrieval corpus and evaluations. The vision: a calm, reliable layer that makes rigorous product analytics feel conversational—and helps teams move from questions to confident action.

Inspired by this post on Amplitude – Best Practices.

April 13, 2026