Author: Shivam Tiwari

A Reliable Amplitude AI Workflow for Product Decisions
You ask Amplitude AI why activation fell. It returns a convincing explanation, a few plausible segments, and a recommendation your team could act on. The problem is that you still don’t know whether the answer reflects your product data, an ambiguous metric, or a reasonable-sounding guess.

You don’t fix that uncertainty with a longer prompt. You fix it with a controlled workflow: define the decision, provide only the context needed to analyze it, let AI run a bounded sequence of checks, and require evidence before accepting a conclusion. The result is an analysis another product manager can inspect, reproduce, and turn into action.

Start with a decision contract, not an open-ended question

A request such as analyze our onboarding leaves too many choices to the model. It must decide what onboarding means, which users count, what success looks like, which period matters, and whether the goal is diagnosis or opportunity discovery. A polished answer can hide those unresolved choices.

Write a short decision contract before opening the analysis. It should contain five elements:
- Decision: State what someone will decide after reading the result. For example: decide which activation bottleneck the onboarding team should investigate next.
- Population: Name the eligible users, accounts, plan types, platforms, markets, or acquisition channels.
- Metric: Supply the exact event or formula, its time window, and any exclusions.
- Evidence bar: Specify what the answer must show, such as the supporting events, segments, funnel steps, or behavioral trend.
- Output: Ask for a conclusion, competing explanations, uncertainties, and the next analysis or product action.
A useful objective is narrow enough to fit in one sentence. Your quality rubric can be slightly longer: require every conclusion to identify the relevant metric, population, comparison, and evidence. This intent-first, evaluation-driven approach keeps the analysis tied to a product decision instead of rewarding whatever answer sounds most complete.

Constraints belong in the contract too. If the team cannot change pricing, instrumentation, or a particular onboarding step, say so. If a result must remain descriptive because the analysis cannot establish causality, require that distinction. AI is more useful when it knows which doors are closed.

Build a compact context packet Amplitude AI can actually use

Amplitude AI can only interpret behavior through the data model it receives. If two teams use different definitions of an activated account, or an event changed meaning after an instrumentation update, the model can produce a coherent answer to the wrong question.

Create a reusable context packet for each important product area. Keep it short enough to review, but precise enough to remove semantic guesswork. Include:
- Metric definitions: Write the numerator, denominator, qualifying window, and exclusions for activation, retention, conversion, or any other decision metric.
- Event taxonomy: List the events and properties relevant to the question, including known aliases or deprecated events that should not be used.
- Segment definitions: Explain how key cohorts are formed and which properties distinguish users from accounts.
- Known data limitations: Flag missing platforms, delayed events, identity-resolution issues, tracking changes, and periods that should not be compared.
- Recent product context: Include only releases, experiments, or journey changes that could plausibly affect the behavior under review.
Use retrieval before expansion. Start with the smallest relevant set of definitions and observations. Add more context only when the analysis reaches a question that requires it. Dumping an entire analytics catalog into the prompt makes it harder to see which definitions shaped the answer and gives irrelevant details more chances to distract the model.

Examples can stabilize recurring work, but choose them carefully. One to three strong examples are enough to demonstrate the expected structure, evidence standard, and level of uncertainty. Remove old conclusions and stale numbers before reuse. You want the model to copy the analytical pattern, not inherit a previous answer.

Version this packet alongside the workflow. When an event definition, segment, or guardrail changes, record the change and rerun the analyses that depend on it. That turns context management from prompt housekeeping into part of your analytics governance.

Run a bounded analysis loop, then challenge the result

Move from observation to explanation in explicit steps

Don’t ask for a diagnosis in a single jump. A reliable workflow separates what happened from why it may have happened. Use a fixed sequence:
1. Establish the baseline. Confirm the metric definition, eligible population, comparison, and direction of change.
2. Locate the difference. Break the result down by the segments most relevant to the decision. Avoid exploring every available property.
3. Inspect the journey. Examine funnel steps, behavioral paths, retention patterns, or other views that can show where behavior diverges.
4. Generate competing hypotheses. Ask for more than one plausible explanation and require supporting and contradicting evidence for each.
5. Choose the next best analysis. Run the segment drill-down, funnel attribution, or anomaly check most likely to separate the leading explanations.
6. Apply a stop rule. End when the evidence is sufficient for the stated decision, when the remaining uncertainty requires new instrumentation, or when another analysis would not change the next action.
The stop rule matters. Without one, an agentic workflow can keep generating cuts of the data that add activity without increasing confidence. Before each tool call, require the system to state what question the analysis will answer and how each possible result would change its next step.

If you expose Amplitude actions through MCP or another callable interface, keep each tool narrow and observable. A call should have explicit inputs, a recognizable output shape, and an error state the workflow can surface. Log the question, parameters, returned evidence, and the interpretation built from it. Tool access makes iteration faster; it does not remove the need for an audit trail.

Put every conclusion through a verification gate

Before a finding reaches a stakeholder, check it against a simple evidence ledger. For each important claim, record:
- the event, metric, segment, funnel step, or trend that supports it;
- the population and comparison to which it applies;
- whether it is an observation, interpretation, or causal hypothesis;
- the strongest alternative explanation;
- the assumptions or data limitations that could change the conclusion;
- the next check required if confidence is still too low for the decision.
Then try to disprove the preferred answer. Ask whether the pattern survives a relevant segment change, whether a tracking change could explain it, and whether the same evidence also supports a competing hypothesis. This adversarial pass is often more valuable than asking the model to make its first response more detailed.

Turn repeated checks into an evaluation set. Save representative questions, approved metric definitions, required evidence fields, and known failure cases. Rerun them when prompts, context, instrumentation, or model versions change. Review failures by category: wrong scope, wrong metric, unsupported inference, missed uncertainty, or unusable recommendation. That gives your team a regression signal instead of a vague impression that the workflow still works.

Hand stakeholders a decision artifact, not an AI transcript

The output should make the next decision easier. A long transcript of prompts, tool calls, and exploratory branches shifts the work of interpretation onto the reader. Keep the trace for auditability, but present a concise decision artifact with six fields:
- Decision: The choice this analysis informs.
- Finding: The clearest supported behavioral observation.
- Evidence: The exact events, segments, funnel steps, or trends behind the finding.
- Uncertainty: What remains unknown and what the analysis cannot establish.
- Recommendation: The next analysis, discovery activity, experiment, or product change justified by the evidence.
- Owner: The person responsible for the next step and the condition that triggers a follow-up.
Keep human judgment at the decision boundary. Amplitude AI can retrieve definitions, propose analyses, call tools, compare patterns, and draft the artifact. A product leader should still decide whether the evidence is strong enough, whether the recommendation fits current constraints, and whether the cost of being wrong is acceptable.

That division of labor also clarifies accountability. If the AI workflow produces an unsupported inference, improve the context, tool contract, or evaluation. If the evidence is sound but the organization chooses a different path, record the strategic reason. Don’t let an AI-generated recommendation blur the difference between analytical output and an accountable product decision.

Key takeaways
- Begin with the decision, population, metric, evidence bar, and required output.
- Give Amplitude AI a small, versioned context packet instead of an unfiltered analytics catalog.
- Separate baseline measurement, segmentation, journey analysis, hypothesis generation, and the next tool call.
- Require evidence, alternatives, assumptions, and a stop rule before accepting a conclusion.
- Save recurring checks as evaluations and rerun them when data, prompts, tools, or models change.
- Deliver a decision artifact with a named owner while keeping the analytical trace available for review.
Start with one recurring product question this week. Write its decision contract, assemble the minimum context packet, and define the verification gate before asking Amplitude AI to analyze anything. Once that workflow survives review, save it as the template for the next question.

References
- Shivam.Consulting Blog — Decode How Amplitude AI Thinks: Proven Workflows to Get Actionable, High-Accuracy Results
June 2, 2026
Join Me in June: Master Opportunity-First Product Strategy with Continuous Discovery Habits

I’m celebrating the five-year anniversary of Continuous Discovery Habits by inviting you to read it with me this June. As someone who leads product management and coaches product trios, I’ve seen how a shared discovery practice tightens alignment, speeds up learning, and drives outcomes. This month, we’ll go deep on prioritizing opportunities—not solutions—and I’ll guide you step by step so you can apply the ideas on your own team.

Each month, I’m releasing an in-depth reading guide that includes:

We’ll discuss each month’s reading in the comments, and we’ll gather quarterly on a live call to unpack real-world applications, trade wins and missteps, and keep the momentum going.

Joining late? No problem. I monitor the comments on each reading guide throughout the year. Start with the current month or go back to January—whatever works for you. Ask for help, share what’s working, and connect with other readers at any point.

If you want to participate, grab a copy of the book (or dust off your old copy), share the “Spread the Love” videos with your team, block time for the exercises, and register for the community sessions. Let’s do this.

This Month’s Reading

Chapter:

Estimated reading time: ~16 minutes

This month's chapter will introduce you to:

Need a copy? Grab the book

Share the Love with Friends and Colleagues

We learn best in community. Use these short videos to spread the key ideas across your product trios, engineering partners, and stakeholders. Invite them to read along with you so your discovery cadence—and your product strategy—advance together.

Reflect & Discuss What You Read

When we reflect and discuss what we read, we absorb more and apply it faster. This chapter challenges a deeply ingrained habit: prioritizing solutions. I’ve been in those meetings—spreadsheets full of features, heated roadmap debates, and a creeping sense that we’re optimizing outputs rather than outcomes. The shift to opportunity-first thinking changed how my teams frame bets, sequence discovery, and communicate product strategy.

Individual Reflection

Team Discussion

Put It Into Practice

This month is all about shifting from solution-first to opportunity-first thinking. These short, focused exercises will help your product trio practice opportunity prioritization and improve decision speed without sacrificing product discovery rigor.

Exercise: Map Your Roadmap to Opportunities

Time: 45 minutesDo this: With your product trio

Take your current roadmap or backlog and work backwards. For each planned feature or solution:

This exercise often reveals that you're either:

Use these insights to inform your next prioritization conversation.

Exercise: Practice Two-Way Door Thinking

Time: 30 minutesDo this: With your product trio

Choose 3-5 recent or upcoming product decisions. For each one, discuss:

The goal is to calibrate your team's decision-making speed. Two-way door decisions should be made quickly with "just enough" evidence. One-way door decisions deserve more deliberation and data.

Go Deeper: Additional Reading

If you prefer an audio summary of this month’s reading, including the book chapters and the following resources, I’ve included an audio version for members at the bottom of this post.

Related In-Depth Guides

Supplementary Reading

Related Courses

Our Live Discussion Schedule

Our live discussion sessions are for registered members. Sessions are not recorded. Invitations will go out two weeks before the scheduled event—reserve time now.

Audio Summary

Prefer to listen? Stream the audio overview here: June — Prioritizing Opportunities (audio).

Ready to put continuous discovery into action? Grab the book, share the videos with your team, schedule the exercises, and join the community sessions. Opportunity-first product strategy is a muscle we can build together.

The chapters we will be readingA preview of the most important concepts we'll be learning aboutShort videos you can share with friends and colleagues to help spread the ideasIndividual and team discussion questions to help you absorb and engage with the readingTeam exercises to help you put the ideas into practiceAdditional reading to help you go deeper on the core ideasChapter 7: Prioritizing Opportunities, Not SolutionsWhy product strategy happens in the opportunity space, not the solution spaceHow to focus on one target opportunity at a time to deliver value iterativelyUsing the tree structure to simplify prioritization decisionsThe four criteria for assessing opportunities: sizing, market factors, company factors, and customer factorsWhy treating prioritization as a messy, subjective decision leads to better outcomes than scoring formulasThe concept of two-way door decisions and how they apply to opportunity prioritizationWork on one small opportunity at a time – Reduce your batch sizeGetting started with compare and contrast decisions – Choose the right target opportunityTurn big intractable problems into smaller, more solvable problems – The power of decompositionThink about your team's current roadmap or backlog. How much of your time is spent prioritizing features versus understanding and prioritizing customer opportunities? What would change if you flipped that ratio?Reflect on the last time you made a product decision. Did you treat it as a one-way door (irreversible) or a two-way door (reversible)? How did that framing affect your decision-making process and timeline?Consider the four assessment criteria (opportunity sizing, market factors, company factors, customer factors). Which of these does your team currently emphasize most? Which do you tend to overlook or underweight?As a team, list the top 5-10 items on your current roadmap or backlog. For each one, try to identify the underlying customer opportunity it addresses. If you can't clearly articulate the opportunity, what does that tell you about how you're making decisions?The chapter argues against scoring formulas (like RICE or ICE) for prioritization, calling them "made-up math." If your team uses a scoring system, discuss: What is it really measuring? Does it help you make better decisions, or does it just make subjective decisions feel more objective?Walk through a recent prioritization decision. Did you assess options in isolation ("should we build this?") or compare and contrast them? How might your decision have been different with a compare-and-contrast approach?Identify the customer opportunity it's meant to addressWrite it as something a customer might say (e.g., "I can't find anything to watch" not "We need better search")Look for patterns: Are multiple solutions addressing the same opportunity? Are some solutions disconnected from any clear customer need?Spreading yourself thin across too many opportunitiesOver-investing in a single opportunity with multiple solutionsBuilding solutions with no clear opportunity attachedIs this a one-way door decision (hard to reverse) or a two-way door decision (easy to reverse)?If it's a two-way door, what's the smallest step we could take to learn whether we're on the right track?What would we need to see to know we made the wrong choice?If we realize we're wrong, how quickly could we course-correct?Opportunity Solution Trees: Visualize Your Discovery to Stay Aligned and Drive OutcomesCustomer Interviews: Uncover Hidden Insights from Every ConversationPrioritize Opportunities, Not Solutions7 Key Benefits of Using Opportunity Solution TreesProduct in Practice: How 2-Way Door Decisions Helped Simply Business Learn FastProduct in Practice: Getting Started with Opportunity Solution Trees at SuperAwesomeProduct Discovery Fundamentals: Learn a structured and sustainable approach to continuous discovery.Tuesday, June 16, 2026: 9am-10am PDTThursday, September 17, 2026: 9am-10am PDTWednesday, December 16, 2026: 9am-10am PST

Inspired by this post on Product Talk.

June 2, 2026
Stop Support Tickets Before They Start: How AI Unsticks Users and Lifts Conversions

Every moment of friction in a product carries a hidden cost: attention drifts, motivation wanes, and the next click becomes a support ticket—or worse, silent churn. Over the years, I’ve learned to treat “stuck” as an urgent product signal, not just an operational nuisance. When we unstick users in the flow, we protect revenue, brand trust, and the momentum that powers product-led growth.

Learn how Amplitude’s Global Support team uses AI Assistant to reduce support tickets, prevent user churn, and increase conversions.

I reference that line often because it captures a proven pattern: meet users where confusion peaks and resolve it instantly. In my practice, the formula is straightforward—pair behavioral analytics and session replay with a just-in-time AI Assistant, routed by clear driver trees. This transforms support from reactive firefighting into a proactive, in-product experience that accelerates onboarding and boosts user activation.

Here’s how I operationalize it. First, I use Amplitude analytics and behavioral analytics to surface high-friction steps—pages with elevated drop-off, loops, or rage clicks. Session replay clarifies the “why” behind the numbers, while cohort and retention analysis reveal who’s most at risk. Then I deploy targeted in-app guides and tooltip design to preempt known pitfalls, while an AI Assistant handles real-time questions with context from our knowledge base and product docs.

The AI Assistant is more than a chatbot. With well-structured AI workflows, it detects intent, pulls precise snippets from docs-as-code, and handles routine issues instantly. When complexity spikes, it executes a graceful handoff to consultative support via Intercom or a Zendesk integration—preserving conversation history and sentiment cues—so humans spend time where judgment matters. This hybrid model keeps response times low without sacrificing quality.

To de-risk changes, I lean on A/B testing and feature flags. I measure time-to-value, activation rate, and funnel conversion as leading indicators, while tracking ticket deflection, CSAT, and NRR as trailing indicators. The goal isn’t just fewer tickets; it’s faster learning loops and a compounding improvement in user outcomes. When we see activation curves steepen and onboarding friction flatten, we know the system is working.

Practically, I start with the top three friction points in onboarding, implement narrow in-app guides, and deploy the AI Assistant with strict guardrails and clear escalation paths. Weekly reviews align product, customer success, and solutions engineering around shared telemetry—so we tune prompts, content, and UI patterns together. Over time, I’ve seen ticket volume decline meaningfully, while conversion and retention rise as users experience fewer dead ends.

If you’re evaluating where to begin, identify the moments where confusion compounds—pricing configuration, integrations, and data mapping are common culprits. Then introduce targeted, context-aware help right where users hesitate. You’ll not only prevent “every stuck user” from turning into a ticket—you’ll convert friction into confidence, and confidence into growth.

Inspired by this post on Amplitude – Best Practices.

June 1, 2026

How to Build a Resilient Experimentation Program at Scale

Your teams are running more experiments, but decisions are not getting easier. Results arrive late, apparent wins fail to repeat, and every readout starts a new argument about the data.

The fix is not another testing tool or a higher experiment count. You need an operating system that protects validity when traffic, products, models, and customer behavior change underneath you. That system starts before exposure, routes each question to the right evaluation method, and ends with a decision your team can execute.

Give every experiment a decision contract

An experiment should begin with a decision, not a feature. Ask what you will do if the result is positive, negative, inconclusive, or unsafe. If the answer is the same in every case, the test is not worth running.

Turn the proposed test into a short decision contract before engineering begins. Record:

The customer problem: the friction or unmet need you observed.
The causal hypothesis: the product change, the behavior it should alter, and why.
The eligible population: who can enter the experiment and who must be excluded.
The primary outcome: the one metric that determines whether the hypothesis worked.
The guardrails: the measures that can block a rollout even when the primary outcome improves.
The decision thresholds: the minimum effect worth acting on and the conditions for shipping, iterating, stopping, or rolling back.

A driver tree helps you connect the primary metric to the business outcome without pretending that one experiment can prove the entire chain. If the goal is retention, for example, the immediate experiment may be designed to change activation behavior. The contract should distinguish that leading behavior from the longer-term outcome.

Set the minimum detectable effect and guardrails before reading results. The minimum detectable effect is not the smallest movement your analytics can display. It is the smallest improvement that would justify the cost, risk, and complexity of the change. If your available population cannot reliably detect that effect, narrow the question, combine low-traffic variants, choose a more sensitive proximal metric, or do not run the test.

Pre-committing to the metric, stopping rule, exclusions, and decision criteria also limits convenient reinterpretation. Teams can still investigate unexpected patterns, but those findings should become new hypotheses rather than retroactive proof that the original bet won.

Match the question to the cheapest reliable evidence

Production A/B testing is only one layer of experimentation. It is often the slowest and most expensive layer because it consumes customer attention, operational capacity, and statistical power. Use it when real behavior is necessary to resolve a meaningful decision.

Evidence layer	Best question	Move forward when
Offline evaluation	Does the output meet a defined quality, policy, or safety standard?	The candidate passes the agreed evaluation set and regression checks.
Replay or shadow mode	How would the change behave on realistic inputs without affecting users?	Failure patterns, cost, and latency remain inside the operating limits.
Targeted canary	Is the change safe and observable under live conditions?	Telemetry is healthy and no guardrail triggers a rollback.
Controlled A/B test	Does the change cause a valuable shift in user behavior?	The result meets the pre-registered decision criteria.
Progressive rollout	Does the effect and reliability persist as exposure expands?	Segment-level outcomes and operational measures remain acceptable.

This layered model becomes essential for AI products. Prompts, retrieval logic, policies, model versions, and traffic composition can all change the experience. A single production metric cannot tell you whether a decline came from product value, output quality, latency, cost, safety, or an upstream model shift.

Build an evaluation stack for prompts, policies, regressions, canaries, and selective A/B tests. A candidate should earn broader exposure by passing the cheaper layers first. This reduces traffic waste and gives the team diagnostic evidence when a live result moves unexpectedly.

Do not use a multi-armed bandit simply because it can direct more traffic toward a leading variant. Bandits are useful when the objective is clear, feedback is timely, and guardrails are dependable. They are a poor substitute for stable measurement or causal understanding. If you need to estimate an effect, learn about segments, or detect delayed harm, retain a controlled comparison.

Engineer trustworthy measurement and reversible delivery

An experimentation program is only as resilient as its event pipeline. A mathematically correct analysis built on shifting event definitions is still wrong. Treat instrumentation as a product interface with owners, documentation, versioning, tests, and observability.

Before exposure begins, verify that assignment, exposure, outcome, and guardrail events share consistent identities and timestamps. Confirm that users enter only the experiments for which they are eligible. Check that retries, duplicate events, delayed ingestion, and cross-device behavior cannot silently change the denominator.

Naming conventions, schema versioning, lineage, anomaly detection, and pipeline observability are not analytics housekeeping. They let teams move without sacrificing the meaning of their measurements. Assign an owner to every critical event and make schema changes visible to the teams whose experiments depend on them.

During the run, monitor data quality separately from product performance. Sample ratio mismatch, assignment failures, missing exposure events, sharp volume changes, and implausible segment movements should pause interpretation. Do not explain these signals away because the headline result looks attractive.

Delivery must be reversible as well as measurable. Put material treatments behind feature flags. Start with a targeted canary, watch operational and customer guardrails, and expand exposure in stages. Define who can stop the rollout and make sure that person has both the telemetry and access required to act.

For broad platform or AI changes, maintain a persistent holdout when feasible. A long-lived control gives you a reference point for cumulative effects that short experiments miss, including changes in retention, trust, support burden, and cost. Protect the holdout from accidental contamination and document every change that affects its interpretation.

Scale the program around decisions, not test volume

A central experimentation team cannot design and analyze every test at scale. Product teams need autonomy inside a governed system. Centralize the parts where inconsistency creates shared risk: assignment services, metric definitions, event standards, quality checks, templates, and audit records. Let teams own hypotheses, customer context, treatment design, and decisions inside those guardrails.

Use a lightweight review based on risk. A reversible interface change with a proven metric can follow a standard path. A pricing change, safety policy, ranking system, or shared AI capability deserves stronger review, tighter exposure controls, and a clearer rollback plan. Governance should become more demanding as the blast radius grows.

Maintain a portfolio view rather than a leaderboard of teams by test count. For each active experiment, track the decision it supports, expected value, detectable effect, traffic requirement, risk class, owner, and current evidence layer. This reveals when several teams are competing for the same population, when a strategic question is underpowered, and when multiple small tests should become one coherent learning plan.

Reset a brittle program over 90 days

You can make the operating model concrete without attempting a platform-wide rebuild:

By day 30: audit the backlog and current tests. Stop or consolidate experiments that cannot meet their minimum detectable effect. Identify unreliable events, missing owners, conflicting metric definitions, and launches without explicit decision criteria. For AI surfaces, establish a minimal offline evaluation harness for prompts, policies, quality, and safety.
By day 60: publish standard hypothesis and readout templates. Put high-risk changes behind feature flags, make guardrails visible, and introduce canary exposure. Establish persistent holdouts where broad or cumulative effects matter. Add alerts for instrumentation drift and operational regressions.
By day 90: manage a balanced portfolio across offline evaluations, replay or shadow tests, canaries, controlled experiments, and progressive rollouts. Review program health through decision speed, valid learning, repeatability, and detected harm rather than the number of tests launched.

Create a community of practice alongside these controls. Regularly examine inconclusive results, failed replications, instrumentation incidents, and stopped rollouts. These cases expose weaknesses in the system more reliably than a gallery of wins. The goal is not to eliminate failure. It is to make failure informative, contained, and cheap.

Key takeaways

Start with the decision the experiment must support, then pre-register the hypothesis, primary metric, guardrails, detectable effect, and stopping rule.
Use offline evaluations, replay, shadow mode, and canaries to eliminate weak or unsafe candidates before consuming production traffic.
Treat event semantics, assignment, exposure, lineage, and anomaly detection as production infrastructure.
Pair controlled measurement with feature flags, progressive exposure, explicit rollback authority, and persistent holdouts where cumulative effects matter.
Judge the program by trustworthy decisions and reusable learning, not experiment volume or the percentage of positive results.

Choose one upcoming decision with meaningful customer or operational risk. Write its decision contract, identify the cheapest evidence layer that could disprove it, and verify the rollback path before anyone builds the treatment. That single discipline is a practical starting point for a program that can keep learning as your product and organization change.

References

June 1, 2026

An AI Operating Model That Measures Outcomes, Not Activity

Your AI team is shipping, dashboards are filling up, and executives are still asking the uncomfortable question: what changed for the customer or the business?

The answer is rarely another model metric. You need an operating model that connects AI quality to customer behavior, workflow performance, commercial results, and risk. When that chain is visible, you can decide what to scale, what to repair, and what to stop.

Key takeaways

Give every AI initiative an outcome contract that names the target behavior, business result, guardrails, and decision owner.
Measure four linked layers: AI quality, user behavior, workflow results, and business outcomes.
Preserve the context behind each interaction so you can compare outcomes by customer, workflow, model version, and acquisition path.
Run one recurring evidence review where teams make explicit scale, fix, hold, or stop decisions.
Use the first 90 days to prove a reusable learning system, not merely a functioning AI experience.

Start each initiative with an outcome contract

A feature brief tells a team what to build. An outcome contract tells it why the work exists, how evidence will be interpreted, and who can act on that evidence. It is the smallest practical unit of an outcome-led AI portfolio.

Write the contract before choosing a model or polishing a prompt. Keep it to one page and require six fields:

Target workflow: Name the repeated job being changed, such as resolving a support request or preparing a sales follow-up.
Target user behavior: Describe what a person should do differently. Adoption alone is weak; successful completion, accepted recommendations, or reduced rework is stronger.
Business outcome: Connect the behavior to retention, expansion, qualified demand, service capacity, or another commercial result.
Quality floor: Define the task-level evaluation the AI must pass before exposure expands.
Guardrails: Name the safety, privacy, latency, reliability, and cost conditions that must remain acceptable.
Decision rule: State what evidence will trigger a scale, fix, hold, or stop decision, and name the person accountable for making it.

A driver tree makes the logic inspectable. Start with the business result, work backward to the customer behavior that can influence it, then identify the product and AI capabilities that can change that behavior. This prevents a model improvement from being mistaken for business progress.

The contract also gives empowered teams useful boundaries. Leaders align the portfolio around outcomes and constraints; teams retain room to change prompts, retrieval methods, interaction design, or even the proposed solution. That is the practical connection between AI strategy, continuous discovery, evaluation, delivery, and value capture.

Build one scorecard across four layers

AI outcome analytics is not a single north-star metric. It is a chain of evidence. If you measure only the beginning of the chain, you learn whether the system produced an answer. If you measure only the end, you may see revenue move without knowing why.

Measurement layer	Question it answers	Useful examples	Typical decision
AI quality	Did the system perform the intended task?	Task success, groundedness, safety failures, response variance	Change prompts, context, retrieval, model, or fallback
User behavior	Did a person trust and use the result?	Acceptance, correction, abandonment, repeat use, human escalation	Change the interaction, explanation, or moment of assistance
Workflow outcome	Did the job become meaningfully better?	Successful completion, rework, cycle time, resolution quality	Expand, narrow, or redesign the workflow
Business and risk outcome	Did the change create durable value within constraints?	Retention, expansion, qualified leads, cost per successful outcome, incidents	Scale, repackage, hold, or stop

Read the layers from left to right. Good AI quality with weak behavior usually points to product design, trust, or workflow placement. Strong usage with no workflow improvement may indicate novelty rather than value. Workflow gains with poor economics mean the experience works but the architecture or packaging does not.

Use the workflow attempt as the basic unit of analysis whenever possible. A generic session can contain several unrelated intentions. A workflow attempt lets you connect the user request, retrieved context, model and prompt version, response, correction, completion, and downstream result.

Persist the properties needed to reconstruct that journey. Customer segment, acquisition context, workflow type, entitlement, experiment group, model version, retrieval version, and human-handoff status often matter more than another page-view event. Carrying critical context across visits lets you trace behavior from early exploration to conversion and expansion instead of losing the causal story at signup.

Keep the event taxonomy small enough to govern. Instrument decisions and state changes, not every interface movement. For each event, document its owner, trigger, required properties, prohibited sensitive data, and validation method. A dashboard built on ambiguous events creates confidence without clarity.

Run a weekly loop from evidence to decision

Analytics creates value only when it changes a decision. Give each AI initiative a recurring evidence review attended by the product trio and the engineering, data, risk, operations, or go-to-market partners needed for that workflow.

Check the contract. Reconfirm the target workflow, primary outcome, evaluation floor, and guardrails. If the goal has changed, update the contract before interpreting the data.
Inspect the scorecard. Review AI quality, behavior, workflow, business, risk, and cost in that order. Look for breaks in the chain rather than averaging them into one health score.
Segment the result. Compare the cohorts that could conceal a failure: new and experienced users, customer tiers, workflow types, channels, experiment groups, and system versions.
Review failure cases. Sample unsuccessful attempts and classify the reason: missing context, poor retrieval, incorrect generation, confusing interaction, policy restriction, latency, or a problem outside the AI system.
Make one portfolio decision. Choose scale, fix, hold, or stop. Record the evidence, owner, next test, and condition for revisiting the decision.

Do not let offline evaluations and online analytics compete. Offline evaluations test whether a candidate change can handle representative tasks and known edge cases. Online measures show whether the released experience changes real behavior under real conditions. A candidate should clear the evaluation floor before broader exposure, then earn expansion through customer and business evidence.

When you run an experiment, agree on the hypothesis, primary outcome, guardrails, minimum detectable effect, and stopping rule before looking at results. Feature flags and progressive rollout keep the decision reversible. If the result is ambiguous, improve the test or narrow the population; do not promote the most flattering proxy.

This rhythm makes learning rate operational. The useful question is not how many experiments ran. It is how many consequential uncertainties were resolved and converted into a product, portfolio, or go-to-market decision. Testable decisions, behavioral analytics, and guarded rollouts make speed credible because the evidence can survive scrutiny.

Assign decision rights before the dashboard turns red

AI products cross boundaries that ordinary feature teams can often ignore. Product owns the customer and business outcome. Engineering owns service behavior and remediation. Data or AI teams own evaluation integrity and model observability. Risk, security, legal, and operations own constraints that cannot be traded away informally.

Write those responsibilities into the operating model. For each risk tier, specify who can approve an initial release, expand exposure, pause the system, change a model or retrieval source, accept a temporary exception, and communicate an incident. A named decision owner is more useful than a large committee with shared accountability.

Governance should begin during discovery. The team can then choose acceptable data, design consent, build traceability, create fallbacks, and define escalation paths before those choices become expensive. Model cards, data records, evaluation results, release history, and incident decisions should form one audit trail rather than separate compliance paperwork.

The same principle applies to commercial decisions. Product, finance, sales, and customer success need a shared definition of value. Measure inference and support costs against successful workflow outcomes, not raw requests or tokens. Packaging can then reflect delivered value while protecting margins and avoiding incentives for wasteful usage.

Use simple decision tests:

Scale when the primary outcome improves, quality and safety floors hold, economics remain acceptable, and the result repeats in the intended cohorts.
Fix when the chain reveals a local weakness, such as adequate AI quality but low acceptance, or strong adoption but excessive rework.
Hold when the evidence is inconclusive, the measurement is unreliable, or a guardrail is close enough to its limit that broader exposure would create avoidable risk.
Stop when only proxy metrics improve, the target workflow does not change, or the value depends on manual intervention that cannot be sustained.

Use 90 days to prove the operating system

Your first 90 days should produce more than a working use case. They should leave behind a repeatable contract, event model, evaluation set, rollout path, governance record, and decision cadence that the next team can reuse.

Weeks 1–2: choose the workflow. Audit available content and data, map the highest-value repeatable workflows, and select one where behavior and business impact can be observed. Write the outcome contract and assign decision rights.
Weeks 3–4: define the evidence. Build the driver tree, establish the four-layer scorecard, create representative offline evaluations, classify risk, and document the release and stop conditions.
Weeks 5–8: build and instrument. Create the retrieval and prompt baseline, capture lineage and version context, validate events, implement observability, and test graceful fallbacks. Rehearse how the team will diagnose a failed attempt.
Weeks 9–12: release and learn. Ship behind a feature flag, begin with limited exposure, compare behavior and outcomes, inspect failure cohorts, and make explicit scale, fix, hold, or stop decisions.

At the end, ask for three forms of proof. Can the team explain which customer behavior changed? Can it connect that behavior to a workflow and business result without hand-waving? Can another team reuse the operating artifacts without rebuilding them from scratch?

If any answer is no, keep the rollout narrow and repair the system of learning. If all three are yes, fund the next workflow using the same operating model. The goal is not a larger collection of AI features. It is an organization that can turn uncertain AI capabilities into measurable outcomes, repeatedly and responsibly.

References

May 28, 2026

AI-Ready Customer Support: An Operating Model That Works
You may already have an AI agent, a vendor shortlist, or pressure to automate more tickets. But if your policies conflict, ownership is unclear, and agents routinely rely on tribal knowledge, adding AI will expose those weaknesses at customer speed.

The practical goal is not to make every ticket autonomous. It is to build a support operation in which AI can resolve the right issues, recognize when it lacks authority or information, and help your team improve the system after every failure.

Key takeaways
- Start with a bounded customer problem, not a general mandate to automate support.
- Treat knowledge as a controlled production input with owners, audience rules, and review triggers.
- Define acceptable outcomes, prohibited actions, and escalation conditions before configuring the agent.
- Preserve a middle path where a human can unblock the AI without taking over the entire conversation.
- Expand automation only when evaluation results, live outcomes, and operational ownership support it.
Start with the queue, not the model

An AI-ready operation begins with a resolvable job. “Handle customer support” is too broad. “Help authenticated customers update their billing details under the current policy” is something you can document, test, monitor, and constrain.

Choose an initial queue where demand is meaningful, the desired outcome is clear, and the governing policy is reasonably stable. Avoid starting with cases that depend on negotiation, undocumented exceptions, or several teams making judgment calls behind the scenes. Those cases may become suitable later, but they are poor places to learn basic operational control.

Review a representative slice of conversations from that queue. For each one, record the customer’s intent, the information required, the systems touched, the policy applied, the final outcome, and any human judgment that changed the path. This turns a pile of tickets into a resolution map.

Pay special attention to cases that look identical at first but require different actions. A refund request may depend on plan type, purchase date, account state, or a regulatory restriction. These branches are where a fluent answer can still be operationally wrong.

You also need to decide where AI will sit. In most established operations, the safer path is to work through the support systems, queues, and reporting practices your team already uses. Replacing the help desk and automating the work at the same time creates two migrations and makes failures harder to diagnose.

Turn knowledge into a controlled production input

Your help center is only one part of the answer set. Reliable support may also depend on internal runbooks, policy clarifications, troubleshooting steps, approved reply snippets, product limitations, escalation instructions, and information held by product or customer success teams.

Bring those materials into a governed knowledge inventory. Every record should answer seven operational questions:
<!– wp:list {
May 28, 2026
How to Design a Dependable CLI Agent Users Can Trust
Your CLI agent can look impressive in a controlled demo and still feel unsafe in a real repository. The moment it can edit files, invoke tools, or use credentials, users need to understand what it will do before they let it proceed.

The dependable design is rarely the one with the most capabilities. It is the one with the smallest clear promise, predictable execution, visible controls, and evidence that it succeeds repeatedly.

Define the boundary before you define the features

Start by writing an operating contract for the agent. This is a product decision, not a prompt-writing exercise. A useful contract answers five questions:
- What job does the agent complete?
- Which resources and tools may it use?
- What must it never do?
- Which actions require explicit approval?
- What observable result counts as success?
Keep the job narrow enough to explain in one sentence. If the description needs a collection of exceptions, the interface is already carrying too much ambiguity. Split the work into a clearly named subcommand or make the advanced behavior opt-in.

Treat every flag, tool, and permission as an increase in blast radius. A new option does not merely add flexibility. It creates another state the agent can misunderstand, another path you must test, and another behavior the user must learn. Reducing the surface area can improve repeatability and trust because both the agent and the user have fewer possible paths to reason about.

When reviewing a proposed capability, ask whether it makes the mental model smaller. If it does not, remove it, defer it, or isolate it behind progressive disclosure. Safe, fast defaults should handle the common case without demanding that a new user understand the entire system.

Design one boring, observable execution path

A dependable run should feel like a transaction with recognizable stages. The model can help interpret intent, but it should not invent the execution contract as it goes.
- Capture intent: Ask only for information required to resolve the task. If a missing choice would materially change the result, stop and ask.
- Retrieve context: Fetch the smallest relevant set of files, facts, or records. More context can introduce conflicting instructions and distract the agent from the requested change.
- Show the plan: Present a compact description of the intended actions, affected targets, and likely side effects.
- Preview when useful: Provide a dry run for operations whose effects the user should inspect before execution.
- Execute through narrow tools: Give each tool a deterministic input and output contract. Reject malformed responses instead of guessing what they meant.
- Verify the result: Check the resulting state and tell the user what changed, what did not, and whether any step failed.
The agent should stop when the requested scope changes, required context is unavailable, or a tool returns an unexpected result. A visible stop is easier to recover from than confident improvisation.

Favor idempotent operations wherever you can. Repeating an idempotent action produces the intended state without duplicating or compounding its effects. That property matters in a CLI because interrupted runs and retries are normal operating conditions. Test the second run as deliberately as the first.

Put human control at the blast-radius boundary

Do not ask for approval at every step. Constant prompts train users to approve without reading. Place confirmation gates where the consequence or scope changes.
- Read-only work: Make inspection and planning the default where possible.
- Scoped writes: Request access only to the specific project, service, or resource needed for the task.
- Destructive actions: Require a separate confirmation that names the target and explains the consequence.
- Credentials: Use narrowly scoped, time-bounded access rather than broad credentials that persist beyond the run.
- Expanded capability: Let users opt into advanced tools instead of quietly enabling them for every session.
A confirmation message should help the user make a decision. Replace a generic question such as “Continue?” with a concrete statement of what will be changed and whether it can be undone.

Reversibility should shape the underlying implementation as well. Prefer changes that can be represented as a patch, show the proposed difference before applying it, and preserve enough information to explain how to undo the operation. When reversal is impossible, make that fact visible before execution.

Use a simple review question for each workflow: can a user predict the maximum consequence of saying yes? If the answer is unclear, the permission boundary is too broad or the confirmation arrives too late.

Prove reliability before expanding the roadmap

Do not use capability count as the measure of progress. Before adding a feature, define the task it should complete, the success threshold it must meet, and the smallest interface needed to test it. This turns roadmap discussions into observable product decisions.

Evaluate at least three outcomes: task completion, time to first successful result, and stability when the same operation is run again. A capability that succeeds once but behaves differently on a retry is not ready merely because the first demonstration worked.

Instrument each run with Agent Analytics. Capture the input, tools selected, duration, outcome, and error pattern. Review those signals to find where the agent asks unnecessary questions, repeats tool calls, loses users, or encounters the same failure. The response may be a smaller prompt, a tighter tool contract, a safer default, or the removal of a confusing option.

Documentation belongs in this reliability loop. Keep runnable examples alongside the code and make them reflect the golden path. Treat any mismatch between documented behavior and actual behavior as a product defect. If the workflow cannot be explained and demonstrated simply, it is not yet a dependable workflow.

Use these evaluations as promotion gates. Add power only after the current path is measurable, understandable, and stable. That discipline earns you the right to expand without turning the CLI into a collection of loosely related agent behaviors.

Key takeaways
- Write the agent’s operating contract before choosing its tools or refining its prompt.
- Keep the default workflow narrow, safe, fast, and explainable in one sentence.
- Retrieve minimal context, show a compact plan, execute through deterministic contracts, and verify the result.
- Place explicit approval at destructive, irreversible, or scope-expanding boundaries.
- Measure completion, time to first success, and rerun stability before adding another capability.
- Use run telemetry and executable documentation to decide what to simplify next.
Choose one golden-path task and write its operating contract now. Then run it twice: once normally and once as a retry. Every surprise you find is a reliability requirement to resolve before you broaden the agent’s reach.

References
- Shivam.Consulting Blog — The Counterintuitive Playbook for CLI Agents: Why Ruthless Subtraction Beats Feature Creep
May 27, 2026
Analytics-Led Growth Engineering: A Practical Operating Model
Your team has dashboards, event data, and a backlog of growth ideas. Yet decisions still come down to whoever has the strongest opinion, and experiment results rarely change the roadmap.

The missing piece is usually not another analytics tool. It is an operating model that connects user behavior to a decision, a controlled release, and a measurable business result. Here is how to build one.

Start with a growth constraint, not a dashboard

Analytics-led growth begins with a constraint you want to remove. A broad instruction such as improve onboarding gives your team too much room to produce activity without progress. Frame the problem as a break in the user journey instead: qualified users reach the setup screen but fail to complete the action associated with first value.

Connect that problem to your North Star metric through a driver tree. If the North Star depends on retained active accounts, its drivers might include the number of activated accounts, how frequently they return, and how deeply they use the product. Each driver can then be decomposed into observable behaviors.

This prevents a common mistake: optimizing the easiest metric to move rather than the metric that matters. More tooltip clicks are not useful if they do not increase successful setup. Higher setup completion is still questionable if those users never return.

Before opening your analytics platform, write down four things: the user segment, the behavior that is breaking, the outcome it should influence, and the decision you will make if the signal changes. If you cannot name the decision, you are probably requesting a report rather than investigating a growth opportunity.

Build an evidence chain you can trust

A growth team needs to trace the path from exposure to durable value. That requires more than counting page views. Instrument the events that represent intent, progress, successful value delivery, and return behavior.

For every important event, define who triggered it, what object it affected, where it occurred, and whether it represents an attempt or a successful outcome. A generic event such as integration clicked cannot tell you whether the connection worked. Separate the attempt, completion, failure, and first successful use.

Then inspect the journey through three complementary views. Funnel analysis shows where users stop progressing. Cohorts reveal whether the problem is concentrated among particular acquisition channels, plans, roles, or use cases. Retention analysis tests whether an apparent activation gain survives after the initial session.

Behavior alone will not explain motivation. Pair the quantitative signal with customer interviews, support conversations, or session-level evidence. If a funnel shows that users abandon a configuration step, qualitative evidence can distinguish confusing language from missing permissions, weak intent, or a technical failure.

Treat instrumentation defects as product defects. An event that fires twice, changes meaning, or omits a critical property can send engineering effort toward the wrong problem. Assign an owner to each decision-critical event and verify it across the full journey before using it to approve a rollout. Reliable behavioral analytics, cohorting, and funnel analysis are the foundation of this operating model, not a reporting layer added after release.

Turn every growth idea into an experiment contract

An experiment should begin with a falsifiable claim. Use this structure: for a defined user segment, changing a specific part of the experience should change a target behavior because it removes an identified barrier.

Complete the contract before implementation. Name the primary success metric, the guardrails that must not deteriorate, the expected direction of change, and the minimum detectable effect. The MDE forces a useful product decision: what is the smallest improvement that would justify shipping and maintaining this change?

Power considerations belong in planning, not in the explanation written after results arrive. If the eligible audience cannot produce a credible read on the effect that matters, change the experiment. You can target a higher-signal segment, test a stronger intervention, choose a more responsive leading indicator, or treat the release as a qualitative learning exercise rather than claiming a statistical win.

Pre-commit to the decision rules as well. A positive primary metric with damaged guardrails should not become an automatic launch. A neutral result can still eliminate a weak theory. A surprising segment difference should become a new hypothesis, not an invitation to search repeatedly for a favorable slice of the data.

This discipline changes backlog quality. Ideas compete on the strength of their evidence, the importance of the driver they address, and the clarity of the learning they can produce. The roadmap becomes a portfolio of testable growth mechanisms rather than a list of requested features.

Use staged releases to separate learning from risk

Feature flags let you control exposure without tying every decision to a new deployment. Start with internal validation, expose the change to an eligible cohort, watch technical and user guardrails, and widen access only when the evidence supports it.

Keep three decisions distinct. The first is whether the change works as designed. The second is whether it improves the intended user behavior. The third is whether that behavior produces a lasting outcome. Passing the first decision does not answer the other two.

Onboarding illustrates the difference. A clearer tooltip may increase interaction with a setup control. An in-app guide may increase completion of the setup flow. Neither result proves that users reached value or formed a durable habit. Follow the exposed cohort through the activation event and into retention before declaring the intervention successful.

Small, reversible changes are especially useful here. Progressive disclosure, revised UX writing, a better default, or guidance at a predictable stall point can isolate a mechanism more clearly than a full onboarding redesign. When several elements change together, you may see movement without learning what caused it.

Make the product trio accountable for learning

Growth engineering is not an analytics team handing insights to a delivery team. Product, engineering, and design should jointly own the hypothesis, the intervention, the instrumentation, and the interpretation.

Product connects the opportunity to the growth model and defines the decision. Design identifies the user friction and shapes the smallest credible intervention. Engineering validates event behavior, controls exposure, and protects reliability. All three inspect the outcome together.

Close each experiment with a short decision record. Capture what you believed, what changed, which users were exposed, what happened to the primary metric and guardrails, what you decided, and which assumption changed. Record neutral and negative results as carefully as wins. Otherwise, old ideas return with new wording and consume another cycle.

Leaders should review the quality of this learning system, not just the number of tests shipped. Notice whether teams are testing consequential hypotheses, whether events remain trustworthy, whether results lead to explicit decisions, and whether short-term activation gains are being checked against retention. Experiment volume without decision quality is another output metric.

Key takeaways
- Define the broken user behavior and the decision it affects before opening a dashboard.
- Connect activation, depth, and frequency to your North Star through a driver tree.
- Specify the hypothesis, primary metric, guardrails, MDE, and decision rules before implementation.
- Use feature flags and staged exposure to manage risk while preserving a valid learning loop.
- Validate leading indicators against retention, and store every result in a reusable decision record.
Choose one important journey this week and trace it from first intent to retained value. If the events, ownership, or decision rules break anywhere along that path, fix that link before adding another growth experiment. Compounding growth begins with compounding clarity.

References
- Amplitude — Inside Growth Engineering at Amplitude: My Playbook to Accelerate Product-Led Growth with Analytics
May 27, 2026
Speed-to-Lead Is Dead: How AI Agents End the Wait and Rebuild a High-Velocity Sales Org

A prospect lands on our site, skims pricing, watches a demo, and clicks “contact sales.” For years, that’s where momentum died. They waited, and we built entire sales motions around managing that delay.

We optimized for “speed-to-lead,” made it the hallmark of a high-performing sales development org, hired more SDRs, tuned routing rules, added shift coverage, and stared at response-time dashboards. Typical SLA targets were one hour for best-fit leads, four hours for core MQLs, forty-eight hours for everyone else. Those were considered good numbers.

No one questioned the premise because the lag felt structural—shift scheduling, routing delays, and humans working 9–5. The fastest teams could only shrink the gap; nobody could remove it.

An AI Agent closes it completely.

When a prospect arrives today, the conversation can begin immediately. That single change reshapes how I design a sales org—how we staff it, what our team prioritizes, and the metrics we hold ourselves accountable for.

Step outside our dashboards and look at the buyer experience. We spend heavily to drive traffic, then push visitors into forms and queues that add friction precisely when purchase intent peaks.

Intent is highest the moment someone seeks out our product. If an SDR follows up two or three hours later, that buyer’s in another meeting, the urgency has faded, and the moment is gone. We still call it a lead; the buyer has already moved on.

What AI changes

Agents eliminate the structural constraints that made speed-to-lead a problem—shift scheduling, routing delays, CRM batch processing, the SDR being on another call. None of it applies anymore because every single lead can be engaged immediately, at any hour and in any language.

The impact goes beyond response time. When an Agent engages at peak intent, qualification, discovery, and even an initial demo moment can unfold in a single, continuous conversation. The gated funnel collapses. There’s no reason to qualify someone today, schedule discovery for Thursday, and demo the following week when the conversation is already happening.

The constraint the industry built around simply isn’t there anymore. We’re already seeing it with Fin, a Customer Agent. As sales leaders, we need to frame this differently.

If speed-to-lead is no longer the constraint, the knock-on effects reach every part of the org.

Introduce Fin for Sales to your team with this clean hero banner: bold headline, signature blue spiral, and a clear 'Start free trial' call to action—inviting readers to explore an AI customer agent built for revenue.

SDRs focus on moving deals forward. Instead of frontline triage, they double down on phone-based selling and relationship building, complex deal navigation, and multi-threaded engagement across stakeholders—the high-leverage work that used to get crowded out by the inbox.

Pipeline gets more relevant. The old model rewarded volume: capture as many form fills as possible, respond fast, and sort quality later. When an Agent engages at the moment of intent, it qualifies during the conversation. Low-fit leads get filtered out before they reach the team, and high-fit prospects arrive with context—needs, timeline, stakeholders—instead of just a name and email.

You measure outcomes, not response time. When first response is instant, different metrics matter. I anchor on three questions:

1) Is the Agent doing the work? Completion rate, qualification rate, and contact capture rate indicate whether conversations reach clear outcomes and produce usable handoffs to the team.

2) Is the work producing pipeline? Meetings booked and pipeline created through Agent-handled conversations are the leading indicators of revenue, not how fast someone followed up.

3) Are buyers having a good experience? Conversation-level satisfaction matters more than ever because the Agent is the first interaction prospects have with your company. The experience it delivers is the first impression you make.

These three questions reveal whether the motion is working. Time-to-first-response can’t.

Sales orgs built hiring plans, workflows, and performance metrics around beating intent decay. That made sense when the lag was unavoidable. It isn’t anymore.

An Agent is always on. It engages the moment a prospect arrives on your site, qualifies them in real time, and routes them to the right outcome without waiting for someone to be free. The lag the industry built itself around doesn’t exist when the conversation starts immediately.

The companies leaning into this are investing in what happens after the conversation starts: how well the Agent qualifies, where it creates pipeline, and what SDRs should actually spend time on. What matters now is not how fast you respond, but what the conversation produces.

Speed-to-lead made sense when the delay was structural. It isn’t anymore. If you’re re-architecting go-to-market, instrument Agent Analytics, revisit SDR charters, and tighten CRM integration so every qualified handoff is instant, traceable, and revenue-linked.

Inspired by this post on The Intercom Blog.

May 26, 2026

A Product Leader’s Playbook for Humane, Sustainable Growth

Your growth dashboard can be green while your product is becoming less valuable to the people who use it. Activation rises. Engagement deepens. Revenue follows. Yet customers feel pressured, workers absorb hidden costs, or automation removes the human contact that made the experience trustworthy.

You don’t have to choose between humane technology and commercial performance. You do need an operating model that treats human outcomes as product outcomes, exposes harmful trade-offs early, and rewards durable value rather than extraction.

Start with the harm your growth model could create

Most growth models describe the path from acquisition to revenue. A humane growth model also describes who could be worse off if that path succeeds.

Map the product’s intended value first: the problem a person wants to solve, the moment they receive a useful result, and the reason they would return. Then examine the same journey from the perspective of people who may not appear in your analytics. That can include a customer’s employees, contractors who deliver the service, family members affected by the product, local businesses, or people excluded by the design.

Create an impact ledger for the growth surface you are reviewing. Keep it beside the business case, not in a separate ethics document that nobody consults during prioritization.

Impact area	Question to answer	Signal to monitor
User agency	Can people understand the choice, refuse it, reverse it, and leave?	Overrides, cancellations, reversals, and interview evidence
Well-being	Does additional use help people finish their intended task, or merely keep them present?	Successful outcomes, passive time, and expressions of regret
Economic fairness	Who captures the value, and who absorbs the labor, risk, or cost?	Complaints, payout concerns, and changes in burden across participants
Human connection	Does the experience strengthen useful relationships or replace them unnecessarily?	Human handoffs and feedback from affected communities
Trust and safety	Do people know when automation is involved and what happens to their data?	Escalations, corrections, safety reports, and trust feedback

The ledger is not an attempt to predict every consequence. It is a way to make foreseeable trade-offs visible before a team becomes committed to a launch. This matters commercially as well as ethically: extractive growth can weaken trust and retention while increasing regulatory and reputational exposure.

Pair every growth metric with a human countermetric

A metric becomes dangerous when the team can improve it while making the customer’s life worse. Engagement is the familiar example. More time in a product may indicate value, confusion, dependency, or difficulty leaving. The number alone cannot tell you which.

Give each primary growth metric a countermetric that protects the outcome you actually intend. The pair should appear in the same experiment brief and the same review meeting.

Growth metric	Human countermetric	Decision it improves
Activation	Completion of the customer’s intended outcome	Whether setup creates value or only reaches an internal milestone
Engagement	Intentional task completion	Whether additional use is productive or merely prolonged
Retention	Trust, voluntary continuation, and ease of exit	Whether customers stay because the product remains useful
Conversion	Comprehension of price, consent, and commitment	Whether revenue depends on informed choice
Automation rate	Correction, reversal, and human-escalation success	Whether efficiency survives real-world exceptions

Do not combine the pair into a single score too quickly. A blended score can conceal the exact trade-off leaders need to see. Review both trends and ask whether the business result would still be desirable if the countermetric deteriorated further.

Set the stopping condition before running an experiment. Decide which trust, safety, fairness, or agency signal would block rollout even if the primary metric improves. A guardrail invented after seeing strong conversion is rarely a real guardrail.

Expand discovery beyond the people who already love the product

Power users are good at explaining how to improve the experience they have accepted. They are less able to represent people who abandoned it, avoided it, could not access it, or carry costs without being the buyer.

Add an outside-in lane to continuous discovery. Include customers who reduced usage or left, people who encountered a failed automation, front-line workers affected by the workflow, and community members who experience consequences without controlling the purchase. Treat these conversations as product discovery, not public relations.

Ask questions that reveal displacement and dependency: What became easier? What became harder? What did this replace? When did you feel unable to make a meaningful choice? Who else had to change their behavior so you could receive the benefit? What would a responsible version of this experience preserve?

Bring the evidence into roadmap decisions in its original shape. A complaint about loss of control should not be translated into a generic request for better usability. A contractor describing unfair risk is not reporting a minor service defect. Name the underlying impact so the team can address the product model rather than polish its interface.

Put humane constraints inside the experiment

Principles have little effect if they enter the process after pricing, interaction design, and technical architecture are settled. Put them into the experiment before the team writes production code.

State the human outcome. Describe what should become better in the person’s life or work, not merely what behavior should increase.
Name the affected groups. Include non-users who supply labor, absorb risk, or experience downstream effects.
Define meaningful choice. Specify how people will understand automation, decline it, correct it, and reverse important actions.
Design the failure path. Decide how a person reaches human help when the system is uncertain, unsafe, or wrong.
Pre-commit to a stopping rule. Record which negative signal pauses expansion regardless of the growth result.

For AI products, this is where risk management becomes part of product management. Give users enough information to understand when AI is acting. Preserve review for consequential outputs. Build correction and escalation into the main workflow. Apply privacy-by-design while deciding what data the product needs, rather than after collecting everything that might be useful.

The product trio should own these decisions. Legal, security, trust, and policy partners can strengthen the work, but they cannot compensate for a roadmap whose incentives reward harm. The product leader remains accountable for the whole system being optimized.

Choose durable depth over indiscriminate scale

Scale is not proof of value. It is an amplifier. If the operating model depends on weak consent, hidden costs, unfair labor, or the removal of every human interaction, scale magnifies those weaknesses.

A narrower product can create a stronger business when the team understands a community deeply enough to solve its full problem. A locally focused mobility service, for example, could optimize for rider safety, driver economics, and neighborhood usefulness rather than treating every participant as an interchangeable unit of supply or demand. The market is smaller by design, but the value proposition can be clearer and trust can become part of the product’s advantage.

Test the durability of your strategy with a simple question: if customers become better informed and cultural expectations become stricter, does the growth model become stronger or weaker? A group of German primary-school parents collectively chose to delay smartphones until age 11 or 12. Product leaders should expect social norms to change, sometimes in direct opposition to adoption assumptions embedded in a forecast.

At the next roadmap review, challenge any initiative that needs customers to misunderstand a choice, remain dependent, or accept worsening treatment as the company grows. If removing that mechanism destroys the economics, you have found a strategy problem, not an optimization problem.

Key takeaways

Document who could be harmed by a successful growth initiative, including people who never appear in the customer database.
Pair activation, engagement, retention, conversion, and automation metrics with measures of outcomes, agency, trust, and recovery.
Include former users, affected workers, and non-buyers in continuous discovery.
Define consent, correction, escalation, and stopping conditions before launching an experiment.
Prefer a focused market with durable value over scale that depends on hidden human costs.

Start with the growth initiative carrying the greatest human risk. Add its impact ledger and countermetric to the next decision meeting, assign an owner, and make expansion conditional on both business value and human value holding up.

References

Shivam.Consulting Blog — Is Technology Still Net Positive? A Product Leader’s Reckoning and Playbook for Humane Growth

May 26, 2026

Prompt Engineering for Amplitude Global Agent That Holds Up

You ask Amplitude Global Agent why activation fell. It returns a plausible explanation, but you still can’t tell which events it examined, whether the comparison was valid, or what your product team should do next.

The fix is to treat the prompt as an analysis specification. Define the decision, provide the relevant analytics context, constrain unsupported conclusions, and make the agent show its work. You will get an answer that is easier to verify and more useful in a product review.

Start with the decision, not a broad request for insights

Requests such as “analyze activation” leave several decisions unresolved. The agent must guess what activation means, which users belong in the analysis, which period matters, and what kind of answer you expect. Even a polished response may answer the wrong question.

Before writing the prompt, complete this sentence: “After reading the answer, we need to decide whether to…” Your ending might be “change the onboarding sequence,” “investigate a recent release,” or “prioritize one segment for discovery.” That decision gives the analysis a destination.

Then assign a role that matches the work. “You are a product analyst investigating activation performance” is more useful than “You are a helpful assistant.” Add the audience as well. An executive needs the size and business relevance of a change; a product trio also needs the affected steps, segments, and follow-up questions.

A strong opening contains three elements:

Role: the analytical perspective the agent should take.
Decision: what the team will choose or investigate after reading the result.
Success criteria: what the answer must establish before it is useful.

For example: “You are a product analyst helping the onboarding team decide whether to redesign a weak activation step. Identify the largest meaningful drop-off, show which defined segment is most affected, and separate measured findings from possible explanations.”

Give the agent a compact analytics contract

The most reliable prompt names the data the agent may use. Include the relevant event names, property names, segment definitions, filters, and timeframe. If activation has an internal definition, write it out rather than relying on the agent to infer it.

This is a retrieval-first approach: put authoritative definitions, dashboard context, and prior query logic into the request before asking for interpretation. Concrete grounding reduces room for invented assumptions and makes repeated analyses easier to compare. A structured prompt can also specify the role, business objective, allowed data, and output fields.

Prompt element	What to provide	What it prevents
Metric definition	The exact event sequence or outcome that counts	A different interpretation of activation or retention
Population	Included users or accounts and explicit exclusions	Comparisons across unlike populations
Segments	Named properties and the values to compare	Arbitrary segmentation
Timeframe	The analysis period and comparison period	Hidden or inconsistent date choices
Evidence boundary	The events, properties, definitions, and dashboards allowed	Unsupported claims presented as measured facts
Output contract	Required sections, fields, ordering, and length	A long narrative that cannot be reviewed quickly

Do not dump every available definition into the context. Include only what the question requires. More context is useful when it resolves ambiguity; irrelevant context competes for attention and makes the prompt harder for a teammate to audit.

Use a reusable prompt that exposes uncertainty

You can adapt the following structure for activation, retention, anomaly investigation, or another behavioral analysis:

Role and audience: “Act as a product analyst. Write for the product manager and analytics lead responsible for [area].”
Decision: “Help us decide whether to [decision].”
Question: “Determine [specific analytical question].”
Definitions: “For this analysis, [metric] means [explicit event or outcome definition].”
Data context: “Use these events: [names]. Use these properties: [names]. Compare these segments: [definitions]. Analyze [timeframe] against [comparison period]. Apply [filters and exclusions].”
Constraints: “Use only the supplied Amplitude analytics events, properties, and definitions. Do not treat an unmeasured explanation as a finding.”
Output: “Return the metric result, segment comparison, timeframe, evidence, interpretation, confidence or limitation, and recommended next check.”
Fallback: “If the available data cannot answer the question, state what is missing and provide the smallest follow-up query needed.”

The fallback matters. Without it, the agent has an incentive to complete the requested narrative even when the evidence is incomplete. A useful failure is specific: it identifies a missing event, undefined property, absent comparison, or ambiguous metric. Your team can fix that. A confident guess is harder to detect.

Ask for measured findings, interpretations, and recommendations as separate fields. A measured drop-off is evidence. A claim that users were confused is an interpretation unless the supplied data establishes it. A recommendation to inspect session replay or conduct customer interviews is a next step, not proof of the cause. Keeping those layers separate makes the result safer to use in prioritization.

Turn prompt quality into a small product evaluation

Do not judge a prompt by whether one response sounds intelligent. Save the prompt version, input context, and output. Then test it against a question whose answer your team already knows. This gives you a reference point for accuracy before you use the template on an ambiguous problem.

Score each version on three dimensions:

Accuracy: Did the answer use the supplied definitions, filters, segments, and timeframe correctly?
Clarity: Can a reviewer distinguish evidence, interpretation, limitations, and next steps?
Actionability: Does the result support the stated decision or name the next query required?

Change one meaningful element at a time. You might compare a broad objective with a decision-specific objective, a narrative response with a fixed output contract, or an unrestricted answer with an explicit evidence boundary. Run the same test question through each variant. Otherwise, you will not know which change improved the result.

Commit to two or three prompt iterations for one critical workflow. Review the failures, tighten the ambiguous instruction, and keep the better-performing version. Within a sprint, that process can produce a reusable template for a recurring analysis such as activation, retention, or anomaly detection.

Store winning prompts with their required inputs and known limitations. A template without those notes becomes cargo cult: teammates copy the wording but omit the definitions that made it work. Treat the prompt, context requirements, evaluation question, and scoring criteria as one asset.

Key takeaways

State the product decision before requesting analysis.
Define the metric, population, segments, filters, and timeframe explicitly.
Restrict conclusions to the analytics evidence you supplied.
Separate measured findings from interpretations and recommended actions.
Require a specific fallback when the data is insufficient.
Version and score prompts for accuracy, clarity, and actionability.

Start with the recurring Amplitude question that currently creates the most debate. Write its decision, definitions, evidence boundary, and output contract. Run two or three scored iterations, then give the winning template to another product manager. If they can obtain a defensible answer without you translating the prompt, it is ready to become part of the team’s operating system.

References

Amplitude — Prompt Like a Pro: Three Battle-Tested Tips for Amplitude Global Agent Success

May 26, 2026

Beyond Accuracy: How I Evaluate AI Customer Service Agents That Delight and Scale
When teams evaluate AI Agent options for customer service, I often see the rigor aimed at the wrong subset of criteria. After leading and observing dozens of proof of concept (POC) efforts with our customers and prospects, I understand why performance—accuracy scores, resolution rates, and benchmark tests on curated datasets—soaks up most of the attention. But those indicators alone won’t guarantee success once you leave the sandbox and face real customers.

If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else I look for to make the best long-term decision.

How does it handle your real-world setup?

Performance is table stakes, but it has to reflect the messiness of an actual support environment. The best-performing Agents don’t just get answers right—they exhibit resilient, human-like behavior under pressure. I watch how the Agent behaves when it doesn’t know an answer: does it recover or spiral? Does it stay on track through multi-step requests, and how gracefully does it hand off to human agents? If your knowledge base depends on a retrieval-first pipeline, test cross-source retrieval and grounding—not just single-document lookups.

When I build evaluation scenarios, I put the Agent through its paces with a broad, realistic mix:
- Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
- Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
- Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
- Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
- Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
- Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.
This preparation is worth the effort. Any Agent can look impressive in a demo; what matters is how it holds up as part of your team, serving your customers in production.

What does it feel like to interact with the Agent?

Two AI Agents can post the same quantitative scores—resolution rates, containment rate, and more—and still deliver very different customer experiences. Resolution rate tells me whether the Agent finishes conversations; it says nothing about how customers felt during them. I deliberately assess the experience, not just the outcome, because conversation design shapes trust and brand perception.

Here’s what I look for to ensure the AI Agent is enjoyable to interact with:
- Is the tone natural and on-brand, or does it feel robotic and generic?
- Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
- When it doesn’t know the answer, does it handle that gracefully?
- When it hands off to a human, is that transition seamless, or does the customer feel abandoned?
As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”

That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.

I make the experience dimension explicit in my POCs. I have people on my team—and when possible, a small cohort of real customers—interact with the Agent under realistic conditions. Then I ask how it felt, not just whether it worked.

Can you keep improving it after launch?

This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one. Choosing an Agent that works today and ensures you can continuously improve the customer experience over time requires more than a functional demo. You’re buying a system that must get better every week, not just during the first sprint.

The feedback loop

Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time. In practice, that means instrumenting conversations, leveraging Agent Analytics, tagging misroutes and tone slips, and running targeted evals on known failure modes.

The speed of iteration

When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent. We treat this as eval-driven development: automate evaluations that mirror real tickets, tighten prompt engineering and retrieval settings, and ship small fixes daily.

The vendor partnership

The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:
- How does customer feedback influence the product roadmap, and can they show you examples?
- If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
- What kind of support will you get post-launch?
- Are they shaping where AI customer experience is going, or reacting to what others are building?
How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.

What a good POC proves

If your POC only proves “the AI works,” you haven’t done enough. A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. Done well, it sets you up for long-term operational success and builds organizational AI readiness—not just a flashy demo.

Inspired by this post on The Intercom Blog.
May 22, 2026

Author: Shivam Tiwari

Start with a decision contract, not an open-ended question

Build a compact context packet Amplitude AI can actually use

Run a bounded analysis loop, then challenge the result

Move from observation to explanation in explicit steps

Put every conclusion through a verification gate

Hand stakeholders a decision artifact, not an AI transcript

Key takeaways

References

Give every experiment a decision contract

Match the question to the cheapest reliable evidence

Engineer trustworthy measurement and reversible delivery

Scale the program around decisions, not test volume

Reset a brittle program over 90 days

Key takeaways

References

Key takeaways

Start each initiative with an outcome contract

Build one scorecard across four layers

Run a weekly loop from evidence to decision

Assign decision rights before the dashboard turns red

Use 90 days to prove the operating system

References

Key takeaways

Start with the queue, not the model

Turn knowledge into a controlled production input

Define the boundary before you define the features

Design one boring, observable execution path

Put human control at the blast-radius boundary

Prove reliability before expanding the roadmap

Key takeaways

References

Start with a growth constraint, not a dashboard

Build an evidence chain you can trust

Turn every growth idea into an experiment contract

Use staged releases to separate learning from risk

Make the product trio accountable for learning

Key takeaways

References

Start with the harm your growth model could create

Pair every growth metric with a human countermetric

Expand discovery beyond the people who already love the product

Put humane constraints inside the experiment

Choose durable depth over indiscriminate scale

Key takeaways

References

Start with the decision, not a broad request for insights

Give the agent a compact analytics contract

Use a reusable prompt that exposes uncertainty

Turn prompt quality into a small product evaluation

Key takeaways

References