Tag: Agent Analytics

How to Operate AI Customer Agents as a Reliable CX System
AI customer agents are expanding from answering routine questions toward handling complex workflows and potentially supporting more of the customer lifecycle. The operational challenge is no longer simply whether an agent can produce a plausible answer. It is whether the organization can keep that agent accurate, controlled, measurable, and ready whenever the business changes.

Taken together, the source reports point to a practical operating model: connect product releases to knowledge updates, test behavior before exposure, measure the full interaction rather than a narrow survey sample, and assign people to improve the system continuously. That turns an AI agent from a channel feature into managed CX infrastructure.

Key takeaways
- Agent reliability depends on a continuous train, test, deploy, and analyze cycle, not a one-time implementation.
- A product release is not operationally complete until the agent has current, unambiguous, and retrievable information about it.
- Pre-release evaluation should test realistic customer questions, policy conditions, system actions, and required human handoffs.
- Survey metrics remain useful, but conversation-level analysis provides broader visibility into answer quality, effort, sentiment, and recurring friction.
- Human roles increasingly shift toward knowledge stewardship, exception handling, policy design, evaluation, and cross-functional CX improvement.
Treat the agent as a product system, not a chatbot

The Pioneer 2025 report describes Fin 3 through four operating stages: training, testing, deployment, and analysis. It reports that Procedures combines natural-language instructions with deterministic controls for complex work, while Simulations is intended to test behavior before customers encounter it. The report also describes deployment across additional channels, including Slack and Discord, improvements to Voice, and analytics features such as CX Score Reasons and Topic Trends.

These are vendor-reported capabilities, but the underlying operating principle applies beyond one platform. An agent that can act in business systems needs more than fluent language generation. It needs explicit procedures, boundaries on what it may do, test cases that expose failure modes, controlled channel deployment, and evidence showing what happened after release.

The same report presents a longer-term Customer Agent vision built around roles, goals, persistent memory, business knowledge, and interoperability. That vision should be distinguished from currently reported product functionality. It nevertheless clarifies the governance challenge: as an agent gains continuity and operational reach, errors can travel across more stages of the customer journey. Ownership of objectives, data, permissions, escalation, and measurement therefore becomes part of CX design.

This also changes how success should be framed. Resolution volume is an operational output, but a dependable CX system must also answer whether the agent followed policy, used current knowledge, completed the intended action, recognized an exception, and left the customer with an acceptable amount of effort. Automation without those checks can move work while concealing deterioration in the experience.

Move agent readiness into the product release process

The NPI playbook focuses on a common source of agent failure: products change faster than their supporting knowledge. When a feature launches without usable documentation, the source reports that the agent may hand conversations to people just as launch-related volume rises. The resulting backlog is therefore not only a support problem; it is a release-readiness problem.

A stronger definition of done includes agent readiness. The NPI source recommends bringing support or knowledge specialists into product walkthroughs, product marketing kick-offs, and pre-release testing. It also calls for a named owner, whether an NPI manager, knowledge manager, support lead, or product operations owner. The title can vary, but accountability cannot be distributed so widely that nobody verifies readiness.

The required knowledge must be designed for retrieval as well as human reading. According to the source, documentation should include both internal feature names and the phrases customers actually use, expand acronyms, state plan and availability conditions explicitly, and reproduce the substance of screenshots or videos in text. This is important because information can be technically present yet remain difficult for an agent to retrieve or apply correctly.

Release work must also remove knowledge that a launch has invalidated. Searching related articles, macros, notes, and workflows can reveal stale or contradictory guidance. Duplicate content deserves particular attention: competing versions of an answer can create inconsistent agent behavior even when the newest article is accurate.

Testing then connects knowledge preparation to customer outcomes. The NPI playbook recommends assembling likely questions from launch content, beta feedback, and early support conversations; running them in the environment customers will use; rating the answers; correcting the underlying content or structure; and repeating the evaluation. Conditions such as phased rollout, plan eligibility, regional availability, and mandatory human escalation require explicit coverage rather than an assumption that the agent will infer the right behavior.

This creates a two-speed control model. Before launch, teams test expected questions and known edge cases. After launch, they watch real conversations for unexpected language, missing scenarios, or product behavior that the original documentation did not anticipate. The feedback should return to the release tracker, knowledge source, procedure, or product team according to the root cause.

Measure experience at conversation scale

Release evaluation shows whether an agent appears ready, but production measurement shows whether that readiness survives real customer behavior. The CX measurement source reports that CSAT captures less than 10% of conversations and that respondents tend to represent more extreme reactions. On that account, survey results leave a large unobserved middle and cannot by themselves explain whether dissatisfaction arose from service, product behavior, or policy.

The source describes an alternative in which AI evaluates every human and agent interaction across dimensions such as service quality, resolution, and customer effort. It reports that Intercom’s CX Score assigns interactions a score from 1 to 5, exposes reasons behind the score, and gives most teams roughly five times the coverage of CSAT alone. Those product-specific claims are reported by the source rather than independently verified here, but they illustrate the broader distinction between voluntary feedback and systematic conversation review.

Fuller coverage does not make direct customer feedback obsolete. CSAT can still capture what a customer chooses to say, while conversation analysis can detect repeated explanations, handoff friction, weak answer quality, unresolved intent, and neutral interactions that generate no survey response. The two signals answer different questions and should be interpreted together rather than forced into a single interchangeable benchmark.

New coverage also requires new baselines. The measurement source cautions against transferring an old CSAT target directly to a conversation-scoring system because the populations and methods differ. It recommends correlating the new score with operational measures such as first response time and time to close, then examining underlying attributes including answer quality, customer effort, and product feedback. Its illustrative targets of 80% for Fin support, 70% for human support, and 78% overall are examples derived from the scenario described in that article, not universal standards.

Segmentation is equally important. Complex, high-touch cases should not automatically be compared with transactional contacts, and aggregate results can hide a poorly performing topic or channel. Useful analysis separates agent and human conversations, examines topics and handoffs, and preserves context about case type. The most actionable output is not the score alone but a reason that can be routed to a responsible owner.

Build one improvement loop across CX, product, and knowledge

The sources approach AI customer agents from different angles: the Pioneer report emphasizes expanding capabilities and a broader customer-agent vision; the NPI playbook concentrates on release and knowledge readiness; and the measurement article addresses visibility after deployment. Their combined implication is that these activities cannot remain separate programs.

A low-quality interaction might originate in several places. The knowledge may be missing or contradictory, the procedure may express the wrong policy, the product may behave unexpectedly, the agent may fail to retrieve applicable information, or the case may require a human specialist. Conversation-level reasons help locate the problem, but the organization still needs a route from evidence to correction and then to re-evaluation.

That operating loop changes human work. Customer-facing specialists remain essential for sensitive, ambiguous, or exceptional cases, while also contributing customer language, testing scenarios, escalation criteria, and knowledge improvements. Product and engineering teams become accountable for the support consequences of releases. Knowledge teams manage information as production input, and CX leaders set objectives that balance resolution, effort, policy compliance, and service quality.

The most revealing opportunities may sit in interactions that are neither failures nor successes. Broader conversation analysis can surface answers that were technically acceptable but unnecessarily difficult, impersonal, or incomplete. Improving that middle ground requires more than tuning a model: it may require clearer documentation, a better workflow, a product fix, or a different escalation rule.

As agents acquire more roles, memory, knowledge, and access to business systems, CX operations will increasingly resemble product operations for a continuously changing service. Organizations that establish release gates, evaluation sets, conversation-level diagnostics, and unambiguous ownership will be better positioned to expand agent responsibility without allowing reliability to become an afterthought.

References
July 3, 2026
Connecting Product Analytics, Attribution, and Growth Decisions
Connected product analytics is not simply a larger collection of events, dashboards, and campaign reports. Its practical value comes from preserving the context behind customer behavior, applying consistent definitions, and carrying trustworthy insights into the systems where teams make decisions.

The four source articles describe complementary parts of that operating model: journey-aware attribution, governed product data, AI-assisted analysis across tools, and continuous measurement. Combined, they offer a framework for turning scattered signals into more defensible growth decisions.

Key takeaways
- Attribution becomes more informative when relevant campaign, session, and product context remains connected to later outcomes.
- Persisted context can reveal associations across a journey, but it does not by itself prove that a touchpoint caused a conversion.
- Naming standards, ownership, metadata, and shared customer definitions determine whether connected analytics can be trusted.
- AI agents and connectors can reduce the effort required to investigate and communicate insights, provided permissions and analytical boundaries are explicit.
- Growth improves through a repeatable learning loop that connects observed behavior to a decision, an intervention, and subsequent measurement.
Attribution improves when journey context survives the final click

The source on persisted properties challenges the idea that the last recorded interaction adequately explains a conversion. It reports that customer decisions may be shaped by activity distributed across sessions, channels, campaigns, and product experiences. In its examples, an e-commerce purchase may follow product discovery, promotions, and cart activity; a financial-services outcome may depend on education, trust-building, eligibility checks, and compliance-sensitive steps; and a B2B lead may emerge after product tours, comparison pages, demos, onboarding interactions, stakeholder reviews, and CRM touchpoints.

Persisted properties address part of this measurement problem by retaining meaningful context as a user continues through a journey. This gives analysts more than the attributes attached to the final event and supports questions such as which acquisition context is associated with later activation, which discovery experience precedes stronger conversion, or which onboarding path appears among retained users.

That richer context should not be confused with automatic causal proof. Attribution assigns or interprets credit according to available data and a chosen analytical approach. A recurring touchpoint may be a useful signal, a proxy for user intent, or an actual contributor to an outcome. Connected journey data makes those possibilities easier to investigate, while controlled experiments and other appropriate evaluation methods remain necessary when a team needs to establish whether changing a touchpoint changes the result.

The practical shift is therefore from asking which interaction deserves all the credit to asking which sequence of interactions warrants attention. That framing is more useful for product roadmaps, campaign investment, onboarding design, and retention analysis because it treats conversion as the outcome of a journey rather than an isolated click.

Data governance supplies the shared meaning behind every signal

More connected data creates more analytical value only when teams agree on what the data represents. The Pendo administration source emphasizes naming conventions, ownership rules, and review cycles for pages, features, segments, guides, and reports. It also describes visitor, account, and product metadata as a strategic asset that should reflect concepts such as onboarding stage, plan type, activation, customer-success motion, and retention.

The marketing analytics source approaches the same requirement from an organizational angle. It argues that analytics works best as a shared language across product, marketing, sales, and customer success. Instead of allowing each function to interpret campaign and product signals independently, teams can align around customer journeys, funnel behavior, and the points at which users find value or leave.

Together, these sources show that the semantic layer is as important as the technical connection. A campaign label, user segment, account tier, activation event, and retention definition must remain intelligible when they move between an analytics platform, a CRM integration, a product report, or an AI-assisted workflow. Otherwise, a connected system can distribute ambiguity more efficiently without improving judgment.

Governance also affects interventions, not just reports. The Pendo source recommends contextual and concise in-app guides, product tours, and tooltips tied to measurable outcomes. This connects the measurement layer to the product experience: the same governed definitions used to identify friction should inform who receives guidance, what behavior the guidance is intended to change, and how the result will be evaluated.

AI connectors reduce workflow friction but do not repair weak analytics

The agent-connectors source extends connected analytics beyond dashboards. It describes an agent working across tools already used by product, analytics, and go-to-market teams, allowing context, analysis, and action to be brought into a more unified interaction. Its central benefit is operational: people can spend less effort moving information between tabs and systems while maintaining the flow of an investigation.

The marketing source similarly presents AI as most useful when paired with behavioral analytics, customer context, disciplined measurement, positioning, and a clear go-to-market strategy. In that account, AI workflows improve the scale and speed of judgment; they do not create durable growth independently of a sound measurement practice.

This distinction matters because an agent can make an answer easier to obtain without making its underlying evidence more reliable. If event definitions conflict, metadata is incomplete, or attribution assumptions are hidden, a connected agent may produce a fluent response to the wrong question. The connector source therefore places importance on permissions, appropriate context, governance, and boundaries alongside prompt design.

A well-designed workflow should preserve the path from a business question to the supporting behavioral evidence. It should also make clear which system supplied the context, which segment or journey definition was used, and whether the result is a descriptive association, an attributed outcome, or evidence from a stronger evaluation. That transparency helps an agent accelerate analysis without becoming an unexamined source of truth.

A connected growth loop joins evidence, intervention, and learning

The sources converge on a continuous operating loop even though each enters it at a different point. Persisted properties preserve the journey context needed to form a better question. Governance and metadata make the relevant users, accounts, features, and outcomes consistently identifiable. Behavioral analytics helps teams locate meaningful movement or friction. Product guidance, campaigns, positioning changes, and go-to-market decisions then become interventions whose effects can be measured.

The Pendo source makes this learning loop explicit by recommending that initiatives record the expected behavior, the observed result, the change in the customer journey, and the team’s next response. The marketing source adds that product, marketing, sales, and customer success should use those findings collectively. The agent-connectors source supplies a potential interface for carrying the analysis across their tools, while the attribution source supplies the longitudinal context needed to avoid judging the intervention solely by the final interaction.

This model also clarifies what a useful growth insight looks like. It is not merely a rising metric or a generated explanation. It connects a defined audience and journey to an observable outcome, states the limits of the attribution, identifies a decision the organization can make, and establishes what should be measured afterward. That standard directs attention toward learning and resource allocation rather than dashboard activity.

The next stage of connected analytics will depend less on adding isolated reports and more on maintaining reliable context as questions move across teams and tools. Organizations that preserve that context, govern its meaning, and test the decisions made from it will be better positioned to turn analytics and AI into a durable growth capability.

References
July 3, 2026
Behavioral Analytics for AI Agent Activation and Retention
AI agent growth is not simply a matter of attracting more users or generating more conversations. The central product question is whether people reach a useful outcome quickly enough to return, and whether the organization can respond intelligently when that journey breaks down.

The two source accounts describe complementary parts of that challenge. The Pendo account focuses on measuring and improving the path from first use to recurring engagement, while the Amplitude account focuses on turning observed behavior into workflows across product and go-to-market systems. Together, they suggest an operating model in which analytics first identifies meaningful behavior and then helps teams act on it.

Treat the agent as a measurable product experience

An AI agent can appear busy without becoming valuable. Conversation counts, prompt volume, and feature exposure show activity, but they do not establish that users completed meaningful work. Behavioral analytics becomes more useful when the agent is treated as an end-to-end product experience rather than an isolated interface.

The Pendo account describes mapping the journey from activation and a first successful task through repeat usage and habit formation. It also reports that the team defined stickiness around the agent’s jobs to be done instead of relying on an unspecified generic engagement measure. That distinction matters because a meaningful return pattern depends on the work the agent is intended to support.

The Amplitude account extends the same reasoning beyond analysis. It describes agents operating on verified product events, including high-intent milestones, changes in feature adoption, and signals associated with churn risk. In this model, instrumentation is not merely a reporting layer. It supplies the evidence used to trigger a subsequent decision or workflow.

A practical measurement chain therefore begins with eligibility and exposure, continues through an attempted interaction and a verified first success, and then examines whether users achieve additional useful outcomes over later sessions. The exact events must reflect the agent’s purpose. The durable principle is to measure completed value, not just interface activity.

Define activation as the first meaningful success

Activation is most informative when it marks a result that demonstrates the agent’s value. Opening the agent, viewing a suggested prompt, or sending a message may be necessary steps, but none necessarily proves that the user accomplished the intended task.

Pendo’s account reports that activation contained unnecessary cognitive load and that the first-session path did not consistently lead users to a quick win. The reported response included simplifying onboarding, clarifying prompts, and using in-app guidance to make valuable capabilities easier to recognize. This connects activation analysis directly to product design: when users stall before a first success, the remedy may involve reducing choices, clarifying expectations, or improving contextual guidance rather than adding more agent functionality.

Journey analysis should separate several different failure modes. A user who never starts may not understand the value proposition. A user who starts but abandons the task may encounter interaction friction. A user who receives an answer but does not act on it may lack confidence, context, or a clear next step. Combining these outcomes into one conversion rate would hide the product decision each one implies.

Activation should also be connected to the behavior that follows it. If an event labelled as success has no observable relationship with later value, it may be a convenient instrumentation point rather than a meaningful milestone. Behavioral cohorts can help compare subsequent engagement among users who reached different early outcomes, although those relationships should initially be treated as diagnostic evidence rather than proof of causation.

Measure retention as repeated value, not raw frequency

Retention analysis asks whether users continue to obtain value after activation. For an AI agent, that requires more context than a simple count of returning users. A return can indicate trust and usefulness, but it can also reflect an unresolved task, repeated correction, or a workflow that unnecessarily forces the user back.

The Pendo account presents stickiness as a proxy for trust and reports a 61% increase after the team established Agent Analytics and ran a series of product experiments. The same source associates stronger return behavior with proactive anticipation of intent and associates context-rich interactions, supported by timely nudges and in-app guides, with deeper engagement over later sessions. These are reported findings from one product account, not an independently verified benchmark for other agents.

The more transferable lesson is methodological. Teams can segment retention by the early behavior users completed, the type of task attempted, and the context surrounding the interaction. They can then examine whether retained users are repeating successful work, expanding into additional useful tasks, or merely revisiting the same point of friction.

This approach also guards against optimizing stickiness in isolation. Frequent use is desirable only when it reflects repeated useful outcomes. Where the agent’s job is to resolve work efficiently, fewer interactions may sometimes represent a better experience than a longer conversation. The retention definition must therefore stay anchored to the user’s intended result.

Turn behavioral signals into controlled interventions

Analytics creates leverage when it changes what the product or organization does next. The sources cover two levels of intervention. Pendo describes changes inside the experience, such as onboarding simplification, prompt clarification, contextual guides, tuned triggers, and tighter feedback loops. Amplitude describes workflows that cross system boundaries, such as initiating outreach for churn risk, triggering experimentation when adoption falls, activating users after high-intent milestones, and updating CRM records.

These approaches are complementary. In-product interventions can help a user complete the current journey, while cross-functional workflows can coordinate actions that require product, sales, or customer-success involvement. The behavioral signal should determine which response is appropriate: interface friction calls for a product change, an unmet need may call for research, and an account-level risk signal may justify a carefully governed human follow-up.

Automation does not remove the need for experimentation. Pendo reports using A/B tests to evaluate changes, while the Amplitude account emphasizes success criteria, governance guardrails, observability, iteration, and aligned performance measures. A sound operating loop combines those ideas: define the target behavior, verify the underlying events, choose an intervention, test its effect, monitor unintended outcomes, and retain only changes that improve the intended user result.

That loop is especially important when an agent both interprets behavior and initiates action. Event quality, ambiguous thresholds, or drifting agent performance can otherwise scale an incorrect decision. Human ownership, visible workflow history, and clear evaluation criteria help distinguish useful orchestration from automated noise.

Key takeaways
- Define activation around a verified first useful outcome, not merely opening the agent or sending a prompt.
- Analyze each stage between exposure, attempted use, successful completion, and later return so different forms of friction remain visible.
- Interpret retention through repeated value and task context; activity alone is not sufficient evidence of trust.
- Use behavioral cohorts to generate hypotheses, then apply controlled experiments before treating an observed relationship as causal.
- Match interventions to the signal: improve the experience when friction is local, and use governed cross-functional workflows when follow-through spans multiple systems or teams.
- Monitor data quality and agent performance because automated actions can amplify both accurate and inaccurate interpretations.
The next stage of AI agent maturity will depend less on adding visible capabilities and more on connecting meaningful outcomes to disciplined follow-through. Teams that can measure the first win, recognize repeated value, and govern the actions between them will be better positioned to turn agent adoption into durable product behavior.

References
- Shivam.Consulting Blog – Stop Guessing: Deploy AI Agents That Act on Real User Behavior with Amplitude Workflows
- Shivam.Consulting Blog – Inside the 61% Stickiness Lift for Pendo’s AI Agent: My Agent Analytics Playbook
June 23, 2026
How I Use Novus, the First Product Agent, to Turn Rapid Releases into Measurable Wins

In a world of relentless CI/CD and accelerating release trains, product leaders like me can’t afford lagging signals or fuzzy readouts on what’s truly moving the needle. I need immediate, trustworthy feedback that connects code shipped to outcomes achieved and customer value created.

Coding agents compress weeks of development into hours, but the faster your codebase changes, the harder it is to know what’s actually helping end-users.

That tension is exactly why I brought Novus into my product toolbox. To keep up with the pace of development, over 600 product teams are already using Novus, the first-of-its-kind product agent, to automatically set itself up, monitor product data, and tell you what to do next.

From my chair, that promise matters only if it translates into clear decisions. With Novus, I’ve been able to tighten the loop between experimentation and learning: it pairs eval-driven development with behavioral analytics and observability so I can see how a release influences activation, engagement, and retention—without spelunking through fragmented dashboards. The agentic AI backbone reduces the manual stitching I used to do across events, cohorts, and funnels, letting me focus on prioritization and product strategy instead of report wrangling.

Day to day, Novus fits naturally into our AI workflows. It surfaces anomalies early, clarifies trade-offs, and frames next-best actions in the language of outcomes. Because it plugs into a unified analytics platform approach, I can maintain continuous discovery at scale while preserving the rigor of Agent Analytics: hypotheses are explicit, telemetry is consistent, and results are traceable. That’s the operating cadence I expect from modern product management leadership.

If your roadmap moves faster than your learning loops, a product agent can be the missing link between speed and certainty. Novus helps me convert rapid releases into measurable wins, keeping the team aligned and confident about what to build next—and just as importantly, what to stop doing.

Inspired by this post on Pendo – Best Practices.

June 17, 2026
AI Agent Product Development: From Workflow to Autonomy
AI agent product development is not primarily a model-selection exercise. It is the work of turning a business outcome into a bounded system that can retrieve information, use tools, make decisions, and escalate safely.

The practical payoff comes from sequencing those capabilities carefully. A focused workflow, explicit measures, controlled access, and continuous evaluation provide a more credible path to value than attempting broad autonomy at launch.

Key takeaways
- Define the business outcome and proof of success before choosing prompts, models, or tools.
- Begin with a repeatable workflow whose inputs, outputs, and failure conditions can be judged clearly.
- Increase capability in stages: relevant retrieval, limited tools, read-only integrations, controlled actions, and then broader autonomy.
- Treat privacy, governance, evaluation, observability, and human escalation as product requirements from the beginning.
- Scale only when operational quality and the intended business outcome remain stable in production.
Start with a decision contract, not an agent concept

An agent initiative becomes testable when the team can state what decision or task the system will handle, what information it requires, what it must never do, and how success will be measured. This creates a decision contract between the product, its users, and the organization operating it.

The supplied source recommends anchoring an AI strategy to one measurable outcome before writing a prompt or selecting a model. It gives lead response time, first-contact resolution, and time-to-first-value as possible measures. Those examples illustrate an important distinction: the agent is a means of changing workflow performance, not the outcome itself.

This framing also makes AI readiness concrete. Instead of asking whether an organization is generally ready for agents, a product team can examine one workflow: Is the required data available? Are the inputs sufficiently consistent? Can acceptable output be recognized? Are the constraints and escalation conditions explicit? A negative answer identifies product work to complete; it does not automatically call for a more capable model.

A useful initial scope therefore has clear boundaries and frequent enough repetition to produce evidence. The source identifies support-ticket triage, inbound-lead qualification, and account-note summarization as examples. Their significance is not that every organization should adopt them, but that they offer observable inputs and outputs. That makes errors easier to classify and improvements easier to evaluate.

Design capability as an autonomy ladder

The core architectural question is not whether an agent can perform an action. It is what evidence should be required before the product is allowed to perform that action without review. Treating capability as an autonomy ladder gives the team intermediate states between a passive assistant and an unrestricted operator.

The source proposes a retrieval-first pipeline that introduces only relevant knowledge into the context window. In product terms, retrieval is part of the experience contract: the system should receive the information needed for the task without being burdened by unrelated material. This can improve the conditions for relevant responses, although retrieval does not eliminate the need to evaluate the final behavior.

Tool access should be similarly bounded. The source recommends a small, explicit tool catalog, with the agent’s role, constraints, and escalation routes documented. It also points to Model Context Protocol as a way to standardize tool invocation across services. Standardization can make integrations more consistent, but it does not decide which tools the agent should receive or what permissions those tools should carry; those remain product and risk decisions.

Systems of record deserve special caution. The source advises beginning with read-only CRM integration and adding actions only after reliability has been demonstrated. This suggests a practical progression: first observe and recommend, then prepare an action for approval, and only later execute eligible actions within defined limits. Each step creates new failure consequences, so each should have its own evidence threshold.

Prompt engineering belongs inside this broader capability design. A prompt can express the agent’s role and boundaries, but predictable operation also depends on retrieved context, tool definitions, permissions, timeouts, escalation logic, and the surrounding user experience. Managing only the prompt would leave much of the product’s actual behavior outside the team’s control.

Make trust an executable product requirement

Agent risk becomes manageable when broad principles are translated into system behavior. Privacy-by-design should affect what data enters the workflow. Data governance should determine which sources and actions are permitted. Human oversight should appear as an explicit escalation path rather than an informal promise that someone can intervene.

The source calls for regression evaluations covering safety, accuracy, and bias, alongside logs of agent actions, rate limits, timeouts, and risk scoring for high-impact operations. Together, these controls form a layered safety model. Evaluations test expected behavior before and during release; operational limits constrain runtime exposure; logs support diagnosis and accountability; and risk gates determine when automation must stop or seek approval.

Uncertainty should also have a designed destination. According to the source, the default response for high-stakes or uncertain situations should be human escalation. A useful handoff needs more than a generic error message: the receiving person should be able to understand the request, the context used, the action considered, and why the system declined to continue. Handoff quality is therefore part of the product experience as well as the risk model.

This approach avoids treating guardrails as a final compliance checkpoint. When controls are defined alongside workflow requirements, they influence architecture, permissions, interface design, analytics, and release criteria. Trust then becomes something the team can test and operate, rather than a claim attached to the launch.

Use two evidence loops to decide when to scale

An agent can appear technically competent without improving the business outcome that justified it. Product development therefore needs two connected evidence loops: one for operational quality and another for workflow impact.

For operational quality, the source recommends monitoring precision, latency, containment, and handoff quality through agent analytics. These measures answer different questions. Precision concerns whether outputs or decisions are correct enough for the task. Latency affects whether the agent fits the pace of the workflow. Containment indicates how often work remains within the automated path. Handoff quality examines whether escalation preserves context and enables a productive recovery.

The business loop returns to the original outcome, using outcomes-versus-output OKRs to avoid equating shipped features with value. A team might improve a prompt, add a tool, or increase containment while leaving the target workflow unchanged. That is useful diagnostic progress, but it is not yet evidence that the product investment is working.

The source also recommends A/B testing prompts and tools and considering minimum detectable effect when sizing experiments. Experimentation is most informative when the changed component, eligible population, success measure, and guardrails are defined in advance. Otherwise, movement in a downstream metric can be difficult to attribute to the agent change.

Qualitative learning completes the loop. The source describes product trios spanning product management, design, and engineering, supported by continuous discovery, weekly transcript review, and the conversion of failure modes into test cases. It also recommends keeping prompts, tools, and evaluations versioned through a docs-as-code approach. This connects discovery to engineering discipline: observed failures become reproducible evaluations, evaluated changes become versioned releases, and releases can be compared or reversed.

Scope and autonomy should expand only when both loops support the decision. Stable technical metrics without workflow impact suggest that the use case or experience needs reconsideration. Business improvement accompanied by unsafe or unreliable behavior suggests that scaling is premature. Evidence across both dimensions supports a measured move into adjacent tasks or higher-impact actions.

Build the next release around earned autonomy

The durable pattern for AI agent products is earned autonomy: every increase in access or authority follows evidence from a narrower operating state. As evaluations accumulate and real workflow performance becomes visible, teams can make expansion decisions based on demonstrated capability rather than the apparent fluency of a demo.

References
- Shivam.Consulting Blog — Kickstart AI Agents with Confidence: 5 Proven Practices I Use to Ship Impact Fast
June 10, 2026
Why We Made Fin the Most Open Agent: Instant HubSpot & Freshdesk Support With 76% Resolutions

I’ve spent my career pairing product strategy with customer reality, and nothing is more clear right now than the demand for openness and speed. Today, we’re announcing that Fin can be used as a Service Agent on top of HubSpot and Freshworks, meaning you can use the world’s best Agent without migrating off your helpdesk.

Hubspot and Freshdesk customers can now:

Get Fin live, integrated, and working seamlessly in less than an hour.

Delivering a 76% average resolution rate.

Across all customer channels (voice, email, chat, social, and more).

Resolving complex queries that require reading and writing to third party systems.

With everything fully configurable to follow the unique policies of every individual business.

This launch is a very visible step in a journey we’ve been on from day one: building an open, customer-first platform that plays well with the rest of your stack. We’ve long known that businesses want flexibility in how they configure their customer-facing tech stack. Since the very beginning, we have built Fin as an open platform, with APIs, MCPs, CLI, and opening up access to Apex, our proprietary trained model that delivers best in class performance.

To make things easy for our customers, we have extensive public documentation of our product on our website, in our help center, and in our developer docs. We are the only Agent company in our space to do this, others hide most details behind sign-in screens, which we don’t believe is the right thing to do.

Open Agent platforms will win because customers refuse to be boxed into closed ecosystems. We now believe our category has reached a stage where customers demand open platforms, that those who open up are more likely to win, and those who remain closed and protectionist will accelerate their demise.

We are operating in a fast changing world, and customers do not want to be locked into a single vendor or closed ecosystem. They want the ability to experiment, to swap things in and out, and move everything with ease, technically and commercially.

In an open world, the best product will win. In a world where businesses can easily swap vendors, the best product will win. We are happy to compete on that front, confident that Fin delivers the best customer experience and the highest performance.

From a product management lens, this openness is powered by agentic AI patterns paired with robust CRM integration. Under the hood, we use Model Context Protocol (MCP), well-documented APIs, and orchestrated AI workflows to read from and write to third-party systems. That’s how Fin handles true multi-channel work—including voice AI agent scenarios—while giving teams the observability they need through Agent Analytics.

If you are a Hubspot or Freshdesk customer, you can now have Fin integrated and live within an hour, without needing any help from us. We’re here if you want us, but as part of our commitment to building an open platform, we’ve designed everything to be self-servable—start in minutes or watch a quick demo of how everything works.

Fin for Hubspot

Fin for Freshdesk

Inspired by this post on The Intercom Blog.

June 9, 2026

Reusable AI Agent Workflows Need Evaluation Contracts

Reusing an AI agent capability can accelerate delivery, but reuse also multiplies the consequences of an undetected defect. A retrieval component, tool-call routine, or safety check may appear in several workflows, so its quality cannot depend on the team that happens to integrate it next.

The practical answer is to package each reusable skill with an evaluation contract: defined behavior, test fixtures, observability, guardrails, and outcome measures that travel with the component. Read together, the two source articles outline how modular workflow design and eval-driven development can reinforce each other from prototype through production.

Reuse requires a contract, not just a prompt

The AI skills library article describes modular capabilities for retrieval and grounding, summarization, classification, tool use, data enrichment, safety controls, and evaluation harnesses. Its central architectural idea is consistency: common interfaces and conventions allow teams to compose capabilities and replace implementations without rebuilding an entire flow.

That modularity addresses code and workflow reuse, but it leaves an important product question: what must remain true when an implementation is replaced? The product-manager evaluation playbook supplies the missing half. It calls for versioned prompts, tools, and datasets; fixed offline scenarios; production experiments; and traces that expose how an agent reached an answer.

The synthesis is an evaluation contract attached to every reusable skill. The contract defines acceptable inputs and outputs, relevant policies, expected telemetry, representative tests, and promotion thresholds. A skill is then reusable because its behavior can be checked repeatedly, not merely because its code can be imported.

This distinction matters most in composed workflows. A summarizer that performs well on clean text may behave differently after a weak retrieval step. A tool-use component may generate a plausible response even when the underlying action fails. Reusable interfaces make these components interchangeable; evaluation contracts make the substitutions accountable.

Measure four layers of agent quality

No single score can represent the quality of a reusable agent workflow. The evaluation article separates concerns such as task success, factuality, safety, latency, cost, evidence quality, and product outcomes. The skills-library article adds operational concerns around guardrails, runtime metrics, and production monitoring. Combined, they suggest a four-layer model.

Evaluation layer	Question it answers	Reusable evidence	Reported signals
Component behavior	Does the skill perform its assigned task?	Fixed fixtures, golden examples, and domain scenarios	Task success, factuality, and retrieval evidence quality
Safety and policy	Does it remain within required boundaries?	Adversarial cases, policy checks, and guardrail configurations	Safety performance, PII handling, and content-policy adherence
Operational performance	Can it run reliably within product constraints?	Traces, logs, version records, and production dashboards	Latency, cost, tool success, and fallback behavior
Product impact	Does better agent behavior create user or business value?	Experiment definitions and driver-tree mappings	Task completion, satisfaction, activation, retention, and NRR

The layers should remain distinguishable even when a dashboard brings them together. If a workflow’s task-success score rises while latency or cost deteriorates, the trade-off should be visible. If offline factuality improves without changing completion or satisfaction in production, the result should not automatically be treated as a product win.

Retrieval-first workflows illustrate the value of separation. The evaluation playbook recommends assessing the quality of retrieved evidence independently from generation. That boundary makes a failure attributable: the system can distinguish missing or irrelevant evidence from a generator that mishandled useful context. The same principle applies to classification, tool selection, tool execution, and response composition.

A reusable workflow needs a controlled promotion path

The two sources describe complementary stages rather than competing evaluation methods. The skills-library article starts with a quick-start chain, configurable skills, guardrails, evaluation datasets, and instrumentation. The evaluation playbook places fixed offline suites before user exposure, followed by controlled online validation. Together they form a promotion path from composable prototype to measured production capability.

Offline evaluation establishes eligibility

A candidate workflow should first face stable examples representing core scenarios, known failure modes, edge cases, adversarial prompts, and domain-specific questions, as reported by the evaluation playbook. Stable fixtures make comparisons reproducible when a prompt, model, tool, retrieval strategy, or policy changes. Running these checks through CI/CD, as proposed in the skills-library article, turns evaluation into a regular release control instead of a separate audit.

Model-based judges can expand coverage for qualities such as helpfulness, coherence, and adherence, but the evaluation article cautions that they require calibration against a small, high-quality human-labeled set. It also recommends monitoring judge drift and retaining human review for edge cases or flows where mistakes carry greater consequences. A reusable judge configuration should therefore include its rubric, reference labels, version, and conditions for escalation.

Online evaluation establishes value

Passing offline checks shows that a variant is eligible for controlled exposure; it does not prove that users benefit from it. Both articles describe feature flags and A/B testing as mechanisms for comparing workflow variants in production. The evaluation playbook identifies conversation outcomes, tool success rates, human-support fallbacks, and user satisfaction as useful online signals.

This staged approach also limits ambiguity. An offline regression can block a weak component before exposure, while an online experiment can test whether an eligible improvement changes real behavior. Promotion should depend on both: acceptable component performance and evidence that the complete workflow advances its intended outcome.

Traces turn composition failures into fixable problems

Composability increases the number of boundaries at which a workflow can fail. The evaluation playbook treats traces as the backbone of agent evaluation because they record inputs, intermediate actions, invoked tools, and final responses. The skills-library article similarly connects reusable chains to logs, traces, metrics, and production dashboards.

A final-answer score alone may reveal that a workflow failed, but a trace can localize the failure. It can show whether retrieval supplied poor evidence, classification selected an unsuitable route, a tool call failed, a guardrail intervened, or generation ignored valid context. This makes evaluation useful for component ownership: teams can repair the relevant skill rather than adding a broad prompt patch to the entire chain.

Trace analysis also supports reuse decisions. If one component repeatedly causes latency, cost, or safety regressions across several workflows, improving that shared component may create more value than optimizing each application independently. Conversely, a component that succeeds in one context but fails in another may need a narrower contract rather than a universal interface.

Versioning is essential to that diagnosis. The evaluation playbook recommends versioning prompts, tools, and datasets, while the skills-library article emphasizes swappable implementations and comparable variants. Without linked versions for the component, evaluation set, judge, and workflow configuration, an apparent improvement may be difficult to reproduce or attribute.

Governance and product outcomes belong in the same system

Reusable workflows can spread good controls, but they can also propagate weak ones. The skills-library article reports guardrails for PII redaction, content-policy checks, and rate limiting, alongside configuration intended to support privacy-by-design. Packaging these controls as reusable capabilities can make the approved path easier to adopt, while evaluation fixtures test whether the controls continue to work as surrounding workflows change.

Governance should not be reduced to a final pass-or-fail gate. Safety, privacy, and policy behavior need their own cases and traces throughout development. The amount of human review can then reflect the cost of error, consistent with the evaluation playbook’s recommendation to retain human oversight for higher-risk flows.

The same evaluation system must connect technical quality to product value. The evaluation playbook proposes a driver tree that links per-turn measures such as helpfulness, safety, and latency to session outcomes such as task completion, and then to product measures including activation, retention, and Net Recurring Revenue. This hierarchy prevents a local metric from becoming the objective by default.

For product teams, the resulting unit of roadmap work is not simply a new skill. It is a versioned capability with evidence about behavior, operational fitness, policy compliance, and contribution to an intended outcome. That shared definition gives product trios, engineers, and governance stakeholders a more precise basis for deciding whether to reuse, revise, or retire a component.

Key takeaways

Package each reusable agent skill with an evaluation contract covering behavior, fixtures, telemetry, policies, and promotion criteria.
Keep component quality, safety, operational performance, and product impact distinct so improvements and trade-offs remain attributable.
Use fixed offline evaluations to establish release eligibility, then controlled online experiments to determine real-world value.
Trace intermediate steps and tool activity so failures can be assigned to the correct component instead of patched at the final response.
Version workflows, prompts, tools, datasets, and judges together so results remain comparable and reproducible.

As skill libraries expand, their lasting advantage will come from accumulated evidence rather than component count. Teams that make evaluation portable alongside implementation can reuse workflows without surrendering visibility, governance, or product accountability.

References

June 5, 2026

Supercharge Insights with Amplitude Agent Connectors: Connect Notion, Slack, Linear & More

I’ve led enough multi-tool product organizations to know how quickly momentum erodes when insights and actions live in different places. When my teams bounce between Notion, Atlassian, Slack, Linear, and analytics dashboards, we pay a real tax in context switching. That’s why I’m excited about what Amplitude is enabling with Agent Connectors—bringing our daily work and our data-driven decisions into one fluid, agentic AI workflow.

Connect Notion, Atlassian, Slack, Linear, and more to Amplitude's Global Agent. Get richer analysis and take action across tools without leaving Amplitude.

Practically, this means I can treat Amplitude analytics as a unified analytics platform where analysis and execution finally meet. Instead of exporting charts or copying insights into docs, I can drive Agent Analytics directly from the same surface where I manage behavioral analytics, reducing friction and accelerating decisions. For my product strategy, that’s a meaningful shift—from “insight later” to “insight-to-action now.”

Here’s how I’d use it on a typical day: I ask the agent to synthesize signals from recent feature usage, spotlight anomalies, and then draft a concise summary for our Slack channel. In the same flow, I can prompt it to reference our Notion specs for context and queue next steps in Linear, keeping Atlassian stakeholders looped in without any extra swiveling between tabs. The value isn’t just faster execution; it’s tighter alignment across teams because the analysis and the plan live together.

From an operating model perspective, this is how I scale AI workflows responsibly. I can define clear prompts, approval paths, and ownership so the agent augments—not replaces—expert judgment. Data governance and permissions remain front and center: the agent sees what your teams are allowed to see, and we maintain auditability on critical workflow steps. The outcome is a trustworthy, repeatable system that compounds learning over time.

If you’re exploring agentic AI for product teams, start small and instrument your ROI. Pick one or two connectors (Slack and Notion are great first choices), define a measurable workflow—like pushing weekly retention insights and creating prioritized follow-ups in Linear—and iterate using continuous discovery. In my experience, the first wins appear as reduced time-to-insight, fewer meetings to align, and faster cycle time from observation to shipped change.

The big picture is simple: bring your work to your analytics, and your analytics to your work. With Agent Connectors, Amplitude’s Global Agent helps close the loop from understanding behavior to taking action—without leaving the place where your insights are born.

Inspired by this post on Amplitude – Best Practices.

June 3, 2026
How to Design a Dependable CLI Agent Users Can Trust
Your CLI agent can look impressive in a controlled demo and still feel unsafe in a real repository. The moment it can edit files, invoke tools, or use credentials, users need to understand what it will do before they let it proceed.

The dependable design is rarely the one with the most capabilities. It is the one with the smallest clear promise, predictable execution, visible controls, and evidence that it succeeds repeatedly.

Define the boundary before you define the features

Start by writing an operating contract for the agent. This is a product decision, not a prompt-writing exercise. A useful contract answers five questions:
- What job does the agent complete?
- Which resources and tools may it use?
- What must it never do?
- Which actions require explicit approval?
- What observable result counts as success?
Keep the job narrow enough to explain in one sentence. If the description needs a collection of exceptions, the interface is already carrying too much ambiguity. Split the work into a clearly named subcommand or make the advanced behavior opt-in.

Treat every flag, tool, and permission as an increase in blast radius. A new option does not merely add flexibility. It creates another state the agent can misunderstand, another path you must test, and another behavior the user must learn. Reducing the surface area can improve repeatability and trust because both the agent and the user have fewer possible paths to reason about.

When reviewing a proposed capability, ask whether it makes the mental model smaller. If it does not, remove it, defer it, or isolate it behind progressive disclosure. Safe, fast defaults should handle the common case without demanding that a new user understand the entire system.

Design one boring, observable execution path

A dependable run should feel like a transaction with recognizable stages. The model can help interpret intent, but it should not invent the execution contract as it goes.
- Capture intent: Ask only for information required to resolve the task. If a missing choice would materially change the result, stop and ask.
- Retrieve context: Fetch the smallest relevant set of files, facts, or records. More context can introduce conflicting instructions and distract the agent from the requested change.
- Show the plan: Present a compact description of the intended actions, affected targets, and likely side effects.
- Preview when useful: Provide a dry run for operations whose effects the user should inspect before execution.
- Execute through narrow tools: Give each tool a deterministic input and output contract. Reject malformed responses instead of guessing what they meant.
- Verify the result: Check the resulting state and tell the user what changed, what did not, and whether any step failed.
The agent should stop when the requested scope changes, required context is unavailable, or a tool returns an unexpected result. A visible stop is easier to recover from than confident improvisation.

Favor idempotent operations wherever you can. Repeating an idempotent action produces the intended state without duplicating or compounding its effects. That property matters in a CLI because interrupted runs and retries are normal operating conditions. Test the second run as deliberately as the first.

Put human control at the blast-radius boundary

Do not ask for approval at every step. Constant prompts train users to approve without reading. Place confirmation gates where the consequence or scope changes.
- Read-only work: Make inspection and planning the default where possible.
- Scoped writes: Request access only to the specific project, service, or resource needed for the task.
- Destructive actions: Require a separate confirmation that names the target and explains the consequence.
- Credentials: Use narrowly scoped, time-bounded access rather than broad credentials that persist beyond the run.
- Expanded capability: Let users opt into advanced tools instead of quietly enabling them for every session.
A confirmation message should help the user make a decision. Replace a generic question such as “Continue?” with a concrete statement of what will be changed and whether it can be undone.

Reversibility should shape the underlying implementation as well. Prefer changes that can be represented as a patch, show the proposed difference before applying it, and preserve enough information to explain how to undo the operation. When reversal is impossible, make that fact visible before execution.

Use a simple review question for each workflow: can a user predict the maximum consequence of saying yes? If the answer is unclear, the permission boundary is too broad or the confirmation arrives too late.

Prove reliability before expanding the roadmap

Do not use capability count as the measure of progress. Before adding a feature, define the task it should complete, the success threshold it must meet, and the smallest interface needed to test it. This turns roadmap discussions into observable product decisions.

Evaluate at least three outcomes: task completion, time to first successful result, and stability when the same operation is run again. A capability that succeeds once but behaves differently on a retry is not ready merely because the first demonstration worked.

Instrument each run with Agent Analytics. Capture the input, tools selected, duration, outcome, and error pattern. Review those signals to find where the agent asks unnecessary questions, repeats tool calls, loses users, or encounters the same failure. The response may be a smaller prompt, a tighter tool contract, a safer default, or the removal of a confusing option.

Documentation belongs in this reliability loop. Keep runnable examples alongside the code and make them reflect the golden path. Treat any mismatch between documented behavior and actual behavior as a product defect. If the workflow cannot be explained and demonstrated simply, it is not yet a dependable workflow.

Use these evaluations as promotion gates. Add power only after the current path is measurable, understandable, and stable. That discipline earns you the right to expand without turning the CLI into a collection of loosely related agent behaviors.

Key takeaways
- Write the agent’s operating contract before choosing its tools or refining its prompt.
- Keep the default workflow narrow, safe, fast, and explainable in one sentence.
- Retrieve minimal context, show a compact plan, execute through deterministic contracts, and verify the result.
- Place explicit approval at destructive, irreversible, or scope-expanding boundaries.
- Measure completion, time to first success, and rerun stability before adding another capability.
- Use run telemetry and executable documentation to decide what to simplify next.
Choose one golden-path task and write its operating contract now. Then run it twice: once normally and once as a retry. Every surprise you find is a reliability requirement to resolve before you broaden the agent’s reach.

References
- Shivam.Consulting Blog — The Counterintuitive Playbook for CLI Agents: Why Ruthless Subtraction Beats Feature Creep
May 27, 2026
Speed-to-Lead Is Dead: How AI Agents End the Wait and Rebuild a High-Velocity Sales Org

A prospect lands on our site, skims pricing, watches a demo, and clicks “contact sales.” For years, that’s where momentum died. They waited, and we built entire sales motions around managing that delay.

We optimized for “speed-to-lead,” made it the hallmark of a high-performing sales development org, hired more SDRs, tuned routing rules, added shift coverage, and stared at response-time dashboards. Typical SLA targets were one hour for best-fit leads, four hours for core MQLs, forty-eight hours for everyone else. Those were considered good numbers.

No one questioned the premise because the lag felt structural—shift scheduling, routing delays, and humans working 9–5. The fastest teams could only shrink the gap; nobody could remove it.

An AI Agent closes it completely.

When a prospect arrives today, the conversation can begin immediately. That single change reshapes how I design a sales org—how we staff it, what our team prioritizes, and the metrics we hold ourselves accountable for.

Step outside our dashboards and look at the buyer experience. We spend heavily to drive traffic, then push visitors into forms and queues that add friction precisely when purchase intent peaks.

Intent is highest the moment someone seeks out our product. If an SDR follows up two or three hours later, that buyer’s in another meeting, the urgency has faded, and the moment is gone. We still call it a lead; the buyer has already moved on.

What AI changes

Agents eliminate the structural constraints that made speed-to-lead a problem—shift scheduling, routing delays, CRM batch processing, the SDR being on another call. None of it applies anymore because every single lead can be engaged immediately, at any hour and in any language.

The impact goes beyond response time. When an Agent engages at peak intent, qualification, discovery, and even an initial demo moment can unfold in a single, continuous conversation. The gated funnel collapses. There’s no reason to qualify someone today, schedule discovery for Thursday, and demo the following week when the conversation is already happening.

The constraint the industry built around simply isn’t there anymore. We’re already seeing it with Fin, a Customer Agent. As sales leaders, we need to frame this differently.

If speed-to-lead is no longer the constraint, the knock-on effects reach every part of the org.

Introduce Fin for Sales to your team with this clean hero banner: bold headline, signature blue spiral, and a clear 'Start free trial' call to action—inviting readers to explore an AI customer agent built for revenue.

SDRs focus on moving deals forward. Instead of frontline triage, they double down on phone-based selling and relationship building, complex deal navigation, and multi-threaded engagement across stakeholders—the high-leverage work that used to get crowded out by the inbox.

Pipeline gets more relevant. The old model rewarded volume: capture as many form fills as possible, respond fast, and sort quality later. When an Agent engages at the moment of intent, it qualifies during the conversation. Low-fit leads get filtered out before they reach the team, and high-fit prospects arrive with context—needs, timeline, stakeholders—instead of just a name and email.

You measure outcomes, not response time. When first response is instant, different metrics matter. I anchor on three questions:

1) Is the Agent doing the work? Completion rate, qualification rate, and contact capture rate indicate whether conversations reach clear outcomes and produce usable handoffs to the team.

2) Is the work producing pipeline? Meetings booked and pipeline created through Agent-handled conversations are the leading indicators of revenue, not how fast someone followed up.

3) Are buyers having a good experience? Conversation-level satisfaction matters more than ever because the Agent is the first interaction prospects have with your company. The experience it delivers is the first impression you make.

These three questions reveal whether the motion is working. Time-to-first-response can’t.

Sales orgs built hiring plans, workflows, and performance metrics around beating intent decay. That made sense when the lag was unavoidable. It isn’t anymore.

An Agent is always on. It engages the moment a prospect arrives on your site, qualifies them in real time, and routes them to the right outcome without waiting for someone to be free. The lag the industry built itself around doesn’t exist when the conversation starts immediately.

The companies leaning into this are investing in what happens after the conversation starts: how well the Agent qualifies, where it creates pipeline, and what SDRs should actually spend time on. What matters now is not how fast you respond, but what the conversation produces.

Speed-to-lead made sense when the delay was structural. It isn’t anymore. If you’re re-architecting go-to-market, instrument Agent Analytics, revisit SDR charters, and tighten CRM integration so every qualified handoff is instant, traceable, and revenue-linked.

Inspired by this post on The Intercom Blog.

May 26, 2026

Prompt Engineering for Amplitude Global Agent That Holds Up

You ask Amplitude Global Agent why activation fell. It returns a plausible explanation, but you still can’t tell which events it examined, whether the comparison was valid, or what your product team should do next.

The fix is to treat the prompt as an analysis specification. Define the decision, provide the relevant analytics context, constrain unsupported conclusions, and make the agent show its work. You will get an answer that is easier to verify and more useful in a product review.

Start with the decision, not a broad request for insights

Requests such as “analyze activation” leave several decisions unresolved. The agent must guess what activation means, which users belong in the analysis, which period matters, and what kind of answer you expect. Even a polished response may answer the wrong question.

Before writing the prompt, complete this sentence: “After reading the answer, we need to decide whether to…” Your ending might be “change the onboarding sequence,” “investigate a recent release,” or “prioritize one segment for discovery.” That decision gives the analysis a destination.

Then assign a role that matches the work. “You are a product analyst investigating activation performance” is more useful than “You are a helpful assistant.” Add the audience as well. An executive needs the size and business relevance of a change; a product trio also needs the affected steps, segments, and follow-up questions.

A strong opening contains three elements:

Role: the analytical perspective the agent should take.
Decision: what the team will choose or investigate after reading the result.
Success criteria: what the answer must establish before it is useful.

For example: “You are a product analyst helping the onboarding team decide whether to redesign a weak activation step. Identify the largest meaningful drop-off, show which defined segment is most affected, and separate measured findings from possible explanations.”

Give the agent a compact analytics contract

The most reliable prompt names the data the agent may use. Include the relevant event names, property names, segment definitions, filters, and timeframe. If activation has an internal definition, write it out rather than relying on the agent to infer it.

This is a retrieval-first approach: put authoritative definitions, dashboard context, and prior query logic into the request before asking for interpretation. Concrete grounding reduces room for invented assumptions and makes repeated analyses easier to compare. A structured prompt can also specify the role, business objective, allowed data, and output fields.

Prompt element	What to provide	What it prevents
Metric definition	The exact event sequence or outcome that counts	A different interpretation of activation or retention
Population	Included users or accounts and explicit exclusions	Comparisons across unlike populations
Segments	Named properties and the values to compare	Arbitrary segmentation
Timeframe	The analysis period and comparison period	Hidden or inconsistent date choices
Evidence boundary	The events, properties, definitions, and dashboards allowed	Unsupported claims presented as measured facts
Output contract	Required sections, fields, ordering, and length	A long narrative that cannot be reviewed quickly

Do not dump every available definition into the context. Include only what the question requires. More context is useful when it resolves ambiguity; irrelevant context competes for attention and makes the prompt harder for a teammate to audit.

Use a reusable prompt that exposes uncertainty

You can adapt the following structure for activation, retention, anomaly investigation, or another behavioral analysis:

Role and audience: “Act as a product analyst. Write for the product manager and analytics lead responsible for [area].”
Decision: “Help us decide whether to [decision].”
Question: “Determine [specific analytical question].”
Definitions: “For this analysis, [metric] means [explicit event or outcome definition].”
Data context: “Use these events: [names]. Use these properties: [names]. Compare these segments: [definitions]. Analyze [timeframe] against [comparison period]. Apply [filters and exclusions].”
Constraints: “Use only the supplied Amplitude analytics events, properties, and definitions. Do not treat an unmeasured explanation as a finding.”
Output: “Return the metric result, segment comparison, timeframe, evidence, interpretation, confidence or limitation, and recommended next check.”
Fallback: “If the available data cannot answer the question, state what is missing and provide the smallest follow-up query needed.”

The fallback matters. Without it, the agent has an incentive to complete the requested narrative even when the evidence is incomplete. A useful failure is specific: it identifies a missing event, undefined property, absent comparison, or ambiguous metric. Your team can fix that. A confident guess is harder to detect.

Ask for measured findings, interpretations, and recommendations as separate fields. A measured drop-off is evidence. A claim that users were confused is an interpretation unless the supplied data establishes it. A recommendation to inspect session replay or conduct customer interviews is a next step, not proof of the cause. Keeping those layers separate makes the result safer to use in prioritization.

Turn prompt quality into a small product evaluation

Do not judge a prompt by whether one response sounds intelligent. Save the prompt version, input context, and output. Then test it against a question whose answer your team already knows. This gives you a reference point for accuracy before you use the template on an ambiguous problem.

Score each version on three dimensions:

Accuracy: Did the answer use the supplied definitions, filters, segments, and timeframe correctly?
Clarity: Can a reviewer distinguish evidence, interpretation, limitations, and next steps?
Actionability: Does the result support the stated decision or name the next query required?

Change one meaningful element at a time. You might compare a broad objective with a decision-specific objective, a narrative response with a fixed output contract, or an unrestricted answer with an explicit evidence boundary. Run the same test question through each variant. Otherwise, you will not know which change improved the result.

Commit to two or three prompt iterations for one critical workflow. Review the failures, tighten the ambiguous instruction, and keep the better-performing version. Within a sprint, that process can produce a reusable template for a recurring analysis such as activation, retention, or anomaly detection.

Store winning prompts with their required inputs and known limitations. A template without those notes becomes cargo cult: teammates copy the wording but omit the definitions that made it work. Treat the prompt, context requirements, evaluation question, and scoring criteria as one asset.

Key takeaways

State the product decision before requesting analysis.
Define the metric, population, segments, filters, and timeframe explicitly.
Restrict conclusions to the analytics evidence you supplied.
Separate measured findings from interpretations and recommended actions.
Require a specific fallback when the data is insufficient.
Version and score prompts for accuracy, clarity, and actionability.

Start with the recurring Amplitude question that currently creates the most debate. Write its decision, definitions, evidence boundary, and output contract. Run two or three scored iterations, then give the winning template to another product manager. If they can obtain a defensible answer without you translating the prompt, it is ready to become part of the team’s operating system.

References

Amplitude — Prompt Like a Pro: Three Battle-Tested Tips for Amplitude Global Agent Success

May 26, 2026

Beyond Accuracy: How I Evaluate AI Customer Service Agents That Delight and Scale
When teams evaluate AI Agent options for customer service, I often see the rigor aimed at the wrong subset of criteria. After leading and observing dozens of proof of concept (POC) efforts with our customers and prospects, I understand why performance—accuracy scores, resolution rates, and benchmark tests on curated datasets—soaks up most of the attention. But those indicators alone won’t guarantee success once you leave the sandbox and face real customers.

If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else I look for to make the best long-term decision.

How does it handle your real-world setup?

Performance is table stakes, but it has to reflect the messiness of an actual support environment. The best-performing Agents don’t just get answers right—they exhibit resilient, human-like behavior under pressure. I watch how the Agent behaves when it doesn’t know an answer: does it recover or spiral? Does it stay on track through multi-step requests, and how gracefully does it hand off to human agents? If your knowledge base depends on a retrieval-first pipeline, test cross-source retrieval and grounding—not just single-document lookups.

When I build evaluation scenarios, I put the Agent through its paces with a broad, realistic mix:
- Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
- Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
- Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
- Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
- Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
- Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.
This preparation is worth the effort. Any Agent can look impressive in a demo; what matters is how it holds up as part of your team, serving your customers in production.

What does it feel like to interact with the Agent?

Two AI Agents can post the same quantitative scores—resolution rates, containment rate, and more—and still deliver very different customer experiences. Resolution rate tells me whether the Agent finishes conversations; it says nothing about how customers felt during them. I deliberately assess the experience, not just the outcome, because conversation design shapes trust and brand perception.

Here’s what I look for to ensure the AI Agent is enjoyable to interact with:
- Is the tone natural and on-brand, or does it feel robotic and generic?
- Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
- When it doesn’t know the answer, does it handle that gracefully?
- When it hands off to a human, is that transition seamless, or does the customer feel abandoned?
As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”

That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.

I make the experience dimension explicit in my POCs. I have people on my team—and when possible, a small cohort of real customers—interact with the Agent under realistic conditions. Then I ask how it felt, not just whether it worked.

Can you keep improving it after launch?

This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one. Choosing an Agent that works today and ensures you can continuously improve the customer experience over time requires more than a functional demo. You’re buying a system that must get better every week, not just during the first sprint.

The feedback loop

Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time. In practice, that means instrumenting conversations, leveraging Agent Analytics, tagging misroutes and tone slips, and running targeted evals on known failure modes.

The speed of iteration

When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent. We treat this as eval-driven development: automate evaluations that mirror real tickets, tighten prompt engineering and retrieval settings, and ship small fixes daily.

The vendor partnership

The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:
- How does customer feedback influence the product roadmap, and can they show you examples?
- If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
- What kind of support will you get post-launch?
- Are they shaping where AI customer experience is going, or reacting to what others are building?
How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.

What a good POC proves

If your POC only proves “the AI works,” you haven’t done enough. A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. Done well, it sets you up for long-term operational success and builds organizational AI readiness—not just a flashy demo.

Inspired by this post on The Intercom Blog.
May 22, 2026

Tag: Agent Analytics

Key takeaways

Treat the agent as a product system, not a chatbot

Move agent readiness into the product release process

Measure experience at conversation scale

Build one improvement loop across CX, product, and knowledge

References

Key takeaways

Attribution improves when journey context survives the final click

Data governance supplies the shared meaning behind every signal

AI connectors reduce workflow friction but do not repair weak analytics

A connected growth loop joins evidence, intervention, and learning

References

Treat the agent as a measurable product experience

Define activation as the first meaningful success

Measure retention as repeated value, not raw frequency

Turn behavioral signals into controlled interventions

Key takeaways

References

Key takeaways

Start with a decision contract, not an agent concept

Design capability as an autonomy ladder

Make trust an executable product requirement

Use two evidence loops to decide when to scale

Build the next release around earned autonomy

References

Reuse requires a contract, not just a prompt

Measure four layers of agent quality

A reusable workflow needs a controlled promotion path

Offline evaluation establishes eligibility

Online evaluation establishes value

Traces turn composition failures into fixable problems

Governance and product outcomes belong in the same system

Key takeaways

References

Define the boundary before you define the features

Design one boring, observable execution path

Put human control at the blast-radius boundary

Prove reliability before expanding the roadmap

Key takeaways

References

Start with the decision, not a broad request for insights

Give the agent a compact analytics contract

Use a reusable prompt that exposes uncertainty

Turn prompt quality into a small product evaluation

Key takeaways

References