Tag: A/B testing

  • AI Evals for Product Managers: How I Measure Agent Quality—A Beginner’s Playbook

    AI Evals for Product Managers: How I Measure Agent Quality—A Beginner’s Playbook

    I’ve led multiple AI agent launches, and the single most reliable way I’ve found to ship with confidence is to treat evaluations as a product capability, not a side project. When we make AI quality measurable, predictable, and comparable over time, we move faster, reduce risk, and build trust with customers and stakeholders.

    Learn how product managers use AI evaluations to measure agent quality. Covers traces, LLM judges, offline evals, online evals, and how to connect evals to product outcomes.

    Why does this matter so much in product management? Because agent quality is only meaningful when it drives adoption, satisfaction, and revenue. I use eval-driven development to align the day-to-day iteration of prompts, policies, and workflows with business outcomes like activation, retention, and Net Recurring Revenue (NRR). That alignment turns AI quality from an abstract notion into a roadmap lever.

    First, traces. Traces are the spine of evaluation for agentic AI: they capture inputs, intermediate steps, tools invoked, and final responses. I instrument traces to make reasoning visible—what the agent tried, where it hesitated, and why it chose a path. With that visibility, I can compare prompts, policies, and tools, and I can teach the team to fix the root cause instead of patching symptoms. This is also where Agent Analytics becomes real: we move from anecdotes to observable behavior trends across cohorts and use cases.

    Next, LLM judges. I use model-as-judge to score qualities like helpfulness, coherence, or adherence to brand and policy. The trick is calibration. I pair LLM judges with a small, high-quality human-labeled set to ground the scale, then monitor drift as models, prompts, or data shift. LLM judges help me evaluate at speed, but I still spot-check edge cases and highly regulated flows to balance efficiency with risk controls.

    Offline evals come first. Before I expose users to changes, I run fixed test suites representing core scenarios, failure modes, and edge cases. I include golden examples, adversarial prompts, and domain-specific queries. Metrics cover task success, factuality, safety, latency, and cost. This is where prompt engineering and retrieval quality are tuned; if I’m using a retrieval-first pipeline, I evaluate evidence quality separately from generation so improvements are attributable and reproducible.

    Online evals follow to validate real-world performance. I roll changes out behind feature flags and use A/B testing to compare variants under production conditions. I track conversation outcomes, tool success rates, fallbacks to human support, and user satisfaction. These online signals close the loop on whether an offline improvement actually compounds value in the product—critical for product-led growth.

    Connecting evals to product outcomes is non-negotiable. I map quality signals to a driver tree: from per-turn scores (helpfulness, safety, latency) up to session-level outcomes (task completion, deflection, revenue intent), and finally to product KPIs (activation, retention, NRR). With this structure, I can set thresholds for launch gates, prioritize roadmap items that move the biggest levers, and build dashboards that leadership understands at a glance.

    A few lessons learned. Start with a minimal but durable test set and grow it as you discover new failure modes. Version everything—prompts, tools, and datasets—so you can reproduce wins. Beware metric drift when you swap models or update prompts. Blend human review where the cost of error is high. Above all, make evaluations part of your AI workflows and sprint rituals so quality improves continuously, not sporadically.

    If you’re just getting started, begin with traces and a small offline suite, add LLM judges for scale, then prove impact with a focused online experiment. Within a few cycles, you’ll have a living evaluation system that guides decisions, accelerates delivery, and gives your team—and your customers—confidence in every AI release.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Director of Product, Growth & AI at Amplitude: My Playbook for Viral Growth and Engagement

    Director of Product, Growth & AI at Amplitude: My Playbook for Viral Growth and Engagement

    I see the Director of Product, Growth & AI at Amplitude as a mandate to operationalize "viral and core growth strategies, user acquisition, and product engagement" with precision. From my vantage point, that means building a rigorous, metrics-first operating system grounded in Amplitude analytics and product-led growth principles, then layering in an AI Strategy that personalizes experiences without sacrificing control or safety.

    I start by defining a clear North Star Metric and mapping a driver tree to expose causal levers across acquisition, activation, engagement, retention, and monetization. With behavioral analytics and cohort analysis, I quantify which user behaviors correlate with long-term value. I operationalize rapid experimentation through A/B testing with sensible minimum detectable effect (MDE) thresholds, guardrail metrics, and sequential testing to ensure we move fast while preserving measurement integrity.

    For "viral and core growth strategies," I lean on durable growth loops more than one-off hacks. Viral loops might include collaboration invites, user-generated content, and shareable artifacts that make the product more valuable as it spreads. Core growth centers on frictionless activation: guided onboarding, in-app guides, product tours, progressive disclosure, and judicious tooltip design that connects users to the ‘aha’ moment quickly. Session replay and funnel instrumentation help isolate friction and systematically remove it.

    On user acquisition, I connect performance channels and go-to-market strategy tightly to in-product activation. Rather than optimizing for clicks, I optimize for post-signup behaviors that predict retention. This includes improving landing page-message-product congruence, refining qualification (so top-of-funnel aligns with downstream value), and orchestrating lifecycle messaging that nudges users toward key activation milestones.

    To deepen product engagement, I focus on leading indicators of retention and feature adoption. I segment by jobs-to-be-done and intent, then personalize in-app prompts to surface the right capability at the right moment. Retention analysis, pathing, and funnel breakouts inform which nudges to deploy and where—whether that’s smarter checklists, contextual education, or lightweight in-product interventions that turn sporadic usage into reliable habits.

    AI raises the ceiling on what’s possible here. With a thoughtful AI Strategy, I use gen ai to personalize onboarding flows, recommend next-best actions based on behavioral signals, and summarize complex activity patterns into actionable insights for the team. I maintain strict measurement: every AI intervention ships behind feature flags, is evaluated through controlled experiments, and adheres to privacy-by-design principles. The outcome is a system that learns continuously while staying aligned to business and user outcomes.

    Execution is where strategy becomes real. I rely on empowered product trios, continuous discovery with customers, and outcome-focused roadmaps that tie directly to the driver tree. This keeps the organization moving in sync: engineering prioritizes the highest-signal experiments, design accelerates comprehension and task success, and product ensures each release strengthens the core loop rather than adding ornamental features.

    Ultimately, the blueprint is simple and disciplined: anchor on "viral and core growth strategies, user acquisition, and product engagement," quantify what matters with behavioral analytics, and iterate through well-instrumented experiments. Combine that with targeted AI augmentation, and you create a compounding growth engine that is both measurable and resilient.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • AI Broke Your A/B Tests: 3 Proven Shifts to Rebuild a Resilient Experimentation Program

    AI Broke Your A/B Tests: 3 Proven Shifts to Rebuild a Resilient Experimentation Program

    I’ve watched a once-reliable A/B testing playbook buckle under the weight of generative AI. Traffic patterns aren’t stable, LLMs update behind the scenes, prompts evolve weekly, and personalization reshapes cohorts mid-flight. The result is non-stationary data, diluted statistical power, and “wins” that don’t replicate in production. If your experimentation program feels slower, noisier, and less trustworthy, you’re not imagining it—and you’re not alone.

    Learn why running more tests isn’t the answer to AI, and the three ways mature teams are shifting their experimentation programs.

    First, I’ve shifted from test volume to an evaluation stack—what I call eval-driven development. Instead of defaulting to production A/B tests, we front-load learning with offline evaluations (golden sets, synthetic scenarios), automated regressions on prompts and policies, and pre-production canaries. We size experiments with a clear minimum detectable effect (MDE), use sequential or Bayesian methods to handle drift, and reserve full A/B runs for hypotheses with sufficient power and operational readiness. This layered approach accelerates decisions, reduces traffic waste, and restores trust in effect sizes.

    Second, I’ve re-anchored our metrics and governance for AI-era reliability. We define a driver tree that links value creation to guardrail metrics such as latency, hallucination rate, cost per request, safety incidents, and user trust proxies. Persistent holdouts and long-lived control cohorts protect against platform-wide regressions, while anomaly detection highlights model or data shifts before they corrupt reads. Strong instrumentation—behavioral analytics, consistent event semantics, and product telemetry wired into Amplitude analytics—keeps our feedback loop tight and auditable.

    Third, we rebuilt rollout mechanics to make delivery experimentation-native. Feature flags, progressive delivery, and targeted canaries let us test safely in production while gating exposure by segment, risk, or policy. Shadow mode and offline replay provide signal before real users see risk. Multi-armed bandits help with exploration when goals are clear and guardrails are enforced, but we resist over-rotating to bandits when measurement is fragile. Tightly integrating experiments into CI/CD and observability shortens the cycle from hypothesis to validated outcome.

    In practice, here’s how I operationalize this shift. In 30 days, I audit the backlog, kill or consolidate tests that can’t meet MDE, and establish a minimal evaluation harness for prompts, policies, and safety checks. By 60 days, guardrail metrics are live with persistent holdouts and feature flags across AI surfaces. By 90 days, the team runs a balanced portfolio: offline evals for fast iteration, canaries for risk, and selective A/B testing for strategic bets—supported by continuous discovery to keep hypotheses grounded in real customer needs.

    AI didn’t eliminate the need for experimentation; it raised the bar for rigor. By moving from volume to validity, from vanity lifts to guardrailed outcomes, and from monolithic launches to progressive delivery, I’ve seen experimentation regain its edge—fewer false positives, faster cycles, and clearer signal on what truly drives impact. That’s how we turn a brittle testing culture into a resilient, learning system built for LLMs and beyond.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • How I Build High-Impact Experimentation Programs with Amplitude: Proven Practices at Scale

    How I Build High-Impact Experimentation Programs with Amplitude: Proven Practices at Scale

    I build experimentation programs to drive measurable outcomes, not just dashboards. In my product leadership work, I’ve seen how the right operating model turns experimentation into a reliable growth engine—especially when paired with the analytical depth of Amplitude. My goal is to help teams move from ad-hoc tests to a disciplined system that compounds learning and impact.

    Rigor starts with clarity. I translate strategic goals into testable hypotheses using driver trees, then structure A/B testing with a defined minimum detectable effect (MDE), guardrail metrics, and pre-registered decision criteria. This reduces p-hacking, shortens debate cycles, and makes outcomes auditable. I’m equally deliberate about risk: we monitor sample ratio mismatch, use feature flags for safe rollouts, and align on outcomes vs output OKRs so we celebrate business impact, not vanity wins.

    Amplitude analytics is my backbone for behavioral analytics at every step. I instrument clean event taxonomies, build funnels and cohorts to track user activation and retention analysis, and centralize experiment readouts in a unified analytics platform. This lets product trios quickly see how treatments shift behavior, where friction hides, and which moments matter most for product-led growth. The result is a trusted, shared source of truth that accelerates continuous discovery.

    At enterprise scale, governance matters as much as math. I often point to lessons inspired by Peacock’s experimentation program: standard naming conventions, centralized QA, CI/CD integration, and an active community of practice. Those practices keep velocity high without sacrificing validity, and they make wins repeatable across teams and surfaces.

    Operationally, I anchor the program in clear roles (data, engineering, design, product), templates for hypotheses and readouts, and a tight feedback loop from deploy to decision. With Amplitude, solutions engineering partnerships, and disciplined experiment hygiene, teams learn faster, ship safer, and build products customers love. That’s how experimentation becomes a strategic capability—not a side project.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Building AI-Era GTM and Analytics That Make Tough Calls Simple: A Product Leader’s Playbook

    Building AI-Era GTM and Analytics That Make Tough Calls Simple: A Product Leader’s Playbook

    I build "GTM and analytics products for the AI era—tools that make hard calls simple." That guiding principle shapes how I design systems, prioritize roadmaps, and lead teams: we earn speed by engineering clarity. My north star is straightforward—turn noisy signals into trusted insights that move the business, without adding friction for customers or chaos for teams.

    In practice, this starts with behavioral analytics. Whether you're using Amplitude analytics or a homegrown stack, the goal is the same: a unified analytics platform that captures clean events, enforces a clear taxonomy, and maps behaviors to outcomes. I focus on journey mapping, activation and retention analysis, and honest attribution so that every GTM motion ladders to real product usage, not vanity metrics.

    Decisions should be testable and reversible. I operationalize experimentation with A/B testing, feature flags, and guardrailed rollouts. Minimum detectable effect, power analyses, and anomaly detection aren’t academic exercises; they’re the foundation for credible learnings. When a result is unclear, we tighten hypotheses, shrink blast radius, and iterate quickly—biasing for learning while protecting the customer experience.

    AI changes the surface area of product work, but it doesn’t change the discipline. I treat LLMs for product managers as a capability, not a shortcut: eval-driven development, clear success criteria, and human-in-the-loop feedback remain non-negotiable. Privacy-by-design and data governance shape what we build; responsible prompts, retrieval strategies, and safety checks shape how it behaves in the wild. When the model is uncertain, the product should be honest about it—and offer a graceful fallback.

    Great GTM is a system, not a launch day. I connect product strategy to go-to-market strategy through product-led growth loops: in-app guides that meet users where they are, onboarding that accelerates time-to-value, and signals that identify true qualified intent. Driver trees tie adoption to monetization so that marketing, sales, and success work from the same picture—making trade-offs visible and reversible.

    Execution is where clarity compounds. Continuous discovery with product trios keeps problems crisp and solutions grounded in user truth. Product roadmapping and sprint planning follow outcome-first principles: fewer projects, clearer intents, stronger accountability. When teams can trace every backlog item to a metric that matters, they move faster with less oversight—and deliver results that stand up to scrutiny.

    When we do all of this well, decisions feel simple because the work behind them is rigorous. That’s the promise of modern GTM and analytics in the AI era: no theatrics, just dependable systems that turn possibilities into predictable progress.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Inside Growth Engineering at Amplitude: My Playbook to Accelerate Product-Led Growth with Analytics

    Inside Growth Engineering at Amplitude: My Playbook to Accelerate Product-Led Growth with Analytics

    I’m often asked how leading growth teams turn insights into compounding business results. Few organizations illustrate this better than the Growth Engineering team at Amplitude. Drawing from their example and my own experience, I’ve distilled a practical playbook that any product organization can use to move faster, learn smarter, and scale impact.

    At the core is a disciplined blend of behavioral analytics and rapid experimentation. Amplitude analytics, as part of a unified analytics platform, enables precise event instrumentation, cohorting, and funnel analysis that surface where activation and retention truly break down. When I combine those signals with qualitative insights, I can prioritize fewer, higher-leverage bets that directly improve user activation and long-term retention.

    My growth loop always starts with clearly stated hypotheses, success metrics, and A/B testing power considerations, including a defined minimum detectable effect (MDE). I pair feature flags with staged rollouts to de-risk changes and accelerate iteration without compromising stability. This cadence turns every release into a learning opportunity, compounding knowledge across teams and time.

    Cross-functional execution is non-negotiable. I rely on tight “product trios” collaboration—product, engineering, and design—so we can ship small, measurable changes quickly, observe outcomes, and then widen scope with confidence. The Growth Engineering mindset keeps us grounded in real user behavior, not assumptions, and ensures our roadmap is fueled by evidence rather than opinion.

    Consider onboarding. Instead of a single redesign, I prefer a series of targeted experiments—tweaking progressive disclosure, refining tooltip design, and adding in-app guides where users predictably stall. Each test is instrumented end to end, from first action to activation event, and validated via retention analysis to confirm that short-term lifts turn into durable habit formation.

    When prioritizing, I map ideas to driver trees tied to our North Star metric. Behavioral analytics tell me which levers—time-to-value, depth-of-use, or frequency—will yield the biggest gain. That clarity focuses engineering effort on interventions that actually shift outcomes, not just outputs.

    If you’re building your own Growth Engineering capability, start with three moves: instrument ruthlessly so you can trust your signals, adopt feature flags to speed safe experimentation, and hold teams accountable to measurable, user-centric outcomes. Do this consistently and you’ll feel the compounding effect—faster learning cycles, stronger product-market fit signals, and a durable engine for product-led growth.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Prompt Like a Pro: Three Battle-Tested Tips for Amplitude Global Agent Success

    Prompt Like a Pro: Three Battle-Tested Tips for Amplitude Global Agent Success

    When I guide teams building agentic AI features, I’ve seen a single prompt turn Amplitude Global Agent into either a world-class analyst or a well-meaning rambler. The difference isn’t magic—it’s method. With the right structure and iteration, we consistently get faster, clearer insights that stand up to product and analytics scrutiny.

    AI has gotten really good, but success still depends on the quality of your prompts. Explore three best practices for prompting in Amplitude Global Agent.

    Tip 1 — Define the role, goal, and guardrails. I begin every prompt by stating the agent’s role (for example: “You are a product analyst”), the business objective (“identify activation drop-offs by cohort”), and the boundaries (“use only Amplitude analytics events and properties provided; return JSON with metric, segment, timeframe”). This simple pattern reduces ambiguity, improves context window management, and yields outputs I can compare across runs.

    Tip 2 — Ground the model with concrete context and examples. Agent outputs improve dramatically when I supply the exact data it should reference: event names, properties, segments, filters, and timeframes. I often include a short example—one ideal question and one ideal answer—to anchor tone, structure, and depth. Think retrieval-first pipeline: feed the agent authoritative snippets (definitions, dashboards, prior queries) rather than hoping it guesses. That’s how I cut hallucinations and make results reproducible for LLMs for product managers.

    Tip 3 — Iterate with measurement, not vibes. I version prompts, A/B test variants, and log inputs/outputs so I can score quality with lightweight evals (accuracy against known answers, clarity, and actionability). Over time, a small library of “winning” prompts emerges for common AI workflows—activation analysis, retention cohorts, anomaly detection—so the team can move from tinkering to repeatable performance. This is where Agent Analytics practices pay off: we inspect outcomes, not just outputs.

    A practical starter structure I use: Role and Audience; Objective and Success Criteria; Data Context (events, properties, segments, timeframe); Constraints (sources, methods, privacy); Output Format (tables/JSON, fields, length); Examples (one good Q/A); and Fallbacks (what to do when data is insufficient). Even written as plain language, that scaffold reliably steers Amplitude Global Agent to precise, defensible answers.

    The emotional arc here is familiar: when the agent nails a complex funnel question in one pass, the team gets that “oh wow” moment; when it meanders, morale dips. Clear prompting turns those spikes of delight into a steady cadence of wins—less rework, faster learning loops, and cleaner handoffs from discovery to delivery. In short, invest in prompt engineering once, and you compound gains across every analysis session.

    If you’re just getting started, pick one critical question (for example, activation or retention), apply the three tips above, and commit to two to three prompt iterations with scoring. Within a single sprint, you’ll have a robust template you can reuse and adapt—helping Amplitude Global Agent deliver trustworthy insights at the speed your product strategy demands.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Stop Losing Users: How a Second Message and Prompt Audit Drive 2–3x Retention

    Stop Losing Users: How a Second Message and Prompt Audit Drive 2–3x Retention

    Default prompts are quietly sabotaging agent retention. I learned this the hard way while reviewing early funnels for our voice and chat agents—engagement looked great at the greeting, but the moment the agent stopped after a single reply, the conversation flatlined. The fix wasn’t a fancy LLM trick; it was a disciplined second message and a rigorous audit of defaults across every entry point.

    When an AI agent opens with a generic, low-friction greeting and then waits, users hesitate. Cognitive load rises, intent stays fuzzy, and drop-off follows. A thoughtful second message—delivered quickly, with clarity and options—reduces ambiguity and gives people a low-effort path to progress. It’s a small behavioral nudge that pays off in outsized retention gains.

    Here’s the pattern that consistently works for me. First, keep the initial default prompt short, confident, and specific to the channel and task domain. Then ship a fast follow-up if the user hesitates for a few seconds. That second message should clarify what the agent can do, present 2–3 concrete choices, and invite free-form input. I’ve repeatedly seen this simple sequence unlock a 2–3x retention lift in early sessions, especially for first-time users.

    Auditing default prompts is where the leverage lives. I inventory every ingress—web widget, IVR, SMS, in-app, help center—and catalogue the exact default system, developer, and user-facing prompts. Then I inspect turn-1 and turn-2 transcripts in Agent Analytics to quantify where users stall: time-to-first-intent, clarification rate, option selection rate, and completion. This makes the drop-off visible and turns “vibes” into data we can A/B test.

    Designing the second message is a conversation design exercise, not a copy tweak. My recipe: empathize with the user’s likely uncertainty, constrain scope so the agent appears capable, and apply choice architecture. For voice AI agents, I keep it shorter, use confirmation questions, and bias toward read-back for accuracy. For chat, I include tappable options and examples that mirror top intents. The goal is momentum without feeling pushy.

    Operationally, I run controlled A/B tests on default and second-message variants, sized to a realistic minimum detectable effect. I segment by source (ad, organic, support), device, and use case, because the winning prompt for sales qualification rarely matches the one for customer support. With proper instrumentation in our analytics stack, we track retention curves over the first 3–5 sessions, not just single-session reply rates, to avoid optimizing for chatter over outcomes.

    Strong prompt engineering underpins the experience. I keep system prompts stable and explicit about persona, tone, and refusal behavior; manage the context window so examples don’t drown live intent; and use a retrieval-first pipeline when domain knowledge matters. The most expensive mistake I see is shipping defaults like “How can I help you?” without guardrails or examples—great for demos, bad for real users.

    If you’re starting fresh, begin with a prompt audit this week: list all defaults, map them to top intents, and pair each with a channel-appropriate second message. Instrument the funnel, launch two variants, and set a crisp success metric (e.g., turn-2 continuation rate to task start, then task completion). This is one of those rare changes that is simple to ship and compounds across onboarding, activation, and long-term retention.

    The takeaway is straightforward: don’t let your best work stall after the first reply. A disciplined second message and a focused default prompt audit will lift engagement, reduce ambiguity, and create the kind of early momentum that sustains retention over time.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Supercharge Core Web Vitals with Amplitude’s Global Agent: Faster Rankings, Happier Users

    Supercharge Core Web Vitals with Amplitude’s Global Agent: Faster Rankings, Happier Users

    I measure product health by a simple equation: speed plus clarity equals trust. That’s why I prioritize Core Web Vitals and search performance together—because the fastest path to better UX and higher rankings is a closed loop between measurement, diagnosis, and action. Standardizing on Amplitude’s Global Agent with Amplitude AI Agents let my teams compress that loop from weeks to hours, and in many cases, to minutes.

    Learn how to track your web vitals and page rankings faster with Amplitude AI Agents and improve your site’s user experience and SEO rankings. That goal sounds ambitious, but with the right instrumentation and analytics workflow, it becomes a repeatable operating rhythm rather than a one-off project.

    Here’s what changed for us with Amplitude’s Global Agent: a single, consistent way to capture performance signals across pages and journeys, unified context for every session, and a lightweight footprint that doesn’t get in the way of speed. By centralizing measurement, we eliminated blind spots and gave product, growth, and engineering one shared truth for Core Web Vitals and behavioral analytics.

    My practical playbook is straightforward: 1) Establish a performance baseline for Core Web Vitals on key templates and critical user paths. 2) Segment results by device, location, acquisition channel, and content type to surface where users actually feel the friction. 3) Connect those vitals to downstream behaviors—scroll depth, engagement, and conversion—so we prioritize fixes that move business outcomes, not just lab scores. 4) Use feature flags and A/B testing to ship improvements safely and quantify uplift. 5) Close the loop with Agent Analytics to keep learnings visible and actionable.

    Operationally, we rely on anomaly detection to flag regressions early, CI/CD guardrails to prevent performance slips at deploy time, and observability plus session replay to accelerate root-cause analysis. This combination reduces mean time to resolution, protects page experience during fast iteration cycles, and helps us avoid trading UX for speed—or vice versa.

    The strategic benefit is compounding: better Core Web Vitals improve user perception and increase engagement, which strengthens SEO signals and, ultimately, page rankings. With a unified analytics platform in place, we can spotlight the few improvements that create outsized gains, then scale those patterns across the site with confidence.

    If your roadmap includes faster pages, stronger rankings, and happier users, align your teams around this simple loop: measure precisely, diagnose quickly, experiment safely, and learn continuously. Amplitude’s Global Agent and Amplitude AI Agents give you the instrumentation and insight to make that loop your competitive advantage.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Proven 3-Step Playbook to Quantify AI Agent ROI: Boost Revenue, Cut Costs, Reduce Risk

    AI agents are only as valuable as the measurable outcomes they deliver. In my role leading product strategy at HighLevel, I’ve learned that the fastest way to earn executive trust is to translate agent performance into clear revenue impact, cost savings, and risk reduction. The challenge isn’t enthusiasm for AI; it’s creating a disciplined, repeatable way to prove business value.

    Here’s the three-step playbook my teams and I use to quantify the value of agentic AI, align stakeholders, and scale what works.

    Step 1 — Define value outcomes and success criteria. Start with a driver tree that ties agent outcomes to company-level goals. For revenue, target conversion lift, average order value, and expansion (e.g., trial-to-paid, self-serve upsell). For cost, focus on containment/deflection rate, reduced handle time, and lower cost to serve. For risk, measure error rates, hallucinations, security/policy violations, and customer complaint rate. Convert these into outcomes vs output OKRs, set baselines, and pre-commit to thresholds for launch, scale, or rollback. This ensures the team is accountable to business KPIs, not vanity metrics.

    Step 2 — Instrument comprehensively and establish baselines. Instrument the full journey: prompts, responses, human-in-the-loop events, escalations, feedback, and downstream conversions. Capture both leading indicators (time-to-first-value, containment rate, self-serve completion) and lagging outcomes (NRR, churn, LTV/CAC). Use behavioral analytics, session replay, product tours, and in-app guides to contextualize what users do before and after agent interactions. Baselines matter—freeze a control period so improvements are truly incremental.

    Increase revenue, cut costs, and reduce risk with Pendo’s Software Experience Management platform. Optimize the entire software experience to drive adoption and improve engagement.

    Step 3 — Experiment, attribute, and risk-adjust. Treat every agent capability like a hypothesis. Run A/B tests or holdouts with a precomputed minimum detectable effect so you can ship confidently. Attribute outcomes to the agent by linking events to conversions and support deflection, and calculate ROI as (incremental revenue + cost avoided – total operating cost, including model/API, labeling, and oversight). Apply AI risk management by tracking false positives/negatives, escalation rate, and policy breaches; adjust ROI with a risk score so the “cheapest” agent isn’t inadvertently the riskiest. This is eval-driven development in practice: define success, measure, iterate.

    Operationalizing the playbook requires crisp reporting. Stand up Agent Analytics dashboards in your unified analytics platform that roll up per-agent KPIs, funnel performance, cohort trends, and experiment results. Review them in QBRs and with frontline teams to connect numbers to lived customer experience. When metrics improve, amplify with product-led growth motions—targeted in-app guides and lifecycle nudges to get more users into high-value agent flows.

    What does this look like in the real world? Early on, we celebrated “tickets deflected” and missed that some conversations quietly increased churn risk. After we adopted this three-step approach, we saw the full picture: a modest dip in deflection quality was offset by a larger lift in expansion revenue and a meaningful drop in time-to-resolution. The risk-adjusted ROI was unambiguous, and the CFO greenlit broader rollout.

    If you’re building or scaling AI agents, anchor on outcomes, instrument ruthlessly, and insist on experimentation. With the right measurement discipline, you’ll know exactly which agents deserve more investment, which need redesign, and which should be retired. The result is a portfolio of agents that reliably drive adoption, engagement, and durable business value.


    Inspired by this post on Pendo – Best Practices.


    Book a consult png image
  • Amplitude Heatmaps Rebuilt: Rock-Solid Screenshots, Precise Placement, Smarter Scrollmaps

    Amplitude Heatmaps Rebuilt: Rock-Solid Screenshots, Precise Placement, Smarter Scrollmaps

    When a platform as foundational as Amplitude refreshes a core feature, I pay close attention. Heatmaps are where qualitative intuition meets quantitative scale, and reliability and precision determine whether teams trust what they see. The latest update meaningfully raises the bar for product analytics teams who depend on crisp visual evidence to guide experiments, diagnose friction, and accelerate product-led growth.

    Here’s the essence of the change, in Amplitude’s own terms: “more reliable screenshot capture, selector-based placement, automatic device detection, and a redesigned scrollmap.” That combination tackles the two biggest historical pain points with heatmaps—stability in dynamic interfaces and confidence that clicks are attributed to the right UI elements across devices and layouts.

    First, more reliable screenshot capture improves the fidelity of what I’m analyzing. When screenshots consistently mirror the live UI state, I can compare sessions across releases without worrying about rendering quirks or timing artifacts. That boosts trust in behavioral analytics, shortens feedback loops with engineering, and makes heatmaps a dependable companion to A/B testing and session replay.

    Second, selector-based placement is a pragmatic step toward precision. In modern, componentized front ends where elements shift with personalization, localization, or responsive design, stable selectors dramatically reduce misattributed interactions. In practice, this means cleaner insights for funnel drop-off analysis, clearer readouts for micro-conversions (e.g., CTA vs. secondary actions), and more confident iteration on UX copy and layout—without constant re-instrumentation.

    Third, automatic device detection aligns insights with the actual context of use. Patterns on mobile often diverge from desktop, and blending them can mask critical signals. Accurate device-specific readouts help me tailor experiments, refine activation paths, and decide when to prioritize mobile-first optimizations versus desktop refinements.

    Finally, the redesigned scrollmap matters because attention is a finite resource. Knowing how far users scroll—and where they pause—helps me position value propositions, trust elements, and calls to action where they’ll be seen. Combining scroll insights with session replay and event data gives me a sharper picture of what’s above the fold, what’s ignored, and where copy or layout needs a rethink.

    How I’d operationalize this update: validate key selectors with engineering and design for critical templates; compare pre- and post-update heatmaps to establish new baselines; segment by device to isolate diverging behaviors; map scroll depth to conversion micro-moments; and feed prioritized findings into backlog grooming and product roadmapping. This keeps heatmaps directly connected to outcomes rather than just interesting visuals.

    Bottom line: these improvements make heatmaps a more trustworthy lens for discovery and optimization. With sturdier screenshot capture, precise selector-based placement, automatic device detection, and a redesigned scrollmap, I can move faster from observation to decision—reducing analysis ambiguity, tightening experiment cycles, and turning behavioral analytics into measurable product strategy.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

    No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

    Most teams ship AI agent personalities by accident—emergent quirks, brittle prompts, and uneven behavior. We refused to let that happen. From day one, we treated personality as a first-class product surface, one that should be designed, instrumented, and iterated with the same rigor as any core capability.

    Learn how we designed Global Agent’s personality and fine-tuned its inquisitiveness and helpfulness using Agent Analytics.

    In my role leading product at HighLevel, Inc., I framed our approach around agentic AI and conversation design: personality is not “flavor text”; it is the control system for how an agent interprets context, asks questions, and decides when to act. Our product strategy prioritized clarity, empathy, and consistency—so the agent would be curious enough to resolve ambiguity without becoming interrogatory, and helpful enough to move work forward without overstepping.

    We made that intent measurable. Using behavioral analytics, we defined operational signals such as clarification-question rate, resolution-path efficiency, and escalation quality. We combined eval-driven development with targeted A/B testing to compare prompt patterns and tool strategies, ensuring each change had a clear hypothesis and measurable outcome.

    To calibrate inquisitiveness, we mapped decision points where the agent should ask follow-ups versus proceed autonomously. Prompt engineering codified those thresholds, while a retrieval-first pipeline reduced unnecessary questions by improving context completeness up front. When the agent did ask, we constrained tone and cadence to keep queries concise, respectful, and progress-oriented.

    To enhance helpfulness, we prioritized precise action-taking and unambiguous guidance. Context window management preserved relevant facts without diluting intent, and guardrails aligned with AI risk management principles ensured the agent stayed within policy, privacy, and compliance boundaries. The result was an assistant that resolved more tasks end-to-end, with fewer stalls and clearer handoffs when human help was warranted.

    Agent Analytics became our nervous system. We instrumented every dialog turn to attribute outcomes to design choices, then used driver trees to connect micro-behaviors to macro results like time-to-resolution and customer satisfaction. This closed-loop view let us ship confidently, knowing which levers improved helpfulness, which sharpened curiosity, and which merely added noise.

    Process mattered as much as tooling. Product trios ran continuous discovery with customers to surface edge cases—ambiguous intents, multi-intent turns, and sensitive scenarios—while our engineering partners operationalized experiments with clean rollback paths. We favored small, testable changes over sweeping rewrites, building momentum and trust with each iteration.

    The payoff is a personality that feels consistent across use cases: curious when clarity is missing, decisive when action is obvious, and transparent when limits are reached. Users experience fewer dead ends, faster resolutions, and a brand voice that shows up the same way every time—because it was defined, measured, and improved on purpose.

    If you’re building agentic AI, don’t leave personality to chance. Treat it like a product: set clear outcomes, instrument deeply with Agent Analytics, and iterate with eval-driven development and A/B testing. That’s how curiosity becomes a feature, helpfulness becomes a habit, and your agent becomes reliably, intentionally excellent.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image