I often look to Amplitude and its core analytics product when I’m coaching teams and refining our own product strategy. The discipline required to turn raw event streams into actionable behavioral analytics mirrors what I expect from empowered product teams: precise instrumentation, clear decision points, and a relentless focus on outcomes.
Some of the most effective product managers I meet began their careers in the ed-tech and recruiting space. That early-stage, resource-constrained environment cultivates sharp prioritization instincts and a comfort with ambiguity—muscles that translate directly into building scalable analytics capabilities without losing speed or customer empathy.
In my practice, I anchor discovery and roadmap decisions in driver trees that connect north-star outcomes to measurable input metrics. That structure keeps product trios aligned on the questions that matter: What behaviors predict retention? Where does user activation stall? Which experiments will meaningfully shift our core metrics? Paired with continuous discovery, this approach ensures we ship learnings—not just features.
Tactically, I encourage teams to combine Amplitude analytics with a unified analytics platform mindset: centralize event taxonomy, standardize cohort definitions, and operationalize retention analysis alongside acquisition and activation. When we treat analytics as a product, not a tool, we unlock faster iteration loops, smarter A/B testing, and clearer trade-offs between depth and breadth in our product surface area.
Product-led growth hinges on narratives supported by evidence. I’ve found that clear opportunities emerge when we map journeys, quantify friction with session replay and funnels, and then validate solution ideas through small, reversible bets. This is where outcome-based roadmapping shines: we commit to moving a metric, not to a specific feature, and we let the data guide sequencing.
At the leadership level, I focus on execution readiness: crisp problem statements, decision logs, and CI/CD practices that reduce batch size and increase deployment frequency. The goal isn’t shipping more; it’s compounding learning. When teams internalize this mindset, analytics stops being a dashboard and becomes a competitive advantage.
Inspired by this post on Amplitude – Perspectives.
I just wrapped an all-out engineering sprint. That still sounds odd coming from me, because while I’ve written code on and off for years, I don’t self-identify as an engineer. I’m a product manager who used to be a designer. It’s been a long time since I wrote code for a living.
But AI has expanded what’s just now possible—for our products, and for us. It’s pushed me to do more than I imagined. In that spirit, I want to share a recent engineering story. It includes technical details, and a year ago I couldn’t have done any of it. I learned it with the help of AI, and my aim is to show what’s now within reach.
I’ve been building two services with a partner at Vistaly: AI-generated interview snapshots and AI-generated opportunity solution trees. We put out a call for alpha partners, received over 100 applicants, and selected eight design partners to start.
A clear, color‑coded map from desired outcome to opportunities, solutions, and assumption tests—showing how to structure discovery work and prompt AI to generate, compare, and validate product ideas.
Each team uploaded three customer interviews. I identified the key moments and opportunities and then generated an opportunity solution tree from those snapshots. I provide the AI services; Vistaly is building the UI and workflows around them.
Early feedback was strong. Teams immediately asked to upload more interviews—exactly the kind of demand signal you hope to see—so we got to work making that possible.
Go behind the scenes as AI turns raw feedback into a clear Opportunity Solution Tree. Linked cards reveal user needs—onboarding, support offload, and bot-readiness signals—so product teams can spot priorities and next steps at a glance.
Updating an opportunity solution tree with new interview content is far harder than generating a new tree from scratch. I initially underestimated the complexity. Our goal wasn’t to produce a tree and declare it truth. We wanted teams to engage, correct, and collaborate with the AI—scaffolding cross-interview synthesis instead of doing it for them.
To support that, we needed a way to communicate precisely how a tree would change after new interviews were added. We took inspiration from git diff and set out to build the equivalent for opportunity solution trees—step-by-step change sets that explain each proposed modification.
A clear visual of AI‑generated opportunity solution trees: outcomes feed opportunities that branch into sub‑opportunities, while evidence is preserved. The structure ensures updates stay traceable and never cause data loss.
That decision was right, but the lift was larger than I expected. It wasn’t enough to generate an updated tree; I also had to provide a clear, ordered walkthrough of what changed and why.
I often see the same pattern with AI: it’s easy to get to an impressive prototype, but much harder to reach a production-grade product. That was exactly my experience here. My service actually comprised two sub-services: generating a new tree from scratch and updating an existing tree with new interviews. The first worked well in alpha; the second had to be built before anyone could add a fourth interview.
Explore how an outcome expands into an Opportunity Solution Tree: Opportunities A and B stem from the goal, with C and D nested under B, while a concise change set tracks every node added along the way.
On the surface, these services look similar. In reality, updates must preserve existing structure unless new evidence requires a change. You have to account for compound operations—merges, splits, deletes—while guaranteeing no data loss. Every node has source opportunities (supporting evidence from interviews) and children (tree sub-opportunities), and neither can be dropped.
In classic AI fashion, I got a reasonable version working in a few days and shipped it to our design partners. One team quickly hit our beta limits and asked to convert to a paid subscription so they could keep going. They showed a willingness to pay, converted, and started uploading aggressively.
Watch an Opportunity Solution Tree evolve: the original parent A with x, y, z branches is split into A and B, shifting evidence while preserving links—mirroring how AI refines scope and structure in discovery.
At the 14th, 15th, and 16th uploads, the cracks appeared. We saw odd behavior in some trees. The Vistaly team noticed that the change sets—the step-by-step instructions emitted by my service—didn’t always reconstruct the final tree my service also emitted. We needed those steps to match exactly, so teams could review and accept, modify, or reject each change with confidence.
They flagged the issue the day I was flying to New Orleans for Jazz Fest. In hindsight, I’m glad I didn’t grasp the scope of what awaited me. I had roughly 80% of the work still to do to make tree updates rock solid. At least I got to enjoy the music first.
From fragments to focus: this diagram shows how Opportunities B and C are merged into a single Opportunity Solution Tree, removing duplicates and unifying context so AI can rank and explore five related opportunities with clarity.
Back home, I started diagnosing. My service was a pipeline: several LLM-driven steps followed by deterministic code to compare trees and produce change sets. As I dug in, I realized that approach was flawed. Tree diffs, unlike linear document diffs, are ambiguous.
In a document, if I add a sentence, the diff shows an addition. If I delete a paragraph and rewrite it, the diff shows a removal and an addition. Simple. But trees are different. Suppose I split opportunity A into A and B, and later merge B with C. The split can disappear from the final diff.
Peek inside our process: a simple opportunity solution tree maps an outcome to prioritized opportunities A and C with downstream options x-z and t-v. A clear snapshot of how AI organizes product discovery.
When the model splits an opportunity, it must distribute A’s source opportunities and children between A and B. For instance, if A has source opportunities 1, 2, 3 and children x, y, z, after the split A might keep 1, 2, and x, while B takes 3, y, and z.
Now suppose the model merges B into C. If C originally had source opportunities 4 and 5 and children t, u, v, then after the merge C now has source opportunities 3, 4, 5 and children t, u, v, y, z. When you compare the original and final trees, it looks like A somehow donated some evidence and children directly to C. The split and merge that explain why are invisible to a naive diff.
See how an AI-generated Opportunity Solution Tree unfolds: one Outcome flows to Opportunities A and C, then into options x–v. Clean colors and arrows reveal the hierarchy from goal to opportunities at a glance.
That was the core insight: we didn’t just need to show what changed—we needed to show why it changed. I had to reconstruct each move step-by-step. That meant getting the model to show its work, which opened a new can of worms.
I refactored my prompts so the model produced both the final output and the exact change set it used to get there. The action language was explicit: add, delete, reframe, merge, split, and so on. Crucially, I asked the model to describe its moves in user-meaningful terms—“split A into A and B, then merge B into C”—not as opaque reassignments of sources and children.
Watch an opportunity solution tree take shape: start with the outcome, add opportunities A and B, then extend B to C and D. The paired change set makes every edit transparent—ideal for AI-assisted product discovery.
For each LLM step, the model now emitted its recommendation and the corresponding change set. This helped, but it wasn’t perfect. After extensive testing and error analysis, two classes of errors emerged: (1) the model attempted an invalid move, and (2) the change set didn’t actually generate the recommendation.
Category 1 felt like designing a game while the model played it creatively. For example, what happens when the model tries to merge a parent with a child? If opportunity A has children B, C, and D and the model merges A with B, the merge is directional. If the instruction is “keep A, delete B,” that works—the parent absorbs the child. But if the instruction is “keep B, delete A,” then C and D become orphans. These puzzles were solvable and even fun.
Visual explainer from Product Talk on AI-generated Opportunity Solution Trees. It contrasts an allowed merge (B into A) with a not-allowed merge (A into B) that leaves child opportunities orphaned, guiding safe hierarchy edits.
Category 2 was harder. Despite prompt iterations, I could only push the discrepancy rate down to about 1 in 40 instances. With 10–20 LLM calls per run, that meant roughly half of all runs still failed. Not acceptable for production. I hit a wall. A paying customer was waiting, and more design partners were queued up.
Next, I tried to correct the model’s mistakes with deterministic code. I had promised that my change sets would generate the output tree, so I wrote verifiers: detect conflicts (e.g., delete a node, then try to use it later), guard against data loss, prevent orphaned nodes, and more. Detection was straightforward; correction was not. Fixing issues required guessing the model’s intent. If the sequence said “delete A, then merge A with B,” should I remove A entirely or salvage A’s sources and children by merging into B? There were dozens of such cases with no unambiguous answer.
A step-by-step loop shows how changes are validated: generate a change set, run a validation tool, review the result, then repeat on failure and exit on pass—mirroring iterative work behind AI-built Opportunity Solution Trees.
After 11 straight days of deep work—including weekends—I was exhausted. I dislike hustle culture; this isn’t how I design my life. But I was stuck, and then I had an insight.
On a walk with my husband (also an engineer), I realized I could have the LLM repair its own mistakes. My data contract with Vistaly requires that the change set must generate the output tree. I had already built robust validation code. I knew exactly when a change set failed—and why. No amount of prompt tuning alone was fixing it. So I turned the validator into a tool for the model and created a simple agentic loop.
The loop works like this: the model proposes a change set, calls the validation tool, and gets back a pass/fail plus specific feedback. If it fails, the model uses those instructions to repair the change set and calls the tool again. Iterate until success or a max number of turns.
I prototyped in Node.js with a single model call, a verifier pass, and a repair attempt. At first, the loop didn’t converge—it just accumulated compute. I experimented with how to communicate errors, how much context to include, and how to sequence feedback. Eventually, it clicked: the model began fixing its own mistakes and typically returned a valid change set in one or two repairs. It was, in practice, eval-driven development applied to LLM outputs.
I had already built an agent loop utility for another AI workflow, so I productionized quickly: model call, optional tool invocation, tool result returned to the model, repeat until the validator signals success or the loop times out. I integrated the new loop into the pipeline and shipped the revamped service to Vistaly on Monday at noon. They’re integrating now, and it will be in the hands of our design partners shortly. I’m relieved—and ready for a day off.
Reflecting on the last two weeks, a few things stand out. First, I shed limiting beliefs about being an engineer. To make this reliable, I had to solve legitimately hard problems, and that feels good.
Second, this was genuinely fun. Designing the action set and watching the model push those boundaries was like working through elegant puzzles. Models are incredibly creative, and harnessing that creativity with the right constraints is deeply satisfying.
Third, I learned when I can and can’t trust Claude to write code for me. Since Opus 4.6 came out, I gave Claude a much longer leash. After the past two weeks, Claude is back on a short leash. I found a lot of gaps in my implementation in areas where I simply trusted that Claude got it right, when in fact it didn’t. If you don’t have the right infrastructure—planning, testing, code review—this can be disastrous. I’ll be investing more here and sharing what I learn.
Finally, if this work had been spread over two months, it would have been thoroughly enjoyable. I’m discovering how much I like being an AI engineer. It feels like a new chapter where I can combine opportunity solution trees with modern AI engineering—and deliver real value to product teams doing continuous discovery.
I’m excited to share more of what we’re building with Vistaly and to onboard more design partners soon. If you’re interested, get on the waiting list. And if you’ve been hesitant to stretch beyond your current skill set, I hope this story nudges you to take the first small step toward what’s just now possible.
When our cloud costs started outpacing growth, I knew we had to make a decisive call on “build vs buy.” Buying a FinOps platform would have been faster on paper, but it wouldn’t internalize our operational nuance. Building an agentic AI layer on top of our cost, telemetry, and product usage data promised not just dashboards—but compounding leverage. Less than a year later, our homegrown approach outperformed off‑the‑shelf alternatives on speed, precision, and organizational adoption.
The aspiration was clear from the outset: See how Amplitude scaled FinOps with AI agents—cutting manual work, accelerating insights, and turning a one-person function into a cost optimization engine. We set that as a bar for both outcomes and operating cadence, then translated it into a roadmap grounded in first principles.
Our build vs buy analysis hinged on three factors. First, cloud cost optimization is only as good as the context it carries; we needed deep hooks into our pricing, feature flags, and deployment frequency to reason about unit economics in real time. Second, we required agentic AI workflows that could detect anomalies, recommend actions, and close the loop—not just visualize waste. Third, governance mattered: privacy‑by‑design, data governance controls, and transparent decision logs were non‑negotiable under our AI Strategy and product management leadership standards.
We architected a retrieval‑first pipeline to blend billing exports, usage telemetry, and observability signals with product and GTM metadata. Agent workflows ran on top: one agent built driver trees that explained spend shifts by service, customer cohort, and environment; another specialized in anomaly detection with confidence scoring; a third agent proposed commitment strategy, rightsizing, and schedule adjustments. Each recommendation linked back to source data for auditability.
From a delivery standpoint, we treated the system like a product, not a tool. A product trio (PM, engineering, and FinOps) ran continuous discovery interviews with stakeholders, instrumented eval‑driven development for agent prompts, and shipped improvements via CI/CD weekly. We optimized prompt engineering for decision clarity over verbosity and codified acceptance criteria: time‑to‑insight, actionability, and measurable savings per recommendation.
The impact was immediate and then compounding. Manual effort on month‑end analysis shrank as agents pre‑triaged drift and surfaced root causes with suggested remediations. Insights arrived continuously, not as end‑of‑month surprises, which meant engineering could fold changes into regular sprints. What started as a one‑person FinOps function evolved into a cost optimization engine embedded across teams—product, SRE, and finance—all speaking a shared language of drivers, tradeoffs, and outcomes.
Along the way, we learned where building truly beats buying. If your architecture, pricing model, and growth loops are unique—and they usually are in consumption SaaS—agentic AI amplifies institutional knowledge in a way generic platforms can’t. Conversely, if you lack clean tagging, clear ownership, or basic observability, investing there first will raise ROI on any approach, built or bought.
My advice if you’re at this crossroads: define success in terms of decisions changed, not reports shipped. Start with a thin slice—anomaly detection plus one high‑leverage remediation path—then iterate. Keep humans in the loop for executive sign‑off until your confidence intervals and post‑action telemetry prove reliability. With the right guardrails and focus, in‑house AI FinOps can move faster than the market and pay for itself well within a year.
Inspired by this post on Amplitude – Perspectives.
In my work with product, operations, and support leaders, I’m often asked to help make sense of Agent Analytics—what to track, how to attribute outcomes, and where to invest. After reviewing countless dashboards and running experiments across human agents and AI agents, I’ve learned that some of the most common measurement beliefs are precisely the ones that lead teams astray.
What comes up in conversation with leaders about Agent Analytics, and why not everything is what it seems.
Below, I unpack four pervasive myths I encounter and share the data-centered practices I use to replace them. My goal is simple: help you upgrade the way you measure performance so you can improve customer outcomes, accelerate learning, and scale impact with confidence.
Myth 1: “Lower average handle time (AHT) means higher performance.” AHT is useful but incomplete. When teams optimize solely for speed, they often push complexity into repeat contacts, reopens, or escalations. In the data, that shows up as a weak or negative relationship between lower AHT and durable outcomes like first contact resolution (FCR), customer effort, or revenue per conversation.
Reality and what I measure instead: I right-size speed by pairing AHT with intent-level resolution and recontact rate. For simple intents (password reset, billing address update), shorter is usually better. For complex intents (tiered troubleshooting, multi-step verification), “right-speeding” wins—slightly longer interactions that prevent rework. Practically, that means segmenting by intent complexity using behavioral analytics, tracking weighted “intent resolution rate,” and monitoring repeat-contact windows (24–168 hours) to catch downstream pain.
Myth 2: “AI agent containment tells the whole story.” A high containment rate can mask failure modes such as unresolved intent, silent abandonment, or low-quality handoffs that frustrate customers and spike human workload later.
Reality and what I measure instead: I break containment into three parts for voice and chat flows: (1) intent resolution without escalation, (2) graceful handoff quality when escalation is necessary, and (3) post-handoff efficiency and satisfaction. For voice AI agent experiences, I also track escalation clarity (did the transcript summarize history and intent?), time-to-human, and customer satisfaction on the combined interaction. This provides a fuller view of customer support ai strategy effectiveness and avoids over-crediting automation for partial wins.
Myth 3: “Quality is subjective, so it can’t be measured at scale.” Teams often default to sporadic QA because they assume it can’t be standardized across channels or agent types. The result is noisy feedback loops and stalled coaching.
Reality and what I measure instead: Quality becomes measurable when it’s grounded in observable behaviors linked to outcomes. I use a rubric anchored in behavioral analytics (e.g., verified customer need, correct resolution path, policy compliance, empathy markers) and validate it via correlation with FCR, recontact, and retention analysis. To scale, I combine calibrated human reviews with AI-assisted scoring, check inter-rater reliability weekly, and use driver trees to connect quality levers to business results. This creates a consistent, coachable signal for both human agents and AI flows.
Myth 4: “If the dashboard is green after launch, we’ve won.” Early wins can reflect novelty effects, cherry-picked routing, or short-term incentives that don’t persist. Declaring victory too soon locks in fragile gains and hides regressions across cohorts.
Reality and what I measure instead: I treat go-live as the start of learning. I use A/B testing with a clear minimum detectable effect (MDE), stagger ramps, and hold out stable control cohorts for at least one full demand cycle. I track outcomes vs output OKRs—focusing on intent resolution, customer effort, and revenue/customer health over vanity metrics. I also monitor seasonality and channel mix shifts inside a unified analytics platform to ensure improvements generalize beyond the first week.
How I operationalize this day to day: (1) define intents and complexity upfront, (2) unify journey data across channels, (3) instrument resolution and recontact rigorously, (4) apply driver trees to isolate what actually moves outcomes, and (5) iterate via disciplined experiments rather than sweeping changes. This approach aligns product and operations, speeds up coaching, and ensures AI investments compound rather than decay.
If you’re rethinking your Agent Analytics stack, start by replacing each myth with a sharper metric: pair AHT with intent-level resolution, pair containment with handoff quality and satisfaction, pair QA with outcome-linked rubrics, and pair green dashboards with robust experiments. The payoff is a measurement system that earns trust, guides better decisions, and consistently improves customer and business results.
I’ve learned that the fastest path to durable AI impact is a disciplined experimentation engine: one that moves quickly, reduces ambiguity, and earns trust with evidence. My goal isn’t just to ship models—it’s to ship measurable outcomes with repeatable rigor.
AI experimentation for product teams. Here’s how to test AI features, choose the right metrics, handle variability, and make data-driven decisions.
I start every AI initiative by framing a clear decision: what must be true for this feature to be worth building, and how will we know quickly? From there, I map driver trees that connect user value to measurable signals, so every test clarifies both impact and risk, not just accuracy.
Success criteria come next. I translate aspirations into testable thresholds, define leading and lagging indicators, and size tests with minimum detectable effect (MDE) so we don’t confuse noise for signal. This keeps us honest about sample sizes, power, and the real cost of waiting for certainty.
Before I touch production traffic, I run eval-driven development. I curate golden datasets that reflect real user complexity, codify rubrics for correctness, safety, tone, and latency, and automate scoring so improvements are reproducible—not anecdotal. This gives the team a stable baseline to iterate prompts, tools, and policies with confidence.
Model behavior is inherently stochastic, so I deliberately control variability. I document temperature, top-p, and seed strategies; I compare deterministic settings for regression checks versus sampled settings for user-facing creativity; and I test sensitivity across content lengths and edge cases. This reduces flakiness and prevents surprise regressions during CI/CD.
When it’s time to learn from real users, I favor A/B testing with thoughtful guardrails. I run holdouts, cap exposure with feature flags, and protect core experience metrics like retention and time-to-value. For ranking and retrieval changes, I’ll use interleaving or switchback tests to isolate effects from seasonality and traffic mix.
To handle LLM variability online, I aggregate outcomes over multiple prompts per cohort, use stratified bucketing to balance power users and new accounts, and track confidence intervals over time instead of snapshot p-values. This approach turns noisy model outputs into stable product signals.
Instrumentation fuels everything. I rely on behavioral analytics to trace user intent, effort, and satisfaction across flows, and I wire up Amplitude analytics for event schemas, funnel drop-offs, and cohort comparisons. Clear event taxonomies and naming discipline make it trivial to separate model quality from UX friction.
Risk is part of the work, so I bake in AI risk management early. I include toxicity and PII checks in my offline evals, monitor safety metrics in every A/B, and set rollback criteria tied to user harm and system costs. Privacy-by-design, audit logs, and runtime safeguards aren’t afterthoughts—they’re acceptance criteria.
The operating cadence matters as much as the math. I run continuous discovery with customer interviews to keep the test queue grounded in real jobs-to-be-done, and I align product trios on hypotheses, success metrics, and stop-loss rules before launch. Weekly readouts keep decisions crisp, and post-ship learning cycles feed the next iteration.
Finally, I invest in upskilling the team. We run internal workshops on LLMs for product managers, standardize experiment templates, and maintain a living playbook so new experiments start at 80% instead of 0%. The result: faster learning loops, safer bets, and more confident shipping.
I set out to solve a deceptively simple problem: help our teams ask product questions in plain English and get trustworthy, analysis-grade answers—fast. That required more than a powerful model; it demanded agents that genuinely understand the language of product analytics, from behavioral analytics nuances to the messy reality of event taxonomies, funnels, and cohorts. In this post, I share how we engineered agentic AI that speaks our domain fluently and turns questions into decisions.
The core challenge wasn’t data volume or dashboard sprawl; it was semantics. Different teams said “activation,” “onboarding,” or “first value” and meant overlapping but distinct things. Our PMs, analysts, and engineers navigated a maze of synonyms across Amplitude analytics, Pendo, and our unified analytics platform. Generic LLMs stumbled on these nuances, so we built a shared ontology—driver trees anchored to a clear North Star—with canonical definitions for activation, retention, and conversion, plus consistent event naming and cohort logic.
We started with a rigorous metric catalog: every KPI linked to its drivers, exact formulas, cohorts, and time windows; every event mapped to a product taxonomy; every dashboard and SQL snippet versioned with ownership and lineage. That catalog became the ground truth for agents. We embedded data governance and privacy-by-design from the start—permissioning for fields and queries, PII redaction, and scoped access that reflected how product teams actually work.
Next, we built a retrieval-first pipeline to ground the agents in our corpus before generation. We indexed metric definitions, dashboards, experiment readouts, runbooks, and high-signal Slack threads so the agent could cite relevant artifacts, not just predict plausible text. With careful context window management and prompt engineering, the agent retrieves definitions and prior analyses, then plans multi-step actions: run a query, compare cohorts, check “minimum detectable effect (MDE)” for an A/B test, and summarize findings with references.
Architecturally, we treated this as “Agent Analytics”: an orchestrator that selects tools based on intent—querying Amplitude analytics or Pendo for behavioral paths and funnels, hitting our warehouse for cohort tables, or pulling experiment metadata and anomaly detection alerts. Tool use is permission-aware, auditable, and designed to fail safe. The agent’s outputs include citations back to the exact definitions, dashboards, and SQL used, so reviewers can validate and iterate.
Quality came from eval-driven development, not intuition. We built a gold set of representative product questions (activation inflections, retention analysis by segment, funnel drop-offs after feature launches) and scored the agent on faithfulness to definitions, numerical accuracy, latency, and actionability. We incorporated regression checks to catch drifts after schema changes, and we tuned prompts to reduce overconfident answers and push for clarifying questions when context was missing.
Safety and reliability were non-negotiable. We layered AI risk management with role-based access, guardrails that block destructive queries, and risk scoring for unfamiliar joins or sudden spikes in metric deltas. The agent logs every step—what it retrieved, which tools it called, and why—so analysts can replay and refine the chain of thought with transparent provenance.
The payoff: product teams now self-serve nuanced questions in minutes instead of days, and our analysts spend more time on discovery than report wrangling. Retention analysis improved as the agent standardized cohort logic; conversion investigations accelerated thanks to consistent funnel definitions; and cross-functional decisions aligned around the same driver trees and shared language. Most importantly, the agent turned ambiguous asks into structured analyses that stand up to scrutiny.
For fellow product leaders, my lesson is simple: start with semantics, not models. A crisp ontology, disciplined taxonomy, and clear ownership will outperform a flashy stack riddled with ambiguity. Avoid technology FOMO; favor retrieval-first grounding, small sharp tools, and continuous discovery with your product trios. When your organization speaks a common analytics language, agents can finally think with you, not just for you.
Next, we’re extending the agent’s planning skills to recommend experiment designs, estimate power and “minimum detectable effect (MDE),” and propose driver-tree-informed bet sizing. We’re also tightening feedback loops so every accepted answer, edit, or override strengthens the retrieval corpus and evaluations. The vision: a calm, reliable layer that makes rigorous product analytics feel conversational—and helps teams move from questions to confident action.
Inspired by this post on Amplitude – Best Practices.
I set out to solve a deceptively simple problem: help our teams ask product questions in plain English and get trustworthy, analysis-grade answers—fast. That required more than a powerful model; it demanded agents that genuinely understand the language of product analytics, from behavioral analytics nuances to the messy reality of event taxonomies, funnels, and cohorts. In this post, I share how we engineered agentic AI that speaks our domain fluently and turns questions into decisions.
The core challenge wasn’t data volume or dashboard sprawl; it was semantics. Different teams said “activation,” “onboarding,” or “first value” and meant overlapping but distinct things. Our PMs, analysts, and engineers navigated a maze of synonyms across Amplitude analytics, Pendo, and our unified analytics platform. Generic LLMs stumbled on these nuances, so we built a shared ontology—driver trees anchored to a clear North Star—with canonical definitions for activation, retention, and conversion, plus consistent event naming and cohort logic.
We started with a rigorous metric catalog: every KPI linked to its drivers, exact formulas, cohorts, and time windows; every event mapped to a product taxonomy; every dashboard and SQL snippet versioned with ownership and lineage. That catalog became the ground truth for agents. We embedded data governance and privacy-by-design from the start—permissioning for fields and queries, PII redaction, and scoped access that reflected how product teams actually work.
Next, we built a retrieval-first pipeline to ground the agents in our corpus before generation. We indexed metric definitions, dashboards, experiment readouts, runbooks, and high-signal Slack threads so the agent could cite relevant artifacts, not just predict plausible text. With careful context window management and prompt engineering, the agent retrieves definitions and prior analyses, then plans multi-step actions: run a query, compare cohorts, check “minimum detectable effect (MDE)” for an A/B test, and summarize findings with references.
Architecturally, we treated this as “Agent Analytics”: an orchestrator that selects tools based on intent—querying Amplitude analytics or Pendo for behavioral paths and funnels, hitting our warehouse for cohort tables, or pulling experiment metadata and anomaly detection alerts. Tool use is permission-aware, auditable, and designed to fail safe. The agent’s outputs include citations back to the exact definitions, dashboards, and SQL used, so reviewers can validate and iterate.
Quality came from eval-driven development, not intuition. We built a gold set of representative product questions (activation inflections, retention analysis by segment, funnel drop-offs after feature launches) and scored the agent on faithfulness to definitions, numerical accuracy, latency, and actionability. We incorporated regression checks to catch drifts after schema changes, and we tuned prompts to reduce overconfident answers and push for clarifying questions when context was missing.
Safety and reliability were non-negotiable. We layered AI risk management with role-based access, guardrails that block destructive queries, and risk scoring for unfamiliar joins or sudden spikes in metric deltas. The agent logs every step—what it retrieved, which tools it called, and why—so analysts can replay and refine the chain of thought with transparent provenance.
The payoff: product teams now self-serve nuanced questions in minutes instead of days, and our analysts spend more time on discovery than report wrangling. Retention analysis improved as the agent standardized cohort logic; conversion investigations accelerated thanks to consistent funnel definitions; and cross-functional decisions aligned around the same driver trees and shared language. Most importantly, the agent turned ambiguous asks into structured analyses that stand up to scrutiny.
For fellow product leaders, my lesson is simple: start with semantics, not models. A crisp ontology, disciplined taxonomy, and clear ownership will outperform a flashy stack riddled with ambiguity. Avoid technology FOMO; favor retrieval-first grounding, small sharp tools, and continuous discovery with your product trios. When your organization speaks a common analytics language, agents can finally think with you, not just for you.
Next, we’re extending the agent’s planning skills to recommend experiment designs, estimate power and “minimum detectable effect (MDE),” and propose driver-tree-informed bet sizing. We’re also tightening feedback loops so every accepted answer, edit, or override strengthens the retrieval corpus and evaluations. The vision: a calm, reliable layer that makes rigorous product analytics feel conversational—and helps teams move from questions to confident action.
Inspired by this post on Amplitude – Best Practices.
Every week I meet marketers who are working harder than ever—more campaigns, more content, more dashboards—yet seeing less movement on metrics that matter. The surge of AI tooling has amplified activity, not necessarily impact. That’s the focus problem: we confuse motion with momentum, and our backlogs look great while our outcomes stall.
Learn how AI agents for marketing can help you prioritize impact so you can do important work, instead of just more work.
In my role leading product and growth teams, I’ve learned that AI only compounds value when it is pointed squarely at outcomes. If we don’t define what “good” looks like, agentic AI will simply scale busywork. The antidote is a disciplined operating model that connects strategy to execution and instruments agents with clear success criteria.
First, anchor your program with outcomes vs output OKRs. Choose one or two measurable business outcomes—such as qualified pipeline, conversion rate, or activation—and make everything else subordinate. This provides the compass agents need to make effective trade-offs when speed and volume tempt you to do “one more thing.”
Second, map a driver tree from the target outcome down to the controllable levers: audience segments, offers, channels, messaging, and experience friction. This traceability shows where agents can move the needle fastest—whether that’s accelerating research, sharpening positioning, or eliminating handoffs that slow experimentation.
Third, design a small, agentic AI workforce aligned to those levers. For example: a Research Agent that synthesizes market insights and past performance; a Copy Agent that generates on-brief, on-brand variants; a Distribution Agent that adapts content to each channel and schedules posts; and an Analytics Agent that runs A/B tests, summarizes results, and flags anomalies. Keep human oversight where judgment matters most—strategy, brand voice, and high-stakes decisions.
Fourth, instrument rigor from day one with Agent Analytics and eval-driven development. Define offline evals for brand consistency, factuality, safety, and response time; pair them with online experiments that quantify lift on your target outcomes. Set a minimum detectable effect (MDE) so you stop shipping changes that cannot plausibly move the metric.
Fifth, operationalize your AI workflows. Standardize prompts, inputs, and handoffs; templatize briefs and acceptance criteria; and keep a change log so improvements compound rather than reset. Use short, frequent feedback loops to prune low-impact work and double down on what demonstrably advances your objectives.
I’ve seen teams reclaim focus and momentum when they treat agents as teammates, not toys. The magic isn’t in producing more assets—it’s in consistently choosing the next best action in service of a clear outcome. When you combine outcome clarity, a driver tree, targeted agents, and tight evals, AI becomes a force multiplier for marketing impact.
If you’re feeling overwhelmed by AI’s possibilities, start small: commit to one outcome, one driver you believe is material, and one agent designed for that job. Prove lift, codify the workflow, then scale. Velocity is only valuable when it’s pointed in the right direction.
Inspired by this post on Amplitude – Best Practices.
Every planning cycle, I feel the drumbeat: “Show me the AI ROI—this quarter.” The pressure is real, especially when boards and CFOs expect immediate payback. Yet when I review stalled initiatives across teams and peers, the pattern is consistent: most companies treat AI like a feature to ship, not a system to manage. That mindset almost guarantees we measure the wrong things, declare victory (or failure) too early, and miss the durable value AI can create.
Here’s the core problem I see: we leap to solution and skip the counterfactual. Without a baseline, a clear control, or a defined “what would have happened otherwise,” we’re guessing. We also fixate on lagging, financial KPIs that move slowly (revenue, cost, risk), then use outputs—not outcomes—as OKRs. If we don’t align on outcomes vs output OKRs upfront, the best team in the world can still optimize for activity over impact.
My AI Strategy starts from a simple truth: value shows up along three vectors—revenue, cost, and risk—on different timelines. In the near term, we must validate leading indicators (adoption, engagement, activation) that ladder to those vectors through a transparent driver tree. Over time, those drivers compound into the lagging KPIs finance cares about. When we make the driver tree explicit, everyone can see how model precision, response time, and workflow integration roll up to conversion lift, case deflection, time-to-resolution, or reduced exposure.
To make this rigorous, I run a five-step playbook. First, define the decision and business outcome in plain terms. Second, instrument the baseline with behavioral analytics on a unified analytics platform—tools like Amplitude analytics or Pendo help expose friction points we’ll later target. Third, create a counterfactual using A/B testing and specify a minimum detectable effect (MDE) so we know how long to run and how much traffic we need. Fourth, quantify costs (training, inference, integration, change management) and include AI risk management, privacy-by-design, and data governance up front. Fifth, lock a measurement plan that connects leading indicators to lagging ROI through the driver tree.
Most AI initiatives don’t fail on model quality—they fail on adoption. If the workflow isn’t smoother, trust isn’t earned, or value isn’t obvious, users revert. That’s why I invest early in onboarding, in-app guides, product tours, and thoughtful tooltip design to reduce the time-to-first-value. Then I watch user activation, retention analysis, and task completion to ensure the assistive experience is not just novel—it’s habit-forming.
For generative use cases, eval-driven development is non-negotiable. I maintain offline evaluations for accuracy and safety, and online evaluations for business impact. Retrieval-first pipeline health, context window management, and prompt engineering affect reliability; so do latency and grounding quality. We ship behind feature flags, measure guardrail effectiveness, and tighten feedback loops from human-in-the-loop reviews into model updates—continuously.
On the business side, I avoid “AI theater” by structuring benefits like a CFO. Revenue: increased conversion or expansion driven by better recommendations, faster sales cycles, or higher trial activation. Cost: case deflection, agent time saved, fewer escalations, and lower rework. Risk: reduced exposure via automated checks, anomaly detection, and consistent policy application. If any claim can’t be tied to measured deltas—via A/B testing or strong quasi-experiments—it doesn’t go in the deck.
Build vs buy deserves the same discipline. I map platform scalability, governance requirements, and total cost of ownership against time-to-impact. Teams often underestimate integration and maintenance drag; a pragmatic mix of bought components with thin custom layers can accelerate outcomes while keeping options open. The goal isn’t to own every layer—it’s to own the learning loop and the differentiated experience.
I also remind teams that tooling should serve the strategy, not replace it. I’ve seen concise, effective messaging that captures the point: “Increase revenue, cut costs, and reduce risk with Pendo’s Software Experience Management platform. Optimize the entire software experience to drive adoption and improve engagement.” The words are compelling because they reflect the three-vector value model and the adoption imperative. The same standard should apply to any AI initiative we propose.
If you’re under pressure to prove ROI, shift the conversation: lead with the driver tree, specify your counterfactual, and anchor on leading indicators you can move in weeks—not quarters. Then connect those to the lagging KPIs finance expects over time. When we manage AI like a product—grounded in evidence, experimentation, and user-centered adoption—we don’t have to force ROI. We compound it.
Product roadmaps should not be promises etched in stone; they are portfolios of bets made under uncertainty. When I build a roadmap, I’m not predicting the future—I’m designing a system that helps the team learn faster than the market changes, allocate capital wisely, and create alignment across engineering, design, go-to-market, and leadership.
The best roadmaps I’ve seen and shipped anchor on outcomes rather than features. “Outcomes vs output OKRs” is more than a slogan; it’s how we translate strategy into measurable impact. I start by defining a small set of outcome metrics that matter—such as activation rate, time-to-first-value, or expansion revenue—and attach clear key results and guardrails to each theme. This reframes prioritization from “what can we build?” to “what must change in customer behavior?” and gives empowered product teams real autonomy.
I organize the roadmap into time horizons—Now, Next, Later—with explicit confidence levels. Near-term items have higher confidence and more specificity; mid- and long-term bets are thematic with wider time windows. This approach reduces false precision and builds trust because stakeholders can see both the intent and the uncertainty. When dates matter, I use windows and service level expectations rather than single deadlines, and I pair each initiative with a lightweight risk scoring so we can discuss uncertainty explicitly rather than implicitly.
Continuous discovery keeps the roadmap honest. I partner in tight “product trios” across product, design, and engineering to run rapid customer interviews, opportunity sizing, and assumption tests before we commit significant delivery capacity. The opportunity solution tree is my favorite artifact here; it visualizes the path from outcomes to opportunities to experiments and solutions, making trade-offs and sequencing transparent. By the time something moves into sprint planning, we’ve already reduced key uncertainties and clarified the narrowest viable slice we can ship.
Uncertainty demands options. I plan initiatives as options with stage gates and explicit kill criteria rather than as single monolithic projects. For every significant theme, I outline base, best, and worst-case scenarios with pre-decided triggers for when we escalate, pivot, or stop. This practice prevents sunk-cost fallacy and keeps the team focused on evidence. We treat scope as a knob, not a switch, and we bias toward small, sequential bets that compound learning.
Capacity is strategy. I routinely reserve a discovery buffer—typically 10–20%—and a contingency buffer for integration, security, and performance risks that always show up late. I ruthlessly control work-in-progress to limit thrash and protect the team’s ability to respond when new information arrives. When we must navigate dependencies, I use thin vertical slices and decouple via contracts or feature flags so discovery momentum doesn’t stall while platforms evolve underneath.
Prioritization under uncertainty benefits from explicit models. I combine value, effort, and confidence with risk scoring to surface where the unknowns are hiding. Driver trees help us connect top-level outcomes to leading indicators, so we can place bets where they have the highest causal leverage. I also lean on the Kano Model and qualitative signals to avoid over-investing in performance attributes while neglecting excitement features that unlock differentiation and word-of-mouth.
The most effective stakeholder management is narrative-first. For executives, I present a one-page outcomes roadmap that shows themes, expected shifts in key results, and the learning plan. For teams, I provide a more detailed plan that links discovery insights, assumptions-to-test, and decision points. I make room for a “what we’re not doing” section to reduce noise and prevent shadow backlogs from reappearing in every meeting. Most importantly, I socialize change before it happens, explaining the evidence and the trade-offs so adjustments feel like progress, not whiplash.
Measurement closes the loop. We instrument experiments and releases with leading indicators tied to the driver tree and review them on a predictable cadence. If movement stalls, we diagnose whether we have a targeting problem (wrong audience), a value problem (weak proposition), or a friction problem (broken journey). That discipline lets us iterate with purpose instead of chasing vanity metrics or isolated anecdotes.
Here’s a concrete example of roadmapping through uncertainty. Suppose our Q3 objective is to “Increase user activation” with key results to raise the Week-1 activation rate from 32% to 45% and cut time-to-first-value by 30%. In discovery, customer interviews reveal confusion in the first-run setup and a missing integration that advanced users expect. We map an opportunity solution tree and identify two high-leverage opportunities: simplifying the first 10 minutes and offering a guided setup for the integration. We then shape two minimal bets: an in-app guide to streamline the first three tasks and an integration wizard behind a feature flag. Each bet has an explicit decision rule and a two-sprint runway. We ship the guide first, confirm a statistically significant lift via A/B testing, then expand scope. The integration wizard underperforms initial expectations, so we pause, revisit the assumptions, and re-allocate buffer to the stronger path. The roadmap updates in real time, and everyone understands why.
When uncertainty spikes—new competitor, pricing shock, platform deprecation—I shift the roadmap cadence to rolling-wave planning. We shorten planning horizons, increase the frequency of readouts, and elevate discovery allocations temporarily. We also create thematic “containment zones” where we explore multiple options in parallel with small budgets until one path justifies scale. This allows us to stay responsive without abandoning strategy.
Good governance accelerates, it doesn’t slow. A lightweight product council that reviews outcomes, risks, and cross-functional dependencies prevents surprise escalations and ensures we keep shipping what matters. We avoid death-by-approval by agreeing in advance on decision rights and thresholds—for example, a product trio can pivot a bet within a theme up to a certain budget or timeline impact without additional approval, as long as it improves the outcome likelihood.
If you’re evolving your roadmap practice, start with three moves. First, reframe your plan in outcomes and publish a driver tree that connects those outcomes to the few leading indicators you believe move them. Second, stand up a continuous discovery cadence with a visible opportunity solution tree and an assumptions-to-test backlog. Third, implement time windows and confidence levels for all mid- and long-term items, and pair each major initiative with explicit kill criteria. You’ll feel the difference in a single quarter: clearer trade-offs, faster learning, and more predictable delivery—despite uncertainty.
In the end, a roadmap that thrives in uncertainty is an agreement about how we learn and decide together. It aligns the organization on outcomes, it funds options—not fantasies—and it gives empowered product teams room to maneuver. That’s how top product teams plan for uncertainty and still deliver with confidence.
For years, I’ve watched product, growth, and data teams burn cycles stitching together manual dashboards and reports, then slogging through replay review just to validate a hunch. That overhead slows discovery and delays decisions. The promise here is different: "Discover how Amplitude AI Agents help product, growth, and data teams turn questions into action without manual dashboards, reports, or replay review." As someone obsessed with decision velocity and evidence-based product strategy, that shift is exactly what I’ve been waiting for.
In practice, I think about "Amplitude AI Agents" as always-on data analysts embedded in our workflow. Instead of queuing requests or context-switching into tooling, I can ask targeted questions, get synthesized insights, and move directly to action. This is a powerful example of agentic AI meeting behavioral analytics in a unified analytics platform—removing friction between inquiry and impact while keeping teams focused on outcomes, not artifacts.
What changes for my day-to-day? I can interrogate customer behavior in real time, pressure-test hypotheses from discovery interviews, and quickly understand whether activation, retention, or monetization is the current constraint. If I’m probing a driver tree for activation or a retention analysis for a specific cohort, I can get to a decision faster—without waiting on someone to build a bespoke dashboard. That means more cycles spent shaping product strategy and fewer sunk into report wrangling.
This matters beyond speed. When product, growth, and data leaders anchor discussions in the same source of truth, we shorten the distance from signal to decision. That alignment is the backbone of product-led growth and continuous discovery: shared context, faster feedback loops, and clearer trade-offs. It also reduces the long tail of analytics debt—those one-off reports and stale views that quietly accumulate across teams.
Of course, adopting any AI workflow in analytics demands governance. I hold these systems to the same bar I set for my teams: clarity of assumptions, consistent metric definitions, and auditable reasoning. Pairing "Amplitude analytics" with strong data governance, CI/CD for analytics definitions, and lightweight evals helps ensure the recommendations we act on are reliable, reproducible, and explainable. AI should accelerate our judgment, not replace it.
The strategic shift is simple and profound: move from building dashboards to making decisions. With always-on analysis, we can spend less time instrumenting analytics theater and more time delivering customer value. That is how we translate insights into impact—and why I’m excited to operationalize this capability across our product trios and go-to-market partners.
Inspired by this post on Amplitude – Best Practices.
Ever feel like your product team is “lost in the woods”? I’ve certainly been there—when strategy gets fuzzy, outcomes drift, or constraints aren’t clear. What helped me reframe the chaos was borrowing “lost person” patterns from search-and-rescue and mapping them to product strategy, product discovery, and team behaviors. The result is a practical playbook for product management leadership that keeps empowered product teams moving toward outcomes—not just outputs.
Listen to this episode on: Spotify | Apple Podcasts
Here are the five patterns I see most often—and how I turn each one into forward motion: settle in place (freeze), chase shortcuts, follow the first visible path, use your own navigation (intuition/taste), and retrace your steps. Each of these has a smart, minimal move that helps teams reorient fast without abandoning continuous discovery or product strategy discipline.
Settle in place (freeze). Sometimes the smartest move is to stop. When my team lacks context or authority, I pause delivery work and escalate instead of improvising fixes. This prevents thrash, protects focus, and creates the air cover we need to realign outcomes vs output OKRs.
Chase shortcuts. Shortcuts can be brilliant—or overconfident. I’ve learned to pressure-test whether the “road” is where we think it is before we commit. That means lightweight experiments, clear exit criteria, and the humility to pivot. Think about big bets like Spotify podcasts: compelling vision, but you still have to validate assumptions step by step.
Follow the first visible path. The obvious option isn’t always the best one. My job as a product leader is to make multiple paths visible before we choose. I lean on opportunity solution trees and KPI trees (or driver trees) to surface alternatives, align stakeholders, and keep empowered product teams focused on customer impact and product-market fit—not just the loudest idea.
Use your own navigation (intuition/taste). Judgment matters, especially for product trios making fast calls—but it’s not a replacement for evidence. When my “compass” conflicts with what we observe, I anchor back to customer interviews, rapid tests, and discovery loops. Intuition should guide where we look, while data validates how we proceed.
Retrace your steps. When we’re drifting, I go back to what used to work: principles, quality practices, and discovery habits as feedback loops. Returning to fundamentals—clear problem statements, crisp value propositions, and disciplined outcomes—rebuilds momentum fast.
Team prompt to try: If your team is “lost” right now, which pattern are you defaulting to—and what’s the smallest move you can make this week to get oriented (escalate, test a shortcut, map options, validate intuition with evidence, or retrace to a principle)? I use this question in weekly reviews to keep us grounded in continuous discovery and product strategy.
Resources & Links:
Follow Teresa Torres: https://ProductTalk.org
Follow Petra Wille: https://Petra-Wille.com
Mentioned in the episode:
Lost Person Behavior: A Search and Rescue Guide on Where to Look – for Land, Air and Water
Robert J. Koester
Examples referenced: Xerox, Nokia, Kodak, Volkswagen emissions scandal, Spotify podcasts, large-org tooling contexts like Oracle and SAP
Opportunity Solution Trees: Visualize Your Discovery to Stay Aligned and Drive Outcomes
KPI Trees: How to Bridge the Gap Between Customer Behavior, Product Metrics, and Company Goals
Let's Read Continuous Discovery Habits Together (January 2026) for Continuous Discovery Habits (and the idea of habits as feedback loops)
Shifting from Outputs to Outcomes: Why It Matters and How to Get Started
I’d love to hear how your team navigates these patterns. Which small move will you try this week? Leave a comment below and let’s compare notes on product discovery, stakeholder management, and product roadmapping that actually drives outcomes.