Shipping great products is a game of making high‑quality decisions under uncertainty. In my role leading product management, I’ve seen teams stall when classic methods demand huge sample sizes before we can say anything useful. Bayesian statistics has become my go‑to approach for turning sparse data into clear, decision‑ready insights—especially when traffic is limited or experimentation windows are tight.
Understand Bayesian statistics vs. frequentist methods and learn how Bayesian approaches improve experiment insights with small sample sizes.
Here’s why I rely on it in A/B testing: frequentist methods focus on p‑values and long‑run error rates, which are tough to translate into action. With a Bayesian lens, I can express outcomes as intuitive probabilities—“Variant B has a 92% chance to outperform A”—and use credible intervals to communicate likely ranges of impact. That clarity reduces decision friction and helps the team move faster with confidence.
Bayesian methods shine when sample sizes are small and the minimum detectable effect (MDE) of a frequentist test would be impractically large. I incorporate prior knowledge—historical conversion trends, seasonality, and learnings from related experiments—to stabilize noisy early data. Done thoughtfully, priors improve estimate quality without overfitting; I always run sensitivity checks to ensure the posterior is driven by the data we’re observing, not wishful thinking.
In practice, my workflow is straightforward. I set a prior from historical performance in Amplitude analytics, run the experiment, and update the posterior daily. I track the probability of superiority, expected lift, and a credible interval that the CRO role can rally around. When the probability of a meaningful win crosses a pre‑agreed threshold, we ship. When it doesn’t, we bank the learning and move on—no prolonged debates about p‑values that few stakeholders truly understand.
This approach also strengthens product discovery. By using behavioral analytics and retention analysis as informative priors, I can evaluate early signals from narrower cohorts—new geographies, niche segments, or enterprise accounts—where traffic is scarce. The result is faster iteration in product‑led growth environments, even when a full‑funnel test would take weeks to reach frequentist significance.
Operationally, I treat Bayesian experimentation as part of a unified analytics platform strategy. The same posterior machinery that powers A/B testing can support anomaly detection during releases, quantify risk in phased rollouts, and estimate lift from in‑app guides or product tours. Because results are framed in plain language probabilities, cross‑functional teams make better, faster decisions aligned to outcomes rather than outputs.
A few guardrails keep me honest. I preregister decision rules (stop/go thresholds, guardrail metrics), run prior sensitivity analyses, and document assumptions alongside results. That discipline prevents overconfidence, improves reproducibility, and builds trust with leadership.
If your experiments are bottlenecked by low traffic or you’re tired of waiting weeks for a binary “significant/not significant,” consider a Bayesian upgrade. You’ll get earlier readouts, clearer stakeholder communication, and a repeatable path to compounding learning—without sacrificing rigor.
Inspired by this post on Amplitude – Perspectives.
“Is product management dead?” I hear this question at almost every conference hallway chat. After listening to the latest Product Builders – All Things Product Podcast with Teresa Torres & Petra Wille, I’m more convinced than ever: product management isn’t dead—it’s evolving fast, and the leaders will be those who embrace the shift.
Listen to this episode on: Spotify | Apple Podcasts
The core take resonated deeply with my day-to-day at HighLevel: product management isn’t dying—“the traditional product trio (PM, design, engineering) is collapsing into something new.” The center of gravity is shifting from swim lanes to outcomes, from rigid handoffs to fluid collaboration, and from role definitions to capabilities that actually ship value.
AI is raising the baseline across the board. That “80/20 shift: AI handles patterns, humans handle hard problems” is real on my teams. With LLMs like “GPT 5.2” and “Opus 4.5,” coding agents such as “Claude Code” and “Codex,” and tools like “Replit” and “Lovable,” we’re compressing cycle time on the repeatable 80%. The bottleneck is no longer typing code or drafting copy—it’s selecting the right problems, crafting sharp product strategy, and making confident trade-offs.
This is why the future belongs to “product builders” — people with a shared foundation across disciplines and deep expertise in one area. I look for teams that can shape, prototype, validate, and iterate in tight loops, blending continuous discovery with empowered product teams. The baseline expands, the craft deepens.
Functional expertise still matters—more than ever—because the hard parts are getting harder. We need leaders who can weigh platform scalability against time-to-value, protect privacy-by-design, apply AI risk management, and navigate data governance while sustaining product-market fit. When AI accelerates execution, judgment becomes the differentiator.
For leaders, this creates a clear mandate: “What product leaders must do to create safe AI infrastructure.” In practice, that means building guardrails early—security reviews tailored to AI workflows, QA harnesses that include eval-driven development, model performance observability, and human-in-the-loop review systems. You can’t bolt this on later without paying a tax in velocity and trust.
Hiring signals are already shifting. “How job descriptions and hiring expectations are already shifting” shows up in my reqs: we emphasize cross-functional range, fluency with AI workflows, prompt engineering literacy, and the ability to frame measurable outcomes. We still want craft depth—design systems, systems thinking in engineering, rigorous discovery—but we prize people who move seamlessly from discovery to delivery.
In the episode, I appreciated the crisp framing of why product management isn’t dying—but changing. The rise of the “product builder” foundation reframes team topology and unlocks smaller, more cross-functional squads. AI changes the baseline skill set across product teams, and ignoring it is a career risk. If you’re not learning AI tools, you’re falling behind.
My key takeaways were straightforward and actionable. Smaller, more cross-functional teams are likely. Deep expertise still matters—especially for complex trade-offs. Leaders need guardrails: security, QA, and review systems built for an AI-driven workflow. And if you work in product, design, or engineering, this episode is your signal to start upskilling now.
“The risk of ignoring AI in your craft” is not hypothetical. I encourage PMs to carve out weekly lab time for hands-on experiments with LLMs for product managers, build lightweight prototypes with Replit or Lovable, and pressure-test opportunity solution trees with data-informed discovery. Pair with your engineers on agentic AI use cases, and integrate model evals into your CI/CD pipelines.
“Mentioned in the episode” were several resources worth exploring: “Product at Heart” (June, Hamburg), “Replit,” “Lovable,” “Every,” “Petra’s Coaching Packages,” and “coding agents (Claude Code, Codex) and LLMs (GPT 5.2, Opus 4.5).” These are great jumping-off points for your own product builder toolkit.
My recommendation: queue up the episode on your commute, then pick one workflow to augment with AI before the week ends. Replace a handoff with a shared canvas. Automate a repetitive analysis. Ship a scrappy prototype. Momentum compounds.
Have thoughts on this episode? Leave a comment below. I’d love to hear how your teams are evolving your product trios, what AI workflows are sticking, and where governance has been most challenging.
PR review bots are all the rage, but they cost a premium. We built our own for cheap that work just as well, if not better. Here's how.
As a VP of Product Management, I care deeply about the velocity and quality of our software delivery. The decision to build our own pull request (PR) review agents came from a simple calculus: we needed tighter control over developer experience, CI/CD integration, and cost—without sacrificing accuracy or reliability. The result was a pragmatic system that accelerates reviews, improves code quality, and pays for itself through faster feedback loops.
Before we wrote a line of code, we defined success. Our objectives were to shorten review cycles, reduce back-and-forth on style and test coverage, and surface risks earlier—measured against DORA metrics like lead time and deployment frequency. That focus aligned the team, guided our build vs buy decision, and anchored scope to the highest-impact use cases.
We started rules-first, AI-optional. The initial release enforced guardrails that are universally valuable: linting and formatting checks, required test coverage thresholds, commit message standards, ownership validation (CODEOWNERS), and basic security scans. These automated gates eliminated predictable review friction, freeing engineers to focus on logic and architecture rather than style debates.
Then we layered intelligence where it mattered. We added lightweight, explainable checks for common code smells and dependency risks, plus optional natural-language summaries that turn large diffs into concise context. Where appropriate, we introduced agentic AI workflows to triage PRs by risk, draft review comments, and suggest missing tests—always keeping humans in the loop. This hybrid approach kept costs low and outcomes high.
Integration with our CI/CD pipeline was non-negotiable. We wired GitHub/GitLab webhooks to a stateless service that queued work, executed checks in containerized workers, and posted results back as status checks and review comments. Caching, parallelization, and smart diff-scoping ensured we only computed what changed, keeping the experience snappy even on large repos.
Adoption hinged on developer experience. We made the bot’s feedback fast, specific, and actionable, with clear remediation steps and links to documentation. Feature flags allowed teams to opt into new checks gradually. ChatOps commands enabled quick overrides for emergencies, while policy-as-code kept rules visible, versioned, and auditable.
We treated this like any product: eval-driven development for accuracy, ongoing telemetry for false-positive rates, and explicit SLAs for response times. We instrumented outcomes end-to-end—tracking PR cycle time, comment-to-merge ratios, and rework—so we could prove the ROI and tune the system without guesswork.
The outcome: a reliable PR review companion that runs on a shoestring budget, integrates cleanly with our workflows, and measurably improves engineering throughput. If you’re weighing build vs buy, start small with rules that deliver immediate value, then layer intelligence where it earns its keep. With a clear product strategy, you can stand up capable PR review bots quickly—and scale them as your needs grow.
If you’re ready to try this yourself, begin with your top three friction points in code reviews, wire them into your CI/CD checks, and pilot with a single team. Iterate weekly, measure relentlessly, and let your developers be your strongest signal. You’ll be surprised how far a pragmatic, product-led approach can take you.
Inspired by this post on Amplitude – Perspectives.
In my role leading product management at HighLevel, I study the architectures and operating models behind high-velocity learning. I often reference "Amplitude's MCP server and its experimentation platform" as a benchmark for how to operationalize scale, reliability, and speed of insight across complex product ecosystems. That lens informs how I design processes, data flows, and decision loops that turn ambiguity into measurable outcomes.
Experimentation is the heartbeat of eval-driven development. In practice, that means running disciplined A/B testing, deploying targeted feature flags to de-risk rollouts, and sizing experiments with a clear minimum detectable effect (MDE) so we avoid vanity wins. When teams internalize these habits, we shift from opinion-led debates to evidence-led decisions—and that’s where product-led growth compounds.
I'm an AI enthusiast, so I think a lot about how experimentation accelerates AI roadmaps. The same rigor that validates UI changes should govern prompts, retrieval strategies, and policy settings for LLM-backed features. By treating AI behaviors as first-class experiment surfaces—and tying them to user activation, retention analysis, and value proposition metrics—we move faster without compromising safety, privacy-by-design, or customer trust.
Making this work in production demands clean instrumentation and a unified analytics platform. I look for stacks that combine Amplitude analytics with robust observability and CI/CD to ensure we can ship, measure, and iterate continuously. When platform scalability and data governance are baked in from the start, product trios can focus on product discovery rather than firefighting pipelines or reconciling metrics.
My playbook is straightforward: define decision-worthy questions, map them to crisp success metrics, run right-sized experiments with feature flags, and use consistent analytics to close the loop. Do this well, and you create a durable advantage—faster learning cycles, sharper product positioning, and a culture that lives by outcomes over output. That’s the real lesson I take from platforms that execute experimentation at scale: process and technology are table stakes; what wins is the discipline to learn relentlessly.
Inspired by this post on Amplitude – Perspectives.
I just watched one of the most significant leaps in customer service AI in years. Last week, a quiet but seismic release landed in CX: Fin introduced Apex, a vertical model purpose-built for support that raises the bar on speed, accuracy, and cost. As a product leader, this is exactly the kind of breakthrough that changes roadmaps, vendor strategies, and what customers can expect from modern service operations.
It’s a brand new model for Fin called Apex, and it’s objectively the highest performing, fastest, and cheapest model for customer service. It beats the very best models in the industry including GPT-5.4 and Opus 4.5.
In this analysis, I’ll unpack why the launch matters for the customer service agent category, what it signals for frontier labs and open‑weight ecosystems, and how leaders should rethink their AI Strategy, build vs buy decisions, and eval-driven development roadmaps.
Fin was already the highest performing and most sophisticated agent in the customer service space, consistently beating impressive competitors like Decagon and Sierra at an average win rate in the 70s. It operates at tremendous scale, now resolving almost 2M customer issues per week, a number that’s growing at an exponential clip. In its short life it’s grown to nearly $100M in recurring revenue.
As of last week, ~100% of all (English language, chat and email) customer conversations are now running on Apex. Since day 1, the Fin engine has comprised a system of models, and last year the team began replacing off‑the‑shelf models with custom ones trained on proprietary data. The core answering model had been a frontier labs offering—initially versions of GPT and more recently Sonnet 4.0. Now, that core answering model is Apex 1.0.
This model resolves customer issues at a materially higher rate than any other model available. One of their largest customers in the gaming space saw the resolution rate improve overnight from 68% to 75% (i.e. a reduction in unresolved conversations of 22%). The team notes they had never seen a jump this large from a single improvement since they started Fin.
Just as important, it’s dramatically faster, has fewer hallucinations, and is far cheaper than other available models—exactly the attributes operations leaders weigh most when deploying agents at scale. In practice, these are the levers that unlock higher CSAT, tighter SLAs, and better unit economics.
Achieving all three simultaneously is extraordinarily hard. Credit goes to foundational research from a 60‑person AI group run by Fergal Reid, and, crucially, to domain‑specific proprietary evals drawn from billions of human and agent interactions produced by the Fin resolution engine—already hand‑tuned to be the most effective in the category. That creates a flywheel: an eval‑driven development loop that trains models to keep improving at the edge of the system’s abilities. In other words, Apex 1.0 looks like the tip of the iceberg.
Zooming out, service is one of the few categories where generative AI has already delivered commercial impact at scale (alongside coding, and arguably the legal industry). With TAMs measured in the hundreds of billions, competition is intense and well capitalized. The pattern I’ve seen repeatedly is clear: winners in these spaces must become full‑stack AI companies. As features become ~free to build, durable competitive differentiation shifts under the hood—to proprietary data, post‑training, inference efficiency, and the quality of the eval loop.
Fin Apex raises the bar for finance-ready AI, highlighting a -65% cut in hallucinations and a quicker first token at 3.7s (0.6s faster), compared with Sonnet 4.6, Opus 4.5, and GPT-5.4 in side-by-side charts.
That’s why competitors will need to release their own models. Many appear to be just starting to hire the talent to do so, which likely gives Fin at least a year of head start. For product leaders, this is a strong signal to revisit build vs buy assumptions, and to quantify when owning your post‑training pipeline and evals becomes the rational move.
Honestly, 2–3 years ago I expected AI application differentiation to live mostly in what we built around third‑party models. The AI game humbles all of us; today it’s obvious that vertical models paired with proprietary evals create compounding moats.
In a podcast interview last week, Andrej Karpathy said:
"I do think we should expect more speciation in the intelligences. The animal kingdom is extremely [diverse] in the brains that exist. And there’s lots of different niches of nature… And I think we should be able to see more speciation. And you don’t need this oracle that knows everything. You kind of speciate it. And then you put it on a specific task. And we should be seeing some of that because you should be able to have much smaller models that still have the cognitive core."
The frontier labs still have the very best models, but open‑weight models aren’t far behind—making pre‑training look increasingly like a commodity. The frontier is moving to post‑training, which is precisely what we see with Apex (and Cursor’s Composer 2), and what we should expect to dominate going forward.
Labs now face a dual reality. On one hand, horizontal general‑purpose models can over‑serve specific verticals (e.g., customer service doesn’t need an oracle that knows everything). On the other, open‑weight models are good enough that high‑quality, domain‑specific post‑training can produce superior models for special‑purpose jobs—and in the ways that matter for those jobs. In service, soft factors like judgement, pleasantness, and attentiveness matter alongside hard factors like resolution effectiveness, speed, and cost.
I’m still bullish on the labs. Many organizations remain heavy customers of Anthropic—whether as part of multi‑model systems or through deep usage of Claude Code in engineering teams (see this example of Claude Code adoption). Yet classic disruption (à la the late, great Clay Christensen) is now at their door. The way out is to disrupt themselves by building cheaper specialized models too, which likely requires acquiring the evals—or the companies with the evals—needed for each task. Expect creative data partnerships, M&A consolidation, and a wave of hyper‑specific model providers that compete head‑to‑head with the labs.
In the meantime, Fin appears to be the only vendor in its space with a custom model that’s also objectively superior to everything else out there. I’m excited to see it deployed broadly for end customers, and I’m watching closely for the next announcement that will accelerate that rollout. For product leaders, the message is clear: the age of vertical models and agentic AI is here—bring your evals, or bring your checkbook.
Making the leap from engineer to CEO demands an almost entirely new skillset. I’ve felt that jolt firsthand: the tools that serve you as an IC or even a product leader—system design, crisp PRDs, elegant roadmaps—only get you about 20% of the way. The rest is learning to orchestrate go-to-market strategy, finance, hiring, culture, and product positioning with just enough depth to make sound, fast decisions while empowering true experts to execute.
My operating heuristic is the 80% rule. As CEO or GM, I don’t need to be the best marketer, seller, or finance leader; I need to understand 80% of each function well enough to set a compelling product strategy, ask the right questions, and catch the second-order effects. That breadth unlocks speed, quality of judgment, and the conviction to say no when the organization is tempted by what it can do rather than what it must do.
The clearest illustration comes from the journey that turned Apache Kafka—originally built at LinkedIn—into Confluent, a publicly traded enterprise software company. The technical insight was powerful, but the real lift came from translating that insight into a repeatable go-to-market engine. That required building new muscles: founder-led GTM, enterprise sales orchestration, and open source monetization without alienating the community that fueled adoption.
Early on, the product was “embarrassing” by enterprise standards—thin features, sharp edges, and a long tail of operational gaps. Shipping anyway was the point. A thin vertical slice into the market created learning loops with real customers, not hypotheticals. That uncomfortable speed became a superpower, especially when the company decided to push toward a cloud-first business in the face of widespread opposition.
The messaging challenge was just as hard as the technical one. Most marketing fails because it starts with what we built, not what customers must achieve. A simple product marketing pyramid—vision at the top, category framing and points of parity in the middle, crisp value props and proof at the base—helped explain Kafka to the world in customer language. When the narrative snaps into place, adoption accelerates. In Kafka’s case, one well-timed blog post clarified the “why now” and unlocked a step-change in community and enterprise pull.
There’s a pivotal distinction leaders underestimate: the gap between what a company can do and what it must do. I use a must-do filter before every planning cycle: What moves are non-discretionary for durable product-market fit? For Kafka and Confluent, that meant ruthless prioritization on managed cloud services, reliability, and platform scalability—even when it jeopardized short-term revenue or required retooling how engineering, sales, and support worked.
Fundraising strategy mirrored this clarity. Planning to raise before building the full product wasn’t about hype; it was about matching capital to the physics of the problem. If your category requires enterprise credibility, global infrastructure, and 24/7 SRE, you finance those table stakes early. That’s first principles decision making: instrument the constraints, then design the sequence that gets you to scale with the fewest irreversible mistakes.
In the early years, every product decision felt like a trade between polish and learning. The team essentially bludgeoned its way into a cloud-first posture—less because the initial product was ready, and more because the market’s must-do was obvious. That’s the essence of founder-led GTM: get into the field, close lighthouse customers, and use their arcs to shape the roadmap. It’s also where open source monetization matures from downloads into durable, enterprise value.
As the organization scales, excellence often erodes—the Chipotle problem. Process hardens; quality blurs; the magic decays. The antidotes are simple but hard: a few non-negotiable product quality bars, a short set of product-market fit metrics that everyone can recite, and empowered product teams who own outcomes over output. This is where organizational development matters as much as code: design clear interfaces between product, sales, and success, and you’ll keep velocity without losing standards.
Contrary to popular lore, founder optimism is overrated. Constructive realism wins. I try to model “probabilistic optimism”: assume we will win, but instrument the journey like an SRE runs an incident. Set leading indicators, rehearse failure modes, and make pre-commitments to the must-do path so you’re not swayed by the latest anecdote. It keeps the team out of a failure mindset while making room for rigorous course correction.
Giving up the right things at the right time is a CEO superpower. As complexity grows, I hand off decisions that benefit from specialization and keep only those tied to company narrative, must-do prioritization, and talent bar. CEO time management becomes a portfolio problem: ensure each week contains deep product time, frontline customer exposure, and one compounding systems fix (hiring loop, pricing rubric, or GTM enablement) that pays back for quarters.
If you’re moving from IC or PM into a GM/CEO role, here’s a practical playbook: build your product marketing pyramid; write the one-page must-do memo for the next six quarters; ship a narrow, managed cloud slice early; pick three product-market fit metrics (usage, time-to-value, retention) and publish them company-wide; and architect an enablement engine that turns field learnings into roadmap changes within one quarter. That’s how you transform technical advantage into a category-defining business.
The Kafka-to-Confluent arc reminds me that technology can open a door—but clarity of narrative, sequencing, and must-do focus determines whether you walk through it. When in doubt, bias toward shipping, talking to customers, and tightening the loop between what you learn and what you build. That’s the work of product management leadership at scale.
I’ve curated a focused set of product marketing insights that zero in on what actually moves the needle—turning data into decisions. You’ll find a special emphasis on Amplitude Analytics, because its behavioral analytics foundation makes it easier to translate product usage into clear messaging, sharper positioning, and measurable growth.
In my day-to-day as a product leader, I’m constantly bridging the gap between product discovery and go-to-market strategy. The best outcomes come when we connect quantitative signals to narrative: using behavioral analytics to inform the value proposition, refining product positioning with cohort trends, and driving product-led growth with activation and retention insights.
Here’s how I put this into practice. I start with user activation and retention analysis to identify the few behaviors that predict long-term value. Then I run tightly scoped A/B testing to validate messaging and in-product prompts that nudge those behaviors. When the numbers move, I translate wins into a consistent story—one that sales, success, and marketing can all rally around.
One pattern keeps repeating: clarity beats complexity. Instead of piling on more features, I focus on the minimum, verifiable set of behaviors that correlate with outcomes. That discipline makes it easier to craft a crisp value proposition, streamline go-to-market strategy, and accelerate feedback loops between product, design, and marketing.
As you explore this collection, expect practical playbooks over platitudes. You’ll see how to apply Amplitude Analytics to uncover hidden friction, validate hypotheses faster, and operationalize product-led growth motions that compound over time. My goal is to help you move from interesting dashboards to decisive actions that strengthen your roadmap and your revenue.
If you care about building empowered product teams that learn continuously, you’ll feel at home here. Dive in, borrow what works, and adapt the rest to your context—then measure it, iterate, and share the wins with your team.
Inspired by this post on Amplitude – Best Practices.
Healthcare leaders ask me the same question every week: how do we unlock AI-driven insights without risking patient trust or regulatory missteps? My approach is pragmatic and proven—connect business goals to measurable behavioral analytics, wrap everything in clear governance, and keep protected health information (PHI) out of the analytics layer by default. In other words, we earn the right to scale by making safety, compliance, and transparency visible in every step of the workflow with Amplitude AI.
At the core, I anchor our rollout on "governed analytics"—curated events, certified metrics, and role-based access that make audits straightforward and decision-making fast. When product, data, security, and compliance share a single source of truth in Amplitude analytics, we reduce rework, eliminate ambiguous definitions, and ship improvements with confidence. This is where AI Strategy meets operational excellence: a unified analytics platform that balances velocity with verification.
From there, I establish "PHI-safe workflows" by drawing a hard boundary around what data enters analytics. Behavioral signals flow in; identifiers stay in clinical systems. I lean on privacy-by-design, data minimization, and clear data governance so we can demonstrate regulatory compliance before a single end user is exposed to a new AI-powered experience. That alignment builds trust with legal and security, shortens review cycles, and operationalizes AI risk management without slowing innovation.
Insights must be "trusted insights"—reliable enough to drive care pathways, staffing decisions, and patient communications. I emphasize repeatable instrumentation, observability of data quality, and transparent lineage so teams can trace outcomes back to inputs. In practice, that means we agree on event contracts, enforce change control, and verify that behavioral analytics reflect real-world adoption and efficacy across patient and provider journeys.
To move decisively from legal review to production, I run a two-speed rollout. First, we validate in a sandbox with synthetic or de-identified data to pressure-test prompts, dashboards, and alerting. Then we graduate to controlled pilots with strict guardrails, documented data flows, and pre-agreed risk mitigations. By the time we scale, stakeholders have evidence, not just assurances—accelerating approvals and reducing last-minute scope churn.
One pattern I rely on is connecting AI outcomes to product metrics that matter: activation, time-to-first-value, task completion rates, and variance in outcomes across segments. With Amplitude analytics, we can spot drop-offs, attribute improvements to specific design or model changes, and quantify impact in language that resonates with executives and clinicians alike. That rigor is what transforms AI from a promising prototype into a dependable operating capability.
Success looks like faster time-to-insight, fewer compliance iterations, and audit-ready documentation built into normal workflows. It also looks like teams who are confident enough in their data to run A/B testing and continuous discovery—because they know their dashboards reflect reality. When governance, safety, and clarity are designed in, product-led growth becomes compatible with healthcare’s unique regulatory and ethical obligations.
"See how to adopt AI in healthcare safely with Amplitude, using governed analytics, PHI-safe workflows, and trusted insights that help teams move from legal review to real usage." That’s the journey I guide teams through—measurable, compliant, and humane—so we can deliver AI that clinicians trust, patients respect, and leaders can scale.
Inspired by this post on Amplitude – Perspectives.
Customer experience is where strategy, data, and execution converge—and where AI can deliver compounding value when thoughtfully designed. In my work, I’ve seen how the right CX vision becomes a growth engine when it’s operationalized through clear measures, robust analytics, and disciplined product practices.
"Amanda Sime is the Customer Experience Strategy Lead at Amplitude. She shapes CX strategy and partners across orgs to design and scale AI-powered solutions." That concise description captures a model I deeply respect: start with a strong CX strategy, then partner across the organization to make AI real in the day-to-day. It’s not just about new technology; it’s about aligning teams, systems, and incentives to deliver consistent customer value.
Translating that approach into practice requires a rigorous AI Strategy, anchored in measurable outcomes and informed by behavioral analytics. I prioritize journey mapping to expose friction, then connect those insights to AI workflows that enhance customer success and in-product guidance. When cross-functional partners—from solutions engineering to support—operate from a shared driver tree, the roadmap balances speed with sustainability.
Data is the backbone. A unified analytics platform—often centered on Amplitude analytics—helps teams move beyond vanity metrics to track user activation, feature adoption, and retention analysis with precision. With that foundation, we can test responsibly, iterate quickly, and validate impact with product-led growth motions that scale across segments without sacrificing quality.
Operational excellence matters just as much as vision. I’ve learned to treat CX programs like enduring products: build reliable feedback loops, connect customer support AI strategy to clear service-level outcomes, and empower product management leadership to make evidence-based tradeoffs. When teams have clarity on the problem space and access to trustworthy insights, they deliver solutions that feel both intelligent and human.
The real win is cultural: empowering product trios and partner teams to co-own outcomes, not just outputs. That’s how AI moves from a promising experiment to a durable capability—by aligning strategy, analytics, and execution so customers experience value at every touchpoint.
Inspired by this post on Amplitude – Perspectives.
I followed the energy at Fin Labs Paris and immediately zeroed in on the announcement of Monitors. In my view, it’s the missing piece that turns Fin’s powerful automation into an observable, trustworthy system—sitting alongside Insights and Recommendations to form a complete observability suite that gives teams confidence in what Fin is doing.
With Monitors, you define what conversations get reviewed, both Fin and human, and set evaluation criteria using Custom Scorecards. That level of control ensures you’re measuring the metrics that matter most to your business and holding support quality to your bar, not a generic one.
Used in concert with Insights and Recommendations, you can finally see what’s happening across your support operation, evaluate every conversation against your standards, and take targeted action to continuously move toward perfect customer experiences.
As Agents become more powerful, transparency and control become critical. I’ve seen this shift firsthand: AI is advancing fast, and the stakes are no longer theoretical—Agents are resolving real customer issues with real consequences at scale.
Visualizing the AI development flywheel—Train, Test, Deploy, Analyze—this graphic spotlights Analyze in orange to introduce Monitors, turning opaque model behavior into measurable signals and continuous customer service insights.
Fin has almost 8,000 customers, averages a 67% resolution rate, and resolves close to 2 million customer queries every single week, including highly complex queries in regulated industries.
At that scale, observability isn’t a nice-to-have; it’s a necessity. Traditional CSAT and small QA samples weren’t built for Agent-led operations—they miss edge cases, don’t scale, and can’t explain drift. The result is a black box. What teams need most right now is confidence, built on data you can trust and act on.
At Intercom, this is called the Fin Flywheel: Train, Test, Deploy, Analyze.
See inside Intercom's Monitors: a streamlined dashboard with pass‑rate charts and review queues, alongside a panel to define a 'Vulnerable customers' monitor, test it on sample chats, and run continuous checks.
Analyze is the step where you find out what’s actually happening and it’s where improvement begins.
In my experience, achieving confidence in an AI support operation requires three things: (1) a complete understanding of what Fin, your human team, and your customers are talking about; (2) a way to monitor and score conversations based on the criteria that matter most to your business; and (3) AI-powered recommendations that make it easy to act on what you find. Intercom launched Insights and Recommendations to address the first and third. Now, Monitors completes the system for full observability and opens the black box.
Monitors: know whether every conversation met your standards. Customer sentiment is important, but it’s different from determining whether a conversation was handled correctly. With Monitors, you can do both—and do it at scale.
Customer support leaders praise Monitors for turning AI performance from a black box into measurable signals. This quote from Ineke Oates of Agorapulse highlights the shift from manual spot checks to continuous quality tracking.
Monitors is a new QA capability that delivers a structured, repeatable way to define which conversations get reviewed and evaluate them against quality criteria you set. It replaces ad-hoc sampling and spreadsheet-driven QA with a system that scales as your volume grows.
Two components work together: Monitors define what gets reviewed and Custom Scorecards define how each conversation is evaluated. That pairing brings the rigor of Agent Analytics and the discipline of eval-driven development to everyday CX operations.
Random sampling has always been a blunt tool. When AI is handling thousands of conversations a week, a small, arbitrary slice won’t reliably capture your highest-risk edge cases, your most complex escalations, or where quality is starting to drift. I’ve felt that pain in operations reviews—too many unknowns, not enough signal.
Open the AI black box with Monitors: track conversations, triage unreviewed items, and build transparent scorecards with criteria like accuracy, process adherence, and efficiency to lift customer support quality.
With Monitors, you select and evaluate conversations with intent. You can target specific signals of risk or failure, like “the customer showed signs of financial vulnerability” or “Fin looped around with the same answer without resolving the issue.” Or you can create consistent, repeatable samples to benchmark quality over time. Use the existing library of filters (customer data, channel, Fin-specific metrics) or describe nuanced scenarios in natural language. Most teams will do both: hone in on the conversations that matter most and maintain a steady, structured QA sample each week.
"When I saw Monitors, my first reaction was — this is exactly what we need. The ability to track quality continuously, instead of relying on spot checks, is a big shift for us." Ineke Oates, Head of Support, Agorapulse
Custom Scorecards make your standards explicit and enforceable. One-size-fits-all rubrics never reflect your brand voice, industry constraints, or customer expectations. With Custom Scorecards, you define what “good” looks like for your business and turn that into a measurable, comparable quality score for every conversation.
A customer testimonial underscores the promise of Monitors: bring quality assurance into the flow of work, unifying AI assistant Fin and human agents in a single place for faster, clearer customer support.
You define the criteria that matters, how each should be measured, and how important each one is. Some criteria can be scored automatically by AI, others reviewed by a human, or both — all within the same scorecard. This means you’re not choosing between scale and judgment; you get both in one system.
Each conversation is then evaluated against these criteria, and the system calculates an overall quality score based on your configuration. You can weigh what matters most, or mark certain criteria as critical, so a single failure can fail the entire evaluation when needed.
The result is a single, consistent quality score that reflects your standards—not a generic metric, and not a collection of disconnected checks. That’s what makes quality measurable over time and comparable across AI and human support.
Monitors helps open the AI black box by turning model outputs into trackable reviews. This clean queue groups customers, monitor types, scores, and actions—with AI auto-review—so teams improve quality faster.
There’s an important distinction here: CX Score tells you how customers felt about a conversation. Custom Scorecards tell you whether it met your standards. You need both.
"We looked at dedicated QA tools, but what's compelling about Monitors is that it lives where our conversations already happen. We don't need another system — we can run QA across Fin and our human team in one place." Jared Ellis, Senior Director, Global Product Support, Culture Amp
When a conversation meets your criteria for review, Monitors routes it into a Review Queue. Each conversation is assigned to the right reviewer with its scorecard attached and status tracked end to end: Not reviewed, Reviewed, Needs a fix, Fix complete. Reviewers work directly in Intercom, capture what went wrong, and propose concrete fixes—like updating documentation or refining a workflow—so quality loops end in action, not just scores.
Monitors turn AI performance from opaque to measurable. The Fin quality view summarizes review score, pass rate, and review counts while a time‑series chart tracks escalation ease, clarification, and efficiency—delivering fast, actionable CX insights.
Reporting turns QA into a continuous signal rather than a one-off audit. You can track review scores over time across Monitors and Scorecards, and compare them directly to CX Score, resolution rate, and other performance metrics. Patterns that were previously invisible become clear: a topic consistently underperforming, a quality dip correlated with a recent knowledge base change, or a team whose scores are improving week over week. This is observability applied to CX—evidence you can act on.
Monitors for Fin conversations is live today, and the roadmap goes further. Human agent QA will bring the same structured evaluation to your human team’s conversations, creating one consistent quality system across your entire support operation.
Real-time alerts will notify you the moment a conversation crosses a threshold you’ve defined—before the issue reaches more customers and risks compounding negative sentiment.
Kick off your journey with the #1 Agent—an AI partner designed to turn resolutions into real outcomes. Tap “Start a free trial” to explore faster, smarter customer service and see how Fin delivers value from day one.
Knowledge base evaluation will connect AI scoring directly to your content so conversations are assessed against your latest policies and documentation, catching inaccurate or outdated responses and providing clear rationale linked to the relevant source.
Creating perfect customer experience with AI requires transparency. You need to understand how the system is performing if you want to maintain and improve quality over time. With Insights, Monitors, and Recommendations, this is now possible—a complete analysis suite that lets you see what’s happening across every conversation, ensure it meets your standards, and pinpoint improvement opportunities when they matter most.
I’ve long advocated for a retrieval-first, eval-driven approach to AI Strategy because it makes risk visible and manageable. Monitors operationalizes that philosophy for CX leaders: you get continuous signal, shared definitions of quality, and a direct path from flags to fixes. If you’re scaling AI support, this is how you replace uncertainty with control—and turn the black box into a competitive advantage.
Are you an AI product manager or want to become one? This guide cuts through the noise and shows where the PM role is really heading with AI.
I’ve spent the last few years scaling AI initiatives across complex SaaS products, and I’ve learned that “AI product manager” isn’t a vanity title—it’s a capability set. The role evolves traditional product management with new responsibilities across data, model behavior, risk, and continuous learning systems. My goal here is to demystify what matters, so you can lead with clarity, build with confidence, and deliver measurable outcomes.
First, let’s separate hype from reality. An effective AI Strategy starts with the customer problem, not the model. I anchor roadmaps around clear use cases, then evaluate whether we need a retrieval-first pipeline, agentic AI, or conventional automation. “Build vs buy” is no longer a procurement question; it’s a lifecycle question about iteration speed, quality control, data governance, and long-term unit economics.
Discovery also looks different. I still run continuous discovery and customer interviews, but I augment them with behavioral analytics and targeted experiments to validate feasibility, risk, and value. I practice privacy-by-design and AI risk management from day one, and I define guardrails for acceptable model behavior alongside success metrics. When high stakes are involved, I document data provenance and align with regulatory compliance standards to protect customers and the business.
Execution shifts from shipping static features to operating learning systems. In product roadmapping and sprint planning, I account for context window management, prompt engineering, and the realities of LLMs for product managers: latency, cost, drift, and failure modes. I use feature flags, A/B testing, and eval-driven development to move from offline model evals to online impact with a minimum detectable effect (MDE) worth the release risk. Observability, anomaly detection, and incident management aren’t optional—they’re how we earn trust.
Collaboration expands beyond engineering and design. I work closely with data science on evaluation frameworks, with solutions engineering to de-risk complex enterprise deployments, and with customer success to close the loop on model performance in the wild. Our outcomes vs output OKRs emphasize activation, time-to-value, and sustained retention over vanity accuracy metrics.
Tooling is now strategic advantage. My AI product toolbox includes prompt libraries with versioning, synthetic data generation where appropriate, and a disciplined approach to model and prompt regression tests. I standardize AI workflows—intake, evaluation, deployment, and monitoring—so teams can ship faster without cutting corners. This is how empowered product teams scale safely.
Career-wise, I look for—and coach—PMs who can frame trade-offs crisply: explain when to fine-tune vs use retrieval, when to embed agents, and when not to use AI at all. Show me driver trees that connect model metrics to business outcomes, a clear risk register, and a plan for continuous discovery. If you can tell a compelling story backed by transparent evaluation and customer value, you’re already ahead.
Here’s the bottom line: the “AI product manager” that matters in 2026 is a product leader who can turn uncertainty into systematized learning. If you focus on real customer problems, rigorous evaluation, responsible design, and iterative delivery, you won’t just carry the title—you’ll create durable competitive differentiation.
I’ve watched too many AI agent deployments celebrate velocity while overlooking the one thing that determines long-term success: whether real users are actually getting value. Dashboards tend to spotlight model upgrades, prompt tweaks, and launch counts, yet they rarely quantify task completion, trust, or time-to-value. That blind spot isn’t technical—it’s human.
Enterprises are spending 93% of their AI budget building agents and almost none know if those agents are actually working for users. Pendo Agent Analytics closes the gap.
In my product reviews, I look for evidence that agentic AI is improving outcomes across the customer journey, not just the demo path. Without behavioral analytics and observability, teams optimize for throughput instead of resolution, for novelty instead of reliability. This is where eval-driven development, A/B testing, and rigorous cohort analysis become non-negotiable: they translate agent performance into user impact we can measure and improve.
Here’s the pattern that works for me: define user-centric success metrics first, then let the AI follow. I prioritize signals like successful task completion, low-friction activation, reduced escalations, and sentiment lift—tied directly to product-led growth indicators such as retention and expansion. When these metrics move in the right direction, I know the agent is creating compounding value, not just answering faster.
Practically, I operationalize this with an analytics spine that captures end-to-end agent interactions: intents, prompts, responses, clarifying turns, handoffs, and final outcomes. I segment by persona, journey stage, and account tier to uncover where agents delight and where they degrade trust. With this foundation, I can run controlled experiments, spot anomalies early, and connect improvements in agent behavior to improvements in business performance.
Pendo Agent Analytics closes the loop by making these user outcomes visible and actionable. Instead of guessing whether an agent helped or hindered, I can analyze where users stall, which prompts or skills drive completion, and how interventions like in-app guides or product tours change behavior. That visibility lets me tune models and experiences in days, not quarters—and gives stakeholders confidence that our AI investments are paying off for customers.
If you’re scaling agents today, start small but instrument deeply: map top user intents, define offline and online evals, A/B test prompts and policies, monitor regressions, and tie every improvement to activation, adoption, and retention. The result is a durable feedback loop that keeps agents aligned with user value as your surface area grows.
AI agents are not a destination—they’re a capability. When we anchor that capability to clear user outcomes and measure it with the right analytics, we stop flying blind and start compounding advantage. That’s how we turn promising demos into dependable products.