In my role leading product management at HighLevel, I study the architectures and operating models behind high-velocity learning. I often reference "Amplitude's MCP server and its experimentation platform" as a benchmark for how to operationalize scale, reliability, and speed of insight across complex product ecosystems. That lens informs how I design processes, data flows, and decision loops that turn ambiguity into measurable outcomes.
Experimentation is the heartbeat of eval-driven development. In practice, that means running disciplined A/B testing, deploying targeted feature flags to de-risk rollouts, and sizing experiments with a clear minimum detectable effect (MDE) so we avoid vanity wins. When teams internalize these habits, we shift from opinion-led debates to evidence-led decisions—and that’s where product-led growth compounds.
I'm an AI enthusiast, so I think a lot about how experimentation accelerates AI roadmaps. The same rigor that validates UI changes should govern prompts, retrieval strategies, and policy settings for LLM-backed features. By treating AI behaviors as first-class experiment surfaces—and tying them to user activation, retention analysis, and value proposition metrics—we move faster without compromising safety, privacy-by-design, or customer trust.
Making this work in production demands clean instrumentation and a unified analytics platform. I look for stacks that combine Amplitude analytics with robust observability and CI/CD to ensure we can ship, measure, and iterate continuously. When platform scalability and data governance are baked in from the start, product trios can focus on product discovery rather than firefighting pipelines or reconciling metrics.
My playbook is straightforward: define decision-worthy questions, map them to crisp success metrics, run right-sized experiments with feature flags, and use consistent analytics to close the loop. Do this well, and you create a durable advantage—faster learning cycles, sharper product positioning, and a culture that lives by outcomes over output. That’s the real lesson I take from platforms that execute experimentation at scale: process and technology are table stakes; what wins is the discipline to learn relentlessly.
Inspired by this post on Amplitude – Perspectives.
I just watched one of the most significant leaps in customer service AI in years. Last week, a quiet but seismic release landed in CX: Fin introduced Apex, a vertical model purpose-built for support that raises the bar on speed, accuracy, and cost. As a product leader, this is exactly the kind of breakthrough that changes roadmaps, vendor strategies, and what customers can expect from modern service operations.
It’s a brand new model for Fin called Apex, and it’s objectively the highest performing, fastest, and cheapest model for customer service. It beats the very best models in the industry including GPT-5.4 and Opus 4.5.
In this analysis, I’ll unpack why the launch matters for the customer service agent category, what it signals for frontier labs and open‑weight ecosystems, and how leaders should rethink their AI Strategy, build vs buy decisions, and eval-driven development roadmaps.
Fin was already the highest performing and most sophisticated agent in the customer service space, consistently beating impressive competitors like Decagon and Sierra at an average win rate in the 70s. It operates at tremendous scale, now resolving almost 2M customer issues per week, a number that’s growing at an exponential clip. In its short life it’s grown to nearly $100M in recurring revenue.
As of last week, ~100% of all (English language, chat and email) customer conversations are now running on Apex. Since day 1, the Fin engine has comprised a system of models, and last year the team began replacing off‑the‑shelf models with custom ones trained on proprietary data. The core answering model had been a frontier labs offering—initially versions of GPT and more recently Sonnet 4.0. Now, that core answering model is Apex 1.0.
This model resolves customer issues at a materially higher rate than any other model available. One of their largest customers in the gaming space saw the resolution rate improve overnight from 68% to 75% (i.e. a reduction in unresolved conversations of 22%). The team notes they had never seen a jump this large from a single improvement since they started Fin.
Just as important, it’s dramatically faster, has fewer hallucinations, and is far cheaper than other available models—exactly the attributes operations leaders weigh most when deploying agents at scale. In practice, these are the levers that unlock higher CSAT, tighter SLAs, and better unit economics.
Achieving all three simultaneously is extraordinarily hard. Credit goes to foundational research from a 60‑person AI group run by Fergal Reid, and, crucially, to domain‑specific proprietary evals drawn from billions of human and agent interactions produced by the Fin resolution engine—already hand‑tuned to be the most effective in the category. That creates a flywheel: an eval‑driven development loop that trains models to keep improving at the edge of the system’s abilities. In other words, Apex 1.0 looks like the tip of the iceberg.
Zooming out, service is one of the few categories where generative AI has already delivered commercial impact at scale (alongside coding, and arguably the legal industry). With TAMs measured in the hundreds of billions, competition is intense and well capitalized. The pattern I’ve seen repeatedly is clear: winners in these spaces must become full‑stack AI companies. As features become ~free to build, durable competitive differentiation shifts under the hood—to proprietary data, post‑training, inference efficiency, and the quality of the eval loop.
Fin Apex raises the bar for finance-ready AI, highlighting a -65% cut in hallucinations and a quicker first token at 3.7s (0.6s faster), compared with Sonnet 4.6, Opus 4.5, and GPT-5.4 in side-by-side charts.
That’s why competitors will need to release their own models. Many appear to be just starting to hire the talent to do so, which likely gives Fin at least a year of head start. For product leaders, this is a strong signal to revisit build vs buy assumptions, and to quantify when owning your post‑training pipeline and evals becomes the rational move.
Honestly, 2–3 years ago I expected AI application differentiation to live mostly in what we built around third‑party models. The AI game humbles all of us; today it’s obvious that vertical models paired with proprietary evals create compounding moats.
In a podcast interview last week, Andrej Karpathy said:
"I do think we should expect more speciation in the intelligences. The animal kingdom is extremely [diverse] in the brains that exist. And there’s lots of different niches of nature… And I think we should be able to see more speciation. And you don’t need this oracle that knows everything. You kind of speciate it. And then you put it on a specific task. And we should be seeing some of that because you should be able to have much smaller models that still have the cognitive core."
The frontier labs still have the very best models, but open‑weight models aren’t far behind—making pre‑training look increasingly like a commodity. The frontier is moving to post‑training, which is precisely what we see with Apex (and Cursor’s Composer 2), and what we should expect to dominate going forward.
Labs now face a dual reality. On one hand, horizontal general‑purpose models can over‑serve specific verticals (e.g., customer service doesn’t need an oracle that knows everything). On the other, open‑weight models are good enough that high‑quality, domain‑specific post‑training can produce superior models for special‑purpose jobs—and in the ways that matter for those jobs. In service, soft factors like judgement, pleasantness, and attentiveness matter alongside hard factors like resolution effectiveness, speed, and cost.
I’m still bullish on the labs. Many organizations remain heavy customers of Anthropic—whether as part of multi‑model systems or through deep usage of Claude Code in engineering teams (see this example of Claude Code adoption). Yet classic disruption (à la the late, great Clay Christensen) is now at their door. The way out is to disrupt themselves by building cheaper specialized models too, which likely requires acquiring the evals—or the companies with the evals—needed for each task. Expect creative data partnerships, M&A consolidation, and a wave of hyper‑specific model providers that compete head‑to‑head with the labs.
In the meantime, Fin appears to be the only vendor in its space with a custom model that’s also objectively superior to everything else out there. I’m excited to see it deployed broadly for end customers, and I’m watching closely for the next announcement that will accelerate that rollout. For product leaders, the message is clear: the age of vertical models and agentic AI is here—bring your evals, or bring your checkbook.
Making the leap from engineer to CEO demands an almost entirely new skillset. I’ve felt that jolt firsthand: the tools that serve you as an IC or even a product leader—system design, crisp PRDs, elegant roadmaps—only get you about 20% of the way. The rest is learning to orchestrate go-to-market strategy, finance, hiring, culture, and product positioning with just enough depth to make sound, fast decisions while empowering true experts to execute.
My operating heuristic is the 80% rule. As CEO or GM, I don’t need to be the best marketer, seller, or finance leader; I need to understand 80% of each function well enough to set a compelling product strategy, ask the right questions, and catch the second-order effects. That breadth unlocks speed, quality of judgment, and the conviction to say no when the organization is tempted by what it can do rather than what it must do.
The clearest illustration comes from the journey that turned Apache Kafka—originally built at LinkedIn—into Confluent, a publicly traded enterprise software company. The technical insight was powerful, but the real lift came from translating that insight into a repeatable go-to-market engine. That required building new muscles: founder-led GTM, enterprise sales orchestration, and open source monetization without alienating the community that fueled adoption.
Early on, the product was “embarrassing” by enterprise standards—thin features, sharp edges, and a long tail of operational gaps. Shipping anyway was the point. A thin vertical slice into the market created learning loops with real customers, not hypotheticals. That uncomfortable speed became a superpower, especially when the company decided to push toward a cloud-first business in the face of widespread opposition.
The messaging challenge was just as hard as the technical one. Most marketing fails because it starts with what we built, not what customers must achieve. A simple product marketing pyramid—vision at the top, category framing and points of parity in the middle, crisp value props and proof at the base—helped explain Kafka to the world in customer language. When the narrative snaps into place, adoption accelerates. In Kafka’s case, one well-timed blog post clarified the “why now” and unlocked a step-change in community and enterprise pull.
There’s a pivotal distinction leaders underestimate: the gap between what a company can do and what it must do. I use a must-do filter before every planning cycle: What moves are non-discretionary for durable product-market fit? For Kafka and Confluent, that meant ruthless prioritization on managed cloud services, reliability, and platform scalability—even when it jeopardized short-term revenue or required retooling how engineering, sales, and support worked.
Fundraising strategy mirrored this clarity. Planning to raise before building the full product wasn’t about hype; it was about matching capital to the physics of the problem. If your category requires enterprise credibility, global infrastructure, and 24/7 SRE, you finance those table stakes early. That’s first principles decision making: instrument the constraints, then design the sequence that gets you to scale with the fewest irreversible mistakes.
In the early years, every product decision felt like a trade between polish and learning. The team essentially bludgeoned its way into a cloud-first posture—less because the initial product was ready, and more because the market’s must-do was obvious. That’s the essence of founder-led GTM: get into the field, close lighthouse customers, and use their arcs to shape the roadmap. It’s also where open source monetization matures from downloads into durable, enterprise value.
As the organization scales, excellence often erodes—the Chipotle problem. Process hardens; quality blurs; the magic decays. The antidotes are simple but hard: a few non-negotiable product quality bars, a short set of product-market fit metrics that everyone can recite, and empowered product teams who own outcomes over output. This is where organizational development matters as much as code: design clear interfaces between product, sales, and success, and you’ll keep velocity without losing standards.
Contrary to popular lore, founder optimism is overrated. Constructive realism wins. I try to model “probabilistic optimism”: assume we will win, but instrument the journey like an SRE runs an incident. Set leading indicators, rehearse failure modes, and make pre-commitments to the must-do path so you’re not swayed by the latest anecdote. It keeps the team out of a failure mindset while making room for rigorous course correction.
Giving up the right things at the right time is a CEO superpower. As complexity grows, I hand off decisions that benefit from specialization and keep only those tied to company narrative, must-do prioritization, and talent bar. CEO time management becomes a portfolio problem: ensure each week contains deep product time, frontline customer exposure, and one compounding systems fix (hiring loop, pricing rubric, or GTM enablement) that pays back for quarters.
If you’re moving from IC or PM into a GM/CEO role, here’s a practical playbook: build your product marketing pyramid; write the one-page must-do memo for the next six quarters; ship a narrow, managed cloud slice early; pick three product-market fit metrics (usage, time-to-value, retention) and publish them company-wide; and architect an enablement engine that turns field learnings into roadmap changes within one quarter. That’s how you transform technical advantage into a category-defining business.
The Kafka-to-Confluent arc reminds me that technology can open a door—but clarity of narrative, sequencing, and must-do focus determines whether you walk through it. When in doubt, bias toward shipping, talking to customers, and tightening the loop between what you learn and what you build. That’s the work of product management leadership at scale.
I’ve curated a focused set of product marketing insights that zero in on what actually moves the needle—turning data into decisions. You’ll find a special emphasis on Amplitude Analytics, because its behavioral analytics foundation makes it easier to translate product usage into clear messaging, sharper positioning, and measurable growth.
In my day-to-day as a product leader, I’m constantly bridging the gap between product discovery and go-to-market strategy. The best outcomes come when we connect quantitative signals to narrative: using behavioral analytics to inform the value proposition, refining product positioning with cohort trends, and driving product-led growth with activation and retention insights.
Here’s how I put this into practice. I start with user activation and retention analysis to identify the few behaviors that predict long-term value. Then I run tightly scoped A/B testing to validate messaging and in-product prompts that nudge those behaviors. When the numbers move, I translate wins into a consistent story—one that sales, success, and marketing can all rally around.
One pattern keeps repeating: clarity beats complexity. Instead of piling on more features, I focus on the minimum, verifiable set of behaviors that correlate with outcomes. That discipline makes it easier to craft a crisp value proposition, streamline go-to-market strategy, and accelerate feedback loops between product, design, and marketing.
As you explore this collection, expect practical playbooks over platitudes. You’ll see how to apply Amplitude Analytics to uncover hidden friction, validate hypotheses faster, and operationalize product-led growth motions that compound over time. My goal is to help you move from interesting dashboards to decisive actions that strengthen your roadmap and your revenue.
If you care about building empowered product teams that learn continuously, you’ll feel at home here. Dive in, borrow what works, and adapt the rest to your context—then measure it, iterate, and share the wins with your team.
Inspired by this post on Amplitude – Best Practices.
Healthcare leaders ask me the same question every week: how do we unlock AI-driven insights without risking patient trust or regulatory missteps? My approach is pragmatic and proven—connect business goals to measurable behavioral analytics, wrap everything in clear governance, and keep protected health information (PHI) out of the analytics layer by default. In other words, we earn the right to scale by making safety, compliance, and transparency visible in every step of the workflow with Amplitude AI.
At the core, I anchor our rollout on "governed analytics"—curated events, certified metrics, and role-based access that make audits straightforward and decision-making fast. When product, data, security, and compliance share a single source of truth in Amplitude analytics, we reduce rework, eliminate ambiguous definitions, and ship improvements with confidence. This is where AI Strategy meets operational excellence: a unified analytics platform that balances velocity with verification.
From there, I establish "PHI-safe workflows" by drawing a hard boundary around what data enters analytics. Behavioral signals flow in; identifiers stay in clinical systems. I lean on privacy-by-design, data minimization, and clear data governance so we can demonstrate regulatory compliance before a single end user is exposed to a new AI-powered experience. That alignment builds trust with legal and security, shortens review cycles, and operationalizes AI risk management without slowing innovation.
Insights must be "trusted insights"—reliable enough to drive care pathways, staffing decisions, and patient communications. I emphasize repeatable instrumentation, observability of data quality, and transparent lineage so teams can trace outcomes back to inputs. In practice, that means we agree on event contracts, enforce change control, and verify that behavioral analytics reflect real-world adoption and efficacy across patient and provider journeys.
To move decisively from legal review to production, I run a two-speed rollout. First, we validate in a sandbox with synthetic or de-identified data to pressure-test prompts, dashboards, and alerting. Then we graduate to controlled pilots with strict guardrails, documented data flows, and pre-agreed risk mitigations. By the time we scale, stakeholders have evidence, not just assurances—accelerating approvals and reducing last-minute scope churn.
One pattern I rely on is connecting AI outcomes to product metrics that matter: activation, time-to-first-value, task completion rates, and variance in outcomes across segments. With Amplitude analytics, we can spot drop-offs, attribute improvements to specific design or model changes, and quantify impact in language that resonates with executives and clinicians alike. That rigor is what transforms AI from a promising prototype into a dependable operating capability.
Success looks like faster time-to-insight, fewer compliance iterations, and audit-ready documentation built into normal workflows. It also looks like teams who are confident enough in their data to run A/B testing and continuous discovery—because they know their dashboards reflect reality. When governance, safety, and clarity are designed in, product-led growth becomes compatible with healthcare’s unique regulatory and ethical obligations.
"See how to adopt AI in healthcare safely with Amplitude, using governed analytics, PHI-safe workflows, and trusted insights that help teams move from legal review to real usage." That’s the journey I guide teams through—measurable, compliant, and humane—so we can deliver AI that clinicians trust, patients respect, and leaders can scale.
Inspired by this post on Amplitude – Perspectives.
Customer experience is where strategy, data, and execution converge—and where AI can deliver compounding value when thoughtfully designed. In my work, I’ve seen how the right CX vision becomes a growth engine when it’s operationalized through clear measures, robust analytics, and disciplined product practices.
"Amanda Sime is the Customer Experience Strategy Lead at Amplitude. She shapes CX strategy and partners across orgs to design and scale AI-powered solutions." That concise description captures a model I deeply respect: start with a strong CX strategy, then partner across the organization to make AI real in the day-to-day. It’s not just about new technology; it’s about aligning teams, systems, and incentives to deliver consistent customer value.
Translating that approach into practice requires a rigorous AI Strategy, anchored in measurable outcomes and informed by behavioral analytics. I prioritize journey mapping to expose friction, then connect those insights to AI workflows that enhance customer success and in-product guidance. When cross-functional partners—from solutions engineering to support—operate from a shared driver tree, the roadmap balances speed with sustainability.
Data is the backbone. A unified analytics platform—often centered on Amplitude analytics—helps teams move beyond vanity metrics to track user activation, feature adoption, and retention analysis with precision. With that foundation, we can test responsibly, iterate quickly, and validate impact with product-led growth motions that scale across segments without sacrificing quality.
Operational excellence matters just as much as vision. I’ve learned to treat CX programs like enduring products: build reliable feedback loops, connect customer support AI strategy to clear service-level outcomes, and empower product management leadership to make evidence-based tradeoffs. When teams have clarity on the problem space and access to trustworthy insights, they deliver solutions that feel both intelligent and human.
The real win is cultural: empowering product trios and partner teams to co-own outcomes, not just outputs. That’s how AI moves from a promising experiment to a durable capability—by aligning strategy, analytics, and execution so customers experience value at every touchpoint.
Inspired by this post on Amplitude – Perspectives.
I followed the energy at Fin Labs Paris and immediately zeroed in on the announcement of Monitors. In my view, it’s the missing piece that turns Fin’s powerful automation into an observable, trustworthy system—sitting alongside Insights and Recommendations to form a complete observability suite that gives teams confidence in what Fin is doing.
With Monitors, you define what conversations get reviewed, both Fin and human, and set evaluation criteria using Custom Scorecards. That level of control ensures you’re measuring the metrics that matter most to your business and holding support quality to your bar, not a generic one.
Used in concert with Insights and Recommendations, you can finally see what’s happening across your support operation, evaluate every conversation against your standards, and take targeted action to continuously move toward perfect customer experiences.
As Agents become more powerful, transparency and control become critical. I’ve seen this shift firsthand: AI is advancing fast, and the stakes are no longer theoretical—Agents are resolving real customer issues with real consequences at scale.
Visualizing the AI development flywheel—Train, Test, Deploy, Analyze—this graphic spotlights Analyze in orange to introduce Monitors, turning opaque model behavior into measurable signals and continuous customer service insights.
Fin has almost 8,000 customers, averages a 67% resolution rate, and resolves close to 2 million customer queries every single week, including highly complex queries in regulated industries.
At that scale, observability isn’t a nice-to-have; it’s a necessity. Traditional CSAT and small QA samples weren’t built for Agent-led operations—they miss edge cases, don’t scale, and can’t explain drift. The result is a black box. What teams need most right now is confidence, built on data you can trust and act on.
At Intercom, this is called the Fin Flywheel: Train, Test, Deploy, Analyze.
See inside Intercom's Monitors: a streamlined dashboard with pass‑rate charts and review queues, alongside a panel to define a 'Vulnerable customers' monitor, test it on sample chats, and run continuous checks.
Analyze is the step where you find out what’s actually happening and it’s where improvement begins.
In my experience, achieving confidence in an AI support operation requires three things: (1) a complete understanding of what Fin, your human team, and your customers are talking about; (2) a way to monitor and score conversations based on the criteria that matter most to your business; and (3) AI-powered recommendations that make it easy to act on what you find. Intercom launched Insights and Recommendations to address the first and third. Now, Monitors completes the system for full observability and opens the black box.
Monitors: know whether every conversation met your standards. Customer sentiment is important, but it’s different from determining whether a conversation was handled correctly. With Monitors, you can do both—and do it at scale.
Customer support leaders praise Monitors for turning AI performance from a black box into measurable signals. This quote from Ineke Oates of Agorapulse highlights the shift from manual spot checks to continuous quality tracking.
Monitors is a new QA capability that delivers a structured, repeatable way to define which conversations get reviewed and evaluate them against quality criteria you set. It replaces ad-hoc sampling and spreadsheet-driven QA with a system that scales as your volume grows.
Two components work together: Monitors define what gets reviewed and Custom Scorecards define how each conversation is evaluated. That pairing brings the rigor of Agent Analytics and the discipline of eval-driven development to everyday CX operations.
Random sampling has always been a blunt tool. When AI is handling thousands of conversations a week, a small, arbitrary slice won’t reliably capture your highest-risk edge cases, your most complex escalations, or where quality is starting to drift. I’ve felt that pain in operations reviews—too many unknowns, not enough signal.
Open the AI black box with Monitors: track conversations, triage unreviewed items, and build transparent scorecards with criteria like accuracy, process adherence, and efficiency to lift customer support quality.
With Monitors, you select and evaluate conversations with intent. You can target specific signals of risk or failure, like “the customer showed signs of financial vulnerability” or “Fin looped around with the same answer without resolving the issue.” Or you can create consistent, repeatable samples to benchmark quality over time. Use the existing library of filters (customer data, channel, Fin-specific metrics) or describe nuanced scenarios in natural language. Most teams will do both: hone in on the conversations that matter most and maintain a steady, structured QA sample each week.
"When I saw Monitors, my first reaction was — this is exactly what we need. The ability to track quality continuously, instead of relying on spot checks, is a big shift for us." Ineke Oates, Head of Support, Agorapulse
Custom Scorecards make your standards explicit and enforceable. One-size-fits-all rubrics never reflect your brand voice, industry constraints, or customer expectations. With Custom Scorecards, you define what “good” looks like for your business and turn that into a measurable, comparable quality score for every conversation.
A customer testimonial underscores the promise of Monitors: bring quality assurance into the flow of work, unifying AI assistant Fin and human agents in a single place for faster, clearer customer support.
You define the criteria that matters, how each should be measured, and how important each one is. Some criteria can be scored automatically by AI, others reviewed by a human, or both — all within the same scorecard. This means you’re not choosing between scale and judgment; you get both in one system.
Each conversation is then evaluated against these criteria, and the system calculates an overall quality score based on your configuration. You can weigh what matters most, or mark certain criteria as critical, so a single failure can fail the entire evaluation when needed.
The result is a single, consistent quality score that reflects your standards—not a generic metric, and not a collection of disconnected checks. That’s what makes quality measurable over time and comparable across AI and human support.
Monitors helps open the AI black box by turning model outputs into trackable reviews. This clean queue groups customers, monitor types, scores, and actions—with AI auto-review—so teams improve quality faster.
There’s an important distinction here: CX Score tells you how customers felt about a conversation. Custom Scorecards tell you whether it met your standards. You need both.
"We looked at dedicated QA tools, but what's compelling about Monitors is that it lives where our conversations already happen. We don't need another system — we can run QA across Fin and our human team in one place." Jared Ellis, Senior Director, Global Product Support, Culture Amp
When a conversation meets your criteria for review, Monitors routes it into a Review Queue. Each conversation is assigned to the right reviewer with its scorecard attached and status tracked end to end: Not reviewed, Reviewed, Needs a fix, Fix complete. Reviewers work directly in Intercom, capture what went wrong, and propose concrete fixes—like updating documentation or refining a workflow—so quality loops end in action, not just scores.
Monitors turn AI performance from opaque to measurable. The Fin quality view summarizes review score, pass rate, and review counts while a time‑series chart tracks escalation ease, clarification, and efficiency—delivering fast, actionable CX insights.
Reporting turns QA into a continuous signal rather than a one-off audit. You can track review scores over time across Monitors and Scorecards, and compare them directly to CX Score, resolution rate, and other performance metrics. Patterns that were previously invisible become clear: a topic consistently underperforming, a quality dip correlated with a recent knowledge base change, or a team whose scores are improving week over week. This is observability applied to CX—evidence you can act on.
Monitors for Fin conversations is live today, and the roadmap goes further. Human agent QA will bring the same structured evaluation to your human team’s conversations, creating one consistent quality system across your entire support operation.
Real-time alerts will notify you the moment a conversation crosses a threshold you’ve defined—before the issue reaches more customers and risks compounding negative sentiment.
Kick off your journey with the #1 Agent—an AI partner designed to turn resolutions into real outcomes. Tap “Start a free trial” to explore faster, smarter customer service and see how Fin delivers value from day one.
Knowledge base evaluation will connect AI scoring directly to your content so conversations are assessed against your latest policies and documentation, catching inaccurate or outdated responses and providing clear rationale linked to the relevant source.
Creating perfect customer experience with AI requires transparency. You need to understand how the system is performing if you want to maintain and improve quality over time. With Insights, Monitors, and Recommendations, this is now possible—a complete analysis suite that lets you see what’s happening across every conversation, ensure it meets your standards, and pinpoint improvement opportunities when they matter most.
I’ve long advocated for a retrieval-first, eval-driven approach to AI Strategy because it makes risk visible and manageable. Monitors operationalizes that philosophy for CX leaders: you get continuous signal, shared definitions of quality, and a direct path from flags to fixes. If you’re scaling AI support, this is how you replace uncertainty with control—and turn the black box into a competitive advantage.
Are you an AI product manager or want to become one? This guide cuts through the noise and shows where the PM role is really heading with AI.
I’ve spent the last few years scaling AI initiatives across complex SaaS products, and I’ve learned that “AI product manager” isn’t a vanity title—it’s a capability set. The role evolves traditional product management with new responsibilities across data, model behavior, risk, and continuous learning systems. My goal here is to demystify what matters, so you can lead with clarity, build with confidence, and deliver measurable outcomes.
First, let’s separate hype from reality. An effective AI Strategy starts with the customer problem, not the model. I anchor roadmaps around clear use cases, then evaluate whether we need a retrieval-first pipeline, agentic AI, or conventional automation. “Build vs buy” is no longer a procurement question; it’s a lifecycle question about iteration speed, quality control, data governance, and long-term unit economics.
Discovery also looks different. I still run continuous discovery and customer interviews, but I augment them with behavioral analytics and targeted experiments to validate feasibility, risk, and value. I practice privacy-by-design and AI risk management from day one, and I define guardrails for acceptable model behavior alongside success metrics. When high stakes are involved, I document data provenance and align with regulatory compliance standards to protect customers and the business.
Execution shifts from shipping static features to operating learning systems. In product roadmapping and sprint planning, I account for context window management, prompt engineering, and the realities of LLMs for product managers: latency, cost, drift, and failure modes. I use feature flags, A/B testing, and eval-driven development to move from offline model evals to online impact with a minimum detectable effect (MDE) worth the release risk. Observability, anomaly detection, and incident management aren’t optional—they’re how we earn trust.
Collaboration expands beyond engineering and design. I work closely with data science on evaluation frameworks, with solutions engineering to de-risk complex enterprise deployments, and with customer success to close the loop on model performance in the wild. Our outcomes vs output OKRs emphasize activation, time-to-value, and sustained retention over vanity accuracy metrics.
Tooling is now strategic advantage. My AI product toolbox includes prompt libraries with versioning, synthetic data generation where appropriate, and a disciplined approach to model and prompt regression tests. I standardize AI workflows—intake, evaluation, deployment, and monitoring—so teams can ship faster without cutting corners. This is how empowered product teams scale safely.
Career-wise, I look for—and coach—PMs who can frame trade-offs crisply: explain when to fine-tune vs use retrieval, when to embed agents, and when not to use AI at all. Show me driver trees that connect model metrics to business outcomes, a clear risk register, and a plan for continuous discovery. If you can tell a compelling story backed by transparent evaluation and customer value, you’re already ahead.
Here’s the bottom line: the “AI product manager” that matters in 2026 is a product leader who can turn uncertainty into systematized learning. If you focus on real customer problems, rigorous evaluation, responsible design, and iterative delivery, you won’t just carry the title—you’ll create durable competitive differentiation.
I’ve watched too many AI agent deployments celebrate velocity while overlooking the one thing that determines long-term success: whether real users are actually getting value. Dashboards tend to spotlight model upgrades, prompt tweaks, and launch counts, yet they rarely quantify task completion, trust, or time-to-value. That blind spot isn’t technical—it’s human.
Enterprises are spending 93% of their AI budget building agents and almost none know if those agents are actually working for users. Pendo Agent Analytics closes the gap.
In my product reviews, I look for evidence that agentic AI is improving outcomes across the customer journey, not just the demo path. Without behavioral analytics and observability, teams optimize for throughput instead of resolution, for novelty instead of reliability. This is where eval-driven development, A/B testing, and rigorous cohort analysis become non-negotiable: they translate agent performance into user impact we can measure and improve.
Here’s the pattern that works for me: define user-centric success metrics first, then let the AI follow. I prioritize signals like successful task completion, low-friction activation, reduced escalations, and sentiment lift—tied directly to product-led growth indicators such as retention and expansion. When these metrics move in the right direction, I know the agent is creating compounding value, not just answering faster.
Practically, I operationalize this with an analytics spine that captures end-to-end agent interactions: intents, prompts, responses, clarifying turns, handoffs, and final outcomes. I segment by persona, journey stage, and account tier to uncover where agents delight and where they degrade trust. With this foundation, I can run controlled experiments, spot anomalies early, and connect improvements in agent behavior to improvements in business performance.
Pendo Agent Analytics closes the loop by making these user outcomes visible and actionable. Instead of guessing whether an agent helped or hindered, I can analyze where users stall, which prompts or skills drive completion, and how interventions like in-app guides or product tours change behavior. That visibility lets me tune models and experiences in days, not quarters—and gives stakeholders confidence that our AI investments are paying off for customers.
If you’re scaling agents today, start small but instrument deeply: map top user intents, define offline and online evals, A/B test prompts and policies, monitor regressions, and tie every improvement to activation, adoption, and retention. The result is a durable feedback loop that keeps agents aligned with user value as your surface area grows.
AI agents are not a destination—they’re a capability. When we anchor that capability to clear user outcomes and measure it with the right analytics, we stop flying blind and start compounding advantage. That’s how we turn promising demos into dependable products.
What happens when an AI starts giving advice in your voice—advice you’d never actually give? I’ve been thinking a lot about that question, and this conversation hit home for me as a product leader navigating the fast-evolving reality of AI “clones.”
Listen to this episode on: https://open.spotify.com/episode/7DNDIlIimwbbMOytArewRp?ref=producttalk.org | https://podcasts.apple.com/kh/podcast/bad-advice/id1794203808?i=1000756914818&ref=producttalk.org. Prefer video? Watch on YouTube: https://www.youtube.com/embed/RF4BwaeMMlg?feature=oembed
The episode examines AI “clones” built from podcast transcripts and public content—where the experimentation feels exciting, where it crosses ethical lines, and what happens when mediocre AI outputs get attributed to real people. The tension is real: when a bot confidently answers in your style but misses the nuance, “it’s not me” becomes more than a disclaimer—it’s a reputational defense.
We dig into the messy parts: IP ownership of open-sourced transcripts, the role of pirated books in LLM training sets, rising inference costs, and the uncomfortable economic question: if anyone can prompt “act like Teresa,” how do creators make a living? In my own decision-making, I look for clear consent, guardrails that prevent impersonation, and transparent UX that never confuses a synthetic perspective with a human expert.
This isn’t anti-AI. It’s a nuanced conversation about quality, consent, and remembering there are real humans behind the ideas.
Here’s how I translate the key takeaways into practice. Using AI for perspective is fine—equating it to the real person isn’t. Free-feeling AI outputs still rely on someone’s work. Expertise is more than past content—it’s context, judgment, and evolution. If someone’s work influences you, find a way to support them. These principles help teams benefit from gen ai without eroding trust or the creator ecosystem.
“Technically possible” doesn’t mean “ethically okay.” My AI Strategy playbook includes privacy-by-design, clear data governance on training materials, and a bright line between inspiration and impersonation. When we ship AI features, we label synthetic outputs, avoid mimicking living experts without permission, and create paths to compensate or promote the humans whose thinking underpins the experience.
I’ve also tested the “act like X” pattern to stress-test product quality. Even when outputs sound plausible, they rarely capture the expert’s mental models, trade-offs, or the evolution of their thinking—especially in complex product discovery work. That gap is the difference between average AI text and expert product management leadership.
If you listen, consider a few reflection prompts: Have you ever used AI to “act like” someone you admire? Could you tell whether the output matched that person’s actual thinking? How do you decide what’s ethically okay when using public content in LLMs? And how can we support creators while still embracing new tools?
Have thoughts on this episode or practices that have worked in your org? Share them below—I’m keen to learn how other teams are balancing innovation with integrity.
For years, I’ve watched product, growth, and data teams burn cycles stitching together manual dashboards and reports, then slogging through replay review just to validate a hunch. That overhead slows discovery and delays decisions. The promise here is different: "Discover how Amplitude AI Agents help product, growth, and data teams turn questions into action without manual dashboards, reports, or replay review." As someone obsessed with decision velocity and evidence-based product strategy, that shift is exactly what I’ve been waiting for.
In practice, I think about "Amplitude AI Agents" as always-on data analysts embedded in our workflow. Instead of queuing requests or context-switching into tooling, I can ask targeted questions, get synthesized insights, and move directly to action. This is a powerful example of agentic AI meeting behavioral analytics in a unified analytics platform—removing friction between inquiry and impact while keeping teams focused on outcomes, not artifacts.
What changes for my day-to-day? I can interrogate customer behavior in real time, pressure-test hypotheses from discovery interviews, and quickly understand whether activation, retention, or monetization is the current constraint. If I’m probing a driver tree for activation or a retention analysis for a specific cohort, I can get to a decision faster—without waiting on someone to build a bespoke dashboard. That means more cycles spent shaping product strategy and fewer sunk into report wrangling.
This matters beyond speed. When product, growth, and data leaders anchor discussions in the same source of truth, we shorten the distance from signal to decision. That alignment is the backbone of product-led growth and continuous discovery: shared context, faster feedback loops, and clearer trade-offs. It also reduces the long tail of analytics debt—those one-off reports and stale views that quietly accumulate across teams.
Of course, adopting any AI workflow in analytics demands governance. I hold these systems to the same bar I set for my teams: clarity of assumptions, consistent metric definitions, and auditable reasoning. Pairing "Amplitude analytics" with strong data governance, CI/CD for analytics definitions, and lightweight evals helps ensure the recommendations we act on are reliable, reproducible, and explainable. AI should accelerate our judgment, not replace it.
The strategic shift is simple and profound: move from building dashboards to making decisions. With always-on analysis, we can spend less time instrumenting analytics theater and more time delivering customer value. That is how we translate insights into impact—and why I’m excited to operationalize this capability across our product trios and go-to-market partners.
Inspired by this post on Amplitude – Best Practices.
Every week, I coach product and documentation teams on a simple truth I keep pinned above my desk: "AI is reading your documentation! Learn tips from the Amplitude docs team about how to structure your documentation for both human and AI audiences." That line captures the shift we’re all living through—our docs must now serve customers, support engineers, and increasingly, LLMs powering chat, search, and in‑product help.
My AI strategy for documentation starts with intent. I map the core questions users ask at activation, onboarding, escalation, and renewal, then shape information architecture to reduce ambiguity. This helps humans find answers faster and helps LLMs retrieve the right chunks with higher precision—a win for UX writing, product-led growth, and support deflection.
Structure beats style when AI is in the loop. I rely on semantic headings (H1–H3), consistent slugs, stable anchors, and one‑topic pages that can stand alone. Short paragraphs, scannable summaries, and canonical references reduce duplication and improve retrieval quality. Treat docs-as-code with CI/CD so changes are reviewed, versioned, and shipped reliably—documentation deserves the same rigor as product releases.
Chunking matters for LLMs. I design content for context window management: one concept per section, tight procedures with numbered steps, and FAQs that mirror real queries. Glossaries define canonical terms and accepted synonyms so retrieval-first pipelines match user language without fragmenting meaning. Error messages and parameter names appear verbatim to strengthen search and grounding.
Metadata is a multiplier. I add clear titles, descriptions, last‑updated dates, product area tags, and audience labels (admin, developer, analyst) to boost SEO and machine readability. Stable IDs for components, examples, and API objects improve deep linking and evaluation. Where appropriate, I include structured examples that align with prompt engineering best practices so AI assistants can extract inputs, outputs, and constraints cleanly.
Quality is measured, not hoped for. I pair content audit checklists with analytics to see what’s searched, where users pogo‑stick, and which articles drive successful task completion. Tools like Amplitude analytics reveal gaps and dead‑ends, while lightweight evals (answer accuracy, grounding rate, latency) ensure LLMs retrieve the right doc chunks at the right time.
Consistency is a feature. I standardize terminology across UI, API, and docs, and I avoid synonym sprawl that confuses both readers and LLMs. Page intros state the job-to-be-done; conclusions link to adjacent tasks; and deprecation notes are explicit with forward paths. This coherence lowers cognitive load and improves both RAG performance and human trust.
Governance keeps it scalable. I assign owners per section, define SLAs for updates, and automate checks for broken links, orphaned pages, and outdated screenshots. Redirect rules avoid 404s, and version banners prevent LLMs from mixing deprecated guidance into current answers—small details that cumulatively protect customer experience.
If you’re just getting started, begin with three moves: clarify intents, restructure pages into atomic, linkable units, and add metadata that reflects how customers actually search. From there, tighten your retrieval-first pipeline and run regular evals. The payoff is durable: faster time to value for users, lower support load, and AI assistants that answer accurately, confidently, and consistently.
Inspired by this post on Amplitude – Perspectives.