Tag: LLMs for product managers

  • Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

    Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

    Trust is the currency of any high-stakes AI product, and nowhere is that more true than in healthcare. I recently dug into how Healio built an AI assistant for physicians—an audience that can’t afford to be wrong—and it’s a masterclass in balancing accuracy, transparency, and speed without compromising credibility.

    Healio, a 125-year-old medical publishing company, set out to create Healio AI to help clinicians prepare for patient care. From the outset, their guiding principle was simple: physicians won’t trust you until you prove it. That lens shaped every decision—from discovery and prototyping to architecture, evaluation, and ongoing validation.

    Discovery started with a survey of 300 healthcare professionals to understand real-world needs at the point of care. The headline insight: physicians primarily want AI for preparation, not bedside use. Even more surprising, the top ask wasn’t purely diagnostic support; it was help with patient communication and empathy—translating complex information into clear, accessible conversation.

    Momentum mattered. After beginning with Figma mockups to validate workflows, the team built a working prototype in a single weekend using Cursor. That velocity wasn’t about cutting corners; it was about proving value quickly, reducing ambiguity, and iterating with concrete feedback from physicians.

    Under the hood, the system employs RAG and hybrid search—combining lexical search, vector search, and semantic search across multiple trusted sources like PubMed. As any PM who has integrated biomedical literature knows, "just use PubMed" isn’t simple—there are five different ways to access the same data, each with trade-offs. The team made pragmatic choices to balance freshness, coverage, latency, and cost while preserving trust in source quality.

    Designing for trust extended all the way to the citation UX. The team leaned into citations that physicians actually trust: subscripts, hover states, and progressive disclosure. This gave clinicians verifiable threads back to source material without overwhelming the core interaction, aligning with how experts want to audit evidence under time pressure.

    Evaluation wasn’t left to chance. They stood up eight LLM judges for evals: safety, medical accuracy, faithfulness, relevancy, completeness, reasoning, clarity, and overall quality. Just as importantly, they treated those signals as directional, not definitive. In a high-stakes domain, physician feedback trumps LLM-as-judge feedback—so they complemented automated evals with direct reviews from practicing clinicians to calibrate quality and reduce hallucinations.

    On the safety front, the team implemented HIPAA compliance and input guardrails for masking personal health information. That choice reflects strong data governance and privacy-by-design thinking: protect PHI by default, constrain prompts to safe boundaries, and make compliance a first-class citizen in the product architecture.

    They also addressed monetization without compromising experience. Serving contextual ads while the LLM processes queries is a practical approach that preserves physician workflow efficiency and creates a clear, non-intrusive revenue model.

    Critically, the work didn’t stop at launch. The Healio Innovation Partners provide ongoing discovery and validation, ensuring the system evolves with physician needs and the medical evidence base. This is the operating cadence you want for any AI product that sits at the intersection of safety, accuracy, and fast-changing knowledge.

    My takeaways for building AI in high-stakes domains: prioritize retrieval-first pipelines over model cleverness; couple RAG with hybrid search across vetted sources; design citations that earn trust at a glance; use eval-driven development, but let domain-expert feedback be the ultimate judge; and embed regulatory compliance into your product strategy from day one. If trust is your North Star, this is a playbook worth emulating.


    Inspired by this post on Product Talk.


    Book a consult png image
  • AI-Powered Growth Loops: Transform Your PLG Product into a Self-Optimizing Engine

    AI-Powered Growth Loops: Transform Your PLG Product into a Self-Optimizing Engine

    Across my teams and portfolio, I’m watching AI fundamentally reshape product-led growth—from static funnels and one-off playbooks to adaptive, compounding growth loops that learn in real time. The shift isn’t just technological; it’s an operating model change that rewards continuous discovery, rigorous instrumentation, and outcome-driven product strategy.

    "Learn how AI is transforming PLG with a new generation of growth loops that can turn your product into a self-optimizing platform." That line captures what I’ve been building toward: systems that sense user intent, decide the next best action, act contextually, and learn to improve the loop with every interaction.

    Here’s the core pattern I rely on. First, sense: unify product analytics and behavioral signals (think Amplitude analytics, Pendo events, Intercom conversations) into a single, queryable, privacy-safe layer. Second, decide: apply AI Strategy—LLMs for product managers, rules, and retrieval—to segment users by intent and probability of success. Third, act: deliver in-app guides, product tours, tooltips, or personalized nudges that accelerate user activation and time-to-value. Finally, learn: run A/B testing with a clear minimum detectable effect (MDE), then feed outcomes back into the model for continuous optimization.

    Activation is where the gains start compounding. With gen ai, I can auto-generate tailored onboarding checklists, dynamic walkthroughs, and contextual help that adapts to the user’s role, data maturity, and current friction points. We’ve moved from generic product tours to precision guidance that updates based on real-time behavior—often lifting first-week activation and shortening time-to-first-value without adding support load.

    Experimentation is the governor that keeps speed and quality in balance. I instrument every growth loop end to end and pair eval-driven development with A/B testing to confirm incremental impact. Amplitude analytics gives me cohort views and path analysis; Pendo or Intercom can deliver in-app variants; a unified analytics platform closes the loop on retention analysis so I’m not optimizing for click-through at the expense of long-term value.

    Retention and expansion are where AI shines as a compounding engine. Retrieval-first pipeline patterns allow instant, contextual support that deflects tickets and boosts perceived product competence. Agentic AI can orchestrate next-best actions—prompting power users toward advanced features, surfacing value moments, or timing expansion prompts when success signals appear. The result is a virtuous cycle: better guidance drives deeper adoption, which improves model accuracy, which unlocks more relevant guidance.

    None of this works without guardrails. I bake in AI risk management from the start: strict data governance, privacy-by-design, human-in-the-loop review for high-impact actions, transparent user consent, and continuous drift monitoring. The goal is reliable automation that users trust—augmented by clear fail-safes when confidence drops.

    Operationally, I anchor the work in empowered product teams and product trios, focus on outcomes vs output OKRs, and practice continuous discovery to validate problems and solutions before scaling. The baseline metrics I watch: activation rate, time-to-value, week-four retention, PQL/PQA conversion, expansion revenue, and support deflection—each tied to a specific growth loop hypothesis.

    If you’re starting fresh, begin with the highest-leverage loop: user activation. Instrument your onboarding journey, define the critical path to value, ship two to three personalized interventions, and measure impact with a precommitted MDE. Scale what wins, drop what doesn’t, and iterate weekly. Once activation is compounding, extend the same approach to adoption depth, collaboration features, and expansion triggers.

    In practical terms, AI-powered PLG is less about flashy features and more about disciplined feedback loops. Build the sensing fabric, keep the decision layer auditable, ship small actions quickly, and treat learning as the product. Do that, and your product doesn’t just grow—it becomes a self-optimizing platform.


    Inspired by this post on Product School.


    Book a consult png image
  • How I Harness AI to Supercharge Product Discovery for Faster Research, Prototyping, and Validation

    How I Harness AI to Supercharge Product Discovery for Faster Research, Prototyping, and Validation

    I’ve led product teams through countless discovery cycles, and nothing has accelerated our learning loops like AI. By weaving AI into our continuous discovery practice at HighLevel, I cut time-to-insight, reduce risk earlier, and keep our product strategy relentlessly focused on customer outcomes.

    AI streamlines product discovery by accelerating research, prototyping, and validation, enabling teams to make faster, smarter, and user-driven decisions.

    In the research phase, I use gen ai and LLMs for product managers to synthesize interviews, cluster themes, and surface unmet needs in minutes instead of days. Pairing those qualitative insights with behavioral signals in Amplitude analytics helps me spot high-intent cohorts and friction points at scale, so our problem framing is both human-centered and data-backed.

    From there, I translate insights into crisp hypotheses and prioritize with the Kano Model and outcomes vs output OKRs. To keep experiments honest, I define a minimum detectable effect (MDE) up front and design A/B testing plans that reflect realistic traffic and seasonality, ensuring our decisions are statistically grounded rather than anecdotal.

    Prototyping is where gen ai for product prototyping really shines. I spin up multiple UX flows, UI copy variants, and edge-case scenarios using prompt engineering, then iterate with rapid feedback from product trios. When needed, I mock in-app guides and product tours to validate onboarding concepts before we commit to code, preserving velocity without sacrificing quality.

    For validation, I lean on a mix of lightweight experiments—fake-door tests, concierge pilots, and targeted A/B testing—augmented by in-product surveys via Pendo or Intercom. For AI-powered features, I apply eval-driven development to measure relevance, latency, and safety, so we can ship responsibly while maintaining the pace of learning.

    This approach only works when the team is structured to move fast. Empowered product teams and product trios own discovery end-to-end, with clear guardrails around data governance, privacy-by-design, and AI risk management. That alignment lets us shift from opinions to evidence, and from output to outcomes, without friction.

    If you’re getting started, pick one discovery loop to transform: automate research synthesis, prototype two to three variants with AI, and validate with a tightly scoped experiment. Instrument your analytics, track time-to-insight and time-to-prototype, and iterate your product roadmapping and sprint planning with what you learn. The payoff is immediate: faster cycles, stronger conviction, and a more user-driven path to product-led growth.


    Inspired by this post on Product School.


    Book a consult png image
  • From PDFs to Proposals: How Tendos AI’s Agent Swarm Automates Construction Quotes Fast

    From PDFs to Proposals: How Tendos AI’s Agent Swarm Automates Construction Quotes Fast

    Anyone who has lived inside construction tendering knows the grind. "When a construction company receives a bid request, someone has to open that email, parse the attached PDF (sometimes 1,800 pages describing an entire building), figure out which products are relevant, look up pricing, and draft a quote—all before the deadline. It's tedious, error-prone, and surprisingly manual." That painful reality is exactly why this conversation about Tendos AI caught my attention—and why it matters for product leaders building agentic AI in complex, document-heavy workflows.

    I listened as Daniel Kappler and Matthias Hilscher from Tendos AI walked through how they’re automating the tendering workflow for manufacturers in the construction industry. What began as a narrow prototype—matching radiator requests to product catalogs—has matured into a full agentic system that does the heavy lifting from email categorization to offer generation. The end result: a scalable AI workflow that tackles messy inputs, orchestrates specialized agents, and produces quotes that are ready for human review—or even straight-through processing.

    What impressed me most was the rigor. They validated the opportunity with a design partner, spent a week on-site observing real workflows, and then engineered a multi-agent architecture where specialized agents collaborate, including a "review agent" that checks work before anything reaches a human. They evaluate each agent independently (not just the whole chain), built custom observability when off-the-shelf tooling fell short, and use human-in-the-loop feedback to push toward a self-learning system.

    From a product management perspective, this is agentic AI done right. It blends continuous discovery with eval-driven development, thoughtful UX decisions, and pragmatic guardrails. Evaluating agents individually makes debugging tractable and change detection transparent; a dedicated "review agent" mirrors code review to reduce error propagation; and custom tracing plus Agent Analytics provide the observability needed to operate AI workflows reliably at scale.

    My key takeaway: "Start narrow to prove value: Tendos AI began with just radiators for one design partner before expanding to all building products"—a classic wedge strategy that accelerates learning while building credibility.

    Another takeaway I’ll adopt in future roadmaps: "Own the interface: building a web application (vs. integrating into legacy systems) gave them control over UX and the ability to iterate toward full automation." Controlling the surface area let them move faster than a purely backend integration ever could.

    On measurement and reliability, I loved this: "Evaluate each agent, not just the chain: per-agent evals make debugging tractable and show exactly where performance changed." That’s true eval-driven development—aligning metrics to decision points rather than only outcomes.

    Quality gates matter in automation, and they nailed it: "Use review agents: a separate agent that checks work (like code review) catches errors before they reach humans." It’s a simple pattern with outsized ROI.

    Finally, the product-market signal is unmistakable: "Let customers pull you: customers asked Tendos to replace their CPQ software—strong signals of product-market fit." When buyers invite you to displace existing systems, you’re past validation and into expansion.

    If you’re exploring agentic AI for enterprise workflows, the themes here are gold: the tendering chain in construction is ripe for automation; domain expertise accelerates opportunity discovery; robust entity extraction across PDFs ranging from 1 to 1,800+ pages is non-negotiable; planning patterns for creating and updating task plans matter; agents must reason about product fit against customer requirements; custom tracing and observability unlock debugging for complex agent chains; and human feedback loops pave the path to self-learning systems.

    Guests: Daniel Kappler — CPO (Product & Design), Tendos AI; Matthias Hilscher — CTO (Engineering), Tendos AI.

    Want to dive deeper? Listen to this episode on: Spotify | Apple Podcasts.

    Explore the team and product: Tendos AI.

    For builders of agentic AI, here’s my playbook distilled from this story: start narrow to earn trust and accuracy; own the interface to speed iteration; use per-agent evaluations to localize issues; add a "review agent" as a quality gate; invest early in tracing, observability, and Agent Analytics; keep humans in the loop until your metrics justify autonomy; and let strong pull signals guide your roadmap. That’s how you turn complex emails and massive PDFs into precise, production-grade quotes—consistently.


    Inspired by this post on Product Talk.


    Book a consult png image
  • AI Ethics That Win Trust: The Product Manager’s Playbook for Safe, Scalable Innovation

    AI Ethics That Win Trust: The Product Manager’s Playbook for Safe, Scalable Innovation

    I’ve learned that the fastest way to lose customers with AI is to ship something powerful but unpredictable. The fastest way to earn their loyalty is to ship something powerful and trustworthy. That’s the job.

    AI ethics in product management isn’t about theory anymore. It’s the line between trusted products and unpredictable ones. Here’s what PMs need to know.

    When I frame AI ethics for my team, I translate principles into practices that protect customers and accelerate velocity. We bake trust into product strategy, delivery, and operations—so ethics is not a separate checklist, but a core capability that compounds over time.

    First, I anchor the roadmap on explicit outcomes and guardrails. We set success metrics alongside ethical constraints, tying them to outcomes vs output OKRs, so teams know not only what to achieve but what to avoid. If a feature can’t meet our trust thresholds, it doesn’t ship—no matter how impressive the demo.

    Data is where trust starts. We enforce data governance from day one: clear data lineage, collection minimization, role-based access, and privacy-by-design defaults. We document lawful bases for processing, consent flows, and retention policies, then automate checks so they run with every change—not just at launch.

    On the model side, we use eval-driven development to turn subjective “looks good” into measurable quality. We design evaluations for safety, bias, robustness, and performance; we red-team prompts; and we test failure modes in realistic conditions. For LLMs, we lean on a retrieval-first pipeline to ground responses in authoritative data, and we apply context window management and prompt engineering patterns to reduce hallucinations.

    In the product experience, we make ethical choices visible. That means clear disclosures when AI is in the loop, user controls to review and correct outputs, and transparent UX writing that avoids overclaiming. In-app guides and thoughtful tooltip design help users understand capabilities and limits without friction.

    Shipping safely requires operational discipline. We build kill switches, human-in-the-loop overrides for high-risk actions, and incident playbooks that pair incident management with threat detection and response. SRE partnerships ensure observability covers both model behavior and customer impact, with rollback paths ready when drift or regressions appear.

    Governance is a team sport. I maintain an AI risk register, review it with security, legal, and product trios, and brief leadership on residual risks and mitigations. Regulatory compliance isn’t a final hurdle; it’s a design input that shapes technical choices long before code reaches production.

    Build vs buy decisions carry ethical implications too. Vendor due diligence covers model provenance, data handling, eval results, and incident history—not just feature checklists. Contracts codify SLAs, audit rights, and deletion commitments so our obligations to customers flow down the stack.

    Finally, we earn trust in public. We publish model facts, change logs, and limitations in a customer-facing trust center, and we invite feedback loops that turn real-world usage into better safeguards. Stakeholder management matters here: being candid about trade-offs often increases confidence more than chasing perfection.

    This is how I keep teams fast without being reckless: ethics as a product capability, not a poster. Build with intention, measure what matters, and make it easy for customers to understand, control, and benefit from your AI. That’s how we ship innovation that stays trusted—at scale.


    Inspired by this post on Product School.


    Book a consult png image
  • New Year, New Product Habits: AI Workflows, Coaching Culture, and Community in 2026

    New Year, New Product Habits: AI Workflows, Coaching Culture, and Community in 2026

    Happy New Year! I’m kicking off 2026 with a behind-the-scenes look at what’s changing in my product practice, the experiments I’m running with my teams at HighLevel, and the trends I’m most energized by—especially around continuous discovery, AI workflows, and building stronger coaching cultures.

    If you want to listen to the conversation that sparked many of these reflections, you can find it here: Spotify | Apple Podcasts.

    Why Teresa sunset the live deep-dive cohorts—and how on-demand and the new Discovery Habits Toolbox better support real behavior change. This pivot resonated with my own experience: some skills, especially discovery habits, only stick when they’re reinforced in the flow of real product work, not just in a time-boxed cohort. In my org, we’re leaning into on-demand learning paired with manager coaching to drive durable behavior change.

    What leaders actually need to coach interviewing, assumption testing, and core discovery habits inside their orgs. I’ve found that empowered product teams thrive when leaders have lightweight coaching tools, practical prompts, and clear expectations for product trios. This is less about one-off training and more about building communities of practice where deliberate practice and feedback loops become routine.

    Why training is shifting toward ongoing, leader-supported learning (and how AI will accelerate the shift). AI Strategy isn’t just about tools—it’s about learning systems. For LLMs for product managers to create leverage, we need eval-driven development, privacy-by-design, and clear guardrails. I’m building AI workflows that enable managers to review interviews, spot anti-patterns, and nudge teams toward better decisions—without replacing critical thinking.

    Teresa’s move into paid subscriptions and why AI content doesn’t fit the classic “design once, run for years” course model. I see the same reality in my content roadmap: the half-life of AI guidance is short. That pushes us toward subscription models, tighter feedback loops, and a more adaptive go-to-market strategy for education products.

    A sneak peek into the AI tools Teresa is building for discovery work—from interview coaching to near-ready interview snapshot generation. I’m particularly excited by tooling that scaffolds better interviews, sharpens assumption testing, and speeds up synthesis without skipping the human judgment step. These capabilities map directly to where I want my teams investing time: spending less energy on admin and more on learning from customers.

    Petra’s plans for the year: community building with Product at Heart, a new product leadership email course, her Product Leadership Wheel, and workshops launching in Cairo. As someone who believes in conferences as high-quality “energy wells,” I’m inspired by how these programs create momentum for leaders who are upgrading their coaching muscles.

    The role of conferences and retreats in staying grounded, inspired, and connected. I treat these gatherings as strategic resets—spaces to test ideas, confront blind spots, and deepen my network for future collaboration. The best outcomes often come from serendipitous hallway conversations and hands-on sessions where you can pressure test frameworks with peers.

    How Teresa is staying on top of academic research (and why “synthetic users” aren’t ready for prime time). I agree: while synthetic data can be useful for scaffolding, it’s not a substitute for direct customer contact. Combine academic rigor with real-world interviewing and strong data governance—especially when operating under General Data Protection Regulation (GDPR).

    The shared challenge of evaluating vendors and conference speakers making questionable AI claims. My heuristic: ask for clear problem statements, reproducible evaluations, grounded benchmarks, and a path to safe deployment. If a pitch can’t show measurable uplift or ignores compliance, it’s not ready for empowered product teams.

    Key takeaways I’m carrying into 2026: delivery models matter; leaders need coaching tools, not just training; AI is reshaping how we teach and learn; experimentation is the theme of 2026; and community still energizes. That’s the blueprint I’m using to strengthen continuous discovery, refine our AI workflows, and sustain high standards in product management leadership.

    What about you? How are you integrating AI workflows into your discovery practice, and what coaching tools are helping your managers reinforce the right habits? Share your approach—I’d love to learn what’s working in your context.

    Resources & Links:

    Follow Teresa Torres: https://ProductTalk.org

    Follow Petra Wille: https://Petra-Wille.com

    Teresa’s website: Product Talk

    General Data Protection Regulation (GDPR)

    Product Talk Academy

    Deliberate Practice – ATP episode where Teresa talked about the ending live cohorts for Deep Dive classes

    Teresa’s Discovery Habits Toolbox program

    Petra’s A 52-Week Transformation Journey

    Teresa’s Product Talk subscriptions (AI workflows + discovery content)

    Claude Code

    The Interview Coach by Teresa

    Product at Heart Conference (Hamburg)

    Petra’s Coaching Packages

    Petra’s Ways We Can Work Together

    Petra’s Product Leadership Wheel (PLwheel)

    Petra’s Product Manager (PMwheel)

    Prdkt+ MENA Product Summit 2026

    World Beautiful Business Forum by House of Beautiful Business

    Melissa Suzuno

    Vistaly (Teresa’s integration partner for some upcoming AI tools)

    Teresa’s Just Now Possible podcast


    Inspired by this post on Product Talk.


    Book a consult png image
  • 11 Product Management Shifts Redefining 2026: Actionable Signals from Top Leaders

    11 Product Management Shifts Redefining 2026: Actionable Signals from Top Leaders

    2026 is closer than it feels, and the signals are already clear. I’ve been synthesizing what I’m seeing across empowered product teams, boards, and cross-functional partners into a practical view of what matters next. A sharp look at product management trends for 2026. Not guesses, but signals from top product leaders shaping how PMs will actually work next.

    In this analysis, I distill eleven shifts that are changing the craft—from outcomes vs output OKRs and continuous discovery to stronger product strategy and tighter product roadmapping and sprint planning. The throughline is simple: prioritize customer value, ship with focus, and measure what moves the business. These aren’t headline trends; they’re working patterns I’m seeing across high-performing organizations.

    AI is no longer a side project—it’s part of the product manager’s core toolkit. Agentic AI, LLMs for product managers, and trustworthy AI workflows are accelerating discovery, sharpening problem framing, and enabling faster iteration. The best teams pair this with disciplined evaluation and experimentation, so insight compounds without sacrificing safety, privacy, or product quality.

    Execution is getting crisper through product trios and stronger stakeholder management. When design, product, and engineering co-own discovery and delivery, teams reduce handoffs and increase clarity. That alignment translates into better prioritization, fewer context-switches, and a roadmap that reflects real trade-offs—not wish lists.

    On growth, product-led growth remains a durable engine when it’s anchored in a compelling value proposition and instrumented end-to-end. Clear activation moments, in-app guides, and thoughtful product tours outperform brute-force acquisition. When we connect these motions back to product strategy and the roadmap, we create a repeatable loop that compounds adoption and retention.

    Governance and trust are now table stakes. Privacy-by-design, data governance, and a pragmatic approach to regulatory compliance protect both users and velocity. Teams that build these practices into their operating model move faster because they avoid late-stage rework and maintain stakeholder confidence.

    If you’re leading a product org—or aspiring to—this is your field guide to 2026. I’ll unpack where these shifts are strongest, how to apply them in your context, and the pitfalls to avoid. The aim is to give you clear language, concrete practices, and a sharper edge as you shape what your team builds next.


    Inspired by this post on Product School.


    Book a consult png image
  • How We Built an AI Career Co‑pilot that Turns Knowing into Doing for Disadvantaged Students

    How We Built an AI Career Co‑pilot that Turns Knowing into Doing for Disadvantaged Students

    How do you help disadvantaged students take action on opportunities they don't even know exist? That question has been top of mind for me as I’ve explored how AI can augment—not replace—human mentorship. Recently, I dug into the work behind Zero Gravity, a UK-based platform using mentoring, community, and learning pathways to unlock elite career opportunities for state school students. Their approach reframed a core problem I care deeply about: the "knowing-doing gap."

    I sat down with Elliot Little (Product Manager) and Dan St. Paul (Software Engineer) from Zero Gravity to unpack how they’re tackling this gap with an AI career co‑pilot. They’ve intentionally positioned the system as an orchestrator, not an automation tool—bridging the space between knowing what to do and actually doing it. As a product leader, I see this as a powerful pattern for Generative AI: use AI to coordinate steps, personalize guidance, and empower action in moments where confidence and clarity are fragile.

    What resonated most was the humility of their build journey. They started with grand visions of AI mentors and synthetic avatars, then scaled back to something simpler and more effective. The first prototype—a job suitability summary—didn’t deliver the "wow moment" they expected. And they discovered that hiding the "LLM magic" backfired—students needed to feel the personalization. That insight aligns with my own experience: users must perceive the value for trust and motivation to compound.

    From a UX standpoint, the team chose text chat over voice input and leaned into guided prompts rather than empty text boxes. That decision lowered cognitive load and increased completion rates—classic product management tradeoffs that privilege momentum over novelty. In my view, this is what good AI product strategy looks like: invite action with structure, then expand autonomy as confidence grows.

    The technical backbone is equally thoughtful. Multi‑month journeys require rigorous context window management to avoid exploding token counts and degrading quality. I appreciated their pragmatic toolkit: context management techniques like removing stale tool calls, summarizing history, exposing tools conditionally. They also used application logic rather than complex RAG architectures to manage tool availability and context freshness. This is the kind of disciplined engineering that keeps systems reliable at scale without overcomplicating the stack.

    Model selection was fit‑for‑purpose, not one‑size‑fits‑all. They’re using different models for different tasks, including "GPT-5 Nano for structured outputs, lighter models for quick replies." That modularity enables speed and cost control while preserving high‑fidelity moments where structure matters most.

    Safeguarding was treated as a first‑class concern—non‑negotiable when you’re building AI for 16‑year‑olds. Their safeguarding architecture pairs moderation endpoints with external verification via Unitary. They also invested in building a failure taxonomy through internal red team/green team exercises. This is AI risk management done right: define failure modes early, test ruthlessly, and wire safety into the product surface area—not just the model layer.

    Evaluation was grounded in outcomes, not demos. The team focused on whether students progressed from insight to action: applying, interviewing, and engaging with mentors. That aligns with how I run eval‑driven development—ship narrowly, measure real behavior, and iterate toward a repeatable "wow moment" that students can actually feel.

    Looking ahead, I’m excited by what’s next: long‑term memory management for multi‑year student journeys. It’s a hard problem—balancing privacy, provenance, and portability—but it’s precisely where an AI career co‑pilot can compound value over time. The vision is compelling: a resilient companion that remembers goals, adapts to context, and orchestrates the right next step.

    If you want to dive deeper, you can listen to the full conversation on Spotify and Apple Podcasts:

    Listen to this episode on: Spotify | Apple Podcasts

    Resources mentioned:

    Zero Gravity: https://zerogravity.co.uk/

    Unitary – AI-powered content moderation: https://www.unitary.ai/

    Blue Dot Impact AI Safety Course – free AI safety course Elliot recommended: https://bluedot.org/

    My key takeaways: build AI that augments human relationships, not replaces them; don’t hide the personalization—let learners feel it; privilege application logic over unnecessary architectural complexity; and treat safety, context, and evaluation as product features, not afterthoughts. That’s how we bridge the "knowing-doing gap" with integrity and scale.


    Inspired by this post on Product Talk.


    Book a consult png image
  • PMs and Developers Need Different AI Metrics—Here’s How That Builds Faster, Better Products

    PMs and Developers Need Different AI Metrics—Here’s How That Builds Faster, Better Products

    I’ve sat in countless AI measurement debates and noticed a recurring gap. One major voice has been noticeably underrepresented in the AI measurement conversation: the product manager (PM) that’s leading development. From experience, PMs and developers do need different measurement tools—and making those differences explicit is exactly what speeds up decisions and improves outcomes.

    Developers optimize the model and system layer. Their toolkit centers on eval-driven development: offline evals, regression suites, red-teaming, latency and throughput monitoring, token cost tracking, and hallucination rate reduction. On the delivery side, engineering teams watch DORA metrics alongside CI/CD performance to keep iteration fast and safe. When building LLM-backed experiences, they also care deeply about retrieval-first pipeline quality and context window management because those mechanics determine grounding, relevance, and consistency.

    PMs, by contrast, own outcomes. We instrument user journeys end to end and define a clear north-star tied to value: activation, time-to-value, task success rate, retention analysis, support deflection, and revenue contribution. We rely on A/B testing frameworks and minimum detectable effect (MDE) planning to separate real impact from noise, and we consolidate behavioral signals in a unified analytics platform like Amplitude analytics and Pendo to understand adoption, friction, and cohort differences. This is the heart of product-led growth and continuous discovery: evidence, not anecdotes.

    The fact that these toolboxes differ is a strength, not a weakness. Specialized metrics keep responsibilities crisp: developers guarantee model quality and reliability; PMs guarantee that quality translates into customer and business outcomes. What we need is an explicit metrics ladder that connects layers—model-level quality floors and SLOs, feature-level KPIs, and company-level results—so trade-offs are transparent and prioritization is principled.

    In practice, I create a shared measurement contract for every AI initiative. It links eval sets to user-facing success criteria, defines acceptance thresholds, and spells out observability across the stack. We include governance from day one—AI risk management, privacy-by-design, and data governance—so we can scale responsibly without slowing teams down.

    Here’s the AI product toolbox I give my teams: start with a concise value hypothesis; define a success rubric the customer would recognize; instrument the happy path and the failure path; plan experiments with MDE up front; segment results by persona and job-to-be-done; and close the loop with qualitative feedback inside the product via in-app guides, product tours, and lightweight surveys. For AI features specifically, add Agent Analytics for agentic AI, capture grounding sources for explainability, and log model/context inputs to make debugging and iteration repeatable. That way, LLMs for product managers stop being magic and start being manageable.

    When we roll out a new assistant—whether a retrieval-augmented copilot or a voice AI agent—we set two dashboards: one for developers (eval pass rates, latency, context integrity, error budgets) and one for PMs (activation, task completion, deflection, satisfaction). The dashboards read differently by design, yet they are joined at the hip by shared definitions and experiment IDs. This lets us move quickly with confidence: engineering can tighten quality loops while product steers toward the outcome that matters most.

    If you’re feeling the tension between model metrics and product metrics, don’t collapse them—connect them. Start with a thin slice, agree on 3–5 measurable outcomes, and let your evals and A/B tests work together. With a clear metrics ladder and a unified analytics platform, PMs and developers can each excel at their craft and still ship AI that customers love.


    Inspired by this post on Pendo – Perspectives.


    Book a consult png image
  • Inside the AI Customer Service Shift: What 166 Leaders Told Me About Teams, Roles, and ROI

    Inside the AI Customer Service Shift: What 166 Leaders Told Me About Teams, Roles, and ROI

    I wanted to cut through the hype and see what’s actually changing inside customer service teams as AI agents like Fin move from pilots to production. So I analyzed 166 interviews with support leaders, managers, and frontline specialists to understand how roles, workflows, and team structures evolve once AI becomes part of everyday work.

    The anecdotes were already loud: AI tools are transforming customer support. But the scale, shape, and consistency of that transformation? Less clear. I went to the source—the practitioners living it—to quantify what’s real and what’s next for customer support AI strategy.

    Here’s what I gleaned from the data.

    TL;DR — What’s changing

    AI is reorganizing core CS operations: Nearly every team (≈95%) reported meaningful workflow changes. Triage, routing, translation, and categorization are increasingly automated. Hybrid human+AI systems are taking their place.

    Frontline work is changing to AI oversight: Humans now QA, monitor, and test AI outputs. When it comes to handling queries, they step in for nuance, rather than repetition.

    Structural change is widespread but uneven across companies: 83% reported new responsibilities or roles. Some built AI pods, while others retained traditional setups.

    Tier 1 headcount demand is falling: 28% saw hiring freezes, slowdowns, or natural attrition at Tier 1 level as AI Agents manage more requests and improve operational efficiency.

    Skill gaps are widening inside teams: Data literacy, QA, and cross-functional communication are all rising in value. For many companies, long-term role strategy is lagging behind.

    Research methodology

    The goal of this research is to understand how many customer service teams have changed their roles, responsibilities and ways of working due to adopting AI agents, as well as understanding how these changes manifest within their organizations.

    For this study, the data chosen consists of interviews conducted by the research team, either with Intercom customers or prospects. This data was chosen because the focus of the interviews revolved around the individual experience of the participant, which gives a higher chance of information related to role changes to be present.

    The data was collected using Snowflake by pulling all interviews stored in gong conducted by a member of the research team from 01-01-2025 to 14-10-2025.

    After the data was pulled, a python script was used to clean the conversation corpus for each conversation retrieved. Common English stopwords (e.g. “and”, “very”, “with”, etc.) were removed, as well as all the text associated with a speaker in the conversation that was not the interview participant(s). This was done to reduce the computational power required for the conversation coding, avoid API timeouts and reduce costs.

    After the corpus was cleaned, the OpenAI API was employed, alongside a prompt, to code each conversation using closed codes defined in a closed codebook.

    The codes used were:

    No role change mentioned: No explicit changes to roles, teams, or reporting lines are attributed to AI/Fin.

    Role responsibilities changed due to AI/Fin: Duties/ownership moved between humans and AI/Fin, or scope of a role changed because AI/Fin handles tasks.

    Team structure/reporting changed due to AI/Fin: Org/team boundaries, team charters, or reporting lines changed due to adopting AI/Fin.

    Headcount/hiring impacted due to AI/Fin: Hiring plans, headcount, staffing coverage, or shifts/rotations changed due to AI/Fin.

    Workflow/process changed due to AI/Fin: Steps, triage/escalations, routing, or playbooks changed because AI/Fin alters the process.

    Other organizational changes due to AI/Fin: Other changes inside the organization due to AI/Fin that don’t involve a change in responsibilities, team structure/reporting lines, headcount or workflow/processes changes.

    Data analysis

    166 conversations were retrieved. More than 90% of all conversations report some sort of change either in their role, team, or processes due to implementing Fin, or a similar AI product, with only 13 participants reporting no changes.

    Across these conversations, each one could have multiple types of change associated with it (M = 2.35, Med = 2, Min = 1, Max = 4, N = 166).

    More specifically, after implementing Fin or a similar AI product:

    94.58% participants reported having their processes and workflows disrupted

    82.53% participants reported seeing their role and responsibilities change

    27.71% participants reported changes in company headcount or hiring

    6.02% participants reported their team structure or reporting lines changing as a result

    Additionally, 16.27% participants reported a change for a different reason from the ones highlighted above (“Other organizational changes due to AI/Fin”).

    Sample representativeness

    The sample is representative with a confidence level of 90% and a margin of error of ±6.4% (accounting for an overall unknown population size). The individual confidence intervals for each type of change are as follows.

    Workflow/process changed due to AI/Fin: 157 (94.6%), 90% CI: 91.7% – 97.5%

    Role responsibilities changed due to AI/Fin: 137 (82.5%), 90% CI: 77.7% – 87.4%

    Headcount/hiring impacted due to AI/Fin: 46 (27.7%), 90% CI: 22.0% – 33.4%

    Other organizational changes due to AI/Fin: 27 (16.3%), 90% CI: 11.6% – 21.0%

    No role change mentioned: 13 (7.8%), 90% CI: 4.4% – 11.3%

    Team structure/reporting changed due to AI/Fin: 10 (6.0%), 90% CI: 3.0% – 9.1%

    Thematic analysis

    1) Automation and AI integration replacing manual steps (94.58%). I see AI workflows embedding into every stage of support. Manual triage, routing, translations, and repetitive responses shift to Fin or similar systems, while agents focus on human-in-the-loop oversight.

    Agents’ day-to-day work now revolves around monitoring or fine-tuning AI outputs, not replying to the same questions. In many teams, conversations enter Fin first; humans only step in when nuance or exception handling is required. Testing, QA, and rollout practices have matured too—teams track Fin’s accuracy and iterate intentionally.

    2) Humans shift to oversight, AI handles execution (82.53%). The role resets are unmistakable. Support agents and managers move from high-volume execution to optimization, configuration, and measurement. New roles emerge—AI specialists, automation managers, Fin owners—while responsibilities migrate toward strategic analysis and quality assurance.

    Duties are redistributed: Fin takes on refunds, triage, simple messaging, even parts of the sales process. I’ve watched some careers pivot toward product/ops or AI systems strategy as managers coordinate testing and monitor adoption metrics.

    3) Reductions or slower growth due to efficiency gains (27.71%). Efficiency is real. Many teams reduce Tier 1 headcount needs or slow hiring because AI absorbs simpler requests. Others reallocate people to complex work or AI management. A few still expand—adding automation engineers, implementation specialists, or technical AI leads—but not at past growth rates.

    The upshot: organizations handle more volume while stabilizing or reducing staffing, especially at the frontline tier.

    4) New AI teams, flatter orgs, fewer escalation layers (6.02%). I’m seeing organizational design catch up to the tech. Some companies form dedicated LLM or automation teams. Others flatten hierarchies, design around workflow complexity instead of region, or merge roles. Dedicated escalation layers shrink as Fin routes or resolves more autonomously.

    Team design is getting more modular and data-driven, with clearer ownership for configuration, governance, and Agent Analytics.

    5) Broader digital transformation and operational modernization (16.27%). Beyond support, companies are modernizing their operating model: automation-first, digital self-service, better data foundations, and new vendor ecosystems. Collaboration patterns between data, ops, CX, and product/engineering are tightening, with a culture of experimentation and continuous improvement taking hold.

    How have customer service roles and responsibilities changed due to Fin/AI agent implementation?

    Implementing Fin or a similar AI agent profoundly changes how an organization operates, with around 95% of participants reporting some level of change in their processes after implementation. These systems have significantly reshaped the workflows that customer service teams are used to. Tasks once performed manually, such as ticket triage, routing, repetitive responses, and translations are now handled by AI agents.

    “This marks a clear transformation in how customer service agents work: moving away from directly resolving customer queries to focusing on more analytical and procedural work”

    As a result, customer service agents’ responsibilities have shifted from performing manual tasks to monitoring and fine-tuning the AI agent whenever its output is inaccurate or incomplete. This marks a clear transformation in how customer service agents work: moving away from directly resolving customer queries to focusing on more analytical and procedural work, such as testing, QA, and performance analysis of AI outputs.

    Human agents who still handle conversations tend to do so either because the AI agent cannot yet respond adequately, or because of an organizational choice to retain human involvement for sensitive or high-value interactions. Nevertheless, the need for such roles is diminishing. Around 28% of participants reported a reduction in Tier 1 staff or a hiring slowdown or a full hiring freeze, as AI agents increasingly manage simple requests and organizational attention shifts towards improving automation efficiency.

    “In some cases, this has led to the creation of specialized AI teams, reorganizations around workflow complexity, or the merging and redefinition of existing roles”

    However, this transformation is not uniform across companies. While some roles have disappeared (particularly escalation layers), others have emerged. Many organizations are reallocating existing staff to AI management or hiring new technical profiles such as automation engineers, implementation specialists, and AI leads. In some cases, this has led to the creation of specialized AI teams, reorganizations around workflow complexity, or the merging and redefinition of existing roles.

    Around 83% of participants reported changes to their roles or responsibilities following the introduction of Fin or similar AI agents. Specifically, customer service agents who no longer handle basic queries now focus on managing AI performance, reviewing Fin tasks and improving automation outputs. Managers oversee AI evaluation and implementation, coordinate testing, and monitor AI metrics such as resolution and involvement rates. In some organizations, new dedicated roles have emerged—AI specialists, automation managers, or Fin owners—reflecting a strategic shift toward automation-first, digital self-service models.

    These structural shifts are also cultural. I’m seeing teams embrace experimentation, versioning, and eval-driven development while deepening collaboration with data, operations, and product/engineering. The move from outcomes vs output OKRs is palpable: leaders are measuring containment, deflection, CSAT, and time-to-resolution with new rigor.

    Overall, a widespread transformation is underway. Roles are broadening, responsibilities are diversifying, and cross-functional collaboration is becoming the norm. Given the pace of gen ai improvement and the rise of agentic AI patterns, I expect these shifts to intensify.

    This evolution raises two important questions

    Firstly, do customer service agents possess the skills required to succeed in these new roles? While they are experts in customer interaction and company policy, their work now demands new competencies in data analysis (e.g. reporting AI agent performance and how it changes over time), quality assurance/debugging (e.g. Fin output testing and versioning), and cross-functional communication (e.g. if help from another team is required, drafting a business case to justify the resources required could be needed).

    Secondly, what long-term strategies are companies adopting to support these evolving roles? Some are reorganizing entirely around automation, while others retain traditional structures. For those undergoing transformation, it remains unclear whether these changes are part of a deliberate strategic plan aimed at achieving specific performance outcomes, or the result of experimentation without defined goals.

    Ultimately, Fin’s success— and of AI in customer service more broadly— depends not only on the technology itself but on the people and strategies that shape its use. In my experience, the winners invest early in data literacy, robust QA, clear ownership, and governance; they align product, ops, and CX around a shared AI roadmap; and they measure what matters with disciplined Agent Analytics. That’s how you turn AI workflows into durable customer and business outcomes.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • 4 Costly Misconceptions About Building AI Agents—and How I Turn Them Into Wins

    4 Costly Misconceptions About Building AI Agents—and How I Turn Them Into Wins

    I’ve lost count of how many times I’ve been asked for a “quick AI agent” that can autonomously fix customer problems, write code, or run sales ops. The promise is intoxicating—and I get why. But in practice, sustainable impact comes from disciplined product thinking, not wishful automation. Drawing on my experience leading product for complex, agentic AI initiatives, I want to debunk four misconceptions I see repeatedly and share what actually works.

    Misconception 1: AI agents are plug-and-play. The reality is that effective agentic AI behaves more like a new product line than a feature toggle. It needs clear job stories, domain grounding, tool access, and guardrails. I start by narrowing scope to one painful job to be done, then design AI workflows that reflect real constraints (SLAs, compliance, edge cases). From day one, I instrument with Agent Analytics and set up eval-driven development so we can see failure modes early and iterate with intent.

    What consistently moves the needle is treating the agent like a teammate you onboard: define responsibilities, provide the right tools, and measure outcomes. I pair scripted validations with live evals, track containment rates and handoff quality, and balance precision/recall depending on the risk profile. This is slow to fast, not fast to broken.

    Misconception 2: Bigger models make better agents. In my experience, architecture outperforms horsepower. A retrieval-first pipeline, tight context window management, and practical prompt engineering often beat an oversized model that hallucinates. Tool use matters more than model size: give the agent reliable APIs, clear schemas, and deterministic fallbacks. For LLMs for product managers, the play is to right-size the foundation model and invest in data quality, prompts, and evaluators that reflect your true acceptance criteria.

    When I see erratic behavior, I don’t immediately swap models; I improve retrieval, prune irrelevant context, and clarify the agent’s planning loop. Most performance gains come from better state management and grounding rather than a pricier token budget.

    Misconception 3: Agents replace teams. High-performing organizations design human-in-the-loop systems. I implement human review on high-risk actions, explicit escalation paths, and simple override mechanisms. That’s not just safety theater—it’s good product design. AI risk management and data governance are part of the product backlog, not an afterthought. In customer support ai strategy, for example, the agent drafts, a specialist approves, and the system learns from deltas to tighten future responses.

    The social system matters as much as the technical one: clear role boundaries, audit trails, and feedback loops turn the agent into a force multiplier. Teams gain leverage without surrendering accountability.

    Misconception 4: Shipping the agent equals success. Adoption is earned, not announced. I treat agent launches like any product-led growth motion: define activation events, remove friction with in-app guides and product tours, and A/B test prompts, tool choices, and UI affordances. We track time-to-value, task completion rate, and user trust signals (edits, undo patterns, and escalation requests). When we get those leading indicators right, retention follows.

    Increase revenue, cut costs, and reduce risk with Pendo’s Software Experience Management platform. Optimize the entire software experience to drive adoption and improve engagement.

    My playbook is simple and repeatable: frame the problem narrowly, ground the agent with the right tools and data, measure with eval-driven development and Agent Analytics, then grow adoption with a disciplined go-to-market inside the product. The agents that win don’t feel like magic—they feel dependable. That’s what customers trust, and that’s what scales.


    Inspired by this post on Pendo – Best Practices.


    Book a consult png image
  • AI Context Pulling Playbook: How I Get LLMs and Teams to Collaborate for Better Product Outcomes

    AI Context Pulling Playbook: How I Get LLMs and Teams to Collaborate for Better Product Outcomes

    In my role leading product, I’ve learned that the fastest path to higher-quality deliverables from large language models (LLMs) is not a clever prompt—it’s rigorous context. I call the practice AI context pulling: a repeatable way to assemble, compress, and structure the most relevant knowledge before the model ever starts generating. Done well, it turns generative AI into a dependable partner for discovery, prioritization, and execution.

    AI context pulling means I proactively gather the right artifacts (customer insights, analytics, strategy, constraints), manage context windows intentionally, and shape the model’s task with clear objectives and guardrails. This reduces hallucinations, improves alignment, and creates traceability back to sources—critical for product management leadership and stakeholder trust.

    Learn a new way in which product professionals can collaborate with AI to get even better results on their projects.

    Here’s the simple flow I use: first, I define the intent (e.g., “synthesize discovery interviews for a positioning brief”). Next, I inventory relevant context: top customer pains from product discovery, usage patterns from Amplitude analytics, recent support trends from Intercom, and any constraints from our product strategy. Then I run a retrieval-first pipeline to select only the most pertinent slices—favoring recency, representativeness, and canonical sources.

    Because context window management matters, I compress long documents into short, source-cited summaries and keep raw excerpts handy when nuance is important. My prompts follow a consistent structure: role and objective, constraints and audience, curated context, the explicit ask, preferred output format, and a brief self-check (e.g., “cite sources and flag uncertainty”). This is prompt engineering for reliability, not theatrics.

    A quick example: when drafting a one-page feature brief, I attach three items—the product strategy paragraph that sets the frame, a usage cohort analysis that highlights who’s affected, and five verbatim customer quotes. I ask the LLM to propose a problem statement, success criteria, and a shortlist of solution hypotheses, each tied to a cited piece of evidence. The result is a grounded, decision-ready artifact I can share with product trios and stakeholders.

    Tooling-wise, I keep it pragmatic. A lightweight retrieval-first pipeline (embeddings, metadata filters, and recency rules) ensures the LLM pulls what matters. I version prompts and contexts together so I can run quick A/B testing on output quality. And I log decisions and sources to support eval-driven development and continuous discovery.

    Common pitfalls are avoidable. Too little context yields generic answers; too much overwhelms the model. Stale docs can mislead; curate aggressively. Vague asks invite fluffy prose; specify outcomes, audiences, and formats. If the task is high risk, I bias toward smaller, well-cited outputs and expand iteratively with human review in the loop.

    To measure impact, I track rework rate, review time, and stakeholder alignment on first pass. Over time, teams adopting AI context pulling report clearer artifacts, faster synthesis cycles, and more confident decisions—because every recommendation traces back to evidence. That’s how humans and LLMs truly collaborate better: we provide the right context, and the model amplifies our judgment.

    If you’re ready to operationalize this, start by templatizing your most common product workflows—discovery synthesis, roadmap rationale, and release notes—and attach small, high-signal context packs. With a retrieval-first mindset and disciplined prompting, AI becomes an extension of your product craft, not a gamble.


    Inspired by this post on Pendo – Perspectives.


    Book a consult png image