Tag: eval-driven development

  • Reinventing Product Management Workflow: The AI Upgrade I Use to Ship Faster, Smarter

    Reinventing Product Management Workflow: The AI Upgrade I Use to Ship Faster, Smarter

    The most valuable upgrade I’ve made to my product management workflow isn’t a new framework or a shiny dashboard—it’s an AI-first operating model that compresses discovery-to-delivery cycles while increasing confidence in every decision. I built this approach to reduce context switching, remove toil, and keep the team relentlessly focused on outcomes over output. The result is a faster, clearer, and more reliable path from insight to shipped value.

    Here’s how I run an AI-powered product workflow end to end: continuous discovery, opportunity sizing, solution shaping, planning, execution, and iteration—each step instrumented with automation, retrieval, and evaluation so we learn faster without compromising rigor.

    Intake and triage start with a retrieval-first pipeline that unifies customer feedback, support tickets, sales notes, research transcripts, and usage analytics. I use embeddings to cluster themes, de-duplicate signals, and surface the most representative examples. This gives me an instant, always-fresh view of customer jobs, pains, and opportunities without manually combing through noise.

    For discovery, I rely on “LLMs for product managers” to accelerate the hard parts without replacing judgment. I generate interview guides, summarize transcripts, extract entities, and tag moments of friction. Prompt engineering and context window management ensure the model sees the right evidence at the right time. I keep all sensitive data governed by privacy-by-design and data governance controls.

    Opportunity sizing is where I connect insights to business impact. I map problems to a driver tree, quantify potential lift, and align to outcomes vs output OKRs. When relevant, I apply the Kano Model to balance performance, basic, and excitement attributes. To maintain rigor, I use eval-driven development on my prompts and heuristics so prioritization is repeatable, not anecdotal.

    Solution shaping is a collaborative exercise with product trios. I draft problem narratives and PRDs, generate acceptance criteria, and create first-pass UX flows. For speed, I use gen ai for product prototyping to explore alternatives quickly, then gate final choices through usability feedback and feasibility checks. Where uncertainty is high, I define a minimum detectable effect (MDE) and design A/B testing plans upfront.

    Planning ties strategy to execution through product roadmapping and sprint planning. I break work into sequenced bets, enable feature flags for controlled exposure, and wire quality signals into CI/CD. DORA metrics—like deployment frequency and change failure rate—help me keep the system honest. Observability ensures we see the “why” behind behavior, not just the “what.”

    Execution is instrumented with in-app guides, Intercom messaging, and Pendo to shape onboarding and activation. I connect Amplitude analytics to measure habit formation, retention analysis, and feature adoption. When experiments run, I monitor leading indicators in near real time while protecting against peeking and p-hacking. The point isn’t to prove we’re right; it’s to learn fast enough to get right.

    Iteration closes the loop. I use a unified analytics platform to compare expected vs actual outcomes, harvest qualitative feedback, and push new evidence back into discovery. The system improves with each cycle because the retrieval-first pipeline and eval harness both get smarter as data grows.

    Governance is non-negotiable. AI risk management, cybersecurity, and regulatory compliance sit alongside model evaluations to prevent drift, leakage, or bias. I document decisions, model versions, and test artifacts so we can audit how we got to a call—especially when trade-offs are nuanced.

    If you’re standing up this AI workflow from scratch, I recommend a 30/60/90 rollout. In the first 30 days, audit your data sources and build a retrieval-first pipeline. In days 31–60, pilot two high-leverage workflows—continuous discovery and PRD drafting—backed by eval-driven development. By days 61–90, scale to prioritization and experiment design, then thread the outputs into your planning and CI/CD rhythms.

    Common pitfalls I watch for: over-automation that blurs context, lack of evaluation frameworks, ungoverned data that undermines trust, and vanity metrics that celebrate activity over outcomes. The antidote is simple but disciplined—clear decision criteria, measurable hypotheses, and automated evaluations that run as guardrails, not bottlenecks.

    This AI upgrade doesn’t replace the craft of product management; it amplifies it. By combining judgment, clear strategy, and reliable automation, we ship value faster, reduce risk, and make better calls under uncertainty. The payoff is durable: compounding learning velocity and a team that spends more time solving the right problems—and less time wrestling the process.


    Inspired by this post on Product School.


    Book a consult png image
  • Build CX Scores You Can Defend: My 5-step playbook for transparent, trustworthy AI metrics

    Build CX Scores You Can Defend: My 5-step playbook for transparent, trustworthy AI metrics

    “You don’t have to trust the algorithm; you can see exactly why a conversation earned the score it did.”

    We recently shared how we redesigned CX Score to deliver deeper, more actionable insights across every conversation. The most common follow-up from support leaders was simpler and incredibly important: “Can I trust it?” It’s the right question—and it’s the one I use as my own bar for whether a metric is ready for the C‑suite.

    CS teams are the subject matter experts on customer experience. They understand the nuance of what customers feel, the context behind every interaction, and the difference between a technically resolved issue and a genuinely satisfied customer. I’ve learned, conversation by conversation, that any metric we ship has to capture that nuance at scale—or it doesn’t deserve to be used.

    We built CX Score to give support teams a complete view of how their customers feel across every conversation. It surfaces what’s working, what’s not, and why—so leaders can communicate impact clearly and drive change across support, product, and the wider business.

    Interface card displaying 'CX Score: 2' summarizing a case where repeated CSV export attempts failed, frustrating the customer; the AI agent explains the issue and requests more details; rounded gradient border.
    A CX Score in action: repeated CSV export failures trigger a low score and customer frustration, while the AI agent clarifies next steps and gathers details—turning raw signals into actionable support insights.

    Here’s exactly how I approached building a trustworthy metric that support leaders can inspect, explain, and defend.

    1) It’s grounded in how support teams define quality. I started with how experienced support professionals actually evaluate conversations—collecting real examples of strong, mixed, and poor interactions across industries, identifying the specific factors that shape overall experience, and writing plain-English rules for each. The result: CX Score applies the same criteria a trained support professional would use, not generic LLM assumptions.

    2) It’s aligned with human judgment. We created a dataset of thousands of real customer conversations spanning multiple industries, languages, channels, and agent types. Each was manually reviewed by experienced support professionals—with two reviewers per conversation where possible and disagreement resolution to create stable consensus labels. The result: CX Score is trained and tested to behave like an expert reviewer, not a language model making broad guesses.

    Analytics dashboard visualizing a CX Score with KPI cards and a Sankey performance funnel linking support channels to AI involvement, resolutions, and positive, neutral, or negative outcomes.
    A modern CX analytics view shows how conversations flow from chat, email, and mobile into AI assistance, then to resolutions and sentiment outcomes—turning messy support data into a single, defensible CX Score.

    3) It’s engineered by AI specialists. CX Score isn’t a prompt attached to an LLM. It’s a production system built by Intercom’s AI Group: 37 ML scientists and 350 engineers whose full-time focus is AI for customer service. The system includes specialized handling for long transcripts, model configuration tailored for support language and subtle sentiment, prompt engineering designed to default to neutral when evidence is weak, and a multi-stage evaluation pipeline that checks for precision, consistency, and reliability. The result: A metric built by a team that understands LLM behavior in production support environments, where accuracy and consistency matter most.

    4) It’s validated statistically, not qualitatively. Trust requires measurement, not vibes. We tested CX Score across standard ML metrics: Precision (when the model flags a negative experience, how often do humans agree?), recall (how many human-identified issues does it catch?), and F1 score (the balance between both). We set an explicit bar: F1 above 0.8, representing high agreement with human judgment. We reran these evaluations through every revision, checking for regressions or biases, and I focused especially on negative experiences, because a false negative hides a real problem. The result: CX Score meets a measurable standard before it ships—not a gut check, a statistical requirement.

    5) It was battle-tested with real customers. Lab accuracy isn’t enough. Customer environments are messy: Varied ticket types, mixed languages, unpredictable edge cases. Before release, we ran a multi-phase field test—shadow-scoring conversations with both old and new models, validating sensible behavior across agent type and conversation length, then rolling out to a controlled customer group who confirmed the scores felt right, reasons were clear, and insights were actionable. The result: CX Score shipped because real teams told us it made sense in practice, not because it passed internal tests.

    Donut chart of CX categories beside a chat UI showing a CX Score of 3 with a 'Negative policy feedback' tag, highlighting policy feedback, answer quality, customer effort, and emotion.
    From conversation to clarity: this visual maps the drivers behind a CX Score. Explore how policy feedback, answer quality, and effort combine to produce defendable insights support leaders can act on.

    The importance of explainability. One of the most critical choices I made was ensuring CX Score isn’t a black box. Every score comes with clear reasons, concrete excerpts, and a short explanation of what influenced the rating. This turns the metric into something you can inspect, audit, and explain to executives. You don’t have to trust the algorithm. You can see exactly why a conversation earned the score it did.

    A metric that evolves with your business. Customer expectations shift. Products change. AI improves. A trustworthy metric can’t be static. CX Score evolves with the same commitments that shaped its redesign: Evaluate the real signals that shape customer experience, keep the logic simple and interpretable, and ensure leaders can make clear decisions from it. It’s built to be a durable source of truth across every conversation.

    The takeaway. In a world where products look the same and AI can generate any interaction, customer experience is one of the few differentiators that actually matters. Support leaders have built that expertise conversation by conversation. What they’ve lacked is a measurement system that could validate it at scale—one that’s reliable enough to report to the C-suite, explainable enough to defend in strategy meetings, and rigorous enough to drive real decisions. That’s what CX Score is designed to be: A metric that reflects the reality support leaders see every day, backed by the technical rigor to make it credible everywhere else.

    Want to see CX Score in your workspace? Ask your admin to enable it for your team, and start using explainable AI insights to improve customer experience and coach with confidence.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

    AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

    Shipping AI agents is not like shipping a typical feature. The system learns, reasons, and takes action in unpredictable environments, and when it’s customer-facing, the stakes are high. Over the past few years, I’ve refined a practical checklist that helps my teams move quickly without breaking trust. It balances speed with safety, and ambition with accountability—exactly what you need to scale agentic AI in production.

    This checklist was forged in real launches—some smooth, some humbling. Early on, I watched an otherwise brilliant agent confidently offer a refund policy we didn’t have. That one incident made it clear: AI agents require a higher bar for guardrails, evals, and observability. Today, I won’t greenlight an AI rollout without these steps being explicit, owned, and testable.

    Start with outcomes, not output. I define the job-to-be-done, the target users, and the measurable business impact using outcomes vs output OKRs and driver trees. Success is not “ship an agent,” it’s “reduce first-response time by 40% with no drop in CSAT,” or “increase qualified demo bookings by 20% at a lower cost per acquisition.” Clear outcomes give the agent a purpose and the team a north star.

    Prepare the knowledge the agent will use. A retrieval-first pipeline beats raw prompting for most enterprise cases. I inventory sources of truth, set access controls, and enforce data governance from day one. That includes PII handling, redaction, retention policies, and privacy-by-design. If the agent can’t reliably retrieve the right fact at the right time, the rest doesn’t matter.

    Choose models and prompts with discipline. I align model selection with context window management, cost, latency, and tool-use requirements. Then I build prompts and tools together, not in isolation, and I keep temperature, stop conditions, and function-calling explicit. Most importantly, I use eval-driven development: golden datasets, task-specific metrics (accuracy, helpfulness, latency, cost), and target thresholds that must be met before widening rollout.

    Manage AI risk upfront. I treat jailbreaks, toxicity, and data leakage as product risks, not just security issues. I implement layered defenses—input/output filtering, policy checks, rate limits, and abuse monitoring—and define escalation paths and human-in-the-loop handoffs for ambiguous cases. Every risky capability needs an owner, a playbook, and a test.

    Build the pipeline that lets you iterate safely. Prompts, tools, policies, and retrieval configs go through the same CI/CD rigor as code. I use feature flags for progressive delivery, canary cohorts to limit blast radius, and clear rollback procedures. Observability isn’t optional; I track latency, token usage, cost, failure modes, and user outcomes. I also watch DORA metrics and deployment frequency to ensure we’re improving the engine, not just the output.

    Constrain autonomy intentionally. Agent behavior design matters as much as model choice. I set step limits, define tool whitelists, separate read vs write permissions, and specify decision checkpoints. When the agent is uncertain or confidence drops below a threshold, it hands off to a human or a deterministic workflow. Guardrails aren’t barriers; they’re bumpers that keep you on the track.

    Instrument what users experience, not just what models produce. I track activation, task success, self-serve completion rates, and time-to-value. I pair Agent Analytics with journey analytics so I can see where the agent helps or hurts. I also invest in UX trust cues—transparent explanations, undo paths, and in-app guides—so users feel in control. When the agent changes behavior through learning, the interface should make that understandable.

    If you’re shipping a voice AI agent, test in realistic conditions. I set targets for ASR accuracy, barge-in responsiveness, TTS prosody, and end-to-end latency. I predefine safe transfer logic for complex calls and ensure compliance for call recording and data retention. Voice amplifies both the magic and the mistakes; operational excellence is non-negotiable.

    Plan the business rollout like a product, not a press release. I align pricing (often consumption SaaS pricing), packaging, and SLAs with actual unit economics—tokens, inference, and retrieval. I equip solutions engineering with playbooks and reference architectures, wire up CRM integration for attribution, and put feedback loops into Intercom or the support stack so we learn from every interaction.

    Run operations like an SRE team. I define incident severity for AI-specific failures (e.g., harmful output, runaway cost, degraded retrieval), add alerting, and keep runbooks current. I schedule postmortems that feed directly into eval baselines and backlog priorities. Continuous discovery isn’t a ceremony; it’s the safety net that keeps improvements compounding.

    Close the loop on compliance and governance. From day zero, I document data flows, vendor scopes, and audit logs. I verify regulatory compliance and adopt privacy-by-design so I’m not retrofitting later. Transparency, user consent, and opt-outs aren’t just legal checkboxes; they’re trust-building tools that differentiate your product.

    The result of this checklist is speed with confidence. It gives my teams a common language to debate trade-offs, a clear path to production, and the guardrails to scale safely. If you’re preparing to deploy an agent, adapt these steps to your stack and your customers. Your future self—and your users—will thank you.


    Inspired by this post on Product School.


    Book a consult png image
  • From Idea to Impact: My PM-Friendly Blueprint to Building Your First AI Agent Fast

    From Idea to Impact: My PM-Friendly Blueprint to Building Your First AI Agent Fast

    AI agents are quickly moving from novelty to necessity, and the fastest way to capture value is to approach them like any other high-stakes product initiative. In this guide, I share how I plan, build, and launch production-grade agents with a product mindset—balancing ambition with risk, speed with governance, and innovation with measurable outcomes.

    I start by getting crisp on the outcome. Who is the primary user, what job are they hiring the agent to do, and how will we know it’s working? I translate this into outcomes vs output OKRs, such as resolution rate, time-to-value, cost-to-serve, or qualified pipeline influenced—anchoring the roadmap before a single line of code or prompt is written.

    Next, I map the agent’s scope and boundaries. I write a simple capability canvas: the tasks the agent must perform, the tools it can use, the data it can access, and the constraints it must respect. Most successful builds follow a retrieval-first pipeline: connect trusted knowledge sources, enrich with metadata, and manage a lean context window to keep responses relevant and cost-efficient. From the start, I bake in privacy-by-design, data governance, and AI risk management so compliance isn’t an afterthought.

    Model selection comes after the workflow is clear. I choose an LLM for the job (latency, cost, multilingual needs, and tool-use fidelity) and pair it with the right connectors and actions—think CRM integration, ticketing, search, or internal APIs. For voice experiences, I define a voice AI agent persona, turn-taking rules, and barge-in behavior. This is where agentic AI patterns shine: structured planning, tool invocation, and verification loops create a resilient, goal-directed system.

    Prompt design is product design. I write system prompts that define role, tone, constraints, data sources, and success criteria. I add few-shot examples that mirror my top use cases and edge cases, then apply prompt engineering best practices to control style, limit speculation, and encourage citations. For voice, I include prompt engineering for voice to optimize brevity, warmth, and disfluency handling without sacrificing accuracy.

    Before launch, I build an eval-driven development workflow. I curate golden datasets from real user intents, add adversarial cases, and automate evals for accuracy, safety, grounding, and tool-use success. I set a minimum detectable effect (MDE) so A/B testing can validate improvements with confidence, and I define go/no-go thresholds to prevent regression. This becomes my continuous discovery loop for the agent.

    Instrumentation is non-negotiable. I wire up Agent Analytics to track task success, containment/deflection rate, handoff quality, cost per task, and user satisfaction. I supplement with a unified analytics platform and session replays to observe failure patterns. These signals feed prioritization and help me decide when to expand scope versus harden reliability.

    For delivery, I rely on CI/CD with feature flags to gate risky capabilities, plus canary releases for new tools and prompts. I monitor DORA metrics to maintain deployment frequency without trading off quality. When incidents happen, I treat them like production issues: incident management playbooks, rollbacks, and clear postmortems.

    Trust is earned through safety and transparency. I enforce least-privilege access, structured logging, and red-teaming for jailbreaks, prompt injection, and data exfiltration. Threat detection and response plus clear user disclosures keep the experience responsible and compliant with regulatory requirements.

    GTM is product-led. I use in-app guides, product tours, and onboarding checklists to drive user activation and early wins. I define success moments, turn them into habit loops, and run retention analysis to find where users stall. This tight loop of messaging, measurement, and iteration accelerates product-market fit.

    Common high-ROI use cases I prioritize include customer support ai strategy (automated resolution and augmented agent assist), sales and success workflows (lead qualification, QBR prep), and internal knowledge copilots (policy, process, engineering runbooks). Each starts narrow, ships fast, and scales with proven evidence from analytics and experiments.

    If you’re skimming, here’s the blueprint: clarify outcomes, design AI workflows with a retrieval-first pipeline, select the right LLM and tools, engineer robust prompts, institutionalize evals and A/B testing, instrument Agent Analytics, ship with CI/CD and feature flags, and iterate with discipline. In the walkthrough video above, I go deeper on templates, prompts, and experiments you can use to build your first agent with confidence.


    Inspired by this post on Product School.


    Book a consult png image
  • Becoming AI Native: A Practical Playbook to Transform Strategy, Teams, Data, and Tech

    Becoming AI Native: A Practical Playbook to Transform Strategy, Teams, Data, and Tech

    AI Native is more than a feature set—it’s an operating system for the entire business. In my role leading product, I’ve seen that companies win when they treat AI as a first-class citizen across strategy, architecture, workflows, and go-to-market. In this narrative, I unpack what “AI Native: What It Means and How to Get There” looks like in practice, sharing the frameworks I use to align vision, technology, and teams around measurable customer outcomes.

    When I say AI Native, I mean a company where core value creation, customer experience, and internal operations are powered by AI end-to-end. It’s not just bolting on a chatbot. It’s rethinking product strategy, data foundations, and execution so we can deliver differentiated experiences faster, at lower cost, and with higher reliability. This shift demands clarity on where AI truly creates leverage—and the courage to say no where it doesn’t.

    The starting point is strategy. I ground teams in outcomes vs output OKRs and a crisp value proposition: Which customer jobs-to-be-done benefit most from generative AI? Where can we unlock 10x improvements in speed, accuracy, or personalization? We prioritize a small number of high-signal use cases, size impact, and design Minimum Viable Experiments (MVEs) to de-risk assumptions before scaling. This is where build vs buy decisions matter—use foundation models and platforms for commodity needs, and invest your scarce engineering time where differentiation lives.

    Next comes architecture and data. AI Native products thrive on a retrieval-first pipeline, strong context window management, and model-agnostic abstraction so we can swap providers as needs evolve. I emphasize privacy-by-design, robust data governance, and observability across prompts, embeddings, latency, and cost. These guardrails let us move quickly without compromising trust, especially in regulated or enterprise settings.

    Execution shifts as well. I organize empowered product teams and product trios around the highest-value workflows, not components. Continuous discovery pairs with CI/CD, feature flags, and telemetry so we can test safely in production. Eval-driven development is non-negotiable: we design offline and online evaluations that mirror real user success criteria—accuracy, helpfulness, safety, and business outcomes—then wire those evals into the build pipeline to prevent regressions.

    On the intelligence layer, we increasingly rely on AI workflows and agentic AI to orchestrate multi-step tasks—retrieval, reasoning, tool use, and verification—with human-in-the-loop where appropriate. Clear system prompts, tool definitions, and fallbacks keep behavior predictable. This is where product craft meets prompt engineering and LLMs for product managers: the best teams codify patterns, share prompts in a living library, and standardize on a lightweight AI product toolbox.

    Risk and reliability are part of the product, not an afterthought. I run AI risk management as a continuous program spanning red teaming, content filters, PII handling, audit trails, and incident response. We tie policies to concrete controls and create simple dashboards leaders can trust. The goal is to ship boldly with safety, maintainability, and scale in mind.

    Becoming AI Native also changes how we grow. We lean into product-led growth with clear in-app guides, product tours, and activation paths that teach users where AI shines. CRM integration ensures sales and success teams have context to coach customers. Pricing experiments—often usage- or value-based—align revenue with the impact customers feel, while retention analysis helps us double down on the use cases that drive compounding value.

    To make this real, I use a 90-day plan. Days 0–30: align on strategy, top use cases, and risk posture; stand up data pipelines and a basic retrieval-first stack; define evaluation metrics. Days 31–60: ship MVEs behind feature flags, run head-to-head evals, and instrument observability; start a cross-functional community of practice. Days 61–90: scale the winning use cases, formalize governance, and publish a roadmap tied to outcomes—not just features—with clear SLAs and success metrics.

    The destination is a durable advantage: faster iteration cycles, smarter experiences, and a product strategy that compounds with every interaction. If you’re ready to make the leap, start small, measure obsessively, and build the muscle to ship, learn, and adapt. That’s the heart of becoming AI Native—and it’s well within reach.


    Inspired by this post on Product School.


    Book a consult png image
  • From Coaching to Co‑Pilots: How AI Elevates Product Owners and Feature Teams

    From Coaching to Co‑Pilots: How AI Elevates Product Owners and Feature Teams

    After two decades of coaching product teams, I’m making a deliberate shift in how I guide leaders and practitioners. The destination hasn’t changed—great products, empowered product teams, and durable outcomes—but the route has. AI is now a practical, compounding advantage, and it demands we evolve our product coaching model.

    In my day-to-day as a VP of Product Management at HighLevel, I’ve watched AI move from novelty to necessity. Large language models, agentic AI, and streamlined AI workflows now accelerate how we discover opportunities, test hypotheses, and communicate decisions. This is not about replacing product judgment; it’s about augmenting it with a disciplined AI Strategy.

    For years, I’ve raised the alarm about the gap between execution and strategy among “product owners and feature team product managers.” The intent was never to pile on more process. It was to strengthen product discovery, sharpen product strategy, and clarify outcomes vs output OKRs so that teams ship what matters. AI finally gives us the leverage to make that shift unavoidable—and repeatable.

    Here’s the new coaching stance: treat AI as a co-pilot, not an answer engine. I coach teams to build an AI product toolbox they can trust—prompt engineering patterns, eval-driven development to measure model quality, and a retrieval-first pipeline for institutional knowledge. When combined with continuous discovery, this creates a tight loop between insight, iteration, and impact.

    Practically, this means elevating core rituals. In product trios, we start discovery with AI-assisted opportunity mapping, then pressure-test problem framing with user evidence. We generate multiple solution sketches with LLMs for product managers, annotate assumptions, and use A/B testing with a minimum detectable effect (MDE) to validate the riskiest bets. The result is faster learning without skipping the hard thinking.

    On the governance side, I set clear guardrails: privacy-by-design, data governance, AI risk management, and explicit criteria for acceptable model behavior. We treat prompts and evaluation datasets as versioned assets, and we pair product managers with forward deployed engineers to operationalize insights in production safely.

    Coaching also extends to measurement. We anchor product outcomes in the customer journey and watch leading indicators for activation, adoption, and retention. On the delivery side, we look at deployment frequency and the health of the feedback loop between support signals and roadmap choices—because empowered product teams win when they learn faster than the market shifts.

    The most profound cultural change is mindset. Instead of asking AI for answers, we ask it for alternatives, counterexamples, and structured ways to explain tradeoffs to stakeholders. That makes product positioning clearer, decision narratives stronger, and the path from insight to execution shorter.

    If you’re responsible for developing talent, reframe coaching as enablement plus guardrails. Build the AI muscle into everyday discovery and delivery, not as a side project. When we do this well, we transform good practitioners into strategic operators—people who pair judgment with leverage and consistently ship value.

    The bottom line: AI doesn’t replace the craft; it amplifies it. Our job as leaders is to harness that amplification responsibly and turn it into a durable competitive advantage.


    Inspired by this post on SVPG.


    Book a consult png image
  • How We Built Rock-Solid AI Infrastructure: Lessons From Scaling AI Visibility and Reliability

    How We Built Rock-Solid AI Infrastructure: Lessons From Scaling AI Visibility and Reliability

    Scaling AI Visibility pushed me to rethink what “reliable” really means for AI infrastructure. As my team expanded usage across more datasets, models, and workflows, we uncovered unexpected sources of report failure and built the guardrails, observability, and processes that now anchor our stability strategy.

    In practice, the surprising failure modes were rarely the loud ones. We saw report failure triggered by small schema drift from non-deterministic LLM outputs, silent permission changes in upstream data sources, token-limit truncation that broke downstream parsing, third-party API rate limits that surfaced only under bursty load, and clock skew that confused idempotent writes. Individually these issues looked minor; together they created reliability debt.

    Our first move was deep observability. We instrumented the end-to-end pipeline with structured logs, distributed tracing, and high-signal metrics mapped to SLOs and error budgets. That visibility let us separate symptom from cause, quantify impact by segment, and prioritize fixes that moved business outcomes, not just vanity thresholds. It also gave product managers and SREs a shared, real-time view to make tradeoffs explicit.

    Next, we hardened the runtime with resilience patterns: circuit breakers on flaky dependencies, timeouts tuned to p95 behavior, retries with jittered backoff, idempotent processing for at-least-once delivery, and backpressure-aware queues. We enforced schema contracts at ingestion with JSON validation and added feature flags to decouple deploys from releases, so we could roll forward or back within minutes when signals degraded.

    On the product side, we adopted eval-driven development for model and prompt changes, shifting risky modifications behind canaries and staged rollouts. CI/CD gates required evaluation baselines to hold or improve before promotion. We tracked DORA metrics to keep deployment frequency high without sacrificing change failure rate, and we used P95 latency and budget burn as the forcing functions for prioritization.

    Culture mattered as much as code. We formalized incident management with clear ownership, lightweight runbooks, and blameless reviews that produced crisp, automatable actions. We partnered early with SRE on SLO design, integrated privacy-by-design and PII scanning into the pipeline, and treated AI risk management as an ongoing product constraint rather than a checkbox.

    The net effect: fewer flaky reports, faster recovery when things do break, and far more confidence to ship improvements to AI Visibility at pace. If you’re scaling similar capabilities, start with observability, make resilience patterns non-negotiable, and let SLOs guide your product roadmap. Reliability is not a phase—it’s the product.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Inside Amplitude’s AI Playbook: Lessons from Leo Jiang on Ask Amplitude, Agents, and Visibility

    Inside Amplitude’s AI Playbook: Lessons from Leo Jiang on Ask Amplitude, Agents, and Visibility

    I continually study how high-velocity teams turn AI ambition into shipped product, and Amplitude’s approach stands out. "Leo Jiang is the Head of Engineering, AI Products at Amplitude, focused on building new AI and marketing products. He has helped build Ask Amplitude, Agents, and AI Visibility." From a product management leadership lens, that portfolio signals a clear AI strategy: enable insight (Ask Amplitude), drive action (Agents), and ensure trust and observability (AI Visibility).

    What I appreciate most is the sequencing: start with user-facing value, build agentic AI capabilities where tasks repeat and outcomes can be evaluated, and layer AI workflows with robust governance. For PMs and LLMs for product managers, the implication is to define success via eval-driven development—quantitative rubrics, offline test sets, and real-time feedback loops—before scaling automation. This also hints at an emerging discipline of Agent Analytics: instrument prompts, tool calls, and outcome quality so we can tune performance like we tune a funnel.

    Ask Amplitude gives a relatable example: natural-language questions lower the activation barrier for product and growth teams inside an Amplitude analytics environment. When agents turn answers into next-best actions, product-led growth becomes measurable—from hypothesis to change to impact—inside a unified decision loop. That tight loop is where product strategy, design, and reliability meet to create compounding value.

    Operationally, I organize a product trio around each capability and pair it with forward deployed engineers to accelerate discovery with customers. I also invest in privacy-by-design and data governance early, ensuring marketing use cases respect compliance while keeping iteration speed high. The goal is a repeatable path from prototype to scale that preserves momentum without compromising safety.

    My takeaway for peers: pick one high-frequency workflow, define clear agent boundaries, ship a narrow slice, and measure relentlessly. Use retrieval-first pipeline patterns for grounding, add human-in-the-loop checkpoints, and close the loop with qualitative insights from in-app guides. When that works, expand capabilities—not just features—and let outcomes vs output OKRs steer prioritization.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • The AI Deployment Gap Is Widening—Accelerate to Mature ROI and World-Class CX in 2026

    The AI Deployment Gap Is Widening—Accelerate to Mature ROI and World-Class CX in 2026

    I’ve watched AI adoption accelerate dramatically over the last year, and the momentum is undeniable. Teams everywhere are experimenting, piloting, and operationalizing AI—but the ways they’re doing it, and the outcomes they’re seeing, vary widely.

    Our latest research shows that 82% of senior leaders invested in AI for customer service in 2025, and 87% plan to in 2026. That’s the new baseline. The differentiator now is depth—how far AI is embedded into core workflows, accountability, and measurement.

    Infographic comparing AI benefits in customer service: 43% with mature deployment report higher quality and consistent support, versus 24% at initial deployment; survey allowed multiple responses.
    Teams with mature AI are almost twice as likely to achieve higher, more consistent support quality. Our survey shows 43% of advanced adopters citing this benefit compared with 24% of early deployments.

    But while most teams are using AI, our 2026 “Customer Service Transformation Report” shows that this usage is not equal. A gap is opening up between teams that have deployed AI at a surface level and those that have integrated it deeply. I see this firsthand: shallow deployments answer FAQs; deep deployments redesign processes, policies, and teams.

    Infographic comparing customer service improvements after AI: 87% of mature deployments report improved metrics vs 62% of all respondents, shown as pink and gray circles with legend and headline.
    Survey results highlight the AI deployment gap: nearly nine in ten organizations with mature AI see improved customer service metrics (87%), compared with 62% across all respondents, visualized with bold circles.

    For this year’s report, we surveyed over 2,400 global customer service professionals across a range of industries to see how they’re using AI today, where it’s paying off, and what they’re betting on as they plan for 2026. The findings mirror my experience leading AI Strategy and AI workflows at scale.

    Infographic of customer service teams measuring AI ROI by deployment stage: 70% mature, 60% scaling, 43% initial, 35% exploring, shown as donut charts, illustrating the deployment gap.
    As AI programs advance, measurement confidence surges. This chart shows how ROI tracking rises from 35% in exploring to 70% in mature deployments—evidence of a widening execution gap in customer service.

    We found that for many teams, AI is still doing narrow work like answering simple questions or handling small parts of workflows. These teams are seeing benefits, but only a fraction of what’s possible. Meanwhile, a smaller group is pulling away. They’ve put AI at the core of their service operation, integrating it into critical workflows, giving it more responsibility, and continuously improving it over time. That’s the hallmark of mature deployment.

    Side-by-side infographic comparing 2025 vs 2026 customer service priorities. In 2026, improving CX leads at 58%, followed by reducing costs and improving efficiency at 46%, with support quality still a key focus.
    Customer service priorities are shifting fast. By 2026, improving CX tops the list at 58%, cost and efficiency climb, and quality moves to third as teams prepare to scale operations and evolve skills.

    The difference in results and overall support experience – for both teams and customers – is significant. Here’s how I interpret the data and what I recommend to close the gap.

    Ranked customer service survey chart titled 'How are existing support roles changing on your team as a result of AI?' showing 45% updated job descriptions, 40% agent AI training, and other shifts at 27–24%.
    Survey insights from the 2026 customer service transformation report reveal how AI reshapes support roles: 45% of teams updated job descriptions and 40% ramped up AI training, while human agents focus more on complex escalations.

    AI adoption is the norm, depth makes the difference. According to senior leaders, 82% of organizations invested in AI in 2025, with 87% planning to invest in the year ahead. Despite this widespread investment, only 10% of teams report having reached a mature level of deployment, where AI is fully integrated into operations and working at scale. In my playbook, maturity means end-to-end ownership of well-defined workflows, robust guardrails, and clear success criteria.

    Survey chart showing drivers to expand AI beyond support: success with AI in support (57%), unified customer experience (49%), scaling without added headcount (33%), and cross-department demand (31%).
    Early AI wins are fueling expansion beyond support. Survey results show 57% cite proven success, 49% aim for a unified customer experience, 33% need to scale without adding headcount, and 31% see demand from other teams.

    Reaching this level of maturity is where AI’s real value lies. We found that 43% of teams with mature deployment report higher quality and consistency across support – nearly double the rate of those still in the exploration or initial deployment stages. That aligns with what I see when we move from point solutions to platform thinking and agentic AI patterns.

    Neon green hero graphic reading 'The 2026 Customer Service Transformation Report', with subhead 'The AI deployment gap is widening' and a black 'Get the report' button over a bar-chart pattern.
    Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.

    ROI becomes clearer with deeper integration. The economic benefits of AI tend to show up first in speed and throughput, and they show up fast. Across all respondents, 62% say their customer service metrics have improved since implementing AI. Most often, teams report their initial gains in efficiency and scale—faster responses, shorter handling times, and the ability to resolve more conversations with the same team—all driving lower cost per interaction.

    But the deeper teams go with deployment, the more the results start to show in the metrics. We found that among teams that describe their AI deployment as mature, the cohort of respondents reporting improved metrics as a result of AI rises from 62% to 87%. What’s more, teams with more mature deployments are significantly more likely to say they can measure the return on their AI investment. My advice: instrument everything upfront, baseline rigorously, and use eval-driven development to iterate with confidence.

    The bar has moved from ‘does it work?’ to ‘is it actually good?’ More than ever, teams are focused on improving customer experience and satisfaction, with 58% saying it’s the top priority for 2026. That number has more than doubled since last year, when just over a quarter (28%) of respondents cited it as a top priority. As AI assumes repetitive work, your people can shift from reactive triage to proactive journey design. Now is the time to invest in quality frameworks, prompt engineering standards, and LLMs for product managers to close the loop between product, ops, and CX.

    Important support work now extends beyond the inbox. AI is reorganizing core customer service operations as it starts to take on a higher volume of work and more complex tasks. Even at the initial deployment stage, 16% of teams report spending less time handling support volume since implementing AI – and among teams who’ve reached maturity, that figure rises to 28%. I’ve seen new roles emerge—AI operations managers, conversation designers, and model evaluators—alongside upskilling for agents into higher-order troubleshooting and relationship building.

    Support is creating the blueprint for AI deployment across the business. Support was the proving ground for AI, and our research suggests that businesses are now planning to expand its use to other areas based on the results it’s yielded so far. Fifty-two percent of respondents said that their organizations are actively planning to scale AI to departments like customer success, marketing, and sales in 2026. The two most cited driving forces behind this decision are the success support has seen with AI to date and a desire to create a unified customer experience. Treat your support stack as a reusable platform: shared services, governance, and reusable components accelerate adoption in adjacent functions.

    Seize the opportunity to close the gap. Having or not having AI isn’t a question anymore. What you should be asking now is how close you are to mature deployment, where AI is capable of tackling nuanced, high-stakes work. Those who have reached this stage show that going deep is what unlocks real value. That’s the opportunity. Push AI to do more, bring it to more channels, use it to resolve the most complex queries, and close the gap before it becomes too wide to close.

    This might seem daunting. But trying new things always is. What we’re experiencing now is a defining moment for customer service, and the teams that are leaning in are actively building the future. As this report shows, what works in customer service now will become the blueprint for how organizations transform the full customer journey with AI. If you want the benchmarks and the playbook to accelerate from pilots to production-grade outcomes, I recommend reviewing the full “2026 Customer Service Transformation Report.”


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • AI Operating Model Masterclass: How I Scale Teams, Tech, and Governance Without Chaos

    AI Operating Model Masterclass: How I Scale Teams, Tech, and Governance Without Chaos

    When I set out to operationalize AI across a product organization, I focus on one promise: repeatable outcomes without chaos. An effective AI operating model turns experiments into an engine—aligning strategy, teams, technology, and governance so we can ship value safely and at scale.

    At its core, an AI operating model is the connective tissue between vision and delivery. I anchor it on a few pillars: clear AI Strategy, empowered cross-functional teams, a modern AI platform, rigorous AI risk management and data governance, and a cadence of eval-driven development that ties everything back to outcomes.

    Strategy comes first. I translate big ambitions into a portfolio of use cases ranked by customer impact, feasibility, and risk. I use continuous discovery to validate the problem, then frame each bet with outcomes vs output OKRs, a crisp value proposition, and a build vs buy decision. For generative AI, I encourage PMs to treat LLMs for product managers as a craft—rapid prototyping, deliberate prompt engineering, and disciplined evaluation from day one.

    Team design matters as much as models. I organize around product trios—PM, design, and engineering—augmented by data, ML, and a “forward deployed” mindset when the domain is complex. I invest in empowered product teams and communities of practice to spread patterns quickly while avoiding centralized bottlenecks.

    On the platform side, I start retrieval-first pipeline before fancy modeling. A solid foundation—feature stores, vector search, observability, and safe integration points—beats bolt-on hacks. I rely on CI/CD with feature flags, strong deployment frequency, DORA metrics, and SRE-grade reliability to keep the iteration loop tight and safe.

    Governance is non-negotiable. I implement privacy-by-design, clear data governance, audit trails, and policy controls aligned to regulatory compliance. AI risk management includes model red teaming, safety layers, and human-in-the-loop review where needed. The goal is confidence: we know what shipped, why it works, and how it fails.

    Execution rides on eval-driven development. For every AI workflow, I define offline and online test sets, target metrics, and a decision policy before launch. I A/B test with proper minimum detectable effect (MDE), layer canaries for protection, and monitor user experience and outcomes in production. This is how we turn “it seems smarter” into statistically confident improvements.

    Adoption is a product in itself. I build onboarding, in-app guides, and product tours that help users form habits quickly. I monitor activation, time-to-value, and retention analysis while partnering with customer support ai strategy to close the loop between real-world issues and roadmap priorities.

    Culture scales the system. I normalize rapid learning, shared playbooks, and personal knowledge management so insights don’t disappear into meetings or notebooks. I upskill teams on prompt engineering, context window management, and model selection, and I celebrate the humility required to refactor what “worked” yesterday.

    Operating cadence keeps it all coherent. I run an AI portfolio review tied to outcomes vs output OKRs, keep a single source of truth for evaluations, and align go-to-market strategy with release readiness. We review risks alongside results so speed never outruns safety.

    If you’re starting from scratch, I recommend a 30-60-90 approach: baseline your current state, choose two lighthouse use cases, stand up the retrieval-first pipeline and eval harness, define governance and data policies, then ship small, safe increments behind feature flags. Teach the system to learn before you make it run.

    I’ve felt the pain of brilliant prototypes that crumble in production and the thrill of AI features that compound value month after month. The difference is the operating model. Build it with intent, and you’ll scale AI with confidence—teams aligned, tech resilient, and customers seeing real outcomes.


    Inspired by this post on Product School.


    Book a consult png image
  • Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

    Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

    Trust is the currency of any high-stakes AI product, and nowhere is that more true than in healthcare. I recently dug into how Healio built an AI assistant for physicians—an audience that can’t afford to be wrong—and it’s a masterclass in balancing accuracy, transparency, and speed without compromising credibility.

    Healio, a 125-year-old medical publishing company, set out to create Healio AI to help clinicians prepare for patient care. From the outset, their guiding principle was simple: physicians won’t trust you until you prove it. That lens shaped every decision—from discovery and prototyping to architecture, evaluation, and ongoing validation.

    Discovery started with a survey of 300 healthcare professionals to understand real-world needs at the point of care. The headline insight: physicians primarily want AI for preparation, not bedside use. Even more surprising, the top ask wasn’t purely diagnostic support; it was help with patient communication and empathy—translating complex information into clear, accessible conversation.

    Momentum mattered. After beginning with Figma mockups to validate workflows, the team built a working prototype in a single weekend using Cursor. That velocity wasn’t about cutting corners; it was about proving value quickly, reducing ambiguity, and iterating with concrete feedback from physicians.

    Under the hood, the system employs RAG and hybrid search—combining lexical search, vector search, and semantic search across multiple trusted sources like PubMed. As any PM who has integrated biomedical literature knows, "just use PubMed" isn’t simple—there are five different ways to access the same data, each with trade-offs. The team made pragmatic choices to balance freshness, coverage, latency, and cost while preserving trust in source quality.

    Designing for trust extended all the way to the citation UX. The team leaned into citations that physicians actually trust: subscripts, hover states, and progressive disclosure. This gave clinicians verifiable threads back to source material without overwhelming the core interaction, aligning with how experts want to audit evidence under time pressure.

    Evaluation wasn’t left to chance. They stood up eight LLM judges for evals: safety, medical accuracy, faithfulness, relevancy, completeness, reasoning, clarity, and overall quality. Just as importantly, they treated those signals as directional, not definitive. In a high-stakes domain, physician feedback trumps LLM-as-judge feedback—so they complemented automated evals with direct reviews from practicing clinicians to calibrate quality and reduce hallucinations.

    On the safety front, the team implemented HIPAA compliance and input guardrails for masking personal health information. That choice reflects strong data governance and privacy-by-design thinking: protect PHI by default, constrain prompts to safe boundaries, and make compliance a first-class citizen in the product architecture.

    They also addressed monetization without compromising experience. Serving contextual ads while the LLM processes queries is a practical approach that preserves physician workflow efficiency and creates a clear, non-intrusive revenue model.

    Critically, the work didn’t stop at launch. The Healio Innovation Partners provide ongoing discovery and validation, ensuring the system evolves with physician needs and the medical evidence base. This is the operating cadence you want for any AI product that sits at the intersection of safety, accuracy, and fast-changing knowledge.

    My takeaways for building AI in high-stakes domains: prioritize retrieval-first pipelines over model cleverness; couple RAG with hybrid search across vetted sources; design citations that earn trust at a glance; use eval-driven development, but let domain-expert feedback be the ultimate judge; and embed regulatory compliance into your product strategy from day one. If trust is your North Star, this is a playbook worth emulating.


    Inspired by this post on Product Talk.


    Book a consult png image
  • From PDFs to Proposals: How Tendos AI’s Agent Swarm Automates Construction Quotes Fast

    From PDFs to Proposals: How Tendos AI’s Agent Swarm Automates Construction Quotes Fast

    Anyone who has lived inside construction tendering knows the grind. "When a construction company receives a bid request, someone has to open that email, parse the attached PDF (sometimes 1,800 pages describing an entire building), figure out which products are relevant, look up pricing, and draft a quote—all before the deadline. It's tedious, error-prone, and surprisingly manual." That painful reality is exactly why this conversation about Tendos AI caught my attention—and why it matters for product leaders building agentic AI in complex, document-heavy workflows.

    I listened as Daniel Kappler and Matthias Hilscher from Tendos AI walked through how they’re automating the tendering workflow for manufacturers in the construction industry. What began as a narrow prototype—matching radiator requests to product catalogs—has matured into a full agentic system that does the heavy lifting from email categorization to offer generation. The end result: a scalable AI workflow that tackles messy inputs, orchestrates specialized agents, and produces quotes that are ready for human review—or even straight-through processing.

    What impressed me most was the rigor. They validated the opportunity with a design partner, spent a week on-site observing real workflows, and then engineered a multi-agent architecture where specialized agents collaborate, including a "review agent" that checks work before anything reaches a human. They evaluate each agent independently (not just the whole chain), built custom observability when off-the-shelf tooling fell short, and use human-in-the-loop feedback to push toward a self-learning system.

    From a product management perspective, this is agentic AI done right. It blends continuous discovery with eval-driven development, thoughtful UX decisions, and pragmatic guardrails. Evaluating agents individually makes debugging tractable and change detection transparent; a dedicated "review agent" mirrors code review to reduce error propagation; and custom tracing plus Agent Analytics provide the observability needed to operate AI workflows reliably at scale.

    My key takeaway: "Start narrow to prove value: Tendos AI began with just radiators for one design partner before expanding to all building products"—a classic wedge strategy that accelerates learning while building credibility.

    Another takeaway I’ll adopt in future roadmaps: "Own the interface: building a web application (vs. integrating into legacy systems) gave them control over UX and the ability to iterate toward full automation." Controlling the surface area let them move faster than a purely backend integration ever could.

    On measurement and reliability, I loved this: "Evaluate each agent, not just the chain: per-agent evals make debugging tractable and show exactly where performance changed." That’s true eval-driven development—aligning metrics to decision points rather than only outcomes.

    Quality gates matter in automation, and they nailed it: "Use review agents: a separate agent that checks work (like code review) catches errors before they reach humans." It’s a simple pattern with outsized ROI.

    Finally, the product-market signal is unmistakable: "Let customers pull you: customers asked Tendos to replace their CPQ software—strong signals of product-market fit." When buyers invite you to displace existing systems, you’re past validation and into expansion.

    If you’re exploring agentic AI for enterprise workflows, the themes here are gold: the tendering chain in construction is ripe for automation; domain expertise accelerates opportunity discovery; robust entity extraction across PDFs ranging from 1 to 1,800+ pages is non-negotiable; planning patterns for creating and updating task plans matter; agents must reason about product fit against customer requirements; custom tracing and observability unlock debugging for complex agent chains; and human feedback loops pave the path to self-learning systems.

    Guests: Daniel Kappler — CPO (Product & Design), Tendos AI; Matthias Hilscher — CTO (Engineering), Tendos AI.

    Want to dive deeper? Listen to this episode on: Spotify | Apple Podcasts.

    Explore the team and product: Tendos AI.

    For builders of agentic AI, here’s my playbook distilled from this story: start narrow to earn trust and accuracy; own the interface to speed iteration; use per-agent evaluations to localize issues; add a "review agent" as a quality gate; invest early in tracing, observability, and Agent Analytics; keep humans in the loop until your metrics justify autonomy; and let strong pull signals guide your roadmap. That’s how you turn complex emails and massive PDFs into precise, production-grade quotes—consistently.


    Inspired by this post on Product Talk.


    Book a consult png image