Tag: prompt engineering

  • We Built Agent Analytics After Observability Broke—Why Your AI Team Needs It Now

    We Built Agent Analytics After Observability Broke—Why Your AI Team Needs It Now

    I remember the exact moment our product crossed the threshold from scripted automation to truly agentic AI. The excitement was real—so was the pit in my stomach when our dashboards went dark. Our trusted analytics and observability stack, which had served us flawlessly for traditional software, suddenly couldn’t explain what the agent was doing, why it made certain choices, or how to reproduce outcomes across runs.

    "The moment our product became a AI agent, our entire observability stack became irrelevant—not something you want as an analytics company. Here's what we did."

    Why does this happen? Agentic AI doesn’t behave like conventional apps. Instead of deterministic flows and neatly tagged events, we face non-deterministic trajectories, tool-use chains, evolving prompts, context window dynamics, and policy guardrails that influence outcomes in real time. Clicks and pageviews give way to tokens, tool calls, and conversation turns. Without purpose-built observability, you can’t do credible product discovery, measure behavioral analytics, or run eval-driven development with confidence.

    That’s why we built Agent Analytics. We needed a unified lens to trace every step of an AI workflow—from user intent to model prompts, function calls, retrievals, tool outputs, and final responses—while capturing latency, cost, guardrail hits, fallbacks, and outcome tags. We instrumented runs end-to-end, added experiment support for prompt engineering and policy variants, and wired in evaluations so we could turn subjective quality into objective signals the team could act on.

    The impact on product management was immediate. We shortened iteration cycles by making failure states obvious and reproducible, turned ambiguous feedback into structured data, and gave engineers and designers a shared source of truth for conversation design and AI workflows. With visibility into containment, escalation, autonomy ratio, and step-level success, we could ship confidently, rollback safely, and align roadmap bets to measurable outcomes—not anecdotes.

    Building this capability demanded more than logging. We invested in data governance and privacy-by-design to mask sensitive content while preserving semantic context, and we separated human-identifiable data from model telemetry. We treated prompts and policies like code—versioned, diffable, and safely rolled out behind feature flags and CI/CD—so we could experiment without risking regressions in production.

    What should every team measure? Start with outcome quality (task success, resolution, containment), reliability (tool success rate, guardrail triggers, fallbacks), performance (time-to-first-token, total latency, step-level latency), and efficiency (tokens and cost per successful task). Add groundedness checks for retrieval steps, regression evals for core journeys, and post-release anomaly detection to catch drift before users do. These metrics become your operating system for agent performance and your compass for product strategy.

    If you’re building or scaling AI agents, you need Agent Analytics before you hit your first incident. It’s the difference between guessing and knowing—between reactive firefighting and proactive iteration. With the right observability, your team can move faster, manage risk intelligently, and translate agent behavior into business outcomes that compound over time.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Agentic Architecture Demystified: How Modern AI Systems Plan, Learn, and Execute at Scale

    Agentic Architecture Demystified: How Modern AI Systems Plan, Learn, and Execute at Scale

    In my role leading product teams at HighLevel, I’m often asked to explain what’s really happening behind the scenes of today’s AI products. The short answer is that modern systems are built on "Agentic Architecture: How Modern AI Systems Actually Work"—not just a single model, but a coordinated loop of planning, tool use, memory, and evaluation. Once you see that pattern, the design decisions snap into focus and the roadmap becomes far easier to prioritize.

    At its core, agentic AI treats the model as a reasoning engine embedded within an AI workflow. The agent interprets intent, plans steps, calls the right tools and APIs, grounds itself in trusted data, and then evaluates outcomes before deciding to continue or stop. This loop creates reliability, reduces hallucinations, and enables the system to operate in real-world, multi-step scenarios.

    Here’s the practical lifecycle I rely on. A user provides intent (a goal or request). We run a retrieval-first pipeline to ground the model in accurate, current data. Prompt engineering structures the task and primes the agent with constraints and success criteria while managing context window management. The agent generates a plan, executes steps by calling tools or services, evaluates intermediate results, reflects or revises as needed, and only then returns a final answer with clear citations or evidence.

    For more complex work, I orchestrate multiple specialized agents—commonly a planner, a solver, and a critic—coordinated by a lightweight controller. This multi-agent pattern reduces single-agent blind spots, encourages self-checking, and mirrors how empowered product teams collaborate. Whether it’s conversation design for support flows or a voice AI agent driving hands-free tasks, orchestration is the difference between a clever demo and a dependable product.

    Memory is the second pillar. Short-term working context sits in the prompt, while long-term memory lives in vector stores or databases to track past interactions, preferences, and outcomes. Retrieval augments the model with the right facts at the right time, and tight context window management ensures the agent stays focused on signal, not noise. The result is faster responses, lower costs, and far better accuracy.

    Reliability is earned through eval-driven development and robust AI risk management. I define offline and online evaluations, guardrails, and human-in-the-loop checkpoints before scaling traffic. These evaluations become living, automated tests that protect against regressions as prompts, models, and tools evolve. The payoff is real: fewer escalations, higher trust, and measurable improvements to quality over time.

    From a product strategy perspective, I resist over-engineering. Start with a simple retrieval-first pipeline and a single agent; prove value; then layer in multi-agent orchestration only where it moves key metrics. Instrument everything—latency, cost, grounding coverage, and outcome quality—and build Agent Analytics dashboards so teams can diagnose issues and iterate with confidence.

    If you’re looking for a practical playbook, here’s mine: clarify the user intent and success criteria; design the tools the agent can call; ground with authoritative data; write prompts that constrain scope and define termination conditions; add reflection and automated evaluations; and ship behind feature flags for safe, staged rollout. Each step compounds reliability without killing velocity.

    The diagram and the video above bring these patterns to life. If you watch closely, you’ll see the same loop—plan, retrieve, act, evaluate—show up in every effective implementation, regardless of domain. That repetition isn’t accidental; it’s the backbone of agentic architecture and a blueprint you can adapt to your own stack.

    Ultimately, what matters is outcomes. When we build around agentic AI, we create systems that are explainable to stakeholders, maintainable by engineers, and genuinely helpful to customers. That’s how we move past hype to durable impact—shipping AI products that plan, learn, and execute at scale.


    Inspired by this post on Product School.


    Book a consult png image
  • Ship MVPs in Days, Not Months: My Proven Prompt Prototyping Playbook for Product Teams

    Ship MVPs in Days, Not Months: My Proven Prompt Prototyping Playbook for Product Teams

    Most MVPs take too long, cost too much, and still miss the mark. Over the past year, I’ve shifted my team to a prototyping prompts approach that lets us validate problem-solution fit in days, not months. The result is faster learning loops, clearer tradeoffs, and a dramatically higher hit rate on features that actually move the needle.

    When I say prototyping prompts, I mean structured, layered instructions that guide gen ai systems to produce the right artifacts at the right fidelity. Instead of jumping straight to code, we generate concise problem briefs, user stories, interaction flows, low-fidelity UI descriptions, and test plans. Each pass is constrained by acceptance criteria and business outcomes, which keeps the work grounded in value rather than output.

    Here’s the playbook my product trios use to go from idea to a testable MVP in 48–72 hours. First, we anchor on outcomes vs output OKRs and clarify the customer job-to-be-done using evidence from customer interviews and support data. This is classic continuous discovery, but we compress it by focusing on the single riskiest assumption to de-risk this week.

    Second, we build a prompt scaffold. We specify the role, constraints, target users, success metrics, and the exact output format we expect. We also define evaluation upfront, borrowing from eval-driven development. For example, before any generation, we list the acceptance tests that a good solution must pass, including edge cases and compliance considerations. This discipline keeps hallucinations in check and improves repeatability.

    Third, we spin up multiple prototypes in parallel. One prompt generates a lean product brief; another outlines user flows; a third proposes UI states and error handling. If we’re exploring voice, we add prompt engineering for voice to script dialogs and repair strategies. For data-heavy features, we call out retrieval-first pipeline patterns so the model references source-of-truth data rather than guessing.

    Fourth, we validate with real users using the lightest-weight experiment possible. Fake-door tests, concierge workflows, and guided click-throughs let us measure intent before we invest. Where we can, we run quick A/B testing and size the effort using minimum detectable effect (MDE) so we don’t over- or under-sample. The point isn’t perfection; it’s fast, directional signal to inform the next iteration.

    Fifth, we instrument and ship behind feature flags. We track activation, task completion, and time-to-value from day one. On the delivery side, we watch DORA metrics and deployment frequency to ensure we’re learning continuously rather than batching big bets. This bridges discovery and delivery so roadmaps reflect real-world feedback, not assumptions.

    One recent example: we needed to evaluate a voice AI agent for appointment scheduling. In 72 hours, prompts produced the problem brief, dialog flows, error recovery strategies, and a sandbox to simulate inbound requests across three user personas. We exposed a thin slice to a pilot cohort, captured call outcomes, and iterated the repair prompts twice before writing any production code. The pilot converted at a higher rate than our control flow and gave us the confidence to invest in full integration.

    This approach only works if we treat governance as a first-class concern. We bake in privacy-by-design, clear data governance boundaries, and AI risk management from the start. Prompts include guardrails on personally identifiable information, explicit constraints on data use, and links to approved sources. We also maintain a prompt repository with versioning and automated evaluations so changes are observable and reversible.

    Practically, strong prompt scaffolds share three traits. They’re specific about context and constraints, they define success in measurable terms, and they separate concerns by artifact type. I’ll often ask for three variants with different tradeoffs, then run a quick synthesis prompt that highlights points of parity and differentiation. This gives the team structured options rather than a single, brittle path.

    If you’re starting from zero, begin with one high-leverage workflow. Write a crisp outcome statement, draft your acceptance tests, and create a prompt that outputs a one-page brief, three user flows, and the top five risks with mitigations. Validate with five users in 48 hours, then decide: double down, pivot, or park. Rinse and repeat, and your product roadmapping and sprint planning will shift from speculation to evidence.

    The bottom line is simple. Prototyping prompts won’t replace product judgment, but they will accelerate it. By turning ideas into testable artifacts in hours, you minimize waste, maximize learning, and ship better MVPs—fast.


    Inspired by this post on Product School.


    Book a consult png image
  • 12 Game-Changing Updates to Fin Procedures & Simulations for Complex Queries

    12 Game-Changing Updates to Fin Procedures & Simulations for Complex Queries

    Today, I’m excited to share 12 major updates to Fin’s Procedures and Simulations—the foundation that lets Fin handle complex work while keeping teams fully in control of the customer experience.

    In my work building AI workflows with product and support leaders, I’ve seen how the right blend of natural language instructions, deterministic controls, and fully agentic behavior turns Fin into a reliable problem solver. Procedures make this blend possible by enabling Fin to act like a human—yet with the repeatability and governance of software. Simulations then let us test those complex Procedures at scale before they reach customers, so we can deploy with confidence.

    Together, these capabilities make Fin self-manageable, transparent, and ready for genuinely complex work.

    Here’s what’s new at a glance: we’ve made Procedures easier to build and maintain; enhanced deterministic controls for precision and policy compliance; expanded agentic behavior so Fin can adapt in real time; and delivered more powerful Simulations to validate end-to-end workflows before go-live.

    Why did we build this? Many teams see early AI gains in speed, coverage, and cost to serve—but then hit a ceiling. They keep AI confined to simple automation and information retrieval, rather than setting it up to handle the nuanced, multi-step workflows they still trust to humans. We designed Procedures and Simulations to remove that ceiling, so teams can confidently set up, govern, and iterate on complex AI workflows without bottlenecks.

    Dark UI diagram of a continuous AI/ML lifecycle loop on a grid, labeled ANALYZE, TRAIN, TEST, and DEPLOY, with TRAIN highlighted in orange to signal iterative model development and evaluation.
    Follow the AI lifecycle as it cycles from Analyze to Train to Test to Deploy. This streamlined loop spotlights the TRAIN phase, underscoring faster iteration and feedback that power more capable procedures and realistic simulations.

    We also heard that teams needed an easy way to connect data so Fin could reliably check customer status or eligibility and then take action. And they didn’t want to route through engineering every time they needed to create or amend logic for mid-conversation decisions. Procedures combines natural language instructions and intuitive data connector setups. You tell Fin in your own words how you want it to behave, and you’ll be guided through creating conditional steps so Fin will react consistently, with the option to add in any code snippets for circumstances where absolute precision is required. Once you build one Procedure, we believe you’ll want to build several, so Fin will constantly read the conversation it’s in to ensure it’s following the most relevant Procedure, and jump to a more relevant one if the user intent changes.

    I know that taking something like this live the first time can feel like a leap of faith. That’s exactly why we built Simulations—to test Procedures comprehensively, uncover edge cases, and launch with confidence.

    Reaching mature deployment takes a deliberate, ongoing commitment to training workflows, validating them before deployment, measuring performance in production, and refining them over time. At Intercom, we call this the Fin Flywheel: train, test, deploy, analyze. Procedures form the foundation of the train stage, and Simulations make the test stage reliable at scale. Together, they enable Fin to handle complex work, and teams to stay in control of it.

    Procedures: Define exactly how Fin handles complex work. With Procedures, I can set Fin up to resolve complex, time-consuming queries that require multiple steps or business logic. Fin follows standard operating procedures and applies sound judgment—just like a seasoned teammate—so even complicated queries are resolved in controllable, predictable ways.

    Interface screenshot of a customer service Procedures editor titled 'Procedure: Damaged food order,' showing when-to-use guidance, Train Fin on examples, and Test, Save, Set live actions.
    A snapshot of the Procedures builder in action, mapping a clear path for handling damaged food orders while letting teams train Fin on examples, target channels, quickly test updates, and publish with Set live.

    Procedures combine three powerful elements. First, natural language instructions. You write a Procedure in plain language, just like documenting a process for a new teammate. You can paste in your existing SOPs, write from scratch, or let AI draft them for you, then iterate yourself.

    What’s new: Draft Procedures with AI. Share an outline of your process and Fin drafts a complete Procedure using your conversation history, knowledge hub content, and relevant data. If additional context is needed, it prompts you with clarifying questions to make sure the Procedure is thorough and tailored to your use case, significantly reducing setup time. For example: if you’re creating a refund workflow, the system can draft conditional paths for eligibility, approval thresholds, and verification steps based on your historical cases and policies.

    What’s new: Break complex workflows into Sub-procedures. Write a process once and reference it across multiple Procedures by breaking it down into reusable steps, called Sub-procedures. This makes workflows easier to read, faster to build, and simpler to maintain as things change.

    Second, deterministic controls. Natural language is flexible, but some steps need to be exact. You can layer in deterministic controls where precision matters, starting with a fully natural language Procedure and introducing structure gradually where it adds value: conditional steps (branching logic) to handle decision points so Fin’s behavior is consistent and predictable; data connectors so Fin can pull information from your tools or take actions automatically; code snippets for when absolute accuracy is essential; and checkpoints to pause for approval or hand off to a teammate.

    Screenshot of a Transaction dispute procedure showing IF/ELSE logic, a code step for check_dispute_eligibility, and a Data Connector menu with Freeze credit card and Get upcoming invoice.
    Fin demonstrates structured troubleshooting: a transaction dispute flow with eligibility checks, clear IF/ELSE steps, and quick Data Connector actions like freezing a card or pulling invoices, streamlining complex support tasks.

    What’s new: Instruct Fin to read specific content from your knowledge hub. You can set clear rules for Fin to reference a specific policy or article from your knowledge hub in defined situations so Fin always surfaces the right context in a conversation.

    What’s new: Explicit Procedure switching under defined conditions. You can set rules that deterministically trigger a switch to a different Procedure, for example, escalating to a complaints Procedure if specific risk signals are detected mid-conversation.

    What’s new: Internal notes for human handoffs. When Fin hands off to a teammate, it can now include internal notes with relevant context so the person picking up the conversation knows exactly what happened and what needs to happen next.

    Third, fully agentic behavior. Because real conversations rarely follow the happy path, Procedures let Fin reason through what’s happening and adapt—jumping to the right step or switching Procedures entirely if a customer changes their mind or the issue shifts.

    Product UI showing a Simulations panel where a 'Food order damage clear' test is running, with a simulated user and Fin AI Agent exchanging messages and green checks marking triggered steps.
    Procedures and Simulations in action: Fin rehearses a food order damage scenario, confirming details and progressing through each trigger. Teams validate complex flows end to end as steps turn green and outcomes are tracked.

    What’s new: Automatic Procedure switching. If a customer starts in a billing workflow but then asks about cancelling their subscription, Fin transitions to the relevant Procedure without forcing the customer to restart.

    What’s new: Structured data extraction from uploaded files. Fin can now extract structured data directly from PDFs and images uploaded by customers—like invoices, forms, or receipts—and use that data within the conversation. Customers don’t have to copy and paste or repeat themselves.

    As MONY Group put it:

    “ If a customer starts down one path but their issue turns out to be something else entirely, Fin adapts seamlessly – no more getting stuck in loops or forcing customers into the wrong workflow. ”

    Screenshot of a Simulations panel for AI support workflows, listing scenarios: Damage confirmed (Pass), Refund subscription (Fail), No subscriptions (Not run yet), with Run all, New, and suggested tests.
    Simulations help teams rehearse procedures and verify outcomes before going live. Run all tests or launch a new one to ensure Fin handles tricky customer scenarios—from damage confirmation to refunds and missing subscriptions.

    The result is a conversation that feels fluid, but always follows your intended rules.

    Making complexity easier to manage is just as important as unlocking new capabilities. Beyond the core updates, we’ve focused on creation, governance, and scale—while keeping ownership with your team.

    What’s new: Improved instruction authoring. We’ve made it easier to write, edit, and structure Procedures, so building and updating them takes less time and requires less effort.

    What’s new: Reporting on when Procedures trigger, resolve, or hand off. You can now track how Procedures are performing directly within the Procedures UI, seeing exactly when they trigger, when they resolve, and when they hand off to a teammate. This visibility helps you spot issues early and improve over time.

    Two-column graphic with customer testimonials on Fin’s Procedures and Simulations update, citing payment query handling, ~94% CSAT for Payment Information, and real-time claims via API-driven decisions.
    Customer stories from Raylo and Mony Group show how Fin now resolves payment issues and complex claims in-chat, checks account data via APIs, and lifts CSAT to about 94%, highlighting the impact of Procedures and Simulations.

    Simulations: Test complex workflows at scale before they reach customers. Simulations let you validate how Procedures will perform before anything goes live, and continuously revalidate as things change. Deploying complex AI can feel uncertain; Simulations remove that uncertainty so you can launch with confidence and iterate safely.

    You can simulate full conversations. For any Procedure, choose a user or customer segment and run a complete, multi-turn simulated conversation. You see every step Fin takes, how it applies your rules, reasons through decisions, and where it passes or fails—giving you the observability to debug and fix issues before they ever reach customers.

    What’s new: Upload images for richer testing. Simulations now support image uploads, so you can test workflows that involve receipts, invoices, or forms—the same inputs your customers actually send.

    What’s new: Clearer visibility into Fin’s reasoning. You can now see exactly how Fin is thinking through each step of a Simulation, making it easier to understand behavior, catch unexpected decisions, and refine Procedures with confidence.

    You can also use AI to create, store, and rerun tests. Writing test coverage manually doesn’t scale. Fin’s AI Assistant generates Simulations directly from your Procedures, suggesting realistic edge cases like partial refund disputes, missing invoice uploads, or no subscription found, so you can expand coverage without expanding overhead. All the Simulations you create are stored in a central library. When a product changes, a policy updates, or a Procedure is edited, hit “run all” to instantly check whether anything has regressed. This applies the same rigor to AI automation that engineering teams bring to software testing.

    What’s new: AI-suggested Simulations. You can now use AI to generate a full set of Simulations from any Procedure. The AI Assistant suggests realistic variations based on your workflow, so you can build comprehensive test coverage fast.

    Customers are already seeing this in production. “Fin can now handle payment-related queries that were never possible before… The impact on CSAT and overall CX has been pretty shocking – the Payment Information procedure CSAT is sitting at ~94%, and CX score is significantly higher than our average.” – Raylo

    “Procedures have fundamentally changed what we can achieve with Fin. Previously, complex processes like cashback claim investigations could only be handled through a static form on our website… Now, Fin can handle these sophisticated scenarios in real-time within the conversation itself. It checks account information via API calls, makes complex decisions, and guides customers through the entire claims process dynamically.” – MONY Group

    Procedures and Simulations are available now. I’m eager to see how teams use these updates to scale agentic AI, deliver faster resolutions, and raise the bar for customer experience—without sacrificing control, compliance, or quality.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • Human-in-the-Loop Mastery: Proven Oversight Tactics That Elevate AI Quality and Trust

    Human-in-the-Loop Mastery: Proven Oversight Tactics That Elevate AI Quality and Trust

    Human-in-the-loop oversight is the fastest and most reliable way I know to elevate AI quality, build user trust, and reduce risk. At HighLevel, my teams treat oversight as a product feature—not an afterthought—because dependable AI experiences come from deliberate design choices across data, models, and people.

    When I say “human-in-the-loop,” I mean a system that blends automation with targeted human judgment at key moments: during data curation, prompt engineering, evaluation, deployment, and post-launch learning. This approach turns “AI workflows” into measurable, repeatable processes and keeps me honest about what’s working, what’s drifting, and where a human safety net must step in.

    Architecturally, I start with a retrieval-first pipeline to ground outputs in trusted knowledge, then wrap it in guardrails. Deterministic preprocessing, careful prompt engineering, and post-processing validators catch obvious failure modes. Confidence thresholds and policy checks route ambiguous or sensitive cases to a human reviewer, while clear, auditable traces show why the system chose automation versus escalation. This balance supports reliability at scale while preserving agility for “agentic AI” patterns when they add value.

    Quality is only real if I can measure it, so I build with eval-driven development from day one. I maintain golden datasets, rubric-based scoring guidelines, and an automated evaluation harness that runs on every change to prompts, models, or data. Pre-production gates protect against regressions, while production telemetry surfaces drift by segment and use case. When it’s time to run experiments, I use A/B tests sized with a minimum detectable effect (MDE) to avoid overfitting to noise.

    Operationally, I optimize for outcomes, not output. I track task success rate, time-to-resolution, safety violation rate, hallucination rate, and cost-to-serve, then connect these to outcomes vs output OKRs. The signal I want is simple: are we reliably solving the user’s job-to-be-done with lower effort and higher confidence? If not, I tighten prompts, refine retrieval, or expand human review where it pays off most.

    Risk governance is non-negotiable. I design with privacy-by-design and data governance from the start—role-based access, audit trails, PII redaction, and red-team tests for safety. Clear reviewer playbooks and calibration sessions reduce bias and ensure consistent decisions. These practices aren’t bureaucracy; they’re how I operationalize AI risk management while maintaining velocity.

    Teams make or break this model. I empower product trios to own the full lifecycle—discovery, build, and learning—so feedback loops close quickly. In-product feedback widgets, reviewer queues, and incident management playbooks help us respond in hours, not weeks. Over time, human review becomes a targeted scalpel rather than a blanket requirement as the system learns and improves.

    Economics guide the level of oversight. I treat each workflow like a portfolio: where the value of accuracy is high and ambiguity is common, I route more to humans; where tasks are simple, frequent, and well-bounded, I automate aggressively. The goal isn’t zero humans—it’s optimal humans, deployed precisely where their judgment compounds ROI.

    If you’re getting started, begin with one high-impact workflow, establish your golden set and evaluation rubric, and wire in a simple review queue. Prove the lift, then scale. In the short video above, I walk through the patterns I use to design these loops, measure quality with rigor, and ship AI that teams—and customers—can trust.


    Inspired by this post on Product School.


    Book a consult png image
  • Go From 3 Customer Interviews to a High-Quality Opportunity Solution Tree—In Minutes

    Go From 3 Customer Interviews to a High-Quality Opportunity Solution Tree—In Minutes

    Most product teams—and especially well-run product trios—know they should be interviewing customers. More teams than ever are actually doing it. That’s the good news.

    The bad news? Many teams still struggle with what comes next. Turning raw recordings into a structured opportunity space that truly guides product discovery can feel overwhelming.

    In my experience, interview synthesis is cognitively demanding work. You have to extract the key moments from each conversation, translate those moments into clear opportunities, and then organize those opportunities into a coherent view of your opportunity space. It’s no surprise I hear teams say, "We need to stop interviewing so we can catch up on what we’ve already learned." Too often, they pause—and never start again.

    Recordings pile up. Maybe there are scattered notes. But nothing gets turned into an opportunity solution tree. The team hasn’t synthesized what they’ve learned, so the research isn’t actionable. That’s the gap I want to help close.

    What if you could go from 3 interviews to a draft OST in minutes?

    My AI goals are straightforward: 1) build tools that help you learn discovery and 2) build tools that help you do discovery. The learning tools are coming through on-demand courses. Today, I’m excited to share the first big step on the "do" side.

    I’m excited to see an expanded partnership with Vistaly—the opportunity solution tree tool many of you already use—to bring AI-powered discovery tools directly into their platform.

    Great synthesis happens in two steps: first, you synthesize each interview separately; then you synthesize across interviews. Most AI tools skip the first step and jump straight to cross-interview analysis—exactly how teams lose the nuance and context that make research actionable.

    This approach does both. You upload three interviews for the same product outcome. The AI extracts the key moments and opportunities from each one separately. Then it synthesizes across those interviews and generates a first draft of your opportunity solution tree for you. Three interviews in. A draft OST out.

    Here’s what this is—and what it isn’t. You’ve probably heard criticism of tools that promise "one-click opportunity solution trees." Those tools ask you to describe your market, click a button, and get a tree. The point of an opportunity solution tree is not to have one—it’s to synthesize what you’re learning from real customers so your team can align on the best path forward. A one-click tree built from made-up data is useless.

    Vistaly 2.0 landing page featuring 'Build what matters,' a blue Enroll in Beta button, and a dark-grid opportunity solution tree connecting an Outcome to Opportunity and Solution nodes.
    Turn interviews into insights in minutes with Vistaly. This hero screen invites you to enroll in beta and showcases an opportunity solution tree that maps outcomes to opportunities and actionable solutions.

    This approach is fundamentally different. It starts with your real customer interviews. The AI does the heavy lifting of extracting key moments and opportunities from those conversations and organizing them into a draft opportunity solution tree. But it’s a draft—you review it, refine it, and reorganize it. You bring your judgment and context to the work.

    My vision for AI-aided cross-interview synthesis is simple: AI identifies common opportunities across interviews, suggests a tree structure, and facilitates the team’s review. Historically, it’s been hard to give AI access to an opportunity solution tree in a way that preserves structure and context. The integration with Vistaly solves that problem by building this capability directly into the tool where your tree already lives.

    In my own experiments using Claude, the AI surfaced opportunities I missed—and I caught things it missed. The highest-quality synthesis came from combining both perspectives. Research (see here and here) backs this up: Experts working with AI outperform both experts working alone and AI working alone. That’s the model we’re building toward—AI generates the draft, you bring the expertise.

    I have mixed feelings about AI doing discovery work for us because there is real value in doing the synthesis yourself. But I also know that a draft OST you actually refine is better than a perfect process you never get to. This is about raising the floor—helping more teams get to a structured opportunity space, even if they aren’t doing every step manually.

    We’re looking for a small group of alpha partners to help shape this product. To apply, sign up for a free Vistaly account and upload three customer interviews for the same outcome or product space.

    We’ll select alpha partners from the applicants. We want a range of interview styles, experience levels, and product spaces. Selected partners will get access to the AI-powered synthesis tools and will work closely with the team to shape the product. Even if you aren’t selected for the alpha, your application puts you at the front of the line when we enter beta.

    A few things to know as you apply: Your three interviews should be for the same outcome, goal, or product space, so the tool can generate a meaningful OST. You don’t need to be a Vistaly user today—the account is free. You don’t need to be an expert interviewer either; we’re looking for a range of experience levels, though we’re particularly interested in story-based customer interviews.

    This is just the beginning. The vision is a full AI-powered discovery suite inside Vistaly—from interview analysis to complete interview snapshots to opportunity solution trees and beyond. We’ll learn alongside our alpha partners and share what we discover as we go.

    If you’ve been looking to bridge the gap between your customer interviews and your opportunity space, this is your chance to help shape how that works. Apply for the alpha today.


    Inspired by this post on Product Talk.


    Book a consult png image
  • LLMs vs AI Agents: Hard‑Won Lessons Product Teams Need to Nail for Real‑World Impact

    LLMs vs AI Agents: Hard‑Won Lessons Product Teams Need to Nail for Real‑World Impact

    When people ask me about "LLM vs AI Agents: What Product Teams Must Get Right," I start with a simple truth: an LLM is a powerful prediction engine, while an AI agent is a productized workflow that plans, takes actions with tools, remembers, and closes the loop on an outcome. That difference sounds academic until you’re on the hook for reliability, cost, and customer trust.

    In my role, I’ve shipped LLM copilots that delight users and piloted agents that automate complex workflows. The pattern that never fails is this: start assistive, then graduate to autonomy. Copilots accelerate people; agents own outcomes. When we respect that gradient, adoption climbs, incidents fall, and we earn the right to expand scope.

    The first decision point is use-case fit. If the task benefits from human judgment, high-context nuance, or brand voice, I frame it as a copilot with strong guardrails and crisp UX. If the task is well-bounded, tool-heavy, and verify‑able, I consider an agent—but only after we can measure end‑to‑end task success with eval-driven development.

    Architecture matters. I reach for a retrieval-first pipeline to keep responses grounded in authoritative data, then add tool use for actions (search, write, schedule, transact) with deterministic scaffolding to prevent thrashing. Good prompt engineering is table stakes, but context window management and a clean memory strategy (short‑term scratchpad, long‑term facts, and policy) separate demos from durable systems.

    Agents amplify both value and risk. I build safety in layers: role and scope definition, tool whitelists, unit limits, human‑in‑the‑loop checkpoints at irreversible steps, and privacy-by-design data governance. We log every decision token-for-token because auditability isn’t optional once agents touch customers, money, or data.

    Measurement is non‑negotiable. For LLM features, I track time‑to‑first‑token, response latency, groundedness, and user satisfaction. For agents, I add Agent Analytics: task success rate, number of steps per task, tool error rate, loop detection, guardrail triggers, escalation to human, cost per successful task, and containment rate. If we can’t see it, we can’t ship it.

    My delivery playbook mirrors modern software ops. We use feature flags, gated betas, and canary rollouts; we version prompts like code; we set incident management paths for model outages and tool drift; and we rehearse fallbacks so the experience degrades gracefully, not catastrophically. Dull operations build dazzling products.

    On roadmapping, I thin‑slice value. We introduce a minimal viable copilot that handles a single, frequent job-to-be-done with high success. Only after continuous discovery confirms product‑market fit do we grant more autonomy, one capability at a time. Outcomes vs output OKRs keep us honest: if the customer’s job gets done faster, cheaper, and with fewer errors, we scale; if not, we fix fundamentals before adding scope.

    Build vs buy is rarely binary. I tend to buy the undifferentiated heavy lifting—observability, prompt versioning, red‑teaming, and policy enforcement—while building the proprietary workflows, data modeling, and UX that encode our defensible advantage. The litmus test: if it’s part of our unique value proposition, we own it; if not, we integrate the best‑in‑class and move.

    Go‑to‑market must be as rigorous as the tech. We position clearly (assistant vs agent), price to value with transparent consumption SaaS pricing, and communicate risk posture in plain language. Customers don’t buy models; they buy confidence that a job gets done reliably within their constraints.

    Common failure modes repeat: shipping autonomy before instrumentation, treating prompts as magic instead of software, skipping data governance, and ignoring the human experience. The antidote is disciplined AI Strategy rooted in empowered product teams, tight feedback loops, and relentless evaluation.

    If you take nothing else: choose the right paradigm for the job (copilot first, agent when proven), ground with a retrieval-first pipeline, instrument with eval-driven development and Agent Analytics, and operationalize like a mission‑critical system. Do that, and you’ll turn LLM capabilities into durable product outcomes.


    Inspired by this post on Product School.


    Book a consult png image
  • Multi‑Agent Systems Demystified: Why One AI Isn’t Enough—and How I Ship Faster With Many

    Multi‑Agent Systems Demystified: Why One AI Isn’t Enough—and How I Ship Faster With Many

    In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.

    Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.

    When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.

    The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.

    To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.

    Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.

    On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.

    If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.

    The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.


    Inspired by this post on Product School.


    Book a consult png image
  • Context Engineering Playbook: 5 Proven Ways to Slash Context Rot and Scale Smarter AI

    Context Engineering Playbook: 5 Proven Ways to Slash Context Rot and Scale Smarter AI

    I've been getting a lot of questions about why I'm diving so deep into Claude Code, so I want to take a step back and provide some context.

    Last March, when I started building my first AI product—the Interview Coach—I felt like I had to figure it all out on my own. I had never built an AI product before, and I didn't have a team I could lean on. It was equal parts energizing and intimidating.

    I had a blast digging in, experimenting, and learning what I needed to learn to ship that first AI product. But I also started to wonder, "How are product teams going to learn this stuff?"

    As an industry, we are being asked to leverage a new technology that is foreign to us. We are all experimenting and learning what's just now possible. It's moving so fast, it's exhausting just following the news, let alone trying to learn and develop new skills.

    My mission has always been to help teams make better product decisions. That still drives me today.

    After releasing the Interview Coach, I asked myself two questions: "How am I going to rapidly develop my skill set?" and "How can I help others do the same?" I landed on a three-part plan: First, I'm going to collect and share stories about how other teams are learning and building AI products—that's why I launched Just Now Possible. Second, I'm going to push the boundaries on how I can use AI in my day-to-day life, and I'm going to write about it. Third, I'm going to keep building AI products—and I'm going to write about that, too.

    The Claude Code series was born out of number two. It’s had an interesting side effect: it’s also helping me build better AI products.

    The more I push the boundaries of what's possible with Claude Code, the more I understand how to build more robust AI products. That’s reinforced my belief that product teams need to get hands-on with this stuff in their day-to-day lives. It’s how we’re going to develop the skillsets we need to build tomorrow’s products.

    In my context rot article—where we learned how to manage the context window in Claude Code—I showed just how much day-to-day practice compounds. Today, I want to show how learning about context window management in our day-to-day lives directly maps to managing the context window in the AI products we might build. My hope is to make it crystal clear how experience in one area develops expertise in the other. Let’s dive in.

    Infographic titled What is Context Engineering? visualizing a context window with arrows and five strategies: compact prompts, external memory, curating turns, repeating info, and sub-agents.
    Discover how product teams engineer context in generative AI: compact prompts, curated turns, external memory, repetition, and sub-agents, all feeding a shared context window to deliver clearer, faster outcomes.

    A quick refresher on context window management. In the context rot article, we learned: "what the context window is and what goes into it"; "how to offload conversational context to the file system"; "about the /compact and /clear tools"; "to repeat critical information as the context window fills up to overcome tokens "lost in the middle" or at the beginning of the input"; and "how to use agents to get access to more context windows."

    It turns out these exact same skills are being used by developers to manage the context window in production products. If you haven't read the context rot article, start there: "Context Rot: Why AI Gets Worse the Longer You Talk (And How to Fix It)."

    What is Context Engineering? Context engineering is the work that we do to manage the context window in the AI products and services that we build. It's how we give the large language model the context it needs to do the job well. It's also how we manage and mitigate context rot in our product and services, so that we can get the highest performance from the underlying model.

    Today, we are going to look at five different strategies that product teams are currently using in their context engineering efforts. You are going to see that each of these strategies ties back to a strategy you might already be using in your day-to-day AI usage (especially if you followed the advice in the context rot article).

    Here's how product teams are putting this into practice right now: designing compact system prompts by breaking big tasks into smaller tasks; building external memory/state structures to keep the context window clean; curating what goes into each turn; repeating critical information as context grows; and using sub-agents to grow the context window.

    I'll connect each tactic back to patterns you're likely already using in your daily AI workflows, especially if you followed the advice in the context rot article. Along the way, I’ll share practical guardrails and instrumentation ideas so you can track quality with eval-driven development, reduce context rot, and scale performance predictably.

    Why this matters for product trios: these strategies clarify the handoffs between prompt engineering, external memory design, and orchestration, which strengthens collaboration across PM, design, and engineering. Whether you’re exploring gen ai prototypes, hardening a retrieval-first pipeline, or evolving toward agentic AI, context engineering is the backbone of reliable, high-performing experiences.

    If you build or lead LLMs for product managers initiatives, consider this your field guide. In upcoming posts, I’ll break down each strategy with concrete examples and templates you can adapt to your stack, so your team can move from experiments to durable, scalable AI workflows with confidence.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Reinventing Product Management Workflow: The AI Upgrade I Use to Ship Faster, Smarter

    Reinventing Product Management Workflow: The AI Upgrade I Use to Ship Faster, Smarter

    The most valuable upgrade I’ve made to my product management workflow isn’t a new framework or a shiny dashboard—it’s an AI-first operating model that compresses discovery-to-delivery cycles while increasing confidence in every decision. I built this approach to reduce context switching, remove toil, and keep the team relentlessly focused on outcomes over output. The result is a faster, clearer, and more reliable path from insight to shipped value.

    Here’s how I run an AI-powered product workflow end to end: continuous discovery, opportunity sizing, solution shaping, planning, execution, and iteration—each step instrumented with automation, retrieval, and evaluation so we learn faster without compromising rigor.

    Intake and triage start with a retrieval-first pipeline that unifies customer feedback, support tickets, sales notes, research transcripts, and usage analytics. I use embeddings to cluster themes, de-duplicate signals, and surface the most representative examples. This gives me an instant, always-fresh view of customer jobs, pains, and opportunities without manually combing through noise.

    For discovery, I rely on “LLMs for product managers” to accelerate the hard parts without replacing judgment. I generate interview guides, summarize transcripts, extract entities, and tag moments of friction. Prompt engineering and context window management ensure the model sees the right evidence at the right time. I keep all sensitive data governed by privacy-by-design and data governance controls.

    Opportunity sizing is where I connect insights to business impact. I map problems to a driver tree, quantify potential lift, and align to outcomes vs output OKRs. When relevant, I apply the Kano Model to balance performance, basic, and excitement attributes. To maintain rigor, I use eval-driven development on my prompts and heuristics so prioritization is repeatable, not anecdotal.

    Solution shaping is a collaborative exercise with product trios. I draft problem narratives and PRDs, generate acceptance criteria, and create first-pass UX flows. For speed, I use gen ai for product prototyping to explore alternatives quickly, then gate final choices through usability feedback and feasibility checks. Where uncertainty is high, I define a minimum detectable effect (MDE) and design A/B testing plans upfront.

    Planning ties strategy to execution through product roadmapping and sprint planning. I break work into sequenced bets, enable feature flags for controlled exposure, and wire quality signals into CI/CD. DORA metrics—like deployment frequency and change failure rate—help me keep the system honest. Observability ensures we see the “why” behind behavior, not just the “what.”

    Execution is instrumented with in-app guides, Intercom messaging, and Pendo to shape onboarding and activation. I connect Amplitude analytics to measure habit formation, retention analysis, and feature adoption. When experiments run, I monitor leading indicators in near real time while protecting against peeking and p-hacking. The point isn’t to prove we’re right; it’s to learn fast enough to get right.

    Iteration closes the loop. I use a unified analytics platform to compare expected vs actual outcomes, harvest qualitative feedback, and push new evidence back into discovery. The system improves with each cycle because the retrieval-first pipeline and eval harness both get smarter as data grows.

    Governance is non-negotiable. AI risk management, cybersecurity, and regulatory compliance sit alongside model evaluations to prevent drift, leakage, or bias. I document decisions, model versions, and test artifacts so we can audit how we got to a call—especially when trade-offs are nuanced.

    If you’re standing up this AI workflow from scratch, I recommend a 30/60/90 rollout. In the first 30 days, audit your data sources and build a retrieval-first pipeline. In days 31–60, pilot two high-leverage workflows—continuous discovery and PRD drafting—backed by eval-driven development. By days 61–90, scale to prioritization and experiment design, then thread the outputs into your planning and CI/CD rhythms.

    Common pitfalls I watch for: over-automation that blurs context, lack of evaluation frameworks, ungoverned data that undermines trust, and vanity metrics that celebrate activity over outcomes. The antidote is simple but disciplined—clear decision criteria, measurable hypotheses, and automated evaluations that run as guardrails, not bottlenecks.

    This AI upgrade doesn’t replace the craft of product management; it amplifies it. By combining judgment, clear strategy, and reliable automation, we ship value faster, reduce risk, and make better calls under uncertainty. The payoff is durable: compounding learning velocity and a team that spends more time solving the right problems—and less time wrestling the process.


    Inspired by this post on Product School.


    Book a consult png image
  • From Chaos to Clarity with Claude Code: My Hands-On Playbook for Product Leaders

    From Chaos to Clarity with Claude Code: My Hands-On Playbook for Product Leaders

    I’ve been pushing hard to operationalize AI for real product work, and this episode zeroes in on the moment Claude Code stops feeling like a demo and starts behaving like a dependable teammate. If you’ve ever wondered how to go from clever prompts in the browser to durable, repeatable workflows on your machine, this walkthrough is for you.

    Listen on: Spotify | Apple Podcasts.

    My first honest reaction to installing and configuring the desktop agent was the all-too-relatable “this tool thinks everything is a code repo” reality. That framing helped me reset expectations fast: instead of treating it like a magical universal assistant, I began designing guardrails, context, and repeatable routines—exactly how I’d onboard a new team member.

    The shift from Claude-in-the-browser to Claude Code on my machine was the unlock. Locally, it can finally work with my files, folders, and workflows. That meant I could ground it in real artifacts—project docs, meeting notes, product specs, and historical decisions—so responses weren’t just plausible; they were contextual and verifiable.

    On setup, I now treat /init and Claude MD files as my product requirements. I define roles, boundaries, and canonical sources up front, then run in a deliberate “walled garden.” The “treat it like an intern” model works beautifully: scope access intentionally, expand privileges as trust grows, and keep a tight audit trail of what it can touch and why.

    Surprisingly, task management became my ideal on-ramp. It’s easy to validate, the feedback loops are tight, and the ROI is immediate. I export calendar windows rather than granting full calendar access, then let the agent map priorities into Trello, reconcile time blocks, and surface trade-offs. Fast wins build confidence—mine and the agent’s.

    Model switching matters more than I expected. When speed is king and “good enough” will do, Haiku keeps the loop snappy. When stakes are higher—complex synthesis, nuanced product strategy, or gnarly ambiguity—I step up to Claude Opus 4.5. Being intentional about when to optimize for latency versus depth is a quiet superpower.

    Web tasks can still spiral. When that happens, I pause its autonomy, toggle to fewer steps, and ask, “What are you doing?” Paired with Claude’s Web fetch tool, this makes the agent explain its chain-of-thought planning without exposing hidden reasoning, so I can spot brittle assumptions, prune distractions, and re-ground the task.

    Content retrieval has become a killer workflow. I point the agent at my archives—blog posts, book drafts, transcripts, notes—and ask, “Where have I talked about this before?” It assembles a map of prior art, connects themes I’d forgotten, and prevents me from reinventing work. Over time, this evolves into a Zettelkasten-style research system that upgrades rigor and accelerates synthesis.

    I’ve also turned Claude Code into a publishing engine. From a single transcript, it drafts titles, descriptions, show notes, and chapters, then routes artifacts to Ghost for formatting. Before anything ships, I run fact-checking workflows that validate claims against transcripts and research sources. The output improves, but more importantly, the scaffolding makes quality repeatable.

    Reusable workflows compound. I rely on slash commands to trigger common jobs, break down larger efforts with sub-agents, and wire in hooks and plugins where external systems are needed. This is agentic AI at its most practical: fewer hero prompts, more reliable processes.

    Audience analytics and content prioritization are helpful with caveats. I let the agent cluster themes and flag gaps, then I pressure-test its suggestions against first-party data and strategic goals. As with any model-driven insight, triangulation beats blind faith.

    Two metaphors guide my day-to-day. First, Claude Code is like a dog—sometimes it returns with the stick, sometimes it gets lost in the woods. Second, the “intern” framing keeps me honest: don’t hand it the whole company on day one. With that mindset, my output jumped—more volume without sacrificing quality—because the workflow scaffolding got better.

    In this episode, I cover what Claude Code is and why it’s useful even if you’re not an engineer, the real difference between the browser experience and running locally, how to shape behavior with /init and Claude MD files, why task management is the perfect proving ground, when to export calendar windows versus connecting directly, and when model-switching makes sense—Haiku for speed, Opus for depth.

    I also dig into debugging web tasks by asking “What are you doing?”, content retrieval workflows across personal archives, building reusable slash-command systems with sub-agents, hooks, and plugins, practical publishing stacks from transcripts, fact-checking against transcripts and research sources, and using analytics to prioritize content—with a healthy respect for uncertainty.

    If you’ve been trying to make Claude Code feel less like “throwing a stick into the woods,” this is the candid, tactical tour I wish I’d had on day one. Drop your questions and experiments below—I’m eager to compare notes and refine the playbook together.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Build CX Scores You Can Defend: My 5-step playbook for transparent, trustworthy AI metrics

    Build CX Scores You Can Defend: My 5-step playbook for transparent, trustworthy AI metrics

    “You don’t have to trust the algorithm; you can see exactly why a conversation earned the score it did.”

    We recently shared how we redesigned CX Score to deliver deeper, more actionable insights across every conversation. The most common follow-up from support leaders was simpler and incredibly important: “Can I trust it?” It’s the right question—and it’s the one I use as my own bar for whether a metric is ready for the C‑suite.

    CS teams are the subject matter experts on customer experience. They understand the nuance of what customers feel, the context behind every interaction, and the difference between a technically resolved issue and a genuinely satisfied customer. I’ve learned, conversation by conversation, that any metric we ship has to capture that nuance at scale—or it doesn’t deserve to be used.

    We built CX Score to give support teams a complete view of how their customers feel across every conversation. It surfaces what’s working, what’s not, and why—so leaders can communicate impact clearly and drive change across support, product, and the wider business.

    Interface card displaying 'CX Score: 2' summarizing a case where repeated CSV export attempts failed, frustrating the customer; the AI agent explains the issue and requests more details; rounded gradient border.
    A CX Score in action: repeated CSV export failures trigger a low score and customer frustration, while the AI agent clarifies next steps and gathers details—turning raw signals into actionable support insights.

    Here’s exactly how I approached building a trustworthy metric that support leaders can inspect, explain, and defend.

    1) It’s grounded in how support teams define quality. I started with how experienced support professionals actually evaluate conversations—collecting real examples of strong, mixed, and poor interactions across industries, identifying the specific factors that shape overall experience, and writing plain-English rules for each. The result: CX Score applies the same criteria a trained support professional would use, not generic LLM assumptions.

    2) It’s aligned with human judgment. We created a dataset of thousands of real customer conversations spanning multiple industries, languages, channels, and agent types. Each was manually reviewed by experienced support professionals—with two reviewers per conversation where possible and disagreement resolution to create stable consensus labels. The result: CX Score is trained and tested to behave like an expert reviewer, not a language model making broad guesses.

    Analytics dashboard visualizing a CX Score with KPI cards and a Sankey performance funnel linking support channels to AI involvement, resolutions, and positive, neutral, or negative outcomes.
    A modern CX analytics view shows how conversations flow from chat, email, and mobile into AI assistance, then to resolutions and sentiment outcomes—turning messy support data into a single, defensible CX Score.

    3) It’s engineered by AI specialists. CX Score isn’t a prompt attached to an LLM. It’s a production system built by Intercom’s AI Group: 37 ML scientists and 350 engineers whose full-time focus is AI for customer service. The system includes specialized handling for long transcripts, model configuration tailored for support language and subtle sentiment, prompt engineering designed to default to neutral when evidence is weak, and a multi-stage evaluation pipeline that checks for precision, consistency, and reliability. The result: A metric built by a team that understands LLM behavior in production support environments, where accuracy and consistency matter most.

    4) It’s validated statistically, not qualitatively. Trust requires measurement, not vibes. We tested CX Score across standard ML metrics: Precision (when the model flags a negative experience, how often do humans agree?), recall (how many human-identified issues does it catch?), and F1 score (the balance between both). We set an explicit bar: F1 above 0.8, representing high agreement with human judgment. We reran these evaluations through every revision, checking for regressions or biases, and I focused especially on negative experiences, because a false negative hides a real problem. The result: CX Score meets a measurable standard before it ships—not a gut check, a statistical requirement.

    5) It was battle-tested with real customers. Lab accuracy isn’t enough. Customer environments are messy: Varied ticket types, mixed languages, unpredictable edge cases. Before release, we ran a multi-phase field test—shadow-scoring conversations with both old and new models, validating sensible behavior across agent type and conversation length, then rolling out to a controlled customer group who confirmed the scores felt right, reasons were clear, and insights were actionable. The result: CX Score shipped because real teams told us it made sense in practice, not because it passed internal tests.

    Donut chart of CX categories beside a chat UI showing a CX Score of 3 with a 'Negative policy feedback' tag, highlighting policy feedback, answer quality, customer effort, and emotion.
    From conversation to clarity: this visual maps the drivers behind a CX Score. Explore how policy feedback, answer quality, and effort combine to produce defendable insights support leaders can act on.

    The importance of explainability. One of the most critical choices I made was ensuring CX Score isn’t a black box. Every score comes with clear reasons, concrete excerpts, and a short explanation of what influenced the rating. This turns the metric into something you can inspect, audit, and explain to executives. You don’t have to trust the algorithm. You can see exactly why a conversation earned the score it did.

    A metric that evolves with your business. Customer expectations shift. Products change. AI improves. A trustworthy metric can’t be static. CX Score evolves with the same commitments that shaped its redesign: Evaluate the real signals that shape customer experience, keep the logic simple and interpretable, and ensure leaders can make clear decisions from it. It’s built to be a durable source of truth across every conversation.

    The takeaway. In a world where products look the same and AI can generate any interaction, customer experience is one of the few differentiators that actually matters. Support leaders have built that expertise conversation by conversation. What they’ve lacked is a measurement system that could validate it at scale—one that’s reliable enough to report to the C-suite, explainable enough to defend in strategy meetings, and rigorous enough to drive real decisions. That’s what CX Score is designed to be: A metric that reflects the reality support leaders see every day, backed by the technical rigor to make it credible everywhere else.

    Want to see CX Score in your workspace? Ask your admin to enable it for your team, and start using explainable AI insights to improve customer experience and coach with confidence.


    Inspired by this post on The Intercom Blog.


    Book a consult png image