Tag: AI readiness

  • AI Operating Model Playbook: Why 80% Stall—and How the Top 1% Accelerate with Discipline

    AI Operating Model Playbook: Why 80% Stall—and How the Top 1% Accelerate with Discipline

    I keep meeting talented product teams who can demo impressive proof-of-concepts but can’t get durable business impact into production. The difference isn’t raw ingenuity—it’s the operating model. As I’ve scaled AI initiatives in my own organization, one sentence has proven painfully accurate: "What the top 1% of AI-native product teams are doing differently – and why most won't catch up without rebuilding the operating model."

    When I say “AI operating model,” I mean the end-to-end way we set strategy, discover value, build, ship, govern, and learn—specifically adapted for AI systems. If we try to bolt AI onto a classic software cadence, we stall. If we rebuild our operating model around AI’s unique constraints and compounding advantages, we accelerate.

    It starts with strategy. I anchor our portfolio to explicit outcomes, not features—tying every initiative to measurable customer and commercial impact. Driver trees and an opportunity solution tree make tradeoffs transparent, while outcomes vs output OKRs prevent us from celebrating activity over results. This is how empowered product teams earn autonomy without losing alignment on the AI Strategy.

    Next is discovery. Continuous discovery reframes “can we ship a model?” into “can we change a behavior or decision with acceptable risk?” I pair customer interviews with in-product telemetry and journey mapping to qualify moments of high value and high frequency. The litmus test: can we describe the target workflow in plain language and simulate success before training models? If not, we’re not ready.

    Data foundations come third. A retrieval-first pipeline is now my default, not an afterthought. We invest in data governance, privacy-by-design, and observability so we can explain where answers come from, prove consent, and debug drift. Without trustworthy data and clear lineage, every downstream AI promise is fragile—and your AI readiness is mostly theater.

    Then I insist on eval-driven development. Before we optimize prompts or tune models, we define offline and online evals that represent the real task, including safety and “gotcha” cases. We treat prompt engineering, context window management, and agentic AI patterns as hypotheses that must beat a baseline under repeatable tests. This moves debate from opinions to evidence.

    Shipping is where most teams quietly stall. We integrate AI into our CI/CD with feature flags, shadow modes, and progressive rollouts, building MLOps into the same platform that runs our services. I watch DORA metrics to keep delivery velocity healthy, but I also watch AI-specific signals—input distribution shifts, response variance, and time-to-mitigation—so we catch regressions before customers do. Platform scalability matters more when inference costs and latency can spike overnight.

    Governance isn’t a gate at the end; it’s a runway from the start. We operationalize AI risk management with tiered reviews, model and data cards, and clear escalation paths. The goal is not to slow down, but to reduce surprise—so product managers, engineers, and legal share the same playbook for safety, fairness, and regulatory compliance.

    Value capture closes the loop. We connect product metrics to commercial levers like Net Recurring Revenue (NRR) and retention analysis, then shape packaging so customers pay for outcomes, not raw compute. This is where product-led growth meets sales-led growth: we demonstrate value in-product, then arm go-to-market teams with unambiguous proof.

    So why are 80% of teams stuck? Three patterns recur: technology FOMO masquerading as strategy, fragmented data that can’t support high-quality retrieval, and a lack of evals that forces decisions by vibes. Add ad hoc governance and you get pilots that impress in slides but wither under real-world variance.

    How do the top 1% think differently? They rebuild the operating model first. They position discovery around workflows, not models. They invest in retrieval-first architectures early. They standardize evals. They ship with guardrails. And they treat “learning per week” as a sacred metric—because compounding insight beats sporadic heroics.

    If you need a 90-day plan, here’s the sequence I use. Week 1–2: run a content audit of data sources and map the top five repeatable workflows ripe for AI leverage. Week 3–4: define success metrics and offline evals for one beachhead use case. Week 5–8: build the retrieval pipeline, implement prompt baselines, and instrument observability. Week 9–12: ship behind feature flags, run A/B testing with safety thresholds, and iterate on failure cases. By the end, you’ll have a reusable blueprint—not just a demo.

    Team design matters. I staff product trios (PM, design, tech lead) with forward deployed engineers or solutions engineering partners who sit with customers. That proximity reduces spec ambiguity and accelerates learning. It also sharpens our product roadmapping and sprint planning because we plan against outcomes, not outputs.

    The hardest part is emotional, not technical: letting go of familiar software rituals that don’t serve AI. Once we accept that AI demands a different operating rhythm, progress feels lighter. The top 1% don’t have secret models; they have disciplined systems. Rebuild yours, and the compounding benefits will outpace any single model upgrade.


    Inspired by this post on Product School.


    Book a consult png image
  • Beyond Accuracy: How I Evaluate AI Customer Service Agents That Delight and Scale

    When teams evaluate AI Agent options for customer service, I often see the rigor aimed at the wrong subset of criteria. After leading and observing dozens of proof of concept (POC) efforts with our customers and prospects, I understand why performance—accuracy scores, resolution rates, and benchmark tests on curated datasets—soaks up most of the attention. But those indicators alone won’t guarantee success once you leave the sandbox and face real customers.

    If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else I look for to make the best long-term decision.

    How does it handle your real-world setup?

    Performance is table stakes, but it has to reflect the messiness of an actual support environment. The best-performing Agents don’t just get answers right—they exhibit resilient, human-like behavior under pressure. I watch how the Agent behaves when it doesn’t know an answer: does it recover or spiral? Does it stay on track through multi-step requests, and how gracefully does it hand off to human agents? If your knowledge base depends on a retrieval-first pipeline, test cross-source retrieval and grounding—not just single-document lookups.

    When I build evaluation scenarios, I put the Agent through its paces with a broad, realistic mix:

    • Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
    • Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
    • Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
    • Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
    • Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
    • Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.

    This preparation is worth the effort. Any Agent can look impressive in a demo; what matters is how it holds up as part of your team, serving your customers in production.

    What does it feel like to interact with the Agent?

    Two AI Agents can post the same quantitative scores—resolution rates, containment rate, and more—and still deliver very different customer experiences. Resolution rate tells me whether the Agent finishes conversations; it says nothing about how customers felt during them. I deliberately assess the experience, not just the outcome, because conversation design shapes trust and brand perception.

    Here’s what I look for to ensure the AI Agent is enjoyable to interact with:

    • Is the tone natural and on-brand, or does it feel robotic and generic?
    • Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
    • When it doesn’t know the answer, does it handle that gracefully?
    • When it hands off to a human, is that transition seamless, or does the customer feel abandoned?

    As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”

    That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.

    I make the experience dimension explicit in my POCs. I have people on my team—and when possible, a small cohort of real customers—interact with the Agent under realistic conditions. Then I ask how it felt, not just whether it worked.

    Can you keep improving it after launch?

    This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one. Choosing an Agent that works today and ensures you can continuously improve the customer experience over time requires more than a functional demo. You’re buying a system that must get better every week, not just during the first sprint.

    The feedback loop

    Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time. In practice, that means instrumenting conversations, leveraging Agent Analytics, tagging misroutes and tone slips, and running targeted evals on known failure modes.

    The speed of iteration

    When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent. We treat this as eval-driven development: automate evaluations that mirror real tickets, tighten prompt engineering and retrieval settings, and ship small fixes daily.

    The vendor partnership

    The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:

    • How does customer feedback influence the product roadmap, and can they show you examples?
    • If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
    • What kind of support will you get post-launch?
    • Are they shaping where AI customer experience is going, or reacting to what others are building?

    How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.

    What a good POC proves

    If your POC only proves “the AI works,” you haven’t done enough. A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. Done well, it sets you up for long-term operational success and builds organizational AI readiness—not just a flashy demo.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • Unlocking AI Agents: The Real Barrier Is Readiness—Not Capability—Here’s How to Scale

    There’s a question that runs underneath every AI Agent evaluation: what can it do?

    Two years ago, that was the right question to ask because Agents were limited and capability was a genuine constraint. The gap between what organizations needed and what the technology could deliver was wide. I felt that gap acutely in early pilots—plenty of ambition, not enough dependable execution.

    That gap has since narrowed considerably, and yet most organizations are running their Agents well below what’s technically possible. I see teams lean on answering and routing, but stop short of looking things up, taking actions, or resolving complex, multi-step problems—especially where data, process variance, or risk come into play.

    The standard explanation is that AI isn’t good enough yet—models must improve, or vendors must ship more features. But after studying organizations across industries actively expanding their AI automation, I’ve found that this explanation holds up less often than people assume. The blockers tend to be elsewhere.

    The teams I’ve observed weren’t primarily constrained by what their AI could do; they were constrained by what their organization was structured to let it do. In other words, the ceiling wasn’t the Agent’s capability—it was organizational readiness, governance, and risk tolerance.

    “Readiness” for AI breaks into five distinct types, and most organizations have some but not all of them. Below is how I assess them with product, operations, and engineering leaders.

    Content readiness is whether you can explain your product and policies clearly and consistently. Most companies can. In practice, that means up-to-date knowledge bases, unified policy language, and clear versions that Agents can cite and apply.

    Scope readiness is whether you’ve defined the edges: when should AI engage, and when should it step aside? Edge cases multiply, intent varies by customer segment, sensitive topics surface mid-conversation, but most teams can work through this with effort. Clear guardrails reduce ambiguity and shrink risk.

    Procedural readiness is where things start to get harder. This is about whether you can articulate your processes clearly enough for something other than a human with years of tacit knowledge to follow. The happy path is rarely the problem. It’s the failure paths, decision branches, variations that have never been written down because they’ve always lived in someone’s head.

    Data readiness is the first real cliff. Can you reliably identify the right user, account, or object at the moment a decision needs to be made? Is the data trustworthy in real time? Are the APIs stable, accessible, and actually connected? For most organizations, the honest answer is “partially, but we’re not always sure when it breaks.”

    Execution readiness is the highest bar. Not just technically (can the Agent make the change?) but organizationally. Who owns it when the wrong refund gets processed? Who detects it? Who recovers? Does someone with authority actually accept the risk?

    Most companies have the first two, some have the third, fewer have the fourth and fifth. When I map this with teams, we often discover that their Agent’s ceiling is really a reflection of operational maturity and data plumbing, not model quality.

    We studied companies across six industries – energy, healthcare, ecommerce, gaming, financial services, property management – all trying to expand what their Agents could do. The pattern was consistent: teams set out to automate real actions—looking up account status, processing changes, handling transactions. In most cases, the AI could technically do it, but at a certain point (somewhere between guiding a user through a process and looking something up on their behalf) they hit a wall.

    One team tried to automate application changes but couldn’t reliably identify which application to modify across their internal systems. Another explored billing automation but couldn’t access live account data due to regulatory constraints. A third needed to verify status across third-party vendor systems their Agent couldn’t reliably reach. I’ve seen similar constraints surface around CRM integration, data governance, and vendor SLAs—none of which are model issues.

    In most cases, the team redesigned around what their infrastructure could support. They moved toward guiding—walking users through processes step by step, rather than executing changes on their behalf. It worked, it resolved conversations and delivered real value, just differently than anyone planned. In customer support, this often looks like consultative flows that shorten time-to-resolution even without direct writes.

    Most Agent evaluations are built around capability. Can it handle complex queries? Does it support multiple channels? Can it integrate with our systems? These are reasonable things to evaluate for, but they produce a capability score, and that doesn’t tell you whether your organization can actually use what you’re buying.

    The teams that got to deeper automation, the ones executing actions early, didn’t have “better AI,” they had more standardized operations. Actions that were already well-defined, consistently applied, and exposed through stable systems with clear rules. Automation wasn’t inventing new behavior, it was triggering actions that were already tightly controlled elsewhere.

    Readiness enables capability, not the other way around. Which reframes the evaluation question from “can the AI do this?” to “are we actually ready for it to?”

    Something that gets lost in most conversations about AI readiness is that organizations are often further along than they assume, just not for the kind of work they were planning for. A team that set out to automate refunds but can reliably guide users through complex troubleshooting has genuine capability deployed. They’re operating at the level their readiness supports, which is a starting point, not a deficit.

    The more useful frame isn’t “are we ready?” – it’s “what are we ready for, and what specifically stands between here and the next level?” The gaps tend to be concrete: a missing API, data that lives in three systems that don’t agree, a process that’s never been documented, or an ownership question nobody has answered. These are solvable problems. They just require a different kind of investment than buying a more capable Agent.

    What nobody has worked through seriously yet is how organizations actually build readiness. Does it develop naturally through using AI at shallower levels first? Or is it mostly a function of prior decisions, like system architecture choices made years ago, operational maturity that accumulated over time, engineering investments that have nothing to do with AI? When readiness does increase, what actually changes? Does the support team develop it? Does engineering grant it? Does it require executive sponsorship and investment in infrastructure with no obvious AI label on it?

    In my experience, progress comes from a joint effort: product to define scope and guardrails, operations to codify procedures and edge cases, engineering to harden APIs and observability, and leadership to underwrite risk with clear ownership. When those pieces align, agentic AI moves from guided assistance to safe, auditable execution.

    Until there are clearer answers, the pattern is likely to continue. Companies will buy capable Agents, plan ambitious rollouts, and find that the harder work is building the organizational infrastructure. The Agents can do the work. The question is what it takes to let them.


    Inspired by this post on The Intercom Blog.


    Book a consult png image