Shipping agentic AI into production is exhilarating—until a flaky output torpedoes trust. Over the past year, I’ve led teams at HighLevel to operationalize agents across customer-facing and internal workflows, and I’ve learned that reliability isn’t an afterthought; it’s an architecture. In this piece, I share the AI Agent Orchestration Patterns for Reliable Products that consistently deliver dependable outcomes at scale.
When we talk about orchestration, we’re talking about more than a single prompt. The shift is from monolithic calls to coordinated “agentic AI” where routers, planners, and specialists collaborate through structured “AI workflows.” In practice, I rely on a few canonical patterns: a planner–executor loop for multi-step tasks, a router–specialist setup for skill selection, and a “retrieval-first pipeline” that grounds generation with authoritative context before a single token is produced.
Reliability-by-design starts with typed inputs/outputs and strict validation. I standardize on JSON schemas, enforce tool/function signatures, and implement idempotency keys so retries don’t wreak havoc on downstream systems. Timeouts, circuit breakers, and backpressure protect the platform under load, while rate limiting and dead-letter queues keep failure modes contained. Most importantly, we engineer graceful degradation: agents “abstain” when uncertain, fall back to deterministic paths, and escalate to humans instead of guessing.
Safety is a first-class concern, not a bolt-on. Our “AI risk management” pipeline includes PII redaction, allow/deny lists for tools and data, and the principle of least privilege for every connector (yes, even the ChatGPT connector). We codify policy-as-code for repeatability and require human-in-the-loop approvals for sensitive or irreversible actions. In my experience, clear red lines and reversible defaults prevent the vast majority of regrettable outcomes.
Without strong “observability,” you’re flying blind. I instrument agents with an “Agent Analytics” layer that captures traces, spans, tool invocations, and token usage across the entire chain. The essential metrics are outcome quality (task success rate), latency (p50/p95), tool failure rates, cost per task, and user-level satisfaction signals. Cross-agent lineage allows us to pinpoint where a plan went awry and which tool or prompt introduced drift—vital for rapid remediation.
Quality improves fastest when it is measured relentlessly. I practice “eval-driven development” with golden datasets, rubric-based scoring, and risk-weighted sampling of edge cases. LLM-as-judge can help, but we always calibrate against human ratings and monitor agreement. In production, I blend online metrics with controlled “A/B testing” and plan experiments to hit a realistic minimum detectable effect (MDE). The result is a virtuous loop where prompt tweaks, tool changes, and retrieval adjustments are verified before wide rollout.
Agents need the same rigor we expect from any modern system. I gate releases through “CI/CD” with linting for prompts, schema checks for tools, and simulation runs for critical paths. “Feature flags” enable shadow and canary deployments so we can throttle exposure by segment or workflow. I also track reliability with “DORA metrics” and “deployment frequency,” and I partner closely with “SRE” for on-call coverage, runbooks, and incident postmortems tailored to agent failure modes.
Context is a resource to allocate, not a bottomless pit. Thoughtful “context window management” means curating retrieval, summarizing long-running threads, setting memory time-to-live, and constraining what the agent can see at any given step. I bias hard toward retrieval over recall, keep chunks small and semantically precise, and validate that the “retrieval-first pipeline” truly returns the right evidence—not just the nearest match.
In day-to-day product work, I lean on a compact playbook: a router that selects the best specialist; a planner that decomposes tasks and allocates tools; a deterministic guard that verifies preconditions; an execution loop with explicit budgets; and a fallback policy that prefers abstaining over hallucinating. Together, these patterns create an agent that behaves like a dependable teammate rather than a creative wildcard.
No architecture thrives without the right rituals. Product trios keep discovery continuous, while clear outcomes (not output) align teams on value instead of vanity. We map risks early, maintain a public quality dashboard, and rehearse failure recoveries so incidents never become improvisations. The cultural signal is simple: we celebrate root-cause clarity and safe iteration over heroics.
If you’re just starting, implement three patterns first: retrieval before generation, abstain-and-escalate for low confidence, and canary releases under feature flags. Instrument everything from day one, run a weekly eval review, and expand scope only when the data says you’re ready. With these habits, your agents will earn user trust—and keep it.
To truly transform with AI, I’ve learned it’s never just about the technology—it’s about redesigning how we work. The teams that win don’t bolt AI on; they re-architect around it. That means rethinking roles, workflows, and governance to build a system that sustains and improves AI performance over time.
In The 2026 Customer Service Transformation Report, teams at every stage of maturity describe human agents taking on more proactive work—training AI systems, handling the hardest queries, and owning tasks that demand judgment. Job descriptions are shifting, too, with many organizations explicitly adding AI-related responsibilities.
I’m also seeing a clear rise in dedicated AI specialists. Conversation analysts, knowledge managers, and AI operations leads are fast becoming standard. For support professionals, this opens new, higher-leverage career paths—and creates a talent pipeline that blends service excellence, data fluency, and product thinking.
Support once centered on queue-level activity—ticket triage, routing, translations, and answering FAQs. Now, as AI handles more frontline interactions, our human roles are moving up the stack toward optimization, oversight, and continuous improvement.
According to the latest research, 45% of teams report updating job descriptions to include AI-related responsibilities, with 40% saying their human agents are now more focused on training AI systems. Another 27% report that human agents primarily handle the most complex escalations and edge cases, while a quarter say agents are doing more consultative and strategic work.
Even at the initial deployment stage, 16% of teams report spending less time handling support volume since implementing AI – and among teams who’ve reached maturity, that figure rises to 28%.
When Intercom’s Research, Analytics & Data Science (RAD) team interviewed 166 of our customers, similar themes emerged. Nearly all participants (≈95%) reported meaningful workflow changes, with manual processes being handled by AI, and humans focusing more on monitoring or fine-tuning AI outputs. Eighty-three percent of participants also reported seeing their team’s roles and responsibilities change to become more strategic and supervisory in nature.
AI is reshaping support teams: organizations are adding conversation analysts (32%), knowledge managers (30%), AI operations leads (28%), and support automation specialists (24%). Just 8% report no new AI roles.
It’s not just the work that’s evolving; organizational structures are, too. Some teams are reallocating existing talent into AI-focused roles; others are hiring entirely new skill sets. Many of the most common job titles in this space didn’t exist two years ago.
Consider a Senior AI Knowledge Manager, Beth-Ann Sher, who transitioned from a help center manager role. Like many careers transformed by AI, her work evolved from administrative to strategic. Instead of focusing solely on customer-facing, self-serve content, her mandate expanded to designing and optimizing knowledge inputs that directly improve AI Agent Fin’s performance—work that materially lifts resolution rates.
Or look at a Senior Conversation Designer, Fred Walton, hired specifically for an AI-first function. He focuses on frictionless customer journeys with Fin, smoothing handoffs between automation and human support while keeping customer satisfaction front and center—hallmarks of mature AI workflows and conversation design.
In high-performing organizations, roles like these typically sit within a dedicated AI support team under senior CS leadership. Clear ownership and accountability for AI performance is critical; without it, optimization stalls and trust erodes.
These shifts aren’t isolated. Take Robb Clarke from RB2B. He went from Head of Technical Operations to Head of AI. With Fin, his focus moved from repetitive support questions to managing knowledge and improving the system behind it—freeing him to be proactive about product improvements and fix issues before they hit customers.
Or consider Eric Broulette from Bloomerang, a support leader who leaned into AI and became the VP of Support and Education. By deploying Fin, his team found breathing room to invest in what’s next. Agents stepped into new roles, contributed to meaningful projects, and built skills that had previously felt out of reach. As Eric puts it: “Do not wait to embrace AI. It will unlock more career growth for your teams than you can imagine.”
Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.
Bringing AI into support will eventually change every agent’s day-to-day work. For leaders at the start of the journey, that can feel daunting. My perspective: the most successful teams treat this as an operating model shift, not a tooling rollout—anchored in AI Strategy, governance, and continuous improvement.
Be transparent about what’s changing, why it matters, and how success will be measured. Define how AI performance will be evaluated (resolution rate, containment, CSAT impact), empower agents to train and improve the system, and communicate how responsibilities will evolve. When teams help build the AI, they’re invested in making it great.
Here’s the playbook I rely on with support leaders: First, reset expectations about time allocation—less time in the queue, more time improving the AI system that serves the queue. Second, elevate knowledge management as a core capability. Prioritize content quality and coverage for your AI Agent, and carve out dedicated “out of the inbox” time so every agent contributes. Third, keep outcome metrics—especially resolution rate—front and center. It gives the team a north star for experimentation and iteration.
Scaling AI is as much a people challenge as it is a technology challenge. As automation takes on more work, support roles become more proactive, strategic, and cross-functional—even early in the journey. Responsibilities expand, new roles emerge, and team structures adapt to concentrate on and amplify AI performance. In the process, support careers are transformed.
If you’re leading this shift, now’s the moment to reimagine your operating model: clarify ownership, invest in knowledge and conversation design, adopt eval-driven development, and build the muscle for continuous improvement. That’s how you move from tickets to strategy—and unlock compounding value for your customers, your business, and your teams.
Today, I’m excited to share 12 major updates to Fin’s Procedures and Simulations—the foundation that lets Fin handle complex work while keeping teams fully in control of the customer experience.
In my work building AI workflows with product and support leaders, I’ve seen how the right blend of natural language instructions, deterministic controls, and fully agentic behavior turns Fin into a reliable problem solver. Procedures make this blend possible by enabling Fin to act like a human—yet with the repeatability and governance of software. Simulations then let us test those complex Procedures at scale before they reach customers, so we can deploy with confidence.
Together, these capabilities make Fin self-manageable, transparent, and ready for genuinely complex work.
Here’s what’s new at a glance: we’ve made Procedures easier to build and maintain; enhanced deterministic controls for precision and policy compliance; expanded agentic behavior so Fin can adapt in real time; and delivered more powerful Simulations to validate end-to-end workflows before go-live.
Why did we build this? Many teams see early AI gains in speed, coverage, and cost to serve—but then hit a ceiling. They keep AI confined to simple automation and information retrieval, rather than setting it up to handle the nuanced, multi-step workflows they still trust to humans. We designed Procedures and Simulations to remove that ceiling, so teams can confidently set up, govern, and iterate on complex AI workflows without bottlenecks.
Follow the AI lifecycle as it cycles from Analyze to Train to Test to Deploy. This streamlined loop spotlights the TRAIN phase, underscoring faster iteration and feedback that power more capable procedures and realistic simulations.
We also heard that teams needed an easy way to connect data so Fin could reliably check customer status or eligibility and then take action. And they didn’t want to route through engineering every time they needed to create or amend logic for mid-conversation decisions. Procedures combines natural language instructions and intuitive data connector setups. You tell Fin in your own words how you want it to behave, and you’ll be guided through creating conditional steps so Fin will react consistently, with the option to add in any code snippets for circumstances where absolute precision is required. Once you build one Procedure, we believe you’ll want to build several, so Fin will constantly read the conversation it’s in to ensure it’s following the most relevant Procedure, and jump to a more relevant one if the user intent changes.
I know that taking something like this live the first time can feel like a leap of faith. That’s exactly why we built Simulations—to test Procedures comprehensively, uncover edge cases, and launch with confidence.
Reaching mature deployment takes a deliberate, ongoing commitment to training workflows, validating them before deployment, measuring performance in production, and refining them over time. At Intercom, we call this the Fin Flywheel: train, test, deploy, analyze. Procedures form the foundation of the train stage, and Simulations make the test stage reliable at scale. Together, they enable Fin to handle complex work, and teams to stay in control of it.
Procedures: Define exactly how Fin handles complex work. With Procedures, I can set Fin up to resolve complex, time-consuming queries that require multiple steps or business logic. Fin follows standard operating procedures and applies sound judgment—just like a seasoned teammate—so even complicated queries are resolved in controllable, predictable ways.
A snapshot of the Procedures builder in action, mapping a clear path for handling damaged food orders while letting teams train Fin on examples, target channels, quickly test updates, and publish with Set live.
Procedures combine three powerful elements. First, natural language instructions. You write a Procedure in plain language, just like documenting a process for a new teammate. You can paste in your existing SOPs, write from scratch, or let AI draft them for you, then iterate yourself.
What’s new: Draft Procedures with AI. Share an outline of your process and Fin drafts a complete Procedure using your conversation history, knowledge hub content, and relevant data. If additional context is needed, it prompts you with clarifying questions to make sure the Procedure is thorough and tailored to your use case, significantly reducing setup time. For example: if you’re creating a refund workflow, the system can draft conditional paths for eligibility, approval thresholds, and verification steps based on your historical cases and policies.
What’s new: Break complex workflows into Sub-procedures. Write a process once and reference it across multiple Procedures by breaking it down into reusable steps, called Sub-procedures. This makes workflows easier to read, faster to build, and simpler to maintain as things change.
Second, deterministic controls. Natural language is flexible, but some steps need to be exact. You can layer in deterministic controls where precision matters, starting with a fully natural language Procedure and introducing structure gradually where it adds value: conditional steps (branching logic) to handle decision points so Fin’s behavior is consistent and predictable; data connectors so Fin can pull information from your tools or take actions automatically; code snippets for when absolute accuracy is essential; and checkpoints to pause for approval or hand off to a teammate.
Fin demonstrates structured troubleshooting: a transaction dispute flow with eligibility checks, clear IF/ELSE steps, and quick Data Connector actions like freezing a card or pulling invoices, streamlining complex support tasks.
What’s new: Instruct Fin to read specific content from your knowledge hub. You can set clear rules for Fin to reference a specific policy or article from your knowledge hub in defined situations so Fin always surfaces the right context in a conversation.
What’s new: Explicit Procedure switching under defined conditions. You can set rules that deterministically trigger a switch to a different Procedure, for example, escalating to a complaints Procedure if specific risk signals are detected mid-conversation.
What’s new: Internal notes for human handoffs. When Fin hands off to a teammate, it can now include internal notes with relevant context so the person picking up the conversation knows exactly what happened and what needs to happen next.
Third, fully agentic behavior. Because real conversations rarely follow the happy path, Procedures let Fin reason through what’s happening and adapt—jumping to the right step or switching Procedures entirely if a customer changes their mind or the issue shifts.
Procedures and Simulations in action: Fin rehearses a food order damage scenario, confirming details and progressing through each trigger. Teams validate complex flows end to end as steps turn green and outcomes are tracked.
What’s new: Automatic Procedure switching. If a customer starts in a billing workflow but then asks about cancelling their subscription, Fin transitions to the relevant Procedure without forcing the customer to restart.
What’s new: Structured data extraction from uploaded files. Fin can now extract structured data directly from PDFs and images uploaded by customers—like invoices, forms, or receipts—and use that data within the conversation. Customers don’t have to copy and paste or repeat themselves.
As MONY Group put it:
“ If a customer starts down one path but their issue turns out to be something else entirely, Fin adapts seamlessly – no more getting stuck in loops or forcing customers into the wrong workflow. ”
Simulations help teams rehearse procedures and verify outcomes before going live. Run all tests or launch a new one to ensure Fin handles tricky customer scenarios—from damage confirmation to refunds and missing subscriptions.
The result is a conversation that feels fluid, but always follows your intended rules.
Making complexity easier to manage is just as important as unlocking new capabilities. Beyond the core updates, we’ve focused on creation, governance, and scale—while keeping ownership with your team.
What’s new: Improved instruction authoring. We’ve made it easier to write, edit, and structure Procedures, so building and updating them takes less time and requires less effort.
What’s new: Reporting on when Procedures trigger, resolve, or hand off. You can now track how Procedures are performing directly within the Procedures UI, seeing exactly when they trigger, when they resolve, and when they hand off to a teammate. This visibility helps you spot issues early and improve over time.
Customer stories from Raylo and Mony Group show how Fin now resolves payment issues and complex claims in-chat, checks account data via APIs, and lifts CSAT to about 94%, highlighting the impact of Procedures and Simulations.
Simulations: Test complex workflows at scale before they reach customers. Simulations let you validate how Procedures will perform before anything goes live, and continuously revalidate as things change. Deploying complex AI can feel uncertain; Simulations remove that uncertainty so you can launch with confidence and iterate safely.
You can simulate full conversations. For any Procedure, choose a user or customer segment and run a complete, multi-turn simulated conversation. You see every step Fin takes, how it applies your rules, reasons through decisions, and where it passes or fails—giving you the observability to debug and fix issues before they ever reach customers.
What’s new: Upload images for richer testing. Simulations now support image uploads, so you can test workflows that involve receipts, invoices, or forms—the same inputs your customers actually send.
What’s new: Clearer visibility into Fin’s reasoning. You can now see exactly how Fin is thinking through each step of a Simulation, making it easier to understand behavior, catch unexpected decisions, and refine Procedures with confidence.
You can also use AI to create, store, and rerun tests. Writing test coverage manually doesn’t scale. Fin’s AI Assistant generates Simulations directly from your Procedures, suggesting realistic edge cases like partial refund disputes, missing invoice uploads, or no subscription found, so you can expand coverage without expanding overhead. All the Simulations you create are stored in a central library. When a product changes, a policy updates, or a Procedure is edited, hit “run all” to instantly check whether anything has regressed. This applies the same rigor to AI automation that engineering teams bring to software testing.
What’s new: AI-suggested Simulations. You can now use AI to generate a full set of Simulations from any Procedure. The AI Assistant suggests realistic variations based on your workflow, so you can build comprehensive test coverage fast.
Customers are already seeing this in production. “Fin can now handle payment-related queries that were never possible before… The impact on CSAT and overall CX has been pretty shocking – the Payment Information procedure CSAT is sitting at ~94%, and CX score is significantly higher than our average.” – Raylo
“Procedures have fundamentally changed what we can achieve with Fin. Previously, complex processes like cashback claim investigations could only be handled through a static form on our website… Now, Fin can handle these sophisticated scenarios in real-time within the conversation itself. It checks account information via API calls, makes complex decisions, and guides customers through the entire claims process dynamically.” – MONY Group
Procedures and Simulations are available now. I’m eager to see how teams use these updates to scale agentic AI, deliver faster resolutions, and raise the bar for customer experience—without sacrificing control, compliance, or quality.
Human-in-the-loop oversight is the fastest and most reliable way I know to elevate AI quality, build user trust, and reduce risk. At HighLevel, my teams treat oversight as a product feature—not an afterthought—because dependable AI experiences come from deliberate design choices across data, models, and people.
When I say “human-in-the-loop,” I mean a system that blends automation with targeted human judgment at key moments: during data curation, prompt engineering, evaluation, deployment, and post-launch learning. This approach turns “AI workflows” into measurable, repeatable processes and keeps me honest about what’s working, what’s drifting, and where a human safety net must step in.
Architecturally, I start with a retrieval-first pipeline to ground outputs in trusted knowledge, then wrap it in guardrails. Deterministic preprocessing, careful prompt engineering, and post-processing validators catch obvious failure modes. Confidence thresholds and policy checks route ambiguous or sensitive cases to a human reviewer, while clear, auditable traces show why the system chose automation versus escalation. This balance supports reliability at scale while preserving agility for “agentic AI” patterns when they add value.
Quality is only real if I can measure it, so I build with eval-driven development from day one. I maintain golden datasets, rubric-based scoring guidelines, and an automated evaluation harness that runs on every change to prompts, models, or data. Pre-production gates protect against regressions, while production telemetry surfaces drift by segment and use case. When it’s time to run experiments, I use A/B tests sized with a minimum detectable effect (MDE) to avoid overfitting to noise.
Operationally, I optimize for outcomes, not output. I track task success rate, time-to-resolution, safety violation rate, hallucination rate, and cost-to-serve, then connect these to outcomes vs output OKRs. The signal I want is simple: are we reliably solving the user’s job-to-be-done with lower effort and higher confidence? If not, I tighten prompts, refine retrieval, or expand human review where it pays off most.
Risk governance is non-negotiable. I design with privacy-by-design and data governance from the start—role-based access, audit trails, PII redaction, and red-team tests for safety. Clear reviewer playbooks and calibration sessions reduce bias and ensure consistent decisions. These practices aren’t bureaucracy; they’re how I operationalize AI risk management while maintaining velocity.
Teams make or break this model. I empower product trios to own the full lifecycle—discovery, build, and learning—so feedback loops close quickly. In-product feedback widgets, reviewer queues, and incident management playbooks help us respond in hours, not weeks. Over time, human review becomes a targeted scalpel rather than a blanket requirement as the system learns and improves.
Economics guide the level of oversight. I treat each workflow like a portfolio: where the value of accuracy is high and ambiguity is common, I route more to humans; where tasks are simple, frequent, and well-bounded, I automate aggressively. The goal isn’t zero humans—it’s optimal humans, deployed precisely where their judgment compounds ROI.
If you’re getting started, begin with one high-impact workflow, establish your golden set and evaluation rubric, and wire in a simple review queue. Prove the lift, then scale. In the short video above, I walk through the patterns I use to design these loops, measure quality with rigor, and ship AI that teams—and customers—can trust.
What happens when you treat an AI agent not as a chatbot, but as a full teammate on your sales team – one that can jump on video calls, demo your product, make phone calls, and follow up over days?
I recently dug into this question with the team behind ShowMe, an AI-native startup building digital sales reps for inbound teams. Founded in April 2025, ShowMe has engineered a multi‑agent system that combines conversation agents for live voice and video interactions, evaluator agents that score every call for quality and sentiment, and creator agents that ingest customer documentation to build tailored playbooks. A workflow layer orchestrates the entire lead‑to‑close journey across days, not minutes—exactly the kind of agentic AI approach I expect to see become standard in revenue workflows.
What stood out to me first was the origin story: a glaring conversion gap on a previous website, and the realization that a purpose‑built AI could fill it. The initial MVP was refreshingly pragmatic—start with a voice agent, pair it with product videos, and back it with a simple RAG knowledge base. That retrieval‑first pipeline let the team ship quickly, validate real user behavior, and then scale sophistication where it mattered.
Then came a pivotal affordance shift: adding a realistic avatar via HeyGen. It wasn’t just eye candy; it changed how prospects engaged. The video-call UX established trust and made the AI’s capabilities legible at a glance. Prospects behaved as if they were with a human rep—interrupting, probing, and asking for demos—because the surface area invited that behavior.
On the architecture side, the team decomposed a single sales conversation into multiple specialized sub‑agents—greeting, qualifying, pitching—to manage latency, memory constraints, and model limitations. Deterministic workflows handle the happy paths reliably, while a smart orchestrator is emerging to break out of rigid paths when context demands it. Confidence scoring and frustration detection kick in for real‑time human handoff decisions, a must for revenue‑critical moments where a missed nuance can cost pipeline.
Training the system to sell like your team is where it gets powerful. ShowMe ingests sales transcripts and training materials to teach company‑specific sales skills, then uses creator agents to assemble tailored playbooks. Conversation agents stay focused on live interactions, while evaluator agents continuously score calls for quality and sentiment. The result: repeatable, compliant, and brand‑consistent selling—without flattening personalization.
Quality isn’t an afterthought—it’s operationalized. Early deployments run with customer-driven evaluation loops where 100% of conversations are reviewed, tapering to about 5% over time as confidence increases. Feedback becomes automated tests to prevent prompt regression, and production quality is proven with POCs, A/B rollouts, dashboards, and CRM logging. This is eval-driven development applied to go‑to‑market: measurable, auditable, and continuously improving.
I also appreciate how they treat the agent as a coworker, not a widget. Onboarding happens via Slack, weekly reporting aligns with sales leadership rhythms, and tight CRM integration keeps data flowing both ways. That mindset unlocks adoption because it fits how sales teams actually operate—and it creates real Agent Analytics you can manage.
From a product perspective, several pragmatic details matter. Real‑time voice and avatar demos rely on latency tricks and a library of video clips to keep interactions snappy. The conversation agent evolved from a basic Q&A bot into guided sales discovery, balancing personalization with the ever-present risks of hallucination. Guardrails, human‑in‑the‑loop, and clearly defined handoff rules are non‑negotiables in high‑stakes sales workflows.
Looking ahead, the roadmap makes sense: move toward self‑serve PLG setup, add smarter orchestration that adapts beyond deterministic flows, and expand into adjacent roles like customer success. For product leaders building in gen ai, the pattern here is instructive: start with inbound value, design AI workflows that align to proven sales motions, and use rigorous evals to earn the right to automate more.
If you want to go deeper into the build, the live demos, and the full multi‑agent orchestration, listen to this episode on: Spotify | Apple Podcasts. For more on the stack, explore ShowMe and the avatar platform HeyGen.
Over the past few years, I’ve led cross-functional teams to deploy agentic AI in production, and I’ve learned that success rarely hinges on the model alone. It comes from methodically designing the right workflows, instrumenting every step, and building a feedback loop that compounds. Learn how companies like Replit are consolidating workflows, creating one-person departments, and building systems for scale with Amplitude.
When I talk about AI agents, I’m describing software that behaves like a focused teammate—owning a clear job to be done end-to-end. In practice, that means consolidating fragmented tasks into a single accountable “one-person department,” then giving it the context, tools, and analytics to perform reliably. This is how agentic AI moves beyond demos into durable business impact.
I start with outcomes, not algorithms. I map a driver tree from business goals (e.g., lower response time, higher activation, better retention) to the specific moments an agent can influence. This outcome-first alignment keeps scope tight, informs guardrails, and grounds the value proposition in measurable change instead of vanity metrics.
Next, I define the workflow the agent will fully own. I look for high-volume, rules-adjacent processes—think lead qualification, support triage, or billing inquiries—where clear decision criteria already exist but human time is the bottleneck. I document triggers, inputs, decision points, and handoffs, then design the ideal-state flow the agent will run autonomously, with transparent escalation paths to humans.
On architecture, I favor a retrieval-first pipeline to keep responses accurate and current. I scope the knowledge base, implement context window management, and standardize tools the agent can call (search, CRM actions, ticket updates). For teams new to this, I coach “LLMs for product managers” fundamentals so we make sensible trade-offs between speed and reliability rather than chasing model-of-the-week headlines.
Instrumentation is where the system becomes self-improving. I use Amplitude analytics and an Agent Analytics schema to track intent detection, tool usage, resolution rate, time-to-resolution, deflection, and escalation causes. A unified analytics platform lets me connect agent outcomes to core product metrics—activation, retention, and conversion—so we can see the real revenue and experience impact, not just local efficiency gains.
To validate impact, I run A/B testing when traffic allows, setting a minimum detectable effect (MDE) upfront to avoid inconclusive reads. In lower-volume scenarios, I lean on eval-driven development: curated test sets for edge cases, scenario-based regression suites, and error taxonomies that accelerate iteration. Feature flags let us stage capabilities safely (shadow mode, assistive, autonomous) while we monitor deltas before full rollout.
Reliability and trust are designed in from the start. I apply AI risk management practices—privacy-by-design, data governance, and policy-aligned prompt templates—paired with observability to trace decisions. Clear escalation policies, incident management runbooks, and human-in-the-loop checkpoints ensure the agent fails safe, not silently.
Shipping cadence matters. I use CI/CD to increase deployment frequency, keep prompts and tools versioned, and gate risky changes with targeted rollouts. As patterns stabilize, we scale horizontally to new use cases, sharing core capabilities (retrieval, analytics, guardrails) as a platform. This is how “one-person departments” multiply without multiplying overhead.
Change management closes the loop. I partner with product trios and frontline teams to co-design prompts, set acceptance criteria, and define what “good” looks like in plain language. In-app guides and product tours introduce the agent’s role and limits, and structured feedback channels feed directly into our discovery and iteration rhythm.
The throughline of this playbook is simple: treat agents like real teammates with a job description, operating procedures, and performance reviews. With disciplined workflow design, a retrieval-first pipeline, and outcome-level instrumentation in Amplitude, agentic AI stops being a science project and starts compounding into durable product-led growth.
Inspired by this post on Amplitude – Perspectives.
When people ask me about "LLM vs AI Agents: What Product Teams Must Get Right," I start with a simple truth: an LLM is a powerful prediction engine, while an AI agent is a productized workflow that plans, takes actions with tools, remembers, and closes the loop on an outcome. That difference sounds academic until you’re on the hook for reliability, cost, and customer trust.
In my role, I’ve shipped LLM copilots that delight users and piloted agents that automate complex workflows. The pattern that never fails is this: start assistive, then graduate to autonomy. Copilots accelerate people; agents own outcomes. When we respect that gradient, adoption climbs, incidents fall, and we earn the right to expand scope.
The first decision point is use-case fit. If the task benefits from human judgment, high-context nuance, or brand voice, I frame it as a copilot with strong guardrails and crisp UX. If the task is well-bounded, tool-heavy, and verify‑able, I consider an agent—but only after we can measure end‑to‑end task success with eval-driven development.
Architecture matters. I reach for a retrieval-first pipeline to keep responses grounded in authoritative data, then add tool use for actions (search, write, schedule, transact) with deterministic scaffolding to prevent thrashing. Good prompt engineering is table stakes, but context window management and a clean memory strategy (short‑term scratchpad, long‑term facts, and policy) separate demos from durable systems.
Agents amplify both value and risk. I build safety in layers: role and scope definition, tool whitelists, unit limits, human‑in‑the‑loop checkpoints at irreversible steps, and privacy-by-design data governance. We log every decision token-for-token because auditability isn’t optional once agents touch customers, money, or data.
Measurement is non‑negotiable. For LLM features, I track time‑to‑first‑token, response latency, groundedness, and user satisfaction. For agents, I add Agent Analytics: task success rate, number of steps per task, tool error rate, loop detection, guardrail triggers, escalation to human, cost per successful task, and containment rate. If we can’t see it, we can’t ship it.
My delivery playbook mirrors modern software ops. We use feature flags, gated betas, and canary rollouts; we version prompts like code; we set incident management paths for model outages and tool drift; and we rehearse fallbacks so the experience degrades gracefully, not catastrophically. Dull operations build dazzling products.
On roadmapping, I thin‑slice value. We introduce a minimal viable copilot that handles a single, frequent job-to-be-done with high success. Only after continuous discovery confirms product‑market fit do we grant more autonomy, one capability at a time. Outcomes vs output OKRs keep us honest: if the customer’s job gets done faster, cheaper, and with fewer errors, we scale; if not, we fix fundamentals before adding scope.
Build vs buy is rarely binary. I tend to buy the undifferentiated heavy lifting—observability, prompt versioning, red‑teaming, and policy enforcement—while building the proprietary workflows, data modeling, and UX that encode our defensible advantage. The litmus test: if it’s part of our unique value proposition, we own it; if not, we integrate the best‑in‑class and move.
Go‑to‑market must be as rigorous as the tech. We position clearly (assistant vs agent), price to value with transparent consumption SaaS pricing, and communicate risk posture in plain language. Customers don’t buy models; they buy confidence that a job gets done reliably within their constraints.
Common failure modes repeat: shipping autonomy before instrumentation, treating prompts as magic instead of software, skipping data governance, and ignoring the human experience. The antidote is disciplined AI Strategy rooted in empowered product teams, tight feedback loops, and relentless evaluation.
If you take nothing else: choose the right paradigm for the job (copilot first, agent when proven), ground with a retrieval-first pipeline, instrument with eval-driven development and Agent Analytics, and operationalize like a mission‑critical system. Do that, and you’ll turn LLM capabilities into durable product outcomes.
In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.
Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.
When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.
The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.
To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.
Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.
On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.
If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.
The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.
The most valuable upgrade I’ve made to my product management workflow isn’t a new framework or a shiny dashboard—it’s an AI-first operating model that compresses discovery-to-delivery cycles while increasing confidence in every decision. I built this approach to reduce context switching, remove toil, and keep the team relentlessly focused on outcomes over output. The result is a faster, clearer, and more reliable path from insight to shipped value.
Here’s how I run an AI-powered product workflow end to end: continuous discovery, opportunity sizing, solution shaping, planning, execution, and iteration—each step instrumented with automation, retrieval, and evaluation so we learn faster without compromising rigor.
Intake and triage start with a retrieval-first pipeline that unifies customer feedback, support tickets, sales notes, research transcripts, and usage analytics. I use embeddings to cluster themes, de-duplicate signals, and surface the most representative examples. This gives me an instant, always-fresh view of customer jobs, pains, and opportunities without manually combing through noise.
For discovery, I rely on “LLMs for product managers” to accelerate the hard parts without replacing judgment. I generate interview guides, summarize transcripts, extract entities, and tag moments of friction. Prompt engineering and context window management ensure the model sees the right evidence at the right time. I keep all sensitive data governed by privacy-by-design and data governance controls.
Opportunity sizing is where I connect insights to business impact. I map problems to a driver tree, quantify potential lift, and align to outcomes vs output OKRs. When relevant, I apply the Kano Model to balance performance, basic, and excitement attributes. To maintain rigor, I use eval-driven development on my prompts and heuristics so prioritization is repeatable, not anecdotal.
Solution shaping is a collaborative exercise with product trios. I draft problem narratives and PRDs, generate acceptance criteria, and create first-pass UX flows. For speed, I use gen ai for product prototyping to explore alternatives quickly, then gate final choices through usability feedback and feasibility checks. Where uncertainty is high, I define a minimum detectable effect (MDE) and design A/B testing plans upfront.
Planning ties strategy to execution through product roadmapping and sprint planning. I break work into sequenced bets, enable feature flags for controlled exposure, and wire quality signals into CI/CD. DORA metrics—like deployment frequency and change failure rate—help me keep the system honest. Observability ensures we see the “why” behind behavior, not just the “what.”
Execution is instrumented with in-app guides, Intercom messaging, and Pendo to shape onboarding and activation. I connect Amplitude analytics to measure habit formation, retention analysis, and feature adoption. When experiments run, I monitor leading indicators in near real time while protecting against peeking and p-hacking. The point isn’t to prove we’re right; it’s to learn fast enough to get right.
Iteration closes the loop. I use a unified analytics platform to compare expected vs actual outcomes, harvest qualitative feedback, and push new evidence back into discovery. The system improves with each cycle because the retrieval-first pipeline and eval harness both get smarter as data grows.
Governance is non-negotiable. AI risk management, cybersecurity, and regulatory compliance sit alongside model evaluations to prevent drift, leakage, or bias. I document decisions, model versions, and test artifacts so we can audit how we got to a call—especially when trade-offs are nuanced.
If you’re standing up this AI workflow from scratch, I recommend a 30/60/90 rollout. In the first 30 days, audit your data sources and build a retrieval-first pipeline. In days 31–60, pilot two high-leverage workflows—continuous discovery and PRD drafting—backed by eval-driven development. By days 61–90, scale to prioritization and experiment design, then thread the outputs into your planning and CI/CD rhythms.
Common pitfalls I watch for: over-automation that blurs context, lack of evaluation frameworks, ungoverned data that undermines trust, and vanity metrics that celebrate activity over outcomes. The antidote is simple but disciplined—clear decision criteria, measurable hypotheses, and automated evaluations that run as guardrails, not bottlenecks.
This AI upgrade doesn’t replace the craft of product management; it amplifies it. By combining judgment, clear strategy, and reliable automation, we ship value faster, reduce risk, and make better calls under uncertainty. The payoff is durable: compounding learning velocity and a team that spends more time solving the right problems—and less time wrestling the process.
I’ve been pushing hard to operationalize AI for real product work, and this episode zeroes in on the moment Claude Code stops feeling like a demo and starts behaving like a dependable teammate. If you’ve ever wondered how to go from clever prompts in the browser to durable, repeatable workflows on your machine, this walkthrough is for you.
Listen on: Spotify | Apple Podcasts.
My first honest reaction to installing and configuring the desktop agent was the all-too-relatable “this tool thinks everything is a code repo” reality. That framing helped me reset expectations fast: instead of treating it like a magical universal assistant, I began designing guardrails, context, and repeatable routines—exactly how I’d onboard a new team member.
The shift from Claude-in-the-browser to Claude Code on my machine was the unlock. Locally, it can finally work with my files, folders, and workflows. That meant I could ground it in real artifacts—project docs, meeting notes, product specs, and historical decisions—so responses weren’t just plausible; they were contextual and verifiable.
On setup, I now treat /init and Claude MD files as my product requirements. I define roles, boundaries, and canonical sources up front, then run in a deliberate “walled garden.” The “treat it like an intern” model works beautifully: scope access intentionally, expand privileges as trust grows, and keep a tight audit trail of what it can touch and why.
Surprisingly, task management became my ideal on-ramp. It’s easy to validate, the feedback loops are tight, and the ROI is immediate. I export calendar windows rather than granting full calendar access, then let the agent map priorities into Trello, reconcile time blocks, and surface trade-offs. Fast wins build confidence—mine and the agent’s.
Model switching matters more than I expected. When speed is king and “good enough” will do, Haiku keeps the loop snappy. When stakes are higher—complex synthesis, nuanced product strategy, or gnarly ambiguity—I step up to Claude Opus 4.5. Being intentional about when to optimize for latency versus depth is a quiet superpower.
Web tasks can still spiral. When that happens, I pause its autonomy, toggle to fewer steps, and ask, “What are you doing?” Paired with Claude’s Web fetch tool, this makes the agent explain its chain-of-thought planning without exposing hidden reasoning, so I can spot brittle assumptions, prune distractions, and re-ground the task.
Content retrieval has become a killer workflow. I point the agent at my archives—blog posts, book drafts, transcripts, notes—and ask, “Where have I talked about this before?” It assembles a map of prior art, connects themes I’d forgotten, and prevents me from reinventing work. Over time, this evolves into a Zettelkasten-style research system that upgrades rigor and accelerates synthesis.
I’ve also turned Claude Code into a publishing engine. From a single transcript, it drafts titles, descriptions, show notes, and chapters, then routes artifacts to Ghost for formatting. Before anything ships, I run fact-checking workflows that validate claims against transcripts and research sources. The output improves, but more importantly, the scaffolding makes quality repeatable.
Reusable workflows compound. I rely on slash commands to trigger common jobs, break down larger efforts with sub-agents, and wire in hooks and plugins where external systems are needed. This is agentic AI at its most practical: fewer hero prompts, more reliable processes.
Audience analytics and content prioritization are helpful with caveats. I let the agent cluster themes and flag gaps, then I pressure-test its suggestions against first-party data and strategic goals. As with any model-driven insight, triangulation beats blind faith.
Two metaphors guide my day-to-day. First, Claude Code is like a dog—sometimes it returns with the stick, sometimes it gets lost in the woods. Second, the “intern” framing keeps me honest: don’t hand it the whole company on day one. With that mindset, my output jumped—more volume without sacrificing quality—because the workflow scaffolding got better.
In this episode, I cover what Claude Code is and why it’s useful even if you’re not an engineer, the real difference between the browser experience and running locally, how to shape behavior with /init and Claude MD files, why task management is the perfect proving ground, when to export calendar windows versus connecting directly, and when model-switching makes sense—Haiku for speed, Opus for depth.
I also dig into debugging web tasks by asking “What are you doing?”, content retrieval workflows across personal archives, building reusable slash-command systems with sub-agents, hooks, and plugins, practical publishing stacks from transcripts, fact-checking against transcripts and research sources, and using analytics to prioritize content—with a healthy respect for uncertainty.
If you’ve been trying to make Claude Code feel less like “throwing a stick into the woods,” this is the candid, tactical tour I wish I’d had on day one. Drop your questions and experiments below—I’m eager to compare notes and refine the playbook together.
Shipping AI agents is not like shipping a typical feature. The system learns, reasons, and takes action in unpredictable environments, and when it’s customer-facing, the stakes are high. Over the past few years, I’ve refined a practical checklist that helps my teams move quickly without breaking trust. It balances speed with safety, and ambition with accountability—exactly what you need to scale agentic AI in production.
This checklist was forged in real launches—some smooth, some humbling. Early on, I watched an otherwise brilliant agent confidently offer a refund policy we didn’t have. That one incident made it clear: AI agents require a higher bar for guardrails, evals, and observability. Today, I won’t greenlight an AI rollout without these steps being explicit, owned, and testable.
Start with outcomes, not output. I define the job-to-be-done, the target users, and the measurable business impact using outcomes vs output OKRs and driver trees. Success is not “ship an agent,” it’s “reduce first-response time by 40% with no drop in CSAT,” or “increase qualified demo bookings by 20% at a lower cost per acquisition.” Clear outcomes give the agent a purpose and the team a north star.
Prepare the knowledge the agent will use. A retrieval-first pipeline beats raw prompting for most enterprise cases. I inventory sources of truth, set access controls, and enforce data governance from day one. That includes PII handling, redaction, retention policies, and privacy-by-design. If the agent can’t reliably retrieve the right fact at the right time, the rest doesn’t matter.
Choose models and prompts with discipline. I align model selection with context window management, cost, latency, and tool-use requirements. Then I build prompts and tools together, not in isolation, and I keep temperature, stop conditions, and function-calling explicit. Most importantly, I use eval-driven development: golden datasets, task-specific metrics (accuracy, helpfulness, latency, cost), and target thresholds that must be met before widening rollout.
Manage AI risk upfront. I treat jailbreaks, toxicity, and data leakage as product risks, not just security issues. I implement layered defenses—input/output filtering, policy checks, rate limits, and abuse monitoring—and define escalation paths and human-in-the-loop handoffs for ambiguous cases. Every risky capability needs an owner, a playbook, and a test.
Build the pipeline that lets you iterate safely. Prompts, tools, policies, and retrieval configs go through the same CI/CD rigor as code. I use feature flags for progressive delivery, canary cohorts to limit blast radius, and clear rollback procedures. Observability isn’t optional; I track latency, token usage, cost, failure modes, and user outcomes. I also watch DORA metrics and deployment frequency to ensure we’re improving the engine, not just the output.
Constrain autonomy intentionally. Agent behavior design matters as much as model choice. I set step limits, define tool whitelists, separate read vs write permissions, and specify decision checkpoints. When the agent is uncertain or confidence drops below a threshold, it hands off to a human or a deterministic workflow. Guardrails aren’t barriers; they’re bumpers that keep you on the track.
Instrument what users experience, not just what models produce. I track activation, task success, self-serve completion rates, and time-to-value. I pair Agent Analytics with journey analytics so I can see where the agent helps or hurts. I also invest in UX trust cues—transparent explanations, undo paths, and in-app guides—so users feel in control. When the agent changes behavior through learning, the interface should make that understandable.
If you’re shipping a voice AI agent, test in realistic conditions. I set targets for ASR accuracy, barge-in responsiveness, TTS prosody, and end-to-end latency. I predefine safe transfer logic for complex calls and ensure compliance for call recording and data retention. Voice amplifies both the magic and the mistakes; operational excellence is non-negotiable.
Plan the business rollout like a product, not a press release. I align pricing (often consumption SaaS pricing), packaging, and SLAs with actual unit economics—tokens, inference, and retrieval. I equip solutions engineering with playbooks and reference architectures, wire up CRM integration for attribution, and put feedback loops into Intercom or the support stack so we learn from every interaction.
Run operations like an SRE team. I define incident severity for AI-specific failures (e.g., harmful output, runaway cost, degraded retrieval), add alerting, and keep runbooks current. I schedule postmortems that feed directly into eval baselines and backlog priorities. Continuous discovery isn’t a ceremony; it’s the safety net that keeps improvements compounding.
Close the loop on compliance and governance. From day zero, I document data flows, vendor scopes, and audit logs. I verify regulatory compliance and adopt privacy-by-design so I’m not retrofitting later. Transparency, user consent, and opt-outs aren’t just legal checkboxes; they’re trust-building tools that differentiate your product.
The result of this checklist is speed with confidence. It gives my teams a common language to debate trade-offs, a clear path to production, and the guardrails to scale safely. If you’re preparing to deploy an agent, adapt these steps to your stack and your customers. Your future self—and your users—will thank you.
AI adoption is everywhere. I see more teams every quarter moving from pilots to production—and increasing their budgets accordingly. But the gap between “using AI” and truly transforming with it is widening fast. Launching an AI Agent is easy; building a mature, AI-powered support operation is where the real work—and the real value—lives.
In the new research, the "2026 Customer Service Transformation Report," the difference comes down to depth of deployment. It’s not enough to dabble. Teams that design their operations around AI are pulling away from those who treat AI like a bolt-on feature.
This article kicks off part one of my five-part deep dive into the research. I’ll unpack the data, share what I’ve learned leading product and AI strategy, and translate it into practical steps you can apply now. If you’d like to go straight to the source, you can download the report here.
First, the macro picture: 2,470 global support professionals across industries were surveyed to understand current AI usage, challenges, and the 2026 opportunities. The headline is clear—AI investment is now table stakes. Eighty-two percent of senior leaders say their teams invested in AI in the past year and 87% say they plan to invest in 2026. Those investments are already paying off: Over three-quarters of CS teams (77%) say AI is meeting or exceeding expectations, delivering faster response and resolution times, always-on coverage, cost savings, increased capacity, and multilingual support that scales globally.
And yet, only 10% of organizations say they have reached a "mature" level of deployment, where AI is fully integrated into operations and working at scale. That’s the tell: most teams are skimming the surface and leaving meaningful performance gains on the table.
Most service teams are still early in AI adoption. Only 10% report mature deployment, while 26% are scaling, 35% are in initial rollout, and 26% remain in exploration, with 3% unsure.
When I map the data to what I’ve seen in the field, the maturity difference shows up immediately in outcomes. Teams at mature deployment don’t just automate repetitive tasks; they build AI into critical workflows, give it real responsibility, and iterate continuously. Beyond automating the bulk of their manual work, they’re using AI to proactively engage customers and perform tasks on their behalf.
The results follow. Of the teams that have reached mature deployment, 43% report higher quality and consistency across support—nearly double the rate of those still in the initial deployment stage. That quality shift is how support evolves from a cost center to a value driver. Great experiences don’t just prevent churn; they create advocacy and become a reason customers choose you. The more you trust your AI Agent with meaningful work, the more it creates the conditions for higher-quality, more consistent support.
One example I point to often: Lightspeed. They operate a complex product across regions and languages, with tens of thousands of monthly requests. When they adopted Fin in early 2023, they needed a solution that could scale with that complexity—and they treated the transition like a first-class change program.
They leveraged foundational training and built custom, in-house modules aligned to their processes. They supported their team post-launch and worked closely with leadership to align on the goals and benefits of AI. In a large, distributed org, that executive alignment created ownership and momentum. Their VP of Information Systems, Yamine Gluchow, put it perfectly: "It’s not magic. If you invest in understanding, adoption, and great content, AI performance takes off."
Mature AI Agent rollouts deliver bigger gains in customer service—outperforming initial deployments in automation, proactive engagement, and task completion (63% vs 52%, 51% vs 41%, 45% vs 28%)—showing how depth drives measurable impact.
Their outcomes reflect that depth: An 88% involvement rate. 72% of Fin conversations resolved without human intervention. 43,000+ customer requests resolved monthly. Service in 12+ languages across 100+ countries. Stable CSAT—with improvement in some markets.
What impressed me most was the complexity Fin now resolves. A merchant in France asked about tax invoices—normally a long phone call to check back-end data and explain rules step by step. Instead, Fin handled the conversation in French, provided an accurate end-to-end explanation, and earned positive CSAT. That’s what mature deployment looks like: a system that absorbs complexity and delivers correct, efficient results at scale.
So how do we build toward that level of maturity? In my experience, this journey requires a mindset shift and operational rigor—not just a bigger AI budget.
Rethink how you approach support. If you were building from scratch today, you’d design around AI from day one. As Grant Lee, CEO of Gamma, puts it: "If you want to unlock the real value of AI, you have to design for it, not retrofit around it." Treat AI as infrastructure, not a feature. That shift impacts your org design, workflows, and what “good” looks like.
Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.
Secure executive sponsorship early. You won’t scale without C-suite backing. AI reshapes how support works, how teams are structured, how performance is measured, and how cost and value flow. Align your CFO on ROI, your CCO on journey design, and your CEO on customer experience as a strategic advantage. Early wins are great—but the compounding gains only come when leadership backs AI as infrastructure, not a one-off cost save.
Assign clear ownership for AI performance. One common failure mode: no one owns the AI. Stand up an AI operations lead or support ops specialist to review resolution trends and handoffs, tune content and configuration, coordinate on systemic issues, and drive a prioritized improvement roadmap. Without this role, feedback loops break and performance plateaus.
Treat content as critical infrastructure. Your AI Agent is only as good as the knowledge it can access. Ensure coverage for the topics it must handle, keep information accurate and current, and structure content so it’s easy for AI to consume. Make maintenance part of BAU, not a quarterly fire drill. A clean, governed, retrieval-first pipeline dramatically increases autonomous resolution.
Build a continuous improvement system. AI performance isn’t static. Train your AI Agent by expanding its knowledge, refining behavior, and connecting new data sources to handle more scenarios autonomously. Validate changes against real scenarios before they ship. Roll out updates in a controlled way across channels and segments. Use performance data to find patterns—frequent handoffs, low-resolution topics—and decide what to improve next. I often point to the Fin Flywheel (Train → Test → Deploy → Analyze) as a practical example of turning performance data into action.
The big takeaway from the "2026 Customer Service Transformation Report" is encouraging: investment is widespread, and early returns are real. The bigger opportunity is to turn those early wins into durable transformation. Teams leaning into AI as infrastructure—supported by executive alignment, clear ownership, strong content, and a continuous improvement loop—are already separating from the pack.
Next up in this series, I’ll dig into how leading teams measure success. Beyond simple cost savings, mature deployments tie AI to clear ROI and strategic impact—shifting more work into value-adding, revenue-generating territory. Follow along here, or subscribe on LinkedIn to get the next installment in your feed.