Tag: retrieval-first pipeline

Battle-Tested AI Agent Orchestration Patterns for Reliable, Observable, Product-Ready Systems

Shipping agentic AI into production is exhilarating—until a flaky output torpedoes trust. Over the past year, I’ve led teams at HighLevel to operationalize agents across customer-facing and internal workflows, and I’ve learned that reliability isn’t an afterthought; it’s an architecture. In this piece, I share the AI Agent Orchestration Patterns for Reliable Products that consistently deliver dependable outcomes at scale.

When we talk about orchestration, we’re talking about more than a single prompt. The shift is from monolithic calls to coordinated “agentic AI” where routers, planners, and specialists collaborate through structured “AI workflows.” In practice, I rely on a few canonical patterns: a planner–executor loop for multi-step tasks, a router–specialist setup for skill selection, and a “retrieval-first pipeline” that grounds generation with authoritative context before a single token is produced.

Reliability-by-design starts with typed inputs/outputs and strict validation. I standardize on JSON schemas, enforce tool/function signatures, and implement idempotency keys so retries don’t wreak havoc on downstream systems. Timeouts, circuit breakers, and backpressure protect the platform under load, while rate limiting and dead-letter queues keep failure modes contained. Most importantly, we engineer graceful degradation: agents “abstain” when uncertain, fall back to deterministic paths, and escalate to humans instead of guessing.

Safety is a first-class concern, not a bolt-on. Our “AI risk management” pipeline includes PII redaction, allow/deny lists for tools and data, and the principle of least privilege for every connector (yes, even the ChatGPT connector). We codify policy-as-code for repeatability and require human-in-the-loop approvals for sensitive or irreversible actions. In my experience, clear red lines and reversible defaults prevent the vast majority of regrettable outcomes.

Without strong “observability,” you’re flying blind. I instrument agents with an “Agent Analytics” layer that captures traces, spans, tool invocations, and token usage across the entire chain. The essential metrics are outcome quality (task success rate), latency (p50/p95), tool failure rates, cost per task, and user-level satisfaction signals. Cross-agent lineage allows us to pinpoint where a plan went awry and which tool or prompt introduced drift—vital for rapid remediation.

Quality improves fastest when it is measured relentlessly. I practice “eval-driven development” with golden datasets, rubric-based scoring, and risk-weighted sampling of edge cases. LLM-as-judge can help, but we always calibrate against human ratings and monitor agreement. In production, I blend online metrics with controlled “A/B testing” and plan experiments to hit a realistic minimum detectable effect (MDE). The result is a virtuous loop where prompt tweaks, tool changes, and retrieval adjustments are verified before wide rollout.

Agents need the same rigor we expect from any modern system. I gate releases through “CI/CD” with linting for prompts, schema checks for tools, and simulation runs for critical paths. “Feature flags” enable shadow and canary deployments so we can throttle exposure by segment or workflow. I also track reliability with “DORA metrics” and “deployment frequency,” and I partner closely with “SRE” for on-call coverage, runbooks, and incident postmortems tailored to agent failure modes.

Context is a resource to allocate, not a bottomless pit. Thoughtful “context window management” means curating retrieval, summarizing long-running threads, setting memory time-to-live, and constraining what the agent can see at any given step. I bias hard toward retrieval over recall, keep chunks small and semantically precise, and validate that the “retrieval-first pipeline” truly returns the right evidence—not just the nearest match.

In day-to-day product work, I lean on a compact playbook: a router that selects the best specialist; a planner that decomposes tasks and allocates tools; a deterministic guard that verifies preconditions; an execution loop with explicit budgets; and a fallback policy that prefers abstaining over hallucinating. Together, these patterns create an agent that behaves like a dependable teammate rather than a creative wildcard.

No architecture thrives without the right rituals. Product trios keep discovery continuous, while clear outcomes (not output) align teams on value instead of vanity. We map risks early, maintain a public quality dashboard, and rehearse failure recoveries so incidents never become improvisations. The cultural signal is simple: we celebrate root-cause clarity and safe iteration over heroics.

If you’re just starting, implement three patterns first: retrieval before generation, abstain-and-escalate for low confidence, and canary releases under feature flags. Instrument everything from day one, run a weekly eval review, and expand scope only when the data says you’re ready. With these habits, your agents will earn user trust—and keep it.

Inspired by this post on Product School.

March 2, 2026
From Tickets to Strategy: How AI Is Rewriting Support Careers—and Why Now Is the Moment

To truly transform with AI, I’ve learned it’s never just about the technology—it’s about redesigning how we work. The teams that win don’t bolt AI on; they re-architect around it. That means rethinking roles, workflows, and governance to build a system that sustains and improves AI performance over time.

In The 2026 Customer Service Transformation Report, teams at every stage of maturity describe human agents taking on more proactive work—training AI systems, handling the hardest queries, and owning tasks that demand judgment. Job descriptions are shifting, too, with many organizations explicitly adding AI-related responsibilities.

I’m also seeing a clear rise in dedicated AI specialists. Conversation analysts, knowledge managers, and AI operations leads are fast becoming standard. For support professionals, this opens new, higher-leverage career paths—and creates a talent pipeline that blends service excellence, data fluency, and product thinking.

Support once centered on queue-level activity—ticket triage, routing, translations, and answering FAQs. Now, as AI handles more frontline interactions, our human roles are moving up the stack toward optimization, oversight, and continuous improvement.

According to the latest research, 45% of teams report updating job descriptions to include AI-related responsibilities, with 40% saying their human agents are now more focused on training AI systems. Another 27% report that human agents primarily handle the most complex escalations and edge cases, while a quarter say agents are doing more consultative and strategic work.

Even at the initial deployment stage, 16% of teams report spending less time handling support volume since implementing AI – and among teams who’ve reached maturity, that figure rises to 28%.

When Intercom’s Research, Analytics & Data Science (RAD) team interviewed 166 of our customers, similar themes emerged. Nearly all participants (≈95%) reported meaningful workflow changes, with manual processes being handled by AI, and humans focusing more on monitoring or fine-tuning AI outputs. Eighty-three percent of participants also reported seeing their team’s roles and responsibilities change to become more strategic and supervisory in nature.

AI is reshaping support teams: organizations are adding conversation analysts (32%), knowledge managers (30%), AI operations leads (28%), and support automation specialists (24%). Just 8% report no new AI roles.

It’s not just the work that’s evolving; organizational structures are, too. Some teams are reallocating existing talent into AI-focused roles; others are hiring entirely new skill sets. Many of the most common job titles in this space didn’t exist two years ago.

Consider a Senior AI Knowledge Manager, Beth-Ann Sher, who transitioned from a help center manager role. Like many careers transformed by AI, her work evolved from administrative to strategic. Instead of focusing solely on customer-facing, self-serve content, her mandate expanded to designing and optimizing knowledge inputs that directly improve AI Agent Fin’s performance—work that materially lifts resolution rates.

Or look at a Senior Conversation Designer, Fred Walton, hired specifically for an AI-first function. He focuses on frictionless customer journeys with Fin, smoothing handoffs between automation and human support while keeping customer satisfaction front and center—hallmarks of mature AI workflows and conversation design.

In high-performing organizations, roles like these typically sit within a dedicated AI support team under senior CS leadership. Clear ownership and accountability for AI performance is critical; without it, optimization stalls and trust erodes.

These shifts aren’t isolated. Take Robb Clarke from RB2B. He went from Head of Technical Operations to Head of AI. With Fin, his focus moved from repetitive support questions to managing knowledge and improving the system behind it—freeing him to be proactive about product improvements and fix issues before they hit customers.

Or consider Eric Broulette from Bloomerang, a support leader who leaned into AI and became the VP of Support and Education. By deploying Fin, his team found breathing room to invest in what’s next. Agents stepped into new roles, contributed to meaningful projects, and built skills that had previously felt out of reach. As Eric puts it: “Do not wait to embrace AI. It will unlock more career growth for your teams than you can imagine.”

Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.

Bringing AI into support will eventually change every agent’s day-to-day work. For leaders at the start of the journey, that can feel daunting. My perspective: the most successful teams treat this as an operating model shift, not a tooling rollout—anchored in AI Strategy, governance, and continuous improvement.

Be transparent about what’s changing, why it matters, and how success will be measured. Define how AI performance will be evaluated (resolution rate, containment, CSAT impact), empower agents to train and improve the system, and communicate how responsibilities will evolve. When teams help build the AI, they’re invested in making it great.

Here’s the playbook I rely on with support leaders: First, reset expectations about time allocation—less time in the queue, more time improving the AI system that serves the queue. Second, elevate knowledge management as a core capability. Prioritize content quality and coverage for your AI Agent, and carve out dedicated “out of the inbox” time so every agent contributes. Third, keep outcome metrics—especially resolution rate—front and center. It gives the team a north star for experimentation and iteration.

Scaling AI is as much a people challenge as it is a technology challenge. As automation takes on more work, support roles become more proactive, strategic, and cross-functional—even early in the journey. Responsibilities expand, new roles emerge, and team structures adapt to concentrate on and amplify AI performance. In the process, support careers are transformed.

If you’re leading this shift, now’s the moment to reimagine your operating model: clarify ownership, invest in knowledge and conversation design, adopt eval-driven development, and build the muscle for continuous improvement. That’s how you move from tickets to strategy—and unlock compounding value for your customers, your business, and your teams.

Inspired by this post on The Intercom Blog.

February 27, 2026
12 Game-Changing Updates to Fin Procedures & Simulations for Complex Queries

Today, I’m excited to share 12 major updates to Fin’s Procedures and Simulations—the foundation that lets Fin handle complex work while keeping teams fully in control of the customer experience.

In my work building AI workflows with product and support leaders, I’ve seen how the right blend of natural language instructions, deterministic controls, and fully agentic behavior turns Fin into a reliable problem solver. Procedures make this blend possible by enabling Fin to act like a human—yet with the repeatability and governance of software. Simulations then let us test those complex Procedures at scale before they reach customers, so we can deploy with confidence.

Together, these capabilities make Fin self-manageable, transparent, and ready for genuinely complex work.

Here’s what’s new at a glance: we’ve made Procedures easier to build and maintain; enhanced deterministic controls for precision and policy compliance; expanded agentic behavior so Fin can adapt in real time; and delivered more powerful Simulations to validate end-to-end workflows before go-live.

Why did we build this? Many teams see early AI gains in speed, coverage, and cost to serve—but then hit a ceiling. They keep AI confined to simple automation and information retrieval, rather than setting it up to handle the nuanced, multi-step workflows they still trust to humans. We designed Procedures and Simulations to remove that ceiling, so teams can confidently set up, govern, and iterate on complex AI workflows without bottlenecks.

Follow the AI lifecycle as it cycles from Analyze to Train to Test to Deploy. This streamlined loop spotlights the TRAIN phase, underscoring faster iteration and feedback that power more capable procedures and realistic simulations.

We also heard that teams needed an easy way to connect data so Fin could reliably check customer status or eligibility and then take action. And they didn’t want to route through engineering every time they needed to create or amend logic for mid-conversation decisions. Procedures combines natural language instructions and intuitive data connector setups. You tell Fin in your own words how you want it to behave, and you’ll be guided through creating conditional steps so Fin will react consistently, with the option to add in any code snippets for circumstances where absolute precision is required. Once you build one Procedure, we believe you’ll want to build several, so Fin will constantly read the conversation it’s in to ensure it’s following the most relevant Procedure, and jump to a more relevant one if the user intent changes.

I know that taking something like this live the first time can feel like a leap of faith. That’s exactly why we built Simulations—to test Procedures comprehensively, uncover edge cases, and launch with confidence.

Reaching mature deployment takes a deliberate, ongoing commitment to training workflows, validating them before deployment, measuring performance in production, and refining them over time. At Intercom, we call this the Fin Flywheel: train, test, deploy, analyze. Procedures form the foundation of the train stage, and Simulations make the test stage reliable at scale. Together, they enable Fin to handle complex work, and teams to stay in control of it.

Procedures: Define exactly how Fin handles complex work. With Procedures, I can set Fin up to resolve complex, time-consuming queries that require multiple steps or business logic. Fin follows standard operating procedures and applies sound judgment—just like a seasoned teammate—so even complicated queries are resolved in controllable, predictable ways.

A snapshot of the Procedures builder in action, mapping a clear path for handling damaged food orders while letting teams train Fin on examples, target channels, quickly test updates, and publish with Set live.

Procedures combine three powerful elements. First, natural language instructions. You write a Procedure in plain language, just like documenting a process for a new teammate. You can paste in your existing SOPs, write from scratch, or let AI draft them for you, then iterate yourself.

What’s new: Draft Procedures with AI. Share an outline of your process and Fin drafts a complete Procedure using your conversation history, knowledge hub content, and relevant data. If additional context is needed, it prompts you with clarifying questions to make sure the Procedure is thorough and tailored to your use case, significantly reducing setup time. For example: if you’re creating a refund workflow, the system can draft conditional paths for eligibility, approval thresholds, and verification steps based on your historical cases and policies.

What’s new: Break complex workflows into Sub-procedures. Write a process once and reference it across multiple Procedures by breaking it down into reusable steps, called Sub-procedures. This makes workflows easier to read, faster to build, and simpler to maintain as things change.

Second, deterministic controls. Natural language is flexible, but some steps need to be exact. You can layer in deterministic controls where precision matters, starting with a fully natural language Procedure and introducing structure gradually where it adds value: conditional steps (branching logic) to handle decision points so Fin’s behavior is consistent and predictable; data connectors so Fin can pull information from your tools or take actions automatically; code snippets for when absolute accuracy is essential; and checkpoints to pause for approval or hand off to a teammate.

Fin demonstrates structured troubleshooting: a transaction dispute flow with eligibility checks, clear IF/ELSE steps, and quick Data Connector actions like freezing a card or pulling invoices, streamlining complex support tasks.

What’s new: Instruct Fin to read specific content from your knowledge hub. You can set clear rules for Fin to reference a specific policy or article from your knowledge hub in defined situations so Fin always surfaces the right context in a conversation.

What’s new: Explicit Procedure switching under defined conditions. You can set rules that deterministically trigger a switch to a different Procedure, for example, escalating to a complaints Procedure if specific risk signals are detected mid-conversation.

What’s new: Internal notes for human handoffs. When Fin hands off to a teammate, it can now include internal notes with relevant context so the person picking up the conversation knows exactly what happened and what needs to happen next.

Third, fully agentic behavior. Because real conversations rarely follow the happy path, Procedures let Fin reason through what’s happening and adapt—jumping to the right step or switching Procedures entirely if a customer changes their mind or the issue shifts.

Procedures and Simulations in action: Fin rehearses a food order damage scenario, confirming details and progressing through each trigger. Teams validate complex flows end to end as steps turn green and outcomes are tracked.

What’s new: Automatic Procedure switching. If a customer starts in a billing workflow but then asks about cancelling their subscription, Fin transitions to the relevant Procedure without forcing the customer to restart.

What’s new: Structured data extraction from uploaded files. Fin can now extract structured data directly from PDFs and images uploaded by customers—like invoices, forms, or receipts—and use that data within the conversation. Customers don’t have to copy and paste or repeat themselves.

As MONY Group put it:

“ If a customer starts down one path but their issue turns out to be something else entirely, Fin adapts seamlessly – no more getting stuck in loops or forcing customers into the wrong workflow. ”

Simulations help teams rehearse procedures and verify outcomes before going live. Run all tests or launch a new one to ensure Fin handles tricky customer scenarios—from damage confirmation to refunds and missing subscriptions.

The result is a conversation that feels fluid, but always follows your intended rules.

Making complexity easier to manage is just as important as unlocking new capabilities. Beyond the core updates, we’ve focused on creation, governance, and scale—while keeping ownership with your team.

What’s new: Improved instruction authoring. We’ve made it easier to write, edit, and structure Procedures, so building and updating them takes less time and requires less effort.

What’s new: Reporting on when Procedures trigger, resolve, or hand off. You can now track how Procedures are performing directly within the Procedures UI, seeing exactly when they trigger, when they resolve, and when they hand off to a teammate. This visibility helps you spot issues early and improve over time.

Customer stories from Raylo and Mony Group show how Fin now resolves payment issues and complex claims in-chat, checks account data via APIs, and lifts CSAT to about 94%, highlighting the impact of Procedures and Simulations.

Simulations: Test complex workflows at scale before they reach customers. Simulations let you validate how Procedures will perform before anything goes live, and continuously revalidate as things change. Deploying complex AI can feel uncertain; Simulations remove that uncertainty so you can launch with confidence and iterate safely.

You can simulate full conversations. For any Procedure, choose a user or customer segment and run a complete, multi-turn simulated conversation. You see every step Fin takes, how it applies your rules, reasons through decisions, and where it passes or fails—giving you the observability to debug and fix issues before they ever reach customers.

What’s new: Upload images for richer testing. Simulations now support image uploads, so you can test workflows that involve receipts, invoices, or forms—the same inputs your customers actually send.

What’s new: Clearer visibility into Fin’s reasoning. You can now see exactly how Fin is thinking through each step of a Simulation, making it easier to understand behavior, catch unexpected decisions, and refine Procedures with confidence.

You can also use AI to create, store, and rerun tests. Writing test coverage manually doesn’t scale. Fin’s AI Assistant generates Simulations directly from your Procedures, suggesting realistic edge cases like partial refund disputes, missing invoice uploads, or no subscription found, so you can expand coverage without expanding overhead. All the Simulations you create are stored in a central library. When a product changes, a policy updates, or a Procedure is edited, hit “run all” to instantly check whether anything has regressed. This applies the same rigor to AI automation that engineering teams bring to software testing.

What’s new: AI-suggested Simulations. You can now use AI to generate a full set of Simulations from any Procedure. The AI Assistant suggests realistic variations based on your workflow, so you can build comprehensive test coverage fast.

Customers are already seeing this in production. “Fin can now handle payment-related queries that were never possible before… The impact on CSAT and overall CX has been pretty shocking – the Payment Information procedure CSAT is sitting at ~94%, and CX score is significantly higher than our average.” – Raylo

“Procedures have fundamentally changed what we can achieve with Fin. Previously, complex processes like cashback claim investigations could only be handled through a static form on our website… Now, Fin can handle these sophisticated scenarios in real-time within the conversation itself. It checks account information via API calls, makes complex decisions, and guides customers through the entire claims process dynamically.” – MONY Group

Procedures and Simulations are available now. I’m eager to see how teams use these updates to scale agentic AI, deliver faster resolutions, and raise the bar for customer experience—without sacrificing control, compliance, or quality.

Inspired by this post on The Intercom Blog.

February 25, 2026
Human-in-the-Loop Mastery: Proven Oversight Tactics That Elevate AI Quality and Trust

Human-in-the-loop oversight is the fastest and most reliable way I know to elevate AI quality, build user trust, and reduce risk. At HighLevel, my teams treat oversight as a product feature—not an afterthought—because dependable AI experiences come from deliberate design choices across data, models, and people.

When I say “human-in-the-loop,” I mean a system that blends automation with targeted human judgment at key moments: during data curation, prompt engineering, evaluation, deployment, and post-launch learning. This approach turns “AI workflows” into measurable, repeatable processes and keeps me honest about what’s working, what’s drifting, and where a human safety net must step in.

Architecturally, I start with a retrieval-first pipeline to ground outputs in trusted knowledge, then wrap it in guardrails. Deterministic preprocessing, careful prompt engineering, and post-processing validators catch obvious failure modes. Confidence thresholds and policy checks route ambiguous or sensitive cases to a human reviewer, while clear, auditable traces show why the system chose automation versus escalation. This balance supports reliability at scale while preserving agility for “agentic AI” patterns when they add value.

Quality is only real if I can measure it, so I build with eval-driven development from day one. I maintain golden datasets, rubric-based scoring guidelines, and an automated evaluation harness that runs on every change to prompts, models, or data. Pre-production gates protect against regressions, while production telemetry surfaces drift by segment and use case. When it’s time to run experiments, I use A/B tests sized with a minimum detectable effect (MDE) to avoid overfitting to noise.

Operationally, I optimize for outcomes, not output. I track task success rate, time-to-resolution, safety violation rate, hallucination rate, and cost-to-serve, then connect these to outcomes vs output OKRs. The signal I want is simple: are we reliably solving the user’s job-to-be-done with lower effort and higher confidence? If not, I tighten prompts, refine retrieval, or expand human review where it pays off most.

Risk governance is non-negotiable. I design with privacy-by-design and data governance from the start—role-based access, audit trails, PII redaction, and red-team tests for safety. Clear reviewer playbooks and calibration sessions reduce bias and ensure consistent decisions. These practices aren’t bureaucracy; they’re how I operationalize AI risk management while maintaining velocity.

Teams make or break this model. I empower product trios to own the full lifecycle—discovery, build, and learning—so feedback loops close quickly. In-product feedback widgets, reviewer queues, and incident management playbooks help us respond in hours, not weeks. Over time, human review becomes a targeted scalpel rather than a blanket requirement as the system learns and improves.

Economics guide the level of oversight. I treat each workflow like a portfolio: where the value of accuracy is high and ambiguity is common, I route more to humans; where tasks are simple, frequent, and well-bounded, I automate aggressively. The goal isn’t zero humans—it’s optimal humans, deployed precisely where their judgment compounds ROI.

If you’re getting started, begin with one high-impact workflow, establish your golden set and evaluation rubric, and wire in a simple review queue. Prove the lift, then scale. In the short video above, I walk through the patterns I use to design these loops, measure quality with rigor, and ship AI that teams—and customers—can trust.

Inspired by this post on Product School.

February 23, 2026
Implementing AI Agents That Scale: My Playbook for One‑Person Departments with Amplitude

Over the past few years, I’ve led cross-functional teams to deploy agentic AI in production, and I’ve learned that success rarely hinges on the model alone. It comes from methodically designing the right workflows, instrumenting every step, and building a feedback loop that compounds. Learn how companies like Replit are consolidating workflows, creating one-person departments, and building systems for scale with Amplitude.

When I talk about AI agents, I’m describing software that behaves like a focused teammate—owning a clear job to be done end-to-end. In practice, that means consolidating fragmented tasks into a single accountable “one-person department,” then giving it the context, tools, and analytics to perform reliably. This is how agentic AI moves beyond demos into durable business impact.

I start with outcomes, not algorithms. I map a driver tree from business goals (e.g., lower response time, higher activation, better retention) to the specific moments an agent can influence. This outcome-first alignment keeps scope tight, informs guardrails, and grounds the value proposition in measurable change instead of vanity metrics.

Next, I define the workflow the agent will fully own. I look for high-volume, rules-adjacent processes—think lead qualification, support triage, or billing inquiries—where clear decision criteria already exist but human time is the bottleneck. I document triggers, inputs, decision points, and handoffs, then design the ideal-state flow the agent will run autonomously, with transparent escalation paths to humans.

On architecture, I favor a retrieval-first pipeline to keep responses accurate and current. I scope the knowledge base, implement context window management, and standardize tools the agent can call (search, CRM actions, ticket updates). For teams new to this, I coach “LLMs for product managers” fundamentals so we make sensible trade-offs between speed and reliability rather than chasing model-of-the-week headlines.

Instrumentation is where the system becomes self-improving. I use Amplitude analytics and an Agent Analytics schema to track intent detection, tool usage, resolution rate, time-to-resolution, deflection, and escalation causes. A unified analytics platform lets me connect agent outcomes to core product metrics—activation, retention, and conversion—so we can see the real revenue and experience impact, not just local efficiency gains.

To validate impact, I run A/B testing when traffic allows, setting a minimum detectable effect (MDE) upfront to avoid inconclusive reads. In lower-volume scenarios, I lean on eval-driven development: curated test sets for edge cases, scenario-based regression suites, and error taxonomies that accelerate iteration. Feature flags let us stage capabilities safely (shadow mode, assistive, autonomous) while we monitor deltas before full rollout.

Reliability and trust are designed in from the start. I apply AI risk management practices—privacy-by-design, data governance, and policy-aligned prompt templates—paired with observability to trace decisions. Clear escalation policies, incident management runbooks, and human-in-the-loop checkpoints ensure the agent fails safe, not silently.

Shipping cadence matters. I use CI/CD to increase deployment frequency, keep prompts and tools versioned, and gate risky changes with targeted rollouts. As patterns stabilize, we scale horizontally to new use cases, sharing core capabilities (retrieval, analytics, guardrails) as a platform. This is how “one-person departments” multiply without multiplying overhead.

Change management closes the loop. I partner with product trios and frontline teams to co-design prompts, set acceptance criteria, and define what “good” looks like in plain language. In-app guides and product tours introduce the agent’s role and limits, and structured feedback channels feed directly into our discovery and iteration rhythm.

The throughline of this playbook is simple: treat agents like real teammates with a job description, operating procedures, and performance reviews. With disciplined workflow design, a retrieval-first pipeline, and outcome-level instrumentation in Amplitude, agentic AI stops being a science project and starts compounding into durable product-led growth.

Inspired by this post on Amplitude – Perspectives.

February 18, 2026

How to Scale Trustworthy Enterprise Analytics With AI Agents

Your analytics agent can turn a question into a chart. Then a product leader asks which activation definition it used, an analyst gets a different cohort result, or security discovers that the agent queried data the user could not normally access. That is where a promising pilot becomes an enterprise risk.

The way through is not a better chat interface. You need a controlled path from question to decision: approved definitions, bounded tools, task-level evaluations, visible evidence, and permissions that expand only after the agent proves it can handle a specific workflow reliably.

Define trust as an executable contract

A trustworthy answer is more than a plausible explanation. It is the output of a reproducible analytical process. The enterprise bar includes consistent metric definitions, privacy-by-design, role-based access control, audit trails, low-latency support, and repeatable results. If any link in that chain is implicit, the agent can be eloquent and still be unsafe.

Before you give an agent a task, define its contract. The contract should answer five questions:

What decision is being supported? A request to explain a funnel is different from a request to change the funnel definition or publish a recommendation.
Which definitions are authoritative? Identify the canonical metric, its version, the population, the unit of analysis, the time window, and any approved exclusions.
What may the agent access and do? Specify datasets, fields, tools, credentials, and whether the task is read-only, produces a draft, or can trigger an action.
What evidence must accompany the answer? Require the metric identifier, query or tool calls, filters, lineage, assumptions, and enough result detail for an analyst to reproduce the work.
When must the agent stop? Define the ambiguities, policy conflicts, statistical gaps, and high-consequence actions that require clarification or approval.

Consider a seemingly simple question: Did activation decline for new accounts? The answer depends on the approved activation event or event sequence, cohort entry rule, identity resolution, time zone, date range, and exclusions. If the agent silently supplies one of those details, it has made a product decision while pretending to perform analysis.

The safe behavior is straightforward. The agent should retrieve the approved definition, display the material assumptions, and ask for clarification when the remaining ambiguity could change the result. It should not create a new activation definition in the course of answering the question. Changes to definitions belong in a governed workflow with an owner, review, version history, and rollback path.

This distinction also gives you a better definition of accuracy. An answer fails if it uses the wrong metric, violates an access rule, omits a material assumption, or cannot be reproduced, even when the final number happens to be correct. Trust is a property of the whole execution path, not only the sentence shown to the user.

Move through four levels of autonomy one task at a time

Teams often treat agent maturity as a platform-wide label. That hides risk. The same system may be mature enough to draft a funnel but not mature enough to interpret an under-specified experiment. Assign maturity to each task, dataset, and action instead.

Level	Agent role	Evidence required before moving forward
L0: Conversational interface	Summarizes charts or reports that already exist.	The agent accurately identifies the selected artifact, preserves its filters and caveats, and does not imply that it performed new analysis.
L1: Grounded retrieval	Retrieves definitions and context from the analytics catalog, taxonomy, or metric store before answering.	Canonical definitions are consistently selected, citations and assumptions are visible, and retrieval respects the requesting user’s permissions.
L2: Governed tool use	Reads schemas, generates safe SQL, calls approved tools, and reconciles results against canonical definitions.	Representative tasks pass golden-data and regression evaluations; queries, tool calls, lineage, errors, latency, and cost are observable.
L3: Bounded autonomous workflow	Completes an end-to-end workflow with approval gates, audit logs, feature flags, and rollback controls.	The exact workflow has a stable evaluation history, clear ownership, tested failure handling, and a reversible execution path.

L0 can still be useful. It reduces navigation work and helps a user understand an existing dashboard. The mistake is presenting that convenience as autonomous analytics. L1 improves trust by grounding language in the organization’s own definitions, but retrieval alone does not prove that a newly calculated result is correct.

L2 is the consequential transition. The agent is no longer explaining an approved artifact; it is producing analytical work. Schema awareness, safe SQL, result reconciliation, and complete traces become release requirements rather than optional diagnostics.

L3 should describe a narrow, governed workflow, not a general promise that the agent can handle anything. For example, an agent might autonomously refresh an approved weekly retention analysis while still requiring an analyst to approve a new cohort definition. Broaden the task boundary only after the additional behavior has its own tests and controls.

The capabilities that justify early investment are rapid exploration, schema-grounded SQL generation, experiment summarization, and conversion of natural-language questions into charts. Ambiguous metric semantics and under-specified experiment designs remain poor candidates for unreviewed autonomy. Use the agent to compress the mechanical work, but keep unresolved organizational judgment visible.

Build evaluations around the work people actually do

A generic chatbot benchmark will not tell you whether an agent can support your product decisions. Your evaluation unit should be a complete analytics task performed under your definitions, schemas, policies, and edge cases.

Start with the ten high-frequency analytics tasks that matter most in your environment. Do not select only the cleanest demonstrations. Include work that is frequent, consequential, and likely to expose semantic or governance failures.

<!– wp:list {

February 17, 2026

Multi‑Agent Systems Demystified: Why One AI Isn’t Enough—and How I Ship Faster With Many

In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.

Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.

When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.

The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.

To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.

Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.

On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.

If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.

The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.

Inspired by this post on Product School.

February 16, 2026

How to Build an AI-Native Product Development Workflow

Your team can generate a PRD, summarize an interview, and draft acceptance criteria in minutes. Yet the product still may not ship faster. Customer evidence remains scattered, decisions lose their rationale at handoffs, and nobody knows whether an AI-generated recommendation deserves to be trusted.

An AI-native product development workflow fixes that operating system. It connects evidence, decisions, delivery, and evaluation in one traceable learning loop. The goal is not to produce more documents. It is to shorten the path from a customer signal to a reliable product decision, then carry the result back into the next decision.

Change the unit of work from an artifact to a decision

AI-assisted teams use a model inside an existing process. They write the same documents, hold the same handoffs, and make the same decisions, only with faster drafting. That can save time, but it leaves the fundamental bottlenecks untouched.

An AI-native workflow reorganizes the process around decisions. Every meaningful unit of work should carry enough context for the next person or system to understand what is being decided, why it matters, and what evidence would change the decision.

Use a decision packet with five parts:

Decision: State the exact choice in front of the team. Replace broad assignments such as improve onboarding with a decision such as whether to change the first-session setup flow for a defined customer segment.
Evidence: Link the customer examples, research moments, usage data, and business constraints that support the problem. Preserve the original evidence rather than storing only an AI summary.
Assumptions: Separate what the team knows from what it believes. An assumption should be written so that new evidence can confirm or challenge it.
Success condition: Name the customer or business behavior expected to change. For an experiment, define the hypothesis and, where appropriate, the minimum detectable effect before exposure begins.
Decision state: Record the owner, status, unresolved questions, next test, and reason for the latest change.

The model can retrieve evidence, compress it, identify inconsistencies, draft alternatives, and check whether required fields are missing. A person still owns the interpretation, trade-offs, priority, and release decision. This boundary prevents polished language from being mistaken for product judgment.

Apply a simple test to every AI-generated artifact: what decision will this change? If the answer is unclear, the artifact is probably workflow noise. If the answer is clear, attach the artifact to the decision packet instead of allowing it to become another disconnected document.

Build an evidence spine before adding more automation

Most product workflows fragment evidence before a model ever sees it. Support tickets sit in one system, sales notes in another, interviews in folders, and behavioral data in an analytics platform. A prompt cannot recover relationships that the operating system never preserved.

A retrieval-first intake can unify customer feedback, support tickets, sales notes, research transcripts, and usage analytics. Embeddings can help cluster related signals and remove duplicates, but the useful output is not a list of themes. It is a navigable path from a theme to representative evidence and then to the decision it informed.

Build that path as a closed sequence:

Normalize incoming evidence while preserving its source identifier, relevant customer or segment context, and access permissions.
De-duplicate repeated signals and cluster related evidence without erasing meaningful differences between customers or use cases.
Retrieve a small set of representative examples for the decision being made. Do not dump the entire evidence store into the model context.
Write the approved decision, its assumptions, and its rationale into durable external state.
Return experiment results, release outcomes, and new qualitative feedback to the same evidence system.

Keep three forms of information distinct. The evidence store contains raw and normalized inputs. Working context contains only the material needed for the current task. The decision log contains approved conclusions, rejected alternatives, owners, and changes. Mixing all three creates stale prompts, contradictory instructions, and summaries that can no longer be audited.

A prioritization recommendation, for example, should link back to representative customer records and the relevant analytics view. A summary without those links is compression, not evidence. When somebody challenges the recommendation, the team should be able to inspect the underlying material without asking the model to reconstruct its reasoning from memory.

This is also where data governance belongs. Decide which systems the workflow may retrieve from, which fields require redaction, who can see sensitive records, and how model outputs will be retained before connecting those systems. Privacy-by-design, cybersecurity, and regulatory controls need to sit alongside the workflow, not appear as a review after customer information has already crossed an inappropriate boundary.

Run one closed loop from discovery to shipped learning

The product trio remains important in an AI-native workflow. Product, design, and engineering use automation to reach the evidence faster and explore more alternatives, while keeping explicit human gates around interpretation, feasibility, customer experience, and risk. Clear handoffs between context design, external memory, and orchestration make those responsibilities easier to see.

For each stage, name the AI job, the human gate, and the durable output. That turns a collection of AI tools into an operating workflow.

Stage	AI accelerates	Human gate	Durable output
Intake and triage	Normalize, de-duplicate, cluster, and retrieve representative customer signals.	Verify that a cluster reflects a real customer problem rather than repeated wording or a noisy channel.	An opportunity record linked to original evidence.
Discovery	Draft interview guides, summarize transcripts, extract entities, and tag moments of friction.	Interpret what the customer meant, identify contradictions, and decide which uncertainty deserves another conversation.	An evidence-backed problem narrative with open questions.
Opportunity sizing	Organize evidence against a driver tree and assemble available inputs about potential impact.	Choose the outcome, inspect data quality, expose assumptions, and make the prioritization trade-off.	A ranked opportunity with decision criteria and explicit assumptions.
Solution shaping	Generate alternatives, first-pass flows, PRD sections, acceptance criteria, and experiment ideas.	Test desirability, usability, feasibility, strategic fit, and the cost of being wrong.	A solution hypothesis, acceptance criteria, and a test plan.
Planning and execution	Break an approved bet into sequenced work, surface dependencies, and check artifacts for missing requirements.	Set scope, choose rollout controls, confirm instrumentation, and approve release readiness.	An instrumented release plan connected to feature flags, CI/CD, and observability.
Iteration	Compare expected and actual outcomes, organize qualitative feedback, and surface anomalies for review.	Decide whether to scale, revise, stop, or collect more evidence.	An updated decision record returned to the evidence spine.

Exit criteria keep each stage honest. Discovery is not complete because the transcripts have been summarized. It is complete enough to move forward when the team can name the customer problem, the supporting evidence, and the uncertainty it intends to resolve next. Solution shaping is not complete because a PRD exists. It is complete when the hypothesis, constraints, acceptance criteria, test method, and required telemetry are clear enough for a responsible decision.

Plan measurement before release. If the team will use an A/B test, write the hypothesis and minimum detectable effect before looking at the result. If controlled experimentation is not appropriate, name the expected behavior change and the qualitative evidence that would support or challenge it. Feature flags provide controlled exposure, while observability helps the team understand why behavior changed rather than merely showing that it changed.

The workflow closes only when actual outcomes return to discovery. Comparing expected and actual outcomes, harvesting qualitative feedback, and feeding the result back into the evidence system turns a release into organizational learning. Without that return path, the model keeps retrieving yesterday’s beliefs even after the product has disproved them.

Engineer context, evaluations, and decision rights together

Reliability cannot be added as a final quality check. Every AI transformation can lose evidence, introduce unsupported language, or carry stale assumptions into the next stage. The workflow needs controls at the moment each failure can occur.

Give each task a context contract

One large prompt that tries to perform discovery, prioritization, specification, and planning will accumulate irrelevant material and conflicting instructions. Break the workflow into smaller tasks, each with a compact context contract:

The decision or job the output must support.
The approved evidence the model may use.
The constraints and non-negotiable requirements.
The information the model must not infer.
The required output structure.
The conditions that require human review.

Compact task prompts, curated turns, external memory, repeated critical instructions, and isolated sub-agents are practical ways to manage a limited context window. Use external state for durable decisions and retrieve only the relevant slice for the current task. Repeat a critical constraint when the context grows rather than assuming an earlier mention will retain equal influence.

Use a sub-agent when a task benefits from an isolated context or a separate review, such as checking a PRD against approved evidence. Do not add one merely to make the system look agentic. Every additional agent creates another handoff whose inputs, outputs, permissions, and failure behavior must be evaluated.

Build an evaluation harness before scaling the workflow

An evaluation should answer a repeatable question: does this workflow produce an acceptable result on representative work? A few impressive demonstrations do not tell you whether a prompt, retrieval change, or model update made the system more dependable.

Start with real task types your team already performs. Preserve representative inputs, the evidence that should be used, the requirements an acceptable output must satisfy, and known failure conditions. Then run those cases whenever you change the prompt, model, retrieval logic, tool permissions, or output schema.

Evaluate at least these dimensions:

Grounding: Can each important claim be traced to approved evidence?
Fidelity: Did the output preserve material differences, uncertainty, and constraints rather than flattening them into a convenient narrative?
Completeness: Are the fields required for the next decision present?
Decision usefulness: Does the output help a named owner make a specific choice?
Data handling: Did the workflow respect access, redaction, and retention rules?
Format and tool behavior: Did the model follow the schema and use only permitted systems or actions?

Eval-driven development makes prompts and heuristics repeatable. It also gives you a safer way to adopt new models: compare them against the same task set instead of judging them from a fresh demo with different inputs.

Measure learning flow, not AI activity

Documents generated, prompts executed, and summaries produced are activity measures. They can rise while product decisions become less reliable. Use four layers of measurement instead:

Learning flow: Time from a customer signal to an evidence-backed decision, time spent waiting at handoffs, and rework caused by missing context.
AI quality: Evaluation results by task, unsupported claims found during review, required fields missed, and human corrections before approval.
Customer outcome: The activation, adoption, retention, or other behavior named in the original hypothesis.
Delivery health: Deployment frequency, change failure rate, and the operational signals relevant to the release.

Keep decision rights visible beside those measures. The model may propose a priority, but the accountable product leader approves it. The model may draft a customer interpretation, but the product trio validates it against evidence. The model may prepare a release plan, but engineering owns operational readiness. Feature flags, access controls, and human approval are not signs that the workflow is insufficiently automated. They are what make greater automation responsible.

Log the decision, evidence references, model version, prompt or workflow version, retrieval configuration, evaluation result, and approving owner. Documenting decisions, model versions, and test artifacts makes a nuanced call auditable and gives the team a concrete starting point when quality changes.

Key takeaways: a 30/60/90-day rollout

Do not begin by automating the full product lifecycle. Start with one recurring decision, connect its evidence to its outcome, and prove that the loop can be operated reliably. A practical 30/60/90 sequence expands from the evidence foundation to selected workflows and then into planning and delivery.

Days 1-30: Map the evidence systems used for one recurring product decision. Define the decision packet, access rules, retrieval path, current human gates, and initial evaluation cases. Build the smallest retrieval-first pipeline that can preserve links from a recommendation back to original evidence.
Days 31-60: Pilot continuous discovery and PRD drafting. Keep approval manual, evaluate representative cases, record recurring corrections, and tighten the context contract. Do not expand until the team can identify why an output passed or failed.
Days 61-90: Extend the proven pattern to prioritization and experiment design. Connect approved outputs to planning, CI/CD, feature flags, and observability. Feed release outcomes and customer feedback back into the evidence spine.

By the end of the rollout, you should be able to trace an AI recommendation to customer evidence, reconstruct why a decision changed, detect a quality regression after a workflow update, and compare the expected outcome with what happened after release. If one of those paths is missing, fix it before adding another agent or automating another handoff.

Your next move can be small. Choose one product decision scheduled for this week. Put its evidence, assumptions, success condition, and state into a decision packet. Then follow that packet through discovery, delivery, and the first outcome review. That single trace will reveal where your workflow is genuinely AI-native and where faster drafting is only hiding an old bottleneck.

References

February 11, 2026

From Chaos to Clarity with Claude Code: My Hands-On Playbook for Product Leaders

I’ve been pushing hard to operationalize AI for real product work, and this episode zeroes in on the moment Claude Code stops feeling like a demo and starts behaving like a dependable teammate. If you’ve ever wondered how to go from clever prompts in the browser to durable, repeatable workflows on your machine, this walkthrough is for you.

Listen on: Spotify | Apple Podcasts.

My first honest reaction to installing and configuring the desktop agent was the all-too-relatable “this tool thinks everything is a code repo” reality. That framing helped me reset expectations fast: instead of treating it like a magical universal assistant, I began designing guardrails, context, and repeatable routines—exactly how I’d onboard a new team member.

The shift from Claude-in-the-browser to Claude Code on my machine was the unlock. Locally, it can finally work with my files, folders, and workflows. That meant I could ground it in real artifacts—project docs, meeting notes, product specs, and historical decisions—so responses weren’t just plausible; they were contextual and verifiable.

On setup, I now treat /init and Claude MD files as my product requirements. I define roles, boundaries, and canonical sources up front, then run in a deliberate “walled garden.” The “treat it like an intern” model works beautifully: scope access intentionally, expand privileges as trust grows, and keep a tight audit trail of what it can touch and why.

Surprisingly, task management became my ideal on-ramp. It’s easy to validate, the feedback loops are tight, and the ROI is immediate. I export calendar windows rather than granting full calendar access, then let the agent map priorities into Trello, reconcile time blocks, and surface trade-offs. Fast wins build confidence—mine and the agent’s.

Model switching matters more than I expected. When speed is king and “good enough” will do, Haiku keeps the loop snappy. When stakes are higher—complex synthesis, nuanced product strategy, or gnarly ambiguity—I step up to Claude Opus 4.5. Being intentional about when to optimize for latency versus depth is a quiet superpower.

Web tasks can still spiral. When that happens, I pause its autonomy, toggle to fewer steps, and ask, “What are you doing?” Paired with Claude’s Web fetch tool, this makes the agent explain its chain-of-thought planning without exposing hidden reasoning, so I can spot brittle assumptions, prune distractions, and re-ground the task.

Content retrieval has become a killer workflow. I point the agent at my archives—blog posts, book drafts, transcripts, notes—and ask, “Where have I talked about this before?” It assembles a map of prior art, connects themes I’d forgotten, and prevents me from reinventing work. Over time, this evolves into a Zettelkasten-style research system that upgrades rigor and accelerates synthesis.

I’ve also turned Claude Code into a publishing engine. From a single transcript, it drafts titles, descriptions, show notes, and chapters, then routes artifacts to Ghost for formatting. Before anything ships, I run fact-checking workflows that validate claims against transcripts and research sources. The output improves, but more importantly, the scaffolding makes quality repeatable.

Reusable workflows compound. I rely on slash commands to trigger common jobs, break down larger efforts with sub-agents, and wire in hooks and plugins where external systems are needed. This is agentic AI at its most practical: fewer hero prompts, more reliable processes.

Audience analytics and content prioritization are helpful with caveats. I let the agent cluster themes and flag gaps, then I pressure-test its suggestions against first-party data and strategic goals. As with any model-driven insight, triangulation beats blind faith.

Two metaphors guide my day-to-day. First, Claude Code is like a dog—sometimes it returns with the stick, sometimes it gets lost in the woods. Second, the “intern” framing keeps me honest: don’t hand it the whole company on day one. With that mindset, my output jumped—more volume without sacrificing quality—because the workflow scaffolding got better.

In this episode, I cover what Claude Code is and why it’s useful even if you’re not an engineer, the real difference between the browser experience and running locally, how to shape behavior with /init and Claude MD files, why task management is the perfect proving ground, when to export calendar windows versus connecting directly, and when model-switching makes sense—Haiku for speed, Opus for depth.

I also dig into debugging web tasks by asking “What are you doing?”, content retrieval workflows across personal archives, building reusable slash-command systems with sub-agents, hooks, and plugins, practical publishing stacks from transcripts, fact-checking against transcripts and research sources, and using analytics to prioritize content—with a healthy respect for uncertainty.

If you’ve been trying to make Claude Code feel less like “throwing a stick into the woods,” this is the candid, tactical tour I wish I’d had on day one. Drop your questions and experiments below—I’m eager to compare notes and refine the playbook together.

Inspired by this post on Product Talk.

February 10, 2026
AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

Shipping AI agents is not like shipping a typical feature. The system learns, reasons, and takes action in unpredictable environments, and when it’s customer-facing, the stakes are high. Over the past few years, I’ve refined a practical checklist that helps my teams move quickly without breaking trust. It balances speed with safety, and ambition with accountability—exactly what you need to scale agentic AI in production.

This checklist was forged in real launches—some smooth, some humbling. Early on, I watched an otherwise brilliant agent confidently offer a refund policy we didn’t have. That one incident made it clear: AI agents require a higher bar for guardrails, evals, and observability. Today, I won’t greenlight an AI rollout without these steps being explicit, owned, and testable.

Start with outcomes, not output. I define the job-to-be-done, the target users, and the measurable business impact using outcomes vs output OKRs and driver trees. Success is not “ship an agent,” it’s “reduce first-response time by 40% with no drop in CSAT,” or “increase qualified demo bookings by 20% at a lower cost per acquisition.” Clear outcomes give the agent a purpose and the team a north star.

Prepare the knowledge the agent will use. A retrieval-first pipeline beats raw prompting for most enterprise cases. I inventory sources of truth, set access controls, and enforce data governance from day one. That includes PII handling, redaction, retention policies, and privacy-by-design. If the agent can’t reliably retrieve the right fact at the right time, the rest doesn’t matter.

Choose models and prompts with discipline. I align model selection with context window management, cost, latency, and tool-use requirements. Then I build prompts and tools together, not in isolation, and I keep temperature, stop conditions, and function-calling explicit. Most importantly, I use eval-driven development: golden datasets, task-specific metrics (accuracy, helpfulness, latency, cost), and target thresholds that must be met before widening rollout.

Manage AI risk upfront. I treat jailbreaks, toxicity, and data leakage as product risks, not just security issues. I implement layered defenses—input/output filtering, policy checks, rate limits, and abuse monitoring—and define escalation paths and human-in-the-loop handoffs for ambiguous cases. Every risky capability needs an owner, a playbook, and a test.

Build the pipeline that lets you iterate safely. Prompts, tools, policies, and retrieval configs go through the same CI/CD rigor as code. I use feature flags for progressive delivery, canary cohorts to limit blast radius, and clear rollback procedures. Observability isn’t optional; I track latency, token usage, cost, failure modes, and user outcomes. I also watch DORA metrics and deployment frequency to ensure we’re improving the engine, not just the output.

Constrain autonomy intentionally. Agent behavior design matters as much as model choice. I set step limits, define tool whitelists, separate read vs write permissions, and specify decision checkpoints. When the agent is uncertain or confidence drops below a threshold, it hands off to a human or a deterministic workflow. Guardrails aren’t barriers; they’re bumpers that keep you on the track.

Instrument what users experience, not just what models produce. I track activation, task success, self-serve completion rates, and time-to-value. I pair Agent Analytics with journey analytics so I can see where the agent helps or hurts. I also invest in UX trust cues—transparent explanations, undo paths, and in-app guides—so users feel in control. When the agent changes behavior through learning, the interface should make that understandable.

If you’re shipping a voice AI agent, test in realistic conditions. I set targets for ASR accuracy, barge-in responsiveness, TTS prosody, and end-to-end latency. I predefine safe transfer logic for complex calls and ensure compliance for call recording and data retention. Voice amplifies both the magic and the mistakes; operational excellence is non-negotiable.

Plan the business rollout like a product, not a press release. I align pricing (often consumption SaaS pricing), packaging, and SLAs with actual unit economics—tokens, inference, and retrieval. I equip solutions engineering with playbooks and reference architectures, wire up CRM integration for attribution, and put feedback loops into Intercom or the support stack so we learn from every interaction.

Run operations like an SRE team. I define incident severity for AI-specific failures (e.g., harmful output, runaway cost, degraded retrieval), add alerting, and keep runbooks current. I schedule postmortems that feed directly into eval baselines and backlog priorities. Continuous discovery isn’t a ceremony; it’s the safety net that keeps improvements compounding.

Close the loop on compliance and governance. From day zero, I document data flows, vendor scopes, and audit logs. I verify regulatory compliance and adopt privacy-by-design so I’m not retrofitting later. Transparency, user consent, and opt-outs aren’t just legal checkboxes; they’re trust-building tools that differentiate your product.

The result of this checklist is speed with confidence. It gives my teams a common language to debate trade-offs, a clear path to production, and the guardrails to scale safely. If you’re preparing to deploy an agent, adapt these steps to your stack and your customers. Your future self—and your users—will thank you.

Inspired by this post on Product School.

February 9, 2026

How to Build a Mature AI Customer Service Operation

Your customer-service AI agent is live. It answers common questions, the launch dashboard looks healthy, and the next budget conversation is already about scale. Then a harder question arrives: which customer problems can the system actually own from start to finish?

That answer separates a production pilot from a mature deployment. Maturity is not the number of channels using AI or the quality of the demo. It is your ability to give the system meaningful responsibility, measure the result, recover safely when it fails, and improve it as part of normal operations. The framework below will help you diagnose where your deployment is shallow and decide what to build next.

Maturity begins where the pilot stops

Investment no longer distinguishes an AI leader. Among 2,470 global support professionals surveyed by Intercom, 82% of senior leaders said their teams had invested in AI during the previous year, 87% planned to invest in 2026, and 77% said AI was meeting or exceeding expectations. Yet only 10% classified their deployment as mature.

Those are self-reported responses collected by an AI-support vendor, so treat them as a directional benchmark rather than causal proof. The useful signal is the gap: buying and launching AI has become common, while redesigning customer service around it remains rare.

A pilot proves that an AI agent can participate. A mature operation proves that it can take responsibility. Participation might mean generating an answer before handing the conversation to a person. Responsibility means resolving the customer’s need, completing any permitted action, recording what happened, and escalating with context when human judgment is required.

Dimension	Pilot-shaped deployment	Mature operating behavior
Scope	A few answerable intents on one surface	Selected journeys owned from initial request through verified outcome
Work performed	Retrieves information or drafts a reply	Explains, gathers context, uses approved tools, and completes permitted tasks
Ownership	A launch team watches aggregate results	A named operator owns performance, failures, and the improvement backlog
Knowledge	Content is cleaned up before launch	Knowledge coverage, accuracy, and maintenance are governed as production dependencies
Testing	The happy path works in a demo	Realistic scenarios, boundary cases, and regressions are evaluated before changes ship
Handoffs	Escalation is an undifferentiated escape route	Every handoff has a reason, preserves context, and feeds the next improvement decision
Success	Containment or deflection rises	Verified resolution, task completion, quality, safety, and customer impact improve together

Use this as a constraint map, not an average score. A deployment with excellent content but unreliable account permissions is not ready to complete account changes. A deployment with strong automation but no failure taxonomy cannot improve systematically. Your least-developed operating dependency usually limits the next safe increase in responsibility.

Expand responsibility one customer intent at a time

The safest unit of expansion is not a channel, market, or percentage target. It is a customer intent with a defined outcome. Shipping an AI agent to every messaging surface can increase reach without increasing capability. Giving it end-to-end ownership of one additional support journey creates measurable depth.

For each intent, move up this responsibility ladder only when the previous level is dependable:

Answer: Retrieve and explain approved information.
Clarify: Ask the minimum questions needed to identify the customer’s situation.
Contextualize: Use authenticated account, product, region, or history data to provide the applicable answer.
Act: Complete a permitted task through a reliable tool or workflow, then confirm the result.
Intervene proactively: Detect a relevant condition and offer or perform an appropriate next step under explicit rules.

This ladder explains why an answer bot and an operational AI agent can look similar in a dashboard but create very different value. The first reduces reading and typing. The second can remove an entire unit of work for the customer and the support team.

The reported difference between early and deep deployments appears in the type of work performed. Mature teams were more likely than teams in initial deployment to report automation of manual work, proactive engagement, and task completion: 63% versus 52%, 51% versus 41%, and 45% versus 28%, respectively. Mature teams also reported higher quality and consistency more often. The figures do not establish that deployment depth alone caused the gains, but they show what deeper responsibility looks like in practice.

Before promoting an intent to the next rung, answer these questions:

Outcome: Can you state exactly what successful resolution means for the customer?
Knowledge: Is there an approved, current answer for the common case and its important exceptions?
Identity: Does the workflow know who the customer is when personalization or action requires authentication?
Authorization: Can the system verify that this customer and this AI workflow are allowed to perform the action?
Inputs: Can required values be validated before an action is submitted?
Confirmation: Can the system verify that the downstream task succeeded instead of assuming that a tool call worked?
Recovery: Is there a safe retry, rollback, approval, or human-handoff path?
Evidence: Can an operator reconstruct which knowledge, data, rules, and tool results produced the outcome?
Evaluation: Do your test scenarios cover ambiguity, missing information, exceptions, and known failure modes?

If an answer is no, you have found the next capability to build. Do not compensate with a more confident prompt. Missing permissions need a permission model. Unreliable data needs an integration fix. Conflicting policy pages need knowledge governance.

Use additional care for refunds, cancellations, account changes, identity-sensitive requests, and other consequential actions. Start with reversible or approval-gated operations. Validate the customer, the requested change, the permitted amount or scope, and the downstream result. A fast autonomous action is not a success if it creates financial loss, locks the wrong account, or leaves no reliable audit trail.

Build the operating system behind the agent

An AI agent does not mature on its own after launch. Performance plateaus when ownership, content, testing, integrations, and analysis remain side projects. These capabilities need to operate as one system.

Give performance to a named operator

Executive sponsorship and operational ownership solve different problems. The sponsor aligns customer experience, economics, organizational design, and cross-functional priorities. The operator turns failures into changes and makes sure those changes reach production safely. One person can fill both roles in a smaller organization, but the accountabilities should still be explicit.

The operator should own a working backlog organized by customer intent. Each entry needs enough context to support a decision:

The customer intent and desired outcome.
Where the current journey begins and ends.
Conversation volume and customer impact drawn from your own data.
The primary failure mode, supported by examples.
The proposed content, behavior, integration, or policy change.
The person responsible for the dependency.
The scenarios that will validate the change.
The deployment status, observed result, and rollback decision.

This prevents the backlog from becoming a collection of prompt tweaks. It also exposes systemic problems. If several intents fail because account status arrives late, the priority is the shared data dependency, not separate wording changes in every conversation.

Treat knowledge as a runtime dependency

Content quality is not a launch task. The AI agent depends on current knowledge every time it answers, just as a transactional workflow depends on a functioning service. A policy change can therefore create production failures even when no AI configuration changes.

Create a content contract for every intent you expect the agent to own:

Canonical location: Identify the approved source rather than allowing several conflicting pages to compete.
Coverage: Include the common case, eligibility conditions, exceptions, prerequisites, and the point where human judgment begins.
Scope: Separate product, plan, market, language, and policy variants when the answer differs.
Owner: Assign the person or function authorized to approve changes.
Freshness trigger: Tie review to the product, pricing, policy, or workflow event that can make the content stale.
Retirement: Remove or clearly supersede obsolete information so retrieval does not surface an old rule.
Validation: Attach representative scenarios that should pass whenever the knowledge changes.

A retrieval-first pipeline makes content maintainable because the approved explanation lives in governed knowledge instead of being buried inside prompts. Prompt behavior should decide how to use policy, not become a second unofficial policy store.

Run every change through an evaluation loop

A useful production loop is Train, Test, Deploy, Analyze. Its value is not the labels. It is the discipline of connecting an observed failure to a controlled change and then checking whether the change improved real outcomes.

Train: Change the relevant knowledge, behavior, data access, or tool. Record the failure you expect the change to fix.
Test: Run representative customer scenarios, including the happy path, ambiguous wording, missing data, policy exceptions, tool failure, and required escalation. Govern or redact conversation data under your privacy controls.
Deploy: Release to the intended intent, channel, customer segment, language, or market with a known fallback and rollback path.
Analyze: Check the customer outcome and guardrails, inspect new failure patterns, and decide whether to keep, revise, expand, or revert the change.

Your evaluation set should evolve with production. Add scenarios when a customer finds a new ambiguity, a product release changes the journey, or an integration fails in a way the original tests did not anticipate. Keep regression cases after the immediate defect is fixed. Otherwise, one improvement can quietly reintroduce an old failure elsewhere.

Make actions observable and recoverable

Answer quality alone is insufficient once the AI agent can perform tasks. Your operation must distinguish a bad explanation from a failed action, a denied permission, stale account data, a duplicate request, and a downstream timeout. Those failures require different owners and different fixes.

For each consequential workflow, preserve the facts needed to reconstruct the outcome: the detected intent, the applicable knowledge or policy version, required customer inputs, authorization result, tool invoked, request status, returned result, confirmation shown to the customer, and handoff reason. The goal is not indiscriminate data collection. Retain only what your privacy and security rules permit, but retain enough operational evidence to diagnose a failure.

Design the human path at the same time as the autonomous path. A handoff should carry the customer’s request, relevant facts already collected, actions attempted, results received, and the unresolved decision. Making the customer repeat the conversation transfers the AI agent’s failure cost directly to them.

Turn handoffs into the improvement backlog

A handoff is not automatically a failure. Some requests require empathy, judgment, negotiation, policy discretion, or authority that should remain with a person. The operational failure is an unexplained handoff. When every escalation looks the same in analytics, you cannot tell whether to improve knowledge, retrieval, workflow reliability, or the boundary itself.

Handoff or failure type	What to inspect	Likely improvement
Knowledge gap	No approved answer, missing exception, or obsolete policy	Create or update canonical content and add regression scenarios
Retrieval mismatch	Relevant content exists but the wrong variant is selected	Improve structure, metadata, scoping, or content separation
Interpretation or behavior error	The right information is available but applied incorrectly	Refine behavior instructions and add boundary-case evaluations
Missing customer context	The answer depends on account, plan, region, or history data that is unavailable	Connect the required data or ask a precise clarifying question
Authorization boundary	The requested action is not permitted for this customer or workflow	Preserve the guardrail; improve explanation or approval routing
Tool or data failure	A permitted action fails, times out, or returns an uncertain result	Improve integration reliability, confirmation, retry, and fallback behavior
Deliberate human boundary	The request requires judgment, discretion, or specialized handling	Keep the handoff and improve context transfer

Apply one primary reason to each reviewed failure, even when several contributing factors exist. Route the item to the owner who can change that dependency. Over time, the distribution of reasons tells you whether the deployment is becoming more capable or merely handing off in different places.

Measure the operation as a stack rather than relying on one headline rate:

Reach: Where was the AI agent involved, broken down by intent, channel, language, market, and product area?
Outcome: Was the customer’s issue actually resolved, and did any requested task complete successfully?
Quality: Was the answer correct, consistent, clear, and appropriate for the applicable policy and context?
Customer impact: What happened to satisfaction, repeat contact, abandonment, and escalation experience?
Guardrails: Were there unauthorized actions, incorrect confirmations, failed tools, or missed mandatory handoffs?
Diagnostics: Which knowledge gaps, retrieval mismatches, behavior errors, and integration failures drove the result?

Do not confuse involvement with success. It measures how often the system participated. Do not treat a conversation that ended without a human as verified resolution either; the customer may have abandoned the interaction or returned through another channel. Tie autonomous resolution to evidence that the intended outcome occurred, especially when a tool or account change was involved.

Aggregate containment is also easy to misread. It can rise because the mix shifted toward simpler questions while a high-impact journey deteriorated. Review results by intent and relevant customer segment before crediting a model or configuration change. If containment improves while repeat contacts, task failures, or customer satisfaction worsen, the operation has not become more mature.

Key takeaways

AI deployment maturity is the ability to give an AI agent measurable, recoverable responsibility for customer outcomes, not simply expose it to more conversations.
Expand one customer intent at a time through answering, clarification, contextualization, action, and carefully governed proactive work.
Do not automate consequential actions until identity, authorization, validation, confirmation, observability, and recovery are in place.
Assign a named operator to own intent-level performance, failure analysis, dependencies, evaluations, and the improvement backlog.
Manage knowledge as production infrastructure with canonical content, explicit scope, accountable owners, freshness triggers, and regression scenarios.
Classify handoffs by root cause and measure verified resolution, quality, customer impact, and guardrails alongside containment.

At your next operating review, choose one important intent that the AI agent currently answers but does not own. Map it onto the responsibility ladder, run the readiness questions, name its operator, classify its current handoffs, and put the next change through the evaluation loop. The scope is deliberately narrow. The maturity gain is real: one more customer problem resolved safely from beginning to end.

References

Intercom – Go Deep or Get Left Behind: How AI Deployment Depth Transforms Customer Service

February 5, 2026

Building Reliable AI Agent Systems: A Product Leader’s Playbook

Your AI agent performs beautifully in a controlled demo. Then real users arrive with incomplete instructions, stale records, missing permissions, ambiguous goals, and requests that cross the boundary between drafting something and actually changing the business.

The answer is rarely a longer prompt or a newer model. A reliable agent is a product system: a bounded workflow with trusted context, constrained tools, explicit verification, measurable release gates, and a safe way to stop. Build those pieces together and you can increase autonomy without losing control of quality, cost, or risk.

Start with a reliability contract, not an agent architecture

Before discussing models, memory, orchestration, or frameworks, define the job the agent is accountable for completing. “Answer customer questions” is too vague. “Resolve an eligible billing question using approved account and policy data, record the result, and escalate when authorization or evidence is missing” is a testable contract.

This distinction separates output from outcome. A fluent answer is output. A correctly changed business state is an outcome. The useful metrics therefore sit at the workflow level: resolution rate, time to a verified result, cost per completed task, qualified pipeline influenced, or another measure tied to the user’s job. That outcome-first capability design should happen before anyone selects a model.

Contract field	Decision you must make	Evidence the system must retain
Outcome	What real-world state counts as completed?	The accepted artifact, updated system record, or verified tool result
Scope	Which intents, data, tools, and actions are allowed?	The classified intent, permission decision, and tools invoked
Quality bar	What must be correct, grounded, complete, and timely?	Evaluation results and postcondition checks for the task
Stopping condition	When must the agent ask, refuse, or hand off?	The missing evidence, policy conflict, failed tool call, or risk trigger
Recovery	How can a failed or interrupted run be resumed or reversed?	Run state, committed actions, pending actions, approvals, and rollback path

The stopping condition deserves as much product attention as the happy path. If two trusted records conflict, the reliable behavior may be to expose the conflict. If an API times out after a write, the agent must determine whether the write happened before retrying. If a request would delete data, spend money, alter access, contact a customer, or create a legal commitment, a draft-and-approve flow is safer than silent execution. The downside is not an awkward response; it is an irreversible business action.

A practical autonomy ladder is observe, recommend, prepare, execute a reversible action, and execute a consequential action. Move a workflow upward only when the additional autonomy is necessary for the user outcome and the preceding level has evidence behind it. My rule is simple: earn autonomy one consequential action at a time.

Write the expected handoff as part of the contract. Name who receives it, what context travels with it, what the agent already attempted, and what decision remains. “Escalated to a person” is not a successful fallback if that person has to reconstruct the entire case.

Put a deterministic shell around the probabilistic core

An LLM can interpret ambiguity and propose a plan. It should not also be the unobserved authority for identity, permissions, transaction state, policy enforcement, and whether its own work succeeded. Keep those controls in ordinary application logic wherever possible.

A production workflow usually needs the following control points:

Authenticate the user and validate the request before sending it into the agent loop.
Retrieve only the authorized context needed for this task, with identifiers and provenance attached.
Ask the model for a structured plan that can be inspected, constrained, or rejected.
Validate every proposed tool and argument against policy, permissions, and a typed schema.
Execute scoped actions with timeouts, retry rules, and protection against duplicate writes.
Verify the resulting system state instead of trusting a generated claim that the task succeeded.
Return the result, evidence, unresolved uncertainty, and next state to the user.

That sequence creates a crucial separation between proposing an action, authorizing it, executing it, and verifying it. The LLM can participate in each stage, but it should not collapse all four into one opaque response.

Retrieve evidence for the task, not everything that might be relevant

A retrieval-first pipeline is usually more controllable than placing a large collection of documents in the prompt. Filter by tenant, user permissions, document type, effective date, product area, and workflow state before semantic ranking. Preserve record IDs and timestamps so the answer can be traced back to what the agent actually saw. Lean context also reduces latency, cost, and the chance that irrelevant instructions steer the run.

Embedding similarity is only one retrieval tool. Questions such as “Which decisions changed across these meetings?” depend on time, structure, and purpose, not just semantic proximity. A more capable search layer can combine vector retrieval, lexical search such as BM25, metadata queries, and purpose-built summaries. Route the query to the appropriate retrieval method and give the agent a way to inspect gaps rather than forcing every question through one embedding index.

Retrieved content is still untrusted input. A document can contain stale policy, hostile instructions, or text that resembles a system command. Keep instructions separate from evidence, restrict which tools retrieved text can influence, and apply least-privilege access at the API layer. Privacy-by-design, data governance, structured logs, and tests for prompt injection and data exfiltration belong in the architecture, not in a pre-launch checklist.

Treat every tool as a narrow product interface

A tool description is not merely prompt text. It is an interface contract. Give each tool a single clear responsibility, explicit input types, constrained values, recognizable error states, and a response the workflow can verify. Separate read tools from write tools. Where the underlying system allows it, add dry-run modes, idempotency keys, and an endpoint that checks the final state.

Avoid exposing a broad “run anything” tool when the agent only needs to look up an account, prepare a ticket, or update one approved field. Narrow tools reduce the decision surface, simplify evaluation, and make permission reviews legible. They also let you disable one unsafe capability without taking the entire agent offline.

Persist enough state to answer operational questions after the run: which prompt and model version ran, what was retrieved, which plan was selected, which tools were attempted, what they returned, what was committed, which verification passed, and whether a person approved the action. Do not rely on a natural-language transcript as the only record. Store structured events with a run identifier and propagate that identifier through tool calls.

Model selection comes after these boundaries are clear. Tool-use fidelity, prose quality, latency, multilingual performance, context needs, and cost can point to different choices. Newer is not automatically better: one production team found GPT-4.1 more suitable for its prose workload than newer alternatives. Keep the workflow and evaluation interfaces model-agnostic enough to compare or replace providers without rewriting the product.

The same discipline applies to multi-agent designs. Parallel agents are useful when tasks are genuinely independent, such as preparing different artifacts from a shared meeting. Specialized agents can also isolate permissions or context. But each added agent introduces another prompt, model call, state transition, failure path, and cost center. A second agent is not meaningful verification when it sees the same evidence, inherits the same assumptions, and merely agrees with the first. Add orchestration only when the separation has a measurable job.

Make workflow evaluations a release gate

A few attractive examples cannot tell you whether an agent is production-ready. Reliability work starts by naming how the workflow can fail, then turning those failure modes into repeatable tests.

Use a failure taxonomy that follows the run from request to outcome:

The agent misunderstood the intent or accepted a task outside its scope.
Retrieval omitted the necessary record, returned stale information, or crossed an access boundary.
The plan skipped a required step or selected an unsafe sequence.
The agent chose the wrong tool or supplied invalid arguments.
A tool failed, timed out, or completed after the agent assumed it had failed.
The response introduced an unsupported claim or concealed uncertainty.
The agent claimed success even though the intended system state was not reached.
The handoff occurred too late or omitted information the recipient needed.

Build a golden dataset from real user intents and known edge cases. Include normal successful work, ambiguous instructions, missing data, conflicting records, insufficient permissions, tool errors, adversarial content, and requests that should be refused or escalated. Each case needs an expected outcome, allowed tools, forbidden actions, required evidence, and an evaluation method. Otherwise the dataset is a collection of prompts, not a product specification.

Grade the system at several layers. Task success checks whether the intended state was reached. Grounding checks whether material claims are supported by authorized evidence. Tool-use evaluation checks selection, argument correctness, sequence, and postconditions. Safety evaluation checks policy and access boundaries. Handoff quality checks whether the receiving person can continue without repeating work. Latency and cost reveal whether the successful path is operationally sustainable.

Use deterministic checks where the answer is objective. An account ID, required field, permission decision, or database state should not need a subjective model judge. Use rubric-based model evaluation or calibrated human review for writing quality, helpfulness, and other dimensions that genuinely require judgment. Regularly compare automated grades with human decisions; an evaluator can drift or share the actor model’s blind spots.

Do not hide a severe failure behind an average score. Segment results by intent, tool, customer type, language, risk class, and workflow version. A high overall pass rate says little if the agent consistently fails the one action that changes access or sends a customer-facing commitment. Set separate go/no-go requirements for critical slices and treat forbidden actions as release blockers.

A disciplined release path looks like this:

Run offline evaluations against the current production version and the candidate change.
Replay representative historical traces with writes disabled and inspect changed decisions.
Shadow real traffic without allowing the candidate to act.
Expose the candidate behind a feature flag to internal or explicitly selected users.
Canary the workflow with a limited production population and a tested rollback path.
Use an online experiment when the question concerns user or business impact, defining the minimum detectable effect before interpreting the result.
Expand only after task success, safety, handoff, latency, and cost remain within their release requirements.

This is eval-driven development in practical terms. Prompt, retrieval, model, tool, and policy changes are versioned product changes. They enter the same comparison pipeline and cannot bypass it because someone considers a prompt edit “just configuration.”

Scale reliability and unit economics as one system

An agent can be accurate and still be unscalable. It can also look inexpensive per model call while becoming costly per resolved task because it retrieves too much, retries weak plans, invokes unnecessary tools, or sends avoidable cases to people.

Measure cost per completed safe task. The numerator should include model inference, retrieval, external APIs, tool execution, retries, verification calls, and required human review. The denominator should include only tasks that reached the intended state without violating the contract. Counting failed or falsely completed runs as successful makes the economics look better precisely when reliability is deteriorating.

Instrument the complete trace so you can attribute both cost and delay to a stage. Useful operating views include task success by intent, tool errors by endpoint, retries by plan type, escalations by reason, latency by stage, cost by model and workflow version, unsupported-claim rate, and verification failures. Pair those measures with user satisfaction and downstream correction signals; a fast completion is not a win if a person has to undo it later.

Cost work should target the mechanism, not apply a blanket downgrade. Shorten irrelevant context. Retrieve smaller evidence sets. Cache stable prompt prefixes where the provider and privacy posture allow it. Route simple classifications away from expensive reasoning models. Reuse deterministic results. Remove redundant verification, but only when evaluations show it adds no protection. In one concrete case, Earmark reported reducing its meeting workflow from about $70 per meeting to under $1 through prompt caching. That is a product-specific result, not a general benchmark, but it shows why context and caching decisions can determine whether an agent remains a demonstration or reaches everyday use.

Define service objectives around the user journey rather than a generic chatbot response. Track whether eligible tasks finish safely, whether consequential actions are verified, how long the user waits for the intended outcome, whether interrupted runs recover, and whether handoffs retain context. Set the actual thresholds from the workflow’s risk, user promise, baseline performance, and economics; there is no responsible universal target for every agent.

Prepare for incidents before increasing exposure. The operating playbook should identify the on-call owner, alert conditions, kill switch, feature flags, tool-specific disablement, prompt and model rollback procedure, trace replay process, customer-impact assessment, and postmortem owner. Test that the team can stop writes while preserving read-only or handoff behavior. An all-or-nothing shutdown is avoidable when capabilities are independently gated.

Data retention is another scaling decision, not merely a legal footnote. Record what must be retained for debugging, audit, recovery, and user value; minimize everything else; define access and deletion behavior; and make the choice visible to enterprise reviewers. An ephemeral architecture can become a commercial advantage when persistent conversation storage is unnecessary: a no-storage design reduced a real enterprise adoption objection. It will not fit every workflow, especially where auditability requires durable records, so make retention a deliberate contract rather than a default.

Use the first 90 days to earn a narrow production footprint

A useful 90-day plan does not promise an autonomous platform by the end of the quarter. It creates one bounded production workflow, evidence that the workflow is valuable, and the controls required to expand it. The sequence below adapts an outcome-led 90-day AI operating model to agent reliability.

Days 0-30: define the contract and make failure observable

Choose a frequent workflow with a recognizable end state and enough value to justify automation.
Write the outcome, eligible intents, tools, data boundaries, prohibited actions, stopping conditions, and handoff owner.
Map every identity, permission, retention, and policy dependency before connecting write tools.
Baseline the current process so improvements in completion, time, cost, and quality have a meaningful comparison.
Assemble real and adversarial evaluation cases with expected outcomes and forbidden behaviors.
Implement structured traces and a read-only or dry-run version of the workflow.

The exit criterion is not a persuasive demo. You should be able to inspect a run and determine, without guessing, whether it completed the job, what evidence it used, what it changed, and why it stopped.

Days 31-60: connect tools behind controls

Implement narrow tool adapters with typed inputs, permission checks, stable errors, timeouts, and duplicate-write protection.
Add retrieval filters, provenance, postcondition checks, and explicit approval points.
Version prompts, models, policies, retrieval settings, and tool schemas as one releasable workflow.
Run offline comparisons and shadow traffic, then review failures by category rather than as isolated bad answers.
Add feature flags, tool-specific disablement, alerts, and a tested rollback path.
Assign a product owner for the outcome and named engineering, risk, security, and operational partners for the controls they own.

Leave this phase only when every serious known failure class has either a preventive control, a detection mechanism, or an explicit human gate. A line in a risk register is not a runtime control.

Days 61-90: canary, learn, and expand selectively

Release to a limited population whose intents and permissions match the evaluated scope.
Monitor safe task completion, false-success signals, handoffs, latency, cost, corrections, and user outcomes by workflow version.
Review traces for both failures and unexpected successes; an agent may reach the right answer through an unsafe path.
Run incident and rollback drills before raising the exposure or enabling a more consequential action.
Compare production behavior with the baseline and the predeclared release requirements.
Expand one dimension at a time: more users, another intent, a new tool, or greater autonomy. Re-run the relevant evaluations after each change.

The exit criterion is operational ownership. Someone owns the workflow’s outcome, someone responds when it degrades, the system can be rolled back, and the roadmap is driven by observed failure and value rather than a list of impressive agent capabilities.

Key takeaways

Define reliability as a completed, verified user outcome inside explicit boundaries.
Keep authorization, policy enforcement, transaction state, and postcondition checks outside the model wherever possible.
Evaluate retrieval, planning, tool use, safety, handoff, and final state – not just the generated response.
Gate changes with offline tests, shadowing, feature flags, canaries, and rollback procedures.
Measure cost per completed safe task and optimize the stage causing the expense.
Increase scope and autonomy separately so production evidence can tell you which change caused a regression.

Start with one workflow this week. Write its reliability contract, collect representative failures, and make a dry run traceable from request to verified outcome. Once that narrow path is measurable and recoverable, you have something worth scaling – and a defensible reason to grant the agent its next action.

References

February 5, 2026